Re: [git-users] How does Git storing entire files rather than deltas make it superior?

2019-11-03 Thread Philip Oakley

Hi Michael,

On 03/11/2019 20:22, Michael wrote:

On 2019-11-03, at 8:28 AM, Philip Oakley  wrote:


But you still need to control what gets merged into mainline or master, right?


If you change the management viewpoint from "control" (with all its baggage) to 
"select" then it's a bit easier to see that the managers task got that bit easier (they 
don't need to 'protect' the VCS 'master') and the coder's work balance got easier because they do 
have access to a storage mechanism that works.

You're correct that  there is selection of what gets merged, or in a PR 
scenario, accepted as the new 'latest', and the golden repo (read only to 
others  usually) reflects the magic hash number of the latest and greatest.

Keep in mind: Anyone can publish "This is my version of this project".
Very true, and one of my key points about control (of one's local repo) 
being distributed to the user, rather than having to depend on access to 
a central repo.


That I can take someone else's project, and add my own tweak to it, and then my version 
is just as "valid" as theirs, is both the strong and weak point of this system.

A generic user has to determine which of several different repositories has the 
best version to use.
I'd argue, that for 'maintained', open source projects, there will be 
notifications of the locations that the current master hash can be 
obtained from, rather than getting from some arbitrary 'Joe random'..

But any coder can work with any repository as a starting point to base from.

And, you can merge work done by several different people, each in their own repository, 
to combine into the "newest and bestest".

The issue is less "What's in master", and more "What's in Keybounce's master" vs "What's in Zek's 
master", or "What's in Ubuntu's master" vs "What's in Red Hat's master".
In the collaborative context, it's true that you can take work from 
those you trust (as opposed to Joe.deadbeef), and usually you can also 
easily see what work they have added (diff's etc)




Linus's job, as I understand, is more a case of "blessing this version" than 
anything else at this point. Which gets back to the question of selecting from the many 
choices available. The rest of us rely on his ability to select from the many choices so 
that we don't need to worry about studying all the differences.
True, however the original comparison was with those older VCS's that 
used diff based recording, and my comparison was for VCS systems that 
started in the engineering domain (which where the cause of all those 
procedures and processes that are endemic in most 'central control' VCS 
systems came from).


Hopefully were are strenuously agreeing here...


--
You received this message because you are subscribed to the Google Groups "Git for 
human beings" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/git-users/ce59a884-0a69-fe7d-fe97-7fe183a0d113%40iee.email.


Re: [git-users] How does Git storing entire files rather than deltas make it superior?

2019-11-03 Thread Michael


On 2019-11-03, at 8:28 AM, Philip Oakley  wrote:

>> But you still need to control what gets merged into mainline or master, 
>> right?
>> 
> If you change the management viewpoint from "control" (with all its baggage) 
> to "select" then it's a bit easier to see that the managers task got that bit 
> easier (they don't need to 'protect' the VCS 'master') and the coder's work 
> balance got easier because they do have access to a storage mechanism that 
> works.
> 
> You're correct that  there is selection of what gets merged, or in a PR 
> scenario, accepted as the new 'latest', and the golden repo (read only to 
> others  usually) reflects the magic hash number of the latest and greatest. 

Keep in mind: Anyone can publish "This is my version of this project".

That I can take someone else's project, and add my own tweak to it, and then my 
version is just as "valid" as theirs, is both the strong and weak point of this 
system.

A generic user has to determine which of several different repositories has the 
best version to use.
But any coder can work with any repository as a starting point to base from.

And, you can merge work done by several different people, each in their own 
repository, to combine into the "newest and bestest".

The issue is less "What's in master", and more "What's in Keybounce's master" 
vs "What's in Zek's master", or "What's in Ubuntu's master" vs "What's in Red 
Hat's master".

Linus's job, as I understand, is more a case of "blessing this version" than 
anything else at this point. Which gets back to the question of selecting from 
the many choices available. The rest of us rely on his ability to select from 
the many choices so that we don't need to worry about studying all the 
differences.

-- 
You received this message because you are subscribed to the Google Groups "Git 
for human beings" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/git-users/43375EA0-8407-4FCB-8202-40114AB93A19%40gmail.com.


Re: [git-users] How does Git storing entire files rather than deltas make it superior?

2019-11-03 Thread Philip Oakley

On 03/11/2019 15:50, likejudo wrote:

'that "Control" [aka managers from hell] has been
*distributed* from the management to the user. You no longer need any
permission to store anything you want into the holy shrine of the the
"VCS" '

But you still need to control what gets merged into mainline or master, right?

If you change the management viewpoint from "control" (with all its 
baggage) to "select" then it's a bit easier to see that the managers 
task got that bit easier (they don't need to 'protect' the VCS 'master') 
and the coder's work balance got easier because they do have access to a 
storage mechanism that works.


You're correct that  there is selection of what gets merged, or in a PR 
scenario, accepted as the new 'latest', and the golden repo (read only 
to others  usually) reflects the magic hash number of the latest and 
greatest. If you have a copy with the same commit hash (which implicitly 
has the same history in Git) then you are good to go.


Aside: the commit hash embeds both history and the latest code tree; 
while the tree hash is just the latest code, but without history, which 
would be not so great.


--
Philip


--
You received this message because you are subscribed to the Google Groups "Git for 
human beings" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/git-users/3c9aeead-93fb-6709-1b7e-0c9e5ff06e66%40iee.email.


Re: [git-users] How does Git storing entire files rather than deltas make it superior?

2019-11-03 Thread likejudo
'that "Control" [aka managers from hell] has been
*distributed* from the management to the user. You no longer need any
permission to store anything you want into the holy shrine of the the
"VCS" '

But you still need to control what gets merged into mainline or master, right?

-- 
You received this message because you are subscribed to the Google Groups "Git 
for human beings" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/git-users/f42e077b-97cc-4e8a-818f-dd048ae142f2%40googlegroups.com.


Re: [git-users] How does Git storing entire files rather than deltas make it superior?

2019-11-03 Thread likejudo
Excellent answer  with a lot of knowledge in it. Thank you sir. 

-- 
You received this message because you are subscribed to the Google Groups "Git 
for human beings" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/git-users/fd24aba0-6cba-4119-ba96-42280e51acd1%40googlegroups.com.


Re: [git-users] How does Git storing entire files rather than deltas make it superior?

2019-11-03 Thread Philip Oakley

Hi, a couple of extra comments about the theme.

On 03/11/2019 00:01, Michael wrote:
+1.
On 2019-11-01, at 12:39 PM, likejudo > wrote:


I was wondering if this isn't space inefficient - and how does it 
become superior to a VCS by storing snapshots rather than deltas?
Initially Git is space inefficient as the first step is simply to just 
zlib compress each and every file in the new snapshot. However if the 
object id Hash value is identical (nothing changed!) then there is 
nothing new to store - instant de-duplication. But if it did change, 
hmm, then we do get a whole new object.


But that Torvalds guy was sneaky, and so (third step) he knew that the 
new and old files were mostly identical, often with common text even 
across other files, so he create the pack compression mechanism which 
records **similarity** (old version  is current version from start to 
point A, and from point B to the end, you'd inserted texted between A 
and B. i.e. the diff will be A-B). Hence ... [Michael's good points]




Some people will cite studies showing that the pack files have better 
compression than you'd normally expect; this is to be expected from 
compressing a larger amount of data.






Some people will cite that "unmodified files checkedsummed" prevent 
unexpected alterations; git is actually the first example I know of of 
a block-chain in real life, before it was called a block chain. Git 
gains all the advantages of blockchains for detecting accuracy.

all true.

What I think is the success of Git, which is implicit in the way Linux 
development works, is that "Control" [aka managers from hell] has been 
*distributed* from the management to the user. You no longer need any 
permission to store anything you want into the holy shrine of the the 
"VCS" (in case you might have somehow contaminated it).
The manager is relieved of those horrid tasks of handling coders, and 
simply lists the valid hash of the "correct" versions. All your nif naff 
and trivia are local to you, but are secure, and validated by their own 
hashes. You can get back to the various interim states you were at 
without worry.


The critical point here (and this is slightly philosophical) is that 
there is no longer a single MASTER (see works of art such as Mona Lisa, 
or your code..) Code can be perfectly replicated at almost zero cost. 
It's value is in having the correct copy (the hash), rather than having 
the only copy. The whole version "Control" paradigm has broken out of 
the 'fragility' box that bedevilled physical artefact control (paper 
drawings, serialised parts, VIN numbers on cars).


Having broken the veracity problem, the diff based approach with a 
central authority falls away, especially when the pack file technique is 
included.


There is still the problem of non-diffable files (e.g. audio/video (AV) 
edits) where it is still an all or nothing problem (especially for 
packing), but that is an issue common to both approaches. The Microsoft 
contributors are looking at how they can handle the Windows mono-repo 
(largest in the world!), and then hopefully others will look at the 
large mono-file problems  (how to diff and merge AV files)


Some people will question what "superior" means.

The bottom line is this: Git was developed for the linux kernel. Git 
was developed based on the needs of a decently sized small project.


Yea, there was a time when I thought linux was big. "Big" is what you 
get when Microsoft and Google both start moving their 
development/version control over to git. There's stuff in git designed 
to deal with very, very large archives that these two have contributed.


In a nutshell, git has these advantages over everything else that came 
before it:

1. Ability to work with really large archives.
2. Ability to recover not just a version of a file, but a version of a 
project, even as filenames change
3. Ability to check what changes were made in a given subdirectory 
during a period of time -- used by people working on a subset of the 
linux kernel, for example.

4. Ability to merge more than two deltas off a previous base
5. Ability to ensure no one slipped unauthorized changes into the 
source code.
6. Ability to have different people work on different files at the 
same time without ever running into "locking" issues, without having 
to have a network connection at "checkout" time, without needing to 
have a concept of checking out.
7. Ability to consider anyone's copy as the "master" copy -- useful if 
the maintain/"master" of a project changes.


When you consider these goals, space used by text files isn't nearly 
as important. Once you get to something the size of the linux code 
base, you can start to think that you might be consuming disk space.




As stated, the best way to think of git is a read-only filesystem. 
Files are presented to git in their "only" finished form, and do not 
get stored in the filesystem until finished. There is no 

Re: [git-users] How does Git storing entire files rather than deltas make it superior?

2019-11-02 Thread Michael
On 2019-11-01, at 12:39 PM, likejudo  wrote:

> I was wondering if this isn't space inefficient - and how does it become 
> superior to a VCS by storing snapshots rather than deltas?

Some people will cite studies showing that the pack files have better 
compression than you'd normally expect; this is to be expected from compressing 
a larger amount of data.

Some people will cite that "unmodified files checkedsummed" prevent unexpected 
alterations; git is actually the first example I know of of a block-chain in 
real life, before it was called a block chain. Git gains all the advantages of 
blockchains for detecting accuracy.

Some people will question what "superior" means.

The bottom line is this: Git was developed for the linux kernel. Git was 
developed based on the needs of a decently sized small project.

Yea, there was a time when I thought linux was big. "Big" is what you get when 
Microsoft and Google both start moving their development/version control over 
to git. There's stuff in git designed to deal with very, very large archives 
that these two have contributed.

In a nutshell, git has these advantages over everything else that came before 
it:
1. Ability to work with really large archives.
2. Ability to recover not just a version of a file, but a version of a project, 
even as filenames change
3. Ability to check what changes were made in a given subdirectory during a 
period of time -- used by people working on a subset of the linux kernel, for 
example.
4. Ability to merge more than two deltas off a previous base
5. Ability to ensure no one slipped unauthorized changes into the source code.
6. Ability to have different people work on different files at the same time 
without ever running into "locking" issues, without having to have a network 
connection at "checkout" time, without needing to have a concept of checking 
out.
7. Ability to consider anyone's copy as the "master" copy -- useful if the 
maintain/"master" of a project changes.

When you consider these goals, space used by text files isn't nearly as 
important. Once you get to something the size of the linux code base, you can 
start to think that you might be consuming disk space.



As stated, the best way to think of git is a read-only filesystem. Files are 
presented to git in their "only" finished form, and do not get stored in the 
filesystem until finished. There is no "differential" at the lowest level, only 
a bunch of full files that do not change.

Everything else is layered on top of that.

The files are named by their hash code.
There are files that contain mappings of file user-names to hash codes -- which 
in turn have a hash code name. These are the "directory listings". Some of 
those files are sub-directories instead of user-supplied files.
There are files that contain the hash of the top-level project directory, and 
information about which version that project directory has represents.

What does this not give you, that has to be calculated all the time? The diff 
from version N to N+1. When you want to apply "what changed" between C and D as 
a rebase onto B.

Diff-based VCS's give you that cheaply, but lose all the other benefits.
Linux found those benefits to be better.
Microsoft and Google are switching.

Are there issues/problems? Sure.
Are they less of an issue this way than any other way so far? Seems like it.
Are there features people would like to see in Git? Yep. 
Could most of them be added to git without changing the "Read only filesystem" 
at the heart? Yes.

Is there a better system design than git? Sure. Do we know what it is? Probably 
not.

-- 
You received this message because you are subscribed to the Google Groups "Git 
for human beings" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/git-users/45F7E631-0FD0-48C7-AF98-16785213BC77%40gmail.com.


Re: [git-users] How does Git storing entire files rather than deltas make it superior?

2019-11-02 Thread 'Matthias Urlichs' via Git for human beings
On 02.11.19 16:20, Konstantin Khomoutov wrote:
> how does it become 
> superior to a VCS by storing snapshots rather than deltas?

Deltas based on what? As soon as there are merge operations, you either
run into ambiguities or the code is forced to prefer one branch over the
other. Branches and branch selection and whatnot thus become first-class
citizens in that VCS. You don't want that to happen because branches are
incidental and don't matter to the user: the state of your repository is
not affected by whether you merged A into B or vice versa.

Also, one of git's basic tenets is that every file (and directory and
commit and …) is identified by the hash of its contents. Thus you can
immediately detect if two files (or directories or …) are the same:
their hash is identical.

This in turn allows you to quickly identify common sub-trees between
*any* two commits immediately, without requiring examination of
subtrees. Contrast with delta storage where you always need to
re-assemble the whole tree as soon as the situation becomes non-trivial.

In a complex repository like the Linux kernel, with its thicket of
multiple merges, trivial situations are the exception rather than the rule.

-- 
-- Matthias Urlichs

-- 
You received this message because you are subscribed to the Google Groups "Git 
for human beings" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/git-users/5f1df930-a91e-329b-6cfd-d3d82866d9c7%40urlichs.de.


Re: [git-users] How does Git storing entire files rather than deltas make it superior?

2019-11-02 Thread Konstantin Khomoutov
On Fri, Nov 01, 2019 at 12:39:18PM -0700, likejudo wrote:

> In Scott Chacone's book ProGit, he says that Git is different from VCS'es 
> in that it stores entire files rather than deltas.
> I was wondering if this isn't space inefficient - and how does it become 
> superior to a VCS by storing snapshots rather than deltas?

I have read this but but I'm not sure I really have seen such rhetoric
that this approach is exactly "superior".

In fact, the problem with this discussion is that we do not have
well-defined criteria according to which we judge various approaches
taken by different VCSes.

>From the standpoint of a user of a VCS, there really is no difference —
at least until you do not find yourself in an unusual situation where
you have a damaged repository (say, due to a filesystem or hardware
problem, and no backups).

>From the purely technical standpoint, the format Git uses may be
considered to have certain advantages. Note that Git only _conceptually_
stores full snapshots of the entire repository in its commits;
technically, only a few copies of a file modified throughout the history
of changes are stored "as is" — with most historic changes contained in
the so-called "pack files" which are indexed compressed archives.

Let me cite a very high-quality essay by Keith Packard (one of the
principal folks behind modern X.org):

> <…> git's repository structure is better than others, at least for
> X.org's usage model. It seems to hold several interesting properties:
>
> 1. Files containing object data are never modified. Once written,
> every file is read-only from that point forward.
>
> 2. Compression is done off-line and can be delayed until after the
> primary objects are saved to backup media. This method provides better
> compression than any incremental approach, allowing data to be
> re-ordered on disk to match usage patterns.
>
> 3. Object data is inherently self-checking; you cannot modify an
> object in the repository and escape detection the first time the
> object is referenced.
>
> Many people have complained about git's off-line compression strategy,
> seeing it as a weakness that the system cannot automatically deal with
> this. Admittedly, automatic is always nice, but in this case, the
> off-line process gains significant performance advantages (all
> objects, independent of original source file name are grouped into a
> single compressed file), as well as reliability benefits (original
> objects can be backed-up before being removed from the server). From
> measurements made on a wide variety of repositories, git's compression
> techniques are far and away the most successful in reducing the total
> size of the repository. The reduced size benefits both download times
> and overall repository performance as fewer pages must be mapped to
> operate on objects within a Git repository than within any other
> repository structure.

The full essay is available at [1].

I also happened to answer a quite similar question asked by someone on
SO; the question was about why Git does not use a "real database" for
its backend storage. You might want to read my answer at [2].

1. https://keithp.com/blog/Repository_Formats_Matter/
2. https://stackoverflow.com/a/21141068/720999

-- 
You received this message because you are subscribed to the Google Groups "Git 
for human beings" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/git-users/20191102152037.7wkdiwrb7yf7nnec%40carbon.