Re: [git-users] How does Git storing entire files rather than deltas make it superior?

Konstantin Khomoutov Sat, 02 Nov 2019 08:21:30 -0700

On Fri, Nov 01, 2019 at 12:39:18PM -0700, likejudo wrote:

> In Scott Chacone's book ProGit, he says that Git is different from VCS'es 
> in that it stores entire files rather than deltas.
> I was wondering if this isn't space inefficient - and how does it become 
> superior to a VCS by storing snapshots rather than deltas?


I have read this but but I'm not sure I really have seen such rhetoric
that this approach is exactly "superior".

In fact, the problem with this discussion is that we do not have
well-defined criteria according to which we judge various approaches
taken by different VCSes.

>From the standpoint of a user of a VCS, there really is no difference —
at least until you do not find yourself in an unusual situation where
you have a damaged repository (say, due to a filesystem or hardware
problem, and no backups).

>From the purely technical standpoint, the format Git uses may be
considered to have certain advantages. Note that Git only _conceptually_
stores full snapshots of the entire repository in its commits;
technically, only a few copies of a file modified throughout the history
of changes are stored "as is" — with most historic changes contained in
the so-called "pack files" which are indexed compressed archives.

Let me cite a very high-quality essay by Keith Packard (one of the
principal folks behind modern X.org):

> <…> git's repository structure is better than others, at least for
> X.org's usage model. It seems to hold several interesting properties:
>
> 1. Files containing object data are never modified. Once written,
> every file is read-only from that point forward.
>
> 2. Compression is done off-line and can be delayed until after the
> primary objects are saved to backup media. This method provides better
> compression than any incremental approach, allowing data to be
> re-ordered on disk to match usage patterns.
>
> 3. Object data is inherently self-checking; you cannot modify an
> object in the repository and escape detection the first time the
> object is referenced.
>
> Many people have complained about git's off-line compression strategy,
> seeing it as a weakness that the system cannot automatically deal with
> this. Admittedly, automatic is always nice, but in this case, the
> off-line process gains significant performance advantages (all
> objects, independent of original source file name are grouped into a
> single compressed file), as well as reliability benefits (original
> objects can be backed-up before being removed from the server). From
> measurements made on a wide variety of repositories, git's compression
> techniques are far and away the most successful in reducing the total
> size of the repository. The reduced size benefits both download times
> and overall repository performance as fewer pages must be mapped to
> operate on objects within a Git repository than within any other
> repository structure.

The full essay is available at [1].

I also happened to answer a quite similar question asked by someone on
SO; the question was about why Git does not use a "real database" for
its backend storage. You might want to read my answer at [2].

1. https://keithp.com/blog/Repository_Formats_Matter/
2. https://stackoverflow.com/a/21141068/720999

-- 
You received this message because you are subscribed to the Google Groups "Git 
for human beings" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/git-users/20191102152037.7wkdiwrb7yf7nnec%40carbon.

Re: [git-users] How does Git storing entire files rather than deltas make it superior?

Reply via email to