On Mon, 10 Dec 2007, Gabriel Paubert wrote:

> On Fri, Dec 07, 2007 at 04:47:19PM -0800, Harvey Harrison wrote:
> > Some interesting stats from the highly packed gcc repo.  The long chain
> > lengths very quickly tail off.  Over 60% of the objects have a chain
> > length of 20 or less.  If anyone wants the full list let me know.  I
> > also have included a few other interesting points, the git default
> > depth of 50, my initial guess of 100 and every 10% in the cumulative
> > distribution from 60-100%.
> > 
> > This shows the git default of 50 really isn't that bad, and after
> > about 100 it really starts to get sparse.  
> 
> Do you have a way to know which files have the longest chains?

With 'git verify-pack -v' you get the delta depth for each object.
Then you can use 'git show' with the object SHA1 to see its content.

> I have a suspiscion that the ChangeLog* files are among them,
> not only because they are, almost without exception, only modified
> by prepending text to the previous version (and a fairly small amount
> compared to the size of the file), and therefore the diff is simple
> (a single hunk) so that the limit on chain depth is probably what
> causes a new copy to be created. 

My gcc repo is currently repacked with a max delta depth of 50, and 
a quick sample of those objects at the depth limit does indeed show the 
content of the ChangeLog file.  But I have occurrences of the root 
directory tree object too, and the "GCC machine description for IA-32" 
content as well.

But yes, the really deep delta chains are most certainly going to 
contain those ChangeLog files.

> Besides that these files grow quite large and become some of the 
> largest files in the tree, and at least one of them is changed 
> for every commit. This leads again to many versions of fairly 
> large files.
> 
> If this guess is right, this implies that most of the size gains
> from longer chains comes from having less copies of the ChangeLog*
> files. From a performance point of view, it is rather favourable
> since the differences are simple. This would also explain why
> the window parameter has little effect.

Well, actually the window parameter does have big effects.  For instance 
the default of 10 is completely inadequate for the gcc repo, since 
changing the window size from 10 to 100 made the corresponding pack 
shrink from 2.1GB down to 400MB, with the same max delta depth.


Nicolas

Reply via email to