On Fri, Dec 07, 2007 at 04:47:19PM -0800, Harvey Harrison wrote:
> Some interesting stats from the highly packed gcc repo.  The long chain
> lengths very quickly tail off.  Over 60% of the objects have a chain
> length of 20 or less.  If anyone wants the full list let me know.  I
> also have included a few other interesting points, the git default
> depth of 50, my initial guess of 100 and every 10% in the cumulative
> distribution from 60-100%.
> 
> This shows the git default of 50 really isn't that bad, and after
> about 100 it really starts to get sparse.  

Do you have a way to know which files have the longest chains?

I have a suspiscion that the ChangeLog* files are among them,
not only because they are, almost without exception, only modified
by prepending text to the previous version (and a fairly small amount
compared to the size of the file), and therefore the diff is simple
(a single hunk) so that the limit on chain depth is probably what
causes a new copy to be created. 

Besides that these files grow quite large and become some of the 
largest files in the tree, and at least one of them is changed 
for every commit. This leads again to many versions of fairly 
large files.

If this guess is right, this implies that most of the size gains
from longer chains comes from having less copies of the ChangeLog*
files. From a performance point of view, it is rather favourable
since the differences are simple. This would also explain why
the window parameter has little effect.

        Regards,
        Gabriel

Reply via email to