What happens when the repository is bigger than gc.autopacklimit * pack.packSizeLimit?

2014-08-27 Thread Dale R. Worley
[Previously sent to the git-users mailing list, but it probably should
be addressed here.]

A number of commands invoke git gc --auto to clean up the repository
when there might be a lot of dangling objects and/or there might be
far too many unpacked files.  The manual pages say:

git gc:
   --auto
   With this option, git gc checks whether any housekeeping is
   required; if not, it exits without performing any work. Some git
   commands run git gc --auto after performing operations that could
   create many loose objects.

   Housekeeping is required if there are too many loose objects or too
   many packs in the repository. If the number of loose objects
   exceeds the value of the gc.auto configuration variable, then all
   loose objects are combined into a single pack using git repack -d
   -l. Setting the value of gc.auto to 0 disables automatic packing of
   loose objects.

git config:
   gc.autopacklimit
   When there are more than this many packs that are not marked with
   *.keep file in the repository, git gc --auto consolidates them into
   one larger pack. The default value is 50. Setting this to 0
   disables it.

What happens when the amount of data in the repository exceeds
gc.autopacklimit * pack.packSizeLimit?  According to the
documentation, git gc --auto will then *always* repack the
repository, whether it needs it or not, because the data will require
more than gc.autopacklimit pack files.

And it appears from an experiment that this is what happens.  I have a
repository with pack.packSizeLimit = 99m, and there are 104 pack
files, and even when git gc is done, if I do git gc --auto, it
will do git-repack again.

Looking at the code, I see:

builtin/gc.c:
static int too_many_packs(void)
{
struct packed_git *p;
int cnt;

if (gc_auto_pack_limit = 0)
return 0;

prepare_packed_git();
for (cnt = 0, p = packed_git; p; p = p-next) {
if (!p-pack_local)
continue;
if (p-pack_keep)
continue;
/*
 * Perhaps check the size of the pack and count only
 * very small ones here?
 */
cnt++;
}
return gc_auto_pack_limit = cnt;
}

Yes, perhaps you *should* check the size of the pack!

What is a good strategy for making this function behave as we want it to?

Dale
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What happens when the repository is bigger than gc.autopacklimit * pack.packSizeLimit?

2014-08-27 Thread Jeff King
On Wed, Aug 27, 2014 at 03:36:53PM -0400, Dale R. Worley wrote:

 And it appears from an experiment that this is what happens.  I have a
 repository with pack.packSizeLimit = 99m, and there are 104 pack
 files, and even when git gc is done, if I do git gc --auto, it
 will do git-repack again.

I agree that gc --auto could be smarter here, but I have to wonder:
why are you setting the packsize limit to 99m in the first place? It is
generally much more efficient to place everything in a single pack.
There are more delta opportunities, fewer base objects, lookup is faster
(we binary search each pack index, but linearly move through the list of
indices), and it is required for advanced techniques like bitmaps.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What happens when the repository is bigger than gc.autopacklimit * pack.packSizeLimit?

2014-08-27 Thread Junio C Hamano
wor...@alum.mit.edu (Dale R. Worley) writes:

 builtin/gc.c:
 static int too_many_packs(void)
 {
   struct packed_git *p;
   int cnt;

   if (gc_auto_pack_limit = 0)
   return 0;

   prepare_packed_git();
   for (cnt = 0, p = packed_git; p; p = p-next) {
   if (!p-pack_local)
   continue;
   if (p-pack_keep)
   continue;
   /*
* Perhaps check the size of the pack and count only
* very small ones here?
*/
   cnt++;
   }
   return gc_auto_pack_limit = cnt;
 }

 Yes, perhaps you *should* check the size of the pack!

 What is a good strategy for making this function behave as we want it to?

Whoever decides the details of as we want it to gets to decide
;-).

I think what we want is a mode where we repack only loose objects
and small packs by concatenating them into a single large one
(with deduping of base objects, the total would become smaller than
the sum), while leaving existing large ones alone.  Daily
repacking would just coalesce new objects into the current pack
that grows gradually and at some point it stops growing and join the
more longer term large ones, until a full gc is done to optimize
the overall history traversal, or something.

But if your definition of the boundary between small and large
is unreasonably low (and/or your definition of too many is
unreasonably small), you will always have the problem you found.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html