Re: [PATCH v2 3/5] gc --auto: exclude base pack if not enough mem to "repack -ad"

2018-03-16 Thread Duy Nguyen
On Thu, Mar 15, 2018 at 8:21 PM, Ævar Arnfjörð Bjarmason
 wrote:
>
> On Thu, Mar 15 2018, Duy Nguyen jotted:
>
>> On Mon, Mar 12, 2018 at 8:30 PM, Ævar Arnfjörð Bjarmason
>>  wrote:
>>> We already have pack.packSizeLimit, perhaps we could call this
>>> e.g. gc.keepPacksSize=2GB?
>>
>> I'm OK either way. The "base pack" concept comes from the
>> "--keep-base-pack" option where we do keep _one_ base pack. But gc
>> config var has a slightly different semantics when it can keep
>> multiple packs.
>
> I see, yeah it would be great to generalize it to N packs.
>
>>> Finally I wonder if there should be something equivalent to
>>> gc.autoPackLimit for this. I.e. with my proposed semantics above it's
>>> possible that we end up growing forever, i.e. I could have 1000 2GB
>>> packs and then 50 very small packs per gc.autoPackLimit.
>>>
>>> Maybe we need a gc.keepPackLimit=100 to deal with that, then e.g. if
>>> gc.keepPacksSize=2GB is set and we have 101 >= 2GB packs, we'd pick the
>>> two smallest one and not issue a --keep-pack for those, although then
>>> maybe our memory use would spike past the limit.
>>>
>>> I don't know, maybe we can leave that for later, but I'm quite keen to
>>> turn the top-level config variable into something that just considers
>>> size instead of "base" if possible, and it seems we're >95% of the way
>>> to that already with this patch.
>>
>> At least I will try to ignore gc.keepPacksSize if all packs are kept
>> because of it. That repack run will hurt. But then we're back to one
>> giant pack and plenty of small packs that will take some time to grow
>> up to 2GB again.
>
> I think that semantic really should have its own option. The usefulness
> of this is significantly diminished if it's not a guarantee on the
> resource use of git-gc.
>
> Consider a very large repo where we clone and get a 4GB pack. Then as
> time goes on we end up with lots of loose objects and small packs from
> pulling, and eventually end up with say 4GB + 2x 500MB packs (if our
> limit is 500MB).
>
> If I understand what you're saying correctly if we ever match the gc
> --auto requirements because we have *just* the big packs and then a
> bunch of loose objects (say we rebased a lot) then we'll try to create a
> giant 5GB pack (+ loose objects).

Yes. There isn't a simple and easy solution here and I consider
packing (even if it's expensive) to regain performance is better than
not packing at all. I could tweak that a bit by keeping the largest
pack out (so we have to packs in the end). After a long long long time
when your second pack gets to 5 GB, then we hit the most expensive
repack. But that should be ok for now, I guess.

I think this repack strategy was discussed here at some point in the
past by Gerrit guys. Their goal was to reduce I/O, I believe. A
perfect solution probably could be found, but I don't want to hold
this series back until it's found and I don't want to introduce a
zillion config knobs that become useless later on when the perfect
solution is found.

>>> Actually maybe that should be a "if we're that low on memory, forget
>>> about GC for now" config, but urgh, there's a lot of potential
>>> complexity to be handled here...
>>
>> Yeah I think what you want is a hook. You can make custom rules then.
>> We already have pre-auto-gc hook and could pretty much do what you
>> want without pack-objects memory estimation. But if you want it, maybe
>> we can export the info to the hook somehow.
>
> I can do away with that particular thing, but I'd really like to do
> without the hook. I can automate it on some machines, but then we also
> have un-managed laptops run by users who clone big repos. It's much
> easier to tell them to set a few git config variables than have them
> install & keep some hook up-to-date.

That sounds like we need a mechanism to push hooks (and config stuff)
automatically from clone source. I think this topic was touched in the
summit? I don't object adding new config but we need to figure out
what we need, and from this thread I think there are too many "I don't
know" to settle on a solution.
-- 
Duy


Re: [PATCH v2 3/5] gc --auto: exclude base pack if not enough mem to "repack -ad"

2018-03-15 Thread Ævar Arnfjörð Bjarmason

On Thu, Mar 15 2018, Duy Nguyen jotted:

> On Mon, Mar 12, 2018 at 8:30 PM, Ævar Arnfjörð Bjarmason
>  wrote:
>> We already have pack.packSizeLimit, perhaps we could call this
>> e.g. gc.keepPacksSize=2GB?
>
> I'm OK either way. The "base pack" concept comes from the
> "--keep-base-pack" option where we do keep _one_ base pack. But gc
> config var has a slightly different semantics when it can keep
> multiple packs.

I see, yeah it would be great to generalize it to N packs.

>> Finally I wonder if there should be something equivalent to
>> gc.autoPackLimit for this. I.e. with my proposed semantics above it's
>> possible that we end up growing forever, i.e. I could have 1000 2GB
>> packs and then 50 very small packs per gc.autoPackLimit.
>>
>> Maybe we need a gc.keepPackLimit=100 to deal with that, then e.g. if
>> gc.keepPacksSize=2GB is set and we have 101 >= 2GB packs, we'd pick the
>> two smallest one and not issue a --keep-pack for those, although then
>> maybe our memory use would spike past the limit.
>>
>> I don't know, maybe we can leave that for later, but I'm quite keen to
>> turn the top-level config variable into something that just considers
>> size instead of "base" if possible, and it seems we're >95% of the way
>> to that already with this patch.
>
> At least I will try to ignore gc.keepPacksSize if all packs are kept
> because of it. That repack run will hurt. But then we're back to one
> giant pack and plenty of small packs that will take some time to grow
> up to 2GB again.

I think that semantic really should have its own option. The usefulness
of this is significantly diminished if it's not a guarantee on the
resource use of git-gc.

Consider a very large repo where we clone and get a 4GB pack. Then as
time goes on we end up with lots of loose objects and small packs from
pulling, and eventually end up with say 4GB + 2x 500MB packs (if our
limit is 500MB).

If I understand what you're saying correctly if we ever match the gc
--auto requirements because we have *just* the big packs and then a
bunch of loose objects (say we rebased a lot) then we'll try to create a
giant 5GB pack (+ loose objects).

>> Finally, I don't like the way the current implementation conflates a
>> "size" variable with auto detecting the size from memory, leaving no way
>> to fallback to the auto-detection if you set it manually.
>>
>> I think we should split out the auto-memory behavior into another
>> variable, and also make the currently hardcoded 50% of memory
>> configurable.
>>
>> That way you could e.g. say you'd always like to keep 2GB packs, but if
>> you happen to have ended up with a 1GB pack and it's time to repack, and
>> you only have 500MB free memory on that system, it would keep the 1GB
>> one until such time as we have more memory.
>
> I don't calculate based on free memory (it's tricky to get that right
> on linux) but physical memory. If you don't have enough memory
> according to this formula, you won't until you add more memory sticks.

Ah, thanks for the clarification.

>>
>> Actually maybe that should be a "if we're that low on memory, forget
>> about GC for now" config, but urgh, there's a lot of potential
>> complexity to be handled here...
>
> Yeah I think what you want is a hook. You can make custom rules then.
> We already have pre-auto-gc hook and could pretty much do what you
> want without pack-objects memory estimation. But if you want it, maybe
> we can export the info to the hook somehow.

I can do away with that particular thing, but I'd really like to do
without the hook. I can automate it on some machines, but then we also
have un-managed laptops run by users who clone big repos. It's much
easier to tell them to set a few git config variables than have them
install & keep some hook up-to-date.


Re: [PATCH v2 3/5] gc --auto: exclude base pack if not enough mem to "repack -ad"

2018-03-15 Thread Duy Nguyen
On Mon, Mar 12, 2018 at 8:30 PM, Ævar Arnfjörð Bjarmason
 wrote:
> We already have pack.packSizeLimit, perhaps we could call this
> e.g. gc.keepPacksSize=2GB?

I'm OK either way. The "base pack" concept comes from the
"--keep-base-pack" option where we do keep _one_ base pack. But gc
config var has a slightly different semantics when it can keep
multiple packs.

> Finally I wonder if there should be something equivalent to
> gc.autoPackLimit for this. I.e. with my proposed semantics above it's
> possible that we end up growing forever, i.e. I could have 1000 2GB
> packs and then 50 very small packs per gc.autoPackLimit.
>
> Maybe we need a gc.keepPackLimit=100 to deal with that, then e.g. if
> gc.keepPacksSize=2GB is set and we have 101 >= 2GB packs, we'd pick the
> two smallest one and not issue a --keep-pack for those, although then
> maybe our memory use would spike past the limit.
>
> I don't know, maybe we can leave that for later, but I'm quite keen to
> turn the top-level config variable into something that just considers
> size instead of "base" if possible, and it seems we're >95% of the way
> to that already with this patch.

At least I will try to ignore gc.keepPacksSize if all packs are kept
because of it. That repack run will hurt. But then we're back to one
giant pack and plenty of small packs that will take some time to grow
up to 2GB again.

> Finally, I don't like the way the current implementation conflates a
> "size" variable with auto detecting the size from memory, leaving no way
> to fallback to the auto-detection if you set it manually.
>
> I think we should split out the auto-memory behavior into another
> variable, and also make the currently hardcoded 50% of memory
> configurable.
>
> That way you could e.g. say you'd always like to keep 2GB packs, but if
> you happen to have ended up with a 1GB pack and it's time to repack, and
> you only have 500MB free memory on that system, it would keep the 1GB
> one until such time as we have more memory.

I don't calculate based on free memory (it's tricky to get that right
on linux) but physical memory. If you don't have enough memory
according to this formula, you won't until you add more memory sticks.

>
> Actually maybe that should be a "if we're that low on memory, forget
> about GC for now" config, but urgh, there's a lot of potential
> complexity to be handled here...

Yeah I think what you want is a hook. You can make custom rules then.
We already have pre-auto-gc hook and could pretty much do what you
want without pack-objects memory estimation. But if you want it, maybe
we can export the info to the hook somehow.
-- 
Duy


Re: [PATCH v2 3/5] gc --auto: exclude base pack if not enough mem to "repack -ad"

2018-03-15 Thread Duy Nguyen
On Mon, Mar 12, 2018 at 7:56 PM, Ævar Arnfjörð Bjarmason
 wrote:
>
> On Wed, Mar 07 2018, Junio C. Hamano jotted:
>
>> Duy Nguyen  writes:
 But to those who say "packs larger than this value is too big" via
 configuration, keeping only the largest of these above-threshold
 packs would look counter-intuitive, wouldn't it, I wonder?
>>>
>>> I think I'll just clarify this in the document. There may be a use
>>> case for keeping multiple large packs, but I don't see it (*). We can
>>> deal with it when it comes.
>>
>> When the project's history grows too much, a large pack that holds
>> its first 10 years of stuff, together with another one that holds
>> its second 20 years of stuff, may both be larger than the threshold
>> and want to be kept.  If we pick only the largest one, we would
>> explode the other one and repack together with loose objects.
>>
>> But realistically, those who would want to control the way in which
>> their repository is packed to such a degree are very likely to add
>> ".keep" files to these two packfiles themselves, so the above would
>> probably not a concern.  Perhaps we shouldn't do the "automatically
>> pick the largest one and exclude from repacking" when there is a
>> packfile that is marked with ".keep"?
>
> As someone who expects to use this (although hopefully in slightly
> modified form), it's very useful if we can keep the useful semantics in
> gc.* config values without needing some external job finding repos andis is
> creating *.keep files to get custom behavior.
>
> E.g. I have the use-case of wanting to set this on servers that I know
> are going to be used for cloning some big repos in user's ~ directory
> manually, so if I can set something sensible in /etc/gitconfig that's
> great, but it sucks a lot more to need to write some cronjob that goes
> hunting for repos in those ~ directories and tweaks *.keep files.

If this is about .gc.bigBasePackThreshold keeping all packs larger
than the threshold, then yes it will be so in the reroll.
-- 
Duy


Re: [PATCH v2 3/5] gc --auto: exclude base pack if not enough mem to "repack -ad"

2018-03-12 Thread Ævar Arnfjörð Bjarmason

On Mon, Mar 12 2018, Junio C. Hamano jotted:

> On Mon, Mar 12, 2018 at 11:56 AM, Ævar Arnfjörð Bjarmason
>  wrote:
>> As someone who expects to use this (although hopefully in slightly
>> modified form), it's very useful if we can keep the useful semantics in
>> gc.* config values without needing some external job finding repos and
>> creating *.keep files to get custom behavior.
>>
>> E.g. I have the use-case of wanting to set this on servers that I know
>> are going to be used for cloning some big repos in user's ~ directory
>> manually, so if I can set something sensible in /etc/gitconfig that's
>> great, but it sucks a lot more to need to write some cronjob that goes
>> hunting for repos in those ~ directories and tweaks *.keep files.
>
> Yeah, but that is exactly what I suggested, no? That is, if you don't do any
> specific marking to describe _which_ ones need to be kept, this new thing
> would kick in and pick the largest one and repack all others. If you choose
> to want more control, on the other hand, you can mark those packs you
> would want to keep, and this mechanism will not kick in to countermand
> your explicit settings done via those .keep files.

Yes, this configurable mechanism as it stands only needs /etc/gitconfig.

What I was pointing out in this mail is that we really should get the
advanced use-cases right as well (see my
87a7vdqegi@evledraar.gmail.com for details) via the config, because
it's a pain to cross the chasm between setting config centrally on the
one hand, and needing to track down .git's in arbitrary locations on the
FS (you may not have cloned them yourself) to set *.keep flags.

Doubly so if the machines in questions are just the laptops of some
developers. It's relatively easy to tell them "we work with git repos,
run this git config commands", not so easy to have them install & keep
up-to-date some arbitrary cronjob that needs to hunt down their repos
and set *.keep flags.


Re: [PATCH v2 3/5] gc --auto: exclude base pack if not enough mem to "repack -ad"

2018-03-12 Thread Junio C Hamano
On Mon, Mar 12, 2018 at 11:56 AM, Ævar Arnfjörð Bjarmason
 wrote:
> As someone who expects to use this (although hopefully in slightly
> modified form), it's very useful if we can keep the useful semantics in
> gc.* config values without needing some external job finding repos and
> creating *.keep files to get custom behavior.
>
> E.g. I have the use-case of wanting to set this on servers that I know
> are going to be used for cloning some big repos in user's ~ directory
> manually, so if I can set something sensible in /etc/gitconfig that's
> great, but it sucks a lot more to need to write some cronjob that goes
> hunting for repos in those ~ directories and tweaks *.keep files.

Yeah, but that is exactly what I suggested, no? That is, if you don't do any
specific marking to describe _which_ ones need to be kept, this new thing
would kick in and pick the largest one and repack all others. If you choose
to want more control, on the other hand, you can mark those packs you
would want to keep, and this mechanism will not kick in to countermand
your explicit settings done via those .keep files.


Re: [PATCH v2 3/5] gc --auto: exclude base pack if not enough mem to "repack -ad"

2018-03-12 Thread Ævar Arnfjörð Bjarmason

On Tue, Mar 06 2018, Nguyễn Thái Ngọc Duy jotted:

> pack-objects could be a big memory hog especially on large repos,
> everybody knows that. The suggestion to stick a .keep file on the
> giant base pack to avoid this problem is also known for a long time.
>
> Let's do the suggestion automatically instead of waiting for people to
> come to Git mailing list and get the advice. When a certain condition
> is met, "gc --auto" tells "git repack" to keep the base pack around.
> The end result would be two packs instead of one.
>
> On linux-2.6.git, valgrind massif reports 1.6GB heap in "pack all"
> case, and 535MB [1] in "pack all except the base pack" case. We save
> roughly 1GB memory by excluding the base pack.
>
> gc --auto decides to do this based on an estimation of pack-objects
> memory usage, which is quite accurate at least for the heap part, and
> whether that fits in half of system memory (the assumption here is for
> desktop environment where there are many other applications running).
>
> Since the estimation may be inaccurate and that 1/2 threshold is
> really arbitrary, give the user a finer control over this mechanism:
> if the largest pack is larger than gc.bigBasePackThreshold, it's kept.
>
> PS. A big chunk of the remaining 535MB is the result of pack-objects
> running rev-list internally. This will be dealt with when we could run
> rev-list externally. Right now we can't because pack-objects internal
> rev-list does more regarding unreachable objects, which cannot be done
> by "git rev-list".
>
> Signed-off-by: Nguyễn Thái Ngọc Duy 
> ---
>  Documentation/config.txt |   7 ++
>  Documentation/git-gc.txt |  13 
>  builtin/gc.c | 153 +--
>  builtin/pack-objects.c   |   2 +-
>  config.mak.uname |   1 +
>  git-compat-util.h|   4 +
>  pack-objects.h   |   2 +
>  t/t6500-gc.sh|  29 
>  8 files changed, 204 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index f57e9cf10c..120cf6bac9 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -1549,6 +1549,13 @@ gc.autoDetach::
>   Make `git gc --auto` return immediately and run in background
>   if the system supports it. Default is true.
>
> +gc.bigBasePackThreshold::
> + Make `git gc --auto` only enable `--keep-base-pack` when the
> + base pack's size is larger than this limit (in bytes).
> + Defaults to zero, which disables this check and lets
> + `git gc --auto` determine when to enable `--keep-base-pack`
> + based on memory usage.
> +

I'm really keen to use this (and would be happy to apply a patch on
top), but want to get your thoughts first, see also my just-sent
87bmftqg1n@evledraar.gmail.com
(https://public-inbox.org/git/87bmftqg1n@evledraar.gmail.com/).

The thing I'd like to change is that the underlying --keep-pack= takes a
list of paths (good!), but then I think this patch needlessly
complicates things by talking about "base packs" and having the
implementation limitation that we only ever pass one --keep-pack down to
pack-objects (bad!).

Why don't we instead just have a gc.* variable that you can set to some
size of pack that we'll always implicitly *.keep? That way I can
e.g. clone a 5GB pack and set the limit to 2GB, then keep adding new
content per the rules of gc.autoPackLimit, until I finally create a
larger than 2GB pack, at that point I'll have 5GB, 2GB, and some smaller
packs and loose objects.

We already have pack.packSizeLimit, perhaps we could call this
e.g. gc.keepPacksSize=2GB?

Or is there a use-case for still having the concept of a "base" pack? Is
it magic in some way? Maybe I'm missing something but I don't see why,
we can just stop thinking about whether some one pack is larger than X,
and consider all packs larger than X specially.

But if we do maybe an extra gc.keepBasePack=true?

Finally I wonder if there should be something equivalent to
gc.autoPackLimit for this. I.e. with my proposed semantics above it's
possible that we end up growing forever, i.e. I could have 1000 2GB
packs and then 50 very small packs per gc.autoPackLimit.

Maybe we need a gc.keepPackLimit=100 to deal with that, then e.g. if
gc.keepPacksSize=2GB is set and we have 101 >= 2GB packs, we'd pick the
two smallest one and not issue a --keep-pack for those, although then
maybe our memory use would spike past the limit.

I don't know, maybe we can leave that for later, but I'm quite keen to
turn the top-level config variable into something that just considers
size instead of "base" if possible, and it seems we're >95% of the way
to that already with this patch.

Finally, I don't like the way the current implementation conflates a
"size" variable with auto detecting the size from memory, leaving no way
to fallback to the auto-detection if you set it manually.

I think we should split out the auto-memory behavior into another

Re: [PATCH v2 3/5] gc --auto: exclude base pack if not enough mem to "repack -ad"

2018-03-12 Thread Ævar Arnfjörð Bjarmason

On Wed, Mar 07 2018, Junio C. Hamano jotted:

> Duy Nguyen  writes:
>>> But to those who say "packs larger than this value is too big" via
>>> configuration, keeping only the largest of these above-threshold
>>> packs would look counter-intuitive, wouldn't it, I wonder?
>>
>> I think I'll just clarify this in the document. There may be a use
>> case for keeping multiple large packs, but I don't see it (*). We can
>> deal with it when it comes.
>
> When the project's history grows too much, a large pack that holds
> its first 10 years of stuff, together with another one that holds
> its second 20 years of stuff, may both be larger than the threshold
> and want to be kept.  If we pick only the largest one, we would
> explode the other one and repack together with loose objects.
>
> But realistically, those who would want to control the way in which
> their repository is packed to such a degree are very likely to add
> ".keep" files to these two packfiles themselves, so the above would
> probably not a concern.  Perhaps we shouldn't do the "automatically
> pick the largest one and exclude from repacking" when there is a
> packfile that is marked with ".keep"?

As someone who expects to use this (although hopefully in slightly
modified form), it's very useful if we can keep the useful semantics in
gc.* config values without needing some external job finding repos and
creating *.keep files to get custom behavior.

E.g. I have the use-case of wanting to set this on servers that I know
are going to be used for cloning some big repos in user's ~ directory
manually, so if I can set something sensible in /etc/gitconfig that's
great, but it sucks a lot more to need to write some cronjob that goes
hunting for repos in those ~ directories and tweaks *.keep files.


Re: [PATCH v2 3/5] gc --auto: exclude base pack if not enough mem to "repack -ad"

2018-03-07 Thread Junio C Hamano
Johannes Schindelin  writes:

> Junio, may I ask you to put this into a SQUASH??? commit so that the
> Windows build no longer fails?

Thanks for a reminder; I also spotted it (I first thought I screwed
up in my editor while reviewing and then went back to the original
on the list) and sent out a response, but then by that time I was
already far into the day's integration cycle.

Will queue a SQUASH??? at the tip.


Re: [PATCH v2 3/5] gc --auto: exclude base pack if not enough mem to "repack -ad"

2018-03-07 Thread Junio C Hamano
Duy Nguyen  writes:

>>> +Set environment variable `GIT_TRACE` in order to see the memory usage
>>> +estimation in `git gc --auto` that determines whether the base pack is
>>> +kept.
>>
>> This is somewhat a puzzling use of trace.  It sounds more like a way
>> to find out "how" memory usage estimation is done and arriving at a
>> wrong value for those who want to debug the feature.
>
> Yeah. I'm not sure if this estimation could be really problematic that
> people need to debug this way. A cleaner way (if we think people will
> need this often) is just add a new option in "git gc" to report this
> estimation breakdown and do nothing else.

Actually after reading the implementation and seeing what it does, I
personally do not have any problem with the way GIT_TRACE is used
for this purpose in this patch.  I am not sure how interesting the
information given by that codepath in real life; it looks even less
intereseting than say what comes out of "verify-pack --stat".

>> This is finding the largest pack.
>
> The discussion on .keep files raises one question for me, what if the
> largest pack already has a .keep file, do we still consider it the
> base pack, or should we find the next largest non-kept pack?
>
> I'm guessing we keep things simple here and ignore .keep files.

I agree that it is a sensible design decision.

>>> +#elif defined(GIT_WINDOWS_NATIVE)
>>> + MEMORYSTATUSEX memInfo;
>>> +
>>> + memInfo.dwLength = sizeof(MEMORYSTATUSEX);
>>> + if (GlobalMemoryStatusEx())
>>> + return memInfo;ullTotalPhys;
>>
>> Is this legal C in Microsoft land?
>
> That's the problem with writing code without a way to test it. At
> least travis helped catch a compiler bug on mac.
>
> I'm torn between just #error here and let Windows/Mac guys implement
> it (which they may scream "too much work, I don't wanna") but if I
> help write something first, yes things are potentially broken and need
> verification from those guys. I guess I'll just fix this up and hope
> non-linux guys do the rest.

Yup, we all collaborate and help in ways each of us can.  None of us
can be expected to do any more than that ;-)

>> I wonder if the above should go somewhere under compat/ without
>> ifdef but split into separate files for individual archs (I do not
>> know the answer to this question).
>
> My first choice too. I chose this way after seeing online_cpus()
> thread-utils.c. Not sure which way is best either.

OK.

>> But to those who say "packs larger than this value is too big" via
>> configuration, keeping only the largest of these above-threshold
>> packs would look counter-intuitive, wouldn't it, I wonder?
>
> I think I'll just clarify this in the document. There may be a use
> case for keeping multiple large packs, but I don't see it (*). We can
> deal with it when it comes.

When the project's history grows too much, a large pack that holds
its first 10 years of stuff, together with another one that holds
its second 20 years of stuff, may both be larger than the threshold
and want to be kept.  If we pick only the largest one, we would
explode the other one and repack together with loose objects.

But realistically, those who would want to control the way in which
their repository is packed to such a degree are very likely to add
".keep" files to these two packfiles themselves, so the above would
probably not a concern.  Perhaps we shouldn't do the "automatically
pick the largest one and exclude from repacking" when there is a
packfile that is marked with ".keep"?

> The use of super large gc.bigBasePackThreshold to disable this keeping
> base pack is intended. But I can't go too high here it may break
> limits on 32 bit platforms. And 2g sounds really arbitrary.

You could use 42m instead to clarify that it really is an arbitrary
threshold that was chosen only for the purpose of this test perhaps?
;-)


Re: [PATCH v2 3/5] gc --auto: exclude base pack if not enough mem to "repack -ad"

2018-03-07 Thread Duy Nguyen
On Wed, Mar 7, 2018 at 2:19 AM, Junio C Hamano  wrote:
>> +--keep-base-pack::
>> + All packs except the base pack are consolidated into a single
>> + pack. The largest pack is considered the base pack.
>
> This makes it sound as if packs with .keep are also repacked unless
> they meet the threshold for "base pack".  Is that what you actually
> implemented?

Copy/paste problem. That is, I copied this from --auto description,
but I missed the "(except those marked with a `.keep` file)" part. No,
I don't think ignoring .keep files is a good idea, at least no by
default.

> In order to do so, [2/5] needs to allow the "--keep-pack" option
> override the on-disk .keep files.  In an earlier message, I wondered
> if such an arrangement is useful in some use cases; I think it is,
> and because those who do want to see the on-disk .keep files honored
> can collect and include them in the set of packs to be kept via
> "--keep-pack" (after all this is an option for low-level scripting
> anyway).

At gc level I don't think we need to allow this. But yeah git-repack
has --pack-kept-objects to ignore .keep. If they specify this, then
repack should ignore .keep (but still follow whatever --keep-pack is
specified). There's some interesting interaction between .keep and
pack bitmap feature in pack-objects though. I'm not so sure what
happens down there yet.

>> +Set environment variable `GIT_TRACE` in order to see the memory usage
>> +estimation in `git gc --auto` that determines whether the base pack is
>> +kept.
>
> This is somewhat a puzzling use of trace.  It sounds more like a way
> to find out "how" memory usage estimation is done and arriving at a
> wrong value for those who want to debug the feature.

Yeah. I'm not sure if this estimation could be really problematic that
people need to debug this way. A cleaner way (if we think people will
need this often) is just add a new option in "git gc" to report this
estimation breakdown and do nothing else.

>> +static struct packed_git *find_the_base_pack(void)
>> +{
>> + struct packed_git *p, *base = NULL;
>> +
>> + prepare_packed_git();
>> +
>> + for (p = packed_git; p; p = p->next) {
>> + if (p->pack_local &&
>> + (!base || base->pack_size < p->pack_size))
>> + base = p;
>> + }
>> +
>> + return base;
>> +}
>
> This is finding the largest pack.

The discussion on .keep files raises one question for me, what if the
largest pack already has a .keep file, do we still consider it the
base pack, or should we find the next largest non-kept pack?

I'm guessing we keep things simple here and ignore .keep files.

>> +#ifdef HAVE_SYSINFO
>> + struct sysinfo si;
>> +
>> + if (!sysinfo())
>> + return si.totalram;
>> +#elif defined(HAVE_BSD_SYSCTL) && defined(HW_MEMSIZE)
>> + int64_t physical_memory;
>> + int mib[2];
>> + size_t length;
>> +
>> + mib[0] = CTL_HW;
>> + mib[1] = HW_MEMSIZE;
>> + length = sizeof(int64_t);
>> + if (!sysctl(mib, 2, _memory, , NULL, 0))
>> + return physical_memory;
>> +#elif defined(GIT_WINDOWS_NATIVE)
>> + MEMORYSTATUSEX memInfo;
>> +
>> + memInfo.dwLength = sizeof(MEMORYSTATUSEX);
>> + if (GlobalMemoryStatusEx())
>> + return memInfo;ullTotalPhys;
>
> Is this legal C in Microsoft land?

That's the problem with writing code without a way to test it. At
least travis helped catch a compiler bug on mac.

I'm torn between just #error here and let Windows/Mac guys implement
it (which they may scream "too much work, I don't wanna") but if I
help write something first, yes things are potentially broken and need
verification from those guys. I guess I'll just fix this up and hope
non-linux guys do the rest.

>> +#else
>> + fprintf(stderr, _("unrecognized platform, assuming %lu GB RAM\n"),
>> + default_ram);
>> +#endif
>> + return default_ram * 1024 * 1024 * 1024;
>> +}
>
> I wonder if the above should go somewhere under compat/ without
> ifdef but split into separate files for individual archs (I do not
> know the answer to this question).

My first choice too. I chose this way after seeing online_cpus()
thread-utils.c. Not sure which way is best either.

>> @@ -427,8 +557,19 @@ int cmd_gc(int argc, const char **argv, const char 
>> *prefix)
>>*/
>>   daemonized = !daemonize();
>>   }
>> - } else
>> - add_repack_all_option();
>> + } else {
>> + struct packed_git *base_pack = find_the_base_pack();
>> + struct packed_git *exclude = NULL;
>> +
>> + if (keep_base_pack != -1) {
>> + if (keep_base_pack)
>> + exclude = base_pack;
>
> OK, --keep-base-pack option always wins if given...
>
>> + } else if (base_pack && big_base_pack_threshold &&
>> +base_pack->pack_size >= 

Re: [PATCH v2 3/5] gc --auto: exclude base pack if not enough mem to "repack -ad"

2018-03-07 Thread Johannes Schindelin
Hi Duy,

On Tue, 6 Mar 2018, Nguyễn Thái Ngọc Duy wrote:

> diff --git a/builtin/gc.c b/builtin/gc.c
> index 77fa720bd0..273657ddf4 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -187,7 +211,101 @@ static int too_many_packs(void)
>   return gc_auto_pack_limit < cnt;
>  }
>  
> -static void add_repack_all_option(void)
> +static inline unsigned long total_ram(void)
> +{
> + unsigned long default_ram = 4;
> +#ifdef HAVE_SYSINFO
> + struct sysinfo si;
> +
> + if (!sysinfo())
> + return si.totalram;
> +#elif defined(HAVE_BSD_SYSCTL) && defined(HW_MEMSIZE)
> + int64_t physical_memory;
> + int mib[2];
> + size_t length;
> +
> + mib[0] = CTL_HW;
> + mib[1] = HW_MEMSIZE;
> + length = sizeof(int64_t);
> + if (!sysctl(mib, 2, _memory, , NULL, 0))
> + return physical_memory;
> +#elif defined(GIT_WINDOWS_NATIVE)
> + MEMORYSTATUSEX memInfo;
> +
> + memInfo.dwLength = sizeof(MEMORYSTATUSEX);
> + if (GlobalMemoryStatusEx())
> + return memInfo;ullTotalPhys;

This fails to compile:

builtin/gc.c: In function 'total_ram':
builtin/gc.c:235:10: error: incompatible types when returning type 
'MEMORYSTATUSEX {aka struct _MEMORYSTATUSEX}' but 'long unsigned int' was 
expected
   return memInfo;ullTotalPhys;
  ^~~
builtin/gc.c:234:2: error: this 'if' clause does not guard... 
[-Werror=misleading-indentation]
  if (GlobalMemoryStatusEx())
  ^~
builtin/gc.c:235:18: note: ...this statement, but the latter is misleadingly 
indented as if it were guarded by the 'if'
   return memInfo;ullTotalPhys;
  ^~~~
builtin/gc.c:235:18: error: 'ullTotalPhys' undeclared (first use in this 
function)
builtin/gc.c:235:18: note: each undeclared identifier is reported only once for 
each function it appears in

I suspect that the first semicolon wanted to be a period instead. At least
it fixes the build here (that's all I can test, I'm at GitMerge and miss a
very interesting discussion about the serialized commit graph to write
this):

-- snip --
diff --git a/builtin/gc.c b/builtin/gc.c
index 4f46a99ab54..9c12f1ee9af 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -232,7 +232,7 @@ static inline unsigned long total_ram(void)
 
memInfo.dwLength = sizeof(MEMORYSTATUSEX);
if (GlobalMemoryStatusEx())
-   return memInfo;ullTotalPhys;
+   return memInfo.ullTotalPhys;
 #else
fprintf(stderr, _("unrecognized platform, assuming %lu GB RAM\n"),
default_ram);

-- snap --

Junio, may I ask you to put this into a SQUASH??? commit so that the
Windows build no longer fails?

Thanks,
Dscho

Re: [PATCH v2 3/5] gc --auto: exclude base pack if not enough mem to "repack -ad"

2018-03-06 Thread Junio C Hamano
Nguyễn Thái Ngọc Duy   writes:

> On linux-2.6.git, valgrind massif reports 1.6GB heap in "pack all"
> case, and 535MB [1] in "pack all except the base pack" case. We save
> roughly 1GB memory by excluding the base pack.

;-)

> gc --auto decides to do this based on an estimation of pack-objects
> memory usage, which is quite accurate at least for the heap part, and
> whether that fits in half of system memory (the assumption here is for
> desktop environment where there are many other applications running).

I was still confused by "decides to do this..." after reading the
above three times.  If this is describing the state _with_ this
patch applied, then "Teach 'gc --auto' to do this automatically..."
would make it clear that is what is going on.

> diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
> index 571b5a7e3c..35ad420d5c 100644
> --- a/Documentation/git-gc.txt
> +++ b/Documentation/git-gc.txt
> @@ -59,6 +59,11 @@ then existing packs (except those marked with a `.keep` 
> file)
>  are consolidated into a single pack by using the `-A` option of
>  'git repack'. Setting `gc.autoPackLimit` to 0 disables
>  automatic consolidation of packs.
> ++
> +If the physical amount of memory is considered not enough for `git
> +repack` to run smoothly, `--keep-base-pack` is enabled. This could be
> +overridden by setting `gc.bigBasePackThreshold` which only enables
> +`--keep-base-pack` when the base pack is larger the specified limit.

I somehow find the flow of logic in these two sentences harder to
follow than necessary.  Perhaps swapping the order may make it
easier to grok?  That is:

 - When gc.bigBasePackThreshold is set, packs larger than that will
   automatically be kept (i.e. not considered for repacking);

 - When it is not set, we try to guess how memory constrained we are,
   and behave as if the threshold were set to the size of the
   largest packfile we have (i.e. that single pack is kept).

I think another and bigger reason why I found the original hard to
read is because it is described for those who already understand
what "--keep-base-pack" option does.  Rewriting it not to require
the pre-existing knowledge of that option would make it a lot easier
to grok, I would think (you may not realize it because you wrote the
feature and are very familiar with it, though).

> +--keep-base-pack::
> + All packs except the base pack are consolidated into a single
> + pack. The largest pack is considered the base pack.

This makes it sound as if packs with .keep are also repacked unless
they meet the threshold for "base pack".  Is that what you actually
implemented?

In order to do so, [2/5] needs to allow the "--keep-pack" option
override the on-disk .keep files.  In an earlier message, I wondered
if such an arrangement is useful in some use cases; I think it is,
and because those who do want to see the on-disk .keep files honored
can collect and include them in the set of packs to be kept via
"--keep-pack" (after all this is an option for low-level scripting
anyway).

> +Set environment variable `GIT_TRACE` in order to see the memory usage
> +estimation in `git gc --auto` that determines whether the base pack is
> +kept.

This is somewhat a puzzling use of trace.  It sounds more like a way
to find out "how" memory usage estimation is done and arriving at a
wrong value for those who want to debug the feature.

> +static unsigned long big_base_pack_threshold;
> +static unsigned long max_delta_cache_size = DEFAULT_DELTA_CACHE_SIZE;

A new symbol, which is a good addition.

> +static struct packed_git *find_the_base_pack(void)
> +{
> + struct packed_git *p, *base = NULL;
> +
> + prepare_packed_git();
> +
> + for (p = packed_git; p; p = p->next) {
> + if (p->pack_local &&
> + (!base || base->pack_size < p->pack_size))
> + base = p;
> + }
> +
> + return base;
> +}

This is finding the largest pack.

> @@ -187,7 +211,101 @@ static int too_many_packs(void)
>   return gc_auto_pack_limit < cnt;
>  }
>  
> -static void add_repack_all_option(void)
> +static inline unsigned long total_ram(void)

"inline"?  We'd rather have compiler figure it out, no?

> +{
> + unsigned long default_ram = 4;

4 what?  4 bytes?  Name it perhaps "default_ram_gb" or something?

> +#ifdef HAVE_SYSINFO
> + struct sysinfo si;
> +
> + if (!sysinfo())
> + return si.totalram;
> +#elif defined(HAVE_BSD_SYSCTL) && defined(HW_MEMSIZE)
> + int64_t physical_memory;
> + int mib[2];
> + size_t length;
> +
> + mib[0] = CTL_HW;
> + mib[1] = HW_MEMSIZE;
> + length = sizeof(int64_t);
> + if (!sysctl(mib, 2, _memory, , NULL, 0))
> + return physical_memory;
> +#elif defined(GIT_WINDOWS_NATIVE)
> + MEMORYSTATUSEX memInfo;
> +
> + memInfo.dwLength = sizeof(MEMORYSTATUSEX);
> + if (GlobalMemoryStatusEx())
> + return memInfo;ullTotalPhys;

Is this legal C in 

[PATCH v2 3/5] gc --auto: exclude base pack if not enough mem to "repack -ad"

2018-03-06 Thread Nguyễn Thái Ngọc Duy
pack-objects could be a big memory hog especially on large repos,
everybody knows that. The suggestion to stick a .keep file on the
giant base pack to avoid this problem is also known for a long time.

Let's do the suggestion automatically instead of waiting for people to
come to Git mailing list and get the advice. When a certain condition
is met, "gc --auto" tells "git repack" to keep the base pack around.
The end result would be two packs instead of one.

On linux-2.6.git, valgrind massif reports 1.6GB heap in "pack all"
case, and 535MB [1] in "pack all except the base pack" case. We save
roughly 1GB memory by excluding the base pack.

gc --auto decides to do this based on an estimation of pack-objects
memory usage, which is quite accurate at least for the heap part, and
whether that fits in half of system memory (the assumption here is for
desktop environment where there are many other applications running).

Since the estimation may be inaccurate and that 1/2 threshold is
really arbitrary, give the user a finer control over this mechanism:
if the largest pack is larger than gc.bigBasePackThreshold, it's kept.

PS. A big chunk of the remaining 535MB is the result of pack-objects
running rev-list internally. This will be dealt with when we could run
rev-list externally. Right now we can't because pack-objects internal
rev-list does more regarding unreachable objects, which cannot be done
by "git rev-list".

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 Documentation/config.txt |   7 ++
 Documentation/git-gc.txt |  13 
 builtin/gc.c | 153 +--
 builtin/pack-objects.c   |   2 +-
 config.mak.uname |   1 +
 git-compat-util.h|   4 +
 pack-objects.h   |   2 +
 t/t6500-gc.sh|  29 
 8 files changed, 204 insertions(+), 7 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index f57e9cf10c..120cf6bac9 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1549,6 +1549,13 @@ gc.autoDetach::
Make `git gc --auto` return immediately and run in background
if the system supports it. Default is true.
 
+gc.bigBasePackThreshold::
+   Make `git gc --auto` only enable `--keep-base-pack` when the
+   base pack's size is larger than this limit (in bytes).
+   Defaults to zero, which disables this check and lets
+   `git gc --auto` determine when to enable `--keep-base-pack`
+   based on memory usage.
+
 gc.logExpiry::
If the file gc.log exists, then `git gc --auto` won't run
unless that file is more than 'gc.logExpiry' old.  Default is
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 571b5a7e3c..35ad420d5c 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -59,6 +59,11 @@ then existing packs (except those marked with a `.keep` file)
 are consolidated into a single pack by using the `-A` option of
 'git repack'. Setting `gc.autoPackLimit` to 0 disables
 automatic consolidation of packs.
++
+If the physical amount of memory is considered not enough for `git
+repack` to run smoothly, `--keep-base-pack` is enabled. This could be
+overridden by setting `gc.bigBasePackThreshold` which only enables
+`--keep-base-pack` when the base pack is larger the specified limit.
 
 --prune=::
Prune loose objects older than date (default is 2 weeks ago,
@@ -78,6 +83,10 @@ automatic consolidation of packs.
Force `git gc` to run even if there may be another `git gc`
instance running on this repository.
 
+--keep-base-pack::
+   All packs except the base pack are consolidated into a single
+   pack. The largest pack is considered the base pack.
+
 Configuration
 -
 
@@ -167,6 +176,10 @@ run commands concurrently have to live with some risk of 
corruption (which
 seems to be low in practice) unless they turn off automatic garbage
 collection with 'git config gc.auto 0'.
 
+Set environment variable `GIT_TRACE` in order to see the memory usage
+estimation in `git gc --auto` that determines whether the base pack is
+kept.
+
 HOOKS
 -
 
diff --git a/builtin/gc.c b/builtin/gc.c
index 77fa720bd0..273657ddf4 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -20,6 +20,10 @@
 #include "argv-array.h"
 #include "commit.h"
 #include "packfile.h"
+#include "pack.h"
+#include "pack-objects.h"
+#include "blob.h"
+#include "tree.h"
 
 #define FAILED_RUN "failed to run %s"
 
@@ -39,6 +43,8 @@ static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
+static unsigned long big_base_pack_threshold;
+static unsigned long max_delta_cache_size = DEFAULT_DELTA_CACHE_SIZE;
 
 static struct argv_array pack_refs_cmd = ARGV_ARRAY_INIT;
 static struct argv_array reflog = ARGV_ARRAY_INIT;
@@ -126,6 +132,9 @@ static void gc_config(void)