Re: [PATCH v3 2/7] gc: add --keep-base-pack

2018-03-19 Thread Ævar Arnfjörð Bjarmason

On Mon, Mar 19 2018, Duy Nguyen jotted:

> On Fri, Mar 16, 2018 at 10:05 PM, Ævar Arnfjörð Bjarmason
>  wrote:
>>
>> On Fri, Mar 16 2018, Nguyễn Thái Ngọc Duy jotted:
>>
>>> +--keep-base-pack::
>>> + All packs except the base pack and those marked with a `.keep`
>>> + files are consolidated into a single pack. The largest pack is
>>> + considered the base pack.
>>> +
>>
>> I wonder if all of this would be less confusing as:
>>
>>> +--keep-biggest-pack::
>>> + All packs except the largest pack and those marked with a `.keep`
>>> + files are consolidated into a single pack.
>>
>> I.e. just skimming these docs I'd expect "base" to somehow be the thing
>> that we initially cloned, of course in almost all cases that *is* the
>> largest pack, but not necessarily. So rather than communicate that
>> expectation let's just say largest/biggest?
>
> Keeping the term base pack allows us to change its definition later
> (something else other than "largest"). But to be honest I can't see
> what else can a base pack(s) be. So unless people object I'm changing
> this to --keep-biggest-pack (which could take a value  to keep 
> largest packs, but I will refrain from doing things we don't need
> right now).

Maybe I've just been reading this differently, but to me the "base" pack
means the pack that holds the basis of our history, i.e. the thing we
first cloned. As in the base of the history.

Let's say we have a 100MB pack that we cloned, and someone adds a 200MB
(uncompressible) binary file to the repo, then we'll have a "base" pack
that's smaller than the "largest" pack.

So when I was initially reading this series I kept looking for some
discovery of *that* pack, but of course it turned out that it's just
looking for the largest pack.

I just think it's best to avoid that confusion since we really mean
largest, and maybe in the future it would be legitimate to treat the
"base" pack differently, i.e. as you pull down more updates you're
likely to need to be consulting it less and less as time goes on, so
maybe we should have some mode to explicitly exclude just *that* pack
eventually. I.e. as an optimization to keep the more relevant stuff in
cache.


Re: [PATCH v3 2/7] gc: add --keep-base-pack

2018-03-19 Thread Duy Nguyen
On Fri, Mar 16, 2018 at 10:05 PM, Ævar Arnfjörð Bjarmason
 wrote:
>
> On Fri, Mar 16 2018, Nguyễn Thái Ngọc Duy jotted:
>
>> +--keep-base-pack::
>> + All packs except the base pack and those marked with a `.keep`
>> + files are consolidated into a single pack. The largest pack is
>> + considered the base pack.
>> +
>
> I wonder if all of this would be less confusing as:
>
>> +--keep-biggest-pack::
>> + All packs except the largest pack and those marked with a `.keep`
>> + files are consolidated into a single pack.
>
> I.e. just skimming these docs I'd expect "base" to somehow be the thing
> that we initially cloned, of course in almost all cases that *is* the
> largest pack, but not necessarily. So rather than communicate that
> expectation let's just say largest/biggest?

Keeping the term base pack allows us to change its definition later
(something else other than "largest"). But to be honest I can't see
what else can a base pack(s) be. So unless people object I'm changing
this to --keep-biggest-pack (which could take a value  to keep 
largest packs, but I will refrain from doing things we don't need
right now).

>
> Maybe I'm the only one who finds this confusing...
-- 
Duy


Re: [PATCH v3 2/7] gc: add --keep-base-pack

2018-03-16 Thread Ævar Arnfjörð Bjarmason

On Fri, Mar 16 2018, Nguyễn Thái Ngọc Duy jotted:

>   struct option builtin_gc_options[] = {
>   OPT__QUIET(, N_("suppress progress reporting")),
> @@ -362,6 +390,8 @@ int cmd_gc(int argc, const char **argv, const char 
> *prefix)
>   OPT_BOOL(0, "aggressive", , N_("be more thorough 
> (increased runtime)")),
>   OPT_BOOL(0, "auto", _gc, N_("enable auto-gc mode")),
>   OPT_BOOL(0, "force", , N_("force running gc even if there 
> may be another gc running")),
> + OPT_BOOL(0, "keep-base-pack", _base_pack,
> +  N_("repack all other packs except the base pack")),
>   OPT_END()
>   };

There's an easy to solve merge conflict here between the current master
& this. Pushed out a solution (for my own use) at
https://github.com/avar/git/ gc-auto-keep-base-pack. Interdiff with
yours:

@@ -112,9 +112,9 @@
struct option builtin_gc_options[] = {
OPT__QUIET(, N_("suppress progress reporting")),
 @@
-   OPT_BOOL(0, "aggressive", , N_("be more thorough 
(increased runtime)")),
-   OPT_BOOL(0, "auto", _gc, N_("enable auto-gc mode")),
-   OPT_BOOL(0, "force", , N_("force running gc even if there 
may be another gc running")),
+   OPT_BOOL_F(0, "force", ,
+  N_("force running gc even if there may be another gc 
running"),
+  PARSE_OPT_NOCOMPLETE),
 +  OPT_BOOL(0, "keep-base-pack", _base_pack,
 +   N_("repack all other packs except the base pack")),
OPT_END()


Re: [PATCH v3 2/7] gc: add --keep-base-pack

2018-03-16 Thread Ævar Arnfjörð Bjarmason

On Fri, Mar 16 2018, Nguyễn Thái Ngọc Duy jotted:

> +--keep-base-pack::
> + All packs except the base pack and those marked with a `.keep`
> + files are consolidated into a single pack. The largest pack is
> + considered the base pack.
> +

I wonder if all of this would be less confusing as:

> +--keep-biggest-pack::
> + All packs except the largest pack and those marked with a `.keep`
> + files are consolidated into a single pack.

I.e. just skimming these docs I'd expect "base" to somehow be the thing
that we initially cloned, of course in almost all cases that *is* the
largest pack, but not necessarily. So rather than communicate that
expectation let's just say largest/biggest?

Maybe I'm the only one who finds this confusing...


[PATCH v3 2/7] gc: add --keep-base-pack

2018-03-16 Thread Nguyễn Thái Ngọc Duy
This adds a new repack mode that combines everything into a secondary
pack, leaving the largest/base pack alone.

This could help reduce memory pressure. On linux-2.6.git, valgrind
massif reports 1.6GB heap in "pack all" case, and 535MB in "pack
all except the base pack" case. We save roughly 1GB memory by
excluding the base pack.

This should also lower I/O because we don't have to rewrite a giant
pack every time (e.g. for linux-2.6.git that's a 1.4GB pack file)..

PS. The use of string_list here seems overkill, but we'll need it in
the next patch...

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 Documentation/git-gc.txt |  7 +-
 builtin/gc.c | 47 
 t/t6500-gc.sh| 22 +++
 3 files changed, 71 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 571b5a7e3c..1717517043 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -9,7 +9,7 @@ git-gc - Cleanup unnecessary files and optimize the local 
repository
 SYNOPSIS
 
 [verse]
-'git gc' [--aggressive] [--auto] [--quiet] [--prune= | --no-prune] 
[--force]
+'git gc' [--aggressive] [--auto] [--quiet] [--prune= | --no-prune] 
[--force] [--keep-base-pack]
 
 DESCRIPTION
 ---
@@ -78,6 +78,11 @@ automatic consolidation of packs.
Force `git gc` to run even if there may be another `git gc`
instance running on this repository.
 
+--keep-base-pack::
+   All packs except the base pack and those marked with a `.keep`
+   files are consolidated into a single pack. The largest pack is
+   considered the base pack.
+
 Configuration
 -
 
diff --git a/builtin/gc.c b/builtin/gc.c
index 77fa720bd0..362dd537a4 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -164,6 +164,24 @@ static int too_many_loose_objects(void)
return needed;
 }
 
+static void find_base_packs(struct string_list *packs)
+{
+   struct packed_git *p, *base = NULL;
+
+   prepare_packed_git();
+
+   for (p = packed_git; p; p = p->next) {
+   if (!p->pack_local)
+   continue;
+   if (!base || base->pack_size < p->pack_size) {
+   base = p;
+   }
+   }
+
+   if (base)
+   string_list_append(packs, base->pack_name);
+}
+
 static int too_many_packs(void)
 {
struct packed_git *p;
@@ -187,7 +205,13 @@ static int too_many_packs(void)
return gc_auto_pack_limit < cnt;
 }
 
-static void add_repack_all_option(void)
+static int keep_one_pack(struct string_list_item *item, void *data)
+{
+   argv_array_pushf(, "--keep-pack=%s", basename(item->string));
+   return 0;
+}
+
+static void add_repack_all_option(struct string_list *keep_pack)
 {
if (prune_expire && !strcmp(prune_expire, "now"))
argv_array_push(, "-a");
@@ -196,6 +220,9 @@ static void add_repack_all_option(void)
if (prune_expire)
argv_array_pushf(, "--unpack-unreachable=%s", 
prune_expire);
}
+
+   if (keep_pack)
+   for_each_string_list(keep_pack, keep_one_pack, NULL);
 }
 
 static void add_repack_incremental_option(void)
@@ -219,7 +246,7 @@ static int need_to_gc(void)
 * there is no need.
 */
if (too_many_packs())
-   add_repack_all_option();
+   add_repack_all_option(NULL);
else if (too_many_loose_objects())
add_repack_incremental_option();
else
@@ -353,6 +380,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
const char *name;
pid_t pid;
int daemonized = 0;
+   int keep_base_pack = -1;
 
struct option builtin_gc_options[] = {
OPT__QUIET(, N_("suppress progress reporting")),
@@ -362,6 +390,8 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
OPT_BOOL(0, "aggressive", , N_("be more thorough 
(increased runtime)")),
OPT_BOOL(0, "auto", _gc, N_("enable auto-gc mode")),
OPT_BOOL(0, "force", , N_("force running gc even if there 
may be another gc running")),
+   OPT_BOOL(0, "keep-base-pack", _base_pack,
+N_("repack all other packs except the base pack")),
OPT_END()
};
 
@@ -427,8 +457,17 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 */
daemonized = !daemonize();
}
-   } else
-   add_repack_all_option();
+   } else {
+   struct string_list keep_pack = STRING_LIST_INIT_NODUP;
+
+   if (keep_base_pack != -1) {
+   if (keep_base_pack)
+   find_base_packs(_pack);
+   }
+
+   add_repack_all_option(_pack);
+   string_list_clear(_pack, 0);
+   }
 
name =