Re: [PATCH v7 00/13] nd/pack-objects-pack-struct updates

2018-03-26 Thread Jeff King
On Sat, Mar 24, 2018 at 07:33:40AM +0100, Nguyễn Thái Ngọc Duy wrote:

> +unsigned long oe_get_size_slow(struct packing_data *pack,
> +const struct object_entry *e)
> +{
> + struct packed_git *p;
> + struct pack_window *w_curs;
> + unsigned char *buf;
> + enum object_type type;
> + unsigned long used, avail, size;
> +
> + if (e->type_ != OBJ_OFS_DELTA && e->type_ != OBJ_REF_DELTA) {
> + read_lock();
> + if (sha1_object_info(e->idx.oid.hash, &size) < 0)
> + die(_("unable to get size of %s"),
> + oid_to_hex(&e->idx.oid));
> + read_unlock();
> + return size;
> + }
> +
> + p = oe_in_pack(pack, e);
> + if (!p)
> + die("BUG: when e->type is a delta, it must belong to a pack");
> +
> + read_lock();
> + w_curs = NULL;
> + buf = use_pack(p, &w_curs, e->in_pack_offset, &avail);
> + used = unpack_object_header_buffer(buf, avail, &type, &size);
> + if (used == 0)
> + die(_("unable to parse object header of %s"),
> + oid_to_hex(&e->idx.oid));
> +
> + unuse_pack(&w_curs);
> + read_unlock();
> + return size;
> +}

It took me a while to figure out why this treated deltas and non-deltas
differently. At first I thought it was an optimization (since we can
find non-delta sizes quickly by looking at the headers).  But I think
it's just that you want to know the size of the actual _delta_, not the
reconstructed object. And there's no way to ask sha1_object_info() for
that.

Perhaps the _extended version of that function should learn an
OBJECT_INFO_NO_DEREF flag or something to tell it return the true delta
type and size. Then this whole function could just become a single call.

But short of that, it's probably worth a comment explaining what's going
on.

> +static void prepare_in_pack_by_idx(struct packing_data *pdata)
> +{
> + struct packed_git **mapping, *p;
> + int cnt = 0, nr = 1 << OE_IN_PACK_BITS;
> +
> + if (getenv("GIT_TEST_FULL_IN_PACK_ARRAY")) {
> + /*
> +  * leave in_pack_by_idx NULL to force in_pack[] to be
> +  * used instead
> +  */
> + return;
> + }

Minor nit, but can we use git_env_bool() here? It's just as easy, and
it's less surprising in some corner cases.

>  struct object_entry *packlist_alloc(struct packing_data *pdata,
>   const unsigned char *sha1,
>   uint32_t index_pos)
>  {
>   struct object_entry *new_entry;
>  
> + if (!pdata->nr_objects) {
> + prepare_in_pack_by_idx(pdata);
> + if (getenv("GIT_TEST_OE_SIZE_BITS")) {
> + int bits = atoi(getenv("GIT_TEST_OE_SIZE_BITS"));;
> + pdata->oe_size_limit = 1 << bits;
> + }
> + if (!pdata->oe_size_limit)
> + pdata->oe_size_limit = 1 << OE_SIZE_BITS;
> + }

Ditto here; I think this could just be:

  pdata->oe_size_limit = git_env_ulong("GIT_TEST_OE_SIZE_BITS",
   (1 << OE_SIZE_BITS));

>   if (pdata->nr_objects >= pdata->nr_alloc) {
>   pdata->nr_alloc = (pdata->nr_alloc  + 1024) * 3 / 2;
>   REALLOC_ARRAY(pdata->objects, pdata->nr_alloc);
> +
> + if (!pdata->in_pack_by_idx)
> + REALLOC_ARRAY(pdata->in_pack, pdata->nr_alloc);
>   }

I was going to complain that we don't use ALLOC_GROW() here, but
actually that part is in the context. ;)

> @@ -35,7 +36,9 @@ enum dfs_state {
>   *
>   * "size" is the uncompressed object size. Compressed size of the raw
>   * data for an object in a pack is not stored anywhere but is computed
> - * and made available when reverse .idx is made.
> + * and made available when reverse .idx is made. Note that when an
> + * delta is reused, "size" is the uncompressed _delta_ size, not the
> + * canonical one after the delta has been applied.

s/an delta/a delta/

> +Running tests with special setups
> +-
> +
> +The whole test suite could be run to test some special features
> +that cannot be easily covered by a few specific test cases. These
> +could be enabled by running the test suite with correct GIT_TEST_
> +environment set.
> +
> +GIT_TEST_SPLIT_INDEX forces split-index mode on the whole test suite.
> +
> +GIT_TEST_FULL_IN_PACK_ARRAY exercises the uncommon pack-objects code
> +path where there are more than 1024 packs even if the actual number of
> +packs in repository is below this limit.
> +
> +GIT_TEST_OE_SIZE_BITS= exercises the uncommon pack-objects
> +code path where we do not cache objecct size in memory and read it
> +from existing packs on demand. This normally only happens when the
> +object size is over 2GB. This variable forces the code path on any
> +object larger than 2^ bytes.

It's nice to have these av

Re: [PATCH v7 00/13] nd/pack-objects-pack-struct updates

2018-03-26 Thread Duy Nguyen
On Mon, Mar 26, 2018 at 5:13 PM, Jeff King  wrote:
> On Sat, Mar 24, 2018 at 07:33:40AM +0100, Nguyễn Thái Ngọc Duy wrote:
>
>> +unsigned long oe_get_size_slow(struct packing_data *pack,
>> +const struct object_entry *e)
>> +{
>> + struct packed_git *p;
>> + struct pack_window *w_curs;
>> + unsigned char *buf;
>> + enum object_type type;
>> + unsigned long used, avail, size;
>> +
>> + if (e->type_ != OBJ_OFS_DELTA && e->type_ != OBJ_REF_DELTA) {
>> + read_lock();
>> + if (sha1_object_info(e->idx.oid.hash, &size) < 0)
>> + die(_("unable to get size of %s"),
>> + oid_to_hex(&e->idx.oid));
>> + read_unlock();
>> + return size;
>> + }
>> +
>> + p = oe_in_pack(pack, e);
>> + if (!p)
>> + die("BUG: when e->type is a delta, it must belong to a pack");
>> +
>> + read_lock();
>> + w_curs = NULL;
>> + buf = use_pack(p, &w_curs, e->in_pack_offset, &avail);
>> + used = unpack_object_header_buffer(buf, avail, &type, &size);
>> + if (used == 0)
>> + die(_("unable to parse object header of %s"),
>> + oid_to_hex(&e->idx.oid));
>> +
>> + unuse_pack(&w_curs);
>> + read_unlock();
>> + return size;
>> +}
>
> It took me a while to figure out why this treated deltas and non-deltas
> differently. At first I thought it was an optimization (since we can
> find non-delta sizes quickly by looking at the headers).  But I think
> it's just that you want to know the size of the actual _delta_, not the
> reconstructed object. And there's no way to ask sha1_object_info() for
> that.
>
> Perhaps the _extended version of that function should learn an
> OBJECT_INFO_NO_DEREF flag or something to tell it return the true delta
> type and size. Then this whole function could just become a single call.
>
> But short of that, it's probably worth a comment explaining what's going
> on.

I thought the elaboration on "size" in the big comment block in front
of struct object_entry was enough. I was wrong. Will add something
here.

>> +Running tests with special setups
>> +-
>> +
>> +The whole test suite could be run to test some special features
>> +that cannot be easily covered by a few specific test cases. These
>> +could be enabled by running the test suite with correct GIT_TEST_
>> +environment set.
>> +
>> +GIT_TEST_SPLIT_INDEX forces split-index mode on the whole test suite.
>> +
>> +GIT_TEST_FULL_IN_PACK_ARRAY exercises the uncommon pack-objects code
>> +path where there are more than 1024 packs even if the actual number of
>> +packs in repository is below this limit.
>> +
>> +GIT_TEST_OE_SIZE_BITS= exercises the uncommon pack-objects
>> +code path where we do not cache objecct size in memory and read it
>> +from existing packs on demand. This normally only happens when the
>> +object size is over 2GB. This variable forces the code path on any
>> +object larger than 2^ bytes.
>
> It's nice to have these available to test the uncommon cases. But I have
> a feeling nobody will ever run them, since it requires extra effort (and
> takes a full test run).

I know :) I also know that this does not interfere with
GIT_TEST_SPLIT_INDEX, which is being run in Travis. So the plan (after
this series is merged) is to make Travis second run to do something
like

make test GIT_TEST_SPLIT...=1 GIT_TEST_FULL..=1 GIT_TEST_OE..=4

we don't waste more cpu cycles and we can make sure these code paths
are always run (at least on one platform)

> I see there's a one-off test for GIT_TEST_FULL_IN_PACK_ARRAY, which I
> think is a good idea, since it makes sure the code is exercised in a
> normal test suite run. Should we do the same for GIT_TEST_OE_SIZE_BITS?

I think the problem with OE_SIZE_BITS is it has many different code
paths (like reused deltas) which is hard to make sure it runs. But yes
I think I could construct a pack that executes both code paths in
oe_get_size_slow(). Will do in a reroll.

> I haven't done an in-depth read of each patch yet; this was just what
> jumped out at me from reading the interdiff.

I would really appreciate it if you could find some time to do it. The
bugs I found in this round proved that I had no idea what's really
going on in pack-objects. Sure I know the big picture but that's far
from enough to do changes like this.
-- 
Duy


Re: [PATCH v7 00/13] nd/pack-objects-pack-struct updates

2018-03-27 Thread Jeff King
On Mon, Mar 26, 2018 at 07:04:54PM +0200, Duy Nguyen wrote:

> >> +unsigned long oe_get_size_slow(struct packing_data *pack,
> >> +const struct object_entry *e)
> [...]
> > But short of that, it's probably worth a comment explaining what's going
> > on.
> 
> I thought the elaboration on "size" in the big comment block in front
> of struct object_entry was enough. I was wrong. Will add something
> here.

It may be my fault for reading the interdiff, which didn't include that
comment. I was literally just thinking something like:

  /*
   * Return the size of the object without doing any delta
   * reconstruction (so non-deltas are true object sizes, but
   * deltas return the size of the delta data).
   */

> > I see there's a one-off test for GIT_TEST_FULL_IN_PACK_ARRAY, which I
> > think is a good idea, since it makes sure the code is exercised in a
> > normal test suite run. Should we do the same for GIT_TEST_OE_SIZE_BITS?
> 
> I think the problem with OE_SIZE_BITS is it has many different code
> paths (like reused deltas) which is hard to make sure it runs. But yes
> I think I could construct a pack that executes both code paths in
> oe_get_size_slow(). Will do in a reroll.

OK. If it's too painful to construct a good example, don't worry about
it.  It sounds like we're unlikely to get full coverage anyway.

> > I haven't done an in-depth read of each patch yet; this was just what
> > jumped out at me from reading the interdiff.
> 
> I would really appreciate it if you could find some time to do it. The
> bugs I found in this round proved that I had no idea what's really
> going on in pack-objects. Sure I know the big picture but that's far
> from enough to do changes like this.

I didn't get to it today, but I'll try to give it a careful read. There
are quite a few corners of pack-objects I don't know well, but I think
at this point I may be the most expert of remaining people. Scary. :)

-Peff