Re: [PATCH v4 01/11] pack-objects: a bit of document about struct object_entry

2018-03-17 Thread Duy Nguyen
On Fri, Mar 16, 2018 at 9:32 PM, Junio C Hamano  wrote:
>> +/*
>> + * basic object info
>> + * -
>> + * idx.oid is filled up before delta searching starts. idx.crc32 and
>> + * is only valid after the object is written out and will be used for
>
> "and is"?

There was another field that I thought was only valid after blah blah.
But it was wrong and I forgot to delete this "and" after deleting that
field.

>> + * "hash" contains a path name hash which is used for sorting the
>> + * delta list and also during delta searching. Once prepare_pack()
>> + * returns it's no longer needed.
>
> Hmm, that suggests an interesting optimization opportunity ;-)

Heh.. it does not reduce peak memory consumption though which is why
I'm less interested in freeing it after prepare_pack().

>> + * source pack info
>> + * 
>> + * The (in_pack, in_pack_offset, in_pack_header_size) tuple contains
>> + * the location of the object in the source pack, with or without
>> + * header.
>
> "with or without", meaning...?  An object in the source pack may or
> may not have any in_pack_header, in which case in_pack_header_size
> is zero, or something?  Not suggesting to rephrase (at least not
> yet), but trying to understand.

The location with the header (i.e. true beginning an object in a pack)
or without/after the header so you are at the zlib stream, ready to
inflate or reuse. I'll rephrase this a bit.

>> + *
>> + * delta_child and delta_sibling are last needed in
>> + * compute_write_order(). "delta" and "delta_size" must remain valid
>> + * at object writing phase in case the delta is not cached.
>
> True.  I thought child and sibling are only needed during write
> order computing, so there may be an optimization opportunity there.

See. I wrote all this for a reason. Somebody looking for low hang
fruit can always find some ;-)
-- 
Duy


Re: [PATCH v4 01/11] pack-objects: a bit of document about struct object_entry

2018-03-16 Thread Junio C Hamano
Nguyễn Thái Ngọc Duy   writes:

> The role of this comment block becomes more important after we shuffle
> fields around to shrink this struct. It will be much harder to see what
> field is related to what.
>
> Signed-off-by: Nguyễn Thái Ngọc Duy 
> ---
>  pack-objects.h | 44 
>  1 file changed, 44 insertions(+)
>
> diff --git a/pack-objects.h b/pack-objects.h
> index 03f1191659..85345a4af1 100644
> --- a/pack-objects.h
> +++ b/pack-objects.h
> @@ -1,6 +1,50 @@
>  #ifndef PACK_OBJECTS_H
>  #define PACK_OBJECTS_H
>  
> +/*
> + * basic object info
> + * -
> + * idx.oid is filled up before delta searching starts. idx.crc32 and
> + * is only valid after the object is written out and will be used for

"and is"?

> + * generating the index. idx.offset will be both gradually set and
> + * used in writing phase (base objects get offset first, then deltas
> + * refer to them)
> + *
> + * "size" is the uncompressed object size. Compressed size is not
> + * cached (ie. raw data in a pack) but available via revindex.

I am having a hard time understanding what "ie. raw data in a pack"
is doing in that sentence.

It is correct that compressed size is not cached; it does not even
exist and the only way to know it is to compute it by reversing the
.idx file (or actually uncompressing the compressed stream).

Perhaps:

Compressed size of the raw data for an object in a pack is not
stored anywhere but is computed and made available when reverse
.idx is made.

> + * "hash" contains a path name hash which is used for sorting the
> + * delta list and also during delta searching. Once prepare_pack()
> + * returns it's no longer needed.

Hmm, that suggests an interesting optimization opportunity ;-)

> + * source pack info
> + * 
> + * The (in_pack, in_pack_offset, in_pack_header_size) tuple contains
> + * the location of the object in the source pack, with or without
> + * header.

"with or without", meaning...?  An object in the source pack may or
may not have any in_pack_header, in which case in_pack_header_size
is zero, or something?  Not suggesting to rephrase (at least not
yet), but trying to understand.

> + * "type" and "in_pack_type" both describe object type. in_pack_type
> + * may contain a delta type, while type is always the canonical type.
> + *
> + * deltas
> + * --
> + * Delta links (delta, delta_child and delta_sibling) are created
> + * reflect that delta graph from the source pack then updated or added
> + * during delta searching phase when we find better deltas.

Isn't anything missing after "are created"?  Perhaps "to"?

> + *
> + * delta_child and delta_sibling are last needed in
> + * compute_write_order(). "delta" and "delta_size" must remain valid
> + * at object writing phase in case the delta is not cached.

True.  I thought child and sibling are only needed during write
order computing, so there may be an optimization opportunity there.

> + * If a delta is cached in memory and is compressed delta_data points

s/compressed delta_data/compressed, delta_data/;

> + * to the data and z_delta_size contains the compressed size. If it's
> + * uncompressed [1], z_delta_size must be zero. delta_size is always
> + * the uncompressed size and must be valid even if the delta is not
> + * cached.
> + *
> + * [1] during try_delta phase we don't bother with compressing because
> + * the delta could be quickly replaced with a better one.
> + */
>  struct object_entry {
>   struct pack_idx_entry idx;
>   unsigned long size; /* uncompressed size */


[PATCH v4 01/11] pack-objects: a bit of document about struct object_entry

2018-03-16 Thread Nguyễn Thái Ngọc Duy
The role of this comment block becomes more important after we shuffle
fields around to shrink this struct. It will be much harder to see what
field is related to what.

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 pack-objects.h | 44 
 1 file changed, 44 insertions(+)

diff --git a/pack-objects.h b/pack-objects.h
index 03f1191659..85345a4af1 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -1,6 +1,50 @@
 #ifndef PACK_OBJECTS_H
 #define PACK_OBJECTS_H
 
+/*
+ * basic object info
+ * -
+ * idx.oid is filled up before delta searching starts. idx.crc32 and
+ * is only valid after the object is written out and will be used for
+ * generating the index. idx.offset will be both gradually set and
+ * used in writing phase (base objects get offset first, then deltas
+ * refer to them)
+ *
+ * "size" is the uncompressed object size. Compressed size is not
+ * cached (ie. raw data in a pack) but available via revindex.
+ *
+ * "hash" contains a path name hash which is used for sorting the
+ * delta list and also during delta searching. Once prepare_pack()
+ * returns it's no longer needed.
+ *
+ * source pack info
+ * 
+ * The (in_pack, in_pack_offset, in_pack_header_size) tuple contains
+ * the location of the object in the source pack, with or without
+ * header.
+ *
+ * "type" and "in_pack_type" both describe object type. in_pack_type
+ * may contain a delta type, while type is always the canonical type.
+ *
+ * deltas
+ * --
+ * Delta links (delta, delta_child and delta_sibling) are created
+ * reflect that delta graph from the source pack then updated or added
+ * during delta searching phase when we find better deltas.
+ *
+ * delta_child and delta_sibling are last needed in
+ * compute_write_order(). "delta" and "delta_size" must remain valid
+ * at object writing phase in case the delta is not cached.
+ *
+ * If a delta is cached in memory and is compressed delta_data points
+ * to the data and z_delta_size contains the compressed size. If it's
+ * uncompressed [1], z_delta_size must be zero. delta_size is always
+ * the uncompressed size and must be valid even if the delta is not
+ * cached.
+ *
+ * [1] during try_delta phase we don't bother with compressing because
+ * the delta could be quickly replaced with a better one.
+ */
 struct object_entry {
struct pack_idx_entry idx;
unsigned long size; /* uncompressed size */
-- 
2.16.2.903.gd04caf5039