[PATCH 09/23] midx: write pack names in chunk

2018-06-07 Thread Derrick Stolee
The multi-pack-index (MIDX) needs to track which pack-files are covered
by the MIDX file. Store these in our first required chunk. Since
filenames are not well structured, add padding to keep good alignment in
later chunks.

Modify the 'git midx read' subcommand to output the existence of the
pack-file name chunk. Modify t5319-midx.sh to reflect this new output
and the new expected number of chunks.

Defense in depth: A pattern we are using in the multi-pack-index feature
is to verify the data as we write it. We want to ensure we never write
invalid data to the multi-pack-index. There are many checks during the
write of a MIDX file that double-check that the values we are writing
fit the format definitions. If any value is incorrect, then we notice
before writing invalid data. This mainly helps developers while working
on the feature, but it can also identify issues that only appear when
dealing with very large data sets. These large sets are hard to encode
into test cases.

Signed-off-by: Derrick Stolee 
---
 Documentation/technical/pack-format.txt |   6 +
 builtin/midx.c  |   7 +
 midx.c  | 176 +++-
 object-store.h  |   2 +
 t/t5319-midx.sh |   3 +-
 5 files changed, 188 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/pack-format.txt 
b/Documentation/technical/pack-format.txt
index 17666b4bfc..2b37be7b33 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -296,6 +296,12 @@ CHUNK LOOKUP:
 
 CHUNK DATA:
 
+   Packfile Names (ID: {'P', 'N', 'A', 'M'})
+   Stores the packfile names as concatenated, null-terminated strings.
+   Packfiles must be listed in lexicographic order for fast lookups by
+   name. This is the only chunk not guaranteed to be a multiple of four
+   bytes in length, so should be the last chunk for alignment reasons.
+
(This section intentionally left incomplete.)
 
 TRAILER:
diff --git a/builtin/midx.c b/builtin/midx.c
index c7002f664a..fe56560853 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -28,6 +28,13 @@ static int read_midx_file(const char *object_dir)
   m->num_chunks,
   m->num_packs);
 
+   printf("chunks:");
+
+   if (m->chunk_pack_names)
+   printf(" pack_names");
+
+   printf("\n");
+
printf("object_dir: %s\n", m->object_dir);
 
return 0;
diff --git a/midx.c b/midx.c
index 9fb89c80a2..d4f4a01a51 100644
--- a/midx.c
+++ b/midx.c
@@ -13,6 +13,11 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
+#define MIDX_MAX_CHUNKS 1
+#define MIDX_CHUNK_ALIGNMENT 4
+#define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
+
 static char *get_midx_filename(const char *object_dir)
 {
struct strbuf midx_name = STRBUF_INIT;
@@ -29,6 +34,7 @@ struct midxed_git *load_midxed_git(const char *object_dir)
size_t midx_size;
void *midx_map;
const char *midx_name = get_midx_filename(object_dir);
+   uint32_t i;
 
fd = git_open(midx_name);
if (fd < 0)
@@ -74,6 +80,31 @@ struct midxed_git *load_midxed_git(const char *object_dir)
m->num_chunks = *(m->data + 6);
m->num_packs = get_be32(m->data + 8);
 
+   for (i = 0; i < m->num_chunks; i++) {
+   uint32_t chunk_id = get_be32(m->data + 12 + 
MIDX_CHUNKLOOKUP_WIDTH * i);
+   uint64_t chunk_offset = get_be64(m->data + 16 + 
MIDX_CHUNKLOOKUP_WIDTH * i);
+
+   switch (chunk_id) {
+   case MIDX_CHUNKID_PACKNAMES:
+   m->chunk_pack_names = m->data + chunk_offset;
+   break;
+
+   case 0:
+   die("terminating MIDX chunk id appears earlier 
than expected");
+   break;
+
+   default:
+   /*
+* Do nothing on unrecognized chunks, allowing 
future
+* extensions to add optional chunks.
+*/
+   break;
+   }
+   }
+
+   if (!m->chunk_pack_names)
+   die("MIDX missing required pack-name chunk");
+
return m;
 
 cleanup_fail:
@@ -99,18 +130,88 @@ static size_t write_midx_header(struct hashfile *f,
return MIDX_HEADER_SIZE;
 }
 
+struct pack_pair {
+   uint32_t pack_int_id;
+   char *pack_name;
+};
+
+static int pack_pair_compare(const void *_a, const void *_b)
+{
+   struct pack_pair *a = (struct pack_pair *)_a;
+   struct pack_pair *b = (struct pack_pair *)_b;
+   return strcmp(a->pack_name, b->pack_name);
+}
+
+static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, 

Re: [PATCH 09/23] midx: write pack names in chunk

2018-06-07 Thread Duy Nguyen
On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee  wrote:
> @@ -74,6 +80,31 @@ struct midxed_git *load_midxed_git(const char *object_dir)
> m->num_chunks = *(m->data + 6);
> m->num_packs = get_be32(m->data + 8);
>
> +   for (i = 0; i < m->num_chunks; i++) {
> +   uint32_t chunk_id = get_be32(m->data + 12 + 
> MIDX_CHUNKLOOKUP_WIDTH * i);
> +   uint64_t chunk_offset = get_be64(m->data + 16 + 
> MIDX_CHUNKLOOKUP_WIDTH * i);

Would be good to reduce magic numbers like 12 and 16, I think you have
some header length constants for those already.

> +   switch (chunk_id) {
> +   case MIDX_CHUNKID_PACKNAMES:
> +   m->chunk_pack_names = m->data + chunk_offset;
> +   break;
> +
> +   case 0:
> +   die("terminating MIDX chunk id appears 
> earlier than expected");

_()

> +   break;
> +
> +   default:
> +   /*
> +* Do nothing on unrecognized chunks, 
> allowing future
> +* extensions to add optional chunks.
> +*/

I wrote about the chunk term reminding me of PNG format then deleted
it. But it may help to do similar to PNG here. The first letter can
let us know if the chunk is optional and can be safely ignored. E.g.
uppercase first letter cannot be ignored, lowercase go wild.

> +   break;
> +   }
> +   }
> +
> +   if (!m->chunk_pack_names)
> +   die("MIDX missing required pack-name chunk");

_()

> +
> return m;
>
>  cleanup_fail:
> @@ -99,18 +130,88 @@ static size_t write_midx_header(struct hashfile *f,
> return MIDX_HEADER_SIZE;
>  }
>
> +struct pack_pair {
> +   uint32_t pack_int_id;

can this be just pack_id?

> +   char *pack_name;
> +};
> +
> +static int pack_pair_compare(const void *_a, const void *_b)
> +{
> +   struct pack_pair *a = (struct pack_pair *)_a;
> +   struct pack_pair *b = (struct pack_pair *)_b;
> +   return strcmp(a->pack_name, b->pack_name);
> +}
> +
> +static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, 
> uint32_t *perm)
> +{
> +   uint32_t i;
> +   struct pack_pair *pairs;
> +
> +   ALLOC_ARRAY(pairs, nr_packs);
> +
> +   for (i = 0; i < nr_packs; i++) {
> +   pairs[i].pack_int_id = i;
> +   pairs[i].pack_name = pack_names[i];
> +   }
> +
> +   QSORT(pairs, nr_packs, pack_pair_compare);
> +
> +   for (i = 0; i < nr_packs; i++) {
> +   pack_names[i] = pairs[i].pack_name;
> +   perm[pairs[i].pack_int_id] = i;
> +   }

pairs[] is leaked?

> +}
> +
> +static size_t write_midx_pack_names(struct hashfile *f,
> +   char **pack_names,
> +   uint32_t num_packs)
> +{
> +   uint32_t i;
> +   unsigned char padding[MIDX_CHUNK_ALIGNMENT];
> +   size_t written = 0;
> +
> +   for (i = 0; i < num_packs; i++) {
> +   size_t writelen = strlen(pack_names[i]) + 1;
> +
> +   if (i && strcmp(pack_names[i], pack_names[i - 1]) <= 0)
> +   BUG("incorrect pack-file order: %s before %s",
> +   pack_names[i - 1],
> +   pack_names[i]);
> +
> +   hashwrite(f, pack_names[i], writelen);
> +   written += writelen;

side note. This pattern happens a lot. It may be a good idea to make
hashwrite() return writelen so we can just write

written += hashwrite(f, ..., writelen);

> +   }
> +
> +   /* add padding to be aligned */
> +   i = MIDX_CHUNK_ALIGNMENT - (written % MIDX_CHUNK_ALIGNMENT);
> +   if (i < MIDX_CHUNK_ALIGNMENT) {
> +   bzero(padding, sizeof(padding));
> +   hashwrite(f, padding, i);
> +   written += i;
> +   }
> +
> +   return written;
> +}
> +
>  int write_midx_file(const char *object_dir)
>  {
> -   unsigned char num_chunks = 0;
> +   unsigned char cur_chunk, num_chunks = 0;
> char *midx_name;
> struct hashfile *f;
> struct lock_file lk;
> struct packed_git **packs = NULL;
> +   char **pack_names = NULL;
> +   uint32_t *pack_perm;
> uint32_t i, nr_packs = 0, alloc_packs = 0;
> +   uint32_t alloc_pack_names = 0;
> DIR *dir;
> struct dirent *de;
> struct strbuf pack_dir = STRBUF_INIT;
> size_t pack_dir_len;
> +   uint64_t pack_name_concat_len = 0;
> +   uint64_t written = 0;
> +   uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
> +   uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];

This long list of local vars may be a good indicator that this
function needs split up into smaller ones.

>
> midx_name = get_midx_filename(object

Re: [PATCH 09/23] midx: write pack names in chunk

2018-06-21 Thread Derrick Stolee

On 6/7/2018 2:26 PM, Duy Nguyen wrote:

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee  wrote:

@@ -74,6 +80,31 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 m->num_chunks = *(m->data + 6);
 m->num_packs = get_be32(m->data + 8);

+   for (i = 0; i < m->num_chunks; i++) {
+   uint32_t chunk_id = get_be32(m->data + 12 + 
MIDX_CHUNKLOOKUP_WIDTH * i);
+   uint64_t chunk_offset = get_be64(m->data + 16 + 
MIDX_CHUNKLOOKUP_WIDTH * i);

Would be good to reduce magic numbers like 12 and 16, I think you have
some header length constants for those already.


+   switch (chunk_id) {
+   case MIDX_CHUNKID_PACKNAMES:
+   m->chunk_pack_names = m->data + chunk_offset;
+   break;
+
+   case 0:
+   die("terminating MIDX chunk id appears earlier than 
expected");

_()


This die() and others like it are not marked for translation on purpose, 
as they should never be seen by an end-user.





+   break;
+
+   default:
+   /*
+* Do nothing on unrecognized chunks, allowing 
future
+* extensions to add optional chunks.
+*/

I wrote about the chunk term reminding me of PNG format then deleted
it. But it may help to do similar to PNG here. The first letter can
let us know if the chunk is optional and can be safely ignored. E.g.
uppercase first letter cannot be ignored, lowercase go wild.


That's an interesting way to think about it. That way you could add a 
new "required" chunk and earlier versions could die() realizing they 
don't know how to parse that required chunk.


I think for this format, we should update the file version value when a 
required chunk is needed.





+   break;
+   }
+   }
+
+   if (!m->chunk_pack_names)
+   die("MIDX missing required pack-name chunk");

_()


+
 return m;

  cleanup_fail:
@@ -99,18 +130,88 @@ static size_t write_midx_header(struct hashfile *f,
 return MIDX_HEADER_SIZE;
  }

+struct pack_pair {
+   uint32_t pack_int_id;

can this be just pack_id?


Since packfiles are usually named pack-{hash}.pack, I chose to be 
specific here.





+   char *pack_name;
+};
+
+static int pack_pair_compare(const void *_a, const void *_b)
+{
+   struct pack_pair *a = (struct pack_pair *)_a;
+   struct pack_pair *b = (struct pack_pair *)_b;
+   return strcmp(a->pack_name, b->pack_name);
+}
+
+static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t 
*perm)
+{
+   uint32_t i;
+   struct pack_pair *pairs;
+
+   ALLOC_ARRAY(pairs, nr_packs);
+
+   for (i = 0; i < nr_packs; i++) {
+   pairs[i].pack_int_id = i;
+   pairs[i].pack_name = pack_names[i];
+   }
+
+   QSORT(pairs, nr_packs, pack_pair_compare);
+
+   for (i = 0; i < nr_packs; i++) {
+   pack_names[i] = pairs[i].pack_name;
+   perm[pairs[i].pack_int_id] = i;
+   }

pairs[] is leaked?


Good catch!




+}
+
+static size_t write_midx_pack_names(struct hashfile *f,
+   char **pack_names,
+   uint32_t num_packs)
+{
+   uint32_t i;
+   unsigned char padding[MIDX_CHUNK_ALIGNMENT];
+   size_t written = 0;
+
+   for (i = 0; i < num_packs; i++) {
+   size_t writelen = strlen(pack_names[i]) + 1;
+
+   if (i && strcmp(pack_names[i], pack_names[i - 1]) <= 0)
+   BUG("incorrect pack-file order: %s before %s",
+   pack_names[i - 1],
+   pack_names[i]);
+
+   hashwrite(f, pack_names[i], writelen);
+   written += writelen;

side note. This pattern happens a lot. It may be a good idea to make
hashwrite() return writelen so we can just write

written += hashwrite(f, ..., writelen);


If I change the prototype of hashwrite(), will other callers get 
warnings about not inspecting the return value (for some build options 
on some platforms)?





+   }
+
+   /* add padding to be aligned */
+   i = MIDX_CHUNK_ALIGNMENT - (written % MIDX_CHUNK_ALIGNMENT);
+   if (i < MIDX_CHUNK_ALIGNMENT) {
+   bzero(padding, sizeof(padding));
+   hashwrite(f, padding, i);
+   written += i;
+   }
+
+   return written;
+}
+
  int write_midx_file(const char *object_dir)
  {
-   unsigned char num_chunks = 0;
+   unsigned char cur_chunk, num_chunks = 0;
 char *midx_name;
 struct hashfile *f;
 struct lock_file lk;
 struct packed_git **packs = NULL;
+   char **pack_names = NULL;
+   uint32_t *pack_perm;
 uint32_t i, nr_packs = 0, alloc_

Re: [PATCH 09/23] midx: write pack names in chunk

2018-06-21 Thread Junio C Hamano
Derrick Stolee  writes:

> On 6/7/2018 2:26 PM, Duy Nguyen wrote:
>> On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee  wrote:
>>> @@ -74,6 +80,31 @@ struct midxed_git *load_midxed_git(const char 
>>> *object_dir)
>>>  m->num_chunks = *(m->data + 6);
>>>  m->num_packs = get_be32(m->data + 8);
>>>
>>> +   for (i = 0; i < m->num_chunks; i++) {
>>> +   uint32_t chunk_id = get_be32(m->data + 12 + 
>>> MIDX_CHUNKLOOKUP_WIDTH * i);
>>> +   uint64_t chunk_offset = get_be64(m->data + 16 + 
>>> MIDX_CHUNKLOOKUP_WIDTH * i);
>> Would be good to reduce magic numbers like 12 and 16, I think you have
>> some header length constants for those already.
>>
>>> +   switch (chunk_id) {
>>> +   case MIDX_CHUNKID_PACKNAMES:
>>> +   m->chunk_pack_names = m->data + 
>>> chunk_offset;
>>> +   break;

(style: aren't these case arms indented one level too deep)?

>>> +   case 0:
>>> +   die("terminating MIDX chunk id appears 
>>> earlier than expected");
>> _()
>
> This die() and others like it are not marked for translation on
> purpose, as they should never be seen by an end-user.

Should never be seen because it indicates a software bug, in which
case this should be BUG() instead of die()?

Or did we just find a file corruption on the filesystem?  If so,
then the error is end-user facing and should tell the user something
that hints what is going on in the language the user understands, I
would guess.

>>> +   default:
>>> +   /*
>>> +* Do nothing on unrecognized chunks, 
>>> allowing future
>>> +* extensions to add optional chunks.
>>> +*/
>> I wrote about the chunk term reminding me of PNG format then deleted
>> it. But it may help to do similar to PNG here. The first letter can
>> let us know if the chunk is optional and can be safely ignored. E.g.
>> uppercase first letter cannot be ignored, lowercase go wild.
>
> That's an interesting way to think about it. That way you could add a
> new "required" chunk and earlier versions could die() realizing they
> don't know how to parse that required chunk.

That is how the index extension sections work and may be a good
example to follow.


Re: [PATCH 09/23] midx: write pack names in chunk

2018-06-22 Thread Derrick Stolee

On 6/21/2018 1:38 PM, Junio C Hamano wrote:

Derrick Stolee  writes:


On 6/7/2018 2:26 PM, Duy Nguyen wrote:

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee  wrote:

@@ -74,6 +80,31 @@ struct midxed_git *load_midxed_git(const char *object_dir)
  m->num_chunks = *(m->data + 6);
  m->num_packs = get_be32(m->data + 8);

+   for (i = 0; i < m->num_chunks; i++) {
+   uint32_t chunk_id = get_be32(m->data + 12 + 
MIDX_CHUNKLOOKUP_WIDTH * i);
+   uint64_t chunk_offset = get_be64(m->data + 16 + 
MIDX_CHUNKLOOKUP_WIDTH * i);

Would be good to reduce magic numbers like 12 and 16, I think you have
some header length constants for those already.


+   switch (chunk_id) {
+   case MIDX_CHUNKID_PACKNAMES:
+   m->chunk_pack_names = m->data + chunk_offset;
+   break;

(style: aren't these case arms indented one level too deep)?


+   case 0:
+   die("terminating MIDX chunk id appears earlier than 
expected");

_()

This die() and others like it are not marked for translation on
purpose, as they should never be seen by an end-user.

Should never be seen because it indicates a software bug, in which
case this should be BUG() instead of die()?

Or did we just find a file corruption on the filesystem?  If so,
then the error is end-user facing and should tell the user something
that hints what is going on in the language the user understands, I
would guess.


+   default:
+   /*
+* Do nothing on unrecognized chunks, allowing 
future
+* extensions to add optional chunks.
+*/

I wrote about the chunk term reminding me of PNG format then deleted
it. But it may help to do similar to PNG here. The first letter can
let us know if the chunk is optional and can be safely ignored. E.g.
uppercase first letter cannot be ignored, lowercase go wild.

That's an interesting way to think about it. That way you could add a
new "required" chunk and earlier versions could die() realizing they
don't know how to parse that required chunk.

That is how the index extension sections work and may be a good
example to follow.


The index extension documentation doesn't appear to be clear about which 
extensions are optional or required, but it seems the split-index is the 
only "required" one and uses lowercase for its extension id.


Since the multi-pack-index has similar structure to the commit-graph 
file, and that file includes an optional chunk with no special casing of 
the chunk id, I think we should stick with the existing model: chunks 
that are added later are optional and if Git _must_ understand it, then 
we increment the version number. Hence, for each version number there is 
a fixed list of required chunks, but an extendible list of optional chunks.


Thanks,
-Stolee


Re: [PATCH 09/23] midx: write pack names in chunk

2018-06-22 Thread Junio C Hamano
Derrick Stolee  writes:

> The index extension documentation doesn't appear to be clear about
> which extensions are optional or required, but it seems the
> split-index is the only "required" one and uses lowercase for its
> extension id.

read-cache.c::

/* Index extensions.
 *
 * The first letter should be 'A'..'Z' for extensions that are not
 * necessary for a correct operation (i.e. optimization data).
 * When new extensions are added that _needs_ to be understood in
 * order to correctly interpret the index file, pick character that
 * is outside the range, to cause the reader to abort.
 */



Re: [PATCH 09/23] midx: write pack names in chunk

2018-06-22 Thread Derrick Stolee

On 6/22/2018 2:31 PM, Junio C Hamano wrote:

Derrick Stolee  writes:


The index extension documentation doesn't appear to be clear about
which extensions are optional or required, but it seems the
split-index is the only "required" one and uses lowercase for its
extension id.

read-cache.c::

 /* Index extensions.
  *
  * The first letter should be 'A'..'Z' for extensions that are not
  * necessary for a correct operation (i.e. optimization data).
  * When new extensions are added that _needs_ to be understood in
  * order to correctly interpret the index file, pick character that
  * is outside the range, to cause the reader to abort.
  */


Thanks! I was reading Documentation/technical/index-format.txt and 
optional extensions are mentioned but not described precisely.