Re: [PATCH v3 5/6] split-index: don't compare stat data of entries already marked for split index

2018-09-29 Thread SZEDER Gábor
On Sat, Sep 29, 2018 at 11:14:29AM +0200, SZEDER Gábor wrote:
> > > Note that comparing the cached data in copied and original entries in
> > 
> > s/cached data/cached stat data/ ? I was confused for a bit.
> 
> No, it's indeed cached data, but now that you mention it, the subject
> line does need a s/stat //.

Or rather s/stat/cached/



Re: [PATCH v3 5/6] split-index: don't compare stat data of entries already marked for split index

2018-09-29 Thread SZEDER Gábor
On Sat, Sep 29, 2018 at 07:36:08AM +0200, Duy Nguyen wrote:
> On Fri, Sep 28, 2018 at 06:24:58PM +0200, SZEDER Gábor wrote:
> > When unpack_trees() constructs a new index, it copies cache entries
> > from the original index [1].  prepare_to_write_split_index() has to
> > deal with this, and it has a dedicated code path for copied entries
> > that are present in the shared index, where it compares the cached
> > data in the corresponding copied and original entries.  If the cached
> > data matches, then they are considered the same; if it differs, then
> > the copied entry will be marked for inclusion as a replacement entry
> > in the just about to be written split index by setting the
> > CE_UPDATE_IN_BASE flag.
> > 
> > However, a cache entry already has its CE_UPDATE_IN_BASE flag set upon
> > reading the split index, if the entry already has a replacement entry
> > there, or upon refreshing the cached stat data, if the corresponding
> > file was modified.  The state of this flag is then preserved when
> > unpack_trees() copies a cache entry from the shared index.
> > 
> > So modify prepare_to_write_split_index() to check the copied cache
> > entries' CE_UPDATE_IN_BASE flag first, and skip the thorough
> > comparison of cached data if the flag is already set.
> 
> OK so this is an optimization, not a bug fix. Right?

Well, a microoptimization at most: with all what's going on in
unpack_trees() I seriously doubt that it's effect is measurable.

> > Note that comparing the cached data in copied and original entries in
> 
> s/cached data/cached stat data/ ? I was confused for a bit.

No, it's indeed cached data, but now that you mention it, the subject
line does need a s/stat //.

The comparison is done with this call:

  ret = memcmp(&ce->ce_stat_data, &base->ce_stat_data,
   offsetof(struct cache_entry, name) -
   offsetof(struct cache_entry, ce_stat_data));

i.e. it starts at the stat data and ends just before the cache entry's
name, and 'struct cache_entry' has several other fields between these
two, including e.g. the cached oid:

  struct cache_entry {
  struct hashmap_entry ent;
  struct stat_data ce_stat_data;
  unsigned int ce_mode;
  unsigned int ce_flags;
  unsigned int mem_pool_allocated;
  unsigned int ce_namelen;
  unsigned int index; /* for link extension */
  struct object_id oid;
  char name[FLEX_ARRAY]; /* more */
  };

However, to me it's mostly about clarity of the code, and about
documenting that CE_UPDATE_IN_BASE might have already been set at that
point and why, so the next dev diving in to debug the split index
doesn't have to figure this out himself.

> > the shared index might actually be entirely unnecessary.  In theory
> > all code paths refreshing the cached stat data of an entry in the
> > shared index should set the CE_UPDATE_IN_BASE flag in that entry, and
> > unpack_trees() should preserve this flag when copying cache entries.
> > This means that the cached data is only ever changed if the
> > CE_UPDATE_IN_BASE flag is set as well.  Our test suite seems to
> > confirm this: instrumenting the conditions in question and running the
> > test suite repeatedly with 'GIT_TEST_SPLIT_INDEX=yes' showed that the
> > cached data in a copied entry differs from the data in the shared
> > entry only if its CE_UPDATE_IN_BASE flag is indeed set.
> 
> Yes I was probably just being paranoid (or sticking to simpler
> checks). I was told that split index is computation expensive and not
> doing unnecesary/expensive checks may help. But let's leave it for
> later.
> 
> > +   } else {
> > +   /*
> > +* Thoroughly compare the cached data to see
> > +* whether it should be marked for inclusion
> > +* in the split index.
> > +*
> > +* This comparison might be unnecessary, as
> > +* code paths modifying the cached data do
> > +* set CE_UPDATE_IN_BASE as well.
> > +*/
> > +   const unsigned int ondisk_flags =
> > +   CE_STAGEMASK | CE_VALID |
> > +   CE_EXTENDED_FLAGS;
> > +   unsigned int ce_flags, base_flags, ret;
> > +   ce_flags = ce->ce_flags;
> > +   base_flags = base->ce_flags;
> > +   /* only on-disk flags matter */
> > +   ce->ce_flags   &= ondisk_flags;
> > +   base->ce_flags &= ondisk_flags;
> > +   ret = memcmp(&ce->ce_stat_data, 
> > &base->ce_stat_data,
> > +offsetof(struct cache_entry, name) 
> > -
> > + 

Re: [PATCH v3 5/6] split-index: don't compare stat data of entries already marked for split index

2018-09-28 Thread Duy Nguyen
On Fri, Sep 28, 2018 at 06:24:58PM +0200, SZEDER Gábor wrote:
> When unpack_trees() constructs a new index, it copies cache entries
> from the original index [1].  prepare_to_write_split_index() has to
> deal with this, and it has a dedicated code path for copied entries
> that are present in the shared index, where it compares the cached
> data in the corresponding copied and original entries.  If the cached
> data matches, then they are considered the same; if it differs, then
> the copied entry will be marked for inclusion as a replacement entry
> in the just about to be written split index by setting the
> CE_UPDATE_IN_BASE flag.
> 
> However, a cache entry already has its CE_UPDATE_IN_BASE flag set upon
> reading the split index, if the entry already has a replacement entry
> there, or upon refreshing the cached stat data, if the corresponding
> file was modified.  The state of this flag is then preserved when
> unpack_trees() copies a cache entry from the shared index.
> 
> So modify prepare_to_write_split_index() to check the copied cache
> entries' CE_UPDATE_IN_BASE flag first, and skip the thorough
> comparison of cached data if the flag is already set.

OK so this is an optimization, not a bug fix. Right?

> Note that comparing the cached data in copied and original entries in

s/cached data/cached stat data/ ? I was confused for a bit.

> the shared index might actually be entirely unnecessary.  In theory
> all code paths refreshing the cached stat data of an entry in the
> shared index should set the CE_UPDATE_IN_BASE flag in that entry, and
> unpack_trees() should preserve this flag when copying cache entries.
> This means that the cached data is only ever changed if the
> CE_UPDATE_IN_BASE flag is set as well.  Our test suite seems to
> confirm this: instrumenting the conditions in question and running the
> test suite repeatedly with 'GIT_TEST_SPLIT_INDEX=yes' showed that the
> cached data in a copied entry differs from the data in the shared
> entry only if its CE_UPDATE_IN_BASE flag is indeed set.

Yes I was probably just being paranoid (or sticking to simpler
checks). I was told that split index is computation expensive and not
doing unnecesary/expensive checks may help. But let's leave it for
later.

> + } else {
> + /*
> +  * Thoroughly compare the cached data to see
> +  * whether it should be marked for inclusion
> +  * in the split index.
> +  *
> +  * This comparison might be unnecessary, as
> +  * code paths modifying the cached data do
> +  * set CE_UPDATE_IN_BASE as well.
> +  */
> + const unsigned int ondisk_flags =
> + CE_STAGEMASK | CE_VALID |
> + CE_EXTENDED_FLAGS;
> + unsigned int ce_flags, base_flags, ret;
> + ce_flags = ce->ce_flags;
> + base_flags = base->ce_flags;
> + /* only on-disk flags matter */
> + ce->ce_flags   &= ondisk_flags;
> + base->ce_flags &= ondisk_flags;
> + ret = memcmp(&ce->ce_stat_data, 
> &base->ce_stat_data,
> +  offsetof(struct cache_entry, name) 
> -
> +  offsetof(struct cache_entry, 
> ce_stat_data));
> + ce->ce_flags = ce_flags;
> + base->ce_flags = base_flags;

Maybe make this block a separate function (compare_ce_content or
something). The amount of indentation is getting too high.

> + if (ret)
> + ce->ce_flags |= CE_UPDATE_IN_BASE;
> + }
>   discard_cache_entry(base);
>   si->base->cache[ce->index - 1] = ce;
>   }
> -- 
> 2.19.0.361.gafc87ffe72
>