Re: [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache

Ben Peart Mon, 12 Feb 2018 09:59:08 -0800


On 2/12/2018 5:20 AM, Duy Nguyen wrote:

On Wed, Feb 7, 2018 at 9:13 PM, Ben Peart <[email protected]> wrote:


On 2/6/2018 7:27 AM, Duy Nguyen wrote:


This is another thing that bugs me. I know you're talking about huge
index files, but at what size should we start this sort of
optimization? Writing down a few MBs on linux is cheap enough that I
won't bother optimizing (and I get my UNTR extension repaired all the
time, so reduced lstat calls and stuff). This "typically" only comes
at certain size, what size?

It's important to identify what we're trading off here.  With the proposed
optimization, we'll end up doing less writes of the index but potentially
more lstat calls.  Even with a small index, writing the index is much more
expensive than calling lstat on a file.  Exactly how much more expensive
depends on a lot of variables but even with a SSD its still orders of
magnitude difference.

Keep in mind it's not just about lstat() calls. Processing ignore
patterns also takes a big chunk of CPU, especially when you have more
than a couple ignore rules.

Yes, I'm very familiar with the cost of the exclude list patternmatching code. I've rewritten it to use a hashmap (for those patternswhere it is possible) that dramatically speeds that aspect up as we tendto abuse it pretty heavily (~60K entries on average).


I'm not convinced that writing index files is that expensive for small
files. I don't know about Windows, with Linux it seems fast to me.
Actually, for small scale repos, performance probably does not differ
much either.

I agree. For small scale repos, none of this is significant enough tomatter. Untracked caching should help most as your working directorystarts to get large.

That means we could potentially lstat hundreds or thousands of files and
still come out ahead.  Given the untracked cache works at the directory
level, the lstat cost actually scales with the number of directories that
have had changes (ie changing a 2nd file in the same directory doesn't add
any additional cost).  In "typical" workflows, developers don't often change
hundreds of files across different directories so we'd "typically" still
come out ahead.

I agree. But we must deal with the bad case when someone
"accidentally" does that. We should not wait until them yell up "it's
too slow now" and tell them to disable/enable untracked cache again.

Another case that could touches a lot of directories over time is
switch trees (e.g. "git checkout master" then "checkout next" or worse
"checkout v1.0").

You're example above makes me wonder if you understand what my patch isdoing. If the index is flagged as dirty for _any_ reason, the entireindex is written to disk with the latest information - including theupdated untracked cache and all other extensions. So in your checkoutexamples above, the index will still get written to disk with theupdated untracked cache extension. There would be zero change inbehavior or performance. The _only_ time the index is not written todisk after my patch is if there were no other changes to the index. Inmy experience, that is only status calls.

To suffer any degradation in the untracked cache would be if a useredited a bunch of files across multiple directories and called statusrepeatedly. As soon as they did add, checkout, rm, rebase, etc (ie mostgit commands other than status) the index would get flagged as dirty andthe latest untracked cache extension would get written to disk.

We have internal performance tests based on common developer sequences of
commands (status, edit a file, status, add, status, commit, log, push, etc)
that show that having the untracked cache turned on actually makes these
sequences slower.  At the time, we didn't have time to investigate why so we
just turned off the untracked cache.

When writing the fsmonitor patch series which can utilize the untracked
cache, I did some investigation into why the untracked cache was slowing
things down in these tests and discovered the cause was the additional index
writes.  That is what led to this patch.

I've been sitting on it for months now because I didn't have the time to
write some performance tests for the git perf framework to demonstrate the
problem and how this helps solve it.  With the discussion about the
fsmonitor extension, I thought this might be a good time to send it out
there.

Hopefully you have time to get some of those numbers :) A patch is
more convincing when it's backed by numbers. And I'm still not
convinced that never ever letting untracked cache be written to the
index again is a right thing to do [1].

I agree that any performance patch should be accompanied by aperformance test to demonstrate it actually performs as promised. Ilooked for but didn't find a performance test for the untracked cache sowill have to create one from scratch. In order to make that asrealistic as possible, it needs to simulate (as much as possible)typical developer workflows, ie create/edit files across multipledirectories and then execute various common commands (status, add,status, add, status, rm, status, commit, log, rebase, etc) and time theperformance of that sequence of commands. All doable, that just isn'tsupported well in the performance framework yet so will take some time(much more than the actual patch :)).

If you have the cycles, time a typical series of commands with and without
the untracked cache (ie don't just call status over and over in a loop) and
you should see the same results.  Given my limited time right now, I'm OK
putting this on the back burner again until a git perf test can be written
to ensure it actually speeds things up as advertised.

[1] Another case that we must _sometimes_ let untracked cache be
written down is to correct its data. My last invalidation bug, for
example, could leave the untracked cache in a bad state, when you run
with new git then it should be able to correct the data and write
down. But if you don't allow writing down, the new 'git' will keep
seeing the old errors and keep complaining.

Re: [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache

Reply via email to