Re: another packed-refs race

2013-05-07 Thread Michael Haggerty
On 05/07/2013 06:44 AM, Jeff King wrote:
 On Tue, May 07, 2013 at 06:32:12AM +0200, Michael Haggerty wrote:
 
 Another potential problem caused by the non-atomicity of loose reference
 reading could be to confuse reachability tests if process 1 is reading
 loose references while process 2 is renaming a reference:

 1. Process 1 looks for refs/heads/a and finds it missing.

 2. Process 2 renames z - a

 3. Process 1 looks for refs/heads/z and finds it missing.

 Process 2 would think that any objects pointed to by a (formerly
 z) are unreachable.  This would be unfortunate if it is doing an
 upload-pack and very bad if it is doing a gc.  I wonder if this could be
 a problem in practice?  (Gee, wouldn't it be nice to keep reflogs for
 deleted refs? :-) )
 
 Ugh. Yeah, that is definitely a possible race, and it could be quite
 disastrous for prune.
 
 I am really starting to think that we need a pluggable refs
 architecture, and that busy servers can use something like sqlite to
 keep the ref storage. That would require bumping repositoryformatversion,
 of course, but it would be OK for a server accessible only by git
 protocols.

That would be a fun project.  I like the idea of not burdening people's
local mostly-one-user-at-a-time repositories with code that is hardened
against server-level pounding.

 I also wonder if we can spend extra time to get more reliable results
 for prune, like checking refs, coming up with a prune list, and then
 checking again. I have a feeling it's a 2-generals sort of problem where
 we can always miss a ref that keeps bouncing around, but we could bound
 the probability. I haven't thought that hard about it. Perhaps this will
 give us something to talk about on Thursday. :)

It's not 100% solvable without big changes; there could always be a
malign Dijkstra running around your system, renaming references right
before you read them.  But I guess it would be pretty safe if pack would
keep the union of objects reachable from the references read at the
beginning of the run and objects reachable from the references read at
(aka near) the end of the run.

 * Preloading the whole tree of loose references before starting an
 iteration.  As I recall, this was a performance *win*.  It was on my
 to-do list of things to pursue when I have some free time (ha, ha).  I
 mostly wanted to check first that there are not callers who abort the
 iteration soon after starting it.  For example, imagine a caller who
 tries to determine are there any tags at all by iterating over
 refs/tags with a callback that just returns 1; such a caller would
 suffer the cost of reading all of the loose references in refs/tags.
 
 Well, you can measure my patches, because that's what they do. :) I
 didn't really consider an early termination from the iteration.
 Certainly something like:
 
   git for-each-ref refs/tags | head -1
 
 would take longer. Though if you have that many refs that the latency is
 a big problem, you should probably consider packing them (it can't
 possibly bite you with a race condition, right?).

No, I don't see a correctness issue.

Michael

-- 
Michael Haggerty
mhag...@alum.mit.edu
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: another packed-refs race

2013-05-06 Thread Michael Haggerty
On 05/03/2013 10:38 AM, Jeff King wrote:
 I found another race related to the packed-refs code. Consider for a
 moment what happens when we are looking at refs and another process does
 a simultaneous git pack-refs --all --prune, updating packed-refs and
 deleting the loose refs.
 
 If we are resolving a single ref, then we will either find its loose
 form or not. If we do, we're done. If not, we can fall back on what was
 in the packed-refs file. If we read the packed-refs file at that point,
 we're OK. If the loose ref existed before but was pruned before we could
 read it, then we know the packed-refs file is already in place, because
 pack-refs would not have deleted the loose ref until it had finished
 writing the new file. But imagine that we already loaded the packed-refs
 file into memory earlier. We may be looking at an _old_ version of it
 that has an irrelevant sha1 from the older packed-refs file. Or it might
 not even exist in the packed-refs file at all, and we think the ref does
 not resolve.

 We could fix this by making sure our packed-refs file is up to date

s/file/cache/

 before using it. E.g., resolving a ref with the following sequence:
 
   1. Look for loose ref. If found, OK.
 
   2. Compare inode/size/mtime/etc of on-disk packed-refs to their values
  from the last time we loaded it. If they're not the same, reload
  packed-refs from disk. Otherwise, continue.
 
   3. Look for ref in in-memory packed-refs.
 
 Not too bad. We introduce one extra stat() for a ref that has been
 packed, and the scheme isn't very complicated.

Let me think out loud alongside your analysis...

By this mechanism the reader can ensure that it never uses a version of
the packed-refs file that is older than its information that the
corresponding loose ref is absent from the filesystem.

This is all assuming that the filesystem accesses have a defined order;
how is that guaranteed?  pack_refs() and commit_ref() both rely on
commit_lock_file(), which calls

close(fd) on the lockfile
rename(lk-filename, result_file)

prune_ref() locks the ref, verifies that its SHA-1 is unchanged, then
calls unlink(), then rollback_lock_file().

The atomicity of rename() guarantees that a reader sees either the old
or the new version of the file in question.  But what guarantees are
there about accesses across two files?  Suppose we start with ref foo
that exists only as a loose ref, and we have a pack-refs process doing

write packed-refs with foo
commit_lock_file() for packed-refs
read loose ref foo and verify that its SHA-1 is unchanged
unlink() loose ref foo

while another process is trying to read the reference:

look for loose ref foo
read packed-refs

Is there any guarantee that the second process can't see the loose ref
foo as being missing but nevertheless read the old version of
packed-refs?  I'm not strong enough on filesystem semantics to answer
that question.

 But what about enumerating refs via for_each_ref? It's possible to have
 the same problem there, and the packed-refs file may be moved into place
 midway through the process of enumerating the loose refs. So we may see
 refs/heads/master, but when we get to refs/remotes/origin/master, it has
 now been packed and pruned.

Yes.

 I _think_ we can get by with:
 
   1. Generate the complete sorted list of loose refs.
 
   2. Check that packed-refs is stat-clean, and reload if necessary, as
  above.
 
   3. Merge the sorted loose and packed lists, letting loose override
  packed (so even if we get repacked halfway through our loose
  traversal and get half of the refs there, it's OK to see duplicates
  in packed-refs, which we will ignore).
 
 This is not very far off of what we do now. Obviously we don't do the
 stat-clean check in step 2. But we also don't generate the complete list
 of loose refs before hitting the packed-refs file. Instead we lazily
 load the loose directories as needed. And of course we cache that
 information in memory, even though it may go out of date. I think the
 best we could do is keep a stat() for each individual directory we see,
 and check it before using the in-memory contents. That may be a lot of
 stats, but it's still better than actually opening each loose ref
 separately.

The loose refs cache is only used by the for_each_ref() functions, not
for looking up individual references.  Another approach would be to
change the top-level for_each_ref() functions to re-stat() all of the
loose references within the namespace that interests it, *then* verify
that the packed-ref cache is not stale, *then* start the iteration.
Then there would be no need to re-stat() files during the iteration.
(This would mean that we have to prohibit a second reference iteration
from being started while one is already in progress.)

Of course, clearing (part of) the loose reference cache invalidates any
pointers that other callers might have retained to refnames in the old
version of the cache.  

Re: another packed-refs race

2013-05-06 Thread Michael Haggerty
On 05/03/2013 07:28 PM, Jeff King wrote:
 On Fri, May 03, 2013 at 11:26:11AM +0200, Johan Herland wrote:
 
 You don't really need to be sure that packed-refs is up-to-date. You
 only need to make sure that don't rely on lazily loading loose refs
 _after_ you have loaded packed-refs.
 
 True. As long as you load them both together, and always make sure you
 do loose first, you'd be fine. But I think there will be corner cases
 where you have loaded _part_ of the loose ref namespace. I think part of
 the point of Michael's ref work is that if you call for_each_tag_ref,
 we would not waste time loading refs/remotes/ at all. If you then follow
 that with a call to for_each_ref, you would want to re-use the cached
 work from traversing refs/tags/, and then traverse refs/remotes/. You
 know that your cached packed-refs is good with respect to refs/tags/,
 but you don't with respect to refs/remotes.

Correct.

[I wonder if there really are a lot of iterations over overlapping parts
of the reference namespace within a single git process...]

 The following solution might work in both the resolve-a-single-ref and
 enumerating-refs case:

 0. Look for ref already cached in memory. If found, OK.

 1. Look for loose ref. If found, OK.

 2. If not found, load all loose refs and packed-refs from disk (in
 that order), and store in memory for remainder of this process. Never
 reload packed-refs from disk (unless you also reload all loose refs
 first).
 
 I think that would be correct (modulo that step 1 cannot happen for
 enumeration). But we would like to avoid loading all loose refs if we
 can. Especially on a cold cache, it can be quite slow, and you may not
 even care about those refs for the current operation (I do not recall
 the exact original motivation for the lazy loading, but it was something
 along those lines).

Lazy loading was first inspired by the observation that effectively
every git invocation was loading *all* loose references to do an
iteration over refs/replace/ (which I've never even used!)  This was
absolutely killing the performance of filter-branch, which creates a lot
of loose references and invokes git many times--even though the cache
was warm.

Michael

-- 
Michael Haggerty
mhag...@alum.mit.edu
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: another packed-refs race

2013-05-06 Thread Jeff King
On Mon, May 06, 2013 at 02:03:40PM +0200, Michael Haggerty wrote:

  We could fix this by making sure our packed-refs file is up to date
 
 s/file/cache/

Yeah, I meant our view of the packed-refs file, but I think cache
says that more succinctly. I'll be sure to make it clear when I start
writing real commit messages.

 Let me think out loud alongside your analysis...
 
 By this mechanism the reader can ensure that it never uses a version of
 the packed-refs file that is older than its information that the
 corresponding loose ref is absent from the filesystem.

Yes, that's a good way of saying it.

 This is all assuming that the filesystem accesses have a defined order;
 how is that guaranteed?  pack_refs() and commit_ref() both rely on
 commit_lock_file(), which calls
 
 close(fd) on the lockfile
 rename(lk-filename, result_file)
 
 prune_ref() locks the ref, verifies that its SHA-1 is unchanged, then
 calls unlink(), then rollback_lock_file().
 
 The atomicity of rename() guarantees that a reader sees either the old
 or the new version of the file in question.  But what guarantees are
 there about accesses across two files?

I don't know offhand if any standard makes such guarantees. But there
are other parts of git that do depend on that ordering. For example, I
create a loose object representing commit X. Then I update a ref to
point to X. If the ordering of those operations is not preserved, then a
simultaneous reader would think that the repository is corrupted (the
ref points to a missing object).

I would not be surprised if there are distributed or networked
filesystems for which this property does not hold. But I suspect it does
hold for operations on a single machine with a decent kernel (I would
imagine the VFS layer takes care of this). I am just basing my suspicion
on experimental results (the patch I sent does seem to work on my ext4
system).

 The loose refs cache is only used by the for_each_ref() functions, not
 for looking up individual references.  Another approach would be to
 change the top-level for_each_ref() functions to re-stat() all of the
 loose references within the namespace that interests it, *then* verify
 that the packed-ref cache is not stale, *then* start the iteration.
 Then there would be no need to re-stat() files during the iteration.
 (This would mean that we have to prohibit a second reference iteration
 from being started while one is already in progress.)

Hmm. Thinking on this more, I'm not sure that we need to stat the loose
references at all. We do not need to care if the loose refs are up to
date or not. Well, we might care, but the point here is not to pretend
that we have an up-to-date atomic view of the loose refs; it is only to
make sure that the fallback-to-packed behavior does not lie to us about
the existence or value of a ref.

IOW, it is OK to come up with a value for ref X that was true at the
beginning of the program, even if it has been simultaneously updated.
Our program can operate as if it happened in the instant it started,
even though in real life it takes longer. But it is _not_ OK to miss the
existence of a ref, or to come up with a value that it did not hold at
some point during the program (e.g., it is not OK to return some cruft
we wrote into the packed-refs file when we packed it three weeks ago).

That is a weaker guarantee, and I think we can provide it with:

  1. Load all loose refs into cache for a particular enumeration.

  2. Make sure the packed-refs cache is up-to-date (by checking its
 stat() information and reloading if necessary).

  3. Run the usual iteration over the loose/packed ref caches.

If a loose ref is updated after or during (1), that's OK. The ref
hierarchy is not atomic, so it is possible for us to see a state that
never existed (e.g., we read X, somebody updates X to X' and Y to Y',
then we read Y'; we see X and Y', a state that never existed on disk).
But either the ref was already loose, in which case we always see its
loose value as it was when we read it, or it was previously only packed
(or non-existent), in which case we see get its value from the
packed-refs (or assume it does not exist), which is its state at the
start of our program. If the loose ref is deleted instead of updated,
it's the same thing; we may see it as existing, as it was at the start
of our program.

If the packed-refs file is updated after we have read all of the loose
refs, we may see it ahead of the loose refs state we have cached. And
that may mean the packed-refs file even has more up-to-date values. But
we don't have to care, as long as we return some value that was valid
during the course of our program. And we do, either because we have the
loose value cached (in which case it trumps the packed-refs version and
we use it), or it was missing (in which case we will use the updated
pack-refs value, which may not even be the most recent value, but is at
least a value that the ref had to hold at some point during the 

Re: another packed-refs race

2013-05-06 Thread Jeff King
On Mon, May 06, 2013 at 02:12:29PM +0200, Michael Haggerty wrote:

  I think that would be correct (modulo that step 1 cannot happen for
  enumeration). But we would like to avoid loading all loose refs if we
  can. Especially on a cold cache, it can be quite slow, and you may not
  even care about those refs for the current operation (I do not recall
  the exact original motivation for the lazy loading, but it was something
  along those lines).
 
 Lazy loading was first inspired by the observation that effectively
 every git invocation was loading *all* loose references to do an
 iteration over refs/replace/ (which I've never even used!)  This was
 absolutely killing the performance of filter-branch, which creates a lot
 of loose references and invokes git many times--even though the cache
 was warm.

Yeah, obviously we want to avoid that. I _think_ we can even keep the
lazy loading, as long as keep the ordering as:

  1. Load a chunk of loose refs (whatever we need for the current
 iteration request).

  2. Double-check that our packed-refs cache is up to date, and reload
 if necessary.

  3. Return the results to the caller.

It would perhaps increase latency to getting results to the caller
(since otherwise we can start feeding answers to the caller as we walk
the tree), but should not increase the total amount of work (just the
extra stat() of packed-refs, once per for_each_ref, which is not much).

I'll see if I can cook up a patch.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: another packed-refs race

2013-05-06 Thread Jeff King
On Mon, May 06, 2013 at 02:41:22PM -0400, Jeff King wrote:

 That is a weaker guarantee, and I think we can provide it with:
 
   1. Load all loose refs into cache for a particular enumeration.
 
   2. Make sure the packed-refs cache is up-to-date (by checking its
  stat() information and reloading if necessary).
 
   3. Run the usual iteration over the loose/packed ref caches.

This does seem to work in my experiments. With stock git, I can trigger
the race reliably with:

  # base load, in one terminal, as before
  git init -q repo 
  cd repo 
  git commit -q --allow-empty -m one 
  one=`git rev-parse HEAD` 
  git commit -q --allow-empty -m two 
  two=`git rev-parse HEAD` 
  sha1=$one 
  while true; do
# this re-creates the loose ref in .git/refs/heads/master
if test $sha1 = $one; then
  sha1=$two
else
  sha1=$one
fi 
git update-ref refs/heads/master $sha1 

# we can remove packed-refs safely, as we know that
# its only value is now stale. Real git would not do
# this, but we are simulating the case that master
# simply wasn't included in the last packed-refs file.
rm -f .git/packed-refs 

# and now we repack, which will create an up-to-date
# packed-refs file, and then delete the loose ref
git pack-refs --all --prune
  done

  # in another terminal, enumerate and make sure we never miss the ref
  cd repo 
  while true; do
refs=`git.compile for-each-ref --format='%(refname)'`
echo == $refs
test -z $refs  break
  done

It usually takes about 30 seconds to hit a problem, though I measured
failures at up to 90 seconds. With the patch below (on top of the one I
posted the other day, which refreshes the packed-refs cache in
get_packed_refs), it has been running fine for 15 minutes.

The noop_each_fn is a little gross. I could also just reimplement the
recursion from do_for_each_ref_in_dir (except we don't care about flags,
trim, base, etc, and would just be calling get_ref_dir recursively).
It's a slight repetition of code, but it would be less subtle than what
I have written below (which uses a no-op callback for the side effect
that it primes the loose ref cache). Which poison do you prefer?

diff --git a/refs.c b/refs.c
index 45a7ee6..59ae7e4 100644
--- a/refs.c
+++ b/refs.c
@@ -1363,19 +1363,38 @@ static int do_for_each_ref(const char *submodule, const 
char *base, each_ref_fn
for_each_rawref(warn_if_dangling_symref, data);
 }
 
+static int noop_each_fn(const char *ref, const unsigned char *sha1, int flags,
+   void *data)
+{
+   return 0;
+}
+
 static int do_for_each_ref(const char *submodule, const char *base, 
each_ref_fn fn,
   int trim, int flags, void *cb_data)
 {
struct ref_cache *refs = get_ref_cache(submodule);
-   struct ref_dir *packed_dir = get_packed_refs(refs);
-   struct ref_dir *loose_dir = get_loose_refs(refs);
+   struct ref_dir *packed_dir;
+   struct ref_dir *loose_dir;
int retval = 0;
 
-   if (base  *base) {
-   packed_dir = find_containing_dir(packed_dir, base, 0);
+   /*
+* Prime the loose ref cache; we must make sure the packed ref cache is
+* uptodate after we read the loose refs in order to avoid race
+* conditions with a simultaneous pack-refs --prune.
+*/
+   loose_dir = get_loose_refs(refs);
+   if (base  *base)
loose_dir = find_containing_dir(loose_dir, base, 0);
+   if (loose_dir) {
+   sort_ref_dir(loose_dir);
+   do_for_each_ref_in_dir(loose_dir, 0, base, noop_each_fn, 0, 0,
+  NULL);
}
 
+   packed_dir = get_packed_refs(refs);
+   if (base  *base)
+   packed_dir = find_containing_dir(packed_dir, base, 0);
+
if (packed_dir  loose_dir) {
sort_ref_dir(packed_dir);
sort_ref_dir(loose_dir);


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: another packed-refs race

2013-05-06 Thread Michael Haggerty
On 05/06/2013 08:41 PM, Jeff King wrote:
 On Mon, May 06, 2013 at 02:03:40PM +0200, Michael Haggerty wrote:
 [...]
 The loose refs cache is only used by the for_each_ref() functions, not
 for looking up individual references.  Another approach would be to
 change the top-level for_each_ref() functions to re-stat() all of the
 loose references within the namespace that interests it, *then* verify
 that the packed-ref cache is not stale, *then* start the iteration.
 Then there would be no need to re-stat() files during the iteration.
 (This would mean that we have to prohibit a second reference iteration
 from being started while one is already in progress.)
 
 Hmm. Thinking on this more, I'm not sure that we need to stat the loose
 references at all. We do not need to care if the loose refs are up to
 date or not. Well, we might care, but the point here is not to pretend
 that we have an up-to-date atomic view of the loose refs; it is only to
 make sure that the fallback-to-packed behavior does not lie to us about
 the existence or value of a ref.
 
 IOW, it is OK to come up with a value for ref X that was true at the
 beginning of the program, even if it has been simultaneously updated.
 Our program can operate as if it happened in the instant it started,
 even though in real life it takes longer. But it is _not_ OK to miss the
 existence of a ref, or to come up with a value that it did not hold at
 some point during the program (e.g., it is not OK to return some cruft
 we wrote into the packed-refs file when we packed it three weeks ago).

This all sounds correct to me.

Another potential problem caused by the non-atomicity of loose reference
reading could be to confuse reachability tests if process 1 is reading
loose references while process 2 is renaming a reference:

1. Process 1 looks for refs/heads/a and finds it missing.

2. Process 2 renames z - a

3. Process 1 looks for refs/heads/z and finds it missing.

Process 2 would think that any objects pointed to by a (formerly
z) are unreachable.  This would be unfortunate if it is doing an
upload-pack and very bad if it is doing a gc.  I wonder if this could be
a problem in practice?  (Gee, wouldn't it be nice to keep reflogs for
deleted refs? :-) )

 That is a weaker guarantee, and I think we can provide it with:
 
   1. Load all loose refs into cache for a particular enumeration.
 
   2. Make sure the packed-refs cache is up-to-date (by checking its
  stat() information and reloading if necessary).
 
   3. Run the usual iteration over the loose/packed ref caches.

This sounds reasonable, too, though I'll need some more time to digest
your suggestion in detail.

 [...]
 Of course, clearing (part of) the loose reference cache invalidates any
 pointers that other callers might have retained to refnames in the old
 version of the cache.  I've never really investigated what callers might
 hold onto such pointers under the assumption that they will live to the
 end of the process.
 
 My proposal above gets rid of the need to invalidate the loose refs
 cache, so we can ignore that complexity.

The same concern applies to invalidating the packed-ref cache, which you
still want to do.

 Given all of this trouble, there is an obvious question: why do we have
 a loose reference cache in the first place?  I think there are a few
 reasons:

 1. In case one git process has to iterate through the same part of the
 reference namespace more than once.  (Does this frequently happen?)
 
 I'm not sure how often it happens. There are a few obvious candidates if
 you git grep 'for_each[a-z_]*ref', but I'm not sure if the actual
 performance impact is measurable (the cache should be warm after the
 first run-through).
 
 2. Reading a bunch of loose references at the same time is more
 efficient than reading them one by one interleaved with other file
 reads.  (I think this is a significant win.)
 
 Makes some sense, though I don't recall whether it was ever measured.

I think I measured it once and found it a significant benefit, though I
can't remember whether this was with a warm cache or only with a cold cache.

In fact, I experimented with some other strategies for loose reference
loading for performance reasons.  Currently loose references are read
into the cache lazily, one directory at a time.  I experimented with
changes in both directions:

* Preloading the whole tree of loose references before starting an
iteration.  As I recall, this was a performance *win*.  It was on my
to-do list of things to pursue when I have some free time (ha, ha).  I
mostly wanted to check first that there are not callers who abort the
iteration soon after starting it.  For example, imagine a caller who
tries to determine are there any tags at all by iterating over
refs/tags with a callback that just returns 1; such a caller would
suffer the cost of reading all of the loose references in refs/tags.

* Lazy loading loose references one reference at a time.  The 

Re: another packed-refs race

2013-05-06 Thread Jeff King
On Tue, May 07, 2013 at 06:32:12AM +0200, Michael Haggerty wrote:

 Another potential problem caused by the non-atomicity of loose reference
 reading could be to confuse reachability tests if process 1 is reading
 loose references while process 2 is renaming a reference:
 
 1. Process 1 looks for refs/heads/a and finds it missing.
 
 2. Process 2 renames z - a
 
 3. Process 1 looks for refs/heads/z and finds it missing.
 
 Process 2 would think that any objects pointed to by a (formerly
 z) are unreachable.  This would be unfortunate if it is doing an
 upload-pack and very bad if it is doing a gc.  I wonder if this could be
 a problem in practice?  (Gee, wouldn't it be nice to keep reflogs for
 deleted refs? :-) )

Ugh. Yeah, that is definitely a possible race, and it could be quite
disastrous for prune.

I am really starting to think that we need a pluggable refs
architecture, and that busy servers can use something like sqlite to
keep the ref storage. That would require bumping repositoryformatversion,
of course, but it would be OK for a server accessible only by git
protocols.

I also wonder if we can spend extra time to get more reliable results
for prune, like checking refs, coming up with a prune list, and then
checking again. I have a feeling it's a 2-generals sort of problem where
we can always miss a ref that keeps bouncing around, but we could bound
the probability. I haven't thought that hard about it. Perhaps this will
give us something to talk about on Thursday. :)

  My proposal above gets rid of the need to invalidate the loose refs
  cache, so we can ignore that complexity.
 
 The same concern applies to invalidating the packed-ref cache, which you
 still want to do.

True. In theory a call to resolve_ref can invalidate it, so any call
from inside a for_each_ref callback would be suspect.

 * Preloading the whole tree of loose references before starting an
 iteration.  As I recall, this was a performance *win*.  It was on my
 to-do list of things to pursue when I have some free time (ha, ha).  I
 mostly wanted to check first that there are not callers who abort the
 iteration soon after starting it.  For example, imagine a caller who
 tries to determine are there any tags at all by iterating over
 refs/tags with a callback that just returns 1; such a caller would
 suffer the cost of reading all of the loose references in refs/tags.

Well, you can measure my patches, because that's what they do. :) I
didn't really consider an early termination from the iteration.
Certainly something like:

  git for-each-ref refs/tags | head -1

would take longer. Though if you have that many refs that the latency is
a big problem, you should probably consider packing them (it can't
possibly bite you with a race condition, right?).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: another packed-refs race

2013-05-03 Thread Johan Herland
On Fri, May 3, 2013 at 10:38 AM, Jeff King p...@peff.net wrote:
 I found another race related to the packed-refs code. Consider for a
 moment what happens when we are looking at refs and another process does
 a simultaneous git pack-refs --all --prune, updating packed-refs and
 deleting the loose refs.

 [...]

 We could fix this by making sure our packed-refs file is up to date
 before using it. E.g., resolving a ref with the following sequence:

   1. Look for loose ref. If found, OK.

   2. Compare inode/size/mtime/etc of on-disk packed-refs to their values
  from the last time we loaded it. If they're not the same, reload
  packed-refs from disk. Otherwise, continue.

   3. Look for ref in in-memory packed-refs.

 Not too bad. We introduce one extra stat() for a ref that has been
 packed, and the scheme isn't very complicated.

 But what about enumerating refs via for_each_ref? It's possible to have
 the same problem there, and the packed-refs file may be moved into place
 midway through the process of enumerating the loose refs. So we may see
 refs/heads/master, but when we get to refs/remotes/origin/master, it has
 now been packed and pruned. I _think_ we can get by with:

   1. Generate the complete sorted list of loose refs.

   2. Check that packed-refs is stat-clean, and reload if necessary, as
  above.

   3. Merge the sorted loose and packed lists, letting loose override
  packed (so even if we get repacked halfway through our loose
  traversal and get half of the refs there, it's OK to see duplicates
  in packed-refs, which we will ignore).

 This is not very far off of what we do now. Obviously we don't do the
 stat-clean check in step 2. But we also don't generate the complete list
 of loose refs before hitting the packed-refs file. Instead we lazily
 load the loose directories as needed. And of course we cache that
 information in memory, even though it may go out of date. I think the
 best we could do is keep a stat() for each individual directory we see,
 and check it before using the in-memory contents. That may be a lot of
 stats, but it's still better than actually opening each loose ref
 separately.

 So I think it's possible to fix, but I thought you might have some
 insight on the simplest way to fit it into the current ref code.

 Did I explain the problem well enough to understand? Can you think of
 any simpler or better solutions (or is there a case where my proposed
 solutions don't work?).

You don't really need to be sure that packed-refs is up-to-date. You
only need to make sure that don't rely on lazily loading loose refs
_after_ you have loaded packed-refs.

The following solution might work in both the resolve-a-single-ref and
enumerating-refs case:

0. Look for ref already cached in memory. If found, OK.

1. Look for loose ref. If found, OK.

2. If not found, load all loose refs and packed-refs from disk (in
that order), and store in memory for remainder of this process. Never
reload packed-refs from disk (unless you also reload all loose refs
first).

My rationale for this approach is that if you have a packed-refs file,
you will likely have fewer loose refs, so loading all of them in
addition to the pack-refs file won't be that expensive. (Conversely,
if you do have a lot of loose refs, you're more likely to hit #1, and
not have to load all refs.)

That said, my intuition on the number of loose vs. packed refs, or the
relative cost of reading all loose refs might be off here...


...Johan

-- 
Johan Herland, jo...@herland.net
www.herland.net
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: another packed-refs race

2013-05-03 Thread Jeff King
On Fri, May 03, 2013 at 11:26:11AM +0200, Johan Herland wrote:

 You don't really need to be sure that packed-refs is up-to-date. You
 only need to make sure that don't rely on lazily loading loose refs
 _after_ you have loaded packed-refs.

True. As long as you load them both together, and always make sure you
do loose first, you'd be fine. But I think there will be corner cases
where you have loaded _part_ of the loose ref namespace. I think part of
the point of Michael's ref work is that if you call for_each_tag_ref,
we would not waste time loading refs/remotes/ at all. If you then follow
that with a call to for_each_ref, you would want to re-use the cached
work from traversing refs/tags/, and then traverse refs/remotes/. You
know that your cached packed-refs is good with respect to refs/tags/,
but you don't with respect to refs/remotes.

 The following solution might work in both the resolve-a-single-ref and
 enumerating-refs case:
 
 0. Look for ref already cached in memory. If found, OK.
 
 1. Look for loose ref. If found, OK.
 
 2. If not found, load all loose refs and packed-refs from disk (in
 that order), and store in memory for remainder of this process. Never
 reload packed-refs from disk (unless you also reload all loose refs
 first).

I think that would be correct (modulo that step 1 cannot happen for
enumeration). But we would like to avoid loading all loose refs if we
can. Especially on a cold cache, it can be quite slow, and you may not
even care about those refs for the current operation (I do not recall
the exact original motivation for the lazy loading, but it was something
along those lines).

 My rationale for this approach is that if you have a packed-refs file,
 you will likely have fewer loose refs, so loading all of them in
 addition to the pack-refs file won't be that expensive. (Conversely,
 if you do have a lot of loose refs, you're more likely to hit #1, and
 not have to load all refs.)
 
 That said, my intuition on the number of loose vs. packed refs, or the
 relative cost of reading all loose refs might be off here...

I don't think that is necessarily true about the number of loose refs.
In a busy repository, you may have many loose refs that have been
updated.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: another packed-refs race

2013-05-03 Thread Jeff King
On Fri, May 03, 2013 at 01:28:53PM -0400, Jeff King wrote:

  The following solution might work in both the resolve-a-single-ref and
  enumerating-refs case:
  
  0. Look for ref already cached in memory. If found, OK.
  
  1. Look for loose ref. If found, OK.
  
  2. If not found, load all loose refs and packed-refs from disk (in
  that order), and store in memory for remainder of this process. Never
  reload packed-refs from disk (unless you also reload all loose refs
  first).
 
 I think that would be correct (modulo that step 1 cannot happen for
 enumeration). But we would like to avoid loading all loose refs if we
 can. Especially on a cold cache, it can be quite slow, and you may not
 even care about those refs for the current operation (I do not recall
 the exact original motivation for the lazy loading, but it was something
 along those lines).

Actually, forgetting about enumeration for a minute, that would make
single-ref lookup quite painful. Running git rev-parse foo shouldn't
have to even look at most loose refs in the first place. It should be a
couple of open() calls looking for the right spot, and then fall back to
loading packed-refs.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: another packed-refs race

2013-05-03 Thread Johan Herland
On Fri, May 3, 2013 at 8:26 PM, Jeff King p...@peff.net wrote:
 On Fri, May 03, 2013 at 01:28:53PM -0400, Jeff King wrote:
  The following solution might work in both the resolve-a-single-ref and
  enumerating-refs case:
 
  0. Look for ref already cached in memory. If found, OK.
 
  1. Look for loose ref. If found, OK.
 
  2. If not found, load all loose refs and packed-refs from disk (in
  that order), and store in memory for remainder of this process. Never
  reload packed-refs from disk (unless you also reload all loose refs
  first).

 I think that would be correct (modulo that step 1 cannot happen for
 enumeration). But we would like to avoid loading all loose refs if we
 can. Especially on a cold cache, it can be quite slow, and you may not
 even care about those refs for the current operation (I do not recall
 the exact original motivation for the lazy loading, but it was something
 along those lines).

 Actually, forgetting about enumeration for a minute, that would make
 single-ref lookup quite painful. Running git rev-parse foo shouldn't
 have to even look at most loose refs in the first place. It should be a
 couple of open() calls looking for the right spot, and then fall back to
 loading packed-refs.

True. I was overemphasizing the case where we start looking up one
ref, and later look up more refs from the same process (in which case
the load-everything step would be amortized across the other lookups),
but this is probably not the ref access pattern for most Git commands,
and definitely not for git rev-parse foo. I think your approach is
better.


...Johan

-- 
Johan Herland, jo...@herland.net
www.herland.net
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: another packed-refs race

2013-05-03 Thread Jeff King
On Fri, May 03, 2013 at 04:38:47AM -0400, Jeff King wrote:

 For reference, here's a script that demonstrates the problem during
 enumeration (sometimes for-each-ref fails to realize that
 refs/heads/master exists at all):
 
   # run this in one terminal
   git init repo 
   cd repo 
   git commit --allow-empty -m foo 
   base=`git rev-parse HEAD` 
   while true; do
 # this re-creates the loose ref in .git/refs/heads/master
 git update-ref refs/heads/master $base 

It turns out this is wrong. Git is smart enough not to bother writing
out the loose ref if it isn't changing. So the script as I showed it
actually ends up in a state with _neither_ the packed-refs file nor the
loose ref for an instant.

The correct script looks like this (it just flips between two objects):

  git init -q repo 
  cd repo 
  git commit -q --allow-empty -m one 
  one=`git rev-parse HEAD` 
  git commit -q --allow-empty -m two 
  two=`git rev-parse HEAD` 
  sha1=$one 
  while true; do
# this re-creates the loose ref in .git/refs/heads/master
if test $sha1 = $one; then
  sha1=$two
else
  sha1=$one
fi 
git update-ref refs/heads/master $sha1 

# we can remove packed-refs safely, as we know that
# its only value is now stale. Real git would not do
# this, but we are simulating the case that master
# simply wasn't included in the last packed-refs file.
rm -f .git/packed-refs 

# and now we repack, which will create an up-to-date
# packed-refs file, and then delete the loose ref
git pack-refs --all --prune
  done

And a racy lookup check could look like this:

  cd repo 
  while true; do
ref=`git rev-parse --verify master`
echo == $ref
test -z $ref  break
  done

it doesn't know which of the two flipping refs it will get on any given
invocation, but it should never see nothing. It should get one or the
other. With stock git, running these two looks for me simultaneously
typically causes a failure in the second one within about 15 seconds.
The (messy, not ready for application) patch below fixes it (at least I
let it run for 30 minutes without a problem).

The fix is actually two-fold:

  1. Re-load the packed-refs file after each loose object lookup
 failure. This is made more palatable by using stat() to avoid
 re-reading the file in the common case that it wasn't updated.

  2. The loose ref reading itself is actually not atomic. We call
 lstat() on the ref to find out whether it exists (and whether it is
 a symlink). If we get ENOENT, we fall back to finding the loose
 ref.  If it does exist and is a regular file, we proceed to open()
 it. But if the ref gets packed and pruned in the interim, our open
 will fail and we just return NULL to say oops, I guess it doesn't
 exist. We want the same fallback-to-packed behavior we would get
 if the lstat failed.

 We could potentially do the same when we readlink() a symbolic
 link, but I don't think it is necessary. We do not pack symbolic
 refs, so if readlink gets ENOENT, it's OK to say nope, the ref
 does not exist.

This doesn't cover the for_each_ref enumeration case at all, which
should still fail.  I'll try to look at that next.

---
diff --git a/refs.c b/refs.c
index de2d8eb..45a7ee6 100644
--- a/refs.c
+++ b/refs.c
@@ -708,6 +708,7 @@ static struct ref_cache {
struct ref_cache *next;
struct ref_entry *loose;
struct ref_entry *packed;
+   struct stat packed_validity;
/* The submodule name, or  for the main repo. */
char name[FLEX_ARRAY];
 } *ref_cache;
@@ -717,6 +718,7 @@ static void clear_packed_ref_cache(struct ref_cache *refs)
if (refs-packed) {
free_ref_entry(refs-packed);
refs-packed = NULL;
+   memset(refs-packed_validity, 0, 
sizeof(refs-packed_validity));
}
 }
 
@@ -876,19 +878,57 @@ static struct ref_dir *get_packed_refs(struct ref_cache 
*refs)
}
 }
 
+/*
+ * Returns 1 if the cached stat information matches the
+ * current state of the file, and 0 otherwise. This should
+ * probably be refactored to share code with ce_match_stat_basic,
+ * which has platform-specific knobs for which fields to respect.
+ */
+static int check_stat_validity(const struct stat *old, const char *fn)
+{
+   static struct stat null;
+   struct stat cur;
+
+   if (stat(fn, cur))
+   return errno == ENOENT  !memcmp(old, null, sizeof(null));
+   return cur.st_ino == old-st_ino 
+  cur.st_size == old-st_size 
+  cur.st_mtime == old-st_mtime;
+}
+
+/*
+ * Call fstat, but zero out the stat structure if for whatever
+ * reason we can't get an answer.
+ */
+static int safe_fstat(int fd, struct stat *out)
+{
+   int r = fstat(fd, out);
+   if (r)
+   memset(out, 0, sizeof(*out));
+   return r;
+}
+
 static struct ref_dir *get_packed_refs(struct ref_cache *refs)
 {
+   const char