Re: Re* Protecting old temporary objects being reused from concurrent "git gc"?

2016-11-16 Thread Jeff King
On Wed, Nov 16, 2016 at 05:35:47PM -0800, Junio C Hamano wrote:

> OK, here is what I have queued.
> 
> -- >8 --
> Subject: cache-tree: make sure to "touch" tree objects the cache-tree records
> 
> The cache_tree_fully_valid() function is called by callers that want
> to know if they need to call cache_tree_update(), i.e. as an attempt
> to optimize. They all want to have a fully valid cache-tree in the
> end so that they can write a tree object out.

That makes sense. I was focusing on cache_tree_update() call, but we do
not even get there in the fully-valid case.

So I think this approach is nice as long as there is not a caller who
asks "are we fully valid? I do not need to write, but was just
wondering". That should be a read-only operation, but the freshen calls
may fail with EPERM, for example.

I do not see any such callers, nor do I really expect any. Just trying
to think through the possible consequences.

> Strictly speaking, freshing these tree objects at each and every
> level is probably unnecessary, given that anything reachable from a
> young object inherits the youth from the referring object to be
> protected from pruning.  It should be sufficient to freshen only the
> very top-level tree instead.  Benchmarking and optimization is left
> as an exercise for later days.

Good observation, and nicely explained all around.

-Peff


Re* Protecting old temporary objects being reused from concurrent "git gc"?

2016-11-16 Thread Junio C Hamano
Jeff King  writes:

> I think the case that is helped here is somebody who runs "git
> write-tree" and expects that the timestamp on those trees is fresh. So
> even more a briefly used index, like:
>
>   export GIT_INDEX_FILE=/tmp/foo
>   git read-tree ...
>   git write-tree
>   rm -f $GIT_INDEX_FILE
>
> we'd expect that a "git gc" which runs immediately after would see those
> trees as recent and avoid pruning them (and transitively, any blobs that
> are reachable from the trees). But I don't think that write-tree
> actually freshens them (it sees "oh, we already have these; there is
> nothing to write").

OK, here is what I have queued.

-- >8 --
Subject: cache-tree: make sure to "touch" tree objects the cache-tree records

The cache_tree_fully_valid() function is called by callers that want
to know if they need to call cache_tree_update(), i.e. as an attempt
to optimize. They all want to have a fully valid cache-tree in the
end so that they can write a tree object out.

We used to check if the cached tree object is up-to-date (i.e. the
index entires covered by the cache-tree entry hasn't been changed
since the roll-up hash was computed for the cache-tree entry) and
made sure the tree object is already in the object store.  Since the
top-level tree we would soon be asked to write out (and would find
in the object store) may not be anchored to any refs or commits
immediately, freshen the tree object when it happens.

Similarly, when we actually compute the cache-tree entries in
cache_tree_update(), we refrained from writing a tree object out
when it already exists in the object store.  We would want to
freshen the tree object in that case to protect it from premature
pruning.

Strictly speaking, freshing these tree objects at each and every
level is probably unnecessary, given that anything reachable from a
young object inherits the youth from the referring object to be
protected from pruning.  It should be sufficient to freshen only the
very top-level tree instead.  Benchmarking and optimization is left
as an exercise for later days.

Signed-off-by: Junio C Hamano 
---
 cache-tree.c | 6 +++---
 cache.h  | 1 +
 sha1_file.c  | 9 +++--
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/cache-tree.c b/cache-tree.c
index 3ebf9c3aa4..c8c74a1e07 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -225,7 +225,7 @@ int cache_tree_fully_valid(struct cache_tree *it)
int i;
if (!it)
return 0;
-   if (it->entry_count < 0 || !has_sha1_file(it->sha1))
+   if (it->entry_count < 0 || !freshen_object(it->sha1))
return 0;
for (i = 0; i < it->subtree_nr; i++) {
if (!cache_tree_fully_valid(it->down[i]->cache_tree))
@@ -253,7 +253,7 @@ static int update_one(struct cache_tree *it,
 
*skip_count = 0;
 
-   if (0 <= it->entry_count && has_sha1_file(it->sha1))
+   if (0 <= it->entry_count && freshen_object(it->sha1))
return it->entry_count;
 
/*
@@ -393,7 +393,7 @@ static int update_one(struct cache_tree *it,
if (repair) {
unsigned char sha1[20];
hash_sha1_file(buffer.buf, buffer.len, tree_type, sha1);
-   if (has_sha1_file(sha1))
+   if (freshen_object(sha1))
hashcpy(it->sha1, sha1);
else
to_invalidate = 1;
diff --git a/cache.h b/cache.h
index 4ff196c259..72c0b321ac 100644
--- a/cache.h
+++ b/cache.h
@@ -1077,6 +1077,7 @@ extern int sha1_object_info(const unsigned char *, 
unsigned long *);
 extern int hash_sha1_file(const void *buf, unsigned long len, const char 
*type, unsigned char *sha1);
 extern int write_sha1_file(const void *buf, unsigned long len, const char 
*type, unsigned char *return_sha1);
 extern int hash_sha1_file_literally(const void *buf, unsigned long len, const 
char *type, unsigned char *sha1, unsigned flags);
+extern int freshen_object(const unsigned char *);
 extern int pretend_sha1_file(void *, unsigned long, enum object_type, unsigned 
char *);
 extern int force_object_loose(const unsigned char *sha1, time_t mtime);
 extern int git_open_noatime(const char *name);
diff --git a/sha1_file.c b/sha1_file.c
index d0f2aa029b..9acce3d3b8 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3151,6 +3151,11 @@ static int freshen_packed_object(const unsigned char 
*sha1)
return 1;
 }
 
+int freshen_object(const unsigned char *sha1)
+{
+   return freshen_packed_object(sha1) || freshen_loose_object(sha1);
+}
+
 int write_sha1_file(const void *buf, unsigned long len, const char *type, 
unsigned char *sha1)
 {
char hdr[32];
@@ -3160,7 +3165,7 @@ int write_sha1_file(const void *buf, unsigned long len, 
const char *type, unsign
 * it out into .git/objects/??/?{38} file.
 */
write_sha1_file_prepare(buf, len, type, sha1, hdr, &hdrlen);
-   if (freshen_packed_object(sha1) || freshen_loose_object(sha1))
+   if (freshen_

Re: Protecting old temporary objects being reused from concurrent "git gc"?

2016-11-16 Thread Jeff King
On Wed, Nov 16, 2016 at 10:58:30AM -0800, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > I suspect the issue is that read-tree populates the cache-tree index
> > extension, and then write-tree omits the object write before it even
> > gets to write_sha1_file().
> 
> Wait a minute.  The entries in the index and trees in the cache-tree
> are root of "still in use" traversal for the purpose of pruning,
> which makes the "something like this" patch unnecessary for the real
> index file.
> 
> And for temporary index files that is kept for 6 months, touching
> tree objects that cache-tree references is irrelevant---the blobs
> recorded in the "list of objects" part of the index will go stale,
> which is a lot more problematic.

I think the case that is helped here is somebody who runs "git
write-tree" and expects that the timestamp on those trees is fresh. So
even more a briefly used index, like:

  export GIT_INDEX_FILE=/tmp/foo
  git read-tree ...
  git write-tree
  rm -f $GIT_INDEX_FILE

we'd expect that a "git gc" which runs immediately after would see those
trees as recent and avoid pruning them (and transitively, any blobs that
are reachable from the trees). But I don't think that write-tree
actually freshens them (it sees "oh, we already have these; there is
nothing to write").

I could actually see an argument that the read-tree operation should
freshen the blobs themselves (because we know those blobs are now in
active use, and probably shouldn't be pruned), but I am not sure I agree
there. If only because it is weird that an operation which is otherwise
read-only with respect to the repository would modify the object
database.

-Peff


Re: Protecting old temporary objects being reused from concurrent "git gc"?

2016-11-16 Thread Junio C Hamano
Jeff King  writes:

> I suspect the issue is that read-tree populates the cache-tree index
> extension, and then write-tree omits the object write before it even
> gets to write_sha1_file().

Wait a minute.  The entries in the index and trees in the cache-tree
are root of "still in use" traversal for the purpose of pruning,
which makes the "something like this" patch unnecessary for the real
index file.

And for temporary index files that is kept for 6 months, touching
tree objects that cache-tree references is irrelevant---the blobs
recorded in the "list of objects" part of the index will go stale,
which is a lot more problematic.


Re: Protecting old temporary objects being reused from concurrent "git gc"?

2016-11-16 Thread Junio C Hamano
Jeff King  writes:

> ... I notice there is
> a return very early on in update_one() when has_sha1_file() matches, and
> it seems like that would trigger in some interesting cases, too.

Yeah, I missed that.  It says "we were asked to update one
cache_tree that corresponds to this subdirectory, found that
hashes everything below has been rolled up and still valid, and we
already have the right tree object in the object store".

It can simply become freshen(), which is "do we have it in the
object store?" with a side effect of touching iff the answer is
"yes".


Re: Protecting old temporary objects being reused from concurrent "git gc"?

2016-11-16 Thread Jeff King
On Tue, Nov 15, 2016 at 12:01:35PM -0800, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > I suspect the issue is that read-tree populates the cache-tree index
> > extension, and then write-tree omits the object write before it even
> > gets to write_sha1_file(). The solution is that it should probably be
> > calling one of the freshen() functions (possibly just replacing
> > has_sha1_file() with check_and_freshen(), but I haven't looked).
> 
> I think the final writing always happens via write_sha1_file(), but
> an earlier cache-tree update that says "if we have a tree object
> already, then use it, otherwise even though we know the object name
> for this subtree, do not record it in the cache-tree" codepath may
> decide to record the subtree's sha1 without refreshing the referent.
> 
> A fix may look like this.

Yeah, that's along the lines I was expecting, though I'm not familiar
enough with cache-tree to say whether it's sufficient. I notice there is
a return very early on in update_one() when has_sha1_file() matches, and
it seems like that would trigger in some interesting cases, too.

-Peff


Re: Protecting old temporary objects being reused from concurrent "git gc"?

2016-11-15 Thread Junio C Hamano
Jeff King  writes:

> I suspect the issue is that read-tree populates the cache-tree index
> extension, and then write-tree omits the object write before it even
> gets to write_sha1_file(). The solution is that it should probably be
> calling one of the freshen() functions (possibly just replacing
> has_sha1_file() with check_and_freshen(), but I haven't looked).

I think the final writing always happens via write_sha1_file(), but
an earlier cache-tree update that says "if we have a tree object
already, then use it, otherwise even though we know the object name
for this subtree, do not record it in the cache-tree" codepath may
decide to record the subtree's sha1 without refreshing the referent.

A fix may look like this.

 cache-tree.c | 2 +-
 cache.h  | 1 +
 sha1_file.c  | 9 +++--
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/cache-tree.c b/cache-tree.c
index 345ea35963..3ae6d056b4 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -401,7 +401,7 @@ static int update_one(struct cache_tree *it,
if (repair) {
unsigned char sha1[20];
hash_sha1_file(buffer.buf, buffer.len, tree_type, sha1);
-   if (has_sha1_file(sha1))
+   if (freshen_object(sha1))
hashcpy(it->sha1, sha1);
else
to_invalidate = 1;
diff --git a/cache.h b/cache.h
index 5cdea6833e..1f5694f308 100644
--- a/cache.h
+++ b/cache.h
@@ -1126,6 +1126,7 @@ extern int sha1_object_info(const unsigned char *, 
unsigned long *);
 extern int hash_sha1_file(const void *buf, unsigned long len, const char 
*type, unsigned char *sha1);
 extern int write_sha1_file(const void *buf, unsigned long len, const char 
*type, unsigned char *return_sha1);
 extern int hash_sha1_file_literally(const void *buf, unsigned long len, const 
char *type, unsigned char *sha1, unsigned flags);
+extern int freshen_object(const unsigned char *);
 extern int pretend_sha1_file(void *, unsigned long, enum object_type, unsigned 
char *);
 extern int force_object_loose(const unsigned char *sha1, time_t mtime);
 extern int git_open_cloexec(const char *name, int flags);
diff --git a/sha1_file.c b/sha1_file.c
index e030805497..1daeb05dcd 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3275,6 +3275,11 @@ static int freshen_packed_object(const unsigned char 
*sha1)
return 1;
 }
 
+int freshen_object(const unsigned char *sha1)
+{
+   return freshen_packed_object(sha1) || freshen_loose_object(sha1);
+}
+
 int write_sha1_file(const void *buf, unsigned long len, const char *type, 
unsigned char *sha1)
 {
char hdr[32];
@@ -3284,7 +3289,7 @@ int write_sha1_file(const void *buf, unsigned long len, 
const char *type, unsign
 * it out into .git/objects/??/?{38} file.
 */
write_sha1_file_prepare(buf, len, type, sha1, hdr, &hdrlen);
-   if (freshen_packed_object(sha1) || freshen_loose_object(sha1))
+   if (freshen_object(sha1))
return 0;
return write_loose_object(sha1, hdr, hdrlen, buf, len, 0);
 }
@@ -3302,7 +3307,7 @@ int hash_sha1_file_literally(const void *buf, unsigned 
long len, const char *typ
 
if (!(flags & HASH_WRITE_OBJECT))
goto cleanup;
-   if (freshen_packed_object(sha1) || freshen_loose_object(sha1))
+   if (freshen_object(sha1))
goto cleanup;
status = write_loose_object(sha1, header, hdrlen, buf, len, 0);
 


Re: Protecting old temporary objects being reused from concurrent "git gc"?

2016-11-15 Thread Matt McCutchen
On Tue, 2016-11-15 at 12:40 -0500, Jeff King wrote:
> On Tue, Nov 15, 2016 at 12:33:04PM -0500, Matt McCutchen wrote:
> 
> > 
> > On Tue, 2016-11-15 at 12:06 -0500, Jeff King wrote:
> > > 
> > >  - when an object write is optimized out because we already have the
> > >    object, git will update the mtime on the file (loose object or
> > >    packfile) to freshen it
> > 
> > FWIW, I am not seeing this happen when I do "git read-tree --prefix"
> > followed by "git write-tree" using the current master (3ab2281).  See
> > the attached test script.
> 
> The optimization I'm thinking about is the one from write_sha1_file(),
> which learned to freshen in 33d4221c7 (write_sha1_file: freshen existing
> objects, 2014-10-15).
> 
> I suspect the issue is that read-tree populates the cache-tree index
> extension, and then write-tree omits the object write before it even
> gets to write_sha1_file(). The solution is that it should probably be
> calling one of the freshen() functions (possibly just replacing
> has_sha1_file() with check_and_freshen(), but I haven't looked).
> 
> I'd definitely welcome patches in this area.

Cool, it's nice to have an idea of what's going on.  I don't think I'm
going to try to fix it myself though.

By the way, thanks for the fast response to my original question!

Matt


Re: Protecting old temporary objects being reused from concurrent "git gc"?

2016-11-15 Thread Jeff King
On Tue, Nov 15, 2016 at 12:33:04PM -0500, Matt McCutchen wrote:

> On Tue, 2016-11-15 at 12:06 -0500, Jeff King wrote:
> >  - when an object write is optimized out because we already have the
> >    object, git will update the mtime on the file (loose object or
> >    packfile) to freshen it
> 
> FWIW, I am not seeing this happen when I do "git read-tree --prefix"
> followed by "git write-tree" using the current master (3ab2281).  See
> the attached test script.

The optimization I'm thinking about is the one from write_sha1_file(),
which learned to freshen in 33d4221c7 (write_sha1_file: freshen existing
objects, 2014-10-15).

I suspect the issue is that read-tree populates the cache-tree index
extension, and then write-tree omits the object write before it even
gets to write_sha1_file(). The solution is that it should probably be
calling one of the freshen() functions (possibly just replacing
has_sha1_file() with check_and_freshen(), but I haven't looked).

I'd definitely welcome patches in this area.

> OK.  I'll write a patch to add a summary of this information to the
> git-gc man page.

Sounds like a good idea. Thanks.

-Peff


Re: Protecting old temporary objects being reused from concurrent "git gc"?

2016-11-15 Thread Matt McCutchen
On Tue, 2016-11-15 at 12:06 -0500, Jeff King wrote:
>  - when an object write is optimized out because we already have the
>    object, git will update the mtime on the file (loose object or
>    packfile) to freshen it

FWIW, I am not seeing this happen when I do "git read-tree --prefix"
followed by "git write-tree" using the current master (3ab2281).  See
the attached test script.

> If you have long-running data (like, a temporary index file that might
> literally sit around for days or weeks) I think that is a potential
> problem. And the solution is probably to use refs in some way to point
> to your objects.

Agreed.  This is not my current scenario.

> If you're worried about a short-term operation where
> somebody happens to run git-gc concurrently, I agree it's a possible
> problem, but I suspect something you can ignore in practice.
> 
> For the most part, a lot of the client-side git tools assume that one
> operation is happening at a time in the repository. And I think that
> largely holds for a developer working on a single clone, and things just
> work in practice.
> 
> Auto-gc makes that a little sketchier, but historically does not seem to
> have really caused problems in practice.

OK.  I'll write a patch to add a summary of this information to the
git-gc man page.

Matt

test-git-read-tree-write-tree-touch-object.sh
Description: application/shellscript


Re: Protecting old temporary objects being reused from concurrent "git gc"?

2016-11-15 Thread Jeff King
On Tue, Nov 15, 2016 at 09:13:14AM -0500, Matt McCutchen wrote:

> I want to change this to something that won't leave an inconsistent
> state if interrupted.  I've written code for this kind of thing before
> that sets GIT_INDEX_FILE and uses a temporary index file and "git
> write-tree".  But I realized that if "git gc" runs concurrently, the
> generated tree could be deleted before it is used and the tool would
> fail.  If I had a need to run "git commit-tree", it seems like I might
> even end up with a commit object with a broken reference to a tree.
>  "git gc" normally doesn't delete objects that were created in the last
> 2 weeks, but if an identical tree was added to the object database more
> than 2 weeks ago by another operation and is unreferenced, it could be
> reused without updating its mtime and it could still get deleted.

Modern versions of git do two things to help with this:

 - any object which is referenced by a "recent" object (within the 2
   weeks) is also considered recent. So if you create a new commit
   object that points to a tree, even before you reference the commit
   that tree is protected

 - when an object write is optimized out because we already have the
   object, git will update the mtime on the file (loose object or
   packfile) to freshen it

This isn't perfect, though. You can decide to reference an existing
object just as it is being deleted. And the pruning process itself is
not atomic (and it's tricky to make it so, just because of what we're
promised by the filesystem).

> Is there a recommended way to avoid this kind of problem in add-on
> tools?  (I searched the Git documentation and the web for information
> about races with "git gc" and didn't find anything useful.)  If not, it
> seems to be a significant design flaw in "git gc", even if the problem
> is extremely rare in practice.  I wonder if some of the built-in
> commands may have the same problem, though I haven't tried to test
> them.  If this is confirmed to be a known problem affecting built-in
> commands, then at least I won't feel bad about introducing the
> same problem into add-on tools. :/

If you have long-running data (like, a temporary index file that might
literally sit around for days or weeks) I think that is a potential
problem. And the solution is probably to use refs in some way to point
to your objects. If you're worried about a short-term operation where
somebody happens to run git-gc concurrently, I agree it's a possible
problem, but I suspect something you can ignore in practice.

For the most part, a lot of the client-side git tools assume that one
operation is happening at a time in the repository. And I think that
largely holds for a developer working on a single clone, and things just
work in practice.

Auto-gc makes that a little sketchier, but historically does not seem to
have really caused problems in practice.

For a busy multi-user server, I recommend turning off auto-gc entirely,
and repacking manually with "-k" to be on the safe side.

-Peff


Protecting old temporary objects being reused from concurrent "git gc"?

2016-11-15 Thread Matt McCutchen
The Braid subproject management tool stores the subproject content in
the main tree and is able to switch to a different upstream revision of
a subproject by doing the equivalent of "git read-tree -m" on the
superproject tree and the two upstream trees.  The tricky part is
preparing temporary trees with the upstream content moved to the path
configured for the superproject.  The usual method is "git read-tree
--prefix", but using what index file?  Braid currently uses the user's
actual worktree, which can leave a mess if it gets interrupted:

https://github.com/cristibalan/braid/blob/7d81da6e86e24de62a74f3ab8d880666cb343b04/lib/braid/commands/update.rb#L98

I want to change this to something that won't leave an inconsistent
state if interrupted.  I've written code for this kind of thing before
that sets GIT_INDEX_FILE and uses a temporary index file and "git
write-tree".  But I realized that if "git gc" runs concurrently, the
generated tree could be deleted before it is used and the tool would
fail.  If I had a need to run "git commit-tree", it seems like I might
even end up with a commit object with a broken reference to a tree.
 "git gc" normally doesn't delete objects that were created in the last
2 weeks, but if an identical tree was added to the object database more
than 2 weeks ago by another operation and is unreferenced, it could be
reused without updating its mtime and it could still get deleted.

Is there a recommended way to avoid this kind of problem in add-on
tools?  (I searched the Git documentation and the web for information
about races with "git gc" and didn't find anything useful.)  If not, it
seems to be a significant design flaw in "git gc", even if the problem
is extremely rare in practice.  I wonder if some of the built-in
commands may have the same problem, though I haven't tried to test
them.  If this is confirmed to be a known problem affecting built-in
commands, then at least I won't feel bad about introducing the
same problem into add-on tools. :/

Thanks,
Matt