Re: Re* Protecting old temporary objects being reused from concurrent "git gc"?
On Wed, Nov 16, 2016 at 05:35:47PM -0800, Junio C Hamano wrote: > OK, here is what I have queued. > > -- >8 -- > Subject: cache-tree: make sure to "touch" tree objects the cache-tree records > > The cache_tree_fully_valid() function is called by callers that want > to know if they need to call cache_tree_update(), i.e. as an attempt > to optimize. They all want to have a fully valid cache-tree in the > end so that they can write a tree object out. That makes sense. I was focusing on cache_tree_update() call, but we do not even get there in the fully-valid case. So I think this approach is nice as long as there is not a caller who asks "are we fully valid? I do not need to write, but was just wondering". That should be a read-only operation, but the freshen calls may fail with EPERM, for example. I do not see any such callers, nor do I really expect any. Just trying to think through the possible consequences. > Strictly speaking, freshing these tree objects at each and every > level is probably unnecessary, given that anything reachable from a > young object inherits the youth from the referring object to be > protected from pruning. It should be sufficient to freshen only the > very top-level tree instead. Benchmarking and optimization is left > as an exercise for later days. Good observation, and nicely explained all around. -Peff
Re* Protecting old temporary objects being reused from concurrent "git gc"?
Jeff King writes: > I think the case that is helped here is somebody who runs "git > write-tree" and expects that the timestamp on those trees is fresh. So > even more a briefly used index, like: > > export GIT_INDEX_FILE=/tmp/foo > git read-tree ... > git write-tree > rm -f $GIT_INDEX_FILE > > we'd expect that a "git gc" which runs immediately after would see those > trees as recent and avoid pruning them (and transitively, any blobs that > are reachable from the trees). But I don't think that write-tree > actually freshens them (it sees "oh, we already have these; there is > nothing to write"). OK, here is what I have queued. -- >8 -- Subject: cache-tree: make sure to "touch" tree objects the cache-tree records The cache_tree_fully_valid() function is called by callers that want to know if they need to call cache_tree_update(), i.e. as an attempt to optimize. They all want to have a fully valid cache-tree in the end so that they can write a tree object out. We used to check if the cached tree object is up-to-date (i.e. the index entires covered by the cache-tree entry hasn't been changed since the roll-up hash was computed for the cache-tree entry) and made sure the tree object is already in the object store. Since the top-level tree we would soon be asked to write out (and would find in the object store) may not be anchored to any refs or commits immediately, freshen the tree object when it happens. Similarly, when we actually compute the cache-tree entries in cache_tree_update(), we refrained from writing a tree object out when it already exists in the object store. We would want to freshen the tree object in that case to protect it from premature pruning. Strictly speaking, freshing these tree objects at each and every level is probably unnecessary, given that anything reachable from a young object inherits the youth from the referring object to be protected from pruning. It should be sufficient to freshen only the very top-level tree instead. Benchmarking and optimization is left as an exercise for later days. Signed-off-by: Junio C Hamano --- cache-tree.c | 6 +++--- cache.h | 1 + sha1_file.c | 9 +++-- 3 files changed, 11 insertions(+), 5 deletions(-) diff --git a/cache-tree.c b/cache-tree.c index 3ebf9c3aa4..c8c74a1e07 100644 --- a/cache-tree.c +++ b/cache-tree.c @@ -225,7 +225,7 @@ int cache_tree_fully_valid(struct cache_tree *it) int i; if (!it) return 0; - if (it->entry_count < 0 || !has_sha1_file(it->sha1)) + if (it->entry_count < 0 || !freshen_object(it->sha1)) return 0; for (i = 0; i < it->subtree_nr; i++) { if (!cache_tree_fully_valid(it->down[i]->cache_tree)) @@ -253,7 +253,7 @@ static int update_one(struct cache_tree *it, *skip_count = 0; - if (0 <= it->entry_count && has_sha1_file(it->sha1)) + if (0 <= it->entry_count && freshen_object(it->sha1)) return it->entry_count; /* @@ -393,7 +393,7 @@ static int update_one(struct cache_tree *it, if (repair) { unsigned char sha1[20]; hash_sha1_file(buffer.buf, buffer.len, tree_type, sha1); - if (has_sha1_file(sha1)) + if (freshen_object(sha1)) hashcpy(it->sha1, sha1); else to_invalidate = 1; diff --git a/cache.h b/cache.h index 4ff196c259..72c0b321ac 100644 --- a/cache.h +++ b/cache.h @@ -1077,6 +1077,7 @@ extern int sha1_object_info(const unsigned char *, unsigned long *); extern int hash_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *sha1); extern int write_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *return_sha1); extern int hash_sha1_file_literally(const void *buf, unsigned long len, const char *type, unsigned char *sha1, unsigned flags); +extern int freshen_object(const unsigned char *); extern int pretend_sha1_file(void *, unsigned long, enum object_type, unsigned char *); extern int force_object_loose(const unsigned char *sha1, time_t mtime); extern int git_open_noatime(const char *name); diff --git a/sha1_file.c b/sha1_file.c index d0f2aa029b..9acce3d3b8 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -3151,6 +3151,11 @@ static int freshen_packed_object(const unsigned char *sha1) return 1; } +int freshen_object(const unsigned char *sha1) +{ + return freshen_packed_object(sha1) || freshen_loose_object(sha1); +} + int write_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *sha1) { char hdr[32]; @@ -3160,7 +3165,7 @@ int write_sha1_file(const void *buf, unsigned long len, const char *type, unsign * it out into .git/objects/??/?{38} file. */ write_sha1_file_prepare(buf, len, type, sha1, hdr, &hdrlen); - if (freshen_packed_object(sha1) || freshen_loose_object(sha1)) + if (freshen_
Re: Protecting old temporary objects being reused from concurrent "git gc"?
On Wed, Nov 16, 2016 at 10:58:30AM -0800, Junio C Hamano wrote: > Jeff King writes: > > > I suspect the issue is that read-tree populates the cache-tree index > > extension, and then write-tree omits the object write before it even > > gets to write_sha1_file(). > > Wait a minute. The entries in the index and trees in the cache-tree > are root of "still in use" traversal for the purpose of pruning, > which makes the "something like this" patch unnecessary for the real > index file. > > And for temporary index files that is kept for 6 months, touching > tree objects that cache-tree references is irrelevant---the blobs > recorded in the "list of objects" part of the index will go stale, > which is a lot more problematic. I think the case that is helped here is somebody who runs "git write-tree" and expects that the timestamp on those trees is fresh. So even more a briefly used index, like: export GIT_INDEX_FILE=/tmp/foo git read-tree ... git write-tree rm -f $GIT_INDEX_FILE we'd expect that a "git gc" which runs immediately after would see those trees as recent and avoid pruning them (and transitively, any blobs that are reachable from the trees). But I don't think that write-tree actually freshens them (it sees "oh, we already have these; there is nothing to write"). I could actually see an argument that the read-tree operation should freshen the blobs themselves (because we know those blobs are now in active use, and probably shouldn't be pruned), but I am not sure I agree there. If only because it is weird that an operation which is otherwise read-only with respect to the repository would modify the object database. -Peff
Re: Protecting old temporary objects being reused from concurrent "git gc"?
Jeff King writes: > I suspect the issue is that read-tree populates the cache-tree index > extension, and then write-tree omits the object write before it even > gets to write_sha1_file(). Wait a minute. The entries in the index and trees in the cache-tree are root of "still in use" traversal for the purpose of pruning, which makes the "something like this" patch unnecessary for the real index file. And for temporary index files that is kept for 6 months, touching tree objects that cache-tree references is irrelevant---the blobs recorded in the "list of objects" part of the index will go stale, which is a lot more problematic.
Re: Protecting old temporary objects being reused from concurrent "git gc"?
Jeff King writes: > ... I notice there is > a return very early on in update_one() when has_sha1_file() matches, and > it seems like that would trigger in some interesting cases, too. Yeah, I missed that. It says "we were asked to update one cache_tree that corresponds to this subdirectory, found that hashes everything below has been rolled up and still valid, and we already have the right tree object in the object store". It can simply become freshen(), which is "do we have it in the object store?" with a side effect of touching iff the answer is "yes".
Re: Protecting old temporary objects being reused from concurrent "git gc"?
On Tue, Nov 15, 2016 at 12:01:35PM -0800, Junio C Hamano wrote: > Jeff King writes: > > > I suspect the issue is that read-tree populates the cache-tree index > > extension, and then write-tree omits the object write before it even > > gets to write_sha1_file(). The solution is that it should probably be > > calling one of the freshen() functions (possibly just replacing > > has_sha1_file() with check_and_freshen(), but I haven't looked). > > I think the final writing always happens via write_sha1_file(), but > an earlier cache-tree update that says "if we have a tree object > already, then use it, otherwise even though we know the object name > for this subtree, do not record it in the cache-tree" codepath may > decide to record the subtree's sha1 without refreshing the referent. > > A fix may look like this. Yeah, that's along the lines I was expecting, though I'm not familiar enough with cache-tree to say whether it's sufficient. I notice there is a return very early on in update_one() when has_sha1_file() matches, and it seems like that would trigger in some interesting cases, too. -Peff
Re: Protecting old temporary objects being reused from concurrent "git gc"?
Jeff King writes: > I suspect the issue is that read-tree populates the cache-tree index > extension, and then write-tree omits the object write before it even > gets to write_sha1_file(). The solution is that it should probably be > calling one of the freshen() functions (possibly just replacing > has_sha1_file() with check_and_freshen(), but I haven't looked). I think the final writing always happens via write_sha1_file(), but an earlier cache-tree update that says "if we have a tree object already, then use it, otherwise even though we know the object name for this subtree, do not record it in the cache-tree" codepath may decide to record the subtree's sha1 without refreshing the referent. A fix may look like this. cache-tree.c | 2 +- cache.h | 1 + sha1_file.c | 9 +++-- 3 files changed, 9 insertions(+), 3 deletions(-) diff --git a/cache-tree.c b/cache-tree.c index 345ea35963..3ae6d056b4 100644 --- a/cache-tree.c +++ b/cache-tree.c @@ -401,7 +401,7 @@ static int update_one(struct cache_tree *it, if (repair) { unsigned char sha1[20]; hash_sha1_file(buffer.buf, buffer.len, tree_type, sha1); - if (has_sha1_file(sha1)) + if (freshen_object(sha1)) hashcpy(it->sha1, sha1); else to_invalidate = 1; diff --git a/cache.h b/cache.h index 5cdea6833e..1f5694f308 100644 --- a/cache.h +++ b/cache.h @@ -1126,6 +1126,7 @@ extern int sha1_object_info(const unsigned char *, unsigned long *); extern int hash_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *sha1); extern int write_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *return_sha1); extern int hash_sha1_file_literally(const void *buf, unsigned long len, const char *type, unsigned char *sha1, unsigned flags); +extern int freshen_object(const unsigned char *); extern int pretend_sha1_file(void *, unsigned long, enum object_type, unsigned char *); extern int force_object_loose(const unsigned char *sha1, time_t mtime); extern int git_open_cloexec(const char *name, int flags); diff --git a/sha1_file.c b/sha1_file.c index e030805497..1daeb05dcd 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -3275,6 +3275,11 @@ static int freshen_packed_object(const unsigned char *sha1) return 1; } +int freshen_object(const unsigned char *sha1) +{ + return freshen_packed_object(sha1) || freshen_loose_object(sha1); +} + int write_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *sha1) { char hdr[32]; @@ -3284,7 +3289,7 @@ int write_sha1_file(const void *buf, unsigned long len, const char *type, unsign * it out into .git/objects/??/?{38} file. */ write_sha1_file_prepare(buf, len, type, sha1, hdr, &hdrlen); - if (freshen_packed_object(sha1) || freshen_loose_object(sha1)) + if (freshen_object(sha1)) return 0; return write_loose_object(sha1, hdr, hdrlen, buf, len, 0); } @@ -3302,7 +3307,7 @@ int hash_sha1_file_literally(const void *buf, unsigned long len, const char *typ if (!(flags & HASH_WRITE_OBJECT)) goto cleanup; - if (freshen_packed_object(sha1) || freshen_loose_object(sha1)) + if (freshen_object(sha1)) goto cleanup; status = write_loose_object(sha1, header, hdrlen, buf, len, 0);
Re: Protecting old temporary objects being reused from concurrent "git gc"?
On Tue, 2016-11-15 at 12:40 -0500, Jeff King wrote: > On Tue, Nov 15, 2016 at 12:33:04PM -0500, Matt McCutchen wrote: > > > > > On Tue, 2016-11-15 at 12:06 -0500, Jeff King wrote: > > > > > > - when an object write is optimized out because we already have the > > > object, git will update the mtime on the file (loose object or > > > packfile) to freshen it > > > > FWIW, I am not seeing this happen when I do "git read-tree --prefix" > > followed by "git write-tree" using the current master (3ab2281). See > > the attached test script. > > The optimization I'm thinking about is the one from write_sha1_file(), > which learned to freshen in 33d4221c7 (write_sha1_file: freshen existing > objects, 2014-10-15). > > I suspect the issue is that read-tree populates the cache-tree index > extension, and then write-tree omits the object write before it even > gets to write_sha1_file(). The solution is that it should probably be > calling one of the freshen() functions (possibly just replacing > has_sha1_file() with check_and_freshen(), but I haven't looked). > > I'd definitely welcome patches in this area. Cool, it's nice to have an idea of what's going on. I don't think I'm going to try to fix it myself though. By the way, thanks for the fast response to my original question! Matt
Re: Protecting old temporary objects being reused from concurrent "git gc"?
On Tue, Nov 15, 2016 at 12:33:04PM -0500, Matt McCutchen wrote: > On Tue, 2016-11-15 at 12:06 -0500, Jeff King wrote: > > - when an object write is optimized out because we already have the > > object, git will update the mtime on the file (loose object or > > packfile) to freshen it > > FWIW, I am not seeing this happen when I do "git read-tree --prefix" > followed by "git write-tree" using the current master (3ab2281). See > the attached test script. The optimization I'm thinking about is the one from write_sha1_file(), which learned to freshen in 33d4221c7 (write_sha1_file: freshen existing objects, 2014-10-15). I suspect the issue is that read-tree populates the cache-tree index extension, and then write-tree omits the object write before it even gets to write_sha1_file(). The solution is that it should probably be calling one of the freshen() functions (possibly just replacing has_sha1_file() with check_and_freshen(), but I haven't looked). I'd definitely welcome patches in this area. > OK. I'll write a patch to add a summary of this information to the > git-gc man page. Sounds like a good idea. Thanks. -Peff
Re: Protecting old temporary objects being reused from concurrent "git gc"?
On Tue, 2016-11-15 at 12:06 -0500, Jeff King wrote: > - when an object write is optimized out because we already have the > object, git will update the mtime on the file (loose object or > packfile) to freshen it FWIW, I am not seeing this happen when I do "git read-tree --prefix" followed by "git write-tree" using the current master (3ab2281). See the attached test script. > If you have long-running data (like, a temporary index file that might > literally sit around for days or weeks) I think that is a potential > problem. And the solution is probably to use refs in some way to point > to your objects. Agreed. This is not my current scenario. > If you're worried about a short-term operation where > somebody happens to run git-gc concurrently, I agree it's a possible > problem, but I suspect something you can ignore in practice. > > For the most part, a lot of the client-side git tools assume that one > operation is happening at a time in the repository. And I think that > largely holds for a developer working on a single clone, and things just > work in practice. > > Auto-gc makes that a little sketchier, but historically does not seem to > have really caused problems in practice. OK. I'll write a patch to add a summary of this information to the git-gc man page. Matt test-git-read-tree-write-tree-touch-object.sh Description: application/shellscript
Re: Protecting old temporary objects being reused from concurrent "git gc"?
On Tue, Nov 15, 2016 at 09:13:14AM -0500, Matt McCutchen wrote: > I want to change this to something that won't leave an inconsistent > state if interrupted. I've written code for this kind of thing before > that sets GIT_INDEX_FILE and uses a temporary index file and "git > write-tree". But I realized that if "git gc" runs concurrently, the > generated tree could be deleted before it is used and the tool would > fail. If I had a need to run "git commit-tree", it seems like I might > even end up with a commit object with a broken reference to a tree. > "git gc" normally doesn't delete objects that were created in the last > 2 weeks, but if an identical tree was added to the object database more > than 2 weeks ago by another operation and is unreferenced, it could be > reused without updating its mtime and it could still get deleted. Modern versions of git do two things to help with this: - any object which is referenced by a "recent" object (within the 2 weeks) is also considered recent. So if you create a new commit object that points to a tree, even before you reference the commit that tree is protected - when an object write is optimized out because we already have the object, git will update the mtime on the file (loose object or packfile) to freshen it This isn't perfect, though. You can decide to reference an existing object just as it is being deleted. And the pruning process itself is not atomic (and it's tricky to make it so, just because of what we're promised by the filesystem). > Is there a recommended way to avoid this kind of problem in add-on > tools? (I searched the Git documentation and the web for information > about races with "git gc" and didn't find anything useful.) If not, it > seems to be a significant design flaw in "git gc", even if the problem > is extremely rare in practice. I wonder if some of the built-in > commands may have the same problem, though I haven't tried to test > them. If this is confirmed to be a known problem affecting built-in > commands, then at least I won't feel bad about introducing the > same problem into add-on tools. :/ If you have long-running data (like, a temporary index file that might literally sit around for days or weeks) I think that is a potential problem. And the solution is probably to use refs in some way to point to your objects. If you're worried about a short-term operation where somebody happens to run git-gc concurrently, I agree it's a possible problem, but I suspect something you can ignore in practice. For the most part, a lot of the client-side git tools assume that one operation is happening at a time in the repository. And I think that largely holds for a developer working on a single clone, and things just work in practice. Auto-gc makes that a little sketchier, but historically does not seem to have really caused problems in practice. For a busy multi-user server, I recommend turning off auto-gc entirely, and repacking manually with "-k" to be on the safe side. -Peff