Re: [msysGit] Re: [PATCH] git tag --contains : avoid stack overflow
Hi Peff, On Mon, 12 Nov 2012, Jeff King wrote: On Mon, Nov 12, 2012 at 11:27:14PM +0100, Jean-Jacques Lafay wrote: 2012/11/11 Jeff King p...@peff.net: On Sun, Nov 11, 2012 at 05:46:32PM +0100, René Scharfe wrote: Ultimately, I have some ideas for doing this in a breadth-first way, which would make it more naturally iterative. It would involve having N bits of storage per commit to check N tags, but it would mean that we could get accurate answers in the face of clock skew (like the merge-base calculation, it would merely get slower in the face of skew). I guess the optimal algorithm may also depend on the commit graph general shape, but intuitively, I'd say that the critical factor is the number and distribution of tags. As soon as you have a significant number of tags (let's say 1% of the commits are tagged, evenly distributed), you'll quickly end up with every commit marked as containing or not the target commit, so that each additional tag check is cheap. This suggests a complexity of O(number of commits) more often then not, however you choose to traverse the graph. We can do much better than O(number of commits), though, if we stop traversing down a path when its timestamp shows that it is too old to contain the commits we are searching for. The problem is that the timestamps cannot always be trusted, because they are generated on machines with wrong clocks, or by buggy software. This could be solved by calculating and caching a generation number, but last time it was discussed there was a lot of arguing and nothing got done. Sadly, not only machines with skewed clocks, but in particular buggy 3rd-party SCMs make this more than just problematic. In a git-svn clone that was used as base for heavy Git development, I encountered quite a lot of Jan 1, 1970 commits. It just cannot be helped, we must distrust timestamps completely. Ciao, Dscho
Re: [msysGit] Re: [PATCH] git tag --contains : avoid stack overflow
On Tue, Nov 13, 2012 at 01:16:01AM +, Johannes Schindelin wrote: We can do much better than O(number of commits), though, if we stop traversing down a path when its timestamp shows that it is too old to contain the commits we are searching for. The problem is that the timestamps cannot always be trusted, because they are generated on machines with wrong clocks, or by buggy software. This could be solved by calculating and caching a generation number, but last time it was discussed there was a lot of arguing and nothing got done. Sadly, not only machines with skewed clocks, but in particular buggy 3rd-party SCMs make this more than just problematic. In a git-svn clone that was used as base for heavy Git development, I encountered quite a lot of Jan 1, 1970 commits. Yeah. We tolerate a certain amount of skew (24 hours for --name-rev, and 5 broken commits in a row for --since). But the big ones are usually software bugs (the big kernel ones were from broken guilt, I think) or broken imports (when I published a bunch of skew statistics last year, the interesting ones were all imports; I don't know if they were software bugs, or just garbage in, garbage out). It just cannot be helped, we must distrust timestamps completely. Note that name-rev will produce wrong answers in the face of clock skew. And I think that you even wrote that code. :) -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [msysGit] Re: [PATCH] git tag --contains : avoid stack overflow
Hi Peff, On Mon, 12 Nov 2012, Jeff King wrote: On Tue, Nov 13, 2012 at 01:16:01AM +, Johannes Schindelin wrote: We can do much better than O(number of commits), though, if we stop traversing down a path when its timestamp shows that it is too old to contain the commits we are searching for. The problem is that the timestamps cannot always be trusted, because they are generated on machines with wrong clocks, or by buggy software. This could be solved by calculating and caching a generation number, but last time it was discussed there was a lot of arguing and nothing got done. Sadly, not only machines with skewed clocks, but in particular buggy 3rd-party SCMs make this more than just problematic. In a git-svn clone that was used as base for heavy Git development, I encountered quite a lot of Jan 1, 1970 commits. Yeah. We tolerate a certain amount of skew (24 hours for --name-rev, and 5 broken commits in a row for --since). But the big ones are usually software bugs (the big kernel ones were from broken guilt, I think) or broken imports (when I published a bunch of skew statistics last year, the interesting ones were all imports; I don't know if they were software bugs, or just garbage in, garbage out). It just cannot be helped, we must distrust timestamps completely. Note that name-rev will produce wrong answers in the face of clock skew. And I think that you even wrote that code. :) IIRC the cute code to short-circuit using the date is not from me. If it is, I am very ashamed. Ciao, Johannes -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [msysGit] Re: [PATCH] git tag --contains : avoid stack overflow
On Tue, Nov 13, 2012 at 04:01:11AM +, Johannes Schindelin wrote: Note that name-rev will produce wrong answers in the face of clock skew. And I think that you even wrote that code. :) IIRC the cute code to short-circuit using the date is not from me. If it is, I am very ashamed. Sorry, but it was: $ git blame -L'/commit-date cutoff/',+1 builtin/name-rev.c bd321bcc name-rev.c (Johannes Schindelin 2005-10-26 15:10:20 +0200 32) if (commit-date cutoff) But it is never too late to fix it. :) -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [msysGit] Re: [PATCH] git tag --contains : avoid stack overflow
Jeff King p...@peff.net writes: Yeah. We tolerate a certain amount of skew (24 hours for --name-rev, and 5 broken commits in a row for --since). But the big ones are usually software bugs (the big kernel ones were from broken guilt, I think) or broken imports (when I published a bunch of skew statistics last year, the interesting ones were all imports; I don't know if they were software bugs, or just garbage in, garbage out). I was hoping that 2e6bdd3 (test-generation: compute generation numbers and clock skews, 2012-09-04) may be a good first step to come up with a practical and cheap solution on top of it. The traversal can be fooled by clock skews when it sees a commit that has a timestamp that is older than it should, causing it to give up, incorrectly thinking that there won't be newer commits that it is interested in behind the problematic commit. The logic implemented by the change is to identify these problematic commits, and we could record these commits with the value of the timestamps they should have had (e.g. the timestamp of the newest ancestor for each of these commits) in a notes tree. Then the traversal logic (commit-list-insert-by-date) could be updated use that corrected timestamp instead not to be fooled by the clock skew. Such a notes tree can be built once and updated by only appending, as a commit will never acquire more ancestors in its parents chain once it is made. Is it too simplistic, or too costly? In git.git we have three such commits whose timestamp need to be corrected, while in the Linux kernel there were 2.2k skewed commits when I counted them a few months ago. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [msysGit] Re: [PATCH] git tag --contains : avoid stack overflow
Hi Peff, On Mon, 12 Nov 2012, Jeff King wrote: On Tue, Nov 13, 2012 at 04:01:11AM +, Johannes Schindelin wrote: Note that name-rev will produce wrong answers in the face of clock skew. And I think that you even wrote that code. :) IIRC the cute code to short-circuit using the date is not from me. If it is, I am very ashamed. Sorry, but it was: $ git blame -L'/commit-date cutoff/',+1 builtin/name-rev.c bd321bcc name-rev.c (Johannes Schindelin 2005-10-26 15:10:20 +0200 32) if (commit-date cutoff) But it is never too late to fix it. :) I will now go and find a hole to hide in. Or alternatively finally go to sleep. Ciao, Johannes -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [msysGit] Re: [PATCH] git tag --contains : avoid stack overflow
On Mon, Nov 12, 2012 at 08:51:37PM -0800, Junio C Hamano wrote: Jeff King p...@peff.net writes: Yeah. We tolerate a certain amount of skew (24 hours for --name-rev, and 5 broken commits in a row for --since). But the big ones are usually software bugs (the big kernel ones were from broken guilt, I think) or broken imports (when I published a bunch of skew statistics last year, the interesting ones were all imports; I don't know if they were software bugs, or just garbage in, garbage out). I was hoping that 2e6bdd3 (test-generation: compute generation numbers and clock skews, 2012-09-04) may be a good first step to come up with a practical and cheap solution on top of it. The traversal can be fooled by clock skews when it sees a commit that has a timestamp that is older than it should, causing it to give up, incorrectly thinking that there won't be newer commits that it is interested in behind the problematic commit. I wrote a similar skew-finding tool last year, though some of the numbers it came up with were different (I remember having many fewer skewed commits in the kernel repo). One problem is that it identifies commits which behave badly with certain algorithms, but it does not identify commits which are wrong. If I skew backwards, it will find my commit. But if I skew forwards, it will label my children as wrong. The logic implemented by the change is to identify these problematic commits, and we could record these commits with the value of the timestamps they should have had (e.g. the timestamp of the newest ancestor for each of these commits) in a notes tree. Then the traversal logic (commit-list-insert-by-date) could be updated use that corrected timestamp instead not to be fooled by the clock skew. Such a notes tree can be built once and updated by only appending, as a commit will never acquire more ancestors in its parents chain once it is made. Is it too simplistic, or too costly? In git.git we have three such commits whose timestamp need to be corrected, while in the Linux kernel there were 2.2k skewed commits when I counted them a few months ago. This came up in the big generations discussion last summer, and I think I even implemented a proof of concept. I couldn't find the actual code, though but only that I got pleasing performance results using a notes tree to store a list of commits with bogus timestamps: http://article.gmane.org/gmane.comp.version-control.git/161101 It is a little wasteful in space if you have a lot of skewed commits (the notes tree stores a 160-bit hash pointing to a blob object storing a 32-bit integer). My personal preference at this point would be: 1. introduce an auxiliary metadata file that would live alongside the pack index and contain generation numbers 2. generate the metadata file during pack indexing. 3. If we have a generation metadata file, but a particular object is not in it, compute the generation; this should be quick because we will hit a file with a stored generation eventually 4. If we do not have any generation metadata files, or if grafts or replace objects are in use, do not use cutoffs in algorithms. Be safe but slow. On the other hand, just switching to doing a single traversal instead of one merge-base computation per tag already got rid of the really awful performance cases. Nobody has complained since that went in, so maybe nobody cares about shaving a few seconds per operation down to a few tens of milliseconds. The real win was shaving tens of seconds down to a few seconds. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [msysGit] Re: [PATCH] git tag --contains : avoid stack overflow
Hi, On Sun, 11 Nov 2012, Jeff King wrote: On Sun, Nov 11, 2012 at 05:46:32PM +0100, René Scharfe wrote: However, I couldn't reproduce it on Linux : where the windows implementations crashes at a ~32000 depth (*not* exactly 32768, mind you), on linux it happily went through 10 commits. I didn't take time to look much further, but maybe on my 64 bit Linux VM, the process can afford to reserve a much bigger address range for the stack of each thread than the 1Mb given to 32 bit processes on windows. Jean-Jacques. I can reproduce it on Linux (Debian testing amd64) with ulimit -s 1000 to reduce the stack size from its default value of 8MB. After reverting ffc4b8012d9a4f92ef238ff72c0d15e9e1b402ed (tag: speed up --contains calculation) the test passes even with the smaller stack, but it makes git tag --contains take thrice the time as before. Right, I am not too surprised. That function replaced the original algorithm with a much faster depth-first recursive one. I haven't looked closely yet at Jean-Jacques' iterative adaptation, but that direction seems like a good fix for now. Ultimately, I have some ideas for doing this in a breadth-first way, which would make it more naturally iterative. It would involve having N bits of storage per commit to check N tags, but it would mean that we could get accurate answers in the face of clock skew (like the merge-base calculation, it would merely get slower in the face of skew). But since I haven't worked on that at all, fixing the depth-first algorithm to be iterative makes sense to me. Have you tried the latest tag-contains branch of git://github.com/msysgit/git/? It contains a couple of brush-ups and a re-write of the recursion (which I hope is right, I had only time to work on it during an unwanted layover at O'Hare). The SHA-1 is fc4f42787a0dd0022d202627681362081a66ef70. Ciao, Johannes