Re: [PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE
Hi Peff, On Fri, 10 Aug 2018, Jeff King wrote: > On Fri, Aug 10, 2018 at 06:43:07PM +0200, Johannes Schindelin wrote: > > > So unless you are willing to ignore, to willfully keep this breakage, > > I would suggest not to introduce the ugliness of an overridden > > upload-pack for the sole purpose of disabling the tracing on one side, > > but instead to get this here bug fixed, by helping me with this here > > patch series. > > I'm OK if you want to live with the broken test in the interim. I realize that I failed to tell you that I basically spent 2.5 days worth of worktime to figure this out and come up with three iterations of the patch series (you only saw the latest). I want this patch series in git.git, maybe not in the current form, but in one form or another. I don't want to spend any more minute on trying to figure out the same problem with any other regression test (which might not even be as easily worked around as with a semi-simple --upload-pack option). Thank you for wanting to help. Please accept my apologies for expressing in a poor way that I appreciate your eagerness to provide a patch, but that I think nevertheless that it would be better to work on the GIT_TRACE concurrency problem and its resolution via file locks. It would solve that class of problems, instead of that single regression test being flakey. Sorry, Dscho
Re: [PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE
On Fri, Aug 10, 2018 at 11:34:59AM -0700, Junio C Hamano wrote: > Johannes Sixt writes: > > > As this buglet looks like a recurring theme, and a proper fix is > > preferable over repeated work-arounds. To me it looks like we need > > some sort of locking on Windows. Unless your friends at Microsoft have > > an ace in their sleeves that lets us have atomic O_APPEND the POSIX > > way... > > Just to put the severity of the issue in context, we use O_APPEND in > a few codepaths, and the trace thing for debugging is the only thing > that could have multiple writers. Other users of O_APPEND are: > > * refs/files-backend.c uses it so that a reflog entry can be >appended at the end, but because update to each ref is protected >from racing at a lot higher layer with a lock, no two people >would try to append to the same reflog file, so atomicity of >O_APPEND does not matter here. Just an interesting side note, but I think one of the reasons we use dot-locks and not flock() is that the former generally works on network filesystems. I think O_APPEND also is not reliable for non-local files, though, so short of actually dot-locking trace files (yuck) I don't think there's a great solution there. > It may make sense to allow GIT_TRACE to have a placeholder > (e.g. "/tmp/traceout.$$") to help debuggers arrange to give > different processes their own trace output file, which perhaps may > be a simpler and less impactful to the performance solution than > having to make locks at an application layer. Heh, I actually wrote that patch several years ago, as part of an abandoned effort to have config-triggered tracing (e.g., on the server side where you're getting hundreds of requests and you want to enable tracing on a repo for a slice of time). One annoying thing for a case like the test in question is that you don't actually know the pid of the process you're interested in. So we'd generate two trace files, and then you'd have to figure out which one is which. In my earlier work, I coupled it with some magic to allow per-command config, like: [trace "fetch"] packet = /tmp/trace.%p but you could also imagine a "%n" which uses the command name. I don't remember why I abandoned it, but I think a lot had to do with violating the "don't look at config" rules for our repo startup code. A lot of that has been fixed since then, but I haven't really needed it. If anybody is really interested, the patches are at: https://github.com/peff/git jk/trace-stdin (my ultimate goal was to record stdin for pack-objects to replay slow fetches, but I ended up hacking together a patch to pack-objects instead). -Peff
Re: [PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE
Jeff King writes: > I have tried to help with the actual problem, by identifying the root > cause (that the trace code's strategy is not fundamentally broken, but > that it relies on O_APPEND whose guarantees are obviously not portable > to Windows) and exploring possible solutions there (FILE_APPEND_DATA). > > Lacking any kind of Windows development environment, I'd just as soon > stop there and let more knowledgeable people take over. But please don't > characterize my presence in this thread as some kind of willful harm or > ignoring of the problem. Thanks. It seems like the other Johannes and the other Jeff are also interested, so we may hopefully see a properly fixed O_APPEND, which would be the real solution that would help appenders that are *not* tracing subsystem, too.
Re: [PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE
Johannes Sixt writes: > As this buglet looks like a recurring theme, and a proper fix is > preferable over repeated work-arounds. To me it looks like we need > some sort of locking on Windows. Unless your friends at Microsoft have > an ace in their sleeves that lets us have atomic O_APPEND the POSIX > way... Just to put the severity of the issue in context, we use O_APPEND in a few codepaths, and the trace thing for debugging is the only thing that could have multiple writers. Other users of O_APPEND are: * refs/files-backend.c uses it so that a reflog entry can be appended at the end, but because update to each ref is protected from racing at a lot higher layer with a lock, no two people would try to append to the same reflog file, so atomicity of O_APPEND does not matter here. * sequencer.c wants to use it when moving one insn from the todo list to the 'done' list when it finishes one operation. If you are running two sequences in a single repository, intermixed 'done' list would be the least of your problem, so presumably we are fine here. It may make sense to allow GIT_TRACE to have a placeholder (e.g. "/tmp/traceout.$$") to help debuggers arrange to give different processes their own trace output file, which perhaps may be a simpler and less impactful to the performance solution than having to make locks at an application layer.
Re: [PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE
On Fri, Aug 10, 2018 at 06:43:07PM +0200, Johannes Schindelin wrote: > > > Correct. But I am more worried about the "mixed/overwriting" > > > breakage, if there is one; it means we may need to be prepared for > > > systems that lack O_APPEND that works correctly. I initially just > > > assumed that it was what Dscho was seeing, but after re-reading his > > > message, I am not sure anymore. > > > > > > I think the "do not trace the other side" approach you suggest for > > > these tests that only care about one side is more appropriate > > > solution for this particular case. We then do not have to worry > > > about overwriting or output from both sides mixed randomly. > > Just so everybody knows: I never received the above two mails. Something > is seriously rotten in GMX. This seems to have started a week or two ago. I think it may just have been that they went to the gitgitgadget address; I explicitly added you as a cc when I sent the patch with a commit message, which was when I first noticed you weren't cc'd on the earlier bits. > What you cannot see easily, unless you go the route that I offered > Jonathan (open a PR on gitgitgadget/git which will automatically run the > test suite on Windows, macOS and Linux, and of course you can do anything > in a PR, including narrowing down what tests are run) is that sometimes > those lines in the `trace` file were clearly *incomplete*. That is, even > if both processes tried to write atomically, one managed to overwrite the > other's buffer in flight. > > This is a pretty serious thing, even worse than the failing test suite, > and I don't think that your patch acknowledges how serious this is. I don't see anything in what you wrote above or in your original cover letter or series that contradicts the notion that O_APPEND is simply not atomic on Windows. Overwriting the same byte ranges in the file is what I would expect on Linux, as well, if the file were opened without O_APPEND. I have mixed reports from searching online whether an equivalent to O_APPEND exists. Some sources seem to say that FILE_APPEND_DATA is. I couldn't find anything definitive in Microsoft documentation about whether it actually provides atomicity. From my (admittedly limited) digging, that does not seem to be used by the msys2 runtime, so it might be worth exploring. > And please don't give me that "but it works on Linux" response. Seriously. > Sheesh. But it does. And it's designed to. There is literally nothing to fix on Linux. > Even if you have not managed to trigger it, the POSIX standard seems not > to define clearly what the behavior is of two competing write() calls, > unless you lock the files appropriately, as my patch series does. POSIX says: If the O_APPEND flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation. and: This volume of POSIX.1-2017 does not specify the behavior of concurrent writes to a regular file from multiple threads, except that each write is atomic. I admit the writing is a little turgid, but I think between the two it is promising the behavior that we expect (and I think that interpretation is pretty common in the Unix world). > So unless you are willing to ignore, to willfully keep this breakage, I > would suggest not to introduce the ugliness of an overridden upload-pack > for the sole purpose of disabling the tracing on one side, but instead to > get this here bug fixed, by helping me with this here patch series. I'm OK if you want to live with the broken test in the interim. But after your complaining about the imminence of it "infecting" master, I thought you might be happy to see a solution that was more immediate. I have tried to help with the actual problem, by identifying the root cause (that the trace code's strategy is not fundamentally broken, but that it relies on O_APPEND whose guarantees are obviously not portable to Windows) and exploring possible solutions there (FILE_APPEND_DATA). Lacking any kind of Windows development environment, I'd just as soon stop there and let more knowledgeable people take over. But please don't characterize my presence in this thread as some kind of willful harm or ignoring of the problem. -Peff
Re: [PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE
Am 10.08.2018 um 18:51 schrieb Jeff Hostetler: On 8/10/2018 12:15 PM, Johannes Sixt wrote: Am 09.08.2018 um 19:35 schrieb Johannes Schindelin via GitGitGadget: I reported a couple of times that t5552 is not passing reliably. It has now reached next, and will no doubt infect master soon. Turns out that it is not a Windows-specific issue, even if it occurs a lot more often on Windows than elsewhere. The culprit is that two processes try simultaneously to write to the same file specified via GIT_TRACE_PACKET, and it is not well defined how that should work, even on Linux. Thanks for digging down to the root cause. As has been said, the behavior of O_APPEND is well-defined under POSIX, but last time I looked for equivalent feature on Windows, I did not find any. Last time was when I worked around the same failure in t5503-tagfollow.sh in my private builds: https://github.com/j6t/git/commit/9a447a6844b50b43746d9765b3ac809e2793d742 It is basically the same as Peff suggests: log only one side of the fetch. As this buglet looks like a recurring theme, and a proper fix is preferable over repeated work-arounds. To me it looks like we need some sort of locking on Windows. Unless your friends at Microsoft have an ace in their sleeves that lets us have atomic O_APPEND the POSIX way... The official Windows CRT for open() does document an O_APPEND, but after looking at the distributed source, I'm not sure I trust it. I looked at the CRT code, too, and no, we can't trust it. I have not looked at the MSYS version of the CRT yet so I cannot comment on it. I did find that the following code does what we want. Notice the use of FILE_APPEND_DATA in place of the usual GENERIC_READ. (And yes, the docs are very vague here.) As far as I understand, FILE_APPEND_DATA is just a permission flag (I'm not sure why that would be necessary), but at least the documentation states that it is still necessary for the caller to set the file pointer. But I haven't tried your code below, yet. Should it append the line of text on each invocation? And if so, is it an atomic operation? Before we try a locking solution, perhaps we should tweak mingw_open() to use something like this to get a proper HANDLE directly and then use _open_osfhandle() to get a "fd" for it. Jeff - #include int main() { HANDLE h = CreateFileW(L"foo.txt", FILE_APPEND_DATA, FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE, NULL, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL); const char * buf = "this is a test\n"; DWORD len = strlen(buf); DWORD nbw; WriteFile(h, buf, len, &nbw, NULL); CloseHandle(h); return 0; }
Re: [PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE
On 8/10/2018 12:51 PM, Jeff Hostetler wrote: On 8/10/2018 12:15 PM, Johannes Sixt wrote: Am 09.08.2018 um 19:35 schrieb Johannes Schindelin via GitGitGadget: I reported a couple of times that t5552 is not passing reliably. It has now reached next, and will no doubt infect master soon. Turns out that it is not a Windows-specific issue, even if it occurs a lot more often on Windows than elsewhere. The culprit is that two processes try simultaneously to write to the same file specified via GIT_TRACE_PACKET, and it is not well defined how that should work, even on Linux. Thanks for digging down to the root cause. As has been said, the behavior of O_APPEND is well-defined under POSIX, but last time I looked for equivalent feature on Windows, I did not find any. Last time was when I worked around the same failure in t5503-tagfollow.sh in my private builds: https://github.com/j6t/git/commit/9a447a6844b50b43746d9765b3ac809e2793d742 It is basically the same as Peff suggests: log only one side of the fetch. As this buglet looks like a recurring theme, and a proper fix is preferable over repeated work-arounds. To me it looks like we need some sort of locking on Windows. Unless your friends at Microsoft have an ace in their sleeves that lets us have atomic O_APPEND the POSIX way... The official Windows CRT for open() does document an O_APPEND, but after looking at the distributed source, I'm not sure I trust it. I have not looked at the MSYS version of the CRT yet so I cannot comment on it. I did find that the following code does what we want. Notice the use of FILE_APPEND_DATA in place of the usual GENERIC_READ. (And yes, the docs are very vague here.) d'oh s/GENERIC_READ/GENERIC_WRITE/ Before we try a locking solution, perhaps we should tweak mingw_open() to use something like this to get a proper HANDLE directly and then use _open_osfhandle() to get a "fd" for it. Jeff - #include int main() { HANDLE h = CreateFileW(L"foo.txt", FILE_APPEND_DATA, FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE, NULL, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL); const char * buf = "this is a test\n"; DWORD len = strlen(buf); DWORD nbw; WriteFile(h, buf, len, &nbw, NULL); CloseHandle(h); return 0; }
Re: [PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE
On 8/10/2018 12:15 PM, Johannes Sixt wrote: Am 09.08.2018 um 19:35 schrieb Johannes Schindelin via GitGitGadget: I reported a couple of times that t5552 is not passing reliably. It has now reached next, and will no doubt infect master soon. Turns out that it is not a Windows-specific issue, even if it occurs a lot more often on Windows than elsewhere. The culprit is that two processes try simultaneously to write to the same file specified via GIT_TRACE_PACKET, and it is not well defined how that should work, even on Linux. Thanks for digging down to the root cause. As has been said, the behavior of O_APPEND is well-defined under POSIX, but last time I looked for equivalent feature on Windows, I did not find any. Last time was when I worked around the same failure in t5503-tagfollow.sh in my private builds: https://github.com/j6t/git/commit/9a447a6844b50b43746d9765b3ac809e2793d742 It is basically the same as Peff suggests: log only one side of the fetch. As this buglet looks like a recurring theme, and a proper fix is preferable over repeated work-arounds. To me it looks like we need some sort of locking on Windows. Unless your friends at Microsoft have an ace in their sleeves that lets us have atomic O_APPEND the POSIX way... The official Windows CRT for open() does document an O_APPEND, but after looking at the distributed source, I'm not sure I trust it. I have not looked at the MSYS version of the CRT yet so I cannot comment on it. I did find that the following code does what we want. Notice the use of FILE_APPEND_DATA in place of the usual GENERIC_READ. (And yes, the docs are very vague here.) Before we try a locking solution, perhaps we should tweak mingw_open() to use something like this to get a proper HANDLE directly and then use _open_osfhandle() to get a "fd" for it. Jeff - #include int main() { HANDLE h = CreateFileW(L"foo.txt", FILE_APPEND_DATA, FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE, NULL, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL); const char * buf = "this is a test\n"; DWORD len = strlen(buf); DWORD nbw; WriteFile(h, buf, len, &nbw, NULL); CloseHandle(h); return 0; }
Re: [PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE
Hi, On Fri, 10 Aug 2018, Jeff King wrote: > On Thu, Aug 09, 2018 at 01:49:52PM -0700, Junio C Hamano wrote: > > > Jeff King writes: > > > > > Are you sure that it's not well-defined? We open the path with O_APPEND, > > > which means every write() will be atomically positioned at the end of > > > file. So we would never lose or overwrite data. > > > > > > We do our own buffering in a strbuf, writing the result out in a single > > > write() call (modulo the OS returning a short write, but that should not > > > generally happen when writing short strings to a file). So we should get > > > individual trace lines as atomic units. > > > > > > The order of lines from the two processes is undefined, of course. > > > > Correct. But I am more worried about the "mixed/overwriting" > > breakage, if there is one; it means we may need to be prepared for > > systems that lack O_APPEND that works correctly. I initially just > > assumed that it was what Dscho was seeing, but after re-reading his > > message, I am not sure anymore. > > > > I think the "do not trace the other side" approach you suggest for > > these tests that only care about one side is more appropriate > > solution for this particular case. We then do not have to worry > > about overwriting or output from both sides mixed randomly. Just so everybody knows: I never received the above two mails. Something is seriously rotten in GMX. This seems to have started a week or two ago. > Here it is as a patch on top of jt/fetch-negotiator-skipping, which lets > us pursue any fix for interleaved trace output on Windows without the > pressure of an impending flaky test. > > My gut says that looking into making O_APPEND work there is going to be > the nicest solution, but my gut is not very well versed in Windows > subtleties. ;) Your patch seems to be a patch in the original sense of the word: it is not a solution. And let me spell it out very clearly (as I have mentioned to Jonathan a couple times, without any response): t5552 fails regularly, way more than rarely, see for yourself: https://git-for-windows.visualstudio.com/git/_build/index?definitionId=1&_a=history And the reason it fails was analyzed by me and described in the commit message (I am sorry that the cover letter was still stale and talked about flock(), which I had decided was not the right approach after fighting with Linux over it). What you cannot see easily, unless you go the route that I offered Jonathan (open a PR on gitgitgadget/git which will automatically run the test suite on Windows, macOS and Linux, and of course you can do anything in a PR, including narrowing down what tests are run) is that sometimes those lines in the `trace` file were clearly *incomplete*. That is, even if both processes tried to write atomically, one managed to overwrite the other's buffer in flight. This is a pretty serious thing, even worse than the failing test suite, and I don't think that your patch acknowledges how serious this is. And please don't give me that "but it works on Linux" response. Seriously. Sheesh. Even if you have not managed to trigger it, the POSIX standard seems not to define clearly what the behavior is of two competing write() calls, unless you lock the files appropriately, as my patch series does. So unless you are willing to ignore, to willfully keep this breakage, I would suggest not to introduce the ugliness of an overridden upload-pack for the sole purpose of disabling the tracing on one side, but instead to get this here bug fixed, by helping me with this here patch series. We don't have to rush this, you know? We have to fix it, though. Thank you, Dscho > -- >8 -- > Subject: [PATCH] t5552: suppress upload-pack trace output > > The t5552 test script uses GIT_TRACE_PACKET to monitor what > git-fetch sends and receives. However, because we're > accessing a local repository, the child upload-pack also > sends trace output to the same file. > > On Linux, this works out OK. We open the trace file with > O_APPEND, so all writes are atomically positioned at the end > of the file. No data can be overwritten or omitted. And > since we prepare our small writes in a strbuf and write them > with a single write(), we should see each line as an atomic > unit. The order of lines between the two processes is > undefined, but the test script greps only for "fetch>" or > "fetch<" lines. So under Linux, the test results are > deterministic. > > The test fails intermittently on Windows, however, > reportedly even overwriting bits of the output file (i.e., > O_APPEND does not seem to give us an atomic position+write). > > Since the test only cares about the trace output from fetch, > we can just disable the output from upload-pack. That > doesn't solve the greater question of O_APPEND/trace issues > under Windows, but it easily fixes the flakiness from this > test. > > Reported-by: Johannes Schindelin > Signed-off-by: Jeff King > --- > I'm assuming that this really isn't trig
Re: [PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE
Am 09.08.2018 um 19:35 schrieb Johannes Schindelin via GitGitGadget: I reported a couple of times that t5552 is not passing reliably. It has now reached next, and will no doubt infect master soon. Turns out that it is not a Windows-specific issue, even if it occurs a lot more often on Windows than elsewhere. The culprit is that two processes try simultaneously to write to the same file specified via GIT_TRACE_PACKET, and it is not well defined how that should work, even on Linux. Thanks for digging down to the root cause. As has been said, the behavior of O_APPEND is well-defined under POSIX, but last time I looked for equivalent feature on Windows, I did not find any. Last time was when I worked around the same failure in t5503-tagfollow.sh in my private builds: https://github.com/j6t/git/commit/9a447a6844b50b43746d9765b3ac809e2793d742 It is basically the same as Peff suggests: log only one side of the fetch. As this buglet looks like a recurring theme, and a proper fix is preferable over repeated work-arounds. To me it looks like we need some sort of locking on Windows. Unless your friends at Microsoft have an ace in their sleeves that lets us have atomic O_APPEND the POSIX way... -- Hannes
Re: [PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE
Jeff King writes: > Here it is as a patch on top of jt/fetch-negotiator-skipping, which lets > us pursue any fix for interleaved trace output on Windows without the > pressure of an impending flaky test. > > My gut says that looking into making O_APPEND work there is going to be > the nicest solution, but my gut is not very well versed in Windows > subtleties. ;) As we know these tests are only interested in "fetch" lines and not record of communication in the other direction, I think this is a nice clean-up that is worth doing. For only two grep's it may not look worth doing, but it would be also a good clean-up to give an got_acked helper that sits next to have_sent and have_not_sent helpers for readability. That is of course outside the scope of this change. Will queue. Thanks. > -- >8 -- > Subject: [PATCH] t5552: suppress upload-pack trace output > > The t5552 test script uses GIT_TRACE_PACKET to monitor what > git-fetch sends and receives. However, because we're > accessing a local repository, the child upload-pack also > sends trace output to the same file. > > On Linux, this works out OK. We open the trace file with > O_APPEND, so all writes are atomically positioned at the end > of the file. No data can be overwritten or omitted. And > since we prepare our small writes in a strbuf and write them > with a single write(), we should see each line as an atomic > unit. The order of lines between the two processes is > undefined, but the test script greps only for "fetch>" or > "fetch<" lines. So under Linux, the test results are > deterministic. > > The test fails intermittently on Windows, however, > reportedly even overwriting bits of the output file (i.e., > O_APPEND does not seem to give us an atomic position+write). > > Since the test only cares about the trace output from fetch, > we can just disable the output from upload-pack. That > doesn't solve the greater question of O_APPEND/trace issues > under Windows, but it easily fixes the flakiness from this > test. > > Reported-by: Johannes Schindelin > Signed-off-by: Jeff King > --- > I'm assuming that this really isn't triggerable on Linux. I tried and > couldn't manage to get it to fail, and the reasoning above explains why. > But I wasn't 100% clear that Dscho hadn't seen it fail on non-Windows. > > t/t5552-skipping-fetch-negotiator.sh | 23 ++- > 1 file changed, 18 insertions(+), 5 deletions(-) > > diff --git a/t/t5552-skipping-fetch-negotiator.sh > b/t/t5552-skipping-fetch-negotiator.sh > index 0a8e0e42ed..0ad50dd839 100755 > --- a/t/t5552-skipping-fetch-negotiator.sh > +++ b/t/t5552-skipping-fetch-negotiator.sh > @@ -28,6 +28,19 @@ have_not_sent () { > done > } > > +# trace_fetch [args] > +# > +# Trace the packet output of fetch, but make sure we disable the variable > +# in the child upload-pack, so we don't combine the results in the same file. > +trace_fetch () { > + client=$1; shift > + server=$1; shift > + GIT_TRACE_PACKET="$(pwd)/trace" \ > + git -C "$client" fetch \ > + --upload-pack 'unset GIT_TRACE_PACKET; git-upload-pack' \ > + "$server" "$@" > +} > + > test_expect_success 'commits with no parents are sent regardless of skip > distance' ' > git init server && > test_commit -C server to_fetch && > @@ -42,7 +55,7 @@ test_expect_success 'commits with no parents are sent > regardless of skip distanc > # "c1" has no parent, it is still sent as "have" even though it would > # normally be skipped. > test_config -C client fetch.negotiationalgorithm skipping && > - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" && > + trace_fetch client "$(pwd)/server" && > have_sent c7 c5 c2 c1 && > have_not_sent c6 c4 c3 > ' > @@ -65,7 +78,7 @@ test_expect_success 'when two skips collide, favor the > larger one' ' > # the next "have" sent will be "c1" (from "c6" skip 4) and not "c4" > # (from "c5side" skip 1). > test_config -C client fetch.negotiationalgorithm skipping && > - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" && > + trace_fetch client "$(pwd)/server" && > have_sent c5side c11 c9 c6 c1 && > have_not_sent c10 c8 c7 c5 c4 c3 c2 > ' > @@ -91,7 +104,7 @@ test_expect_success 'use ref advertisement to filter out > commits' ' > # not need to send any ancestors of "c3", but we still need to send "c3" > # itself. > test_config -C client fetch.negotiationalgorithm skipping && > - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch origin to_fetch && > + trace_fetch client origin to_fetch && > have_sent c5 c4^ c2side && > have_not_sent c4 c4^^ c4^^^ > ' > @@ -121,7 +134,7 @@ test_expect_success 'handle clock skew' ' > # and sent, because (due to clock skew) its only parent has already been > # popped off the priority queue. > test_config -C client fetch.negotiationalgorithm skipping && > - GI
Re: [PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE
Jeff King writes: > Here it is as a patch on top of jt/fetch-negotiator-skipping, which lets > us pursue any fix for interleaved trace output on Windows without the > pressure of an impending flaky test. > > My gut says that looking into making O_APPEND work there is going to be > the nicest solution, but my gut is not very well versed in Windows > subtleties. ;) As we know these tests are only interested in "fetch" lines and not record of communication in the other direction, I think this is a nice clean-up that is worth doing. For only two grep's it may not look worth doing, but it would be also a good clean-up to give an got_acked helper that sits next to have_sent and have_not_sent helpers for readability. That is of course outside the scope of this change. Will queue. Thansk. > -- >8 -- > Subject: [PATCH] t5552: suppress upload-pack trace output > > The t5552 test script uses GIT_TRACE_PACKET to monitor what > git-fetch sends and receives. However, because we're > accessing a local repository, the child upload-pack also > sends trace output to the same file. > > On Linux, this works out OK. We open the trace file with > O_APPEND, so all writes are atomically positioned at the end > of the file. No data can be overwritten or omitted. And > since we prepare our small writes in a strbuf and write them > with a single write(), we should see each line as an atomic > unit. The order of lines between the two processes is > undefined, but the test script greps only for "fetch>" or > "fetch<" lines. So under Linux, the test results are > deterministic. > > The test fails intermittently on Windows, however, > reportedly even overwriting bits of the output file (i.e., > O_APPEND does not seem to give us an atomic position+write). > > Since the test only cares about the trace output from fetch, > we can just disable the output from upload-pack. That > doesn't solve the greater question of O_APPEND/trace issues > under Windows, but it easily fixes the flakiness from this > test. > > Reported-by: Johannes Schindelin > Signed-off-by: Jeff King > --- > I'm assuming that this really isn't triggerable on Linux. I tried and > couldn't manage to get it to fail, and the reasoning above explains why. > But I wasn't 100% clear that Dscho hadn't seen it fail on non-Windows. > > t/t5552-skipping-fetch-negotiator.sh | 23 ++- > 1 file changed, 18 insertions(+), 5 deletions(-) > > diff --git a/t/t5552-skipping-fetch-negotiator.sh > b/t/t5552-skipping-fetch-negotiator.sh > index 0a8e0e42ed..0ad50dd839 100755 > --- a/t/t5552-skipping-fetch-negotiator.sh > +++ b/t/t5552-skipping-fetch-negotiator.sh > @@ -28,6 +28,19 @@ have_not_sent () { > done > } > > +# trace_fetch [args] > +# > +# Trace the packet output of fetch, but make sure we disable the variable > +# in the child upload-pack, so we don't combine the results in the same file. > +trace_fetch () { > + client=$1; shift > + server=$1; shift > + GIT_TRACE_PACKET="$(pwd)/trace" \ > + git -C "$client" fetch \ > + --upload-pack 'unset GIT_TRACE_PACKET; git-upload-pack' \ > + "$server" "$@" > +} > + > test_expect_success 'commits with no parents are sent regardless of skip > distance' ' > git init server && > test_commit -C server to_fetch && > @@ -42,7 +55,7 @@ test_expect_success 'commits with no parents are sent > regardless of skip distanc > # "c1" has no parent, it is still sent as "have" even though it would > # normally be skipped. > test_config -C client fetch.negotiationalgorithm skipping && > - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" && > + trace_fetch client "$(pwd)/server" && > have_sent c7 c5 c2 c1 && > have_not_sent c6 c4 c3 > ' > @@ -65,7 +78,7 @@ test_expect_success 'when two skips collide, favor the > larger one' ' > # the next "have" sent will be "c1" (from "c6" skip 4) and not "c4" > # (from "c5side" skip 1). > test_config -C client fetch.negotiationalgorithm skipping && > - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" && > + trace_fetch client "$(pwd)/server" && > have_sent c5side c11 c9 c6 c1 && > have_not_sent c10 c8 c7 c5 c4 c3 c2 > ' > @@ -91,7 +104,7 @@ test_expect_success 'use ref advertisement to filter out > commits' ' > # not need to send any ancestors of "c3", but we still need to send "c3" > # itself. > test_config -C client fetch.negotiationalgorithm skipping && > - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch origin to_fetch && > + trace_fetch client origin to_fetch && > have_sent c5 c4^ c2side && > have_not_sent c4 c4^^ c4^^^ > ' > @@ -121,7 +134,7 @@ test_expect_success 'handle clock skew' ' > # and sent, because (due to clock skew) its only parent has already been > # popped off the priority queue. > test_config -C client fetch.negotiationalgorithm skipping && > - GI
Re: [PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE
On Thu, Aug 09, 2018 at 01:49:52PM -0700, Junio C Hamano wrote: > Jeff King writes: > > > Are you sure that it's not well-defined? We open the path with O_APPEND, > > which means every write() will be atomically positioned at the end of > > file. So we would never lose or overwrite data. > > > > We do our own buffering in a strbuf, writing the result out in a single > > write() call (modulo the OS returning a short write, but that should not > > generally happen when writing short strings to a file). So we should get > > individual trace lines as atomic units. > > > > The order of lines from the two processes is undefined, of course. > > Correct. But I am more worried about the "mixed/overwriting" > breakage, if there is one; it means we may need to be prepared for > systems that lack O_APPEND that works correctly. I initially just > assumed that it was what Dscho was seeing, but after re-reading his > message, I am not sure anymore. > > I think the "do not trace the other side" approach you suggest for > these tests that only care about one side is more appropriate > solution for this particular case. We then do not have to worry > about overwriting or output from both sides mixed randomly. Here it is as a patch on top of jt/fetch-negotiator-skipping, which lets us pursue any fix for interleaved trace output on Windows without the pressure of an impending flaky test. My gut says that looking into making O_APPEND work there is going to be the nicest solution, but my gut is not very well versed in Windows subtleties. ;) -- >8 -- Subject: [PATCH] t5552: suppress upload-pack trace output The t5552 test script uses GIT_TRACE_PACKET to monitor what git-fetch sends and receives. However, because we're accessing a local repository, the child upload-pack also sends trace output to the same file. On Linux, this works out OK. We open the trace file with O_APPEND, so all writes are atomically positioned at the end of the file. No data can be overwritten or omitted. And since we prepare our small writes in a strbuf and write them with a single write(), we should see each line as an atomic unit. The order of lines between the two processes is undefined, but the test script greps only for "fetch>" or "fetch<" lines. So under Linux, the test results are deterministic. The test fails intermittently on Windows, however, reportedly even overwriting bits of the output file (i.e., O_APPEND does not seem to give us an atomic position+write). Since the test only cares about the trace output from fetch, we can just disable the output from upload-pack. That doesn't solve the greater question of O_APPEND/trace issues under Windows, but it easily fixes the flakiness from this test. Reported-by: Johannes Schindelin Signed-off-by: Jeff King --- I'm assuming that this really isn't triggerable on Linux. I tried and couldn't manage to get it to fail, and the reasoning above explains why. But I wasn't 100% clear that Dscho hadn't seen it fail on non-Windows. t/t5552-skipping-fetch-negotiator.sh | 23 ++- 1 file changed, 18 insertions(+), 5 deletions(-) diff --git a/t/t5552-skipping-fetch-negotiator.sh b/t/t5552-skipping-fetch-negotiator.sh index 0a8e0e42ed..0ad50dd839 100755 --- a/t/t5552-skipping-fetch-negotiator.sh +++ b/t/t5552-skipping-fetch-negotiator.sh @@ -28,6 +28,19 @@ have_not_sent () { done } +# trace_fetch [args] +# +# Trace the packet output of fetch, but make sure we disable the variable +# in the child upload-pack, so we don't combine the results in the same file. +trace_fetch () { + client=$1; shift + server=$1; shift + GIT_TRACE_PACKET="$(pwd)/trace" \ + git -C "$client" fetch \ + --upload-pack 'unset GIT_TRACE_PACKET; git-upload-pack' \ + "$server" "$@" +} + test_expect_success 'commits with no parents are sent regardless of skip distance' ' git init server && test_commit -C server to_fetch && @@ -42,7 +55,7 @@ test_expect_success 'commits with no parents are sent regardless of skip distanc # "c1" has no parent, it is still sent as "have" even though it would # normally be skipped. test_config -C client fetch.negotiationalgorithm skipping && - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" && + trace_fetch client "$(pwd)/server" && have_sent c7 c5 c2 c1 && have_not_sent c6 c4 c3 ' @@ -65,7 +78,7 @@ test_expect_success 'when two skips collide, favor the larger one' ' # the next "have" sent will be "c1" (from "c6" skip 4) and not "c4" # (from "c5side" skip 1). test_config -C client fetch.negotiationalgorithm skipping && - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" && + trace_fetch client "$(pwd)/server" && have_sent c5side c11 c9 c6 c1 && have_not_sent c10 c8 c7 c5 c4 c3 c2 ' @@ -91,7 +104,7 @@ test_expect_success 'use ref advertisement to filter out commits'
Re: [PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE
On Thu, Aug 09, 2018 at 02:08:56PM -0700, Junio C Hamano wrote: > > Correct. But I am more worried about the "mixed/overwriting" > > breakage, if there is one; it means we may need to be prepared for > > systems that lack O_APPEND that works correctly. I initially just > > assumed that it was what Dscho was seeing, but after re-reading his > > message, I am not sure anymore. > > > > I think the "do not trace the other side" approach you suggest for > > these tests that only care about one side is more appropriate > > solution for this particular case. We then do not have to worry > > about overwriting or output from both sides mixed randomly. > > A concluding sentence I forgot to add, after saying "this is simpler > and better to fix test breakage", was > > But if we really are seeing O_APPEND breakage, a mandatory > locking mechanism like this one might be necessary to work > around it (I seriously hope we do not have to, though). > > Sorry for an additional noise. Yeah, my earlier email said "if we just care about this test, then...". But if we do care about making this multiple-tracing case work all the time, then we'd need some solution. If there's something that can work like O_APPEND, even if we have to make some system-specific call, I'd prefer that to locking. I did some searching for atomic append on Windows and got mixed reports on whether their FILE_DATA_APPEND does what we want or not. Knowing nothing about msys, I'd _assume_ that O_APPEND would get translated to that flag, but I got lost trying to find it in git-for-windows/msys2-runtime. Wouldn't it be wonderful if the solution were as simple as making sure that flag was plumbed through? I seriously doubt it will be that easy, though. :) -Peff
Re: [PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE
Junio C Hamano writes: > Jeff King writes: > >> Are you sure that it's not well-defined? We open the path with O_APPEND, >> which means every write() will be atomically positioned at the end of >> file. So we would never lose or overwrite data. >> >> We do our own buffering in a strbuf, writing the result out in a single >> write() call (modulo the OS returning a short write, but that should not >> generally happen when writing short strings to a file). So we should get >> individual trace lines as atomic units. >> >> The order of lines from the two processes is undefined, of course. > > Correct. But I am more worried about the "mixed/overwriting" > breakage, if there is one; it means we may need to be prepared for > systems that lack O_APPEND that works correctly. I initially just > assumed that it was what Dscho was seeing, but after re-reading his > message, I am not sure anymore. > > I think the "do not trace the other side" approach you suggest for > these tests that only care about one side is more appropriate > solution for this particular case. We then do not have to worry > about overwriting or output from both sides mixed randomly. A concluding sentence I forgot to add, after saying "this is simpler and better to fix test breakage", was But if we really are seeing O_APPEND breakage, a mandatory locking mechanism like this one might be necessary to work around it (I seriously hope we do not have to, though). Sorry for an additional noise.
Re: [PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE
Jeff King writes: > Are you sure that it's not well-defined? We open the path with O_APPEND, > which means every write() will be atomically positioned at the end of > file. So we would never lose or overwrite data. > > We do our own buffering in a strbuf, writing the result out in a single > write() call (modulo the OS returning a short write, but that should not > generally happen when writing short strings to a file). So we should get > individual trace lines as atomic units. > > The order of lines from the two processes is undefined, of course. Correct. But I am more worried about the "mixed/overwriting" breakage, if there is one; it means we may need to be prepared for systems that lack O_APPEND that works correctly. I initially just assumed that it was what Dscho was seeing, but after re-reading his message, I am not sure anymore. I think the "do not trace the other side" approach you suggest for these tests that only care about one side is more appropriate solution for this particular case. We then do not have to worry about overwriting or output from both sides mixed randomly. > diff --git a/t/t5552-skipping-fetch-negotiator.sh > b/t/t5552-skipping-fetch-negotiator.sh > index 3b60bd44e0..5ad5bece55 100755 > --- a/t/t5552-skipping-fetch-negotiator.sh > +++ b/t/t5552-skipping-fetch-negotiator.sh > @@ -28,6 +28,19 @@ have_not_sent () { > done > } > > +# trace_fetch [args] > +# > +# Trace the packet output of fetch, but make sure we disable the variable > +# in the child upload-pack, so we don't combine the results in the same file. > +trace_fetch () { > + client=$1; shift > + server=$1; shift > + GIT_TRACE_PACKET="$(pwd)/trace" \ > + git -C "$client" fetch \ > + --upload-pack 'unset GIT_TRACE_PACKET; git-upload-pack' \ > + "$server" "$@" > +} > + > test_expect_success 'commits with no parents are sent regardless of skip > distance' ' > git init server && > test_commit -C server to_fetch && > @@ -42,7 +55,7 @@ test_expect_success 'commits with no parents are sent > regardless of skip distanc > # "c1" has no parent, it is still sent as "have" even though it would > # normally be skipped. > test_config -C client fetch.negotiationalgorithm skipping && > - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" && > + trace_fetch client "$(pwd)/server" && > have_sent c7 c5 c2 c1 && > have_not_sent c6 c4 c3 > ' > @@ -88,7 +101,7 @@ test_expect_success 'when two skips collide, favor the > larger one' ' > # the next "have" sent will be "c1" (from "c6" skip 4) and not "c4" > # (from "c5side" skip 1). > test_config -C client fetch.negotiationalgorithm skipping && > - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" && > + trace_fetch client "$(pwd)/server" && > have_sent c5side c11 c9 c6 c1 && > have_not_sent c10 c8 c7 c5 c4 c3 c2 > ' > @@ -114,7 +127,7 @@ test_expect_success 'use ref advertisement to filter out > commits' ' > # not need to send any ancestors of "c3", but we still need to send "c3" > # itself. > test_config -C client fetch.negotiationalgorithm skipping && > - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch origin to_fetch && > + trace_fetch client origin to_fetch && > have_sent c5 c4^ c2side && > have_not_sent c4 c4^^ c4^^^ > ' > @@ -144,7 +157,7 @@ test_expect_success 'handle clock skew' ' > # and sent, because (due to clock skew) its only parent has already been > # popped off the priority queue. > test_config -C client fetch.negotiationalgorithm skipping && > - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" && > + trace_fetch client "$(pwd)/server" && > have_sent c2 c1 old4 old2 old1 && > have_not_sent old3 > ' > @@ -176,7 +189,7 @@ test_expect_success 'do not send "have" with ancestors of > commits that server AC > test_commit -C server commit-on-b1 && > > test_config -C client fetch.negotiationalgorithm skipping && > - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" > to_fetch && > + trace_fetch client "$(pwd)/server" to_fetch && > grep " fetch" trace && > > # fetch-pack sends 2 requests each containing 16 "have" lines before
Re: [PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE
On Thu, Aug 09, 2018 at 10:35:24AM -0700, Johannes Schindelin via GitGitGadget wrote: > The culprit is that two processes try simultaneously to write to the same > file specified via GIT_TRACE_PACKET, and it is not well defined how that > should work, even on Linux. Are you sure that it's not well-defined? We open the path with O_APPEND, which means every write() will be atomically positioned at the end of file. So we would never lose or overwrite data. We do our own buffering in a strbuf, writing the result out in a single write() call (modulo the OS returning a short write, but that should not generally happen when writing short strings to a file). So we should get individual trace lines as atomic units. The order of lines from the two processes is undefined, of course. > This patch series addresses that by locking the trace fd. I chose to use > flock() instead of fcntl() because the Win32 API LockFileEx() > [https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-lockfileex] > > (which does exactly what I want in this context) has much more similar > semantics to the former than the latter. I must admit I'm not excited to see us adding extra locking and complexity when it is not necessarily needed. Is this a feature we actually care about delivering for normal users of GIT_TRACE, or do we just care about solving the test flakiness here? Because we should be able to do the latter with the patch below. I also wonder if we should be clearing GIT_TRACE as part of clearing repo-env. It would do what we want automatically in this case, though there are definitely cases where I prefer the current behavior (because I really do want to see the upload-pack side of it). Of course I could always use "--upload-pack 'GIT_TRACE_PACKET=... upload-pack'", so this is really just a default. diff --git a/t/t5552-skipping-fetch-negotiator.sh b/t/t5552-skipping-fetch-negotiator.sh index 3b60bd44e0..5ad5bece55 100755 --- a/t/t5552-skipping-fetch-negotiator.sh +++ b/t/t5552-skipping-fetch-negotiator.sh @@ -28,6 +28,19 @@ have_not_sent () { done } +# trace_fetch [args] +# +# Trace the packet output of fetch, but make sure we disable the variable +# in the child upload-pack, so we don't combine the results in the same file. +trace_fetch () { + client=$1; shift + server=$1; shift + GIT_TRACE_PACKET="$(pwd)/trace" \ + git -C "$client" fetch \ + --upload-pack 'unset GIT_TRACE_PACKET; git-upload-pack' \ + "$server" "$@" +} + test_expect_success 'commits with no parents are sent regardless of skip distance' ' git init server && test_commit -C server to_fetch && @@ -42,7 +55,7 @@ test_expect_success 'commits with no parents are sent regardless of skip distanc # "c1" has no parent, it is still sent as "have" even though it would # normally be skipped. test_config -C client fetch.negotiationalgorithm skipping && - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" && + trace_fetch client "$(pwd)/server" && have_sent c7 c5 c2 c1 && have_not_sent c6 c4 c3 ' @@ -88,7 +101,7 @@ test_expect_success 'when two skips collide, favor the larger one' ' # the next "have" sent will be "c1" (from "c6" skip 4) and not "c4" # (from "c5side" skip 1). test_config -C client fetch.negotiationalgorithm skipping && - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" && + trace_fetch client "$(pwd)/server" && have_sent c5side c11 c9 c6 c1 && have_not_sent c10 c8 c7 c5 c4 c3 c2 ' @@ -114,7 +127,7 @@ test_expect_success 'use ref advertisement to filter out commits' ' # not need to send any ancestors of "c3", but we still need to send "c3" # itself. test_config -C client fetch.negotiationalgorithm skipping && - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch origin to_fetch && + trace_fetch client origin to_fetch && have_sent c5 c4^ c2side && have_not_sent c4 c4^^ c4^^^ ' @@ -144,7 +157,7 @@ test_expect_success 'handle clock skew' ' # and sent, because (due to clock skew) its only parent has already been # popped off the priority queue. test_config -C client fetch.negotiationalgorithm skipping && - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" && + trace_fetch client "$(pwd)/server" && have_sent c2 c1 old4 old2 old1 && have_not_sent old3 ' @@ -176,7 +189,7 @@ test_expect_success 'do not send "have" with ancestors of commits that server AC test_commit -C server commit-on-b1 && test_config -C client fetch.negotiationalgorithm skipping && - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" to_fetch && + trace_fetch client "$(pwd)/server" to_fetch && grep " fetch" trace && # fetch-pack sends 2 requests each containing 16 "have"