Re: [RFC PATCH 4/4] t/lib-httpd: sort log based on timestamp to avoid occasional failure
On Wed, Jul 11, 2018 at 02:53:23PM +0200, SZEDER Gábor wrote: > > I.e., could we have a situation where we make a request, the client > > finishes, and then we look at the logs, but nothing has been written by > > apache yet? > > That's possible, I suppose. Highly improbable, but possible. Or the > log entries can be written only partially by the time we look at them. Digging into earlier cases I had seen on my local box, my impression was that this was the problem I was seeing (not reordering). But it's entirely possible that I was confused by reordering. I've been trying to reproduce, but after 5 minutes of t5561 running in a tight loop under high load, I haven't gotten a single failure. > Anyway, I would prefer not to do anything about it just yet, because > I'm curious :) and I'd like to see whether this actually happens in > practice. I can't reproduce for now, so I'm happy either way. :) > > What if the test script provides the in-order expectation, but we check > > only the unordered version (by sorting both actual and expected output > > on the fly)? That would give us a more relaxed check most of the time, > > but somebody digging in to a failure could run the ordered diff (or we > > could even show it automatically on failure instead of just using > > test_cmp). > > this is the fix I'll go for in v2. Great. That seems much simpler to reason about. I'll take a look. -Peff
Re: [RFC PATCH 4/4] t/lib-httpd: sort log based on timestamp to avoid occasional failure
On Thu, Jun 14, 2018 at 7:53 PM Jeff King wrote: > > On Thu, Jun 14, 2018 at 02:31:07PM +0200, SZEDER Gábor wrote: > > > The last test of 't5561-http-backend.sh', 'server request log matches > > test results' may fail occasionally, because the order of entries in > > Apache's access log doesn't match the order of requests sent in the > > previous tests, although all the right requests are there. > I've occasionally run into these failures on my local box, too. I'm > happy somebody is looking into it (I have before, but eventually threw > up my hands in disgust). Heh, well, I found it to be one of the more entertaining debugging sessions... ;) I'm glad that somebody else saw it, too, and outside of Travis CI. > > Now, by default the timestamp of a log entry marks the beginning of > > the request processing, not when the log entry gets written. Since > > our requests are sent sequentially, sorting the log entries based on > > their timestamps would ensure that their order corresponds to the > > order of our requests. > > That's a reasonably clever solution. One thing I wonder, though: are we > always guaranteed that the log entries are written _at all_ before we > look at them? > > I.e., could we have a situation where we make a request, the client > finishes, and then we look at the logs, but nothing has been written by > apache yet? That's possible, I suppose. Highly improbable, but possible. Or the log entries can be written only partially by the time we look at them. FWIW, Apache has the 'BufferedLogs' directive, which "causes mod_log_config to store several log entries in memory and write them together to disk, rather than writing them after each request." It very sensibly defaults to off, and our test config doesn't set it. I would think that stopping Apache should flush the logs, but I don't know whether 'stop_httpd' blocks until the web server is indeed stopped, or it's all async. Perhaps a SIGHUP would suffice to flush the logs, but that's definitely async. Also note that while t5561 checks the access log at the end of the script, t5541 and t5551 do check the logs in the middle of the script, so we would have to stop Apache and then start again. Anyway, I would prefer not to do anything about it just yet, because I'm curious :) and I'd like to see whether this actually happens in practice. > > I don't really like the fix in this patch. I think an unfortunate > > clock skew during the test run could mess up the sorting added in this > > patch and cause test failure. Or if DST or even a leap second happen > > while the test is running. Do we care? Anyway, this occasional test > > failure apparently happens more often than DST and leap seconds > > combined. > > We could probably eliminate DST issues by consistently using UTC for the > timestamps. Leap seconds are probably infrequent enough not to worry > about. More likely is something like clock adjustment due to ntp. Those > adjustments are usually small enough not to matter, but if we're talking > microseconds, it could trigger. Ok, so admittedly I haven't quite thought through all these potential clock-related issues that I mentioned. Setting back the clock at the end of DST is not an issue, because Apache uses the timezone it finds set in the environment, and our test environment does set UTC, so as far as the tests and Apache's log timestamps are concerned there is no such thing as DST. Leap seconds don't go back in time, they are only added to compensate for Earth's slowing rotation, so that's not an issue, either. And I think NTP is supposed to be clever enough to not set back the clock recklessly, but only advance it slower. Of course, someone can still set back the clock manually while these tests run, but I would put that in the "If it hurts, don't do it" box. However, having said all this, ... > What if the test script provides the in-order expectation, but we check > only the unordered version (by sorting both actual and expected output > on the fly)? That would give us a more relaxed check most of the time, > but somebody digging in to a failure could run the ordered diff (or we > could even show it automatically on failure instead of just using > test_cmp). this is the fix I'll go for in v2.
Re: [RFC PATCH 4/4] t/lib-httpd: sort log based on timestamp to avoid occasional failure
On Thu, Jun 14, 2018 at 11:27:03AM -0700, Junio C Hamano wrote: > Jeff King writes: > > >> Another alternative is to simply accept that these tests are racy, and > >> that the resulting test failures are rare enough that it isn't worth > >> the complexity of the workaround, but adding a comment to the affected > >> tests warning about the raciness is sufficient. (But I wrote this > >> when I first saw and tracked down this failure; since then I observed > >> it four more times... :) > > > > It's definitely bugged me. I'd be happy to see some solution. I've been > > close to suggesting that reading apache logs is simply not robust, and > > we should focus our tests on the git-visible state changes (e.g., seeing > > successful requests, updated refs, etc). > > Hmph, that certainly is "checking only the things that matter", > which is desirable. The existing tests that look at the apache logs are very white-box, and come from Shawn's original smart-http series. I suspect it was as much about sanity-checking the implementation then as it was about protecting against regressions. So it's possible that these tests have simply outlived their usefulness. OTOH, they might catch non-functional problems, like if we started sending redundant requests. But then, they are not very comprehensive, as they only cover a few basic situations (so for example, for a while we were sending extra auth-related requests, but I don't think these tests detected that). -Peff
Re: [RFC PATCH 4/4] t/lib-httpd: sort log based on timestamp to avoid occasional failure
Jeff King writes: >> Another alternative is to simply accept that these tests are racy, and >> that the resulting test failures are rare enough that it isn't worth >> the complexity of the workaround, but adding a comment to the affected >> tests warning about the raciness is sufficient. (But I wrote this >> when I first saw and tracked down this failure; since then I observed >> it four more times... :) > > It's definitely bugged me. I'd be happy to see some solution. I've been > close to suggesting that reading apache logs is simply not robust, and > we should focus our tests on the git-visible state changes (e.g., seeing > successful requests, updated refs, etc). Hmph, that certainly is "checking only the things that matter", which is desirable. > A side effect of that is that it would become a lot easier to support > other webservers in our test scripts (though that may still be a fool's > errand just due to the amount of custom config we seem to carry).
Re: [RFC PATCH 4/4] t/lib-httpd: sort log based on timestamp to avoid occasional failure
On Thu, Jun 14, 2018 at 02:31:07PM +0200, SZEDER Gábor wrote: > The last test of 't5561-http-backend.sh', 'server request log matches > test results' may fail occasionally, because the order of entries in > Apache's access log doesn't match the order of requests sent in the > previous tests, although all the right requests are there. I saw it > fail on Travis CI five times in the span of about half a year, when > the order of two subsequent requests was flipped, and could trigger > the failure with a modified Git. However, I was unable to trigger it > with stock Git on my machine. Three tests in > 't5541-http-push-smart.sh' and 't5551-http-fetch-smart.sh' check > requests in the log the same way, so they might be prone to a similar > occasional failure as well. I've occasionally run into these failures on my local box, too. I'm happy somebody is looking into it (I have before, but eventually threw up my hands in disgust). > Now, by default the timestamp of a log entry marks the beginning of > the request processing, not when the log entry gets written. Since > our requests are sent sequentially, sorting the log entries based on > their timestamps would ensure that their order corresponds to the > order of our requests. That's a reasonably clever solution. One thing I wonder, though: are we always guaranteed that the log entries are written _at all_ before we look at them? I.e., could we have a situation where we make a request, the client finishes, and then we look at the logs, but nothing has been written by apache yet? > I don't really like the fix in this patch. I think an unfortunate > clock skew during the test run could mess up the sorting added in this > patch and cause test failure. Or if DST or even a leap second happen > while the test is running. Do we care? Anyway, this occasional test > failure apparently happens more often than DST and leap seconds > combined. We could probably eliminate DST issues by consistently using UTC for the timestamps. Leap seconds are probably infrequent enough not to worry about. More likely is something like clock adjustment due to ntp. Those adjustments are usually small enough not to matter, but if we're talking microseconds, it could trigger. > An alternative I considered was that we could decide that the order of > requests in the access log is not important as long as all the right > requests are there. This would inherently eliminate the raciness > issue, but if something were to go wrong, then it would become rather > hard to find out e.g. which request from which test has gone missing, > especially considering that several requests are sent in more than one > test. We could address this by truncating the access log at the > beginning and checking its contents at the end of each test sending > requests. Unfortunately, this would raise additional difficulties, > because all requests in t5561 come from tests included from > 't556x-common', i.e. from tests shared with > 't5560-http-backend-noserver.sh', which as its name suggests doesn't > run Apache and doesn't have an access log at all. What if the test script provides the in-order expectation, but we check only the unordered version (by sorting both actual and expected output on the fly)? That would give us a more relaxed check most of the time, but somebody digging in to a failure could run the ordered diff (or we could even show it automatically on failure instead of just using test_cmp). > Another alternative is to simply accept that these tests are racy, and > that the resulting test failures are rare enough that it isn't worth > the complexity of the workaround, but adding a comment to the affected > tests warning about the raciness is sufficient. (But I wrote this > when I first saw and tracked down this failure; since then I observed > it four more times... :) It's definitely bugged me. I'd be happy to see some solution. I've been close to suggesting that reading apache logs is simply not robust, and we should focus our tests on the git-visible state changes (e.g., seeing successful requests, updated refs, etc). A side effect of that is that it would become a lot easier to support other webservers in our test scripts (though that may still be a fool's errand just due to the amount of custom config we seem to carry). > Apache doesn't maintain 2.2 anymore; the final maintenance release > 2.2.34 was released in July 2017, almost a year ago. OTOH, our > 't/lib-httpd/apache.conf' contains a couple of IfVersion directives > dealing with versions <2.4, and one even with <2.1. How much do we > actually care about these unmaintained Apache versions, and how much > of this is just bitrotting? I strongly suspect bitrotting. It looks like most of the "< 2.4" directives are from 5 years ago (when Debian switched to 2.4 by default), so even long-term stable systems w
Re: [RFC PATCH 4/4] t/lib-httpd: sort log based on timestamp to avoid occasional failure
On Thu, Jun 14, 2018 at 2:31 PM, SZEDER Gábor wrote: > When a test sends a HTTP request, it can continue execution after > 'git-http-backend' fulfilled that request, but Apache writes the > corresponding access log entry only after 'git-http-backend' exited. > Some time inevitably passes between fulfilling the request and writing > the log entry, and, under unfavourable circumstances, enough time > might pass for the subsequent request to be sent and fulfilled, and > then Apache writes access log entries racily. > > This effect can be exacerbated by adding a bit of variable delay after > the request is fulfilled but before 'git-http-backend' exits, e.g. > like this: > > diff --git a/http-backend.c b/http-backend.c > index f3dc218b2..bbf4c125b 100644 > --- a/http-backend.c > +++ b/http-backend.c > @@ -709,5 +709,7 @@ int cmd_main(int argc, const char **argv) >max_request_buffer); > > cmd->imp(&hdr, cmd_arg); > + if (getpid() % 2) > + sleep(1); > return 0; >} > > This delay considerably increases the chances of log entries being > written out of order, and in turn makes t5561's last test fail almost > every time. Alas, it doesn't seem to be enough to trigger a similar > failure in t5541 and t5551. A bit of addendum here: There are basically two "kinds" of requests: 1. requests asking for regular files, e.g.: GET /smart/repo.git/HEAD HTTP/1.1 200 GET /smart/repo.git/info/refs HTTP/1.1 200 GET /smart/repo.git/objects/info/packs HTTP/1.1 200 2. requests asking for 'git-receive-pack' or 'git-upload-pack', e.g.: GET /smart/repo.git/info/refs?service=git-receive-pack HTTP/1.1 200 POST /smart/repo.git/git-receive-pack HTTP/1.1 200 - This artificial variable delay shuffles only the access log entries written for the first kind of requests, but doesn't appear to have any effect on the order of log entries written for the second kind of requests. t5561 sends both kinds of requests, so this variable delay is sufficient to make the test fail most of the time. t5541 and t5551, however, send only requests of the second kind, so that's why this variable delay can't trigger a similar failure in these test scripts. I haven't investigated any further what else would be necessary to trigger a failure in t5541 and t5551.
[RFC PATCH 4/4] t/lib-httpd: sort log based on timestamp to avoid occasional failure
The last test of 't5561-http-backend.sh', 'server request log matches test results' may fail occasionally, because the order of entries in Apache's access log doesn't match the order of requests sent in the previous tests, although all the right requests are there. I saw it fail on Travis CI five times in the span of about half a year, when the order of two subsequent requests was flipped, and could trigger the failure with a modified Git. However, I was unable to trigger it with stock Git on my machine. Three tests in 't5541-http-push-smart.sh' and 't5551-http-fetch-smart.sh' check requests in the log the same way, so they might be prone to a similar occasional failure as well. When a test sends a HTTP request, it can continue execution after 'git-http-backend' fulfilled that request, but Apache writes the corresponding access log entry only after 'git-http-backend' exited. Some time inevitably passes between fulfilling the request and writing the log entry, and, under unfavourable circumstances, enough time might pass for the subsequent request to be sent and fulfilled, and then Apache writes access log entries racily. This effect can be exacerbated by adding a bit of variable delay after the request is fulfilled but before 'git-http-backend' exits, e.g. like this: diff --git a/http-backend.c b/http-backend.c index f3dc218b2..bbf4c125b 100644 --- a/http-backend.c +++ b/http-backend.c @@ -709,5 +709,7 @@ int cmd_main(int argc, const char **argv) max_request_buffer); cmd->imp(&hdr, cmd_arg); + if (getpid() % 2) + sleep(1); return 0; } This delay considerably increases the chances of log entries being written out of order, and in turn makes t5561's last test fail almost every time. Alas, it doesn't seem to be enough to trigger a similar failure in t5541 and t5551. Now, by default the timestamp of a log entry marks the beginning of the request processing, not when the log entry gets written. Since our requests are sent sequentially, sorting the log entries based on their timestamps would ensure that their order corresponds to the order of our requests. Therefore, change Apache's log format to use microseconds since epoch as timestamps, the finest resolution Apache has to offer. This will: - give enough resolution to make reasonably sure that no two requests have the same timestamp; the default timestamp format has only one second resolution, which is not nearly enough. - make the timestamps sortable; the default timestamp format looks like "[31/Jan/2018:16:33:52 +]", which would trip sorting if the test script happens to run over month boundaries. Modify the strip_access_log() helper function, used by all potentially affected tests, to sort the access log entries based on their timestamp before stripping the uninteresting bits from the entries. Unfortunately, this "microseconds since epoch" is an extended time format specifier which was only introduced in Apache 2.2.30 [1], and standard strftime() doesn't have any format specifiers with microsecond or finer precision, so we can't use the above solution with Apache versions older than 2.2.30. However, since Apache 2.2 is no longer maintained anyway, it's not that big a deal to leave test runs with those old versions prone to this occasional failure. So use microsecond timestamps and sort access log entries only if the test is run with an Apache version supporting it; otherwise stick to the default timestamp format and leave the order of access log entries unchanged. [1] http://httpd.apache.org/docs/2.2/mod/mod_log_config.html Signed-off-by: SZEDER Gábor --- Notes: The log of the latest of this test failures: https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/391499735#L3130 I don't really like the fix in this patch. I think an unfortunate clock skew during the test run could mess up the sorting added in this patch and cause test failure. Or if DST or even a leap second happen while the test is running. Do we care? Anyway, this occasional test failure apparently happens more often than DST and leap seconds combined. Furthermore, in the future we might have computers fast enough to fulfill more than one test requests in a single microsecond; but I guess by that time we'll have higher resolution time formats as well. An alternative I considered was that we could decide that the order of requests in the access log is not important as long as all the right requests are there. This would inherently eliminate the raciness issue, but if something were to go wrong, then it would become rather hard to find out e.g. which request from which test has gone missing, especially considering that several requests are sent in more than one test. We could address this by truncating the access log at the beginning and checking i