Re: [RFC PATCH 4/4] t/lib-httpd: sort log based on timestamp to avoid occasional failure

2018-07-11 Thread Jeff King
On Wed, Jul 11, 2018 at 02:53:23PM +0200, SZEDER Gábor wrote:

> > I.e., could we have a situation where we make a request, the client
> > finishes, and then we look at the logs, but nothing has been written by
> > apache yet?
> 
> That's possible, I suppose.  Highly improbable, but possible.  Or the
> log entries can be written only partially by the time we look at them.

Digging into earlier cases I had seen on my local box, my impression was
that this was the problem I was seeing (not reordering). But it's
entirely possible that I was confused by reordering. I've been trying to
reproduce, but after 5 minutes of t5561 running in a tight loop under
high load, I haven't gotten a single failure.

> Anyway, I would prefer not to do anything about it just yet, because
> I'm curious :) and I'd like to see whether this actually happens in
> practice.

I can't reproduce for now, so I'm happy either way. :)

> > What if the test script provides the in-order expectation, but we check
> > only the unordered version (by sorting both actual and expected output
> > on the fly)? That would give us a more relaxed check most of the time,
> > but somebody digging in to a failure could run the ordered diff (or we
> > could even show it automatically on failure instead of just using
> > test_cmp).
> 
> this is the fix I'll go for in v2.

Great. That seems much simpler to reason about. I'll take a look.

-Peff


Re: [RFC PATCH 4/4] t/lib-httpd: sort log based on timestamp to avoid occasional failure

2018-07-11 Thread SZEDER Gábor
On Thu, Jun 14, 2018 at 7:53 PM Jeff King  wrote:
>
> On Thu, Jun 14, 2018 at 02:31:07PM +0200, SZEDER Gábor wrote:
>
> > The last test of 't5561-http-backend.sh', 'server request log matches
> > test results' may fail occasionally, because the order of entries in
> > Apache's access log doesn't match the order of requests sent in the
> > previous tests, although all the right requests are there.

> I've occasionally run into these failures on my local box, too. I'm
> happy somebody is looking into it (I have before, but eventually threw
> up my hands in disgust).

Heh, well, I found it to be one of the more entertaining debugging
sessions... ;)
I'm glad that somebody else saw it, too, and outside of Travis CI.

> > Now, by default the timestamp of a log entry marks the beginning of
> > the request processing, not when the log entry gets written.  Since
> > our requests are sent sequentially, sorting the log entries based on
> > their timestamps would ensure that their order corresponds to the
> > order of our requests.
>
> That's a reasonably clever solution. One thing I wonder, though: are we
> always guaranteed that the log entries are written _at all_ before we
> look at them?
>
> I.e., could we have a situation where we make a request, the client
> finishes, and then we look at the logs, but nothing has been written by
> apache yet?

That's possible, I suppose.  Highly improbable, but possible.  Or the
log entries can be written only partially by the time we look at them.

FWIW, Apache has the 'BufferedLogs' directive, which "causes
mod_log_config to store several log entries in memory and write them
together to disk, rather than writing them after each request."  It
very sensibly defaults to off, and our test config doesn't set it.

I would think that stopping Apache should flush the logs, but I don't
know whether 'stop_httpd' blocks until the web server is indeed
stopped, or it's all async.  Perhaps a SIGHUP would suffice to flush
the logs, but that's definitely async.  Also note that while t5561
checks the access log at the end of the script, t5541 and t5551 do
check the logs in the middle of the script, so we would have to stop
Apache and then start again.

Anyway, I would prefer not to do anything about it just yet, because
I'm curious :) and I'd like to see whether this actually happens in
practice.


> > I don't really like the fix in this patch.  I think an unfortunate
> > clock skew during the test run could mess up the sorting added in this
> > patch and cause test failure.  Or if DST or even a leap second happen
> > while the test is running.  Do we care?  Anyway, this occasional test
> > failure apparently happens more often than DST and leap seconds
> > combined.
>
> We could probably eliminate DST issues by consistently using UTC for the
> timestamps. Leap seconds are probably infrequent enough not to worry
> about. More likely is something like clock adjustment due to ntp. Those
> adjustments are usually small enough not to matter, but if we're talking
> microseconds, it could trigger.

Ok, so admittedly I haven't quite thought through all these potential
clock-related issues that I mentioned.

Setting back the clock at the end of DST is not an issue, because
Apache uses the timezone it finds set in the environment, and our test
environment does set UTC, so as far as the tests and Apache's log
timestamps are concerned there is no such thing as DST.
Leap seconds don't go back in time, they are only added to compensate
for Earth's slowing rotation, so that's not an issue, either.
And I think NTP is supposed to be clever enough to not set back the
clock recklessly, but only advance it slower.

Of course, someone can still set back the clock manually while these
tests run, but I would put that in the "If it hurts, don't do it" box.


However, having said all this, ...

> What if the test script provides the in-order expectation, but we check
> only the unordered version (by sorting both actual and expected output
> on the fly)? That would give us a more relaxed check most of the time,
> but somebody digging in to a failure could run the ordered diff (or we
> could even show it automatically on failure instead of just using
> test_cmp).

this is the fix I'll go for in v2.


Re: [RFC PATCH 4/4] t/lib-httpd: sort log based on timestamp to avoid occasional failure

2018-06-14 Thread Jeff King
On Thu, Jun 14, 2018 at 11:27:03AM -0700, Junio C Hamano wrote:

> Jeff King  writes:
> 
> >> Another alternative is to simply accept that these tests are racy, and
> >> that the resulting test failures are rare enough that it isn't worth
> >> the complexity of the workaround, but adding a comment to the affected
> >> tests warning about the raciness is sufficient.  (But I wrote this
> >> when I first saw and tracked down this failure; since then I observed
> >> it four more times... :)
> >
> > It's definitely bugged me. I'd be happy to see some solution. I've been
> > close to suggesting that reading apache logs is simply not robust, and
> > we should focus our tests on the git-visible state changes (e.g., seeing
> > successful requests, updated refs, etc).
> 
> Hmph, that certainly is "checking only the things that matter",
> which is desirable.

The existing tests that look at the apache logs are very white-box, and
come from Shawn's original smart-http series. I suspect it was as much
about sanity-checking the implementation then as it was about protecting
against regressions.

So it's possible that these tests have simply outlived their usefulness.

OTOH, they might catch non-functional problems, like if we started
sending redundant requests. But then, they are not very comprehensive,
as they only cover a few basic situations (so for example, for a while
we were sending extra auth-related requests, but I don't think these
tests detected that).

-Peff


Re: [RFC PATCH 4/4] t/lib-httpd: sort log based on timestamp to avoid occasional failure

2018-06-14 Thread Junio C Hamano
Jeff King  writes:

>> Another alternative is to simply accept that these tests are racy, and
>> that the resulting test failures are rare enough that it isn't worth
>> the complexity of the workaround, but adding a comment to the affected
>> tests warning about the raciness is sufficient.  (But I wrote this
>> when I first saw and tracked down this failure; since then I observed
>> it four more times... :)
>
> It's definitely bugged me. I'd be happy to see some solution. I've been
> close to suggesting that reading apache logs is simply not robust, and
> we should focus our tests on the git-visible state changes (e.g., seeing
> successful requests, updated refs, etc).

Hmph, that certainly is "checking only the things that matter",
which is desirable.

> A side effect of that is that it would become a lot easier to support
> other webservers in our test scripts (though that may still be a fool's
> errand just due to the amount of custom config we seem to carry).


Re: [RFC PATCH 4/4] t/lib-httpd: sort log based on timestamp to avoid occasional failure

2018-06-14 Thread Jeff King
On Thu, Jun 14, 2018 at 02:31:07PM +0200, SZEDER Gábor wrote:

> The last test of 't5561-http-backend.sh', 'server request log matches
> test results' may fail occasionally, because the order of entries in
> Apache's access log doesn't match the order of requests sent in the
> previous tests, although all the right requests are there.  I saw it
> fail on Travis CI five times in the span of about half a year, when
> the order of two subsequent requests was flipped, and could trigger
> the failure with a modified Git.  However, I was unable to trigger it
> with stock Git on my machine.  Three tests in
> 't5541-http-push-smart.sh' and 't5551-http-fetch-smart.sh' check
> requests in the log the same way, so they might be prone to a similar
> occasional failure as well.

I've occasionally run into these failures on my local box, too. I'm
happy somebody is looking into it (I have before, but eventually threw
up my hands in disgust).

> Now, by default the timestamp of a log entry marks the beginning of
> the request processing, not when the log entry gets written.  Since
> our requests are sent sequentially, sorting the log entries based on
> their timestamps would ensure that their order corresponds to the
> order of our requests.

That's a reasonably clever solution. One thing I wonder, though: are we
always guaranteed that the log entries are written _at all_ before we
look at them?

I.e., could we have a situation where we make a request, the client
finishes, and then we look at the logs, but nothing has been written by
apache yet?

> I don't really like the fix in this patch.  I think an unfortunate
> clock skew during the test run could mess up the sorting added in this
> patch and cause test failure.  Or if DST or even a leap second happen
> while the test is running.  Do we care?  Anyway, this occasional test
> failure apparently happens more often than DST and leap seconds
> combined.

We could probably eliminate DST issues by consistently using UTC for the
timestamps. Leap seconds are probably infrequent enough not to worry
about. More likely is something like clock adjustment due to ntp. Those
adjustments are usually small enough not to matter, but if we're talking
microseconds, it could trigger.

> An alternative I considered was that we could decide that the order of
> requests in the access log is not important as long as all the right
> requests are there. This would inherently eliminate the raciness
> issue, but if something were to go wrong, then it would become rather
> hard to find out e.g. which request from which test has gone missing,
> especially considering that several requests are sent in more than one
> test.  We could address this by truncating the access log at the
> beginning and checking its contents at the end of each test sending
> requests.  Unfortunately, this would raise additional difficulties,
> because all requests in t5561 come from tests included from
> 't556x-common', i.e. from tests shared with
> 't5560-http-backend-noserver.sh', which as its name suggests doesn't
> run Apache and doesn't have an access log at all.

What if the test script provides the in-order expectation, but we check
only the unordered version (by sorting both actual and expected output
on the fly)? That would give us a more relaxed check most of the time,
but somebody digging in to a failure could run the ordered diff (or we
could even show it automatically on failure instead of just using
test_cmp).

> Another alternative is to simply accept that these tests are racy, and
> that the resulting test failures are rare enough that it isn't worth
> the complexity of the workaround, but adding a comment to the affected
> tests warning about the raciness is sufficient.  (But I wrote this
> when I first saw and tracked down this failure; since then I observed
> it four more times... :)

It's definitely bugged me. I'd be happy to see some solution. I've been
close to suggesting that reading apache logs is simply not robust, and
we should focus our tests on the git-visible state changes (e.g., seeing
successful requests, updated refs, etc).

A side effect of that is that it would become a lot easier to support
other webservers in our test scripts (though that may still be a fool's
errand just due to the amount of custom config we seem to carry).

> Apache doesn't maintain 2.2 anymore; the final maintenance release
> 2.2.34 was released in July 2017, almost a year ago.  OTOH, our
> 't/lib-httpd/apache.conf' contains a couple of IfVersion directives
> dealing with versions <2.4, and one even with <2.1.  How much do we
> actually care about these unmaintained Apache versions, and how much
> of this is just bitrotting?

I strongly suspect bitrotting. It looks like most of the "< 2.4"
directives are from 5 years ago (when Debian switched to 2.4 by
default), so even long-term stable systems w

Re: [RFC PATCH 4/4] t/lib-httpd: sort log based on timestamp to avoid occasional failure

2018-06-14 Thread SZEDER Gábor
On Thu, Jun 14, 2018 at 2:31 PM, SZEDER Gábor  wrote:

> When a test sends a HTTP request, it can continue execution after
> 'git-http-backend' fulfilled that request, but Apache writes the
> corresponding access log entry only after 'git-http-backend' exited.
> Some time inevitably passes between fulfilling the request and writing
> the log entry, and, under unfavourable circumstances, enough time
> might pass for the subsequent request to be sent and fulfilled, and
> then Apache writes access log entries racily.
>
> This effect can be exacerbated by adding a bit of variable delay after
> the request is fulfilled but before 'git-http-backend' exits, e.g.
> like this:
>
>   diff --git a/http-backend.c b/http-backend.c
>   index f3dc218b2..bbf4c125b 100644
>   --- a/http-backend.c
>   +++ b/http-backend.c
>   @@ -709,5 +709,7 @@ int cmd_main(int argc, const char **argv)
>max_request_buffer);
>
> cmd->imp(&hdr, cmd_arg);
>   + if (getpid() % 2)
>   + sleep(1);
> return 0;
>}
>
> This delay considerably increases the chances of log entries being
> written out of order, and in turn makes t5561's last test fail almost
> every time.  Alas, it doesn't seem to be enough to trigger a similar
> failure in t5541 and t5551.

A bit of addendum here:

There are basically two "kinds" of requests:

  1. requests asking for regular files, e.g.:

 GET  /smart/repo.git/HEAD HTTP/1.1 200
 GET  /smart/repo.git/info/refs HTTP/1.1 200
 GET  /smart/repo.git/objects/info/packs HTTP/1.1 200

  2. requests asking for 'git-receive-pack' or 'git-upload-pack',
 e.g.:

 GET  /smart/repo.git/info/refs?service=git-receive-pack HTTP/1.1 200
 POST /smart/repo.git/git-receive-pack HTTP/1.1 200 -

This artificial variable delay shuffles only the access log entries
written for the first kind of requests, but doesn't appear to have any
effect on the order of log entries written for the second kind of
requests.

t5561 sends both kinds of requests, so this variable delay is
sufficient to make the test fail most of the time.

t5541 and t5551, however, send only requests of the second kind, so
that's why this variable delay can't trigger a similar failure in
these test scripts.

I haven't investigated any further what else would be necessary to
trigger a failure in t5541 and t5551.


[RFC PATCH 4/4] t/lib-httpd: sort log based on timestamp to avoid occasional failure

2018-06-14 Thread SZEDER Gábor
The last test of 't5561-http-backend.sh', 'server request log matches
test results' may fail occasionally, because the order of entries in
Apache's access log doesn't match the order of requests sent in the
previous tests, although all the right requests are there.  I saw it
fail on Travis CI five times in the span of about half a year, when
the order of two subsequent requests was flipped, and could trigger
the failure with a modified Git.  However, I was unable to trigger it
with stock Git on my machine.  Three tests in
't5541-http-push-smart.sh' and 't5551-http-fetch-smart.sh' check
requests in the log the same way, so they might be prone to a similar
occasional failure as well.

When a test sends a HTTP request, it can continue execution after
'git-http-backend' fulfilled that request, but Apache writes the
corresponding access log entry only after 'git-http-backend' exited.
Some time inevitably passes between fulfilling the request and writing
the log entry, and, under unfavourable circumstances, enough time
might pass for the subsequent request to be sent and fulfilled, and
then Apache writes access log entries racily.

This effect can be exacerbated by adding a bit of variable delay after
the request is fulfilled but before 'git-http-backend' exits, e.g.
like this:

  diff --git a/http-backend.c b/http-backend.c
  index f3dc218b2..bbf4c125b 100644
  --- a/http-backend.c
  +++ b/http-backend.c
  @@ -709,5 +709,7 @@ int cmd_main(int argc, const char **argv)
   max_request_buffer);

cmd->imp(&hdr, cmd_arg);
  + if (getpid() % 2)
  + sleep(1);
return 0;
   }

This delay considerably increases the chances of log entries being
written out of order, and in turn makes t5561's last test fail almost
every time.  Alas, it doesn't seem to be enough to trigger a similar
failure in t5541 and t5551.

Now, by default the timestamp of a log entry marks the beginning of
the request processing, not when the log entry gets written.  Since
our requests are sent sequentially, sorting the log entries based on
their timestamps would ensure that their order corresponds to the
order of our requests.

Therefore, change Apache's log format to use microseconds since epoch
as timestamps, the finest resolution Apache has to offer.  This will:

  - give enough resolution to make reasonably sure that no two
requests have the same timestamp; the default timestamp format has
only one second resolution, which is not nearly enough.

  - make the timestamps sortable; the default timestamp format looks
like "[31/Jan/2018:16:33:52 +]", which would trip sorting if
the test script happens to run over month boundaries.

Modify the strip_access_log() helper function, used by all potentially
affected tests, to sort the access log entries based on their
timestamp before stripping the uninteresting bits from the entries.

Unfortunately, this "microseconds since epoch" is an extended time
format specifier which was only introduced in Apache 2.2.30 [1], and
standard strftime() doesn't have any format specifiers with
microsecond or finer precision, so we can't use the above solution
with Apache versions older than 2.2.30.  However, since Apache 2.2 is
no longer maintained anyway, it's not that big a deal to leave test
runs with those old versions prone to this occasional failure.  So use
microsecond timestamps and sort access log entries only if the test is
run with an Apache version supporting it; otherwise stick to the
default timestamp format and leave the order of access log entries
unchanged.

[1] http://httpd.apache.org/docs/2.2/mod/mod_log_config.html

Signed-off-by: SZEDER Gábor 
---

Notes:
The log of the latest of this test failures:

  
https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/391499735#L3130

I don't really like the fix in this patch.  I think an unfortunate
clock skew during the test run could mess up the sorting added in this
patch and cause test failure.  Or if DST or even a leap second happen
while the test is running.  Do we care?  Anyway, this occasional test
failure apparently happens more often than DST and leap seconds
combined.
Furthermore, in the future we might have computers fast enough to
fulfill more than one test requests in a single microsecond; but I
guess by that time we'll have higher resolution time formats as well.

An alternative I considered was that we could decide that the order of
requests in the access log is not important as long as all the right
requests are there.  This would inherently eliminate the raciness
issue, but if something were to go wrong, then it would become rather
hard to find out e.g. which request from which test has gone missing,
especially considering that several requests are sent in more than one
test.  We could address this by truncating the access log at the
beginning and checking i