Hi Peff, I just had a look at the patch you provided below (for some reason, my previous search on public-inbox only turned up Gábor's mail to which you responded).
Admittedly, I do not really understand all aspects of it, but it applies, still, and I kicked off a stress test here: https://dev.azure.com/git/git/_build/results?buildId=338 It seems that your patch fixes that t5570 flakiness on macOS, and more importantly, addresses an important issue on macOS. Will play a bit more with it and keep you posted. Ciao, Dscho On Tue, 14 Aug 2018, Jeff King wrote: > On Mon, Aug 06, 2018 at 05:11:13PM +0200, SZEDER Gábor wrote: > > > - 'git upload-pack' receives the request, parses the want line, > > notices the corrupt pack, responds with an 'ERR upload-pack: not > > our ref' pkt-line, and die()s right away. > > > > - 'git fetch' finally approaches the end of the function, where it > > attempts to send a done pkt-line via another send_request() call > > through the now closing TCP socket. > > > > - What happens now seems to depend on the platform: > > > > - On Linux, both on my machine and on Travis CI, it shows textbook > > example behaviour: write() returns with error and sets errno to > > ECONNRESET. Since it happens in write_or_die(), 'git fetch' > > die()s with 'fatal: write error: Connection reset by peer', and > > doesn't show the error send by 'git upload-pack'; how could it, > > it doesn't even get as far to receive upload-pack's ERR > > pkt-line. > > > > The test only checks that 'git fetch' fails, but it doesn't > > check whether it failed with the right error message, so the > > test still succeeds. Had it checked the error message as well, > > we most likely had noticed this issue already, it doesn't happen > > all that rarely. > > Hmm. Traditionally we did not send ERR as part of upload-pack at all. It > was the message you got from git-daemon if it couldn't start the > requested sub-process. It was only later in bdb31eada7 (upload-pack: > report "not our ref" to client, 2017-02-23) that we started sending > them. So I think that is why it does not check the error message: it is > not expecting that case at all (and it is not actually interesting here, > as the real problem is that the remote side is corrupt, but it sadly > does not say anything so useful). > > I think that's somewhat tangential, though. The root of the issue is > this: > > > - On the new OSX images with XCode 9.4 on Travis CI the write() > > triggers SIGPIPE right away, and 'test_must_fail' notices it and > > fails the test. I couldn't see any sign of an ECONNRESET or any > > other error that we could act upon to avoid the SIGPIPE. > > Right, as soon as we get SIGPIPE we can't offer any useful message, > because we're dead. I would argue that fetch should simply turn off > SIGPIPE entirely, and rely on getting EPIPE from write(). But since > we're in write_or_die(), it actually turns EPIPE back into a SIGPIPE > death! > > So we'd probably also want to teach it to use a real write_in_full(), > and then output a more useful message in this case. write_or_die() > really does produce bad messages regardless, because it doesn't know > what it's writing to. > > That would give us a baby step in the right direction, because at least > we'd always be doing a controlled die() then. And then the next step > would be to show the remote error message (even though it's not actually > useful in this case, in theory upload-pack could generate something > better). And that would mean turning the die() on write into an attempt > to drain any ERR messages before either dying or returning an error up > the stack. > > I suspect the (largely untested) patch below would make your test > problems go away. Or instead, we could simply add sigpipe=ok to the > test_must_fail invocation, but I agree with you that the current > behavior on OS X is not ideal (the user sees no error message). > > -Peff > > diff --git a/fetch-pack.c b/fetch-pack.c > index 5714bcbddd..3e80604562 100644 > --- a/fetch-pack.c > +++ b/fetch-pack.c > @@ -188,8 +188,10 @@ static void send_request(struct fetch_pack_args *args, > if (args->stateless_rpc) { > send_sideband(fd, -1, buf->buf, buf->len, LARGE_PACKET_MAX); > packet_flush(fd); > - } else > - write_or_die(fd, buf->buf, buf->len); > + } else { > + if (write_in_full(fd, buf->buf, buf->len) < 0) > + die_errno("unable to write to remote"); > + } > } > > static void insert_one_alternate_object(struct fetch_negotiator *negotiator, > @@ -1167,7 +1169,8 @@ static int send_fetch_request(struct fetch_negotiator > *negotiator, int fd_out, > > /* Send request */ > packet_buf_flush(&req_buf); > - write_or_die(fd_out, req_buf.buf, req_buf.len); > + if (write_in_full(fd_out, req_buf.buf, req_buf.len) < 0) > + die_errno("unable to write request to remote"); > > strbuf_release(&req_buf); > return ret; > diff --git a/pkt-line.c b/pkt-line.c > index a593c08aad..450d0801b1 100644 > --- a/pkt-line.c > +++ b/pkt-line.c > @@ -88,13 +88,15 @@ static void packet_trace(const char *buf, unsigned int > len, int write) > void packet_flush(int fd) > { > packet_trace("0000", 4, 1); > - write_or_die(fd, "0000", 4); > + if (write_in_full(fd, "0000", 4) < 0) > + die_errno("unable to write flush packet"); > } > > void packet_delim(int fd) > { > packet_trace("0001", 4, 1); > - write_or_die(fd, "0001", 4); > + if (write_in_full(fd, "0000", 4) < 0) > + die_errno("unable to write delim packet"); > } > > int packet_flush_gently(int fd) > >