Re: t5570-git-daemon fails with SIGPIPE on OSX
On Sat, Mar 02, 2019 at 10:21:00PM +0100, Johannes Schindelin wrote: > Do you want to turn these two patches into a proper patch series? > Otherwise I can take care of it, probably this Monday or Tuesday. Here they are. The old email sending the first one actually had a typo (it sent "" instead of "0001" for some packets; oops), which is fixed here. I'm actually not sure of the placement for the second one; see the comments there. [1/2]: fetch: avoid calling write_or_die() [2/2]: fetch: ignore SIGPIPE during network operation builtin/fetch.c | 2 ++ fetch-pack.c| 9 ++--- pkt-line.c | 6 -- 3 files changed, 12 insertions(+), 5 deletions(-) -Peff
Re: t5570-git-daemon fails with SIGPIPE on OSX
Hi Junio, On Sun, 3 Mar 2019, Junio C Hamano wrote: > Jeff King writes: > > > But for git-fetch, our primary purpose is receiving data and writing it > > to disk. We should be checking all of our write()s already, and SIGPIPE > > is just confusing. > > Yup, sounds like a very sensible argument. > > > So I'd actually be fine with just declaring that certain commands (like > > fetch) just ignore SIGPIPE entirely. > > Me too (to me, "actually be fine with" would mean "that's a larger > hammer alternative, which is OK, while the base solution is fine, > too"). So do I understand that you'd like something like this? -- snipsnap -- t a/git.c b/git.c index 2f604a41ea..c4122cc699 100644 --- a/git.c +++ b/git.c @@ -4,6 +4,7 @@ #include "help.h" #include "run-command.h" #include "alias.h" +#include "sigchain.h" #define RUN_SETUP (1<<0) #define RUN_SETUP_GENTLY (1<<1) @@ -16,6 +17,7 @@ #define SUPPORT_SUPER_PREFIX (1<<4) #define DELAY_PAGER_CONFIG (1<<5) #define NO_PARSEOPT(1<<6) /* parse-options is not used */ +#define IGNORE_SIGPIPE (1<<7) struct cmd_struct { const char *cmd; @@ -388,6 +390,8 @@ static int run_builtin(struct cmd_struct *p, int argc, const char **argv) prefix = NULL; help = argc == 2 && !strcmp(argv[1], "-h"); if (!help) { + if (p->option & IGNORE_SIGPIPE) + sigchain_push(SIGPIPE, SIG_IGN); if (p->option & RUN_SETUP) prefix = setup_git_directory(); else if (p->option & RUN_SETUP_GENTLY) { @@ -477,7 +481,7 @@ static struct cmd_struct commands[] = { { "diff-tree", cmd_diff_tree, RUN_SETUP | NO_PARSEOPT }, { "difftool", cmd_difftool, RUN_SETUP | NEED_WORK_TREE }, { "fast-export", cmd_fast_export, RUN_SETUP }, - { "fetch", cmd_fetch, RUN_SETUP }, + { "fetch", cmd_fetch, RUN_SETUP | IGNORE_SIGPIPE }, { "fetch-pack", cmd_fetch_pack, RUN_SETUP | NO_PARSEOPT }, { "fmt-merge-msg", cmd_fmt_merge_msg, RUN_SETUP }, { "for-each-ref", cmd_for_each_ref, RUN_SETUP },
Re: t5570-git-daemon fails with SIGPIPE on OSX
Jeff King writes: > But for git-fetch, our primary purpose is receiving data and writing it > to disk. We should be checking all of our write()s already, and SIGPIPE > is just confusing. Yup, sounds like a very sensible argument. > So I'd actually be fine with just declaring that certain commands (like > fetch) just ignore SIGPIPE entirely. Me too (to me, "actually be fine with" would mean "that's a larger hammer alternative, which is OK, while the base solution is fine, too"). Thanks.
Re: t5570-git-daemon fails with SIGPIPE on OSX
Hi Peff, On Fri, 1 Mar 2019, Jeff King wrote: > diff --git a/builtin/fetch.c b/builtin/fetch.c > index b620fd54b4..4ba63d5ac6 100644 > --- a/builtin/fetch.c > +++ b/builtin/fetch.c > @@ -1556,7 +1556,9 @@ static int fetch_one(struct remote *remote, int argc, > const char **argv, int pru > > sigchain_push_common(unlock_pack_on_signal); > atexit(unlock_pack); > + sigchain_push(SIGPIPE, SIG_IGN); > exit_code = do_fetch(gtransport, &rs); > + sigchain_pop(SIGPIPE); > refspec_clear(&rs); > transport_disconnect(gtransport); > gtransport = NULL; That indeed does the job: https://git.visualstudio.com/git/_build/results?buildId=358 > That said, I actually think it's kind of pointless for git-fetch to use > SIGPIPE at all. The purpose of SIGPIPE is that you can write a naive > program that spews output, and you'll get informed (forcefully) by the > OS if the process consuming your output stops listening. That makes > sense for programs like "git log", whose primary purpose is generating > output. > > But for git-fetch, our primary purpose is receiving data and writing it > to disk. We should be checking all of our write()s already, and SIGPIPE > is just confusing. The only "big" output we generate is the status table > at the end. And even if that is going to a pipe that closes, I don't > think we'd want to fail the whole command (we'd want to finalize any > writes for what we just fetched, clean up after ourselves, etc). > > So I'd actually be fine with just declaring that certain commands (like > fetch) just ignore SIGPIPE entirely. That's a bigger change than I'm comfortable with, so I'd like to go with tge diff you gave above. Do you want to turn these two patches into a proper patch series? Otherwise I can take care of it, probably this Monday or Tuesday. Ciao, Dscho
Re: t5570-git-daemon fails with SIGPIPE on OSX
On Fri, Mar 01, 2019 at 04:02:22PM +0100, Johannes Schindelin wrote: > > I think that patch does the write_or_die conversion to handle EPIPE, but > > it would still need to turn off SIGPIPE for the whole process. > > > > So you'd also need to stick a: > > > > sigchain_push(SIGPIPE, SIG_IGN); > > > > somewhere near the start of cmd_fetch(). (There may be a less > > coarse-grained place to put it, but at this point I think we're just > > trying to find out whether this approach even solves the problem). > > This fixes it, it seems. I let the job run with `--stress=50` and even > after half an hour, it did not fail: > > attempts ://git.visualstudio.com/git/_build/results?buildId=354 Cool, I'm glad it works! > (I had to cancel it, I thought that `--stress=50` would stop trying after > 50 runs, but I was obviously incorrect in that assumption...) No, that will do 50 simultaneous jobs. You want --stress-limit=50. It might make more sense to have --stress-jobs, and make --stress=X set the limit, as I think it's much more useful to set the limit than it is to set the number of jobs. At least that's my experience. > So. Good, we have a diff that works. But you mentioned that you'd like to > put it somewhere else? I am a bit unfamiliar with the code paths of > `cmd_fetch()`, so I would be hard pressed to find a better spot. Any hint? Well, I'm of two minds there. If we want to make the minimum change, then we'd just want to disable SIGPIPE while we're actually conversing with the other side. So I guess that would be somewhere in do_fetch(), before we start talking to the other side, and restoring it after we've finished our half of the conversation (i.e., after we've done all the want/have bits and we're receiving the pack). But that's actually pretty awkward to do, because most of those details are handled under the hood by the transport code. So probably something like this is the least-terrible place to put it: diff --git a/builtin/fetch.c b/builtin/fetch.c index b620fd54b4..4ba63d5ac6 100644 --- a/builtin/fetch.c +++ b/builtin/fetch.c @@ -1556,7 +1556,9 @@ static int fetch_one(struct remote *remote, int argc, const char **argv, int pru sigchain_push_common(unlock_pack_on_signal); atexit(unlock_pack); + sigchain_push(SIGPIPE, SIG_IGN); exit_code = do_fetch(gtransport, &rs); + sigchain_pop(SIGPIPE); refspec_clear(&rs); transport_disconnect(gtransport); gtransport = NULL; That said, I actually think it's kind of pointless for git-fetch to use SIGPIPE at all. The purpose of SIGPIPE is that you can write a naive program that spews output, and you'll get informed (forcefully) by the OS if the process consuming your output stops listening. That makes sense for programs like "git log", whose primary purpose is generating output. But for git-fetch, our primary purpose is receiving data and writing it to disk. We should be checking all of our write()s already, and SIGPIPE is just confusing. The only "big" output we generate is the status table at the end. And even if that is going to a pipe that closes, I don't think we'd want to fail the whole command (we'd want to finalize any writes for what we just fetched, clean up after ourselves, etc). So I'd actually be fine with just declaring that certain commands (like fetch) just ignore SIGPIPE entirely. -Peff
Re: t5570-git-daemon fails with SIGPIPE on OSX
Hi Peff, On Fri, 8 Feb 2019, Jeff King wrote: > On Fri, Feb 08, 2019 at 10:28:12AM +0100, Johannes Schindelin wrote: > > > On Fri, 8 Feb 2019, Johannes Schindelin wrote: > > > > > I just had a look at the patch you provided below (for some reason, my > > > previous search on public-inbox only turned up Gábor's mail to which you > > > responded). > > > > > > Admittedly, I do not really understand all aspects of it, but it applies, > > > still, and I kicked off a stress test here: > > > > > > https://dev.azure.com/git/git/_build/results?buildId=338 > > > > > > It seems that your patch fixes that t5570 flakiness on macOS, and more > > > importantly, addresses an important issue on macOS. > > > > > > Will play a bit more with it and keep you posted. > > > > Alas, I was fooled. *Fooled*, I say. Apparently the --stress option makes > > the script *succeed* when it fails? > > [...] > > So I am afraid that your patch does not fix the issue nor does it work > > around it. > > I think that patch does the write_or_die conversion to handle EPIPE, but > it would still need to turn off SIGPIPE for the whole process. > > So you'd also need to stick a: > > sigchain_push(SIGPIPE, SIG_IGN); > > somewhere near the start of cmd_fetch(). (There may be a less > coarse-grained place to put it, but at this point I think we're just > trying to find out whether this approach even solves the problem). This fixes it, it seems. I let the job run with `--stress=50` and even after half an hour, it did not fail: attempts ://git.visualstudio.com/git/_build/results?buildId=354 (I had to cancel it, I thought that `--stress=50` would stop trying after 50 runs, but I was obviously incorrect in that assumption...) Reverting the patch made it fail within a hundred runs: https://git.visualstudio.com/git/_build/results?buildId=356 So. Good, we have a diff that works. But you mentioned that you'd like to put it somewhere else? I am a bit unfamiliar with the code paths of `cmd_fetch()`, so I would be hard pressed to find a better spot. Any hint? Ciao, Dscho
Re: t5570-git-daemon fails with SIGPIPE on OSX
On Fri, Feb 08, 2019 at 10:28:12AM +0100, Johannes Schindelin wrote: > Hi Peff, > > On Fri, 8 Feb 2019, Johannes Schindelin wrote: > > > I just had a look at the patch you provided below (for some reason, my > > previous search on public-inbox only turned up Gábor's mail to which you > > responded). > > > > Admittedly, I do not really understand all aspects of it, but it applies, > > still, and I kicked off a stress test here: > > > > https://dev.azure.com/git/git/_build/results?buildId=338 > > > > It seems that your patch fixes that t5570 flakiness on macOS, and more > > importantly, addresses an important issue on macOS. > > > > Will play a bit more with it and keep you posted. > > Alas, I was fooled. *Fooled*, I say. Apparently the --stress option makes > the script *succeed* when it fails? > [...] > So I am afraid that your patch does not fix the issue nor does it work > around it. I think that patch does the write_or_die conversion to handle EPIPE, but it would still need to turn off SIGPIPE for the whole process. So you'd also need to stick a: sigchain_push(SIGPIPE, SIG_IGN); somewhere near the start of cmd_fetch(). (There may be a less coarse-grained place to put it, but at this point I think we're just trying to find out whether this approach even solves the problem). -Peff
Re: t5570-git-daemon fails with SIGPIPE on OSX
On Fri, Feb 08, 2019 at 09:32:32AM +0100, Johannes Schindelin wrote: > Team, > > On Mon, 6 Aug 2018, SZEDER Gábor wrote: > > > [Resending with Clemens' last used email address. > > Clemens, please consider sending a patch to update our .mailmap file.] > > > > > > On Mon, Aug 6, 2018 at 5:11 PM SZEDER Gábor wrote: > > > > > > Travis CI changed its default OSX image to use XCode 9.4 on 2018-07-31 > > > [1]. Since then OSX build jobs fail rather frequently because of a > > > SIGPIPE in the tests 'fetch notices corrupt pack' or 'fetch notices > > > corrupt idx' in 't5570-git-daemon.sh' [2]. I think this is a symptom > > > a real bug in Git affecting other platforms as well, but these tests > > > are too lax to catch it. > > I am seeing this very frequently now, as it feels like failing in the > Azure Pipeline about half of the time. I was wondering whether it's only an issue on Travis CI, and was waiting whether you'll complain about it :) Evidently it is not, but I still would like to see a report about macOS running directly on someone's Mac hardware, i.e. without all the CI/cloud/whatnot magic. > Any ideas how to fix this test, anyone? I'm afraid that this is not merely an issue with the test, but a platform issue that we should work around somehow. I would have suggested to follow up on what Peff suggested, but I see that you already did, and it didn't work out... In the meantime, for lack of a better option, I started to skip the two failure-prone tests in the OSX build jobs in my automated CI builds with: -- >8 -- Subject: [PATCH] travis-ci: skip flaky tests in 't5570-git-daemon.sh' in OSX build jobs See: https://public-inbox.org/git/CAM0VKj=mcs+cmogzf_xypeb+qzrfmumh52-pv_ndmza9x+r...@mail.gmail.com/ Signed-off-by: SZEDER Gábor --- ci/lib-travisci.sh | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/ci/lib-travisci.sh b/ci/lib-travisci.sh index 69dff4d1ec..1cd2c1db7d 100755 --- a/ci/lib-travisci.sh +++ b/ci/lib-travisci.sh @@ -121,7 +121,9 @@ osx-clang|osx-gcc) # t9810 occasionally fails on Travis CI OS X # t9816 occasionally fails with "TAP out of sequence errors" on # Travis CI OS X - export GIT_SKIP_TESTS="t9810 t9816" + # In 't5570-git-daemon.sh', the tests 'fetch notices corrupt pack' + # and 'fetch notices corrupt idx' fail rather frequently. + export GIT_SKIP_TESTS="t9810 t9816 t5570.9 t5570.10" ;; GIT_TEST_GETTEXT_POISON) export GIT_TEST_GETTEXT_POISON=YesPlease -- 2.20.1.940.g8404bb2d1a -- >8 -- > > > What it boils down to is this sequence: > > > > > > - The test first prepares a repository containing a corrupt pack, > > > ready to be server via 'git daemon'. > > > > > > - Then the test runs 'test_must_fail git fetch ', which connects > > > to 'git daemon', which forks 'git upload-pack', which then > > > advertises refs (only HEAD) and capabilities. So far so good. > > > > > > - 'git fetch' eventually calls fetch-pack.c:find_common(). The > > > first half of this function assembles a request consisting of a > > > want and a flush pkt-line, and sends it via a send_request() call. > > > > > > At this point the scheduling becomes important: let's suppose that > > > fetch is slow and upload-pack is fast. > > > > > > - 'git upload-pack' receives the request, parses the want line, > > > notices the corrupt pack, responds with an 'ERR upload-pack: not > > > our ref' pkt-line, and die()s right away. > > > > > > - 'git fetch' finally approaches the end of the function, where it > > > attempts to send a done pkt-line via another send_request() call > > > through the now closing TCP socket. > > > > > > - What happens now seems to depend on the platform: > > > > > > - On Linux, both on my machine and on Travis CI, it shows textbook > > > example behaviour: write() returns with error and sets errno to > > > ECONNRESET. Since it happens in write_or_die(), 'git fetch' > > > die()s with 'fatal: write error: Connection reset by peer', and > > > doesn't show the error send by 'git upload-pack'; how could it, > > > it doesn't even get as far to receive upload-pack's ERR > > > pkt-line. > > > > > > The test only checks that 'git fetch' fails, but it doesn't > > > check whether it failed with the right error message, so the > > > test still succeeds. Had it checked the error message as well, > > > we most likely had noticed this issue already, it doesn't happen > > > all that rarely. > > > > > > - On the new OSX images with XCode 9.4 on Travis CI the write() > > > triggers SIGPIPE right away, and 'test_must_fail' notices it and > > > fails the test. I couldn't see any sign of an ECONNRESET or any > > > other error that we could act upon to avoid the SIGPIPE. > > > > > > - On OSX with XCode 9.2 on Travis CI there is neither SIGPIPE, nor > > >
Re: t5570-git-daemon fails with SIGPIPE on OSX
Hi Peff, On Fri, 8 Feb 2019, Johannes Schindelin wrote: > I just had a look at the patch you provided below (for some reason, my > previous search on public-inbox only turned up Gábor's mail to which you > responded). > > Admittedly, I do not really understand all aspects of it, but it applies, > still, and I kicked off a stress test here: > > https://dev.azure.com/git/git/_build/results?buildId=338 > > It seems that your patch fixes that t5570 flakiness on macOS, and more > importantly, addresses an important issue on macOS. > > Will play a bit more with it and keep you posted. Alas, I was fooled. *Fooled*, I say. Apparently the --stress option makes the script *succeed* when it fails? I say that because I wanted to make sure that your patch fixes things and reverted your change and started another build, which succeeded. So I started another build, then another build, and they all succeeded. Only then it dawned on me that I had not looked at the *logs*. And they all still report the same issue, even with your patch: https://dev.azure.com/git/git/_build/results?buildId=338&view=logs&jobId=51041795-01c5-57f3-5561-107b6b9e51a6&taskId=fadc714a-a906-5cf2-cc7a-335e443ad2f8&lineStart=1402&lineEnd=1505&colStart=1&colEnd=32 (You will have to scroll all the way down, or press Ctrl+End, to see that "fetch notices corrupt pack" is failing.) So I am afraid that your patch does not fix the issue nor does it work around it. Ciao, Dscho > On Tue, 14 Aug 2018, Jeff King wrote: > > > On Mon, Aug 06, 2018 at 05:11:13PM +0200, SZEDER Gábor wrote: > > > > > - 'git upload-pack' receives the request, parses the want line, > > > notices the corrupt pack, responds with an 'ERR upload-pack: not > > > our ref' pkt-line, and die()s right away. > > > > > > - 'git fetch' finally approaches the end of the function, where it > > > attempts to send a done pkt-line via another send_request() call > > > through the now closing TCP socket. > > > > > > - What happens now seems to depend on the platform: > > > > > > - On Linux, both on my machine and on Travis CI, it shows textbook > > > example behaviour: write() returns with error and sets errno to > > > ECONNRESET. Since it happens in write_or_die(), 'git fetch' > > > die()s with 'fatal: write error: Connection reset by peer', and > > > doesn't show the error send by 'git upload-pack'; how could it, > > > it doesn't even get as far to receive upload-pack's ERR > > > pkt-line. > > > > > > The test only checks that 'git fetch' fails, but it doesn't > > > check whether it failed with the right error message, so the > > > test still succeeds. Had it checked the error message as well, > > > we most likely had noticed this issue already, it doesn't happen > > > all that rarely. > > > > Hmm. Traditionally we did not send ERR as part of upload-pack at all. It > > was the message you got from git-daemon if it couldn't start the > > requested sub-process. It was only later in bdb31eada7 (upload-pack: > > report "not our ref" to client, 2017-02-23) that we started sending > > them. So I think that is why it does not check the error message: it is > > not expecting that case at all (and it is not actually interesting here, > > as the real problem is that the remote side is corrupt, but it sadly > > does not say anything so useful). > > > > I think that's somewhat tangential, though. The root of the issue is > > this: > > > > > - On the new OSX images with XCode 9.4 on Travis CI the write() > > > triggers SIGPIPE right away, and 'test_must_fail' notices it and > > > fails the test. I couldn't see any sign of an ECONNRESET or any > > > other error that we could act upon to avoid the SIGPIPE. > > > > Right, as soon as we get SIGPIPE we can't offer any useful message, > > because we're dead. I would argue that fetch should simply turn off > > SIGPIPE entirely, and rely on getting EPIPE from write(). But since > > we're in write_or_die(), it actually turns EPIPE back into a SIGPIPE > > death! > > > > So we'd probably also want to teach it to use a real write_in_full(), > > and then output a more useful message in this case. write_or_die() > > really does produce bad messages regardless, because it doesn't know > > what it's writing to. > > > > That would give us a baby step in the right direction, because at least > > we'd always be doing a controlled die() then. And then the next step > > would be to show the remote error message (even though it's not actually > > useful in this case, in theory upload-pack could generate something > > better). And that would mean turning the die() on write into an attempt > > to drain any ERR messages before either dying or returning an error up > > the stack. > > > > I suspect the (largely untested) patch below would make your test > > problems go away. Or instead, we could simply add sigpipe=ok to the > > test_must_fail in
Re: t5570-git-daemon fails with SIGPIPE on OSX
Hi Peff, I just had a look at the patch you provided below (for some reason, my previous search on public-inbox only turned up Gábor's mail to which you responded). Admittedly, I do not really understand all aspects of it, but it applies, still, and I kicked off a stress test here: https://dev.azure.com/git/git/_build/results?buildId=338 It seems that your patch fixes that t5570 flakiness on macOS, and more importantly, addresses an important issue on macOS. Will play a bit more with it and keep you posted. Ciao, Dscho On Tue, 14 Aug 2018, Jeff King wrote: > On Mon, Aug 06, 2018 at 05:11:13PM +0200, SZEDER Gábor wrote: > > > - 'git upload-pack' receives the request, parses the want line, > > notices the corrupt pack, responds with an 'ERR upload-pack: not > > our ref' pkt-line, and die()s right away. > > > > - 'git fetch' finally approaches the end of the function, where it > > attempts to send a done pkt-line via another send_request() call > > through the now closing TCP socket. > > > > - What happens now seems to depend on the platform: > > > > - On Linux, both on my machine and on Travis CI, it shows textbook > > example behaviour: write() returns with error and sets errno to > > ECONNRESET. Since it happens in write_or_die(), 'git fetch' > > die()s with 'fatal: write error: Connection reset by peer', and > > doesn't show the error send by 'git upload-pack'; how could it, > > it doesn't even get as far to receive upload-pack's ERR > > pkt-line. > > > > The test only checks that 'git fetch' fails, but it doesn't > > check whether it failed with the right error message, so the > > test still succeeds. Had it checked the error message as well, > > we most likely had noticed this issue already, it doesn't happen > > all that rarely. > > Hmm. Traditionally we did not send ERR as part of upload-pack at all. It > was the message you got from git-daemon if it couldn't start the > requested sub-process. It was only later in bdb31eada7 (upload-pack: > report "not our ref" to client, 2017-02-23) that we started sending > them. So I think that is why it does not check the error message: it is > not expecting that case at all (and it is not actually interesting here, > as the real problem is that the remote side is corrupt, but it sadly > does not say anything so useful). > > I think that's somewhat tangential, though. The root of the issue is > this: > > > - On the new OSX images with XCode 9.4 on Travis CI the write() > > triggers SIGPIPE right away, and 'test_must_fail' notices it and > > fails the test. I couldn't see any sign of an ECONNRESET or any > > other error that we could act upon to avoid the SIGPIPE. > > Right, as soon as we get SIGPIPE we can't offer any useful message, > because we're dead. I would argue that fetch should simply turn off > SIGPIPE entirely, and rely on getting EPIPE from write(). But since > we're in write_or_die(), it actually turns EPIPE back into a SIGPIPE > death! > > So we'd probably also want to teach it to use a real write_in_full(), > and then output a more useful message in this case. write_or_die() > really does produce bad messages regardless, because it doesn't know > what it's writing to. > > That would give us a baby step in the right direction, because at least > we'd always be doing a controlled die() then. And then the next step > would be to show the remote error message (even though it's not actually > useful in this case, in theory upload-pack could generate something > better). And that would mean turning the die() on write into an attempt > to drain any ERR messages before either dying or returning an error up > the stack. > > I suspect the (largely untested) patch below would make your test > problems go away. Or instead, we could simply add sigpipe=ok to the > test_must_fail invocation, but I agree with you that the current > behavior on OS X is not ideal (the user sees no error message). > > -Peff > > diff --git a/fetch-pack.c b/fetch-pack.c > index 5714bcbddd..3e80604562 100644 > --- a/fetch-pack.c > +++ b/fetch-pack.c > @@ -188,8 +188,10 @@ static void send_request(struct fetch_pack_args *args, > if (args->stateless_rpc) { > send_sideband(fd, -1, buf->buf, buf->len, LARGE_PACKET_MAX); > packet_flush(fd); > - } else > - write_or_die(fd, buf->buf, buf->len); > + } else { > + if (write_in_full(fd, buf->buf, buf->len) < 0) > + die_errno("unable to write to remote"); > + } > } > > static void insert_one_alternate_object(struct fetch_negotiator *negotiator, > @@ -1167,7 +1169,8 @@ static int send_fetch_request(struct fetch_negotiator > *negotiator, int fd_out, > > /* Send request */ > packet_buf_flush(&req_buf); > - write_or_die(fd_out, req_buf.buf, req_buf.len); > + if (write_in_full(fd_out, req_buf.bu
Re: t5570-git-daemon fails with SIGPIPE on OSX
Team, On Mon, 6 Aug 2018, SZEDER Gábor wrote: > [Resending with Clemens' last used email address. > Clemens, please consider sending a patch to update our .mailmap file.] > > > On Mon, Aug 6, 2018 at 5:11 PM SZEDER Gábor wrote: > > > > Travis CI changed its default OSX image to use XCode 9.4 on 2018-07-31 > > [1]. Since then OSX build jobs fail rather frequently because of a > > SIGPIPE in the tests 'fetch notices corrupt pack' or 'fetch notices > > corrupt idx' in 't5570-git-daemon.sh' [2]. I think this is a symptom > > a real bug in Git affecting other platforms as well, but these tests > > are too lax to catch it. I am seeing this very frequently now, as it feels like failing in the Azure Pipeline about half of the time. The symptom is this: -- snip -- ++ cp -R '/Users/vsts/agent/2.146.0/work/1/s/t/trash directory.t5570-git-daemon/repo/repo_pack.git' '/Users/vsts/agent/2.146.0/work/1/s/t/trash directory.t5570-git-daemon/repo/repo_bad2.git' ++ cd '/Users/vsts/agent/2.146.0/work/1/s/t/trash directory.t5570-git-daemon/repo/repo_bad2.git' +++ ls objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx ++ p=objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx ++ chmod u+w objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx ++ printf %0256d 0 ++ dd of=objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx bs=256 count=1 seek=1 conv=notrunc 1+0 records in 1+0 records out 256 bytes transferred in 0.18 secs (14316558 bytes/sec) ++ mkdir repo_bad2.git ++ cd repo_bad2.git ++ git --bare init Initialized empty Git repository in /Users/vsts/agent/2.146.0/work/1/s/t/trash directory.t5570-git-daemon/repo_bad2.git/ ++ test_must_fail git --bare fetch git://127.0.0.1:5570/repo_bad2.git ++ case "$1" in ++ _test_ok= ++ git --bare fetch git://127.0.0.1:5570/repo_bad2.git [13879] Connection from 127.0.0.1:60504 [13879] Extended attribute "host": 127.0.0.1:5570 [13879] Extended attribute "protocol": version=0 [13879] Request upload-pack for '/repo_bad2.git' [13879] error: non-monotonic index ./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx [13879] error: non-monotonic index ./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx [13879] error: non-monotonic index ./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx [13879] error: non-monotonic index ./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx [13879] error: refs/heads/master does not point to a valid object! [13879] error: non-monotonic index ./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx [13879] error: non-monotonic index ./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx [13879] error: refs/heads/other does not point to a valid object! [13879] error: non-monotonic index ./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx [13879] error: non-monotonic index ./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx [13879] fatal: git upload-pack: not our ref b5c2e4226db03597a64815fd226648510b22ba40 [11833] [13879] Disconnected (with error) ++ exit_code=141 ++ test 141 -eq 0 ++ test_match_signal 13 141 ++ test 141 = 141 ++ return 0 ++ list_contains '' sigpipe ++ case ",$1," in ++ return 1 ++ test 141 -gt 129 ++ test 141 -le 192 ++ echo 'test_must_fail: died by signal 13: git --bare fetch git://127.0.0.1:5570/repo_bad2.git' test_must_fail: died by signal 13: git --bare fetch git://127.0.0.1:5570/repo_bad2.git ++ return 1 error: last command exited with $?=1 not ok 10 - fetch notices corrupt idx # # cp -R "$GIT_DAEMON_DOCUMENT_ROOT_PATH"/repo_pack.git "$GIT_DAEMON_DOCUMENT_ROOT_PATH"/repo_bad2.git && # (cd "$GIT_DAEMON_DOCUMENT_ROOT_PATH"/repo_bad2.git && #p=$(ls objects/pack/pack-*.idx) && #chmod u+w $p && #printf %0256d 0 | dd of=$p bs=256 count=1 seek=1 conv=notrunc # ) && # mkdir repo_bad2.git && # (cd repo_bad2.git && #git --bare init && #test_must_fail git --bare fetch "$GIT_DAEMON_URL/repo_bad2.git" && #test 0 = $(ls objects/pack | wc -l) # ) # -- snap -- Any ideas how to fix this test, anyone? Ciao, Dscho > > > > What it boils down to is this sequence: > > > > - The test first prepares a repository containing a corrupt pack, > > ready to be server via 'git daemon'. > > > > - Then the test runs 'test_must_fail git fetch ', which connects > > to 'git daemon', which forks 'git upload-pack', which then > > advertises refs (only HEAD) and capabilities. So far so good. > > > > - 'git fetch' eventually calls fetch-pack.c:find_common(). The > > first half of this function assembles a request consisting of a > > want and a flush pkt-line, and sends it via a send_request() call. > > > > At this point the scheduling becomes important: let's suppose that > > fetch is slow and u
Re: t5570-git-daemon fails with SIGPIPE on OSX
On Tue, Aug 14, 2018 at 06:32:47PM -0400, Jeff King wrote: > I suspect the (largely untested) patch below would make your test > problems go away. Or instead, we could simply add sigpipe=ok to the > test_must_fail invocation, but I agree with you that the current > behavior on OS X is not ideal (the user sees no error message). Whoops, that patch of course misses adding "sigchain_push(SIGPIPE, SIG_IGN)" during the fetch-pack operation. I'll leave that as an exercise to the reader. ;) -Peff
Re: t5570-git-daemon fails with SIGPIPE on OSX
On Mon, Aug 06, 2018 at 05:11:13PM +0200, SZEDER Gábor wrote: > - 'git upload-pack' receives the request, parses the want line, > notices the corrupt pack, responds with an 'ERR upload-pack: not > our ref' pkt-line, and die()s right away. > > - 'git fetch' finally approaches the end of the function, where it > attempts to send a done pkt-line via another send_request() call > through the now closing TCP socket. > > - What happens now seems to depend on the platform: > > - On Linux, both on my machine and on Travis CI, it shows textbook > example behaviour: write() returns with error and sets errno to > ECONNRESET. Since it happens in write_or_die(), 'git fetch' > die()s with 'fatal: write error: Connection reset by peer', and > doesn't show the error send by 'git upload-pack'; how could it, > it doesn't even get as far to receive upload-pack's ERR > pkt-line. > > The test only checks that 'git fetch' fails, but it doesn't > check whether it failed with the right error message, so the > test still succeeds. Had it checked the error message as well, > we most likely had noticed this issue already, it doesn't happen > all that rarely. Hmm. Traditionally we did not send ERR as part of upload-pack at all. It was the message you got from git-daemon if it couldn't start the requested sub-process. It was only later in bdb31eada7 (upload-pack: report "not our ref" to client, 2017-02-23) that we started sending them. So I think that is why it does not check the error message: it is not expecting that case at all (and it is not actually interesting here, as the real problem is that the remote side is corrupt, but it sadly does not say anything so useful). I think that's somewhat tangential, though. The root of the issue is this: > - On the new OSX images with XCode 9.4 on Travis CI the write() > triggers SIGPIPE right away, and 'test_must_fail' notices it and > fails the test. I couldn't see any sign of an ECONNRESET or any > other error that we could act upon to avoid the SIGPIPE. Right, as soon as we get SIGPIPE we can't offer any useful message, because we're dead. I would argue that fetch should simply turn off SIGPIPE entirely, and rely on getting EPIPE from write(). But since we're in write_or_die(), it actually turns EPIPE back into a SIGPIPE death! So we'd probably also want to teach it to use a real write_in_full(), and then output a more useful message in this case. write_or_die() really does produce bad messages regardless, because it doesn't know what it's writing to. That would give us a baby step in the right direction, because at least we'd always be doing a controlled die() then. And then the next step would be to show the remote error message (even though it's not actually useful in this case, in theory upload-pack could generate something better). And that would mean turning the die() on write into an attempt to drain any ERR messages before either dying or returning an error up the stack. I suspect the (largely untested) patch below would make your test problems go away. Or instead, we could simply add sigpipe=ok to the test_must_fail invocation, but I agree with you that the current behavior on OS X is not ideal (the user sees no error message). -Peff diff --git a/fetch-pack.c b/fetch-pack.c index 5714bcbddd..3e80604562 100644 --- a/fetch-pack.c +++ b/fetch-pack.c @@ -188,8 +188,10 @@ static void send_request(struct fetch_pack_args *args, if (args->stateless_rpc) { send_sideband(fd, -1, buf->buf, buf->len, LARGE_PACKET_MAX); packet_flush(fd); - } else - write_or_die(fd, buf->buf, buf->len); + } else { + if (write_in_full(fd, buf->buf, buf->len) < 0) + die_errno("unable to write to remote"); + } } static void insert_one_alternate_object(struct fetch_negotiator *negotiator, @@ -1167,7 +1169,8 @@ static int send_fetch_request(struct fetch_negotiator *negotiator, int fd_out, /* Send request */ packet_buf_flush(&req_buf); - write_or_die(fd_out, req_buf.buf, req_buf.len); + if (write_in_full(fd_out, req_buf.buf, req_buf.len) < 0) + die_errno("unable to write request to remote"); strbuf_release(&req_buf); return ret; diff --git a/pkt-line.c b/pkt-line.c index a593c08aad..450d0801b1 100644 --- a/pkt-line.c +++ b/pkt-line.c @@ -88,13 +88,15 @@ static void packet_trace(const char *buf, unsigned int len, int write) void packet_flush(int fd) { packet_trace("", 4, 1); - write_or_die(fd, "", 4); + if (write_in_full(fd, "", 4) < 0) + die_errno("unable to write flush packet"); } void packet_delim(int fd) { packet_trace("0001", 4, 1); - write_or_die(fd, "0001", 4); + if (write_in_full(fd, "", 4) < 0) + die_errno("unable to
Re: t5570-git-daemon fails with SIGPIPE on OSX
[Resending with Clemens' last used email address. Clemens, please consider sending a patch to update our .mailmap file.] On Mon, Aug 6, 2018 at 5:11 PM SZEDER Gábor wrote: > > Travis CI changed its default OSX image to use XCode 9.4 on 2018-07-31 > [1]. Since then OSX build jobs fail rather frequently because of a > SIGPIPE in the tests 'fetch notices corrupt pack' or 'fetch notices > corrupt idx' in 't5570-git-daemon.sh' [2]. I think this is a symptom > a real bug in Git affecting other platforms as well, but these tests > are too lax to catch it. > > What it boils down to is this sequence: > > - The test first prepares a repository containing a corrupt pack, > ready to be server via 'git daemon'. > > - Then the test runs 'test_must_fail git fetch ', which connects > to 'git daemon', which forks 'git upload-pack', which then > advertises refs (only HEAD) and capabilities. So far so good. > > - 'git fetch' eventually calls fetch-pack.c:find_common(). The > first half of this function assembles a request consisting of a > want and a flush pkt-line, and sends it via a send_request() call. > > At this point the scheduling becomes important: let's suppose that > fetch is slow and upload-pack is fast. > > - 'git upload-pack' receives the request, parses the want line, > notices the corrupt pack, responds with an 'ERR upload-pack: not > our ref' pkt-line, and die()s right away. > > - 'git fetch' finally approaches the end of the function, where it > attempts to send a done pkt-line via another send_request() call > through the now closing TCP socket. > > - What happens now seems to depend on the platform: > > - On Linux, both on my machine and on Travis CI, it shows textbook > example behaviour: write() returns with error and sets errno to > ECONNRESET. Since it happens in write_or_die(), 'git fetch' > die()s with 'fatal: write error: Connection reset by peer', and > doesn't show the error send by 'git upload-pack'; how could it, > it doesn't even get as far to receive upload-pack's ERR > pkt-line. > > The test only checks that 'git fetch' fails, but it doesn't > check whether it failed with the right error message, so the > test still succeeds. Had it checked the error message as well, > we most likely had noticed this issue already, it doesn't happen > all that rarely. > > - On the new OSX images with XCode 9.4 on Travis CI the write() > triggers SIGPIPE right away, and 'test_must_fail' notices it and > fails the test. I couldn't see any sign of an ECONNRESET or any > other error that we could act upon to avoid the SIGPIPE. > > - On OSX with XCode 9.2 on Travis CI there is neither SIGPIPE, nor > ECONNRESET, but sending the request actually succeeds even > though there is no process on the other end of the socket > anymore. 'git fetch' then simply continues execution, reads and > parses the ERR pkt-line, and then dies()s with 'fatal: remote > error: upload-pack: not our ref'. So, on the face of it, it > shows the desired behaviour, but I have no idea how that write() > could succeed instead of returning error. > > I don't know what happens on a real Mac as I don't have access to one; > I figured out all the above by enabling packet tracing, adding a > couple of well placed tracing printf() and sleep() calls, running a > bunch of builds on Travis CI, and looking through their logs. But > without access to a debugger and netstat and what not I can't really > go any further. So I would now happily pass the baton to those who > have a Mac and know a thing or two about its porting issues to first > check whether OSX on a real Mac shows the same behaviour as it does in > Travis CI's virtualized(?) environment. And then they can pass the > baton to those who know all the intricacies of the pack protocol and > its implementation to decide what to do with this issue. > > For a mostly reliable reproduction recipe you might want to fetch this > branch: > > https://github.com/szeder/git t5570-git-daemon-sigpipe > > and then run 'make && cd t && ./t5570-git-daemon.sh -v -x' > > > Have fun! ;) > > > 1 - https://blog.travis-ci.com/2018-07-19-xcode9-4-default-announce > > 2 - On git.git's master: > https://travis-ci.org/git/git/jobs/411517552#L2717
t5570-git-daemon fails with SIGPIPE on OSX
Travis CI changed its default OSX image to use XCode 9.4 on 2018-07-31 [1]. Since then OSX build jobs fail rather frequently because of a SIGPIPE in the tests 'fetch notices corrupt pack' or 'fetch notices corrupt idx' in 't5570-git-daemon.sh' [2]. I think this is a symptom a real bug in Git affecting other platforms as well, but these tests are too lax to catch it. What it boils down to is this sequence: - The test first prepares a repository containing a corrupt pack, ready to be server via 'git daemon'. - Then the test runs 'test_must_fail git fetch ', which connects to 'git daemon', which forks 'git upload-pack', which then advertises refs (only HEAD) and capabilities. So far so good. - 'git fetch' eventually calls fetch-pack.c:find_common(). The first half of this function assembles a request consisting of a want and a flush pkt-line, and sends it via a send_request() call. At this point the scheduling becomes important: let's suppose that fetch is slow and upload-pack is fast. - 'git upload-pack' receives the request, parses the want line, notices the corrupt pack, responds with an 'ERR upload-pack: not our ref' pkt-line, and die()s right away. - 'git fetch' finally approaches the end of the function, where it attempts to send a done pkt-line via another send_request() call through the now closing TCP socket. - What happens now seems to depend on the platform: - On Linux, both on my machine and on Travis CI, it shows textbook example behaviour: write() returns with error and sets errno to ECONNRESET. Since it happens in write_or_die(), 'git fetch' die()s with 'fatal: write error: Connection reset by peer', and doesn't show the error send by 'git upload-pack'; how could it, it doesn't even get as far to receive upload-pack's ERR pkt-line. The test only checks that 'git fetch' fails, but it doesn't check whether it failed with the right error message, so the test still succeeds. Had it checked the error message as well, we most likely had noticed this issue already, it doesn't happen all that rarely. - On the new OSX images with XCode 9.4 on Travis CI the write() triggers SIGPIPE right away, and 'test_must_fail' notices it and fails the test. I couldn't see any sign of an ECONNRESET or any other error that we could act upon to avoid the SIGPIPE. - On OSX with XCode 9.2 on Travis CI there is neither SIGPIPE, nor ECONNRESET, but sending the request actually succeeds even though there is no process on the other end of the socket anymore. 'git fetch' then simply continues execution, reads and parses the ERR pkt-line, and then dies()s with 'fatal: remote error: upload-pack: not our ref'. So, on the face of it, it shows the desired behaviour, but I have no idea how that write() could succeed instead of returning error. I don't know what happens on a real Mac as I don't have access to one; I figured out all the above by enabling packet tracing, adding a couple of well placed tracing printf() and sleep() calls, running a bunch of builds on Travis CI, and looking through their logs. But without access to a debugger and netstat and what not I can't really go any further. So I would now happily pass the baton to those who have a Mac and know a thing or two about its porting issues to first check whether OSX on a real Mac shows the same behaviour as it does in Travis CI's virtualized(?) environment. And then they can pass the baton to those who know all the intricacies of the pack protocol and its implementation to decide what to do with this issue. For a mostly reliable reproduction recipe you might want to fetch this branch: https://github.com/szeder/git t5570-git-daemon-sigpipe and then run 'make && cd t && ./t5570-git-daemon.sh -v -x' Have fun! ;) 1 - https://blog.travis-ci.com/2018-07-19-xcode9-4-default-announce 2 - On git.git's master: https://travis-ci.org/git/git/jobs/411517552#L2717