t5570-git-daemon fails with SIGPIPE on OSX

2018-08-06 Thread SZEDER Gábor
Travis CI changed its default OSX image to use XCode 9.4 on 2018-07-31
[1].  Since then OSX build jobs fail rather frequently because of a
SIGPIPE in the tests 'fetch notices corrupt pack' or 'fetch notices
corrupt idx' in 't5570-git-daemon.sh' [2].  I think this is a symptom
a real bug in Git affecting other platforms as well, but these tests
are too lax to catch it.

What it boils down to is this sequence:

  - The test first prepares a repository containing a corrupt pack,
ready to be server via 'git daemon'.

  - Then the test runs 'test_must_fail git fetch ', which connects
to 'git daemon', which forks 'git upload-pack', which then
advertises refs (only HEAD) and capabilities.  So far so good.

  - 'git fetch' eventually calls fetch-pack.c:find_common().  The
first half of this function assembles a request consisting of a
want and a flush pkt-line, and sends it via a send_request() call.

At this point the scheduling becomes important: let's suppose that
fetch is slow and upload-pack is fast.

  - 'git upload-pack' receives the request, parses the want line,
notices the corrupt pack, responds with an 'ERR upload-pack: not
our ref' pkt-line, and die()s right away.

  - 'git fetch' finally approaches the end of the function, where it
attempts to send a done pkt-line via another send_request() call
through the now closing TCP socket.

  - What happens now seems to depend on the platform:

- On Linux, both on my machine and on Travis CI, it shows textbook
  example behaviour: write() returns with error and sets errno to
  ECONNRESET.  Since it happens in write_or_die(), 'git fetch'
  die()s with 'fatal: write error: Connection reset by peer', and
  doesn't show the error send by 'git upload-pack'; how could it,
  it doesn't even get as far to receive upload-pack's ERR
  pkt-line.

  The test only checks that 'git fetch' fails, but it doesn't
  check whether it failed with the right error message, so the
  test still succeeds.  Had it checked the error message as well,
  we most likely had noticed this issue already, it doesn't happen
  all that rarely.

- On the new OSX images with XCode 9.4 on Travis CI the write()
  triggers SIGPIPE right away, and 'test_must_fail' notices it and
  fails the test.  I couldn't see any sign of an ECONNRESET or any
  other error that we could act upon to avoid the SIGPIPE.

- On OSX with XCode 9.2 on Travis CI there is neither SIGPIPE, nor
  ECONNRESET, but sending the request actually succeeds even
  though there is no process on the other end of the socket
  anymore.  'git fetch' then simply continues execution, reads and
  parses the ERR pkt-line, and then dies()s with 'fatal: remote
  error: upload-pack: not our ref'.  So, on the face of it, it
  shows the desired behaviour, but I have no idea how that write()
  could succeed instead of returning error.

I don't know what happens on a real Mac as I don't have access to one;
I figured out all the above by enabling packet tracing, adding a
couple of well placed tracing printf() and sleep() calls, running a
bunch of builds on Travis CI, and looking through their logs.  But
without access to a debugger and netstat and what not I can't really
go any further.  So I would now happily pass the baton to those who
have a Mac and know a thing or two about its porting issues to first
check whether OSX on a real Mac shows the same behaviour as it does in
Travis CI's virtualized(?) environment.  And then they can pass the
baton to those who know all the intricacies of the pack protocol and
its implementation to decide what to do with this issue.

For a mostly reliable reproduction recipe you might want to fetch this
branch:

  https://github.com/szeder/git t5570-git-daemon-sigpipe

and then run 'make && cd t && ./t5570-git-daemon.sh -v -x'


Have fun! ;)


1 - https://blog.travis-ci.com/2018-07-19-xcode9-4-default-announce

2 - On git.git's master:
https://travis-ci.org/git/git/jobs/411517552#L2717


Re: t5570-git-daemon fails with SIGPIPE on OSX

2019-02-08 Thread Johannes Schindelin
Team,

On Mon, 6 Aug 2018, SZEDER Gábor wrote:

> [Resending with Clemens' last used email address.
> Clemens, please consider sending a patch to update our .mailmap file.]
> 
> 
> On Mon, Aug 6, 2018 at 5:11 PM SZEDER Gábor  wrote:
> >
> > Travis CI changed its default OSX image to use XCode 9.4 on 2018-07-31
> > [1].  Since then OSX build jobs fail rather frequently because of a
> > SIGPIPE in the tests 'fetch notices corrupt pack' or 'fetch notices
> > corrupt idx' in 't5570-git-daemon.sh' [2].  I think this is a symptom
> > a real bug in Git affecting other platforms as well, but these tests
> > are too lax to catch it.

I am seeing this very frequently now, as it feels like failing in the
Azure Pipeline about half of the time.

The symptom is this:

-- snip --
++ cp -R '/Users/vsts/agent/2.146.0/work/1/s/t/trash 
directory.t5570-git-daemon/repo/repo_pack.git' 
'/Users/vsts/agent/2.146.0/work/1/s/t/trash 
directory.t5570-git-daemon/repo/repo_bad2.git'
++ cd '/Users/vsts/agent/2.146.0/work/1/s/t/trash 
directory.t5570-git-daemon/repo/repo_bad2.git'
+++ ls objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
++ p=objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
++ chmod u+w objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
++ printf %0256d 0
++ dd of=objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx bs=256 
count=1 seek=1 conv=notrunc
1+0 records in
1+0 records out
256 bytes transferred in 0.18 secs (14316558 bytes/sec)
++ mkdir repo_bad2.git
++ cd repo_bad2.git
++ git --bare init
Initialized empty Git repository in /Users/vsts/agent/2.146.0/work/1/s/t/trash 
directory.t5570-git-daemon/repo_bad2.git/
++ test_must_fail git --bare fetch git://127.0.0.1:5570/repo_bad2.git
++ case "$1" in
++ _test_ok=
++ git --bare fetch git://127.0.0.1:5570/repo_bad2.git
[13879] Connection from 127.0.0.1:60504
[13879] Extended attribute "host": 127.0.0.1:5570
[13879] Extended attribute "protocol": version=0
[13879] Request upload-pack for '/repo_bad2.git'
[13879] error: non-monotonic index 
./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
[13879] error: non-monotonic index 
./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
[13879] error: non-monotonic index 
./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
[13879] error: non-monotonic index 
./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
[13879] error: refs/heads/master does not point to a valid object!
[13879] error: non-monotonic index 
./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
[13879] error: non-monotonic index 
./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
[13879] error: refs/heads/other does not point to a valid object!
[13879] error: non-monotonic index 
./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
[13879] error: non-monotonic index 
./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
[13879] fatal: git upload-pack: not our ref 
b5c2e4226db03597a64815fd226648510b22ba40
[11833] [13879] Disconnected (with error)
++ exit_code=141
++ test 141 -eq 0
++ test_match_signal 13 141
++ test 141 = 141
++ return 0
++ list_contains '' sigpipe
++ case ",$1," in
++ return 1
++ test 141 -gt 129
++ test 141 -le 192
++ echo 'test_must_fail: died by signal 13: git --bare fetch 
git://127.0.0.1:5570/repo_bad2.git'
test_must_fail: died by signal 13: git --bare fetch 
git://127.0.0.1:5570/repo_bad2.git
++ return 1
error: last command exited with $?=1
not ok 10 - fetch notices corrupt idx
#   
#   cp -R "$GIT_DAEMON_DOCUMENT_ROOT_PATH"/repo_pack.git 
"$GIT_DAEMON_DOCUMENT_ROOT_PATH"/repo_bad2.git &&
#   (cd "$GIT_DAEMON_DOCUMENT_ROOT_PATH"/repo_bad2.git &&
#p=$(ls objects/pack/pack-*.idx) &&
#chmod u+w $p &&
#printf %0256d 0 | dd of=$p bs=256 count=1 seek=1 conv=notrunc
#   ) &&
#   mkdir repo_bad2.git &&
#   (cd repo_bad2.git &&
#git --bare init &&
#test_must_fail git --bare fetch 
"$GIT_DAEMON_URL/repo_bad2.git" &&
#test 0 = $(ls objects/pack | wc -l)
#   )
#   
-- snap --

Any ideas how to fix this test, anyone?

Ciao,
Dscho

> >
> > What it boils down to is this sequence:
> >
> >   - The test first prepares a repository containing a corrupt pack,
> > ready to be server via 'git daemon'.
> >
> >   - Then the test runs 'test_must_fail git fetch ', which connects
> > to 'git daemon', which forks 'git upload-pack', which then
> > advertises refs (only HEAD) and capabilities.  So far so good.
> >
> >   - 'git fetch' eventually calls fetch-pack.c:find_common().  The
> > first half of this function assembles a request consisting of a
> > want and a flush pkt-line, and sends it via a send_request() call.
> >
> > At this point the scheduling becomes important: let's suppose that
> > fetch is slow and u

Re: t5570-git-daemon fails with SIGPIPE on OSX

2019-02-08 Thread Johannes Schindelin
Hi Peff,

I just had a look at the patch you provided below (for some reason, my
previous search on public-inbox only turned up Gábor's mail to which you
responded).

Admittedly, I do not really understand all aspects of it, but it applies,
still, and I kicked off a stress test here:

https://dev.azure.com/git/git/_build/results?buildId=338

It seems that your patch fixes that t5570 flakiness on macOS, and more
importantly, addresses an important issue on macOS.

Will play a bit more with it and keep you posted.

Ciao,
Dscho

On Tue, 14 Aug 2018, Jeff King wrote:

> On Mon, Aug 06, 2018 at 05:11:13PM +0200, SZEDER Gábor wrote:
> 
> >   - 'git upload-pack' receives the request, parses the want line,
> > notices the corrupt pack, responds with an 'ERR upload-pack: not
> > our ref' pkt-line, and die()s right away.
> > 
> >   - 'git fetch' finally approaches the end of the function, where it
> > attempts to send a done pkt-line via another send_request() call
> > through the now closing TCP socket.
> > 
> >   - What happens now seems to depend on the platform:
> > 
> > - On Linux, both on my machine and on Travis CI, it shows textbook
> >   example behaviour: write() returns with error and sets errno to
> >   ECONNRESET.  Since it happens in write_or_die(), 'git fetch'
> >   die()s with 'fatal: write error: Connection reset by peer', and
> >   doesn't show the error send by 'git upload-pack'; how could it,
> >   it doesn't even get as far to receive upload-pack's ERR
> >   pkt-line.
> > 
> >   The test only checks that 'git fetch' fails, but it doesn't
> >   check whether it failed with the right error message, so the
> >   test still succeeds.  Had it checked the error message as well,
> >   we most likely had noticed this issue already, it doesn't happen
> >   all that rarely.
> 
> Hmm. Traditionally we did not send ERR as part of upload-pack at all. It
> was the message you got from git-daemon if it couldn't start the
> requested sub-process. It was only later in bdb31eada7 (upload-pack:
> report "not our ref" to client, 2017-02-23) that we started sending
> them. So I think that is why it does not check the error message: it is
> not expecting that case at all (and it is not actually interesting here,
> as the real problem is that the remote side is corrupt, but it sadly
> does not say anything so useful).
> 
> I think that's somewhat tangential, though. The root of the issue is
> this:
> 
> > - On the new OSX images with XCode 9.4 on Travis CI the write()
> >   triggers SIGPIPE right away, and 'test_must_fail' notices it and
> >   fails the test.  I couldn't see any sign of an ECONNRESET or any
> >   other error that we could act upon to avoid the SIGPIPE.
> 
> Right, as soon as we get SIGPIPE we can't offer any useful message,
> because we're dead. I would argue that fetch should simply turn off
> SIGPIPE entirely, and rely on getting EPIPE from write(). But since
> we're in write_or_die(), it actually turns EPIPE back into a SIGPIPE
> death!
> 
> So we'd probably also want to teach it to use a real write_in_full(),
> and then output a more useful message in this case. write_or_die()
> really does produce bad messages regardless, because it doesn't know
> what it's writing to.
> 
> That would give us a baby step in the right direction, because at least
> we'd always be doing a controlled die() then. And then the next step
> would be to show the remote error message (even though it's not actually
> useful in this case, in theory upload-pack could generate something
> better). And that would mean turning the die() on write into an attempt
> to drain any ERR messages before either dying or returning an error up
> the stack.
> 
> I suspect the (largely untested) patch below would make your test
> problems go away. Or instead, we could simply add sigpipe=ok to the
> test_must_fail invocation, but I agree with you that the current
> behavior on OS X is not ideal (the user sees no error message).
> 
> -Peff
> 
> diff --git a/fetch-pack.c b/fetch-pack.c
> index 5714bcbddd..3e80604562 100644
> --- a/fetch-pack.c
> +++ b/fetch-pack.c
> @@ -188,8 +188,10 @@ static void send_request(struct fetch_pack_args *args,
>   if (args->stateless_rpc) {
>   send_sideband(fd, -1, buf->buf, buf->len, LARGE_PACKET_MAX);
>   packet_flush(fd);
> - } else
> - write_or_die(fd, buf->buf, buf->len);
> + } else {
> + if (write_in_full(fd, buf->buf, buf->len) < 0)
> + die_errno("unable to write to remote");
> + }
>  }
>  
>  static void insert_one_alternate_object(struct fetch_negotiator *negotiator,
> @@ -1167,7 +1169,8 @@ static int send_fetch_request(struct fetch_negotiator 
> *negotiator, int fd_out,
>  
>   /* Send request */
>   packet_buf_flush(&req_buf);
> - write_or_die(fd_out, req_buf.buf, req_buf.len);
> + if (write_in_full(fd_out, req_buf.bu

Re: t5570-git-daemon fails with SIGPIPE on OSX

2019-02-08 Thread Johannes Schindelin
Hi Peff,

On Fri, 8 Feb 2019, Johannes Schindelin wrote:

> I just had a look at the patch you provided below (for some reason, my
> previous search on public-inbox only turned up Gábor's mail to which you
> responded).
> 
> Admittedly, I do not really understand all aspects of it, but it applies,
> still, and I kicked off a stress test here:
> 
>   https://dev.azure.com/git/git/_build/results?buildId=338
> 
> It seems that your patch fixes that t5570 flakiness on macOS, and more
> importantly, addresses an important issue on macOS.
> 
> Will play a bit more with it and keep you posted.

Alas, I was fooled. *Fooled*, I say. Apparently the --stress option makes
the script *succeed* when it fails?

I say that because I wanted to make sure that your patch fixes things and
reverted your change and started another build, which succeeded. So I
started another build, then another build, and they all succeeded. Only
then it dawned on me that I had not looked at the *logs*. And they all
still report the same issue, even with your patch:

https://dev.azure.com/git/git/_build/results?buildId=338&view=logs&jobId=51041795-01c5-57f3-5561-107b6b9e51a6&taskId=fadc714a-a906-5cf2-cc7a-335e443ad2f8&lineStart=1402&lineEnd=1505&colStart=1&colEnd=32

(You will have to scroll all the way down, or press Ctrl+End, to see that
"fetch notices corrupt pack" is failing.)

So I am afraid that your patch does not fix the issue nor does it work
around it.

Ciao,
Dscho

> On Tue, 14 Aug 2018, Jeff King wrote:
> 
> > On Mon, Aug 06, 2018 at 05:11:13PM +0200, SZEDER Gábor wrote:
> > 
> > >   - 'git upload-pack' receives the request, parses the want line,
> > > notices the corrupt pack, responds with an 'ERR upload-pack: not
> > > our ref' pkt-line, and die()s right away.
> > > 
> > >   - 'git fetch' finally approaches the end of the function, where it
> > > attempts to send a done pkt-line via another send_request() call
> > > through the now closing TCP socket.
> > > 
> > >   - What happens now seems to depend on the platform:
> > > 
> > > - On Linux, both on my machine and on Travis CI, it shows textbook
> > >   example behaviour: write() returns with error and sets errno to
> > >   ECONNRESET.  Since it happens in write_or_die(), 'git fetch'
> > >   die()s with 'fatal: write error: Connection reset by peer', and
> > >   doesn't show the error send by 'git upload-pack'; how could it,
> > >   it doesn't even get as far to receive upload-pack's ERR
> > >   pkt-line.
> > > 
> > >   The test only checks that 'git fetch' fails, but it doesn't
> > >   check whether it failed with the right error message, so the
> > >   test still succeeds.  Had it checked the error message as well,
> > >   we most likely had noticed this issue already, it doesn't happen
> > >   all that rarely.
> > 
> > Hmm. Traditionally we did not send ERR as part of upload-pack at all. It
> > was the message you got from git-daemon if it couldn't start the
> > requested sub-process. It was only later in bdb31eada7 (upload-pack:
> > report "not our ref" to client, 2017-02-23) that we started sending
> > them. So I think that is why it does not check the error message: it is
> > not expecting that case at all (and it is not actually interesting here,
> > as the real problem is that the remote side is corrupt, but it sadly
> > does not say anything so useful).
> > 
> > I think that's somewhat tangential, though. The root of the issue is
> > this:
> > 
> > > - On the new OSX images with XCode 9.4 on Travis CI the write()
> > >   triggers SIGPIPE right away, and 'test_must_fail' notices it and
> > >   fails the test.  I couldn't see any sign of an ECONNRESET or any
> > >   other error that we could act upon to avoid the SIGPIPE.
> > 
> > Right, as soon as we get SIGPIPE we can't offer any useful message,
> > because we're dead. I would argue that fetch should simply turn off
> > SIGPIPE entirely, and rely on getting EPIPE from write(). But since
> > we're in write_or_die(), it actually turns EPIPE back into a SIGPIPE
> > death!
> > 
> > So we'd probably also want to teach it to use a real write_in_full(),
> > and then output a more useful message in this case. write_or_die()
> > really does produce bad messages regardless, because it doesn't know
> > what it's writing to.
> > 
> > That would give us a baby step in the right direction, because at least
> > we'd always be doing a controlled die() then. And then the next step
> > would be to show the remote error message (even though it's not actually
> > useful in this case, in theory upload-pack could generate something
> > better). And that would mean turning the die() on write into an attempt
> > to drain any ERR messages before either dying or returning an error up
> > the stack.
> > 
> > I suspect the (largely untested) patch below would make your test
> > problems go away. Or instead, we could simply add sigpipe=ok to the
> > test_must_fail in

Re: t5570-git-daemon fails with SIGPIPE on OSX

2019-02-08 Thread SZEDER Gábor
On Fri, Feb 08, 2019 at 09:32:32AM +0100, Johannes Schindelin wrote:
> Team,
> 
> On Mon, 6 Aug 2018, SZEDER Gábor wrote:
> 
> > [Resending with Clemens' last used email address.
> > Clemens, please consider sending a patch to update our .mailmap file.]
> > 
> > 
> > On Mon, Aug 6, 2018 at 5:11 PM SZEDER Gábor  wrote:
> > >
> > > Travis CI changed its default OSX image to use XCode 9.4 on 2018-07-31
> > > [1].  Since then OSX build jobs fail rather frequently because of a
> > > SIGPIPE in the tests 'fetch notices corrupt pack' or 'fetch notices
> > > corrupt idx' in 't5570-git-daemon.sh' [2].  I think this is a symptom
> > > a real bug in Git affecting other platforms as well, but these tests
> > > are too lax to catch it.
> 
> I am seeing this very frequently now, as it feels like failing in the
> Azure Pipeline about half of the time.

I was wondering whether it's only an issue on Travis CI, and was
waiting whether you'll complain about it :)  Evidently it is not, but
I still would like to see a report about macOS running directly on
someone's Mac hardware, i.e. without all the CI/cloud/whatnot magic.

> Any ideas how to fix this test, anyone?

I'm afraid that this is not merely an issue with the test, but a
platform issue that we should work around somehow.  I would have
suggested to follow up on what Peff suggested, but I see that you
already did, and it didn't work out...

In the meantime, for lack of a better option, I started to skip the
two failure-prone tests in the OSX build jobs in my automated CI
builds with:

 -- >8 --

Subject: [PATCH] travis-ci: skip flaky tests in 't5570-git-daemon.sh' in OSX
 build jobs

See: 
https://public-inbox.org/git/CAM0VKj=mcs+cmogzf_xypeb+qzrfmumh52-pv_ndmza9x+r...@mail.gmail.com/

Signed-off-by: SZEDER Gábor 
---
 ci/lib-travisci.sh | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/ci/lib-travisci.sh b/ci/lib-travisci.sh
index 69dff4d1ec..1cd2c1db7d 100755
--- a/ci/lib-travisci.sh
+++ b/ci/lib-travisci.sh
@@ -121,7 +121,9 @@ osx-clang|osx-gcc)
# t9810 occasionally fails on Travis CI OS X
# t9816 occasionally fails with "TAP out of sequence errors" on
# Travis CI OS X
-   export GIT_SKIP_TESTS="t9810 t9816"
+   # In 't5570-git-daemon.sh', the tests 'fetch notices corrupt pack'
+   # and 'fetch notices corrupt idx' fail rather frequently.
+   export GIT_SKIP_TESTS="t9810 t9816 t5570.9 t5570.10"
;;
 GIT_TEST_GETTEXT_POISON)
export GIT_TEST_GETTEXT_POISON=YesPlease
-- 
2.20.1.940.g8404bb2d1a

 -- >8 --


> > > What it boils down to is this sequence:
> > >
> > >   - The test first prepares a repository containing a corrupt pack,
> > > ready to be server via 'git daemon'.
> > >
> > >   - Then the test runs 'test_must_fail git fetch ', which connects
> > > to 'git daemon', which forks 'git upload-pack', which then
> > > advertises refs (only HEAD) and capabilities.  So far so good.
> > >
> > >   - 'git fetch' eventually calls fetch-pack.c:find_common().  The
> > > first half of this function assembles a request consisting of a
> > > want and a flush pkt-line, and sends it via a send_request() call.
> > >
> > > At this point the scheduling becomes important: let's suppose that
> > > fetch is slow and upload-pack is fast.
> > >
> > >   - 'git upload-pack' receives the request, parses the want line,
> > > notices the corrupt pack, responds with an 'ERR upload-pack: not
> > > our ref' pkt-line, and die()s right away.
> > >
> > >   - 'git fetch' finally approaches the end of the function, where it
> > > attempts to send a done pkt-line via another send_request() call
> > > through the now closing TCP socket.
> > >
> > >   - What happens now seems to depend on the platform:
> > >
> > > - On Linux, both on my machine and on Travis CI, it shows textbook
> > >   example behaviour: write() returns with error and sets errno to
> > >   ECONNRESET.  Since it happens in write_or_die(), 'git fetch'
> > >   die()s with 'fatal: write error: Connection reset by peer', and
> > >   doesn't show the error send by 'git upload-pack'; how could it,
> > >   it doesn't even get as far to receive upload-pack's ERR
> > >   pkt-line.
> > >
> > >   The test only checks that 'git fetch' fails, but it doesn't
> > >   check whether it failed with the right error message, so the
> > >   test still succeeds.  Had it checked the error message as well,
> > >   we most likely had noticed this issue already, it doesn't happen
> > >   all that rarely.
> > >
> > > - On the new OSX images with XCode 9.4 on Travis CI the write()
> > >   triggers SIGPIPE right away, and 'test_must_fail' notices it and
> > >   fails the test.  I couldn't see any sign of an ECONNRESET or any
> > >   other error that we could act upon to avoid the SIGPIPE.
> > >
> > > - On OSX with XCode 9.2 on Travis CI there is neither SIGPIPE, nor
> > > 

Re: t5570-git-daemon fails with SIGPIPE on OSX

2019-02-08 Thread Jeff King
On Fri, Feb 08, 2019 at 10:28:12AM +0100, Johannes Schindelin wrote:

> Hi Peff,
> 
> On Fri, 8 Feb 2019, Johannes Schindelin wrote:
> 
> > I just had a look at the patch you provided below (for some reason, my
> > previous search on public-inbox only turned up Gábor's mail to which you
> > responded).
> > 
> > Admittedly, I do not really understand all aspects of it, but it applies,
> > still, and I kicked off a stress test here:
> > 
> > https://dev.azure.com/git/git/_build/results?buildId=338
> > 
> > It seems that your patch fixes that t5570 flakiness on macOS, and more
> > importantly, addresses an important issue on macOS.
> > 
> > Will play a bit more with it and keep you posted.
> 
> Alas, I was fooled. *Fooled*, I say. Apparently the --stress option makes
> the script *succeed* when it fails?
> [...]
> So I am afraid that your patch does not fix the issue nor does it work
> around it.

I think that patch does the write_or_die conversion to handle EPIPE, but
it would still need to turn off SIGPIPE for the whole process.

So you'd also need to stick a:

  sigchain_push(SIGPIPE, SIG_IGN);

somewhere near the start of cmd_fetch(). (There may be a less
coarse-grained place to put it, but at this point I think we're just
trying to find out whether this approach even solves the problem).

-Peff


Re: t5570-git-daemon fails with SIGPIPE on OSX

2018-08-06 Thread SZEDER Gábor
[Resending with Clemens' last used email address.
Clemens, please consider sending a patch to update our .mailmap file.]


On Mon, Aug 6, 2018 at 5:11 PM SZEDER Gábor  wrote:
>
> Travis CI changed its default OSX image to use XCode 9.4 on 2018-07-31
> [1].  Since then OSX build jobs fail rather frequently because of a
> SIGPIPE in the tests 'fetch notices corrupt pack' or 'fetch notices
> corrupt idx' in 't5570-git-daemon.sh' [2].  I think this is a symptom
> a real bug in Git affecting other platforms as well, but these tests
> are too lax to catch it.
>
> What it boils down to is this sequence:
>
>   - The test first prepares a repository containing a corrupt pack,
> ready to be server via 'git daemon'.
>
>   - Then the test runs 'test_must_fail git fetch ', which connects
> to 'git daemon', which forks 'git upload-pack', which then
> advertises refs (only HEAD) and capabilities.  So far so good.
>
>   - 'git fetch' eventually calls fetch-pack.c:find_common().  The
> first half of this function assembles a request consisting of a
> want and a flush pkt-line, and sends it via a send_request() call.
>
> At this point the scheduling becomes important: let's suppose that
> fetch is slow and upload-pack is fast.
>
>   - 'git upload-pack' receives the request, parses the want line,
> notices the corrupt pack, responds with an 'ERR upload-pack: not
> our ref' pkt-line, and die()s right away.
>
>   - 'git fetch' finally approaches the end of the function, where it
> attempts to send a done pkt-line via another send_request() call
> through the now closing TCP socket.
>
>   - What happens now seems to depend on the platform:
>
> - On Linux, both on my machine and on Travis CI, it shows textbook
>   example behaviour: write() returns with error and sets errno to
>   ECONNRESET.  Since it happens in write_or_die(), 'git fetch'
>   die()s with 'fatal: write error: Connection reset by peer', and
>   doesn't show the error send by 'git upload-pack'; how could it,
>   it doesn't even get as far to receive upload-pack's ERR
>   pkt-line.
>
>   The test only checks that 'git fetch' fails, but it doesn't
>   check whether it failed with the right error message, so the
>   test still succeeds.  Had it checked the error message as well,
>   we most likely had noticed this issue already, it doesn't happen
>   all that rarely.
>
> - On the new OSX images with XCode 9.4 on Travis CI the write()
>   triggers SIGPIPE right away, and 'test_must_fail' notices it and
>   fails the test.  I couldn't see any sign of an ECONNRESET or any
>   other error that we could act upon to avoid the SIGPIPE.
>
> - On OSX with XCode 9.2 on Travis CI there is neither SIGPIPE, nor
>   ECONNRESET, but sending the request actually succeeds even
>   though there is no process on the other end of the socket
>   anymore.  'git fetch' then simply continues execution, reads and
>   parses the ERR pkt-line, and then dies()s with 'fatal: remote
>   error: upload-pack: not our ref'.  So, on the face of it, it
>   shows the desired behaviour, but I have no idea how that write()
>   could succeed instead of returning error.
>
> I don't know what happens on a real Mac as I don't have access to one;
> I figured out all the above by enabling packet tracing, adding a
> couple of well placed tracing printf() and sleep() calls, running a
> bunch of builds on Travis CI, and looking through their logs.  But
> without access to a debugger and netstat and what not I can't really
> go any further.  So I would now happily pass the baton to those who
> have a Mac and know a thing or two about its porting issues to first
> check whether OSX on a real Mac shows the same behaviour as it does in
> Travis CI's virtualized(?) environment.  And then they can pass the
> baton to those who know all the intricacies of the pack protocol and
> its implementation to decide what to do with this issue.
>
> For a mostly reliable reproduction recipe you might want to fetch this
> branch:
>
>   https://github.com/szeder/git t5570-git-daemon-sigpipe
>
> and then run 'make && cd t && ./t5570-git-daemon.sh -v -x'
>
>
> Have fun! ;)
>
>
> 1 - https://blog.travis-ci.com/2018-07-19-xcode9-4-default-announce
>
> 2 - On git.git's master:
> https://travis-ci.org/git/git/jobs/411517552#L2717


Re: t5570-git-daemon fails with SIGPIPE on OSX

2018-08-14 Thread Jeff King
On Mon, Aug 06, 2018 at 05:11:13PM +0200, SZEDER Gábor wrote:

>   - 'git upload-pack' receives the request, parses the want line,
> notices the corrupt pack, responds with an 'ERR upload-pack: not
> our ref' pkt-line, and die()s right away.
> 
>   - 'git fetch' finally approaches the end of the function, where it
> attempts to send a done pkt-line via another send_request() call
> through the now closing TCP socket.
> 
>   - What happens now seems to depend on the platform:
> 
> - On Linux, both on my machine and on Travis CI, it shows textbook
>   example behaviour: write() returns with error and sets errno to
>   ECONNRESET.  Since it happens in write_or_die(), 'git fetch'
>   die()s with 'fatal: write error: Connection reset by peer', and
>   doesn't show the error send by 'git upload-pack'; how could it,
>   it doesn't even get as far to receive upload-pack's ERR
>   pkt-line.
> 
>   The test only checks that 'git fetch' fails, but it doesn't
>   check whether it failed with the right error message, so the
>   test still succeeds.  Had it checked the error message as well,
>   we most likely had noticed this issue already, it doesn't happen
>   all that rarely.

Hmm. Traditionally we did not send ERR as part of upload-pack at all. It
was the message you got from git-daemon if it couldn't start the
requested sub-process. It was only later in bdb31eada7 (upload-pack:
report "not our ref" to client, 2017-02-23) that we started sending
them. So I think that is why it does not check the error message: it is
not expecting that case at all (and it is not actually interesting here,
as the real problem is that the remote side is corrupt, but it sadly
does not say anything so useful).

I think that's somewhat tangential, though. The root of the issue is
this:

> - On the new OSX images with XCode 9.4 on Travis CI the write()
>   triggers SIGPIPE right away, and 'test_must_fail' notices it and
>   fails the test.  I couldn't see any sign of an ECONNRESET or any
>   other error that we could act upon to avoid the SIGPIPE.

Right, as soon as we get SIGPIPE we can't offer any useful message,
because we're dead. I would argue that fetch should simply turn off
SIGPIPE entirely, and rely on getting EPIPE from write(). But since
we're in write_or_die(), it actually turns EPIPE back into a SIGPIPE
death!

So we'd probably also want to teach it to use a real write_in_full(),
and then output a more useful message in this case. write_or_die()
really does produce bad messages regardless, because it doesn't know
what it's writing to.

That would give us a baby step in the right direction, because at least
we'd always be doing a controlled die() then. And then the next step
would be to show the remote error message (even though it's not actually
useful in this case, in theory upload-pack could generate something
better). And that would mean turning the die() on write into an attempt
to drain any ERR messages before either dying or returning an error up
the stack.

I suspect the (largely untested) patch below would make your test
problems go away. Or instead, we could simply add sigpipe=ok to the
test_must_fail invocation, but I agree with you that the current
behavior on OS X is not ideal (the user sees no error message).

-Peff

diff --git a/fetch-pack.c b/fetch-pack.c
index 5714bcbddd..3e80604562 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -188,8 +188,10 @@ static void send_request(struct fetch_pack_args *args,
if (args->stateless_rpc) {
send_sideband(fd, -1, buf->buf, buf->len, LARGE_PACKET_MAX);
packet_flush(fd);
-   } else
-   write_or_die(fd, buf->buf, buf->len);
+   } else {
+   if (write_in_full(fd, buf->buf, buf->len) < 0)
+   die_errno("unable to write to remote");
+   }
 }
 
 static void insert_one_alternate_object(struct fetch_negotiator *negotiator,
@@ -1167,7 +1169,8 @@ static int send_fetch_request(struct fetch_negotiator 
*negotiator, int fd_out,
 
/* Send request */
packet_buf_flush(&req_buf);
-   write_or_die(fd_out, req_buf.buf, req_buf.len);
+   if (write_in_full(fd_out, req_buf.buf, req_buf.len) < 0)
+   die_errno("unable to write request to remote");
 
strbuf_release(&req_buf);
return ret;
diff --git a/pkt-line.c b/pkt-line.c
index a593c08aad..450d0801b1 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -88,13 +88,15 @@ static void packet_trace(const char *buf, unsigned int len, 
int write)
 void packet_flush(int fd)
 {
packet_trace("", 4, 1);
-   write_or_die(fd, "", 4);
+   if (write_in_full(fd, "", 4) < 0)
+   die_errno("unable to write flush packet");
 }
 
 void packet_delim(int fd)
 {
packet_trace("0001", 4, 1);
-   write_or_die(fd, "0001", 4);
+   if (write_in_full(fd, "", 4) < 0)
+   die_errno("unable to 

Re: t5570-git-daemon fails with SIGPIPE on OSX

2018-08-14 Thread Jeff King
On Tue, Aug 14, 2018 at 06:32:47PM -0400, Jeff King wrote:

> I suspect the (largely untested) patch below would make your test
> problems go away. Or instead, we could simply add sigpipe=ok to the
> test_must_fail invocation, but I agree with you that the current
> behavior on OS X is not ideal (the user sees no error message).

Whoops, that patch of course misses adding "sigchain_push(SIGPIPE,
SIG_IGN)" during the fetch-pack operation. I'll leave that as an
exercise to the reader. ;)

-Peff


Re: t5570-git-daemon fails with SIGPIPE on OSX

2019-03-01 Thread Johannes Schindelin
Hi Peff,

On Fri, 8 Feb 2019, Jeff King wrote:

> On Fri, Feb 08, 2019 at 10:28:12AM +0100, Johannes Schindelin wrote:
> 
> > On Fri, 8 Feb 2019, Johannes Schindelin wrote:
> > 
> > > I just had a look at the patch you provided below (for some reason, my
> > > previous search on public-inbox only turned up Gábor's mail to which you
> > > responded).
> > > 
> > > Admittedly, I do not really understand all aspects of it, but it applies,
> > > still, and I kicked off a stress test here:
> > > 
> > >   https://dev.azure.com/git/git/_build/results?buildId=338
> > > 
> > > It seems that your patch fixes that t5570 flakiness on macOS, and more
> > > importantly, addresses an important issue on macOS.
> > > 
> > > Will play a bit more with it and keep you posted.
> > 
> > Alas, I was fooled. *Fooled*, I say. Apparently the --stress option makes
> > the script *succeed* when it fails?
> > [...]
> > So I am afraid that your patch does not fix the issue nor does it work
> > around it.
> 
> I think that patch does the write_or_die conversion to handle EPIPE, but
> it would still need to turn off SIGPIPE for the whole process.
> 
> So you'd also need to stick a:
> 
>   sigchain_push(SIGPIPE, SIG_IGN);
> 
> somewhere near the start of cmd_fetch(). (There may be a less
> coarse-grained place to put it, but at this point I think we're just
> trying to find out whether this approach even solves the problem).

This fixes it, it seems. I let the job run with `--stress=50` and even
after half an hour, it did not fail:

attempts ://git.visualstudio.com/git/_build/results?buildId=354

(I had to cancel it, I thought that `--stress=50` would stop trying after
50 runs, but I was obviously incorrect in that assumption...)

Reverting the patch made it fail within a hundred runs:

https://git.visualstudio.com/git/_build/results?buildId=356

So. Good, we have a diff that works. But you mentioned that you'd like to
put it somewhere else? I am a bit unfamiliar with the code paths of
`cmd_fetch()`, so I would be hard pressed to find a better spot. Any hint?

Ciao,
Dscho

Re: t5570-git-daemon fails with SIGPIPE on OSX

2019-03-01 Thread Jeff King
On Fri, Mar 01, 2019 at 04:02:22PM +0100, Johannes Schindelin wrote:

> > I think that patch does the write_or_die conversion to handle EPIPE, but
> > it would still need to turn off SIGPIPE for the whole process.
> > 
> > So you'd also need to stick a:
> > 
> >   sigchain_push(SIGPIPE, SIG_IGN);
> > 
> > somewhere near the start of cmd_fetch(). (There may be a less
> > coarse-grained place to put it, but at this point I think we're just
> > trying to find out whether this approach even solves the problem).
> 
> This fixes it, it seems. I let the job run with `--stress=50` and even
> after half an hour, it did not fail:
> 
> attempts ://git.visualstudio.com/git/_build/results?buildId=354

Cool, I'm glad it works!

> (I had to cancel it, I thought that `--stress=50` would stop trying after
> 50 runs, but I was obviously incorrect in that assumption...)

No, that will do 50 simultaneous jobs. You want --stress-limit=50.

It might make more sense to have --stress-jobs, and make --stress=X set
the limit, as I think it's much more useful to set the limit than it is
to set the number of jobs. At least that's my experience.

> So. Good, we have a diff that works. But you mentioned that you'd like to
> put it somewhere else? I am a bit unfamiliar with the code paths of
> `cmd_fetch()`, so I would be hard pressed to find a better spot. Any hint?

Well, I'm of two minds there. If we want to make the minimum change,
then we'd just want to disable SIGPIPE while we're actually conversing
with the other side. So I guess that would be somewhere in do_fetch(),
before we start talking to the other side, and restoring it after we've
finished our half of the conversation (i.e., after we've done all the
want/have bits and we're receiving the pack). But that's actually pretty
awkward to do, because most of those details are handled under the hood
by the transport code.

So probably something like this is the least-terrible place to put it:

diff --git a/builtin/fetch.c b/builtin/fetch.c
index b620fd54b4..4ba63d5ac6 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1556,7 +1556,9 @@ static int fetch_one(struct remote *remote, int argc, 
const char **argv, int pru
 
sigchain_push_common(unlock_pack_on_signal);
atexit(unlock_pack);
+   sigchain_push(SIGPIPE, SIG_IGN);
exit_code = do_fetch(gtransport, &rs);
+   sigchain_pop(SIGPIPE);
refspec_clear(&rs);
transport_disconnect(gtransport);
gtransport = NULL;

That said, I actually think it's kind of pointless for git-fetch to use
SIGPIPE at all. The purpose of SIGPIPE is that you can write a naive
program that spews output, and you'll get informed (forcefully) by the
OS if the process consuming your output stops listening. That makes
sense for programs like "git log", whose primary purpose is generating
output.

But for git-fetch, our primary purpose is receiving data and writing it
to disk. We should be checking all of our write()s already, and SIGPIPE
is just confusing. The only "big" output we generate is the status table
at the end. And even if that is going to a pipe that closes, I don't
think we'd want to fail the whole command (we'd want to finalize any
writes for what we just fetched, clean up after ourselves, etc).

So I'd actually be fine with just declaring that certain commands (like
fetch) just ignore SIGPIPE entirely.

-Peff


Re: t5570-git-daemon fails with SIGPIPE on OSX

2019-03-02 Thread Johannes Schindelin
Hi Peff,

On Fri, 1 Mar 2019, Jeff King wrote:

> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index b620fd54b4..4ba63d5ac6 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -1556,7 +1556,9 @@ static int fetch_one(struct remote *remote, int argc, 
> const char **argv, int pru
>  
>   sigchain_push_common(unlock_pack_on_signal);
>   atexit(unlock_pack);
> + sigchain_push(SIGPIPE, SIG_IGN);
>   exit_code = do_fetch(gtransport, &rs);
> + sigchain_pop(SIGPIPE);
>   refspec_clear(&rs);
>   transport_disconnect(gtransport);
>   gtransport = NULL;

That indeed does the job:

https://git.visualstudio.com/git/_build/results?buildId=358

> That said, I actually think it's kind of pointless for git-fetch to use
> SIGPIPE at all. The purpose of SIGPIPE is that you can write a naive
> program that spews output, and you'll get informed (forcefully) by the
> OS if the process consuming your output stops listening. That makes
> sense for programs like "git log", whose primary purpose is generating
> output.
> 
> But for git-fetch, our primary purpose is receiving data and writing it
> to disk. We should be checking all of our write()s already, and SIGPIPE
> is just confusing. The only "big" output we generate is the status table
> at the end. And even if that is going to a pipe that closes, I don't
> think we'd want to fail the whole command (we'd want to finalize any
> writes for what we just fetched, clean up after ourselves, etc).
> 
> So I'd actually be fine with just declaring that certain commands (like
> fetch) just ignore SIGPIPE entirely.

That's a bigger change than I'm comfortable with, so I'd like to go with
tge diff you gave above.

Do you want to turn these two patches into a proper patch series?
Otherwise I can take care of it, probably this Monday or Tuesday.

Ciao,
Dscho


Re: t5570-git-daemon fails with SIGPIPE on OSX

2019-03-02 Thread Junio C Hamano
Jeff King  writes:

> But for git-fetch, our primary purpose is receiving data and writing it
> to disk. We should be checking all of our write()s already, and SIGPIPE
> is just confusing.

Yup, sounds like a very sensible argument.

> So I'd actually be fine with just declaring that certain commands (like
> fetch) just ignore SIGPIPE entirely.

Me too (to me, "actually be fine with" would mean "that's a larger
hammer alternative, which is OK, while the base solution is fine,
too").

Thanks.




Re: t5570-git-daemon fails with SIGPIPE on OSX

2019-03-03 Thread Johannes Schindelin
Hi Junio,

On Sun, 3 Mar 2019, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > But for git-fetch, our primary purpose is receiving data and writing it
> > to disk. We should be checking all of our write()s already, and SIGPIPE
> > is just confusing.
> 
> Yup, sounds like a very sensible argument.
> 
> > So I'd actually be fine with just declaring that certain commands (like
> > fetch) just ignore SIGPIPE entirely.
> 
> Me too (to me, "actually be fine with" would mean "that's a larger
> hammer alternative, which is OK, while the base solution is fine,
> too").

So do I understand that you'd like something like this?

-- snipsnap --
t a/git.c b/git.c
index 2f604a41ea..c4122cc699 100644
--- a/git.c
+++ b/git.c
@@ -4,6 +4,7 @@
 #include "help.h"
 #include "run-command.h"
 #include "alias.h"
+#include "sigchain.h"
 
 #define RUN_SETUP  (1<<0)
 #define RUN_SETUP_GENTLY   (1<<1)
@@ -16,6 +17,7 @@
 #define SUPPORT_SUPER_PREFIX   (1<<4)
 #define DELAY_PAGER_CONFIG (1<<5)
 #define NO_PARSEOPT(1<<6) /* parse-options is not used */
+#define IGNORE_SIGPIPE (1<<7)
 
 struct cmd_struct {
const char *cmd;
@@ -388,6 +390,8 @@ static int run_builtin(struct cmd_struct *p, int argc, 
const char **argv)
prefix = NULL;
help = argc == 2 && !strcmp(argv[1], "-h");
if (!help) {
+   if (p->option & IGNORE_SIGPIPE)
+   sigchain_push(SIGPIPE, SIG_IGN);
if (p->option & RUN_SETUP)
prefix = setup_git_directory();
else if (p->option & RUN_SETUP_GENTLY) {
@@ -477,7 +481,7 @@ static struct cmd_struct commands[] = {
{ "diff-tree", cmd_diff_tree, RUN_SETUP | NO_PARSEOPT },
{ "difftool", cmd_difftool, RUN_SETUP | NEED_WORK_TREE },
{ "fast-export", cmd_fast_export, RUN_SETUP },
-   { "fetch", cmd_fetch, RUN_SETUP },
+   { "fetch", cmd_fetch, RUN_SETUP | IGNORE_SIGPIPE },
{ "fetch-pack", cmd_fetch_pack, RUN_SETUP | NO_PARSEOPT },
{ "fmt-merge-msg", cmd_fmt_merge_msg, RUN_SETUP },
{ "for-each-ref", cmd_for_each_ref, RUN_SETUP },



Re: t5570-git-daemon fails with SIGPIPE on OSX

2019-03-03 Thread Jeff King
On Sat, Mar 02, 2019 at 10:21:00PM +0100, Johannes Schindelin wrote:

> Do you want to turn these two patches into a proper patch series?
> Otherwise I can take care of it, probably this Monday or Tuesday.

Here they are. The old email sending the first one actually had a typo
(it sent "" instead of "0001" for some packets; oops), which is
fixed here.

I'm actually not sure of the placement for the second one; see the
comments there.

  [1/2]: fetch: avoid calling write_or_die()
  [2/2]: fetch: ignore SIGPIPE during network operation

 builtin/fetch.c | 2 ++
 fetch-pack.c| 9 ++---
 pkt-line.c  | 6 --
 3 files changed, 12 insertions(+), 5 deletions(-)

-Peff