Re: [ccache] gmake and ccache conspiring together in creating gremlins

2021-02-08 Thread Bernd Petrovitsch via ccache
Hi all!

On 07/02/2021 16:45, Sam Varshavchik via ccache wrote:
[...]
> In the interim I discovered a workaround that seems to work, for
> anyone who's also affected by this. Instead of setting CCACHE_PREFIX
> to distcc, I pointed CCACHE_PREFIX to this:
> 
> #! /bin/sh
> 
> unset MFLAGS ${!MAKE@}
> exec distcc "$@"
> 
> This clears make variables from the environment and when the linker
> invokes make for ...whatever it needs to invoke it for, against a
> manufactured makefile it generates on the fly... it doesn't go look
> for the job server file descriptors, and go sideways.

Hmm, with the whole thread: Shouldn't better the linker (or whatever
tool) cleanup the environment variables before invoking `make` (or
whatever programs) to fix the - IMHO - core cause of the bug?

MfG,
Bernd
-- 
Bernd Petrovitsch  Email : be...@petrovitsch.priv.at
 LUGA : http://www.luga.at

___
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache


Re: [ccache] gmake and ccache conspiring together in creating gremlins

2021-02-08 Thread Sam Varshavchik via ccache
On Mon, Feb 8, 2021 at 2:38 PM Paul Smith  wrote:
>
> On Mon, 2021-02-08 at 10:43 +, Edward Welbourne wrote:
> > Sounds to me like that's a bug: when the descriptors are closed, the
> > part of MAKEFLAGS that claims they're make's jobserver file
> > descriptors should be removed, since that's when the claim stops
> > being true.
>
> I believe there have been other similar issues reported recently.
>
> Certainly fixing MAKEFLAGS when we run without jobserver available is
> something that could be done.
>
> There is a loss of debugging information if we make this change: today
> make can detect if it was invoked in a way that _should_ expect to
> receive a jobserver context, but _didn't_ receive that context.  That
> is, if make sees that jobserver-auth is set but it can't open the
> jobserver pipes it can warn the user that most likely there's a problem
> in their environment or with the setup of their makefiles.
>
> Without this warning there's no way to know when this situation occurs.
>  It's easy to create a situation where every sub-make will create its
> own completely unique jobserver domain.  So you start the top make with
> -j4 and run 4 sub-makes; if you do it wrong then each of 4 sub-makes
> could create a new jobserver domain, and now you're running 16 jobs in
> parallel instead of 4... there's no way for make to warn you about this
> situation.

One thought occurred to me. Specifically: when make executes what it
believes to be something other than a recursive invocation of $(MAKE),
and it closes the job server pipe file descriptors for that, it can
also:

1) Add an additional parameter to MAKEFLAGS, let's call it
"--no-jobserver", and perhaps remove the --jobserver-auth parameter
completely. It might be easier just to append something there, instead
of surgically removing this.

2) Make checks for a --no-jobserver in MAKEFLAGS when it starts. If
it's there it does NOT attempt to validate the file descriptors that
are given in --jobserver-auth (if this parameter is preserved). It's a
given that they're not there:

  if (!FD_OK (job_fds[0]) || !FD_OK (job_fds[1]) || make_job_rfd () < 0)

Don't even do that. What happens right now a warning message gets
printed and make runs without a job server. This change should have
the same result, print the warning but skip the FD_OK tests.

This will result in the same warning, but it should avoid triggering
the bug that I found.

However that might cause a minor regression in LTO linking. I think
that this prevents the LTO linker's internal invocation of make from
finding that it can attach to the original make process's job server.

>From sifting through strace dumps, I see that a linker-invoked make
gets its own -j flag. It appears that the linker is courteous enough
to count how many CPUs it has and use it to construct its own -j flag.

How about this, safe approach: once --no-jobserver is there it stays
there, and gets propagated to all recursively invoked makes. If an
invoke make finds that it has both a --no-jobserver and a -j flag,
it'll warn and refuse to create its own job server, and then proceed
executing one command at a time.

This prevents an arithmetic proliferation of job worker processes if
the original job server's file descriptors get lost. Currently
recursively-invoked makes will find, and attach themselves to, an
existing job server. This is nice; but this is vulnerable to an edge
case that I think I'm hitting: a false positive involving a leaked
file descriptor. This change encourages fixing whatever's causing make
to fail to detect a recursive invocation.

___
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache


Re: [ccache] gmake and ccache conspiring together in creating gremlins

2021-02-08 Thread Sam Varshavchik via ccache
On Sun, Feb 7, 2021 at 7:59 AM Joel Rosdahl  wrote:

> From my perspective, I don't see anything wrong with ccache dup-ing stderr – I
> guess the bug can be triggered by the user running something like
>
>   exec 3>&2
>
> in the shell as well.
>
> I can think of an easy workaround for the problem: only dup2 stderr when 
> running
> the preprocessor or compiler. Done in
> .

In the interim I discovered a workaround that seems to work, for
anyone who's also affected by this. Instead of setting CCACHE_PREFIX
to distcc, I pointed CCACHE_PREFIX to this:

#! /bin/sh

unset MFLAGS ${!MAKE@}
exec distcc "$@"

This clears make variables from the environment and when the linker
invokes make for ...whatever it needs to invoke it for, against a
manufactured makefile it generates on the fly... it doesn't go look
for the job server file descriptors, and go sideways.

___
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache


Re: [ccache] gmake and ccache conspiring together in creating gremlins

2021-02-07 Thread Joel Rosdahl via ccache
On Sun, 7 Feb 2021 at 12:26, Sam Varshavchik via ccache
 wrote:

> I've been chasing down non-deterministic, random build fails.

Thanks for the thorough investigation!

> So, now what? Both ccache, and the linker are contributing to this situation.
> ccache leaks the file descriptor (to distcc apparently?).

Yes, or rather for anyone who wants to use it. Icecream also uses it.

> The linker invokes make for its own use, but it doesn't sanitize the
> environment. make closes the job server file descriptors, but not removing
> them from MAKEFLAGS.

>From my perspective, I don't see anything wrong with ccache dup-ing stderr – I
guess the bug can be triggered by the user running something like

  exec 3>&2

in the shell as well.

I can think of an easy workaround for the problem: only dup2 stderr when running
the preprocessor or compiler. Done in
.

-- Joel

On Sun, 7 Feb 2021 at 12:26, Sam Varshavchik via ccache
 wrote:
>
> I've been chasing down non-deterministic, random build fails. make
> randomly blows up with
>
> write error: stdout
>
> A lot of stracing, experimentation, and pains, determined that this is
> due to a sequence of the following events:
>
> 1) how make uses the job server file descriptor
>
> 2) an intentional leak of the standard error file descriptor in ccache
>
> 3) LTO linking results in linker recursively invoking make (that was a
> new one to me)
>
> 4) A rather nuanced behavior of the Linux PTY driver that can be best
> explained with the following short demo:
>
> #include 
> #include 
> #include 
> #include 
> #include 
>
> int main()
> {
> struct stat state_buf;
>
> for (int i=0; i<8; ++i)
> {
> if (fstat(i, &state_buf) == 0)
> {
> printf("%d: %d/%d\n", i,
>(int)state_buf.st_dev,
>(int)state_buf.st_ino);
> }
> }
>
> printf("NON BLOCKING: %s\n", fcntl(0, F_GETFL) & O_NONBLOCK ? 
> "y":"n");
>
> fcntl(2, F_SETFL, O_NONBLOCK);
>
> printf("NON BLOCKING: %s\n", fcntl(0, F_GETFL) & O_NONBLOCK ? 
> "y":"n");
> return 0;
> }
>
> I'm sshed into my server, and the output of the above is, for me:
>
> 0: 25/3
> 1: 25/3
> 2: 25/3
> NON BLOCKING: n
> NON BLOCKING: y
>
> My stdin, stdout, and stderr is the same dev/ino, and setting non-blocking
> mode one any one of them puts them all in non-blocking mode. I'm setting non-
> blocking on standard error, and my standard input also becomes non-blocking.
>
> After sifting through sources, and straces, this is what I see is happening.
>
> Starting with make, and how it sets up its job file descriptors:
>
> jobserver_parse_auth (const char *auth)
> {
>   /* Given the command-line parameter, parse it.  */
>   if (sscanf (auth, "%d,%d", &job_fds[0], &job_fds[1]) != 2)
>
>   // …
>
>   /* Make sure our pipeline is valid, and (possibly) create a duplicate pipe,
>  that will be closed in the SIGCHLD handler.  If this fails with EBADF,
>  the parent has closed the pipe on us because it didn't think we were a
>  submake.  If so, warn and default to -j1.  */
>
>   if (!FD_OK (job_fds[0]) || !FD_OK (job_fds[1]) || make_job_rfd () < 0)
> }
>
> The TLDR of the above: make reads the job server's file descriptors
> from the MAKEFLAGS environment variable, then checks here if they
> actually exist. If they don't exist, make will create the job server
> pipe. Important: by default they will be file descriptors 3 and 4.
> This becomes a key player in this mystery, a little bit later.
>
> When make spawns a child job (other than a recursive make) the job
> server file descriptors get closed (marked O_CLOEXEC before the actual
> execve), but MAKEFLAGS remains in the environment. They remain open
> for a recursive make call, and the recursive instance picks up the
> ball from here.
>
> The problem with this approach is that if, somehow, a child process
> runs make, and there happens to be a pair of file descriptors here,
> make will assume that they're the job server file descriptors.
>
> Another thing that make does is that it sets one of the file
> descriptors to non-blocking mode. This is in jobserver_setup:
>
>   /* When using pselect() we want the read to be non-blocking.  */
>   set_blocking (job_fds[0], 0);
>
> Now we get to ccache which, amongst things, does the following, in its
> ccache.cpp:
>
> // Make a copy of stderr that will not be cached, so things like distcc can
> // send networking errors to it.
> static void
> set_up_uncached_err()
> {
>   int uncached_fd =
> dup(STDERR_FILENO); // The file descriptor is intentionally leaked.
>   if (uncached_fd == -1) {
> log("dup(2) failed: {}", strerror(errno));
> throw Failure(Statistic::internal_error);
>   }
>
>   Util::setenv("UNCACHED_ERR_FD", fmt::format("{}", uncached_fd));
> }
>
> TLDR: it intentionally lea

[ccache] gmake and ccache conspiring together in creating gremlins

2021-02-07 Thread Sam Varshavchik via ccache
I've been chasing down non-deterministic, random build fails. make
randomly blows up with

write error: stdout

A lot of stracing, experimentation, and pains, determined that this is
due to a sequence of the following events:

1) how make uses the job server file descriptor

2) an intentional leak of the standard error file descriptor in ccache

3) LTO linking results in linker recursively invoking make (that was a
new one to me)

4) A rather nuanced behavior of the Linux PTY driver that can be best
explained with the following short demo:

#include 
#include 
#include 
#include 
#include 

int main()
{
struct stat state_buf;

for (int i=0; i<8; ++i)
{
if (fstat(i, &state_buf) == 0)
{
printf("%d: %d/%d\n", i,
   (int)state_buf.st_dev,
   (int)state_buf.st_ino);
}
}

printf("NON BLOCKING: %s\n", fcntl(0, F_GETFL) & O_NONBLOCK ? "y":"n");

fcntl(2, F_SETFL, O_NONBLOCK);

printf("NON BLOCKING: %s\n", fcntl(0, F_GETFL) & O_NONBLOCK ? "y":"n");
return 0;
}

I'm sshed into my server, and the output of the above is, for me:

0: 25/3
1: 25/3
2: 25/3
NON BLOCKING: n
NON BLOCKING: y

My stdin, stdout, and stderr is the same dev/ino, and setting non-blocking
mode one any one of them puts them all in non-blocking mode. I'm setting non-
blocking on standard error, and my standard input also becomes non-blocking.

After sifting through sources, and straces, this is what I see is happening.

Starting with make, and how it sets up its job file descriptors:

jobserver_parse_auth (const char *auth)
{
  /* Given the command-line parameter, parse it.  */
  if (sscanf (auth, "%d,%d", &job_fds[0], &job_fds[1]) != 2)

  // …

  /* Make sure our pipeline is valid, and (possibly) create a duplicate pipe,
 that will be closed in the SIGCHLD handler.  If this fails with EBADF,
 the parent has closed the pipe on us because it didn't think we were a
 submake.  If so, warn and default to -j1.  */

  if (!FD_OK (job_fds[0]) || !FD_OK (job_fds[1]) || make_job_rfd () < 0)
}

The TLDR of the above: make reads the job server's file descriptors
from the MAKEFLAGS environment variable, then checks here if they
actually exist. If they don't exist, make will create the job server
pipe. Important: by default they will be file descriptors 3 and 4.
This becomes a key player in this mystery, a little bit later.

When make spawns a child job (other than a recursive make) the job
server file descriptors get closed (marked O_CLOEXEC before the actual
execve), but MAKEFLAGS remains in the environment. They remain open
for a recursive make call, and the recursive instance picks up the
ball from here.

The problem with this approach is that if, somehow, a child process
runs make, and there happens to be a pair of file descriptors here,
make will assume that they're the job server file descriptors.

Another thing that make does is that it sets one of the file
descriptors to non-blocking mode. This is in jobserver_setup:

  /* When using pselect() we want the read to be non-blocking.  */
  set_blocking (job_fds[0], 0);

Now we get to ccache which, amongst things, does the following, in its
ccache.cpp:

// Make a copy of stderr that will not be cached, so things like distcc can
// send networking errors to it.
static void
set_up_uncached_err()
{
  int uncached_fd =
dup(STDERR_FILENO); // The file descriptor is intentionally leaked.
  if (uncached_fd == -1) {
log("dup(2) failed: {}", strerror(errno));
throw Failure(Statistic::internal_error);
  }

  Util::setenv("UNCACHED_ERR_FD", fmt::format("{}", uncached_fd));
}

TLDR: it intentionally leaks a file descriptor. A basic standalone
strace of a standalone cache invocation showed ccache leaking file
descriptor 2 to 3. Now, somewhere else I was also getting file
descriptor 4, I haven't tracked down where, exactly. But it was there,

The final ingredient to this perfect storm, is LTO linking. I have
verified that -flto=auto -ffat-lto-objects results in ld running make.
If you compile helloworld, or something, with -flto=auto
-ffat-lto-objects, and link it:

strace -s 256 -o z -f gcc -flto=auto -ffat-lto-objects -o t t.o

strace will show that, somehow, make gets involved:

123302 execve("/usr/bin/make", ["make", "-f", "/tmp/ccyRcNmv.mk", "-j32",
"all"], 0x234ed40 /* 47 vars */ 

And now, let's go back to what make is doing, here:

static void
close_stdout (void)
{
  int prev_fail = ferror (stdout);
  int fclose_fail = fclose (stdout);

  if (prev_fail || fclose_fail)
{
  if (fclose_fail)
perror_with_name (_("write error: stdout"), "");
  else
O (error, NILF, _("write error: stdout"));
  exit (MAKE_TROUBLE);
}
}

So, the grand total, this is what was happening when I was building a
large hairball, using make, distcc, cache, and when using LTO:

1) The original insta