Re: 4.3.90 release candidate segfaults on linux and solaris
> On 2022-09-25 16:25, Dmitry Goncharov wrote: > > On Sun, Sep 25, 2022 at 2:09 AM Martin Dorey > wrote: >> And vfork is where that happens. If I’ve followed the thicket of #ifdef >> correctly and understood the vfork man page, then this is illegal when using >> vfork: >> >> https://github.com/mirror/make/blob/master/src/job.c#L2556 > > > Thanks, Martin. > This is indeed the culprit. > I was able to find a linux where this reproduces. However, the this > does this reproduce on my sun. Denis, can you please try this patch on > your sun? > > regards, Dmitry > > Here is a patch > > index d12a9138..1ed71f0a 100644 > --- a/src/job.c > +++ b/src/job.c > @@ -2286,6 +2286,8 @@ child_execute_job (struct childbase *child, int > good_stdin, char **argv) > posix_spawnattr_t attr; > posix_spawn_file_actions_t fa; > short flags = 0; > +#else > + char **parent_environ = environ; > #endif > > /* Divert child output if we want to capture it. */ > @@ -2301,7 +2303,12 @@ child_execute_job (struct childbase *child, int > good_stdin, char **argv) > > pid = vfork(); > if (pid != 0) > +{ > + /* The child sets environ to child->environment before calling execvp. > + * Restore it here. */ > + environ = parent_environ; > return pid; > +} > > /* We are the child. */ > unblock_all_sigs (); > @@ -2552,7 +2559,8 @@ exec_command (char **argv, char **envp) > errno = ENOEXEC; > > # else > - /* Run the program. */ > + /* Run the program. > + * The parent has to restore environ. */ > environ = envp; > execvp (argv[0], argv); Hello, This patch works perfectly, both on my 32bit linux and my 32bit solaris 10. Thank you. Regards, Denis Excoffier.
Re: 4.3.90 release candidate segfaults on linux and solaris
On Sun, 2022-09-25 at 13:03 -0400, Paul Smith wrote: > The odd thing is that make ALREADY preserves the environment pointer > and restores it, but it does so after child_execute_job() is called > (after removing some ifdefs): Oh, well, clearly child_execute_job() is called from other places, that don't do this. Doh! OK will fix, thanks for the assistance all.
Re: 4.3.90 release candidate segfaults on linux and solaris
On Sun, 2022-09-25 at 10:25 -0400, Dmitry Goncharov wrote: > However, the this does this reproduce on my sun. Sorry Dmitry but I didn't understand that sentence...? The odd thing is that make ALREADY preserves the environment pointer and restores it, but it does so after child_execute_job() is called (after removing some ifdefs): { /* Fork the child process. */ char **parent_environ; run_local: block_sigs (); child->remote = 0; parent_environ = environ; jobserver_pre_child (flags & COMMANDS_RECURSE); child->pid = child_execute_job ((struct childbase *)child, child->good_stdin, argv); environ = parent_environ; /* Restore value child may have clobbered. */ jobserver_post_child (flags & COMMANDS_RECURSE); } I suppose this may not be sufficient to fix the problem?
Re: 4.3.90 release candidate segfaults on linux and solaris
https://docs.oracle.com/cd/E88353_01/html/E37841/execvpe-2.html alleges to document Oracle Solaris 11.4 and appears to support execvpe. Yes, it's listed in Linux pages as a GNU extension and it's not on the Open Group page for the exec family, but symmetry makes it quite an obvious extension. Perhaps that could be detected by a configure test. The Linux page for vfork, but not the Open Group one, says that vfork suspend the caller until the exec is done, so the caller could restore environ after. I don't immediately see that it could hurt and should fix it for Linux. Or there's copying the environment somewhere else. Huh, even just the environ pointer. From: Bug-make on behalf of Martin Dorey Sent: Sunday, September 25, 2022 07:09 To: psm...@gnu.org ; Denis Excoffier Cc: bug-make@gnu.org Subject: Re: 4.3.90 release candidate segfaults on linux and solaris execvpe does indeed fix it for me. Undoing the fix again, doing make clean -k and coming out of the chroot to configure --disable-posix-spawn for my 64 bit root environment (with libc 2.24 from Debian Stretch on Linux 4.9.0-19-amd64), I can reproduce the problem again with this makefile. I'd be surprised if that doesn't crash for Paul too because I suggest that the kernel version isn't actually important either. From: Paul Smith Sent: Sunday, September 25, 2022 06:23 To: Martin Dorey ; Denis Excoffier Cc: bug-make@gnu.org Subject: Re: 4.3.90 release candidate segfaults on linux and solaris * EXTERNAL EMAIL * On Sun, 2022-09-25 at 06:02 +, Martin Dorey wrote: > And vfork is where that happens. If I’ve followed the thicket of > #ifdef correctly and understood the vfork man page, then this is > illegal when using vfork: > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmirror%2Fmake%2Fblob%2Fmaster%2Fsrc%2Fjob.c%23L2556data=05%7C01%7CMartin.Dorey%40hitachivantara.com%7C82c25851389d4e1c12cb08da9ef91def%7C18791e1761594f52a8d4de814ca8284a%7C0%7C0%7C637997090043171790%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=v9hASQYmLkyN49Vpvk4DGrT0roD482%2FC5uIyc3V2DLQ%3Dreserved=0 It's long past time to rewrite this entire part of job.c but I don't want to do that before the release. I did try this with a 32bit Ubuntu 14.04 docker container image which uses a similar libc version, but contrary to my previous assertion (sorry Denis!) if it's an issue with vfork then it IS related to the kernel version and so my docker containers won't be of any use. This is not an actual fix (won't work on Solaris obviously since execvpe() is a GNU extension) but can you try this change to see if it works on your system Martin? At least we can verify that this is the issue: diff --git a/src/job.c b/src/job.c index d12a9138..0a4ddae8 100644 --- a/src/job.c +++ b/src/job.c @@ -2553,8 +2553,7 @@ exec_command (char **argv, char **envp) # else /* Run the program. */ - environ = envp; - execvp (argv[0], argv); + execvpe (argv[0], argv, envp); # endif /* !__EMX__ */
Re: 4.3.90 release candidate segfaults on linux and solaris
On Sun, Sep 25, 2022 at 2:09 AM Martin Dorey wrote: > And vfork is where that happens. If I’ve followed the thicket of #ifdef > correctly and understood the vfork man page, then this is illegal when using > vfork: > > https://github.com/mirror/make/blob/master/src/job.c#L2556 Thanks, Martin. This is indeed the culprit. I was able to find a linux where this reproduces. However, the this does this reproduce on my sun. Denis, can you please try this patch on your sun? regards, Dmitry Here is a patch index d12a9138..1ed71f0a 100644 --- a/src/job.c +++ b/src/job.c @@ -2286,6 +2286,8 @@ child_execute_job (struct childbase *child, int good_stdin, char **argv) posix_spawnattr_t attr; posix_spawn_file_actions_t fa; short flags = 0; +#else + char **parent_environ = environ; #endif /* Divert child output if we want to capture it. */ @@ -2301,7 +2303,12 @@ child_execute_job (struct childbase *child, int good_stdin, char **argv) pid = vfork(); if (pid != 0) +{ + /* The child sets environ to child->environment before calling execvp. + * Restore it here. */ + environ = parent_environ; return pid; +} /* We are the child. */ unblock_all_sigs (); @@ -2552,7 +2559,8 @@ exec_command (char **argv, char **envp) errno = ENOEXEC; # else - /* Run the program. */ + /* Run the program. + * The parent has to restore environ. */ environ = envp; execvp (argv[0], argv);
Re: 4.3.90 release candidate segfaults on linux and solaris
execvpe does indeed fix it for me. Undoing the fix again, doing make clean -k and coming out of the chroot to configure --disable-posix-spawn for my 64 bit root environment (with libc 2.24 from Debian Stretch on Linux 4.9.0-19-amd64), I can reproduce the problem again with this makefile. I'd be surprised if that doesn't crash for Paul too because I suggest that the kernel version isn't actually important either. From: Paul Smith Sent: Sunday, September 25, 2022 06:23 To: Martin Dorey ; Denis Excoffier Cc: bug-make@gnu.org Subject: Re: 4.3.90 release candidate segfaults on linux and solaris * EXTERNAL EMAIL * On Sun, 2022-09-25 at 06:02 +, Martin Dorey wrote: > And vfork is where that happens. If I’ve followed the thicket of > #ifdef correctly and understood the vfork man page, then this is > illegal when using vfork: > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmirror%2Fmake%2Fblob%2Fmaster%2Fsrc%2Fjob.c%23L2556data=05%7C01%7CMartin.Dorey%40hitachivantara.com%7C82c25851389d4e1c12cb08da9ef91def%7C18791e1761594f52a8d4de814ca8284a%7C0%7C0%7C637997090043171790%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=v9hASQYmLkyN49Vpvk4DGrT0roD482%2FC5uIyc3V2DLQ%3Dreserved=0 It's long past time to rewrite this entire part of job.c but I don't want to do that before the release. I did try this with a 32bit Ubuntu 14.04 docker container image which uses a similar libc version, but contrary to my previous assertion (sorry Denis!) if it's an issue with vfork then it IS related to the kernel version and so my docker containers won't be of any use. This is not an actual fix (won't work on Solaris obviously since execvpe() is a GNU extension) but can you try this change to see if it works on your system Martin? At least we can verify that this is the issue: diff --git a/src/job.c b/src/job.c index d12a9138..0a4ddae8 100644 --- a/src/job.c +++ b/src/job.c @@ -2553,8 +2553,7 @@ exec_command (char **argv, char **envp) # else /* Run the program. */ - environ = envp; - execvp (argv[0], argv); + execvpe (argv[0], argv, envp); # endif /* !__EMX__ */
Re: 4.3.90 release candidate segfaults on linux and solaris
On Sun, 2022-09-25 at 06:02 +, Martin Dorey wrote: > And vfork is where that happens. If I’ve followed the thicket of > #ifdef correctly and understood the vfork man page, then this is > illegal when using vfork: > > https://github.com/mirror/make/blob/master/src/job.c#L2556 It's long past time to rewrite this entire part of job.c but I don't want to do that before the release. I did try this with a 32bit Ubuntu 14.04 docker container image which uses a similar libc version, but contrary to my previous assertion (sorry Denis!) if it's an issue with vfork then it IS related to the kernel version and so my docker containers won't be of any use. This is not an actual fix (won't work on Solaris obviously since execvpe() is a GNU extension) but can you try this change to see if it works on your system Martin? At least we can verify that this is the issue: diff --git a/src/job.c b/src/job.c index d12a9138..0a4ddae8 100644 --- a/src/job.c +++ b/src/job.c @@ -2553,8 +2553,7 @@ exec_command (char **argv, char **envp) # else /* Run the program. */ - environ = envp; - execvp (argv[0], argv); + execvpe (argv[0], argv, envp); # endif /* !__EMX__ */
Re: 4.3.90 release candidate segfaults on linux and solaris
I have a 32 bit Debian Squeeze (so 2011 era, with gcc-4.4 - too early for asan - and libc 2.11, so two Debian releases earlier than Denis’s) chroot lying around. Unsurprisingly, his reproducer crashes for me too, configured with no options and make run with no arguments. gdb’s reporting line 119 rather than 118 in expand.c. valgrind annoyingly stops it crashing. I had trouble compiling rule.c because johnny-come-lately mempcpy on this vintage setup seems to be provided as a macro and that doesn't play well with the STRING_SIZE_TUPLE macro, because, until the latter is expanded, mempcpy appears to only have two arguments, when it requires three. #undef mempcpy in rule.c got me over that. I know we depend on C99 now, seemingly without a storm of protest, but this is failing to compile on a setup a decade newer than that. For me, it’s the variable known in the make source as “environ”, which isn't what my gdb calls environ but what it calls ‘environ@@GLIBC_2.0' which has been corrupted, by the second call to recursively_expand_for_file, not, far as I sampled, the environment variables it points to. gdb alleged to let me set a hardware watch point on it but it didn't fire. It seems to have been corrupted to point into the middle of an environment variable. So what should be an array of character pointers is an array of characters. Interpreting those as character pointers promptly doesn't go so well. It’s already changed, for me, by the call to free_childbase. … And vfork is where that happens. If I’ve followed the thicket of #ifdef correctly and understood the vfork man page, then this is illegal when using vfork: https://github.com/mirror/make/blob/master/src/job.c#L2556 (I’d use the savannah git web thing but I don’t see line number links.) The corruption happening in another process would explain the hardware watch point not catching it. The “blame“ says that dodgy assignment is 28 years old but, as was recently topical again, we have stopped using posix_spawn instead of vfork with old libc versions and this code for copying the environment definitions of recursively expanded variables for $(shell) invocations is even newer. Perhaps Make didn’t otherwise use the environ variable that it’s illegally corrupted for the last three decades. From: Bug-make on behalf of Paul Smith Sent: Saturday, September 24, 2022 14:17 To: Denis Excoffier Cc: bug-make@gnu.org Subject: Re: 4.3.90 release candidate segfaults on linux and solaris * EXTERNAL EMAIL * On Sat, 2022-09-24 at 13:06 -0400, Paul Smith wrote: > > The crashes in solaris and linux are so similar, and go away also > > so similarly, that i would primarily think about size of types. > > Under cygwin and MacOS all sizes (pointers, long int, size_t, > > SIZE_MAX, time_t) are 8, while under (this) solaris and (this) > > linux all sizes are 4. Only size of int is 4 on all. > > Aha! You're using a 32bit Linux. That is indeed helpful > information, thanks. I will check it. I installed a 32bit docker container and tried with that but still no failure. I will keep poking at it; maybe I can find a much older container.
Re: 4.3.90 release candidate segfaults on linux and solaris
On Sat, 2022-09-24 at 13:06 -0400, Paul Smith wrote: > > The crashes in solaris and linux are so similar, and go away also > > so similarly, that i would primarily think about size of types. > > Under cygwin and MacOS all sizes (pointers, long int, size_t, > > SIZE_MAX, time_t) are 8, while under (this) solaris and (this) > > linux all sizes are 4. Only size of int is 4 on all. > > Aha! You're using a 32bit Linux. That is indeed helpful > information, thanks. I will check it. I installed a 32bit docker container and tried with that but still no failure. I will keep poking at it; maybe I can find a much older container.
Re: 4.3.90 release candidate segfaults on linux and solaris
On Sat, 2022-09-24 at 18:23 +0200, Denis Excoffier wrote: > The segfault also occurs without ‘env -i’. Thanks for helping me avoid that red herring. > The crashes in solaris and linux are so similar, and go away also so > similarly, that i would primarily think about size of types. Under > cygwin and MacOS all sizes (pointers, long int, size_t, SIZE_MAX, > time_t) are 8, while under (this) solaris and (this) linux all sizes > are 4. Only size of int is 4 on all. Aha! You're using a 32bit Linux. That is indeed helpful information, thanks. I will check it.
Re: 4.3.90 release candidate segfaults on linux and solaris
> On 2022-09-24 17:19, Paul Smith wrote: > > On Sat, 2022-09-24 at 09:36 +0200, Denis Excoffier wrote: >> In my specific configuration (under linux, with --disable-nls, >> --disable-load, without using -j, using 'env -i make -d -n'), a >> segfault always occurs around line 118 of src/expand.c: >> >> My linux is old (2.6.32), > > The kernel version is not very interesting for a userspace program like > GNU make. But, it would be interesting if you could provide the > version of libc you are using; on my system I can use: > > ~$ /lib/x86_64-linux-gnu/libc.so.6 --version | head -n1 > > You can find the library by running "ldd make | grep libc' > > It's also interesting you're running with "env -i" that seems like it > might be related. Do you get the crash if you run without that? > > A quick check on my system was not able to show any issues using the > same setup as you, even building make with ASAN. I will review the > code especially around memory handling in the child structures to see > if I can find an error by inspection. % ldd make linux-gate.so.1 (0xb76e4000) libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb7565000) /lib/ld-linux.so.2 (0xb76e5000) % /lib/i386-linux-gnu/libc.so.6 --version | head -2 GNU C Library (Debian GLIBC 2.19-18+deb8u10) stable release version 2.19, by Roland McGrath et al. Copyright (C) 2014 Free Software Foundation, Inc. % The segfault also occurs without ‘env -i’. I was just trying to minimize the number of entries in the environment. In any case make seems to always add a few ones by default (like MAKEFLAGS etc.). The crashes in solaris and linux are so similar, and go away also so similarly, that i would primarily think about size of types. Under cygwin and MacOS all sizes (pointers, long int, size_t, SIZE_MAX, time_t) are 8, while under (this) solaris and (this) linux all sizes are 4. Only size of int is 4 on all. Regards, Denis Excoffier.
Re: 4.3.90 release candidate segfaults on linux and solaris
On Sat, 2022-09-24 at 09:36 +0200, Denis Excoffier wrote: > In my specific configuration (under linux, with --disable-nls, > --disable-load, without using -j, using 'env -i make -d -n'), a > segfault always occurs around line 118 of src/expand.c: > > My linux is old (2.6.32), The kernel version is not very interesting for a userspace program like GNU make. But, it would be interesting if you could provide the version of libc you are using; on my system I can use: ~$ /lib/x86_64-linux-gnu/libc.so.6 --version | head -n1 You can find the library by running "ldd make | grep libc' It's also interesting you're running with "env -i" that seems like it might be related. Do you get the crash if you run without that? A quick check on my system was not able to show any issues using the same setup as you, even building make with ASAN. I will review the code especially around memory handling in the child structures to see if I can find an error by inspection.