Re: Memory corruption after fork, only on AMD CPUs

2021-12-14 Thread Brad Spencer
Michael Pratt  writes:

> On Tue, Dec 14, 2021 at 1:06 PM Michael Pratt  wrote:
>>
>> [This is a reply to
>> https://mail-index.netbsd.org/tech-kern/2021/12/01/msg027830.html. I
>> just joined the mailing list and can't seem to find the metadata
>> required for a proper reply. Apologies.]
>>
>> I filed https://gnats.netbsd.org/56535 for this a while ago, which has
>> an even simpler reproducer: a direct fork() call with a child that
>> immediately exits sometimes causes memory corruption in the parent
>> process.
>>
>> We've kept looking since filing https://gnats.netbsd.org/56535 but
>> haven't had luck on further simplification. No C reproducer yet,
>> unfortunately. (No crashes if the Go parent process is single-threaded
>> either.)
>
> I spoke too soon here, we managed to get a reproducer in C today,
> which I've posted at
> https://github.com/golang/go/issues/34988#issuecomment-994115345.

I don't have a big collection of AMD systems, but I do have a couple.
Everything here is Xen, however and nothing is really very recent either
from the hardware POV or the OS in a lot of cases...

Ryzen 3 2200G - 2 vcpu DOMU running 9.0_STABLE and a 1 processor DOM0
running 8.99.25 could not reproduce this running the code from the DOMU
or DOM0.

Athlon 64 X2 5600+ - 1 vcpu DOMU running 9.99.74 and a 1 processor DOM0
running 8.0_STABLE could not reproduce this running the code from the
DOMU or DOM0.


As a control test a 2 vcpu 9.0_STABLE DOMU running on an Intel system
could also not reproduce this.

Since this test is a bit brutal, I didn't let this run too long as the
systems are doing other stuff, but it was several minutes and no fails
reported.  Are of the systems are NetBSD/amd64.

>> This feels like a bug in memory management somewhere (TLB invalidation
>> issue, bug in copy-on-write?). Fundamentally, we have the parent
>> process getting corrupt memory after calling fork with an
>> (effectively) no-op child. That just shouldn't happen.
>>
>> I think we need someone familiar with NetBSD memory management
>> internals to help take a look. Otherwise I'm afraid we won't figure it
>> out and will have to declare that Go doesn't work on NetBSD on AMD
>> CPUs.
>>
>> gdt: that does sound like a different issue to me. It may be worth
>> filing a bug at https://github.com/golang/go/issues with the crash
>> details.
>>
>> Thanks,
>> Michael



-- 
Brad Spencer - b...@anduin.eldar.org - KC8VKS - http://anduin.eldar.org


Re: Memory corruption after fork, only on AMD CPUs

2021-12-14 Thread Michael Pratt
On Tue, Dec 14, 2021 at 1:06 PM Michael Pratt  wrote:
>
> [This is a reply to
> https://mail-index.netbsd.org/tech-kern/2021/12/01/msg027830.html. I
> just joined the mailing list and can't seem to find the metadata
> required for a proper reply. Apologies.]
>
> I filed https://gnats.netbsd.org/56535 for this a while ago, which has
> an even simpler reproducer: a direct fork() call with a child that
> immediately exits sometimes causes memory corruption in the parent
> process.
>
> We've kept looking since filing https://gnats.netbsd.org/56535 but
> haven't had luck on further simplification. No C reproducer yet,
> unfortunately. (No crashes if the Go parent process is single-threaded
> either.)

I spoke too soon here, we managed to get a reproducer in C today,
which I've posted at
https://github.com/golang/go/issues/34988#issuecomment-994115345.

>
> This feels like a bug in memory management somewhere (TLB invalidation
> issue, bug in copy-on-write?). Fundamentally, we have the parent
> process getting corrupt memory after calling fork with an
> (effectively) no-op child. That just shouldn't happen.
>
> I think we need someone familiar with NetBSD memory management
> internals to help take a look. Otherwise I'm afraid we won't figure it
> out and will have to declare that Go doesn't work on NetBSD on AMD
> CPUs.
>
> gdt: that does sound like a different issue to me. It may be worth
> filing a bug at https://github.com/golang/go/issues with the crash
> details.
>
> Thanks,
> Michael


Re: Memory corruption after fork, only on AMD CPUs

2021-12-14 Thread Michael Pratt
[This is a reply to
https://mail-index.netbsd.org/tech-kern/2021/12/01/msg027830.html. I
just joined the mailing list and can't seem to find the metadata
required for a proper reply. Apologies.]

I filed https://gnats.netbsd.org/56535 for this a while ago, which has
an even simpler reproducer: a direct fork() call with a child that
immediately exits sometimes causes memory corruption in the parent
process.

We've kept looking since filing https://gnats.netbsd.org/56535 but
haven't had luck on further simplification. No C reproducer yet,
unfortunately. (No crashes if the Go parent process is single-threaded
either.)

This feels like a bug in memory management somewhere (TLB invalidation
issue, bug in copy-on-write?). Fundamentally, we have the parent
process getting corrupt memory after calling fork with an
(effectively) no-op child. That just shouldn't happen.

I think we need someone familiar with NetBSD memory management
internals to help take a look. Otherwise I'm afraid we won't figure it
out and will have to declare that Go doesn't work on NetBSD on AMD
CPUs.

gdt: that does sound like a different issue to me. It may be worth
filing a bug at https://github.com/golang/go/issues with the crash
details.

Thanks,
Michael


Re: python vs. semaphores

2021-12-14 Thread Thomas Klausner
On Tue, Dec 14, 2021 at 02:08:39PM +0100, Thomas Klausner wrote:
> Is anyone aware of problems with semaphores on NetBSD, or has looked
> at this particular problem before?

Ok, my first test already gave me a bug.

The man page says that multiple consecutive sem_open calls (without
sem_close in between) should return the same address. They don't.

I've written an atf test. I don't know how to mark it as
currently-broken, so I'll attach it here (and to the bug report).

I don't think fixing this one will fix semaphores in Python though :)
 Thomas
? .gdbinit
? Atffile
? atf-run.log
? t_sched
? t_sem
Index: t_sem.c
===
RCS file: /cvsroot/src/tests/lib/librt/t_sem.c,v
retrieving revision 1.5
diff -u -r1.5 t_sem.c
--- t_sem.c 14 May 2020 08:34:19 -  1.5
+++ t_sem.c 14 Dec 2021 15:38:47 -
@@ -313,6 +313,31 @@
(void)sem_unlink("/sem_c");
 }
 
+ATF_TC_WITH_CLEANUP(sem_open_address);
+ATF_TC_HEAD(sem_open_address, tc)
+{
+   atf_tc_set_md_var(tc, "descr", "Validate that multiple sem_open calls "
+   "return the same address");
+}
+ATF_TC_BODY(sem_open_address, tc)
+{
+   sem_t *sem, *sem2, *sem3;
+   sem = sem_open("/sem_d", O_CREAT | O_EXCL, 0777, 0);
+   ATF_REQUIRE(sem != SEM_FAILED);
+   sem2 = sem_open("/sem_d", O_CREAT | O_EXCL, 0777, 0);
+   ATF_REQUIRE(sem2 == SEM_FAILED && errno == EEXIST);
+   sem3 = sem_open("/sem_d", 0);
+   ATF_REQUIRE(sem3 != SEM_FAILED);
+   ATF_REQUIRE(sem == sem3);
+   ATF_REQUIRE_EQ(sem_close(sem3), 0);
+   ATF_REQUIRE_EQ(sem_close(sem), 0);
+   ATF_REQUIRE_EQ(sem_unlink("/sem_d"), 0);
+}
+ATF_TC_CLEANUP(sem_open_address, tc)
+{
+   (void)sem_unlink("/sem_d");
+}
+
 ATF_TP_ADD_TCS(tp)
 {
 
@@ -320,6 +345,7 @@
ATF_TP_ADD_TC(tp, child);
ATF_TP_ADD_TC(tp, pshared);
ATF_TP_ADD_TC(tp, invalid_ops);
+   ATF_TP_ADD_TC(tp, sem_open_address);
 
return atf_no_error();
 }


python vs. semaphores

2021-12-14 Thread Thomas Klausner
Hi!

Since at least python 2.7 (2011, probably longer, I haven't dug
deeper) pkgsrc comes with a patch for Python that disables POSIX
semaphore use (pthread_mutex* and pthread_cond* are used instead).

I don't have a small test case yet, but when you build python without
the pkgsrc patches and run

./python ../Tools/scripts/run_tests.py test_compileall 
test_multiprocessing_fork test_concurrent_futures

these tests will not finish and I can see lots of pythons in state 'psem'

 2791 wiz   77096M   11M psem/31 0:00  0.00%  0.00% python

Is anyone aware of problems with semaphores on NetBSD, or has looked
at this particular problem before?

 Thomas