Re: Memory corruption after fork, only on AMD CPUs
Michael Pratt writes: > On Tue, Dec 14, 2021 at 1:06 PM Michael Pratt wrote: >> >> [This is a reply to >> https://mail-index.netbsd.org/tech-kern/2021/12/01/msg027830.html. I >> just joined the mailing list and can't seem to find the metadata >> required for a proper reply. Apologies.] >> >> I filed https://gnats.netbsd.org/56535 for this a while ago, which has >> an even simpler reproducer: a direct fork() call with a child that >> immediately exits sometimes causes memory corruption in the parent >> process. >> >> We've kept looking since filing https://gnats.netbsd.org/56535 but >> haven't had luck on further simplification. No C reproducer yet, >> unfortunately. (No crashes if the Go parent process is single-threaded >> either.) > > I spoke too soon here, we managed to get a reproducer in C today, > which I've posted at > https://github.com/golang/go/issues/34988#issuecomment-994115345. I don't have a big collection of AMD systems, but I do have a couple. Everything here is Xen, however and nothing is really very recent either from the hardware POV or the OS in a lot of cases... Ryzen 3 2200G - 2 vcpu DOMU running 9.0_STABLE and a 1 processor DOM0 running 8.99.25 could not reproduce this running the code from the DOMU or DOM0. Athlon 64 X2 5600+ - 1 vcpu DOMU running 9.99.74 and a 1 processor DOM0 running 8.0_STABLE could not reproduce this running the code from the DOMU or DOM0. As a control test a 2 vcpu 9.0_STABLE DOMU running on an Intel system could also not reproduce this. Since this test is a bit brutal, I didn't let this run too long as the systems are doing other stuff, but it was several minutes and no fails reported. Are of the systems are NetBSD/amd64. >> This feels like a bug in memory management somewhere (TLB invalidation >> issue, bug in copy-on-write?). Fundamentally, we have the parent >> process getting corrupt memory after calling fork with an >> (effectively) no-op child. That just shouldn't happen. >> >> I think we need someone familiar with NetBSD memory management >> internals to help take a look. Otherwise I'm afraid we won't figure it >> out and will have to declare that Go doesn't work on NetBSD on AMD >> CPUs. >> >> gdt: that does sound like a different issue to me. It may be worth >> filing a bug at https://github.com/golang/go/issues with the crash >> details. >> >> Thanks, >> Michael -- Brad Spencer - b...@anduin.eldar.org - KC8VKS - http://anduin.eldar.org
Re: Memory corruption after fork, only on AMD CPUs
On Tue, Dec 14, 2021 at 1:06 PM Michael Pratt wrote: > > [This is a reply to > https://mail-index.netbsd.org/tech-kern/2021/12/01/msg027830.html. I > just joined the mailing list and can't seem to find the metadata > required for a proper reply. Apologies.] > > I filed https://gnats.netbsd.org/56535 for this a while ago, which has > an even simpler reproducer: a direct fork() call with a child that > immediately exits sometimes causes memory corruption in the parent > process. > > We've kept looking since filing https://gnats.netbsd.org/56535 but > haven't had luck on further simplification. No C reproducer yet, > unfortunately. (No crashes if the Go parent process is single-threaded > either.) I spoke too soon here, we managed to get a reproducer in C today, which I've posted at https://github.com/golang/go/issues/34988#issuecomment-994115345. > > This feels like a bug in memory management somewhere (TLB invalidation > issue, bug in copy-on-write?). Fundamentally, we have the parent > process getting corrupt memory after calling fork with an > (effectively) no-op child. That just shouldn't happen. > > I think we need someone familiar with NetBSD memory management > internals to help take a look. Otherwise I'm afraid we won't figure it > out and will have to declare that Go doesn't work on NetBSD on AMD > CPUs. > > gdt: that does sound like a different issue to me. It may be worth > filing a bug at https://github.com/golang/go/issues with the crash > details. > > Thanks, > Michael
Re: Memory corruption after fork, only on AMD CPUs
[This is a reply to https://mail-index.netbsd.org/tech-kern/2021/12/01/msg027830.html. I just joined the mailing list and can't seem to find the metadata required for a proper reply. Apologies.] I filed https://gnats.netbsd.org/56535 for this a while ago, which has an even simpler reproducer: a direct fork() call with a child that immediately exits sometimes causes memory corruption in the parent process. We've kept looking since filing https://gnats.netbsd.org/56535 but haven't had luck on further simplification. No C reproducer yet, unfortunately. (No crashes if the Go parent process is single-threaded either.) This feels like a bug in memory management somewhere (TLB invalidation issue, bug in copy-on-write?). Fundamentally, we have the parent process getting corrupt memory after calling fork with an (effectively) no-op child. That just shouldn't happen. I think we need someone familiar with NetBSD memory management internals to help take a look. Otherwise I'm afraid we won't figure it out and will have to declare that Go doesn't work on NetBSD on AMD CPUs. gdt: that does sound like a different issue to me. It may be worth filing a bug at https://github.com/golang/go/issues with the crash details. Thanks, Michael
Re: python vs. semaphores
On Tue, Dec 14, 2021 at 02:08:39PM +0100, Thomas Klausner wrote: > Is anyone aware of problems with semaphores on NetBSD, or has looked > at this particular problem before? Ok, my first test already gave me a bug. The man page says that multiple consecutive sem_open calls (without sem_close in between) should return the same address. They don't. I've written an atf test. I don't know how to mark it as currently-broken, so I'll attach it here (and to the bug report). I don't think fixing this one will fix semaphores in Python though :) Thomas ? .gdbinit ? Atffile ? atf-run.log ? t_sched ? t_sem Index: t_sem.c === RCS file: /cvsroot/src/tests/lib/librt/t_sem.c,v retrieving revision 1.5 diff -u -r1.5 t_sem.c --- t_sem.c 14 May 2020 08:34:19 - 1.5 +++ t_sem.c 14 Dec 2021 15:38:47 - @@ -313,6 +313,31 @@ (void)sem_unlink("/sem_c"); } +ATF_TC_WITH_CLEANUP(sem_open_address); +ATF_TC_HEAD(sem_open_address, tc) +{ + atf_tc_set_md_var(tc, "descr", "Validate that multiple sem_open calls " + "return the same address"); +} +ATF_TC_BODY(sem_open_address, tc) +{ + sem_t *sem, *sem2, *sem3; + sem = sem_open("/sem_d", O_CREAT | O_EXCL, 0777, 0); + ATF_REQUIRE(sem != SEM_FAILED); + sem2 = sem_open("/sem_d", O_CREAT | O_EXCL, 0777, 0); + ATF_REQUIRE(sem2 == SEM_FAILED && errno == EEXIST); + sem3 = sem_open("/sem_d", 0); + ATF_REQUIRE(sem3 != SEM_FAILED); + ATF_REQUIRE(sem == sem3); + ATF_REQUIRE_EQ(sem_close(sem3), 0); + ATF_REQUIRE_EQ(sem_close(sem), 0); + ATF_REQUIRE_EQ(sem_unlink("/sem_d"), 0); +} +ATF_TC_CLEANUP(sem_open_address, tc) +{ + (void)sem_unlink("/sem_d"); +} + ATF_TP_ADD_TCS(tp) { @@ -320,6 +345,7 @@ ATF_TP_ADD_TC(tp, child); ATF_TP_ADD_TC(tp, pshared); ATF_TP_ADD_TC(tp, invalid_ops); + ATF_TP_ADD_TC(tp, sem_open_address); return atf_no_error(); }
python vs. semaphores
Hi! Since at least python 2.7 (2011, probably longer, I haven't dug deeper) pkgsrc comes with a patch for Python that disables POSIX semaphore use (pthread_mutex* and pthread_cond* are used instead). I don't have a small test case yet, but when you build python without the pkgsrc patches and run ./python ../Tools/scripts/run_tests.py test_compileall test_multiprocessing_fork test_concurrent_futures these tests will not finish and I can see lots of pythons in state 'psem' 2791 wiz 77096M 11M psem/31 0:00 0.00% 0.00% python Is anyone aware of problems with semaphores on NetBSD, or has looked at this particular problem before? Thomas