Re: NetBSD-10.0/i386 spurious SIGSEGV
On Sat, Jun 08, 2024 at 10:10:58PM -0400, Mouse wrote: > First thing I'd look at is the userland instruction(s) around the crash > point, maybe look at instructions starting at 0xbb610480 or something > and then disassemble forwards looking for 0xbb610579. In particular, > I'd be interested in whether it's a store instruction that failed or > whether this happened during a syscall trap. 0xbb610570 <__gettimeofday50>: mov$0x1a2,%eax 0xbb610575 <__gettimeofday50+5>: int$0x80 0xbb610577 <__gettimeofday50+7>: jb 0xbb61057a <__gettimeofday50+10> => 0xbb610579 <__gettimeofday50+9>: ret > Are all the failures in __gettimeofday50? All in trap-to-the-kernel > calls? I have seen many crashes on system call returns. Another one on __gettimeofday50: 0xbb610570 <__gettimeofday50>: mov$0x1a2,%eax 0xbb610575 <__gettimeofday50+5>: int$0x80 0xbb610577 <__gettimeofday50+7>: jb 0xbb61057a <__gettimeofday50+10> 0xbb610579 <__gettimeofday50+9>: ret => 0xbb61057a <__gettimeofday50+10>:push %ebx Another one: 0xbb610570 <__gettimeofday50>: mov$0x1a2,%eax 0xbb610575 <__gettimeofday50+5>: int$0x80 => 0xbb610577 <__gettimeofday50+7>: jb 0xbb61057a <__gettimeofday50+10> 0xbb610579 <__gettimeofday50+9>: ret At once I thought about a stack problem, but I think the last one proves this is not the case. This one involves no memory access. > You say "multiple machines"; are those multiple domUs on a single dom0, > or are they spread across multiple underlying hardware machines? It happens on multiple hardware machines and starts on upgrading the domU. I even tested moving a domU from one machine to another one and the bug folllowed. Other netbsd-9 domU on the same dom0 have no problem, or at least it is rare enough that I did not notice for years. > If the latter, how similar are those underlying machines? Same model: vcpu3: Intel(R) Xeon(R) CPU E3-1220 v6 @ 3.00GHz, id 0x906e9 -- Emmanuel Dreyfus m...@netbsd.org
Re: NetBSD-10.0/i386 spurious SIGSEGV
> After upgrading i386 XEN3PAE_DOMU to NetBSD 10.0, various daemons on > multuple machines get SIGSEGV at places I could not figure any reason > why it happens. [...] > Program terminated with signal SIGSEGV, Segmentation fault. > #0 0xbb610579 in __gettimeofday50 () from /lib/libc.so.12 > (gdb) bt > #0 0xbb610579 in __gettimeofday50 () from /lib/libc.so.12 > #1 0xbb60ca82 in __time50 (t=t@entry=0xbf7fde88) > at /usr/src/lib/libc/gen/time.c:52 > #2 0x0808afdd in update_check_stats (check_type=3, check_time=1717878817) > at utils.c:3015 First thing I'd look at is the userland instruction(s) around the crash point, maybe look at instructions starting at 0xbb610480 or something and then disassemble forwards looking for 0xbb610579. In particular, I'd be interested in whether it's a store instruction that failed or whether this happened during a syscall trap. Are all the failures in __gettimeofday50? All in trap-to-the-kernel calls? You say "multiple machines"; are those multiple domUs on a single dom0, or are they spread across multiple underlying hardware machines? If the latter, how similar are those underlying machines? I'm wondering if perhaps something is broken in a subtle way such that it manifests on only certain hardware (I'm talking about something along the lines of "this tickles erratum #2188 in stepping 478 of Intel CPUs from the Forest Lawn family"). /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
NetBSD-10.0/i386 spurious SIGSEGV
Hello After upgrading i386 XEN3PAE_DOMU to NetBSD 10.0, various daemons on multuple machines get SIGSEGV at places I could not figure any reason why it happens. Here is an exemple with nagios, but I see similar problems with apache httpd, sendmail, slapd, and even built-in syslogd and ping. This is a rare even that happens a few times a day. Any hint how I could track this problem down? Program terminated with signal SIGSEGV, Segmentation fault. #0 0xbb610579 in __gettimeofday50 () from /lib/libc.so.12 (gdb) bt #0 0xbb610579 in __gettimeofday50 () from /lib/libc.so.12 #1 0xbb60ca82 in __time50 (t=t@entry=0xbf7fde88) at /usr/src/lib/libc/gen/time.c:52 #2 0x0808afdd in update_check_stats (check_type=3, check_time=1717878817) at utils.c:3015 #3 0x0806275b in run_async_host_check (hst=hst@entry=0xbb412300, check_options=check_options@entry=0, latency=latency@entry=0.010999, scheduled_check=scheduled_check@entry=1, reschedule_check=reschedule_check@entry=1, time_is_valid=time_is_valid@entry=0xbf7fe2dc, preferred_time=preferred_time@entry=0xbf7fe2e8) at checks.c:3257 #4 0x08062b41 in run_scheduled_host_check (hst=hst@entry=0xbb412300, check_options=0, latency=latency@entry=0.010999) at checks.c:3023 #5 0x08078882 in handle_timed_event (event=event@entry=0xb968e4f0) at events.c:1235 #6 0x080792f8 in event_execution_loop () at events.c:1164 #7 0x080583d5 in main (argc=3, argv=0xbf7fe590) at nagios.c:846 (gdb) frame 1 #1 0xbb60ca82 in __time50 (t=t@entry=0xbf7fde88) at /usr/src/lib/libc/gen/time.c:52 52 if (gettimeofday(&tt, NULL) == -1) (gdb) print tt $1 = {tv_sec = 1717878817, tv_usec = 592767} -- Emmanuel Dreyfus m...@netbsd.org