date:20240608

Re: NetBSD-10.0/i386 spurious SIGSEGV

2024-06-08 Thread Emmanuel Dreyfus

On Sat, Jun 08, 2024 at 10:10:58PM -0400, Mouse wrote:
> First thing I'd look at is the userland instruction(s) around the crash
> point, maybe look at instructions starting at 0xbb610480 or something
> and then disassemble forwards looking for 0xbb610579.  In particular,
> I'd be interested in whether it's a store instruction that failed or
> whether this happened during a syscall trap.

   0xbb610570 <__gettimeofday50>:   mov$0x1a2,%eax
   0xbb610575 <__gettimeofday50+5>: int$0x80
   0xbb610577 <__gettimeofday50+7>: jb 0xbb61057a <__gettimeofday50+10>
=> 0xbb610579 <__gettimeofday50+9>: ret  

> Are all the failures in __gettimeofday50?  All in trap-to-the-kernel
> calls?

I have seen many crashes on system call returns. Another one on
__gettimeofday50:

   0xbb610570 <__gettimeofday50>:   mov$0x1a2,%eax
   0xbb610575 <__gettimeofday50+5>: int$0x80
   0xbb610577 <__gettimeofday50+7>: jb 0xbb61057a <__gettimeofday50+10>
   0xbb610579 <__gettimeofday50+9>: ret
=> 0xbb61057a <__gettimeofday50+10>:push   %ebx

Another one:
   0xbb610570 <__gettimeofday50>:   mov$0x1a2,%eax
   0xbb610575 <__gettimeofday50+5>: int$0x80
=> 0xbb610577 <__gettimeofday50+7>: jb 0xbb61057a <__gettimeofday50+10>
   0xbb610579 <__gettimeofday50+9>: ret  

At once I thought about a stack problem, but I think the last one proves
this is not the case. This one involves no memory access.

> You say "multiple machines"; are those multiple domUs on a single dom0,
> or are they spread across multiple underlying hardware machines? 

It happens on multiple hardware machines and starts on upgrading the 
domU. I even tested moving a domU from one machine to another one 
and the bug folllowed. Other netbsd-9 domU on the same dom0 have
no problem, or at least it is rare enough that I did not notice
for years.

> If the latter, how similar are those underlying machines? 

Same model:
vcpu3: Intel(R) Xeon(R) CPU E3-1220 v6 @ 3.00GHz, id 0x906e9


-- 
Emmanuel Dreyfus
m...@netbsd.org

Re: NetBSD-10.0/i386 spurious SIGSEGV

2024-06-08 Thread Mouse

> After upgrading i386 XEN3PAE_DOMU to NetBSD 10.0, various daemons on
> multuple machines get SIGSEGV at places I could not figure any reason
> why it happens.  [...]

> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0xbb610579 in __gettimeofday50 () from /lib/libc.so.12
> (gdb) bt
> #0  0xbb610579 in __gettimeofday50 () from /lib/libc.so.12
> #1  0xbb60ca82 in __time50 (t=t@entry=0xbf7fde88)
> at /usr/src/lib/libc/gen/time.c:52
> #2  0x0808afdd in update_check_stats (check_type=3, check_time=1717878817)
> at utils.c:3015

First thing I'd look at is the userland instruction(s) around the crash
point, maybe look at instructions starting at 0xbb610480 or something
and then disassemble forwards looking for 0xbb610579.  In particular,
I'd be interested in whether it's a store instruction that failed or
whether this happened during a syscall trap.

Are all the failures in __gettimeofday50?  All in trap-to-the-kernel
calls?

You say "multiple machines"; are those multiple domUs on a single dom0,
or are they spread across multiple underlying hardware machines?  If
the latter, how similar are those underlying machines?  I'm wondering
if perhaps something is broken in a subtle way such that it manifests
on only certain hardware (I'm talking about something along the lines
of "this tickles erratum #2188 in stepping 478 of Intel CPUs from the
Forest Lawn family").

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

NetBSD-10.0/i386 spurious SIGSEGV

2024-06-08 Thread Emmanuel Dreyfus

Hello

After upgrading i386 XEN3PAE_DOMU to NetBSD 10.0, various daemons
on multuple machines get SIGSEGV at places I could not figure any 
reason why it happens. Here is an exemple with nagios, but I see 
similar problems with apache httpd, sendmail, slapd, and even 
built-in syslogd and ping. This is a rare even that happens a 
few times a day.

Any hint how I could track this problem down?

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0xbb610579 in __gettimeofday50 () from /lib/libc.so.12
(gdb) bt
#0  0xbb610579 in __gettimeofday50 () from /lib/libc.so.12
#1  0xbb60ca82 in __time50 (t=t@entry=0xbf7fde88)
at /usr/src/lib/libc/gen/time.c:52
#2  0x0808afdd in update_check_stats (check_type=3, check_time=1717878817)
at utils.c:3015
#3  0x0806275b in run_async_host_check (hst=hst@entry=0xbb412300, 
check_options=check_options@entry=0, 
latency=latency@entry=0.010999, 
scheduled_check=scheduled_check@entry=1, 
reschedule_check=reschedule_check@entry=1, 
time_is_valid=time_is_valid@entry=0xbf7fe2dc, 
preferred_time=preferred_time@entry=0xbf7fe2e8) at checks.c:3257
#4  0x08062b41 in run_scheduled_host_check (hst=hst@entry=0xbb412300, 
check_options=0, latency=latency@entry=0.010999)
at checks.c:3023
#5  0x08078882 in handle_timed_event (event=event@entry=0xb968e4f0)
at events.c:1235
#6  0x080792f8 in event_execution_loop () at events.c:1164
#7  0x080583d5 in main (argc=3, argv=0xbf7fe590) at nagios.c:846
(gdb) frame 1
#1  0xbb60ca82 in __time50 (t=t@entry=0xbf7fde88)
at /usr/src/lib/libc/gen/time.c:52
52  if (gettimeofday(&tt, NULL) == -1)
(gdb) print tt 
$1 = {tv_sec = 1717878817, tv_usec = 592767}



-- 
Emmanuel Dreyfus
m...@netbsd.org

Re: NetBSD-10.0/i386 spurious SIGSEGV

Re: NetBSD-10.0/i386 spurious SIGSEGV

NetBSD-10.0/i386 spurious SIGSEGV

3 matches

Site Navigation

Mail list logo

Footer information