Re: sparc64 reproducable panic on 5.9
On 03/31/16 09:27, Mark Kettenis wrote: Date: Wed, 30 Mar 2016 09:10:04 +0200 From: Stefan SperlingOn Wed, Mar 30, 2016 at 07:54:31AM +0100, Fred wrote: This looks similar to the issues in this thread: http://marc.info/?t=14346611501 I'm not sure a definative solution was found - but having dc nic's was part of the issue. It's not driver dependent. On my blade100 a similar crash is happening with ral(4). http://marc.info/?l=openbsd-bugs=144802152407599=2 I made dc(4) print the dma addressen and legths of the segments it puts on the tx ring (values in hex): X 60095868/86 X 6009479a/42 X 60090768/8e X 60095f9a/42 X 60095f9a/66 The crash happens immediately after that last one. Note that 0x60095f9a + 0x66 = 0x60096000 In other words, this is a segment that is aligned with the end of a page. In all likelyhood the NIC's DMA engine is overfetching and runs into the next page. This page isn't mapped into the IOMMU which triggers the fault. In the past, the mbuf pool used in-page pool page headers. This would hide the issue since the pool page header would "misalign" the pool items and induce some trailing unused space in the page. Not sure how to solve this yet; I'll be looking at Solaris for inspiration. Fairly certain that ral(4) has a similar issue. I'm right in thinking then the Steve's [1] approach of changing the else if in subr_pool.c to: } else if (256 > size) { is just masking the issue? [1] http://marc.info/?l=openbsd-bugs=144962985826087 Cheers Fred
Re: sparc64 reproducable panic on 5.9
> Date: Wed, 30 Mar 2016 09:10:04 +0200 > From: Stefan Sperling> > On Wed, Mar 30, 2016 at 07:54:31AM +0100, Fred wrote: > > This looks similar to the issues in this thread: > > > > http://marc.info/?t=14346611501 > > > > I'm not sure a definative solution was found - but having dc nic's was part > > of the issue. > > It's not driver dependent. > > On my blade100 a similar crash is happening with ral(4). > http://marc.info/?l=openbsd-bugs=144802152407599=2 I made dc(4) print the dma addressen and legths of the segments it puts on the tx ring (values in hex): X 60095868/86 X 6009479a/42 X 60090768/8e X 60095f9a/42 X 60095f9a/66 The crash happens immediately after that last one. Note that 0x60095f9a + 0x66 = 0x60096000 In other words, this is a segment that is aligned with the end of a page. In all likelyhood the NIC's DMA engine is overfetching and runs into the next page. This page isn't mapped into the IOMMU which triggers the fault. In the past, the mbuf pool used in-page pool page headers. This would hide the issue since the pool page header would "misalign" the pool items and induce some trailing unused space in the page. Not sure how to solve this yet; I'll be looking at Solaris for inspiration. Fairly certain that ral(4) has a similar issue.
Re: sparc64 reproducable panic on 5.9
> Date: Wed, 30 Mar 2016 15:01:51 +0200 > From: Matthieu Herrb> > On Wed, Mar 30, 2016 at 07:54:31AM +0100, Fred wrote: > > On 03/30/16 06:52, Matthieu Herrb wrote: > > >Hi, > > > > > >I upgraded yesterday a SunFire V100 which serves as an ntpd server. It > > >was running 5.5 before and had 330 days of uptime (which seems to > > >eliminaite hw failure). > > > > > >Running 5.9 the box crashed twice already each time after ca 4 hours > > >of uptime with the same panic. Detais below. > > > > > >Any idea/patch/things to look for under ddb next time it happens ? > > > > This looks similar to the issues in this thread: > > > > http://marc.info/?t=14346611501 > > > > I'm not sure a definative solution was found - but having dc nic's > > was part of the issue. > > I can confirm that the patch from > http://marc.info/?l=openbsd-bugs=144833598011206=2 (reproduced > below) seems to fix the issue for my machine. > > Without it it's impossible to scp a new /bsd to the machine (I needed > to use ftp to transfer the kernel to it). With it I've been able to > scp /bsd several times. > > If someone (dlg?, kettenis?) is interested I can setup a serial > console access to a similar machine with the same problem under > -current. If you could hook me up with LOM access, that would be great. It frustrates me quite a bit that these machines don't work anymore...
Re: sparc64 reproducable panic on 5.9
On Wed, Mar 30, 2016 at 07:54:31AM +0100, Fred wrote: > On 03/30/16 06:52, Matthieu Herrb wrote: > >Hi, > > > >I upgraded yesterday a SunFire V100 which serves as an ntpd server. It > >was running 5.5 before and had 330 days of uptime (which seems to > >eliminaite hw failure). > > > >Running 5.9 the box crashed twice already each time after ca 4 hours > >of uptime with the same panic. Detais below. > > > >Any idea/patch/things to look for under ddb next time it happens ? > > This looks similar to the issues in this thread: > > http://marc.info/?t=14346611501 > > I'm not sure a definative solution was found - but having dc nic's > was part of the issue. I can confirm that the patch from http://marc.info/?l=openbsd-bugs=144833598011206=2 (reproduced below) seems to fix the issue for my machine. Without it it's impossible to scp a new /bsd to the machine (I needed to use ftp to transfer the kernel to it). With it I've been able to scp /bsd several times. If someone (dlg?, kettenis?) is interested I can setup a serial console access to a similar machine with the same problem under -current. Index: subr_pool.c === RCS file: /cvs/src/sys/kern/subr_pool.c,v retrieving revision 1.194 diff -u -r1.194 subr_pool.c --- subr_pool.c 15 Jan 2016 11:21:58 - 1.194 +++ subr_pool.c 30 Mar 2016 11:42:42 - @@ -258,7 +258,7 @@ */ if (pgsize - (size * items) > sizeof(struct pool_item_header)) { off = pgsize - sizeof(struct pool_item_header); - } else if (sizeof(struct pool_item_header) * 2 >= size) { + } else if (sizeof(struct pool_item_header) * 8 >= size) { off = pgsize - sizeof(struct pool_item_header); items = off / size; } -- Matthieu Herrb pgpsVH1yPZfB9.pgp Description: PGP signature
Re: sparc64 reproducable panic on 5.9
On Wed, Mar 30, 2016 at 07:54:31AM +0100, Fred wrote: > This looks similar to the issues in this thread: > > http://marc.info/?t=14346611501 > > I'm not sure a definative solution was found - but having dc nic's was part > of the issue. It's not driver dependent. On my blade100 a similar crash is happening with ral(4). http://marc.info/?l=openbsd-bugs=144802152407599=2
Re: sparc64 reproducable panic on 5.9
On 03/30/16 06:52, Matthieu Herrb wrote: Hi, I upgraded yesterday a SunFire V100 which serves as an ntpd server. It was running 5.5 before and had 330 days of uptime (which seems to eliminaite hw failure). Running 5.9 the box crashed twice already each time after ca 4 hours of uptime with the same panic. Detais below. Any idea/patch/things to look for under ddb next time it happens ? This looks similar to the issues in this thread: http://marc.info/?t=14346611501 I'm not sure a definative solution was found - but having dc nic's was part of the issue. hth Fred panic: psycho0: uncorrectable DMA error AFAR 6e866250 (pa=0 tte=0/69270012) AFSR 41ff4080 Stopped at Debugger+0x8: nop TIDPIDUID PRFLAGS PFLAGS CPU COMMAND *16805 16805 830x100010 00 ntpd psycho_ue(48a3200, 5a, 2, 4a1e050, 4000fab5700, 127) at psycho_ue+0x7c intr_handler(e0017ec8, 48a3300, 5facef, 800, e364, ) at intr_handler+0x c sparc_interrupt(18960f8, 400090cf900, 1695200, 0, 0, 400091005a0) at sparc_inte rrupt+0x298 pool_put(18960f8, 400090cf900, 400090cf920, 0, 70197ff, 0) at pool_put+0x1dc m_free(400090cf900, 9, 400090cf900, 0, 0, 1) at m_free+0x9c m_freem(400090cf900, 4000fab5b78, 4000fab5b90, 0, 0, 0) at m_freem+0xc sendit(0, 9, 0, 0, 4000fab5df0, 14b) at sendit+0x2bc sys_sendto(40009177690, 4000fab5db0, 4000fab5df0, cb70bfd1cc, 0, 14b) at sys_se ndto+0x68 syscall(4000fab5ed0, 485, cb70c04918, cb70c0491c, 0, 0) at syscall+0x34c softtrap(9, fffdfdbc, 30, 0, fffdfe20, 10) at softtrap+0x19c http://www.openbsd.org/ddb.html describes the minimum info required in bug reports. Insufficient info makes it difficult to find and fix bugs. ddb> LOM event: +330d+14h1m38s host FAULT: watchdog triggered trace psycho_ue(48a3200, 5a, 2, 4a1e050, 4000fab5700, 127) at psycho_ue+0x7c intr_handler(e0017ec8, 48a3300, 5facef, 800, e364, ) at intr_handler+0x c sparc_interrupt(18960f8, 400090cf900, 1695200, 0, 0, 400091005a0) at sparc_inte rrupt+0x298 pool_put(18960f8, 400090cf900, 400090cf920, 0, 70197ff, 0) at pool_put+0x1dc m_free(400090cf900, 9, 400090cf900, 0, 0, 1) at m_free+0x9c m_freem(400090cf900, 4000fab5b78, 4000fab5b90, 0, 0, 0) at m_freem+0xc sendit(0, 9, 0, 0, 4000fab5df0, 14b) at sendit+0x2bc sys_sendto(40009177690, 4000fab5db0, 4000fab5df0, cb70bfd1cc, 0, 14b) at sys_se ndto+0x68 syscall(4000fab5ed0, 485, cb70c04918, cb70c0491c, 0, 0) at syscall+0x34c softtrap(9, fffdfdbc, 30, 0, fffdfe20, 10) at softtrap+0x19c ddb> ps TID PPID PGRPUID S FLAGS WAIT COMMAND 18314 4143 18314 0 30x83 poll systat 4143 30172 4143 0 30x10008b pause ksh 30172 2827 30172 0 30x92 selectsshd 10502 1 10502 0 30x100083 ttyin getty 16663 1 16663 0 30x100098 poll cron 10518 1 10518110 30x100090 poll sndiod 27012 1 27012 99 30x100090 poll sndiod 7526 31292 31292 95 30x100090 kqreadsmtpd 15286 31292 31292 95 30x100090 kqreadsmtpd 7378 31292 31292 95 30x100090 kqreadsmtpd 813 31292 31292 95 30x100090 kqreadsmtpd 7395 31292 31292 95 30x100090 kqreadsmtpd 13554 31292 31292103 30x100090 kqreadsmtpd 31292 1 31292 0 30x100080 kqreadsmtpd 2827 1 2827 0 30x80 selectsshd 24960 16805 20173 83 30x100090 poll ntpd *16805 20173 20173 83 70x100010ntpd 20173 1 20173 0 30x100080 poll ntpd 32115 28778 28778 74 30x100090 bpf pflogd 28778 1 28778 0 30x80 netio pflogd 25537 31067 31067 73 30x100090 kqreadsyslogd 31067 1 31067 0 30x100080 netio syslogd 32003 0 0 0 3 0x14200 pgzerozerothread 4880 0 0 0 3 0x14200 aiodoned aiodoned 2947 0 0 0 3 0x14200 syncerupdate 2812 0 0 0 3 0x14200 cleaner cleaner 24150 0 0 0 3 0x14200 reaperreaper 15162 0 0 0 3 0x14200 pgdaemon pagedaemon 24138 0 0 0 3 0x14200 bored crypto 22463 0 0 0 3 0x14200 pftm pfpurge 20852 0 0 0 3 0x14200 usbtskusbtask 21476 0 0 0 3 0x14200 usbatsk usbatsk 11816 0 0 0 3 0x14200 bored sensors 16023 0 0 0 3 0x14200 bored softnet 22770 0 0 0 3 0x14200 bored systqmp 20377 0 0 0 3 0x14200 bored
sparc64 reproducable panic on 5.9
Hi, I upgraded yesterday a SunFire V100 which serves as an ntpd server. It was running 5.5 before and had 330 days of uptime (which seems to eliminaite hw failure). Running 5.9 the box crashed twice already each time after ca 4 hours of uptime with the same panic. Detais below. Any idea/patch/things to look for under ddb next time it happens ? panic: psycho0: uncorrectable DMA error AFAR 6e866250 (pa=0 tte=0/69270012) AFSR 41ff4080 Stopped at Debugger+0x8: nop TIDPIDUID PRFLAGS PFLAGS CPU COMMAND *16805 16805 830x100010 00 ntpd psycho_ue(48a3200, 5a, 2, 4a1e050, 4000fab5700, 127) at psycho_ue+0x7c intr_handler(e0017ec8, 48a3300, 5facef, 800, e364, ) at intr_handler+0x c sparc_interrupt(18960f8, 400090cf900, 1695200, 0, 0, 400091005a0) at sparc_inte rrupt+0x298 pool_put(18960f8, 400090cf900, 400090cf920, 0, 70197ff, 0) at pool_put+0x1dc m_free(400090cf900, 9, 400090cf900, 0, 0, 1) at m_free+0x9c m_freem(400090cf900, 4000fab5b78, 4000fab5b90, 0, 0, 0) at m_freem+0xc sendit(0, 9, 0, 0, 4000fab5df0, 14b) at sendit+0x2bc sys_sendto(40009177690, 4000fab5db0, 4000fab5df0, cb70bfd1cc, 0, 14b) at sys_se ndto+0x68 syscall(4000fab5ed0, 485, cb70c04918, cb70c0491c, 0, 0) at syscall+0x34c softtrap(9, fffdfdbc, 30, 0, fffdfe20, 10) at softtrap+0x19c http://www.openbsd.org/ddb.html describes the minimum info required in bug reports. Insufficient info makes it difficult to find and fix bugs. ddb> LOM event: +330d+14h1m38s host FAULT: watchdog triggered trace psycho_ue(48a3200, 5a, 2, 4a1e050, 4000fab5700, 127) at psycho_ue+0x7c intr_handler(e0017ec8, 48a3300, 5facef, 800, e364, ) at intr_handler+0x c sparc_interrupt(18960f8, 400090cf900, 1695200, 0, 0, 400091005a0) at sparc_inte rrupt+0x298 pool_put(18960f8, 400090cf900, 400090cf920, 0, 70197ff, 0) at pool_put+0x1dc m_free(400090cf900, 9, 400090cf900, 0, 0, 1) at m_free+0x9c m_freem(400090cf900, 4000fab5b78, 4000fab5b90, 0, 0, 0) at m_freem+0xc sendit(0, 9, 0, 0, 4000fab5df0, 14b) at sendit+0x2bc sys_sendto(40009177690, 4000fab5db0, 4000fab5df0, cb70bfd1cc, 0, 14b) at sys_se ndto+0x68 syscall(4000fab5ed0, 485, cb70c04918, cb70c0491c, 0, 0) at syscall+0x34c softtrap(9, fffdfdbc, 30, 0, fffdfe20, 10) at softtrap+0x19c ddb> ps TID PPID PGRPUID S FLAGS WAIT COMMAND 18314 4143 18314 0 30x83 poll systat 4143 30172 4143 0 30x10008b pause ksh 30172 2827 30172 0 30x92 selectsshd 10502 1 10502 0 30x100083 ttyin getty 16663 1 16663 0 30x100098 poll cron 10518 1 10518110 30x100090 poll sndiod 27012 1 27012 99 30x100090 poll sndiod 7526 31292 31292 95 30x100090 kqreadsmtpd 15286 31292 31292 95 30x100090 kqreadsmtpd 7378 31292 31292 95 30x100090 kqreadsmtpd 813 31292 31292 95 30x100090 kqreadsmtpd 7395 31292 31292 95 30x100090 kqreadsmtpd 13554 31292 31292103 30x100090 kqreadsmtpd 31292 1 31292 0 30x100080 kqreadsmtpd 2827 1 2827 0 30x80 selectsshd 24960 16805 20173 83 30x100090 poll ntpd *16805 20173 20173 83 70x100010ntpd 20173 1 20173 0 30x100080 poll ntpd 32115 28778 28778 74 30x100090 bpf pflogd 28778 1 28778 0 30x80 netio pflogd 25537 31067 31067 73 30x100090 kqreadsyslogd 31067 1 31067 0 30x100080 netio syslogd 32003 0 0 0 3 0x14200 pgzerozerothread 4880 0 0 0 3 0x14200 aiodoned aiodoned 2947 0 0 0 3 0x14200 syncerupdate 2812 0 0 0 3 0x14200 cleaner cleaner 24150 0 0 0 3 0x14200 reaperreaper 15162 0 0 0 3 0x14200 pgdaemon pagedaemon 24138 0 0 0 3 0x14200 bored crypto 22463 0 0 0 3 0x14200 pftm pfpurge 20852 0 0 0 3 0x14200 usbtskusbtask 21476 0 0 0 3 0x14200 usbatsk usbatsk 11816 0 0 0 3 0x14200 bored sensors 16023 0 0 0 3 0x14200 bored softnet 22770 0 0 0 3 0x14200 bored systqmp 20377 0 0 0 3 0x14200 bored systq 16037 0 0 0 3 0x40014200idle0 10456 0 0 0 3 0x14200 kmalloc kmthread 1 0 1 0 30x82 wait init 0 -1 0 0 3 0x10200 scheduler swapper ddb>