Re: sparc64 reproducable panic on 5.9

2016-03-31 Thread Fred

On 03/31/16 09:27, Mark Kettenis wrote:

Date: Wed, 30 Mar 2016 09:10:04 +0200
From: Stefan Sperling 

On Wed, Mar 30, 2016 at 07:54:31AM +0100, Fred wrote:

This looks similar to the issues in this thread:

http://marc.info/?t=14346611501

I'm not sure a definative solution was found - but having dc nic's was part
of the issue.


It's not driver dependent.

On my blade100 a similar crash is happening with ral(4).
http://marc.info/?l=openbsd-bugs=144802152407599=2


I made dc(4) print the dma addressen and legths of the segments it
puts on the tx ring (values in hex):

X 60095868/86
X 6009479a/42
X 60090768/8e
X 60095f9a/42
X 60095f9a/66

The crash happens immediately after that last one.  Note that

   0x60095f9a + 0x66 = 0x60096000

In other words, this is a segment that is aligned with the end of a
page.  In all likelyhood the NIC's DMA engine is overfetching and runs
into the next page.  This page isn't mapped into the IOMMU which
triggers the fault.

In the past, the mbuf pool used in-page pool page headers.  This would
hide the issue since the pool page header would "misalign" the pool
items and induce some trailing unused space in the page.

Not sure how to solve this yet; I'll be looking at Solaris for
inspiration.  Fairly certain that ral(4) has a similar issue.



I'm right in thinking then the Steve's [1] approach of changing the else 
if in subr_pool.c to:

} else if (256 > size) {
is just masking the issue?

[1] http://marc.info/?l=openbsd-bugs=144962985826087

Cheers

Fred



Re: sparc64 reproducable panic on 5.9

2016-03-31 Thread Mark Kettenis
> Date: Wed, 30 Mar 2016 09:10:04 +0200
> From: Stefan Sperling 
> 
> On Wed, Mar 30, 2016 at 07:54:31AM +0100, Fred wrote:
> > This looks similar to the issues in this thread:
> > 
> > http://marc.info/?t=14346611501
> > 
> > I'm not sure a definative solution was found - but having dc nic's was part
> > of the issue.
> 
> It's not driver dependent.
> 
> On my blade100 a similar crash is happening with ral(4).
> http://marc.info/?l=openbsd-bugs=144802152407599=2

I made dc(4) print the dma addressen and legths of the segments it
puts on the tx ring (values in hex):

X 60095868/86
X 6009479a/42
X 60090768/8e
X 60095f9a/42
X 60095f9a/66

The crash happens immediately after that last one.  Note that

  0x60095f9a + 0x66 = 0x60096000

In other words, this is a segment that is aligned with the end of a
page.  In all likelyhood the NIC's DMA engine is overfetching and runs
into the next page.  This page isn't mapped into the IOMMU which
triggers the fault.

In the past, the mbuf pool used in-page pool page headers.  This would
hide the issue since the pool page header would "misalign" the pool
items and induce some trailing unused space in the page.

Not sure how to solve this yet; I'll be looking at Solaris for
inspiration.  Fairly certain that ral(4) has a similar issue.



Re: sparc64 reproducable panic on 5.9

2016-03-30 Thread Mark Kettenis
> Date: Wed, 30 Mar 2016 15:01:51 +0200
> From: Matthieu Herrb 
> 
> On Wed, Mar 30, 2016 at 07:54:31AM +0100, Fred wrote:
> > On 03/30/16 06:52, Matthieu Herrb wrote:
> > >Hi,
> > >
> > >I upgraded yesterday a SunFire V100 which serves as an ntpd server. It
> > >was running 5.5 before and had 330 days of uptime (which seems to
> > >eliminaite hw failure).
> > >
> > >Running 5.9 the box crashed twice already each time after ca 4 hours
> > >of uptime with the same panic. Detais below.
> > >
> > >Any idea/patch/things to look for under ddb next time it happens ?
> > 
> > This looks similar to the issues in this thread:
> > 
> > http://marc.info/?t=14346611501
> > 
> > I'm not sure a definative solution was found - but having dc nic's
> > was part of the issue.
> 
> I can confirm that the patch from
> http://marc.info/?l=openbsd-bugs=144833598011206=2 (reproduced
> below) seems to fix the issue for my machine. 
> 
> Without it it's impossible to scp a new /bsd to the machine (I needed
> to use ftp to transfer the kernel to it). With it I've been able to
> scp /bsd several times. 
> 
> If someone (dlg?, kettenis?) is interested I can setup a serial
> console access to a similar machine with the same problem under
> -current.

If you could hook me up with LOM access, that would be great.  It
frustrates me quite a bit that these machines don't work anymore...



Re: sparc64 reproducable panic on 5.9

2016-03-30 Thread Matthieu Herrb
On Wed, Mar 30, 2016 at 07:54:31AM +0100, Fred wrote:
> On 03/30/16 06:52, Matthieu Herrb wrote:
> >Hi,
> >
> >I upgraded yesterday a SunFire V100 which serves as an ntpd server. It
> >was running 5.5 before and had 330 days of uptime (which seems to
> >eliminaite hw failure).
> >
> >Running 5.9 the box crashed twice already each time after ca 4 hours
> >of uptime with the same panic. Detais below.
> >
> >Any idea/patch/things to look for under ddb next time it happens ?
> 
> This looks similar to the issues in this thread:
> 
> http://marc.info/?t=14346611501
> 
> I'm not sure a definative solution was found - but having dc nic's
> was part of the issue.

I can confirm that the patch from
http://marc.info/?l=openbsd-bugs=144833598011206=2 (reproduced
below) seems to fix the issue for my machine. 

Without it it's impossible to scp a new /bsd to the machine (I needed
to use ftp to transfer the kernel to it). With it I've been able to
scp /bsd several times. 

If someone (dlg?, kettenis?) is interested I can setup a serial
console access to a similar machine with the same problem under
-current.

Index: subr_pool.c
===
RCS file: /cvs/src/sys/kern/subr_pool.c,v
retrieving revision 1.194
diff -u -r1.194 subr_pool.c
--- subr_pool.c 15 Jan 2016 11:21:58 -  1.194
+++ subr_pool.c 30 Mar 2016 11:42:42 -
@@ -258,7 +258,7 @@
 */
if (pgsize - (size * items) > sizeof(struct pool_item_header)) {
off = pgsize - sizeof(struct pool_item_header);
-   } else if (sizeof(struct pool_item_header) * 2 >= size) {
+   } else if (sizeof(struct pool_item_header) * 8 >= size) {
off = pgsize - sizeof(struct pool_item_header);
items = off / size;
}

-- 
Matthieu Herrb


pgpsVH1yPZfB9.pgp
Description: PGP signature


Re: sparc64 reproducable panic on 5.9

2016-03-30 Thread Stefan Sperling
On Wed, Mar 30, 2016 at 07:54:31AM +0100, Fred wrote:
> This looks similar to the issues in this thread:
> 
> http://marc.info/?t=14346611501
> 
> I'm not sure a definative solution was found - but having dc nic's was part
> of the issue.

It's not driver dependent.

On my blade100 a similar crash is happening with ral(4).
http://marc.info/?l=openbsd-bugs=144802152407599=2



Re: sparc64 reproducable panic on 5.9

2016-03-30 Thread Fred

On 03/30/16 06:52, Matthieu Herrb wrote:

Hi,

I upgraded yesterday a SunFire V100 which serves as an ntpd server. It
was running 5.5 before and had 330 days of uptime (which seems to
eliminaite hw failure).

Running 5.9 the box crashed twice already each time after ca 4 hours
of uptime with the same panic. Detais below.

Any idea/patch/things to look for under ddb next time it happens ?


This looks similar to the issues in this thread:

http://marc.info/?t=14346611501

I'm not sure a definative solution was found - but having dc nic's was 
part of the issue.


hth

Fred



  panic: psycho0: uncorrectable DMA error AFAR 6e866250 (pa=0 tte=0/69270012) 
AFSR 41ff4080
Stopped at  Debugger+0x8:   nop
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
*16805  16805 830x100010  00  ntpd
psycho_ue(48a3200, 5a, 2, 4a1e050, 4000fab5700, 127) at psycho_ue+0x7c
intr_handler(e0017ec8, 48a3300, 5facef, 800, e364, ) at intr_handler+0x
c
sparc_interrupt(18960f8, 400090cf900, 1695200, 0, 0, 400091005a0) at sparc_inte
rrupt+0x298
pool_put(18960f8, 400090cf900, 400090cf920, 0, 70197ff, 0) at pool_put+0x1dc
m_free(400090cf900, 9, 400090cf900, 0, 0, 1) at m_free+0x9c
m_freem(400090cf900, 4000fab5b78, 4000fab5b90, 0, 0, 0) at m_freem+0xc
sendit(0, 9, 0, 0, 4000fab5df0, 14b) at sendit+0x2bc
sys_sendto(40009177690, 4000fab5db0, 4000fab5df0, cb70bfd1cc, 0, 14b) at sys_se
ndto+0x68
syscall(4000fab5ed0, 485, cb70c04918, cb70c0491c, 0, 0) at syscall+0x34c
softtrap(9, fffdfdbc, 30, 0, fffdfe20, 10) at softtrap+0x19c
http://www.openbsd.org/ddb.html describes the minimum info required in bug
reports.  Insufficient info makes it difficult to find and fix bugs.
ddb>
LOM event: +330d+14h1m38s host FAULT: watchdog triggered
trace
psycho_ue(48a3200, 5a, 2, 4a1e050, 4000fab5700, 127) at psycho_ue+0x7c
intr_handler(e0017ec8, 48a3300, 5facef, 800, e364, ) at intr_handler+0x
c
sparc_interrupt(18960f8, 400090cf900, 1695200, 0, 0, 400091005a0) at sparc_inte
rrupt+0x298
pool_put(18960f8, 400090cf900, 400090cf920, 0, 70197ff, 0) at pool_put+0x1dc
m_free(400090cf900, 9, 400090cf900, 0, 0, 1) at m_free+0x9c
m_freem(400090cf900, 4000fab5b78, 4000fab5b90, 0, 0, 0) at m_freem+0xc
sendit(0, 9, 0, 0, 4000fab5df0, 14b) at sendit+0x2bc
sys_sendto(40009177690, 4000fab5db0, 4000fab5df0, cb70bfd1cc, 0, 14b) at sys_se
ndto+0x68
syscall(4000fab5ed0, 485, cb70c04918, cb70c0491c, 0, 0) at syscall+0x34c
softtrap(9, fffdfdbc, 30, 0, fffdfe20, 10) at softtrap+0x19c
ddb> ps
TID   PPID   PGRPUID  S   FLAGS  WAIT  COMMAND
  18314   4143  18314  0  30x83  poll  systat
   4143  30172   4143  0  30x10008b  pause ksh
  30172   2827  30172  0  30x92  selectsshd
  10502  1  10502  0  30x100083  ttyin getty
  16663  1  16663  0  30x100098  poll  cron
  10518  1  10518110  30x100090  poll  sndiod
  27012  1  27012 99  30x100090  poll  sndiod
   7526  31292  31292 95  30x100090  kqreadsmtpd
  15286  31292  31292 95  30x100090  kqreadsmtpd
   7378  31292  31292 95  30x100090  kqreadsmtpd
813  31292  31292 95  30x100090  kqreadsmtpd
   7395  31292  31292 95  30x100090  kqreadsmtpd
  13554  31292  31292103  30x100090  kqreadsmtpd
  31292  1  31292  0  30x100080  kqreadsmtpd
   2827  1   2827  0  30x80  selectsshd
  24960  16805  20173 83  30x100090  poll  ntpd
*16805  20173  20173 83  70x100010ntpd
  20173  1  20173  0  30x100080  poll  ntpd
  32115  28778  28778 74  30x100090  bpf   pflogd
  28778  1  28778  0  30x80  netio pflogd
  25537  31067  31067 73  30x100090  kqreadsyslogd
  31067  1  31067  0  30x100080  netio syslogd
  32003  0  0  0  3 0x14200  pgzerozerothread
   4880  0  0  0  3 0x14200  aiodoned  aiodoned
   2947  0  0  0  3 0x14200  syncerupdate
   2812  0  0  0  3 0x14200  cleaner   cleaner
  24150  0  0  0  3 0x14200  reaperreaper
  15162  0  0  0  3 0x14200  pgdaemon  pagedaemon
  24138  0  0  0  3 0x14200  bored crypto
  22463  0  0  0  3 0x14200  pftm  pfpurge
  20852  0  0  0  3 0x14200  usbtskusbtask
  21476  0  0  0  3 0x14200  usbatsk   usbatsk
  11816  0  0  0  3 0x14200  bored sensors
  16023  0  0  0  3 0x14200  bored softnet
  22770  0  0  0  3 0x14200  bored systqmp
  20377  0  0  0  3 0x14200  bored 

sparc64 reproducable panic on 5.9

2016-03-29 Thread Matthieu Herrb
Hi,

I upgraded yesterday a SunFire V100 which serves as an ntpd server. It
was running 5.5 before and had 330 days of uptime (which seems to
eliminaite hw failure).

Running 5.9 the box crashed twice already each time after ca 4 hours
of uptime with the same panic. Detais below.

Any idea/patch/things to look for under ddb next time it happens ?

 panic: psycho0: uncorrectable DMA error AFAR 6e866250 (pa=0 tte=0/69270012) 
AFSR 41ff4080
Stopped at  Debugger+0x8:   nop
   TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
*16805  16805 830x100010  00  ntpd
psycho_ue(48a3200, 5a, 2, 4a1e050, 4000fab5700, 127) at psycho_ue+0x7c
intr_handler(e0017ec8, 48a3300, 5facef, 800, e364, ) at intr_handler+0x
c
sparc_interrupt(18960f8, 400090cf900, 1695200, 0, 0, 400091005a0) at sparc_inte
rrupt+0x298
pool_put(18960f8, 400090cf900, 400090cf920, 0, 70197ff, 0) at pool_put+0x1dc
m_free(400090cf900, 9, 400090cf900, 0, 0, 1) at m_free+0x9c
m_freem(400090cf900, 4000fab5b78, 4000fab5b90, 0, 0, 0) at m_freem+0xc
sendit(0, 9, 0, 0, 4000fab5df0, 14b) at sendit+0x2bc
sys_sendto(40009177690, 4000fab5db0, 4000fab5df0, cb70bfd1cc, 0, 14b) at sys_se
ndto+0x68
syscall(4000fab5ed0, 485, cb70c04918, cb70c0491c, 0, 0) at syscall+0x34c
softtrap(9, fffdfdbc, 30, 0, fffdfe20, 10) at softtrap+0x19c
http://www.openbsd.org/ddb.html describes the minimum info required in bug
reports.  Insufficient info makes it difficult to find and fix bugs.
ddb>
LOM event: +330d+14h1m38s host FAULT: watchdog triggered
trace
psycho_ue(48a3200, 5a, 2, 4a1e050, 4000fab5700, 127) at psycho_ue+0x7c
intr_handler(e0017ec8, 48a3300, 5facef, 800, e364, ) at intr_handler+0x
c
sparc_interrupt(18960f8, 400090cf900, 1695200, 0, 0, 400091005a0) at sparc_inte
rrupt+0x298
pool_put(18960f8, 400090cf900, 400090cf920, 0, 70197ff, 0) at pool_put+0x1dc
m_free(400090cf900, 9, 400090cf900, 0, 0, 1) at m_free+0x9c
m_freem(400090cf900, 4000fab5b78, 4000fab5b90, 0, 0, 0) at m_freem+0xc
sendit(0, 9, 0, 0, 4000fab5df0, 14b) at sendit+0x2bc
sys_sendto(40009177690, 4000fab5db0, 4000fab5df0, cb70bfd1cc, 0, 14b) at sys_se
ndto+0x68
syscall(4000fab5ed0, 485, cb70c04918, cb70c0491c, 0, 0) at syscall+0x34c
softtrap(9, fffdfdbc, 30, 0, fffdfe20, 10) at softtrap+0x19c
ddb> ps
   TID   PPID   PGRPUID  S   FLAGS  WAIT  COMMAND
 18314   4143  18314  0  30x83  poll  systat
  4143  30172   4143  0  30x10008b  pause ksh
 30172   2827  30172  0  30x92  selectsshd
 10502  1  10502  0  30x100083  ttyin getty
 16663  1  16663  0  30x100098  poll  cron
 10518  1  10518110  30x100090  poll  sndiod
 27012  1  27012 99  30x100090  poll  sndiod
  7526  31292  31292 95  30x100090  kqreadsmtpd
 15286  31292  31292 95  30x100090  kqreadsmtpd
  7378  31292  31292 95  30x100090  kqreadsmtpd
   813  31292  31292 95  30x100090  kqreadsmtpd
  7395  31292  31292 95  30x100090  kqreadsmtpd
 13554  31292  31292103  30x100090  kqreadsmtpd
 31292  1  31292  0  30x100080  kqreadsmtpd
  2827  1   2827  0  30x80  selectsshd
 24960  16805  20173 83  30x100090  poll  ntpd
*16805  20173  20173 83  70x100010ntpd
 20173  1  20173  0  30x100080  poll  ntpd
 32115  28778  28778 74  30x100090  bpf   pflogd
 28778  1  28778  0  30x80  netio pflogd
 25537  31067  31067 73  30x100090  kqreadsyslogd
 31067  1  31067  0  30x100080  netio syslogd
 32003  0  0  0  3 0x14200  pgzerozerothread
  4880  0  0  0  3 0x14200  aiodoned  aiodoned
  2947  0  0  0  3 0x14200  syncerupdate
  2812  0  0  0  3 0x14200  cleaner   cleaner
 24150  0  0  0  3 0x14200  reaperreaper
 15162  0  0  0  3 0x14200  pgdaemon  pagedaemon
 24138  0  0  0  3 0x14200  bored crypto
 22463  0  0  0  3 0x14200  pftm  pfpurge
 20852  0  0  0  3 0x14200  usbtskusbtask
 21476  0  0  0  3 0x14200  usbatsk   usbatsk
 11816  0  0  0  3 0x14200  bored sensors
 16023  0  0  0  3 0x14200  bored softnet
 22770  0  0  0  3 0x14200  bored systqmp
 20377  0  0  0  3 0x14200  bored systq
 16037  0  0  0  3  0x40014200idle0
 10456  0  0  0  3 0x14200  kmalloc   kmthread
 1  0  1  0  30x82  wait  init
 0 -1  0  0  3 0x10200  scheduler swapper
ddb>