Re: re0 and re1 watchdog timeouts, and system freeze

2017-06-12 Thread Björn Ketelaars
On Mon 12/06/2017 15:11, David Gwynne wrote:
> On Fri, Jun 09, 2017 at 07:19:34PM +0200, Bj??rn Ketelaars wrote:
> > On Fri 09/06/2017 12:07, Martin Pieuchot wrote:
> > > On 08/06/17(Thu) 20:38, Bj??rn Ketelaars wrote:
> > > > On Thu 08/06/2017 16:55, Martin Pieuchot wrote:
> > > > > On 07/06/17(Wed) 09:43, Bj??rn Ketelaars wrote:
> > > > > > On Sat 03/06/2017 08:44, Bj??rn Ketelaars wrote:
> > > > > > > 
> > > > > > > Reverting back to the previous kernel fixed the issue above. 
> > > > > > > Question: can
> > > > > > > someone give a hint on how to track this issue?
> > > > > > 
> > > > > > After a bit of experimenting I'm able to reproduce the problem. 
> > > > > > Summary is
> > > > > > that queueing in pf and use of a current (after May 30), multi 
> > > > > > processor
> > > > > > kernel (bsd.mp from snapshots) causes these specific watchdog 
> > > > > > timeouts
> > > > > > followed by a system freeze.
> > > > > > 
> > > > > > Issue is 'gone' when:
> > > > > > 1.) using an older kernel (before May 30);
> > > > > > 2.) removal of queueing statements from pf.conf. Included below the 
> > > > > > specific
> > > > > > snippet;
> > > > > > 3.) switch from MP kernel to SP kernel.
> > > > > > 
> > > > > > New observation is that while queueing, using a MP kernel, the 
> > > > > > download
> > > > > > bandwidth is only a fraction of what is expected. Exchanging the MP 
> > > > > > kernel
> > > > > > with a SP kernel restores the download bandwidth to expected level.
> > > > > > 
> > > > > > I'm guessing that this issue is related to recent work on PF?
> > > > > 
> > > > > It's certainly a problem in, or exposed by, re(4) with the recent MP 
> > > > > work
> > > > > in the network stack.
> > > > > 
> > > > > It would help if you could build a kernel with MP_LOCKDEBUG defined 
> > > > > and
> > > > > see if the resulting kernel enters ddb(4) instead of freezing.
> > > > > 
> > > > > Thanks,
> > > > > Martin
> > > > 
> > > > Thanks for the hint! It helped in entering ddb. I collected a bit of 
> > > > output,
> > > > which you can find below. If I read the trace correctly the crash is 
> > > > related
> > > > to line 1750 of sys/dev/ic/re.c:
> > > > 
> > > > d->rl_cmdstat |= htole32(RL_TDESC_CMD_EOF);
> > > 
> > > Could you test the diff below, always with a MP_LOCKDEBUG kernel and
> > > tell us if you can reproduce the freeze or if the kernel enters ddb(4)?
> > > 
> > > Another question, how often do you see "watchdog timeout" messages?
> > > 
> > > Index: re.c
> > > ===
> > > RCS file: /cvs/src/sys/dev/ic/re.c,v
> > > retrieving revision 1.201
> > > diff -u -p -r1.201 re.c
> > > --- re.c  24 Jan 2017 03:57:34 -  1.201
> > > +++ re.c  9 Jun 2017 10:04:43 -
> > > @@ -2074,9 +2074,6 @@ re_watchdog(struct ifnet *ifp)
> > >   s = splnet();
> > >   printf("%s: watchdog timeout\n", sc->sc_dev.dv_xname);
> > >  
> > > - re_txeof(sc);
> > > - re_rxeof(sc);
> > > -
> > >   re_init(ifp);
> > >  
> > >   splx(s);
> > 
> > The diff (with a MP_LOCKDEBUG kernel) resulted in similar traces as before.
> > ddb Output is included below.
> > 
> > With your diff the number of timeout messages decreased from 9 to 2 before
> > entering ddb.
> 
> can you try the diff below please?
> 
> Index: hfsc.c
> ===
> RCS file: /cvs/src/sys/net/hfsc.c,v
> retrieving revision 1.39
> diff -u -p -r1.39 hfsc.c
> --- hfsc.c8 May 2017 11:30:53 -   1.39
> +++ hfsc.c12 Jun 2017 05:08:01 -
> @@ -817,7 +817,7 @@ hfsc_deferred(void *arg)
>   KASSERT(HFSC_ENABLED(ifq));
>  
>   if (!ifq_empty(ifq))
> - (*ifp->if_qstart)(ifq);
> + ifq_start(ifq);
>  
>   hif = ifq->ifq_q;

Your diff fixes my issue.

Previously I was able to reproduce the problem within a minute. I have been
running a kernel with your patch for the last three hours without a hitch. No
more re0/re1 watchdog timeouts, no more bandwidth issues while using queueing
and no more ddb.

Thanks!



Re: re0 and re1 watchdog timeouts, and system freeze

2017-06-12 Thread Mike Belopuhov
On Mon, Jun 12, 2017 at 15:11 +1000, David Gwynne wrote:
> On Fri, Jun 09, 2017 at 07:19:34PM +0200, Bj??rn Ketelaars wrote:
> > On Fri 09/06/2017 12:07, Martin Pieuchot wrote:
> > > On 08/06/17(Thu) 20:38, Bj??rn Ketelaars wrote:
> > > > On Thu 08/06/2017 16:55, Martin Pieuchot wrote:
> > > > > On 07/06/17(Wed) 09:43, Bj??rn Ketelaars wrote:
> > > > > > On Sat 03/06/2017 08:44, Bj??rn Ketelaars wrote:
> > > > > > > 
> > > > > > > Reverting back to the previous kernel fixed the issue above. 
> > > > > > > Question: can
> > > > > > > someone give a hint on how to track this issue?
> > > > > > 
> > > > > > After a bit of experimenting I'm able to reproduce the problem. 
> > > > > > Summary is
> > > > > > that queueing in pf and use of a current (after May 30), multi 
> > > > > > processor
> > > > > > kernel (bsd.mp from snapshots) causes these specific watchdog 
> > > > > > timeouts
> > > > > > followed by a system freeze.
> > > > > > 
> > > > > > Issue is 'gone' when:
> > > > > > 1.) using an older kernel (before May 30);
> > > > > > 2.) removal of queueing statements from pf.conf. Included below the 
> > > > > > specific
> > > > > > snippet;
> > > > > > 3.) switch from MP kernel to SP kernel.
> > > > > > 
> > > > > > New observation is that while queueing, using a MP kernel, the 
> > > > > > download
> > > > > > bandwidth is only a fraction of what is expected. Exchanging the MP 
> > > > > > kernel
> > > > > > with a SP kernel restores the download bandwidth to expected level.
> > > > > > 
> > > > > > I'm guessing that this issue is related to recent work on PF?
> > > > > 
> > > > > It's certainly a problem in, or exposed by, re(4) with the recent MP 
> > > > > work
> > > > > in the network stack.
> > > > > 
> > > > > It would help if you could build a kernel with MP_LOCKDEBUG defined 
> > > > > and
> > > > > see if the resulting kernel enters ddb(4) instead of freezing.
> > > > > 
> > > > > Thanks,
> > > > > Martin
> > > > 
> > > > Thanks for the hint! It helped in entering ddb. I collected a bit of 
> > > > output,
> > > > which you can find below. If I read the trace correctly the crash is 
> > > > related
> > > > to line 1750 of sys/dev/ic/re.c:
> > > > 
> > > > d->rl_cmdstat |= htole32(RL_TDESC_CMD_EOF);
> > > 
> > > Could you test the diff below, always with a MP_LOCKDEBUG kernel and
> > > tell us if you can reproduce the freeze or if the kernel enters ddb(4)?
> > > 
> > > Another question, how often do you see "watchdog timeout" messages?
> > > 
> > > Index: re.c
> > > ===
> > > RCS file: /cvs/src/sys/dev/ic/re.c,v
> > > retrieving revision 1.201
> > > diff -u -p -r1.201 re.c
> > > --- re.c  24 Jan 2017 03:57:34 -  1.201
> > > +++ re.c  9 Jun 2017 10:04:43 -
> > > @@ -2074,9 +2074,6 @@ re_watchdog(struct ifnet *ifp)
> > >   s = splnet();
> > >   printf("%s: watchdog timeout\n", sc->sc_dev.dv_xname);
> > >  
> > > - re_txeof(sc);
> > > - re_rxeof(sc);
> > > -
> > >   re_init(ifp);
> > >  
> > >   splx(s);
> > 
> > The diff (with a MP_LOCKDEBUG kernel) resulted in similar traces as before.
> > ddb Output is included below.
> > 
> > With your diff the number of timeout messages decreased from 9 to 2 before
> > entering ddb.
> 
> can you try the diff below please?
>

Oh my, unserialized start.  Someone didn't finish the conversion (-;
OK mikeb

> Index: hfsc.c
> ===
> RCS file: /cvs/src/sys/net/hfsc.c,v
> retrieving revision 1.39
> diff -u -p -r1.39 hfsc.c
> --- hfsc.c8 May 2017 11:30:53 -   1.39
> +++ hfsc.c12 Jun 2017 05:08:01 -
> @@ -817,7 +817,7 @@ hfsc_deferred(void *arg)
>   KASSERT(HFSC_ENABLED(ifq));
>  
>   if (!ifq_empty(ifq))
> - (*ifp->if_qstart)(ifq);
> + ifq_start(ifq);
>  
>   hif = ifq->ifq_q;
>  
> 



Re: re0 and re1 watchdog timeouts, and system freeze

2017-06-11 Thread David Gwynne
On Fri, Jun 09, 2017 at 07:19:34PM +0200, Bj??rn Ketelaars wrote:
> On Fri 09/06/2017 12:07, Martin Pieuchot wrote:
> > On 08/06/17(Thu) 20:38, Bj??rn Ketelaars wrote:
> > > On Thu 08/06/2017 16:55, Martin Pieuchot wrote:
> > > > On 07/06/17(Wed) 09:43, Bj??rn Ketelaars wrote:
> > > > > On Sat 03/06/2017 08:44, Bj??rn Ketelaars wrote:
> > > > > > 
> > > > > > Reverting back to the previous kernel fixed the issue above. 
> > > > > > Question: can
> > > > > > someone give a hint on how to track this issue?
> > > > > 
> > > > > After a bit of experimenting I'm able to reproduce the problem. 
> > > > > Summary is
> > > > > that queueing in pf and use of a current (after May 30), multi 
> > > > > processor
> > > > > kernel (bsd.mp from snapshots) causes these specific watchdog timeouts
> > > > > followed by a system freeze.
> > > > > 
> > > > > Issue is 'gone' when:
> > > > > 1.) using an older kernel (before May 30);
> > > > > 2.) removal of queueing statements from pf.conf. Included below the 
> > > > > specific
> > > > > snippet;
> > > > > 3.) switch from MP kernel to SP kernel.
> > > > > 
> > > > > New observation is that while queueing, using a MP kernel, the 
> > > > > download
> > > > > bandwidth is only a fraction of what is expected. Exchanging the MP 
> > > > > kernel
> > > > > with a SP kernel restores the download bandwidth to expected level.
> > > > > 
> > > > > I'm guessing that this issue is related to recent work on PF?
> > > > 
> > > > It's certainly a problem in, or exposed by, re(4) with the recent MP 
> > > > work
> > > > in the network stack.
> > > > 
> > > > It would help if you could build a kernel with MP_LOCKDEBUG defined and
> > > > see if the resulting kernel enters ddb(4) instead of freezing.
> > > > 
> > > > Thanks,
> > > > Martin
> > > 
> > > Thanks for the hint! It helped in entering ddb. I collected a bit of 
> > > output,
> > > which you can find below. If I read the trace correctly the crash is 
> > > related
> > > to line 1750 of sys/dev/ic/re.c:
> > > 
> > >   d->rl_cmdstat |= htole32(RL_TDESC_CMD_EOF);
> > 
> > Could you test the diff below, always with a MP_LOCKDEBUG kernel and
> > tell us if you can reproduce the freeze or if the kernel enters ddb(4)?
> > 
> > Another question, how often do you see "watchdog timeout" messages?
> > 
> > Index: re.c
> > ===
> > RCS file: /cvs/src/sys/dev/ic/re.c,v
> > retrieving revision 1.201
> > diff -u -p -r1.201 re.c
> > --- re.c24 Jan 2017 03:57:34 -  1.201
> > +++ re.c9 Jun 2017 10:04:43 -
> > @@ -2074,9 +2074,6 @@ re_watchdog(struct ifnet *ifp)
> > s = splnet();
> > printf("%s: watchdog timeout\n", sc->sc_dev.dv_xname);
> >  
> > -   re_txeof(sc);
> > -   re_rxeof(sc);
> > -
> > re_init(ifp);
> >  
> > splx(s);
> 
> The diff (with a MP_LOCKDEBUG kernel) resulted in similar traces as before.
> ddb Output is included below.
> 
> With your diff the number of timeout messages decreased from 9 to 2 before
> entering ddb.

can you try the diff below please?

Index: hfsc.c
===
RCS file: /cvs/src/sys/net/hfsc.c,v
retrieving revision 1.39
diff -u -p -r1.39 hfsc.c
--- hfsc.c  8 May 2017 11:30:53 -   1.39
+++ hfsc.c  12 Jun 2017 05:08:01 -
@@ -817,7 +817,7 @@ hfsc_deferred(void *arg)
KASSERT(HFSC_ENABLED(ifq));
 
if (!ifq_empty(ifq))
-   (*ifp->if_qstart)(ifq);
+   ifq_start(ifq);
 
hif = ifq->ifq_q;
 



Re: re0 and re1 watchdog timeouts, and system freeze

2017-06-09 Thread Björn Ketelaars
On Fri 09/06/2017 12:07, Martin Pieuchot wrote:
> On 08/06/17(Thu) 20:38, Björn Ketelaars wrote:
> > On Thu 08/06/2017 16:55, Martin Pieuchot wrote:
> > > On 07/06/17(Wed) 09:43, Björn Ketelaars wrote:
> > > > On Sat 03/06/2017 08:44, Björn Ketelaars wrote:
> > > > > 
> > > > > Reverting back to the previous kernel fixed the issue above. 
> > > > > Question: can
> > > > > someone give a hint on how to track this issue?
> > > > 
> > > > After a bit of experimenting I'm able to reproduce the problem. Summary 
> > > > is
> > > > that queueing in pf and use of a current (after May 30), multi processor
> > > > kernel (bsd.mp from snapshots) causes these specific watchdog timeouts
> > > > followed by a system freeze.
> > > > 
> > > > Issue is 'gone' when:
> > > > 1.) using an older kernel (before May 30);
> > > > 2.) removal of queueing statements from pf.conf. Included below the 
> > > > specific
> > > > snippet;
> > > > 3.) switch from MP kernel to SP kernel.
> > > > 
> > > > New observation is that while queueing, using a MP kernel, the download
> > > > bandwidth is only a fraction of what is expected. Exchanging the MP 
> > > > kernel
> > > > with a SP kernel restores the download bandwidth to expected level.
> > > > 
> > > > I'm guessing that this issue is related to recent work on PF?
> > > 
> > > It's certainly a problem in, or exposed by, re(4) with the recent MP work
> > > in the network stack.
> > > 
> > > It would help if you could build a kernel with MP_LOCKDEBUG defined and
> > > see if the resulting kernel enters ddb(4) instead of freezing.
> > > 
> > > Thanks,
> > > Martin
> > 
> > Thanks for the hint! It helped in entering ddb. I collected a bit of output,
> > which you can find below. If I read the trace correctly the crash is related
> > to line 1750 of sys/dev/ic/re.c:
> > 
> > d->rl_cmdstat |= htole32(RL_TDESC_CMD_EOF);
> 
> Could you test the diff below, always with a MP_LOCKDEBUG kernel and
> tell us if you can reproduce the freeze or if the kernel enters ddb(4)?
> 
> Another question, how often do you see "watchdog timeout" messages?
> 
> Index: re.c
> ===
> RCS file: /cvs/src/sys/dev/ic/re.c,v
> retrieving revision 1.201
> diff -u -p -r1.201 re.c
> --- re.c  24 Jan 2017 03:57:34 -  1.201
> +++ re.c  9 Jun 2017 10:04:43 -
> @@ -2074,9 +2074,6 @@ re_watchdog(struct ifnet *ifp)
>   s = splnet();
>   printf("%s: watchdog timeout\n", sc->sc_dev.dv_xname);
>  
> - re_txeof(sc);
> - re_rxeof(sc);
> -
>   re_init(ifp);
>  
>   splx(s);

The diff (with a MP_LOCKDEBUG kernel) resulted in similar traces as before.
ddb Output is included below.

With your diff the number of timeout messages decreased from 9 to 2 before
entering ddb.


ddb{1}> show panic
the kernel did not panic

ddb{1}> machine ddbcpu 0
Stopped at  db_enter+0x9:   leave

ddb{0}> trace
db_enter(8196b640,200,804d1600,10,8000220e9938,282) at db_e
nter+0x9
x86_ipi_handler(80074010,819e7160,102860,c,4,1813a9522) at x86_
ipi_handler+0x85
Xresume_lapic_ipi() at Xresume_lapic_ipi+0x1c
--- interrupt ---
___mp_lock(819e7160,8187acf0,1,1,8000220a38c0,8000220a7
ab0) at ___mp_lock+0x4a
___mp_acquire_count(819e7160,1,8000220a38c0,8000220a7ab0,ff
011da1e1c8,81019942) at ___mp_acquire_count+0x33
mi_switch(ff011b4a93c0,818bf7e7,9cd,8000220e9bb0,8000220a7a
b0,8000220e9ba0) at mi_switch+0x22b
sleep_finish(8000220e9bb0,1,118,8000220e9bb0,ff011b4a93c0,8
18bf7e7) at sleep_finish+0xc2
tsleep(ff011b4a93c0,118,818bf7e7,9cd,0,40) at tsleep+0x164
kqueue_scan(ff011b4a93c0,40,143d48714800,8000220e9dd0,8000220a7ab0,
8000220e9de4) at kqueue_scan+0x15b
sys_kevent(8000220a7ab0,8000220e9e60,8000220e9eb0,8000220a7ab0,
48,1) at sys_kevent+0x2f6
syscall() at syscall+0x29f
--- syscall (number 72) ---
end of kernel
end trace frame: 0x143d48714800, count: -11
0x143c524f280a:

ddb{0}> machine ddbcpu 1
Stopped at  re_encap+0x24d: movl0(%r15),%eax

ddb{1}> trace
re_encap(80084000,dd,ff00d6e03f00,800842c0,400,8008
4000) at re_encap+0x24d
re_start(800842c0,7,ff00d6dd7200,80084338,800842c0,
81091d70) at re_start+0x8c
ifq_serialize(800842c0,80084360,80084090,ff00d6a515
00,800842c0,0) at ifq_serialize+0xf8
if_enqueue(80084090,ff00d6a51500,8001dc80,ff00d6a51500,
37,2) at if_enqueue+0x90
ether_output(80084090,ff00d6dd7200,8001dc80,ff011b3f49a
0,80933400,8001dc80) at ether_output+0x1d6
ip_output(ff00d6dd7200,0,800022032cc0,1,0,0) at ip_output+0x8a1
ip_forward(ff00d6dd7200,802a4090,0,0,8244c78a,8244c78a) at ip_forwa
rd+0x1e7
ipv4_input(802a4090,ff00d6dd7200,800,800,ff0009d00200,8
02

Re: re0 and re1 watchdog timeouts, and system freeze

2017-06-09 Thread Martin Pieuchot
On 08/06/17(Thu) 20:38, Björn Ketelaars wrote:
> On Thu 08/06/2017 16:55, Martin Pieuchot wrote:
> > On 07/06/17(Wed) 09:43, Björn Ketelaars wrote:
> > > On Sat 03/06/2017 08:44, Björn Ketelaars wrote:
> > > > 
> > > > Reverting back to the previous kernel fixed the issue above. Question: 
> > > > can
> > > > someone give a hint on how to track this issue?
> > > 
> > > After a bit of experimenting I'm able to reproduce the problem. Summary is
> > > that queueing in pf and use of a current (after May 30), multi processor
> > > kernel (bsd.mp from snapshots) causes these specific watchdog timeouts
> > > followed by a system freeze.
> > > 
> > > Issue is 'gone' when:
> > > 1.) using an older kernel (before May 30);
> > > 2.) removal of queueing statements from pf.conf. Included below the 
> > > specific
> > > snippet;
> > > 3.) switch from MP kernel to SP kernel.
> > > 
> > > New observation is that while queueing, using a MP kernel, the download
> > > bandwidth is only a fraction of what is expected. Exchanging the MP kernel
> > > with a SP kernel restores the download bandwidth to expected level.
> > > 
> > > I'm guessing that this issue is related to recent work on PF?
> > 
> > It's certainly a problem in, or exposed by, re(4) with the recent MP work
> > in the network stack.
> > 
> > It would help if you could build a kernel with MP_LOCKDEBUG defined and
> > see if the resulting kernel enters ddb(4) instead of freezing.
> > 
> > Thanks,
> > Martin
> 
> Thanks for the hint! It helped in entering ddb. I collected a bit of output,
> which you can find below. If I read the trace correctly the crash is related
> to line 1750 of sys/dev/ic/re.c:
> 
>   d->rl_cmdstat |= htole32(RL_TDESC_CMD_EOF);

Could you test the diff below, always with a MP_LOCKDEBUG kernel and
tell us if you can reproduce the freeze or if the kernel enters ddb(4)?

Another question, how often do you see "watchdog timeout" messages?

Index: re.c
===
RCS file: /cvs/src/sys/dev/ic/re.c,v
retrieving revision 1.201
diff -u -p -r1.201 re.c
--- re.c24 Jan 2017 03:57:34 -  1.201
+++ re.c9 Jun 2017 10:04:43 -
@@ -2074,9 +2074,6 @@ re_watchdog(struct ifnet *ifp)
s = splnet();
printf("%s: watchdog timeout\n", sc->sc_dev.dv_xname);
 
-   re_txeof(sc);
-   re_rxeof(sc);
-
re_init(ifp);
 
splx(s);



Re: re0 and re1 watchdog timeouts, and system freeze

2017-06-08 Thread Björn Ketelaars
On Thu 08/06/2017 16:55, Martin Pieuchot wrote:
> On 07/06/17(Wed) 09:43, Björn Ketelaars wrote:
> > On Sat 03/06/2017 08:44, Björn Ketelaars wrote:
> > > 
> > > Reverting back to the previous kernel fixed the issue above. Question: can
> > > someone give a hint on how to track this issue?
> > 
> > After a bit of experimenting I'm able to reproduce the problem. Summary is
> > that queueing in pf and use of a current (after May 30), multi processor
> > kernel (bsd.mp from snapshots) causes these specific watchdog timeouts
> > followed by a system freeze.
> > 
> > Issue is 'gone' when:
> > 1.) using an older kernel (before May 30);
> > 2.) removal of queueing statements from pf.conf. Included below the specific
> > snippet;
> > 3.) switch from MP kernel to SP kernel.
> > 
> > New observation is that while queueing, using a MP kernel, the download
> > bandwidth is only a fraction of what is expected. Exchanging the MP kernel
> > with a SP kernel restores the download bandwidth to expected level.
> > 
> > I'm guessing that this issue is related to recent work on PF?
> 
> It's certainly a problem in, or exposed by, re(4) with the recent MP work
> in the network stack.
> 
> It would help if you could build a kernel with MP_LOCKDEBUG defined and
> see if the resulting kernel enters ddb(4) instead of freezing.
> 
> Thanks,
> Martin

Thanks for the hint! It helped in entering ddb. I collected a bit of output,
which you can find below. If I read the trace correctly the crash is related
to line 1750 of sys/dev/ic/re.c:

d->rl_cmdstat |= htole32(RL_TDESC_CMD_EOF);


ddb{0}> show panic
the kernel did not panic

ddb{0}> trace
db_enter(8196acc0,80002200b8b8,8000e6a8,10,800022023c58
,282) at db_enter+0x9
x86_ipi_handler(8196acd0,819f26a0,14a2d1,c,800022023cd0,fff
f819f95a0) at x86_ipi_handler+0x85
Xresume_lapic_ipi() at Xresume_lapic_ipi+0x1c
--- interrupt ---
___mp_lock(819f26a0,818552f0,1,1,80002200b8b8,8000e
6a8) at ___mp_lock+0x4a
___mp_acquire_count(819f26a0,1,80002200b8b8,8000e6a8,ff
ff81a11680,81707182) at ___mp_acquire_count+0x33
mi_switch(0,0,8000e6a8,800022023ed0,8000e6a8,800022023e
c0) at mi_switch+0x22b
sleep_finish(800022023ed0,1,810b7110,800022023ed0,0,0) at sleep
_finish+0xc2
softclock_thread(8000e6a8,2a2,0,0,29e123d3cabb9a12,800022023f10) at
 softclock_thread+0x94
 end trace frame: 0x0, count: -8

 ddb{0}> machine ddbcpu 1
 Stopped at  re_encap+0x24d: movl0(%r15),%eax

 ddb{1}> show panic
 the kernel did not panic

 ddb{1}> trace
 re_encap(80084000,363,ff00d221cc00,800842c0,400,800
 84000) at re_encap+0x24d
 re_start(800842c0,7,ff00d171d000,80084338,800842c0,
 815fdbf0) at re_start+0x8c
 ifq_serialize(800842c0,80084360,80084090,ff00d2fbd0
 00,800842c0,0) at ifq_serialize+0xf8
 if_enqueue(80084090,ff00d2fbd000,8096a240,ff00d2fbd000,
 37,2) at if_enqueue+0x90
 ether_output(80084090,ff00d171d000,8096a240,ff011b66777
 8,804d1000,8096a240) at ether_output+0x1d6
 ip_output(ff00d171d000,0,800022032cc0,1,0,0) at ip_output+0x8a1
 ip_forward(ff00d171d000,802a4090,0,0,8244c78a,8244c78a) at ip_forwa
 rd+0x1e7
 ipv4_input(802a4090,ff00d171d000,800,800,ff000990ff80,8
 02a4090) at ipv4_input+0x4f7
 ether_input(802a4090,ff00d171d000,0,810afed0,802a42
 88,802a4090) at ether_input+0xbd
 if_input_process(2,800022032eb0,0,0,800022032eb0,80019080) at i
 f_input_process+0xfa
 taskq_thread(80019080,2a2,80019080,81493240,0,80002
 2032f10) at taskq_thread+0x79
 end trace frame: 0x0, count: -11

ddb{1}> dmesg
OpenBSD 6.1-current (GENERIC.MP) #7: Thu Jun  8 19:10:36 CEST 2017
b...@gateway.lan:/home/code/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 4245995520 (4049MB)
avail mem = 4111519744 (3921MB)
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xdf16d820 (6 entries)
bios0: vendor coreboot version "SageBios_PCEngines_APU-45" date 04/05/2014
bios0: PC Engines APU
acpi0 at bios0: rev 0
acpi0: sleep states S0 S1 S3 S4 S5
acpi0: tables DSDT FACP SPCR HPET APIC HEST SSDT SSDT SSDT
acpi0: wakeup devices AGPB(S4) HDMI(S4) PBR4(S4) PBR5(S4) PBR6(S4) PBR7(S4) PE2
0(S4) PE21(S4) PE22(S4) PE23(S4) PIBR(S4) UOH1(S3) UOH2(S3) UOH3(S3) UOH4(S3) U
OH5(S3) [...]
acpitimer0 at acpi0: 3579545 Hz, 32 bits
acpihpet0 at acpi0: 14318180 Hz
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: AMD G-T40E Processor, 1000.13 MHz
cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,C
FLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,SSSE3,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB
,RDTSCP,L

Re: re0 and re1 watchdog timeouts, and system freeze

2017-06-08 Thread Martin Pieuchot
On 07/06/17(Wed) 09:43, Björn Ketelaars wrote:
> On Sat 03/06/2017 08:44, Björn Ketelaars wrote:
> > 
> > Reverting back to the previous kernel fixed the issue above. Question: can
> > someone give a hint on how to track this issue?
> 
> After a bit of experimenting I'm able to reproduce the problem. Summary is
> that queueing in pf and use of a current (after May 30), multi processor
> kernel (bsd.mp from snapshots) causes these specific watchdog timeouts
> followed by a system freeze.
> 
> Issue is 'gone' when:
> 1.) using an older kernel (before May 30);
> 2.) removal of queueing statements from pf.conf. Included below the specific
> snippet;
> 3.) switch from MP kernel to SP kernel.
> 
> New observation is that while queueing, using a MP kernel, the download
> bandwidth is only a fraction of what is expected. Exchanging the MP kernel
> with a SP kernel restores the download bandwidth to expected level.
> 
> I'm guessing that this issue is related to recent work on PF?

It's certainly a problem in, or exposed by, re(4) with the recent MP work
in the network stack.

It would help if you could build a kernel with MP_LOCKDEBUG defined and
see if the resulting kernel enters ddb(4) instead of freezing.

Thanks,
Martin

> 
> 
> --- SNIP ---
> 
> # queueing
> #
> queue up on re0 bandwidth 15M max 15M
> queue up_def parent up bandwidth 1M qlimit 10 default
> queue up_dns parent up bandwidth 2M qlimit 20
> queue up_ssh parent up bandwidth 6M qlimit 50
> queue up_web parent up bandwidth 6M qlimit 50
> match on egress set queue up_def
> match out on egress proto {tcp, udp} to port 1:1024 set queue up_web
> match on egress proto tcp to port 22 set queue up_ssh
> match out on egress proto {tcp, udp} to port 53 set queue up_dns
> match on egress proto icmp set queue up_dns
> match out on egress proto tcp to port {119, 563} set queue up_def 
> 
> queue down on re1 bandwidth 150M max 150M
> queue down_def parent down bandwidth 10M qlimit 100 default
> queue down_dns parent down bandwidth 20M qlimit 200
> queue down_ssh parent down bandwidth 60M qlimit 500
> queue down_web parent down bandwidth 60M qlimit 500
> match on re1 set queue down_def
> match in on re1 proto {tcp, udp} to port 1:1024 set queue down_web
> match on re1 proto tcp to port 22 set queue down_ssh
> match in on re1 proto {tcp, udp} to port 53 set queue down_dns
> match on re1 proto icmp set queue down_dns
> match in on re1 proto tcp to port {119, 563} set queue down_def
> 
> --- SNIP ---
> 
> -- 
> Björn Ketelaars
> GPG key: 0x4F0E5F21
> 



Re: re0 and re1 watchdog timeouts, and system freeze

2017-06-07 Thread Björn Ketelaars
On Sat 03/06/2017 08:44, Björn Ketelaars wrote:
> 
> Reverting back to the previous kernel fixed the issue above. Question: can
> someone give a hint on how to track this issue?

After a bit of experimenting I'm able to reproduce the problem. Summary is
that queueing in pf and use of a current (after May 30), multi processor
kernel (bsd.mp from snapshots) causes these specific watchdog timeouts
followed by a system freeze.

Issue is 'gone' when:
1.) using an older kernel (before May 30);
2.) removal of queueing statements from pf.conf. Included below the specific
snippet;
3.) switch from MP kernel to SP kernel.

New observation is that while queueing, using a MP kernel, the download
bandwidth is only a fraction of what is expected. Exchanging the MP kernel
with a SP kernel restores the download bandwidth to expected level.

I'm guessing that this issue is related to recent work on PF?


--- SNIP ---

# queueing
#
queue up on re0 bandwidth 15M max 15M
queue up_def parent up bandwidth 1M qlimit 10 default
queue up_dns parent up bandwidth 2M qlimit 20
queue up_ssh parent up bandwidth 6M qlimit 50
queue up_web parent up bandwidth 6M qlimit 50
match on egress set queue up_def
match out on egress proto {tcp, udp} to port 1:1024 set queue up_web
match on egress proto tcp to port 22 set queue up_ssh
match out on egress proto {tcp, udp} to port 53 set queue up_dns
match on egress proto icmp set queue up_dns
match out on egress proto tcp to port {119, 563} set queue up_def 

queue down on re1 bandwidth 150M max 150M
queue down_def parent down bandwidth 10M qlimit 100 default
queue down_dns parent down bandwidth 20M qlimit 200
queue down_ssh parent down bandwidth 60M qlimit 500
queue down_web parent down bandwidth 60M qlimit 500
match on re1 set queue down_def
match in on re1 proto {tcp, udp} to port 1:1024 set queue down_web
match on re1 proto tcp to port 22 set queue down_ssh
match in on re1 proto {tcp, udp} to port 53 set queue down_dns
match on re1 proto icmp set queue down_dns
match in on re1 proto tcp to port {119, 563} set queue down_def

--- SNIP ---

-- 
Björn Ketelaars
GPG key: 0x4F0E5F21



re0 and re1 watchdog timeouts, and system freeze

2017-06-02 Thread Björn Ketelaars
I just upgraded my headless pcengines-apu1c from current (amd64, May 30) to
a more recent current (amd64, June 2). Dmesg of the latter is enclosed below.

Upon rebooting the apu1c and logging in with SSH I noticed that the connection
was unreliable (SSH would just stop printing characters), which was rather
easily fixed by logging in with SSH again. After a couple of repeats the apu1c
would seemingly freeze.
Connecting the frozen apu1c to a laptop using a serial cable did not help as
I was unable to login, no ddb, nothing.  Rebooting the apu1c while being
connected using a serial cable resulted in an expected startup and login
prompt on the serial console. After login, and after a few seconds, I received
the following on my console:

re1: watchdog timeout
re0: watchdog timeout
re1: watchdog timeout
...

While SSHing from the laptop to the apu1c, and monitoring using serial
console, I just received more watchdog timeouts, which finally resulted in
a freeze and no way to enter ddb.

Reverting back to the previous kernel fixed the issue above. Question: can
someone give a hint on how to track this issue?


OpenBSD 6.1-current (GENERIC.MP) #91: Fri Jun  2 05:22:31 MDT 2017
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 4245995520 (4049MB)
avail mem = 4111519744 (3921MB)
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xdf16d820 (6 entries)
bios0: vendor coreboot version "SageBios_PCEngines_APU-45" date 04/05/2014
bios0: PC Engines APU
acpi0 at bios0: rev 0
acpi0: sleep states S0 S1 S3 S4 S5
acpi0: tables DSDT FACP SPCR HPET APIC HEST SSDT SSDT SSDT
acpi0: wakeup devices AGPB(S4) HDMI(S4) PBR4(S4) PBR5(S4) PBR6(S4) PBR7(S4) 
PE20(S4) PE21(S4) PE22(S4) PE23(S4) PIBR(S4) UOH1(S3) UOH2(S3) UOH3(S3) 
UOH4(S3) UOH5(S3) [...]
acpitimer0 at acpi0: 3579545 Hz, 32 bits
acpihpet0 at acpi0: 14318180 Hz
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: AMD G-T40E Processor, 1000.16 MHz
cpu0: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,SSSE3,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,IBS,SKINIT,ITSC
cpu0: 32KB 64b/line 2-way I-cache, 32KB 64b/line 8-way D-cache, 512KB 64b/line 
16-way L2 cache
cpu0: 8 4MB entries fully associative
cpu0: DTLB 40 4KB entries fully associative, 8 4MB entries fully associative
cpu0: TSC frequency 1000160340 Hz
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges
cpu0: apic clock running at 199MHz
cpu0: mwait min=64, max=64, IBE
cpu1 at mainbus0: apid 1 (application processor)
cpu1: AMD G-T40E Processor, 1000.00 MHz
cpu1: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,SSSE3,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,IBS,SKINIT,ITSC
cpu1: 32KB 64b/line 2-way I-cache, 32KB 64b/line 8-way D-cache, 512KB 64b/line 
16-way L2 cache
cpu1: 8 4MB entries fully associative
cpu1: DTLB 40 4KB entries fully associative, 8 4MB entries fully associative
cpu1: smt 0, core 1, package 0
ioapic0 at mainbus0: apid 2 pa 0xfec0, version 21, 24 pins
acpiprt0 at acpi0: bus -1 (AGPB)
acpiprt1 at acpi0: bus -1 (HDMI)
acpiprt2 at acpi0: bus 1 (PBR4)
acpiprt3 at acpi0: bus 2 (PBR5)
acpiprt4 at acpi0: bus 3 (PBR6)
acpiprt5 at acpi0: bus -1 (PBR7)
acpiprt6 at acpi0: bus 5 (PE20)
acpiprt7 at acpi0: bus -1 (PE21)
acpiprt8 at acpi0: bus -1 (PE22)
acpiprt9 at acpi0: bus -1 (PE23)
acpiprt10 at acpi0: bus 0 (PCI0)
acpiprt11 at acpi0: bus 4 (PIBR)
acpicpu0 at acpi0: C2(0@100 io@0x841), C1(@1 halt!), PSS
acpicpu1 at acpi0: C2(0@100 io@0x841), C1(@1 halt!), PSS
acpibtn0 at acpi0: PWRB
cpu0: 1000 MHz: speeds: 1000 800 MHz
pci0 at mainbus0 bus 0
pchb0 at pci0 dev 0 function 0 "AMD AMD64 14h Host" rev 0x00
ppb0 at pci0 dev 4 function 0 "AMD AMD64 14h PCIE" rev 0x00: msi
pci1 at ppb0 bus 1
re0 at pci1 dev 0 function 0 "Realtek 8168" rev 0x06: RTL8168E/8111E (0x2c00), 
msi, address 00:0d:b9:33:8e:a4
rgephy0 at re0 phy 7: RTL8169S/8110S/8211 PHY, rev. 4
ppb1 at pci0 dev 5 function 0 "AMD AMD64 14h PCIE" rev 0x00: msi
pci2 at ppb1 bus 2
re1 at pci2 dev 0 function 0 "Realtek 8168" rev 0x06: RTL8168E/8111E (0x2c00), 
msi, address 00:0d:b9:33:8e:a5
rgephy1 at re1 phy 7: RTL8169S/8110S/8211 PHY, rev. 4
ppb2 at pci0 dev 6 function 0 "AMD AMD64 14h PCIE" rev 0x00: msi
pci3 at ppb2 bus 3
re2 at pci3 dev 0 function 0 "Realtek 8168" rev 0x06: RTL8168E/8111E (0x2c00), 
msi, address 00:0d:b9:33:8e:a6
rgephy2 at re2 phy 7: RTL8169S/8110S/8211 PHY, rev. 4
ahci0 at pci0 dev 17 function 0 "ATI SBx00 SATA" rev 0x40: apic 2 int 19, AHCI 
1.2
ahci0: port 0: 6.0Gb/s
scsibus1 at ahci0: 32 targets
sd0 at scsibus1 targ 0 lun 0:  SCSI3 0/direct 
fixed naa.5002538043584d30
sd0: 30533MB, 512 bytes/sector, 62533296 sectors, thin
ohci0 at pci0 d