Re: re0 and re1 watchdog timeouts, and system freeze
On Mon 12/06/2017 15:11, David Gwynne wrote: > On Fri, Jun 09, 2017 at 07:19:34PM +0200, Bj??rn Ketelaars wrote: > > On Fri 09/06/2017 12:07, Martin Pieuchot wrote: > > > On 08/06/17(Thu) 20:38, Bj??rn Ketelaars wrote: > > > > On Thu 08/06/2017 16:55, Martin Pieuchot wrote: > > > > > On 07/06/17(Wed) 09:43, Bj??rn Ketelaars wrote: > > > > > > On Sat 03/06/2017 08:44, Bj??rn Ketelaars wrote: > > > > > > > > > > > > > > Reverting back to the previous kernel fixed the issue above. > > > > > > > Question: can > > > > > > > someone give a hint on how to track this issue? > > > > > > > > > > > > After a bit of experimenting I'm able to reproduce the problem. > > > > > > Summary is > > > > > > that queueing in pf and use of a current (after May 30), multi > > > > > > processor > > > > > > kernel (bsd.mp from snapshots) causes these specific watchdog > > > > > > timeouts > > > > > > followed by a system freeze. > > > > > > > > > > > > Issue is 'gone' when: > > > > > > 1.) using an older kernel (before May 30); > > > > > > 2.) removal of queueing statements from pf.conf. Included below the > > > > > > specific > > > > > > snippet; > > > > > > 3.) switch from MP kernel to SP kernel. > > > > > > > > > > > > New observation is that while queueing, using a MP kernel, the > > > > > > download > > > > > > bandwidth is only a fraction of what is expected. Exchanging the MP > > > > > > kernel > > > > > > with a SP kernel restores the download bandwidth to expected level. > > > > > > > > > > > > I'm guessing that this issue is related to recent work on PF? > > > > > > > > > > It's certainly a problem in, or exposed by, re(4) with the recent MP > > > > > work > > > > > in the network stack. > > > > > > > > > > It would help if you could build a kernel with MP_LOCKDEBUG defined > > > > > and > > > > > see if the resulting kernel enters ddb(4) instead of freezing. > > > > > > > > > > Thanks, > > > > > Martin > > > > > > > > Thanks for the hint! It helped in entering ddb. I collected a bit of > > > > output, > > > > which you can find below. If I read the trace correctly the crash is > > > > related > > > > to line 1750 of sys/dev/ic/re.c: > > > > > > > > d->rl_cmdstat |= htole32(RL_TDESC_CMD_EOF); > > > > > > Could you test the diff below, always with a MP_LOCKDEBUG kernel and > > > tell us if you can reproduce the freeze or if the kernel enters ddb(4)? > > > > > > Another question, how often do you see "watchdog timeout" messages? > > > > > > Index: re.c > > > === > > > RCS file: /cvs/src/sys/dev/ic/re.c,v > > > retrieving revision 1.201 > > > diff -u -p -r1.201 re.c > > > --- re.c 24 Jan 2017 03:57:34 - 1.201 > > > +++ re.c 9 Jun 2017 10:04:43 - > > > @@ -2074,9 +2074,6 @@ re_watchdog(struct ifnet *ifp) > > > s = splnet(); > > > printf("%s: watchdog timeout\n", sc->sc_dev.dv_xname); > > > > > > - re_txeof(sc); > > > - re_rxeof(sc); > > > - > > > re_init(ifp); > > > > > > splx(s); > > > > The diff (with a MP_LOCKDEBUG kernel) resulted in similar traces as before. > > ddb Output is included below. > > > > With your diff the number of timeout messages decreased from 9 to 2 before > > entering ddb. > > can you try the diff below please? > > Index: hfsc.c > === > RCS file: /cvs/src/sys/net/hfsc.c,v > retrieving revision 1.39 > diff -u -p -r1.39 hfsc.c > --- hfsc.c8 May 2017 11:30:53 - 1.39 > +++ hfsc.c12 Jun 2017 05:08:01 - > @@ -817,7 +817,7 @@ hfsc_deferred(void *arg) > KASSERT(HFSC_ENABLED(ifq)); > > if (!ifq_empty(ifq)) > - (*ifp->if_qstart)(ifq); > + ifq_start(ifq); > > hif = ifq->ifq_q; Your diff fixes my issue. Previously I was able to reproduce the problem within a minute. I have been running a kernel with your patch for the last three hours without a hitch. No more re0/re1 watchdog timeouts, no more bandwidth issues while using queueing and no more ddb. Thanks!
Re: re0 and re1 watchdog timeouts, and system freeze
On Mon, Jun 12, 2017 at 15:11 +1000, David Gwynne wrote: > On Fri, Jun 09, 2017 at 07:19:34PM +0200, Bj??rn Ketelaars wrote: > > On Fri 09/06/2017 12:07, Martin Pieuchot wrote: > > > On 08/06/17(Thu) 20:38, Bj??rn Ketelaars wrote: > > > > On Thu 08/06/2017 16:55, Martin Pieuchot wrote: > > > > > On 07/06/17(Wed) 09:43, Bj??rn Ketelaars wrote: > > > > > > On Sat 03/06/2017 08:44, Bj??rn Ketelaars wrote: > > > > > > > > > > > > > > Reverting back to the previous kernel fixed the issue above. > > > > > > > Question: can > > > > > > > someone give a hint on how to track this issue? > > > > > > > > > > > > After a bit of experimenting I'm able to reproduce the problem. > > > > > > Summary is > > > > > > that queueing in pf and use of a current (after May 30), multi > > > > > > processor > > > > > > kernel (bsd.mp from snapshots) causes these specific watchdog > > > > > > timeouts > > > > > > followed by a system freeze. > > > > > > > > > > > > Issue is 'gone' when: > > > > > > 1.) using an older kernel (before May 30); > > > > > > 2.) removal of queueing statements from pf.conf. Included below the > > > > > > specific > > > > > > snippet; > > > > > > 3.) switch from MP kernel to SP kernel. > > > > > > > > > > > > New observation is that while queueing, using a MP kernel, the > > > > > > download > > > > > > bandwidth is only a fraction of what is expected. Exchanging the MP > > > > > > kernel > > > > > > with a SP kernel restores the download bandwidth to expected level. > > > > > > > > > > > > I'm guessing that this issue is related to recent work on PF? > > > > > > > > > > It's certainly a problem in, or exposed by, re(4) with the recent MP > > > > > work > > > > > in the network stack. > > > > > > > > > > It would help if you could build a kernel with MP_LOCKDEBUG defined > > > > > and > > > > > see if the resulting kernel enters ddb(4) instead of freezing. > > > > > > > > > > Thanks, > > > > > Martin > > > > > > > > Thanks for the hint! It helped in entering ddb. I collected a bit of > > > > output, > > > > which you can find below. If I read the trace correctly the crash is > > > > related > > > > to line 1750 of sys/dev/ic/re.c: > > > > > > > > d->rl_cmdstat |= htole32(RL_TDESC_CMD_EOF); > > > > > > Could you test the diff below, always with a MP_LOCKDEBUG kernel and > > > tell us if you can reproduce the freeze or if the kernel enters ddb(4)? > > > > > > Another question, how often do you see "watchdog timeout" messages? > > > > > > Index: re.c > > > === > > > RCS file: /cvs/src/sys/dev/ic/re.c,v > > > retrieving revision 1.201 > > > diff -u -p -r1.201 re.c > > > --- re.c 24 Jan 2017 03:57:34 - 1.201 > > > +++ re.c 9 Jun 2017 10:04:43 - > > > @@ -2074,9 +2074,6 @@ re_watchdog(struct ifnet *ifp) > > > s = splnet(); > > > printf("%s: watchdog timeout\n", sc->sc_dev.dv_xname); > > > > > > - re_txeof(sc); > > > - re_rxeof(sc); > > > - > > > re_init(ifp); > > > > > > splx(s); > > > > The diff (with a MP_LOCKDEBUG kernel) resulted in similar traces as before. > > ddb Output is included below. > > > > With your diff the number of timeout messages decreased from 9 to 2 before > > entering ddb. > > can you try the diff below please? > Oh my, unserialized start. Someone didn't finish the conversion (-; OK mikeb > Index: hfsc.c > === > RCS file: /cvs/src/sys/net/hfsc.c,v > retrieving revision 1.39 > diff -u -p -r1.39 hfsc.c > --- hfsc.c8 May 2017 11:30:53 - 1.39 > +++ hfsc.c12 Jun 2017 05:08:01 - > @@ -817,7 +817,7 @@ hfsc_deferred(void *arg) > KASSERT(HFSC_ENABLED(ifq)); > > if (!ifq_empty(ifq)) > - (*ifp->if_qstart)(ifq); > + ifq_start(ifq); > > hif = ifq->ifq_q; > >
Re: re0 and re1 watchdog timeouts, and system freeze
On Fri, Jun 09, 2017 at 07:19:34PM +0200, Bj??rn Ketelaars wrote: > On Fri 09/06/2017 12:07, Martin Pieuchot wrote: > > On 08/06/17(Thu) 20:38, Bj??rn Ketelaars wrote: > > > On Thu 08/06/2017 16:55, Martin Pieuchot wrote: > > > > On 07/06/17(Wed) 09:43, Bj??rn Ketelaars wrote: > > > > > On Sat 03/06/2017 08:44, Bj??rn Ketelaars wrote: > > > > > > > > > > > > Reverting back to the previous kernel fixed the issue above. > > > > > > Question: can > > > > > > someone give a hint on how to track this issue? > > > > > > > > > > After a bit of experimenting I'm able to reproduce the problem. > > > > > Summary is > > > > > that queueing in pf and use of a current (after May 30), multi > > > > > processor > > > > > kernel (bsd.mp from snapshots) causes these specific watchdog timeouts > > > > > followed by a system freeze. > > > > > > > > > > Issue is 'gone' when: > > > > > 1.) using an older kernel (before May 30); > > > > > 2.) removal of queueing statements from pf.conf. Included below the > > > > > specific > > > > > snippet; > > > > > 3.) switch from MP kernel to SP kernel. > > > > > > > > > > New observation is that while queueing, using a MP kernel, the > > > > > download > > > > > bandwidth is only a fraction of what is expected. Exchanging the MP > > > > > kernel > > > > > with a SP kernel restores the download bandwidth to expected level. > > > > > > > > > > I'm guessing that this issue is related to recent work on PF? > > > > > > > > It's certainly a problem in, or exposed by, re(4) with the recent MP > > > > work > > > > in the network stack. > > > > > > > > It would help if you could build a kernel with MP_LOCKDEBUG defined and > > > > see if the resulting kernel enters ddb(4) instead of freezing. > > > > > > > > Thanks, > > > > Martin > > > > > > Thanks for the hint! It helped in entering ddb. I collected a bit of > > > output, > > > which you can find below. If I read the trace correctly the crash is > > > related > > > to line 1750 of sys/dev/ic/re.c: > > > > > > d->rl_cmdstat |= htole32(RL_TDESC_CMD_EOF); > > > > Could you test the diff below, always with a MP_LOCKDEBUG kernel and > > tell us if you can reproduce the freeze or if the kernel enters ddb(4)? > > > > Another question, how often do you see "watchdog timeout" messages? > > > > Index: re.c > > === > > RCS file: /cvs/src/sys/dev/ic/re.c,v > > retrieving revision 1.201 > > diff -u -p -r1.201 re.c > > --- re.c24 Jan 2017 03:57:34 - 1.201 > > +++ re.c9 Jun 2017 10:04:43 - > > @@ -2074,9 +2074,6 @@ re_watchdog(struct ifnet *ifp) > > s = splnet(); > > printf("%s: watchdog timeout\n", sc->sc_dev.dv_xname); > > > > - re_txeof(sc); > > - re_rxeof(sc); > > - > > re_init(ifp); > > > > splx(s); > > The diff (with a MP_LOCKDEBUG kernel) resulted in similar traces as before. > ddb Output is included below. > > With your diff the number of timeout messages decreased from 9 to 2 before > entering ddb. can you try the diff below please? Index: hfsc.c === RCS file: /cvs/src/sys/net/hfsc.c,v retrieving revision 1.39 diff -u -p -r1.39 hfsc.c --- hfsc.c 8 May 2017 11:30:53 - 1.39 +++ hfsc.c 12 Jun 2017 05:08:01 - @@ -817,7 +817,7 @@ hfsc_deferred(void *arg) KASSERT(HFSC_ENABLED(ifq)); if (!ifq_empty(ifq)) - (*ifp->if_qstart)(ifq); + ifq_start(ifq); hif = ifq->ifq_q;
Re: re0 and re1 watchdog timeouts, and system freeze
On Fri 09/06/2017 12:07, Martin Pieuchot wrote: > On 08/06/17(Thu) 20:38, Björn Ketelaars wrote: > > On Thu 08/06/2017 16:55, Martin Pieuchot wrote: > > > On 07/06/17(Wed) 09:43, Björn Ketelaars wrote: > > > > On Sat 03/06/2017 08:44, Björn Ketelaars wrote: > > > > > > > > > > Reverting back to the previous kernel fixed the issue above. > > > > > Question: can > > > > > someone give a hint on how to track this issue? > > > > > > > > After a bit of experimenting I'm able to reproduce the problem. Summary > > > > is > > > > that queueing in pf and use of a current (after May 30), multi processor > > > > kernel (bsd.mp from snapshots) causes these specific watchdog timeouts > > > > followed by a system freeze. > > > > > > > > Issue is 'gone' when: > > > > 1.) using an older kernel (before May 30); > > > > 2.) removal of queueing statements from pf.conf. Included below the > > > > specific > > > > snippet; > > > > 3.) switch from MP kernel to SP kernel. > > > > > > > > New observation is that while queueing, using a MP kernel, the download > > > > bandwidth is only a fraction of what is expected. Exchanging the MP > > > > kernel > > > > with a SP kernel restores the download bandwidth to expected level. > > > > > > > > I'm guessing that this issue is related to recent work on PF? > > > > > > It's certainly a problem in, or exposed by, re(4) with the recent MP work > > > in the network stack. > > > > > > It would help if you could build a kernel with MP_LOCKDEBUG defined and > > > see if the resulting kernel enters ddb(4) instead of freezing. > > > > > > Thanks, > > > Martin > > > > Thanks for the hint! It helped in entering ddb. I collected a bit of output, > > which you can find below. If I read the trace correctly the crash is related > > to line 1750 of sys/dev/ic/re.c: > > > > d->rl_cmdstat |= htole32(RL_TDESC_CMD_EOF); > > Could you test the diff below, always with a MP_LOCKDEBUG kernel and > tell us if you can reproduce the freeze or if the kernel enters ddb(4)? > > Another question, how often do you see "watchdog timeout" messages? > > Index: re.c > === > RCS file: /cvs/src/sys/dev/ic/re.c,v > retrieving revision 1.201 > diff -u -p -r1.201 re.c > --- re.c 24 Jan 2017 03:57:34 - 1.201 > +++ re.c 9 Jun 2017 10:04:43 - > @@ -2074,9 +2074,6 @@ re_watchdog(struct ifnet *ifp) > s = splnet(); > printf("%s: watchdog timeout\n", sc->sc_dev.dv_xname); > > - re_txeof(sc); > - re_rxeof(sc); > - > re_init(ifp); > > splx(s); The diff (with a MP_LOCKDEBUG kernel) resulted in similar traces as before. ddb Output is included below. With your diff the number of timeout messages decreased from 9 to 2 before entering ddb. ddb{1}> show panic the kernel did not panic ddb{1}> machine ddbcpu 0 Stopped at db_enter+0x9: leave ddb{0}> trace db_enter(8196b640,200,804d1600,10,8000220e9938,282) at db_e nter+0x9 x86_ipi_handler(80074010,819e7160,102860,c,4,1813a9522) at x86_ ipi_handler+0x85 Xresume_lapic_ipi() at Xresume_lapic_ipi+0x1c --- interrupt --- ___mp_lock(819e7160,8187acf0,1,1,8000220a38c0,8000220a7 ab0) at ___mp_lock+0x4a ___mp_acquire_count(819e7160,1,8000220a38c0,8000220a7ab0,ff 011da1e1c8,81019942) at ___mp_acquire_count+0x33 mi_switch(ff011b4a93c0,818bf7e7,9cd,8000220e9bb0,8000220a7a b0,8000220e9ba0) at mi_switch+0x22b sleep_finish(8000220e9bb0,1,118,8000220e9bb0,ff011b4a93c0,8 18bf7e7) at sleep_finish+0xc2 tsleep(ff011b4a93c0,118,818bf7e7,9cd,0,40) at tsleep+0x164 kqueue_scan(ff011b4a93c0,40,143d48714800,8000220e9dd0,8000220a7ab0, 8000220e9de4) at kqueue_scan+0x15b sys_kevent(8000220a7ab0,8000220e9e60,8000220e9eb0,8000220a7ab0, 48,1) at sys_kevent+0x2f6 syscall() at syscall+0x29f --- syscall (number 72) --- end of kernel end trace frame: 0x143d48714800, count: -11 0x143c524f280a: ddb{0}> machine ddbcpu 1 Stopped at re_encap+0x24d: movl0(%r15),%eax ddb{1}> trace re_encap(80084000,dd,ff00d6e03f00,800842c0,400,8008 4000) at re_encap+0x24d re_start(800842c0,7,ff00d6dd7200,80084338,800842c0, 81091d70) at re_start+0x8c ifq_serialize(800842c0,80084360,80084090,ff00d6a515 00,800842c0,0) at ifq_serialize+0xf8 if_enqueue(80084090,ff00d6a51500,8001dc80,ff00d6a51500, 37,2) at if_enqueue+0x90 ether_output(80084090,ff00d6dd7200,8001dc80,ff011b3f49a 0,80933400,8001dc80) at ether_output+0x1d6 ip_output(ff00d6dd7200,0,800022032cc0,1,0,0) at ip_output+0x8a1 ip_forward(ff00d6dd7200,802a4090,0,0,8244c78a,8244c78a) at ip_forwa rd+0x1e7 ipv4_input(802a4090,ff00d6dd7200,800,800,ff0009d00200,8 02
Re: re0 and re1 watchdog timeouts, and system freeze
On 08/06/17(Thu) 20:38, Björn Ketelaars wrote: > On Thu 08/06/2017 16:55, Martin Pieuchot wrote: > > On 07/06/17(Wed) 09:43, Björn Ketelaars wrote: > > > On Sat 03/06/2017 08:44, Björn Ketelaars wrote: > > > > > > > > Reverting back to the previous kernel fixed the issue above. Question: > > > > can > > > > someone give a hint on how to track this issue? > > > > > > After a bit of experimenting I'm able to reproduce the problem. Summary is > > > that queueing in pf and use of a current (after May 30), multi processor > > > kernel (bsd.mp from snapshots) causes these specific watchdog timeouts > > > followed by a system freeze. > > > > > > Issue is 'gone' when: > > > 1.) using an older kernel (before May 30); > > > 2.) removal of queueing statements from pf.conf. Included below the > > > specific > > > snippet; > > > 3.) switch from MP kernel to SP kernel. > > > > > > New observation is that while queueing, using a MP kernel, the download > > > bandwidth is only a fraction of what is expected. Exchanging the MP kernel > > > with a SP kernel restores the download bandwidth to expected level. > > > > > > I'm guessing that this issue is related to recent work on PF? > > > > It's certainly a problem in, or exposed by, re(4) with the recent MP work > > in the network stack. > > > > It would help if you could build a kernel with MP_LOCKDEBUG defined and > > see if the resulting kernel enters ddb(4) instead of freezing. > > > > Thanks, > > Martin > > Thanks for the hint! It helped in entering ddb. I collected a bit of output, > which you can find below. If I read the trace correctly the crash is related > to line 1750 of sys/dev/ic/re.c: > > d->rl_cmdstat |= htole32(RL_TDESC_CMD_EOF); Could you test the diff below, always with a MP_LOCKDEBUG kernel and tell us if you can reproduce the freeze or if the kernel enters ddb(4)? Another question, how often do you see "watchdog timeout" messages? Index: re.c === RCS file: /cvs/src/sys/dev/ic/re.c,v retrieving revision 1.201 diff -u -p -r1.201 re.c --- re.c24 Jan 2017 03:57:34 - 1.201 +++ re.c9 Jun 2017 10:04:43 - @@ -2074,9 +2074,6 @@ re_watchdog(struct ifnet *ifp) s = splnet(); printf("%s: watchdog timeout\n", sc->sc_dev.dv_xname); - re_txeof(sc); - re_rxeof(sc); - re_init(ifp); splx(s);
Re: re0 and re1 watchdog timeouts, and system freeze
On Thu 08/06/2017 16:55, Martin Pieuchot wrote: > On 07/06/17(Wed) 09:43, Björn Ketelaars wrote: > > On Sat 03/06/2017 08:44, Björn Ketelaars wrote: > > > > > > Reverting back to the previous kernel fixed the issue above. Question: can > > > someone give a hint on how to track this issue? > > > > After a bit of experimenting I'm able to reproduce the problem. Summary is > > that queueing in pf and use of a current (after May 30), multi processor > > kernel (bsd.mp from snapshots) causes these specific watchdog timeouts > > followed by a system freeze. > > > > Issue is 'gone' when: > > 1.) using an older kernel (before May 30); > > 2.) removal of queueing statements from pf.conf. Included below the specific > > snippet; > > 3.) switch from MP kernel to SP kernel. > > > > New observation is that while queueing, using a MP kernel, the download > > bandwidth is only a fraction of what is expected. Exchanging the MP kernel > > with a SP kernel restores the download bandwidth to expected level. > > > > I'm guessing that this issue is related to recent work on PF? > > It's certainly a problem in, or exposed by, re(4) with the recent MP work > in the network stack. > > It would help if you could build a kernel with MP_LOCKDEBUG defined and > see if the resulting kernel enters ddb(4) instead of freezing. > > Thanks, > Martin Thanks for the hint! It helped in entering ddb. I collected a bit of output, which you can find below. If I read the trace correctly the crash is related to line 1750 of sys/dev/ic/re.c: d->rl_cmdstat |= htole32(RL_TDESC_CMD_EOF); ddb{0}> show panic the kernel did not panic ddb{0}> trace db_enter(8196acc0,80002200b8b8,8000e6a8,10,800022023c58 ,282) at db_enter+0x9 x86_ipi_handler(8196acd0,819f26a0,14a2d1,c,800022023cd0,fff f819f95a0) at x86_ipi_handler+0x85 Xresume_lapic_ipi() at Xresume_lapic_ipi+0x1c --- interrupt --- ___mp_lock(819f26a0,818552f0,1,1,80002200b8b8,8000e 6a8) at ___mp_lock+0x4a ___mp_acquire_count(819f26a0,1,80002200b8b8,8000e6a8,ff ff81a11680,81707182) at ___mp_acquire_count+0x33 mi_switch(0,0,8000e6a8,800022023ed0,8000e6a8,800022023e c0) at mi_switch+0x22b sleep_finish(800022023ed0,1,810b7110,800022023ed0,0,0) at sleep _finish+0xc2 softclock_thread(8000e6a8,2a2,0,0,29e123d3cabb9a12,800022023f10) at softclock_thread+0x94 end trace frame: 0x0, count: -8 ddb{0}> machine ddbcpu 1 Stopped at re_encap+0x24d: movl0(%r15),%eax ddb{1}> show panic the kernel did not panic ddb{1}> trace re_encap(80084000,363,ff00d221cc00,800842c0,400,800 84000) at re_encap+0x24d re_start(800842c0,7,ff00d171d000,80084338,800842c0, 815fdbf0) at re_start+0x8c ifq_serialize(800842c0,80084360,80084090,ff00d2fbd0 00,800842c0,0) at ifq_serialize+0xf8 if_enqueue(80084090,ff00d2fbd000,8096a240,ff00d2fbd000, 37,2) at if_enqueue+0x90 ether_output(80084090,ff00d171d000,8096a240,ff011b66777 8,804d1000,8096a240) at ether_output+0x1d6 ip_output(ff00d171d000,0,800022032cc0,1,0,0) at ip_output+0x8a1 ip_forward(ff00d171d000,802a4090,0,0,8244c78a,8244c78a) at ip_forwa rd+0x1e7 ipv4_input(802a4090,ff00d171d000,800,800,ff000990ff80,8 02a4090) at ipv4_input+0x4f7 ether_input(802a4090,ff00d171d000,0,810afed0,802a42 88,802a4090) at ether_input+0xbd if_input_process(2,800022032eb0,0,0,800022032eb0,80019080) at i f_input_process+0xfa taskq_thread(80019080,2a2,80019080,81493240,0,80002 2032f10) at taskq_thread+0x79 end trace frame: 0x0, count: -11 ddb{1}> dmesg OpenBSD 6.1-current (GENERIC.MP) #7: Thu Jun 8 19:10:36 CEST 2017 b...@gateway.lan:/home/code/src/sys/arch/amd64/compile/GENERIC.MP real mem = 4245995520 (4049MB) avail mem = 4111519744 (3921MB) mpath0 at root scsibus0 at mpath0: 256 targets mainbus0 at root bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xdf16d820 (6 entries) bios0: vendor coreboot version "SageBios_PCEngines_APU-45" date 04/05/2014 bios0: PC Engines APU acpi0 at bios0: rev 0 acpi0: sleep states S0 S1 S3 S4 S5 acpi0: tables DSDT FACP SPCR HPET APIC HEST SSDT SSDT SSDT acpi0: wakeup devices AGPB(S4) HDMI(S4) PBR4(S4) PBR5(S4) PBR6(S4) PBR7(S4) PE2 0(S4) PE21(S4) PE22(S4) PE23(S4) PIBR(S4) UOH1(S3) UOH2(S3) UOH3(S3) UOH4(S3) U OH5(S3) [...] acpitimer0 at acpi0: 3579545 Hz, 32 bits acpihpet0 at acpi0: 14318180 Hz acpimadt0 at acpi0 addr 0xfee0: PC-AT compat cpu0 at mainbus0: apid 0 (boot processor) cpu0: AMD G-T40E Processor, 1000.13 MHz cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,C FLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,SSSE3,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB ,RDTSCP,L
Re: re0 and re1 watchdog timeouts, and system freeze
On 07/06/17(Wed) 09:43, Björn Ketelaars wrote: > On Sat 03/06/2017 08:44, Björn Ketelaars wrote: > > > > Reverting back to the previous kernel fixed the issue above. Question: can > > someone give a hint on how to track this issue? > > After a bit of experimenting I'm able to reproduce the problem. Summary is > that queueing in pf and use of a current (after May 30), multi processor > kernel (bsd.mp from snapshots) causes these specific watchdog timeouts > followed by a system freeze. > > Issue is 'gone' when: > 1.) using an older kernel (before May 30); > 2.) removal of queueing statements from pf.conf. Included below the specific > snippet; > 3.) switch from MP kernel to SP kernel. > > New observation is that while queueing, using a MP kernel, the download > bandwidth is only a fraction of what is expected. Exchanging the MP kernel > with a SP kernel restores the download bandwidth to expected level. > > I'm guessing that this issue is related to recent work on PF? It's certainly a problem in, or exposed by, re(4) with the recent MP work in the network stack. It would help if you could build a kernel with MP_LOCKDEBUG defined and see if the resulting kernel enters ddb(4) instead of freezing. Thanks, Martin > > > --- SNIP --- > > # queueing > # > queue up on re0 bandwidth 15M max 15M > queue up_def parent up bandwidth 1M qlimit 10 default > queue up_dns parent up bandwidth 2M qlimit 20 > queue up_ssh parent up bandwidth 6M qlimit 50 > queue up_web parent up bandwidth 6M qlimit 50 > match on egress set queue up_def > match out on egress proto {tcp, udp} to port 1:1024 set queue up_web > match on egress proto tcp to port 22 set queue up_ssh > match out on egress proto {tcp, udp} to port 53 set queue up_dns > match on egress proto icmp set queue up_dns > match out on egress proto tcp to port {119, 563} set queue up_def > > queue down on re1 bandwidth 150M max 150M > queue down_def parent down bandwidth 10M qlimit 100 default > queue down_dns parent down bandwidth 20M qlimit 200 > queue down_ssh parent down bandwidth 60M qlimit 500 > queue down_web parent down bandwidth 60M qlimit 500 > match on re1 set queue down_def > match in on re1 proto {tcp, udp} to port 1:1024 set queue down_web > match on re1 proto tcp to port 22 set queue down_ssh > match in on re1 proto {tcp, udp} to port 53 set queue down_dns > match on re1 proto icmp set queue down_dns > match in on re1 proto tcp to port {119, 563} set queue down_def > > --- SNIP --- > > -- > Björn Ketelaars > GPG key: 0x4F0E5F21 >
Re: re0 and re1 watchdog timeouts, and system freeze
On Sat 03/06/2017 08:44, Björn Ketelaars wrote: > > Reverting back to the previous kernel fixed the issue above. Question: can > someone give a hint on how to track this issue? After a bit of experimenting I'm able to reproduce the problem. Summary is that queueing in pf and use of a current (after May 30), multi processor kernel (bsd.mp from snapshots) causes these specific watchdog timeouts followed by a system freeze. Issue is 'gone' when: 1.) using an older kernel (before May 30); 2.) removal of queueing statements from pf.conf. Included below the specific snippet; 3.) switch from MP kernel to SP kernel. New observation is that while queueing, using a MP kernel, the download bandwidth is only a fraction of what is expected. Exchanging the MP kernel with a SP kernel restores the download bandwidth to expected level. I'm guessing that this issue is related to recent work on PF? --- SNIP --- # queueing # queue up on re0 bandwidth 15M max 15M queue up_def parent up bandwidth 1M qlimit 10 default queue up_dns parent up bandwidth 2M qlimit 20 queue up_ssh parent up bandwidth 6M qlimit 50 queue up_web parent up bandwidth 6M qlimit 50 match on egress set queue up_def match out on egress proto {tcp, udp} to port 1:1024 set queue up_web match on egress proto tcp to port 22 set queue up_ssh match out on egress proto {tcp, udp} to port 53 set queue up_dns match on egress proto icmp set queue up_dns match out on egress proto tcp to port {119, 563} set queue up_def queue down on re1 bandwidth 150M max 150M queue down_def parent down bandwidth 10M qlimit 100 default queue down_dns parent down bandwidth 20M qlimit 200 queue down_ssh parent down bandwidth 60M qlimit 500 queue down_web parent down bandwidth 60M qlimit 500 match on re1 set queue down_def match in on re1 proto {tcp, udp} to port 1:1024 set queue down_web match on re1 proto tcp to port 22 set queue down_ssh match in on re1 proto {tcp, udp} to port 53 set queue down_dns match on re1 proto icmp set queue down_dns match in on re1 proto tcp to port {119, 563} set queue down_def --- SNIP --- -- Björn Ketelaars GPG key: 0x4F0E5F21
re0 and re1 watchdog timeouts, and system freeze
I just upgraded my headless pcengines-apu1c from current (amd64, May 30) to a more recent current (amd64, June 2). Dmesg of the latter is enclosed below. Upon rebooting the apu1c and logging in with SSH I noticed that the connection was unreliable (SSH would just stop printing characters), which was rather easily fixed by logging in with SSH again. After a couple of repeats the apu1c would seemingly freeze. Connecting the frozen apu1c to a laptop using a serial cable did not help as I was unable to login, no ddb, nothing. Rebooting the apu1c while being connected using a serial cable resulted in an expected startup and login prompt on the serial console. After login, and after a few seconds, I received the following on my console: re1: watchdog timeout re0: watchdog timeout re1: watchdog timeout ... While SSHing from the laptop to the apu1c, and monitoring using serial console, I just received more watchdog timeouts, which finally resulted in a freeze and no way to enter ddb. Reverting back to the previous kernel fixed the issue above. Question: can someone give a hint on how to track this issue? OpenBSD 6.1-current (GENERIC.MP) #91: Fri Jun 2 05:22:31 MDT 2017 dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP real mem = 4245995520 (4049MB) avail mem = 4111519744 (3921MB) mpath0 at root scsibus0 at mpath0: 256 targets mainbus0 at root bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xdf16d820 (6 entries) bios0: vendor coreboot version "SageBios_PCEngines_APU-45" date 04/05/2014 bios0: PC Engines APU acpi0 at bios0: rev 0 acpi0: sleep states S0 S1 S3 S4 S5 acpi0: tables DSDT FACP SPCR HPET APIC HEST SSDT SSDT SSDT acpi0: wakeup devices AGPB(S4) HDMI(S4) PBR4(S4) PBR5(S4) PBR6(S4) PBR7(S4) PE20(S4) PE21(S4) PE22(S4) PE23(S4) PIBR(S4) UOH1(S3) UOH2(S3) UOH3(S3) UOH4(S3) UOH5(S3) [...] acpitimer0 at acpi0: 3579545 Hz, 32 bits acpihpet0 at acpi0: 14318180 Hz acpimadt0 at acpi0 addr 0xfee0: PC-AT compat cpu0 at mainbus0: apid 0 (boot processor) cpu0: AMD G-T40E Processor, 1000.16 MHz cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,SSSE3,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,IBS,SKINIT,ITSC cpu0: 32KB 64b/line 2-way I-cache, 32KB 64b/line 8-way D-cache, 512KB 64b/line 16-way L2 cache cpu0: 8 4MB entries fully associative cpu0: DTLB 40 4KB entries fully associative, 8 4MB entries fully associative cpu0: TSC frequency 1000160340 Hz cpu0: smt 0, core 0, package 0 mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges cpu0: apic clock running at 199MHz cpu0: mwait min=64, max=64, IBE cpu1 at mainbus0: apid 1 (application processor) cpu1: AMD G-T40E Processor, 1000.00 MHz cpu1: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,SSSE3,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,IBS,SKINIT,ITSC cpu1: 32KB 64b/line 2-way I-cache, 32KB 64b/line 8-way D-cache, 512KB 64b/line 16-way L2 cache cpu1: 8 4MB entries fully associative cpu1: DTLB 40 4KB entries fully associative, 8 4MB entries fully associative cpu1: smt 0, core 1, package 0 ioapic0 at mainbus0: apid 2 pa 0xfec0, version 21, 24 pins acpiprt0 at acpi0: bus -1 (AGPB) acpiprt1 at acpi0: bus -1 (HDMI) acpiprt2 at acpi0: bus 1 (PBR4) acpiprt3 at acpi0: bus 2 (PBR5) acpiprt4 at acpi0: bus 3 (PBR6) acpiprt5 at acpi0: bus -1 (PBR7) acpiprt6 at acpi0: bus 5 (PE20) acpiprt7 at acpi0: bus -1 (PE21) acpiprt8 at acpi0: bus -1 (PE22) acpiprt9 at acpi0: bus -1 (PE23) acpiprt10 at acpi0: bus 0 (PCI0) acpiprt11 at acpi0: bus 4 (PIBR) acpicpu0 at acpi0: C2(0@100 io@0x841), C1(@1 halt!), PSS acpicpu1 at acpi0: C2(0@100 io@0x841), C1(@1 halt!), PSS acpibtn0 at acpi0: PWRB cpu0: 1000 MHz: speeds: 1000 800 MHz pci0 at mainbus0 bus 0 pchb0 at pci0 dev 0 function 0 "AMD AMD64 14h Host" rev 0x00 ppb0 at pci0 dev 4 function 0 "AMD AMD64 14h PCIE" rev 0x00: msi pci1 at ppb0 bus 1 re0 at pci1 dev 0 function 0 "Realtek 8168" rev 0x06: RTL8168E/8111E (0x2c00), msi, address 00:0d:b9:33:8e:a4 rgephy0 at re0 phy 7: RTL8169S/8110S/8211 PHY, rev. 4 ppb1 at pci0 dev 5 function 0 "AMD AMD64 14h PCIE" rev 0x00: msi pci2 at ppb1 bus 2 re1 at pci2 dev 0 function 0 "Realtek 8168" rev 0x06: RTL8168E/8111E (0x2c00), msi, address 00:0d:b9:33:8e:a5 rgephy1 at re1 phy 7: RTL8169S/8110S/8211 PHY, rev. 4 ppb2 at pci0 dev 6 function 0 "AMD AMD64 14h PCIE" rev 0x00: msi pci3 at ppb2 bus 3 re2 at pci3 dev 0 function 0 "Realtek 8168" rev 0x06: RTL8168E/8111E (0x2c00), msi, address 00:0d:b9:33:8e:a6 rgephy2 at re2 phy 7: RTL8169S/8110S/8211 PHY, rev. 4 ahci0 at pci0 dev 17 function 0 "ATI SBx00 SATA" rev 0x40: apic 2 int 19, AHCI 1.2 ahci0: port 0: 6.0Gb/s scsibus1 at ahci0: 32 targets sd0 at scsibus1 targ 0 lun 0: SCSI3 0/direct fixed naa.5002538043584d30 sd0: 30533MB, 512 bytes/sector, 62533296 sectors, thin ohci0 at pci0 d