On Tue, Aug 15, 2017 at 17:14 +0200, Mike Belopuhov wrote:
> Hi,
>
> I've just triggered an assert in hfsc_deferred (a callout) on an
> MP kernel running on an SP virtual machine:
>
> panic: kernel diagnostic assertion "HFSC_ENABLED(ifq)" failed: file
> "/home/mike/src/openbsd/sys/net/hfsc.c", line 950
> Stopped at db_enter+0x9: leave
> TIDPIDUID PRFLAGS PFLAGS CPU COMMAND
> *247463 28420 0 0x3 00 pfctl
> db_enter() at db_enter+0x9
>
> panic(817f78f0,4,81a3ffc0,8110c140,800c2060,fff
> f81598b1c) at panic+0x102
> __assert(81769d93,817d7350,3b6,817d72bd) at
> __assert+0x
> 35
> hfsc_deferred(800c2060) at hfsc_deferred+0x9e
> timeout_run(8004adc8) at timeout_run+0x4c
> softclock(0) at softclock+0x146
> softintr_dispatch(0) at softintr_dispatch+0x9f
> Xsoftclock() at Xsoftclock+0x1f
> --- interrupt ---
> end of kernel
> end trace frame: 0x728d481974c08548, count: 7
> 0x2cfe9c031c9:
> https://www.openbsd.org/ddb.html describes the minimum info required in bug
> reports. Insufficient info makes it difficult to find and fix bugs.
> ddb{0}> ps
> PID TID PPIDUID S FLAGS WAIT COMMAND
> *28420 247463 5000 0 7 0x3pfctl
>
>
> pfctl runs in the loop reloading the ruleset. So at some point we
> disable HFSC on the interface but lose a race with hfsc_deferred
> before re-enabling it.
>
> IFQ has a mechanism to lock the underlying object and I believe this
> is the right tool for this job. Any other ideas?
>
> I don't think it's a good idea to hold the mutex (ifq_q_enter and
> ifq_q_leave effectively lock and unlock it) during the ifq_start,
> so we have to make a concession and run the ifq_start before knowing
> whether or not HFSC is attached. IMO, it's a small price to pay to
> avoide clutter. Kernel lock assertion is pointless at this point.
>
> OK?
>
I've been running with this while debugging the issue with the active
class list ("panic: kernel diagnostic assertion" from Aug 12 on bugs@)
and I'm quite confident that this works and I don't observe the race
anymore.
In addition, I've figured we can keep the HFSC_ENABLED check as there
is no issue with bailing early here:
diff --git sys/net/hfsc.c sys/net/hfsc.c
index 12504267dc5..c51f1406a0b 100644
--- sys/net/hfsc.c
+++ sys/net/hfsc.c
@@ -950,10 +950,13 @@ hfsc_deferred(void *arg)
{
struct ifnet *ifp = arg;
struct ifqueue *ifq = >if_snd;
struct hfsc_if *hif;
+ if (!HFSC_ENABLED(ifq))
+ return;
+
if (!ifq_empty(ifq))
ifq_start(ifq);
hif = ifq_q_enter(>if_snd, ifq_hfsc_ops);
if (hif == NULL)
> diff --git sys/net/hfsc.c sys/net/hfsc.c
> index 410bea733c6..3c5b6f6ef78 100644
> --- sys/net/hfsc.c
> +++ sys/net/hfsc.c
> @@ -944,20 +944,19 @@ hfsc_deferred(void *arg)
> {
> struct ifnet *ifp = arg;
> struct ifqueue *ifq = >if_snd;
> struct hfsc_if *hif;
>
> - KERNEL_ASSERT_LOCKED();
> - KASSERT(HFSC_ENABLED(ifq));
> -
> if (!ifq_empty(ifq))
> ifq_start(ifq);
>
> - hif = ifq->ifq_q;
> -
> + hif = ifq_q_enter(>if_snd, ifq_hfsc_ops);
> + if (hif == NULL)
> + return;
> /* XXX HRTIMER nearest virtual/fit time is likely less than 1/HZ. */
> timeout_add(>hif_defer, 1);
> + ifq_q_leave(>if_snd, hif);
> }
>
> void
> hfsc_cl_purge(struct hfsc_if *hif, struct hfsc_class *cl, struct mbuf_list
> *ml)
> {