Hi Chris,

Sorry for the slow reply. Day job takes up most of my time.

Anyway, I finally added some logging into /usr/src/sys/net/if_vlan.c etc;

            if (m == NULL) {

                ifp->if_oerrors++;

                printf("Output Error due to NULL mbuff\n");

                continue;

            }

        }

        if (if_enqueue(ifp0, m)) {

            ifp->if_oerrors++;

            printf("Output Error from if_enqueue\n");

            continue;

        }

        ifp->if_opackets++;


Recompiled the kernel and rebooted onto it, and pushed traffic through it
(~50Mbps).

And sure enough every single instance of the VLAN Output drops is due
to "if_enqueue(ifp0,
m)" being TRUE. I edited if.c and again confirmed that IFQ_ENQUEUE does
return the error.

Traced it further back to ifq.c:ifq_enqueue_try(), and rv (from rv =
ifq->ifq_ops->ifqop_enq(ifq, m);)  is 55 for every one of the VLAN output
drops.


Needed some help from a colleague to figure out what
ifq->ifq_ops->ifqop_enq(ifq,
m) calls.

We believe is should be calling ifq.c:priq_enq(). Still dont understand
that glue part yet :( But after adding some logging on "if (ifq_len(ifq) >=
ifq->ifq_maxlen)" it doesn't seem to be that? So have either made a mistake
or gone as far as my knowledge can go? Any _pointers_ guys? ;)


We do use HFSC (and have done since 5.0 without issues), but only on the
physical interface, not on the VLANs.

The reason for this is so that we can _share_ the whole of the 10Gig
interface root bandwidth across all of the VLANs on the same physical .1q
trunk. This has worked great for years without VLAN output errors. I think
this started after 5.8 or 5.9.

I increased the qlimits from the default but that made no difference.


queue trunk_root on $if_trunk bandwidth 4294M

    queue qlocal on $if_trunk parent trunk_root bandwidth 4.1G

        queue local_kern on $if_trunk parent qlocal bandwidth 8M min 8M
burst 8M for 1000ms

        queue local_pri on $if_trunk parent qlocal bandwidth 150M min 150M
burst 200M for 2500ms qlimit 500

        queue local_data on $if_trunk parent qlocal bandwidth 4G min 1G
qlimit 1000

    queue qwan on $if_trunk parent trunk_root bandwidth 190M

        queue wan_rt on $if_trunk  parent qwan bandwidth 30M min 19M burst
38M for 5000ms

        queue wan_int on $if_trunk parent qwan bandwidth 19M min 9M

        queue wan_pri on $if_trunk parent qwan bandwidth 19M min 10M burst
25M for 2000ms

        queue wan_vpn on $if_trunk parent qwan bandwidth 50M min 25M

        queue wan_web on $if_trunk parent qwan bandwidth 29M min 10M burst
19M for 3000ms

        queue wan_dflt on $if_trunk parent qwan bandwidth 19M min 10M burst
19M for 5000ms

        queue wan_bulk on $if_trunk parent qwan bandwidth 20M max 100M
default

.

.

match out on INSIDE all received-on INSIDE queue (local_data,local_pri) set
prio (2,4)


So all traffic flowing from one VLAN to another (on the same trunk) are in
queues local_data and local_pri, however looking at the queue statistics
with systat queues 1, shows these large internal queues never drop a single
packet. Yet if_oerrors for the VLANs is still incrementing quite a lot for
most of our VLANs.


Hi Henning, whilst I have the code open, I am also going to have another go
at trying to find the missing 64bit counter/range check etc for the HFSC
queue size tomorrow (if I dont get dragged onto anything else).


Thanks for your time and help guys,

Kind regards, Andy Lemin



On Tue, Aug 9, 2016 at 2:48 AM, Chris Cappuccio <ch...@nmedia.net> wrote:

> Andy Lemin [a...@brandwatch.com] wrote:
> > The underlying trunk does not report any Rx or Tx errors at all.
> >
> > And the VLAN interfaces do not report any receive errors, only low rate
> > transmit errors.
> >
> > Also as a thought exercise, could anyone kindly explain/discuss how an
> > output error might even occur or be valid?
> >
>
> Look at /usr/src/sys/net/if_vlan.c, you'll find exactly two places where
> if_oerrors increments. Logically, both are in the vlan_start() routine.
> The first happens after vlan_inject fails. If vlan_inject returns a null
> mbuf, that appears to be a failure within m_prepend(), probably from
> failure to allocate memory for the new mbuf. Where's your dmesg? Are you
> using a card that does hw tagging? (If so, this isn't the codepath you're
> looking for.)
>
> If the failure is the new if_enqueue, it seems like ifq_enqueue would be
> calling priq_enq which would be returning a failure if the queue is full.
> Are you using hfsc?
>
> Chris

Reply via email to