Hi Chris, Sorry for the slow reply. Day job takes up most of my time.
Anyway, I finally added some logging into /usr/src/sys/net/if_vlan.c etc; if (m == NULL) { ifp->if_oerrors++; printf("Output Error due to NULL mbuff\n"); continue; } } if (if_enqueue(ifp0, m)) { ifp->if_oerrors++; printf("Output Error from if_enqueue\n"); continue; } ifp->if_opackets++; Recompiled the kernel and rebooted onto it, and pushed traffic through it (~50Mbps). And sure enough every single instance of the VLAN Output drops is due to "if_enqueue(ifp0, m)" being TRUE. I edited if.c and again confirmed that IFQ_ENQUEUE does return the error. Traced it further back to ifq.c:ifq_enqueue_try(), and rv (from rv = ifq->ifq_ops->ifqop_enq(ifq, m);) is 55 for every one of the VLAN output drops. Needed some help from a colleague to figure out what ifq->ifq_ops->ifqop_enq(ifq, m) calls. We believe is should be calling ifq.c:priq_enq(). Still dont understand that glue part yet :( But after adding some logging on "if (ifq_len(ifq) >= ifq->ifq_maxlen)" it doesn't seem to be that? So have either made a mistake or gone as far as my knowledge can go? Any _pointers_ guys? ;) We do use HFSC (and have done since 5.0 without issues), but only on the physical interface, not on the VLANs. The reason for this is so that we can _share_ the whole of the 10Gig interface root bandwidth across all of the VLANs on the same physical .1q trunk. This has worked great for years without VLAN output errors. I think this started after 5.8 or 5.9. I increased the qlimits from the default but that made no difference. queue trunk_root on $if_trunk bandwidth 4294M queue qlocal on $if_trunk parent trunk_root bandwidth 4.1G queue local_kern on $if_trunk parent qlocal bandwidth 8M min 8M burst 8M for 1000ms queue local_pri on $if_trunk parent qlocal bandwidth 150M min 150M burst 200M for 2500ms qlimit 500 queue local_data on $if_trunk parent qlocal bandwidth 4G min 1G qlimit 1000 queue qwan on $if_trunk parent trunk_root bandwidth 190M queue wan_rt on $if_trunk parent qwan bandwidth 30M min 19M burst 38M for 5000ms queue wan_int on $if_trunk parent qwan bandwidth 19M min 9M queue wan_pri on $if_trunk parent qwan bandwidth 19M min 10M burst 25M for 2000ms queue wan_vpn on $if_trunk parent qwan bandwidth 50M min 25M queue wan_web on $if_trunk parent qwan bandwidth 29M min 10M burst 19M for 3000ms queue wan_dflt on $if_trunk parent qwan bandwidth 19M min 10M burst 19M for 5000ms queue wan_bulk on $if_trunk parent qwan bandwidth 20M max 100M default . . match out on INSIDE all received-on INSIDE queue (local_data,local_pri) set prio (2,4) So all traffic flowing from one VLAN to another (on the same trunk) are in queues local_data and local_pri, however looking at the queue statistics with systat queues 1, shows these large internal queues never drop a single packet. Yet if_oerrors for the VLANs is still incrementing quite a lot for most of our VLANs. Hi Henning, whilst I have the code open, I am also going to have another go at trying to find the missing 64bit counter/range check etc for the HFSC queue size tomorrow (if I dont get dragged onto anything else). Thanks for your time and help guys, Kind regards, Andy Lemin On Tue, Aug 9, 2016 at 2:48 AM, Chris Cappuccio <ch...@nmedia.net> wrote: > Andy Lemin [a...@brandwatch.com] wrote: > > The underlying trunk does not report any Rx or Tx errors at all. > > > > And the VLAN interfaces do not report any receive errors, only low rate > > transmit errors. > > > > Also as a thought exercise, could anyone kindly explain/discuss how an > > output error might even occur or be valid? > > > > Look at /usr/src/sys/net/if_vlan.c, you'll find exactly two places where > if_oerrors increments. Logically, both are in the vlan_start() routine. > The first happens after vlan_inject fails. If vlan_inject returns a null > mbuf, that appears to be a failure within m_prepend(), probably from > failure to allocate memory for the new mbuf. Where's your dmesg? Are you > using a card that does hw tagging? (If so, this isn't the codepath you're > looking for.) > > If the failure is the new if_enqueue, it seems like ifq_enqueue would be > calling priq_enq which would be returning a failure if the queue is full. > Are you using hfsc? > > Chris