Re: pf overlapping IPv6 fragments

2016-11-18 Thread Alexandr Nedvedicky
Hello,

> more strictly here.  Drop the whole fragment state if IPv6 fragments
> appear which have invalid length, fragment-offset or more-fragment-bit.

I like the idea being strict here. I don't like 'goto overlap_fragment'.
the 'overlap_fragment' as a name of jump target is bit confusing. Comment
talks about 'Non terminal fragments' and code jumps to 'overlap_fragment'
>   /* Non terminal fragments must have more fragments flag */
>   if (frent->fe_off + frent->fe_len < total && !frent->fe_mff)
> - goto bad_fragment;
> + goto overlap_fragment;

how about using 'goto free_ipv6_frag' ? It better explains, what's
going to happen.

sorry for bikeshedding
regards
sasha



Re: Intel 10GbE (ix) driver update - Looking for tests

2016-11-18 Thread Mike Belopuhov
On Fri, Nov 18, 2016 at 15:37 +0100, Mike Belopuhov wrote:
> The remaining diff after the interrupt change was committed.  I have
> kept KERNEL_LOCK/UNLOCK dance since those functions can run in
> parallel with similar code triggered by ifconfig so better safe than
> sorry.
> 
> I still need to go through changes in ixgbe_initialize_receive_units.
> One thing that is obvious is that Intel driver calls RSS setup even
> if one queue is configured which we haven't done before.  My and
> others tests don't show any regression regarding this however.
> 
> "Wait for a last completion before clearing buffers" change in the
> ixgbe_clear_tx_pending was committed upstream without any explanation
> but "looks safe".
> 
> FCRTH related change also looks strange, needs to get checked against
> documentation.
>

Quoting the spec for 82599: "The content of the Flow Control Receive
Threshold High (FCRTH) register determines at what point the 82599
transmits the first PAUSE frame." Then it mentions that FCRTL (low
threshold) when enabled controls sending of XON messages.

There are two modes of operation regarding flow control: the regular
one and Priority Flow Control that we don't enable.  Then depending
on a few other features (DCB, Flow director) and whether or not we
enable jumbo frames the FCRTH is set relative to the packet buffer
size (which is 384k for X540 and 512k for 82599).

The way FCRTH is calculated in *our* case (no DCB, no flow director,
etc) is basically Rx packet buffer size minus the delay value which
is somewhere in the range of [24..60]k.  But the changed code below
is a branch that handles the default not configured case: it sets
FCRTL to 0, disable XON and low threshold and seems to be setting
the FCRTH to a "default" value, which might or might not be poorly
chosen.  However it definitely makes more sense than the "-32" we
have there now.

> Ditto regarding changes regarding PHY power and ixgbe_handle_mod.
> Hrvoje has already reported that X550 SFP doesn't seem to be able to
> detect different SFP+ modules when replugged as opposed to X520.
> 

ixgbe_handle_mod change is relevant when you swap different SFP+
modules (fiber, copper, etc), it also handles the unknown PHY
case by dropping out early.  I intend to commit this as one of
the last changes so that it will get a bit more testing.

The PHY power change is explained here:
https://svnweb.freebsd.org/base?view=revision&revision=295093
"This fixes link not detected on X540-AT2 after booting to Linux
which turns the phy power off on detach" -- explains why we
we haven't seen this while some users have reported weird
problems with X540...

diff --git sys/dev/pci/files.pci sys/dev/pci/files.pci
index a6b91fb..34ce9bf 100644
--- sys/dev/pci/files.pci
+++ sys/dev/pci/files.pci
@@ -351,20 +351,21 @@ file  dev/pci/ixgb_ee.c   ixgb
 file   dev/pci/ixgb_hw.c   ixgb
 
 # Intel 82598 10GbE
 device ix: ether, ifnet, ifmedia
 attach ix at pci
 file   dev/pci/if_ix.c ix
 file   dev/pci/ixgbe.c ix
 file   dev/pci/ixgbe_82598.c   ix
 file   dev/pci/ixgbe_82599.c   ix
 file   dev/pci/ixgbe_x540.cix
+file   dev/pci/ixgbe_x550.cix
 file   dev/pci/ixgbe_phy.c ix
 
 # Neterion Xframe 10 Gigabit ethernet 
 device xge: ether, ifnet, ifmedia
 attach xge  at pci
 file   dev/pci/if_xge.cxge
 
 # NetXen NX2031/NX2035 10Gb Ethernet
 device nxe: ether, ifnet, ifmedia
 attach nxe at pci
diff --git sys/dev/pci/if_ix.c sys/dev/pci/if_ix.c
index c41a443..755d40d 100644
--- sys/dev/pci/if_ix.c
+++ sys/dev/pci/if_ix.c
@@ -64,24 +64,34 @@ const struct pci_matchid ixgbe_devices[] = {
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82598_DA_DUAL },
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_KX4 },
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_KX4_MEZZ },
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_XAUI },
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_COMBO_BP },
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_BPLANE_FCOE },
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_CX4 },
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_T3_LOM },
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_SFP },
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_SFP_EM },
+   { PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_SFP_SF_QP },
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_SFP_SF2 },
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_SFP_FCOE },
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599EN_SFP },
+   { PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_QSFP_SF_QP },
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_X540T },
+   { PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_X540T1 },
+   { PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_X550T },
+   { PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_X550T1 },
+   { PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_X550EM_X_KX4 },
+   { PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_X550EM_X_KR },
+   { PC

iwm ack rates, again

2016-11-18 Thread Stefan Sperling
We inherited a bug(?) from Linux where the iwm driver adds all mandatory
OFDM rates (up to 24Mbit/s) to the ACK rate set if the AP does not
advertise _any_ basic OFDM rates. This kills traffic at the edge of
the WLAN cell because e.g. RTS is sent at 24Mbit/s instead of 6Mbit/s
and rarely gets through to the AP.
(I have not been able to convince the firmware to send RTS at lower
rates than 6 Mbit/s while in 11n mode.)

This code was also inconsistent with the comments in this function
which confidently state in two places (emphasis mine):

  ""add all mandatory rates that are LOWER THAN all of the
basic rates to these bitmaps."""

  """ just add ANY LOWER rates to the ACK rate bitmap."""

So then instead of adding any lower rates, they add all of them if
none was given?!? Let's just add the lowest one if none was given.
Invisible in this diff: The lowest rates are always added by code below.

With this, ssh connections to a laptop at the edge of the WLAN cell
remain somewhat usable during pkg_add -u.

Index: dev/pci/if_iwm.c
==
--- dev/pci/if_iwm.c
+++ dev/pci/if_iwm.c
@@ -4916,12 +4916,12 @@
 iwm_ack_rates(struct iwm_softc *sc, struct iwm_node *in, int *cck_rates,
 int *ofdm_rates)
 {
struct ieee80211_node *ni = &in->in_ni;
struct ieee80211_rateset *rs = &ni->ni_rates;
-   int lowest_present_ofdm = 100;
-   int lowest_present_cck = 100;
+   int lowest_present_ofdm = -1;
+   int lowest_present_cck = -1;
uint8_t cck = 0;
uint8_t ofdm = 0;
int i;
 
if (ni->ni_chan == IEEE80211_CHAN_ANYC ||
@@ -4928,19 +4928,19 @@
IEEE80211_IS_CHAN_2GHZ(ni->ni_chan)) {
for (i = IWM_FIRST_CCK_RATE; i < IWM_FIRST_OFDM_RATE; i++) {
if ((iwm_ridx2rate(rs, i) & IEEE80211_RATE_BASIC) == 0)
continue;
cck |= (1 << i);
-   if (lowest_present_cck > i)
+   if (lowest_present_cck == -1 || lowest_present_cck > i)
lowest_present_cck = i;
}
}
for (i = IWM_FIRST_OFDM_RATE; i <= IWM_LAST_NON_HT_RATE; i++) {
if ((iwm_ridx2rate(rs, i) & IEEE80211_RATE_BASIC) == 0)
continue;   
ofdm |= (1 << (i - IWM_FIRST_OFDM_RATE));
-   if (lowest_present_ofdm > i)
+   if (lowest_present_ofdm == -1 || lowest_present_ofdm > i)
lowest_present_ofdm = i;
}
 
/*
 * Now we've got the basic rates as bitmaps in the ofdm and cck



Re: reduce block ack Rx latency

2016-11-18 Thread Stefan Sperling
On Fri, Nov 18, 2016 at 03:01:34PM +0100, Stefan Sperling wrote:
> On Fri, Nov 18, 2016 at 02:37:18PM +0100, Stefan Sperling wrote:
> > While playing around with rate scaling, and testing behaviour when
> > increasing the distance to the AP, I noticed that a lot of successfully
> > received frames end up getting stuck or discarded in the block ack
> > buffer logic.
> > 
> > In this situation, I see icmp echo replies arrive in the output
> > of tcpdump -i iwm0 -y IEEE802_11_RADIO -v, but those replies don't
> > reach the ping program in userspace. And when they do, they arrive
> > with a huge delay (up to several seconds).
> > 
> > This diff tunes block ack parameters to vastly reduce buffering timeout.
> > With this, as long as we receive echo replies, they are very likely to
> > reach the ping program. In my testing, some pings still get lost and
> > some get delayed, but it seems latency now depends mostly on how fast
> > the AP chooses a lower data rate -- and once the incoming rate settles
> > at 1Mbit/s, traffic flows smoothly again.
> > 
> > Under ideal wifi signal conditions, this diff should make no difference.
> > 
> > OK?
> 
> A bit of more testing has shown that a 5 msec gap timeout is a bit
> too aggressive when other traffic is passed besides ping. It can
> cause frames to arrive behind the BA window, and we're dropping those.
> In such cases 10 msec seems to work better.

And another diff, which adds another idea on top:

Resync the block ack window backwards (as well as forwards), when we
lose synchronization. The 802.11n standard commands we drop all packets
"before" the window, but well... several details of this standards were
"well intended".
This change should address a situation I saw where echo replies were
received but dropped because they were counted as arriving "before"
the window, and not "after" the window.

However, it comes at a cost of occasional dupes during normal operation,
e.g. in case the sender did not receive our block ack and sends the whole
aggregate again.

Contrary to previous diffs, I'm not changing the MAX_WINMISS threshold
in this diff (currently set to 8 frames). This seems to work alright
and not cause many dupes.

Index: ieee80211_node.h
===
RCS file: /cvs/src/sys/net80211/ieee80211_node.h,v
retrieving revision 1.63
diff -u -p -r1.63 ieee80211_node.h
--- ieee80211_node.h21 Sep 2016 12:21:27 -  1.63
+++ ieee80211_node.h18 Nov 2016 15:15:45 -
@@ -146,7 +146,7 @@ struct ieee80211_rx_ba {
u_int16_t   ba_winsize;
u_int16_t   ba_head;
struct timeout  ba_gap_to;
-#define IEEE80211_BA_GAP_TIMEOUT   300 /* msec */
+#define IEEE80211_BA_GAP_TIMEOUT   10 /* msec */
/* Counter for consecutive frames which missed the BA window. */
int ba_winmiss;
/* Sequence number of previous frame which missed the BA window. */
Index: ieee80211_input.c
===
RCS file: /cvs/src/sys/net80211/ieee80211_input.c,v
retrieving revision 1.180
diff -u -p -r1.180 ieee80211_input.c
--- ieee80211_input.c   21 Sep 2016 12:21:27 -  1.180
+++ ieee80211_input.c   18 Nov 2016 15:14:23 -
@@ -708,10 +708,31 @@ ieee80211_input_ba(struct ieee80211com *
 
if (SEQ_LT(sn, ba->ba_winstart)) {  /* SN < WinStartB */
ic->ic_stats.is_ht_rx_frame_below_ba_winstart++;
-   m_freem(m); /* discard the MPDU */
-   return;
-   }
-   if (SEQ_LT(ba->ba_winend, sn)) {/* WinEndB < SN */
+   /* 
+* Check whether we're consistently ahead of the window,
+* and let the window jump backwards if neccessary.
+*/
+   if (ba->ba_winmiss < IEEE80211_BA_MAX_WINMISS) { 
+   if (ba->ba_missedsn == ((sn - 1) & 0xfff))
+   ba->ba_winmiss++;
+   else
+   ba->ba_winmiss = 0;
+   ba->ba_missedsn = sn;
+   ifp->if_ierrors++;
+   m_freem(m); /* discard the MPDU */
+   return;
+   } else {
+   /* It appears the window has moved for real. */
+   ic->ic_stats.is_ht_rx_ba_window_jump++;
+   ba->ba_winmiss = 0;
+   ba->ba_missedsn = 0;
+   /* Flush our current window. */
+   ieee80211_ba_move_window(ic, ni, tid, ba->ba_winend);
+   /* Reset the window. */
+   ba->ba_winstart = sn;
+   ieee80211_input_ba_flush(ic, ni, ba);
+   }
+   } else if (SEQ_LT(ba->ba_winend, sn)) { /* WinEndB < SN */
ic->ic_stats.is_ht_rx_frame_above_ba_winend++;

Re: pf af-to route output

2016-11-18 Thread Alexander Bluhm
On Mon, Nov 14, 2016 at 04:38:07PM +0100, Alexander Bluhm wrote:
> Hi,
> 
> The !r->rt case is only used by af-to.  pf_route6() calls ip6_output()
> to do the work while pf_route() has some custom implementation for
> that.  It is simpler to call ip_output() or ip6_output() from
> pf_test() directly.
> 
> ok?

anyone?

> 
> bluhm
> 
> Index: net/pf.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/pf.c,v
> retrieving revision 1.998
> diff -u -p -r1.998 pf.c
> --- net/pf.c  14 Nov 2016 13:25:00 -  1.998
> +++ net/pf.c  14 Nov 2016 14:08:57 -
> @@ -5820,50 +5820,34 @@ pf_route(struct pf_pdesc *pd, struct pf_
>   dst->sin_addr = ip->ip_dst;
>   rtableid = m0->m_pkthdr.ph_rtableid;
>  
> - if (!r->rt) {
> - rt = rtalloc(sintosa(dst), RT_RESOLVE, rtableid);
> - if (rt == NULL) {
> - ipstat_inc(ips_noroute);
> + if (s == NULL) {
> + bzero(sns, sizeof(sns));
> + if (pf_map_addr(AF_INET, r,
> + (struct pf_addr *)&ip->ip_src,
> + &naddr, NULL, sns, &r->route, PF_SN_ROUTE)) {
> + DPFPRINTF(LOG_ERR,
> + "pf_route: pf_map_addr() failed.");
>   goto bad;
>   }
>  
> - ifp = if_get(rt->rt_ifidx);
> -
> - if (rt->rt_flags & RTF_GATEWAY)
> - dst = satosin(rt->rt_gateway);
> -
> - m0->m_pkthdr.pf.flags |= PF_TAG_GENERATED;
> + if (!PF_AZERO(&naddr, AF_INET))
> + dst->sin_addr.s_addr = naddr.v4.s_addr;
> + ifp = r->route.kif ?
> + r->route.kif->pfik_ifp : NULL;
>   } else {
> - if (s == NULL) {
> - bzero(sns, sizeof(sns));
> - if (pf_map_addr(AF_INET, r,
> - (struct pf_addr *)&ip->ip_src,
> - &naddr, NULL, sns, &r->route, PF_SN_ROUTE)) {
> - DPFPRINTF(LOG_ERR,
> - "pf_route: pf_map_addr() failed.");
> - goto bad;
> - }
> -
> - if (!PF_AZERO(&naddr, AF_INET))
> - dst->sin_addr.s_addr = naddr.v4.s_addr;
> - ifp = r->route.kif ?
> - r->route.kif->pfik_ifp : NULL;
> - } else {
> - if (!PF_AZERO(&s->rt_addr, AF_INET))
> - dst->sin_addr.s_addr =
> - s->rt_addr.v4.s_addr;
> - ifp = s->rt_kif ? s->rt_kif->pfik_ifp : NULL;
> - }
> -
> - rt = rtalloc(sintosa(dst), RT_RESOLVE, rtableid);
> - if (rt == NULL) {
> - ipstat_inc(ips_noroute);
> - goto bad;
> - }
> + if (!PF_AZERO(&s->rt_addr, AF_INET))
> + dst->sin_addr.s_addr =
> + s->rt_addr.v4.s_addr;
> + ifp = s->rt_kif ? s->rt_kif->pfik_ifp : NULL;
>   }
>   if (ifp == NULL)
>   goto bad;
>  
> + rt = rtalloc(sintosa(dst), RT_RESOLVE, rtableid);
> + if (rt == NULL) {
> + ipstat_inc(ips_noroute);
> + goto bad;
> + }
>  
>   if (pd->kif->pfik_ifp != ifp) {
>   if (pf_test(AF_INET, PF_OUT, ifp, &m0) != PF_PASS)
> @@ -5928,8 +5912,6 @@ pf_route(struct pf_pdesc *pd, struct pf_
>  done:
>   if (r->rt != PF_DUPTO)
>   pd->m = NULL;
> - if (!r->rt)
> - if_put(ifp);
>   rtfree(rt);
>   return;
>  
> @@ -5982,12 +5964,6 @@ pf_route6(struct pf_pdesc *pd, struct pf
>   dst->sin6_addr = ip6->ip6_dst;
>   rtableid = m0->m_pkthdr.ph_rtableid;
>  
> - if (!r->rt) {
> - m0->m_pkthdr.pf.flags |= PF_TAG_GENERATED;
> - ip6_output(m0, NULL, NULL, 0, NULL, NULL);
> - goto done;
> - }
> -
>   if (s == NULL) {
>   bzero(sns, sizeof(sns));
>   if (pf_map_addr(AF_INET6, r, (struct pf_addr *)&ip6->ip6_src,
> @@ -6908,10 +6884,28 @@ done:
>   action = PF_DROP;
>   break;
>   }
> - if (pd.naf == AF_INET)
> - pf_route(&pd, r, s);
> - if (pd.naf == AF_INET6)
> - pf_route6(&pd, r, s);
> + if (r->rt) {
> + switch (pd.naf) {
> + case AF_INET:
> + pf_route(&pd, r, s);
> + break;
> + case AF_INET6:
> + pf_route6(&pd, r, s);
> + break;
> + }
> + }
> + if (pd.m) {
> + pd.m->m_pkthdr.pf.flags |= PF_TAG_GENERATED;
> + 

Re: Intel 10GbE (ix) driver update - Looking for tests

2016-11-18 Thread Mike Belopuhov
On Fri, Nov 18, 2016 at 13:27 +0100, Mike Belopuhov wrote:
> On Thu, Nov 17, 2016 at 23:23 +0100, Mike Belopuhov wrote:
> > On Thu, Nov 17, 2016 at 22:24 +0100, Mike Belopuhov wrote:
> > > On Wed, Nov 16, 2016 at 23:04 +0100, Mike Belopuhov wrote:
> > > > Hi,
> > > > 
> > > > I've done a massive update of our ix(4) driver that brings
> > > > support for X550 family of controllers including those
> > > > integrated into new Xeon chips as well as QSFP support for
> > > > X520 (82599) but this needs thorough testing.  If you're
> > > > using Intel 10Gb controllers, please make sure that you
> > > > either (or both!) test the complete diff found at this URL:
> > > > http://gir.theapt.org/~mike/ixgbe.diff or next few snapshots
> > > > that will (hopefully) contain bits of this monster diff.
> > > > 
> > > > To test the monster diff, make sure that you are running a
> > > > recent snapshot and your kernel source code is up-to-date,
> > > > then reset a few files to the specified revisions and
> > > > remove the support file for X550:
> > > > 
> > > > % pwd
> > > > /usr/src
> > > > % cvs up -r1.326 sys/dev/pci/files.pci
> > > > % cvs up -r1.133 sys/dev/pci/if_ix.c
> > > > % cvs up -r1.14 sys/dev/pci/ixgbe.c
> > > > % cvs up -r1.23 sys/dev/pci/ixgbe.h
> > > > % cvs up -r1.11 sys/dev/pci/ixgbe_82598.c
> > > > % cvs up -r1.12 sys/dev/pci/ixgbe_82599.c
> > > > % cvs up -r1.13 sys/dev/pci/ixgbe_phy.c
> > > > % cvs up -r1.22 sys/dev/pci/ixgbe_type.h
> > > > % cvs up -r1.4 sys/dev/pci/ixgbe_x540.c
> > > > % rm -f sys/dev/pci/ixgbe_x550.c
> > > > 
> > > > To verify that files have been reset:
> > > > 
> > > > % pwd
> > > > /usr/src
> > > > % fgrep "//T1" sys/dev/pci/CVS/Entries
> > > > /files.pci/1.326/Mon Sep 12 09:45:53 2016//T1.326
> > > > /if_ix.c/1.133/Thu Oct 27 05:00:50 2016//T1.133
> > > > /ixgbe.c/1.14/Wed Nov 26 17:03:52 2014//T1.14
> > > > /ixgbe.h/1.23/Tue Oct  4 09:24:02 2016//T1.23
> > > > /ixgbe_82598.c/1.11/Mon Aug  5 19:58:06 2013//T1.11
> > > > /ixgbe_82599.c/1.12/Fri May  1 04:15:00 2015//T1.12
> > > > /ixgbe_phy.c/1.13/Fri May  1 04:15:00 2015//T1.13
> > > > /ixgbe_type.h/1.22/Wed Nov 16 21:53:57 2016//T1.22
> > > > /ixgbe_x540.c/1.4/Wed May 20 14:34:27 2015//T1.4
> > > > 
> > > > And then test and apply the diff:
> > > > 
> > > > % pwd
> > > > /usr/src
> > > > % patch -Csp0  > > > 
> > > > Make sure to reset files every time the source tree gets updated.
> > > > 
> > > > Cheers,
> > > > Mike
> > > 
> > > As of today and file revisions below, most of the boilerplate
> > > code is now committed.  I've tried very hard to not introduce
> > > any noticeable changes in behavior so far.
> > > 
> > > if_ix.c/1.133
> > > ixgbe_82598.c/1.15
> > > ixgbe_x550.c/1.1
> > > ixgbe.c/1.19
> > > ixgbe.h/1.26
> > > ixgbe_82599.c/1.16
> > > ixgbe_x540.c/1.9
> > > ixgbe_type.h/1.29
> > > ixgbe_phy.c/1.18
> > > 
> > > Remaining bits are below.  I'll start picking at low hanging fruits,
> > > but won't mind another pair of eyes.
> > > 
> > 
> > I've just realised I forgot to commit one small bit that is
> > also not part of this diff that might fail the kernel compile.
> > I'll post an update in about 12 hours.
> 
> OK, the issue is resolved now. ixgbe_x550.c should be at 1.2 now.
> 
> Here's an updated diff (Hrvoje Popovski has found out that
> enable_tx_laser function pointers can be not set by X550).
> 

The remaining diff after the interrupt change was committed.  I have
kept KERNEL_LOCK/UNLOCK dance since those functions can run in
parallel with similar code triggered by ifconfig so better safe than
sorry.

I still need to go through changes in ixgbe_initialize_receive_units.
One thing that is obvious is that Intel driver calls RSS setup even
if one queue is configured which we haven't done before.  My and
others tests don't show any regression regarding this however.

"Wait for a last completion before clearing buffers" change in the
ixgbe_clear_tx_pending was committed upstream without any explanation
but "looks safe".

FCRTH related change also looks strange, needs to get checked against
documentation.

Ditto regarding changes regarding PHY power and ixgbe_handle_mod.
Hrvoje has already reported that X550 SFP doesn't seem to be able to
detect different SFP+ modules when replugged as opposed to X520.

diff --git sys/dev/pci/files.pci sys/dev/pci/files.pci
index a6b91fb..34ce9bf 100644
--- sys/dev/pci/files.pci
+++ sys/dev/pci/files.pci
@@ -351,20 +351,21 @@ file  dev/pci/ixgb_ee.c   ixgb
 file   dev/pci/ixgb_hw.c   ixgb
 
 # Intel 82598 10GbE
 device ix: ether, ifnet, ifmedia
 attach ix at pci
 file   dev/pci/if_ix.c ix
 file   dev/pci/ixgbe.c ix
 file   dev/pci/ixgbe_82598.c   ix
 file   dev/pci/ixgbe_82599.c   ix
 file   dev/pci/ixgbe_x540.cix
+file   dev/pci/ixgbe_x550.cix

Re: reduce block ack Rx latency

2016-11-18 Thread Stefan Sperling
On Fri, Nov 18, 2016 at 02:37:18PM +0100, Stefan Sperling wrote:
> While playing around with rate scaling, and testing behaviour when
> increasing the distance to the AP, I noticed that a lot of successfully
> received frames end up getting stuck or discarded in the block ack
> buffer logic.
> 
> In this situation, I see icmp echo replies arrive in the output
> of tcpdump -i iwm0 -y IEEE802_11_RADIO -v, but those replies don't
> reach the ping program in userspace. And when they do, they arrive
> with a huge delay (up to several seconds).
> 
> This diff tunes block ack parameters to vastly reduce buffering timeout.
> With this, as long as we receive echo replies, they are very likely to
> reach the ping program. In my testing, some pings still get lost and
> some get delayed, but it seems latency now depends mostly on how fast
> the AP chooses a lower data rate -- and once the incoming rate settles
> at 1Mbit/s, traffic flows smoothly again.
> 
> Under ideal wifi signal conditions, this diff should make no difference.
> 
> OK?

A bit of more testing has shown that a 5 msec gap timeout is a bit
too aggressive when other traffic is passed besides ping. It can
cause frames to arrive behind the BA window, and we're dropping those.
In such cases 10 msec seems to work better.

Index: ieee80211_node.h
===
RCS file: /cvs/src/sys/net80211/ieee80211_node.h,v
retrieving revision 1.63
diff -u -p -r1.63 ieee80211_node.h
--- ieee80211_node.h21 Sep 2016 12:21:27 -  1.63
+++ ieee80211_node.h18 Nov 2016 13:46:35 -
@@ -146,13 +146,13 @@ struct ieee80211_rx_ba {
u_int16_t   ba_winsize;
u_int16_t   ba_head;
struct timeout  ba_gap_to;
-#define IEEE80211_BA_GAP_TIMEOUT   300 /* msec */
+#define IEEE80211_BA_GAP_TIMEOUT   10 /* msec */
/* Counter for consecutive frames which missed the BA window. */
int ba_winmiss;
/* Sequence number of previous frame which missed the BA window. */
uint16_tba_missedsn;
/* Window moves forward after this many frames have missed it. */
-#define IEEE80211_BA_MAX_WINMISS   8
+#define IEEE80211_BA_MAX_WINMISS   3
 
uint8_t ba_token;
 };



reduce block ack Rx latency

2016-11-18 Thread Stefan Sperling
While playing around with rate scaling, and testing behaviour when
increasing the distance to the AP, I noticed that a lot of successfully
received frames end up getting stuck or discarded in the block ack
buffer logic.

In this situation, I see icmp echo replies arrive in the output
of tcpdump -i iwm0 -y IEEE802_11_RADIO -v, but those replies don't
reach the ping program in userspace. And when they do, they arrive
with a huge delay (up to several seconds).

This diff tunes block ack parameters to vastly reduce buffering timeout.
With this, as long as we receive echo replies, they are very likely to
reach the ping program. In my testing, some pings still get lost and
some get delayed, but it seems latency now depends mostly on how fast
the AP chooses a lower data rate -- and once the incoming rate settles
at 1Mbit/s, traffic flows smoothly again.

Under ideal wifi signal conditions, this diff should make no difference.

OK?

Index: ieee80211_node.h
===
RCS file: /cvs/src/sys/net80211/ieee80211_node.h,v
retrieving revision 1.63
diff -u -p -r1.63 ieee80211_node.h
--- ieee80211_node.h21 Sep 2016 12:21:27 -  1.63
+++ ieee80211_node.h18 Nov 2016 13:10:59 -
@@ -146,13 +146,13 @@ struct ieee80211_rx_ba {
u_int16_t   ba_winsize;
u_int16_t   ba_head;
struct timeout  ba_gap_to;
-#define IEEE80211_BA_GAP_TIMEOUT   300 /* msec */
+#define IEEE80211_BA_GAP_TIMEOUT   5 /* msec */
/* Counter for consecutive frames which missed the BA window. */
int ba_winmiss;
/* Sequence number of previous frame which missed the BA window. */
uint16_tba_missedsn;
/* Window moves forward after this many frames have missed it. */
-#define IEEE80211_BA_MAX_WINMISS   8
+#define IEEE80211_BA_MAX_WINMISS   3
 
uint8_t ba_token;
 };



Re: pf IPv6 hop-by-hop after fragment header

2016-11-18 Thread Alexander Bluhm
On Fri, Nov 18, 2016 at 09:43:43AM +0100, Alexandr Nedvedicky wrote:
> Side Note: I did quick check to RFCs. It seems to me there is a 'bug' in
> specification. RFC 2460 says:
> 
>When more than one extension header is used in the same packet, it is
>recommended that those headers appear in the following order:
>^^^

Later on page 7 RFC 2460 says:

   IPv6 nodes must accept and attempt to process extension headers in
   any order and occurring any number of times in the same packet,
   except for the Hop-by-Hop Options header which is restricted to
   appear immediately after an IPv6 header only.

bluhm



Re: Intel 10GbE (ix) driver update - Looking for tests

2016-11-18 Thread Mike Belopuhov
On Thu, Nov 17, 2016 at 23:23 +0100, Mike Belopuhov wrote:
> On Thu, Nov 17, 2016 at 22:24 +0100, Mike Belopuhov wrote:
> > On Wed, Nov 16, 2016 at 23:04 +0100, Mike Belopuhov wrote:
> > > Hi,
> > > 
> > > I've done a massive update of our ix(4) driver that brings
> > > support for X550 family of controllers including those
> > > integrated into new Xeon chips as well as QSFP support for
> > > X520 (82599) but this needs thorough testing.  If you're
> > > using Intel 10Gb controllers, please make sure that you
> > > either (or both!) test the complete diff found at this URL:
> > > http://gir.theapt.org/~mike/ixgbe.diff or next few snapshots
> > > that will (hopefully) contain bits of this monster diff.
> > > 
> > > To test the monster diff, make sure that you are running a
> > > recent snapshot and your kernel source code is up-to-date,
> > > then reset a few files to the specified revisions and
> > > remove the support file for X550:
> > > 
> > > % pwd
> > > /usr/src
> > > % cvs up -r1.326 sys/dev/pci/files.pci
> > > % cvs up -r1.133 sys/dev/pci/if_ix.c
> > > % cvs up -r1.14 sys/dev/pci/ixgbe.c
> > > % cvs up -r1.23 sys/dev/pci/ixgbe.h
> > > % cvs up -r1.11 sys/dev/pci/ixgbe_82598.c
> > > % cvs up -r1.12 sys/dev/pci/ixgbe_82599.c
> > > % cvs up -r1.13 sys/dev/pci/ixgbe_phy.c
> > > % cvs up -r1.22 sys/dev/pci/ixgbe_type.h
> > > % cvs up -r1.4 sys/dev/pci/ixgbe_x540.c
> > > % rm -f sys/dev/pci/ixgbe_x550.c
> > > 
> > > To verify that files have been reset:
> > > 
> > > % pwd
> > > /usr/src
> > > % fgrep "//T1" sys/dev/pci/CVS/Entries
> > > /files.pci/1.326/Mon Sep 12 09:45:53 2016//T1.326
> > > /if_ix.c/1.133/Thu Oct 27 05:00:50 2016//T1.133
> > > /ixgbe.c/1.14/Wed Nov 26 17:03:52 2014//T1.14
> > > /ixgbe.h/1.23/Tue Oct  4 09:24:02 2016//T1.23
> > > /ixgbe_82598.c/1.11/Mon Aug  5 19:58:06 2013//T1.11
> > > /ixgbe_82599.c/1.12/Fri May  1 04:15:00 2015//T1.12
> > > /ixgbe_phy.c/1.13/Fri May  1 04:15:00 2015//T1.13
> > > /ixgbe_type.h/1.22/Wed Nov 16 21:53:57 2016//T1.22
> > > /ixgbe_x540.c/1.4/Wed May 20 14:34:27 2015//T1.4
> > > 
> > > And then test and apply the diff:
> > > 
> > > % pwd
> > > /usr/src
> > > % patch -Csp0  > > 
> > > Make sure to reset files every time the source tree gets updated.
> > > 
> > > Cheers,
> > > Mike
> > 
> > As of today and file revisions below, most of the boilerplate
> > code is now committed.  I've tried very hard to not introduce
> > any noticeable changes in behavior so far.
> > 
> > if_ix.c/1.133
> > ixgbe_82598.c/1.15
> > ixgbe_x550.c/1.1
> > ixgbe.c/1.19
> > ixgbe.h/1.26
> > ixgbe_82599.c/1.16
> > ixgbe_x540.c/1.9
> > ixgbe_type.h/1.29
> > ixgbe_phy.c/1.18
> > 
> > Remaining bits are below.  I'll start picking at low hanging fruits,
> > but won't mind another pair of eyes.
> > 
> 
> I've just realised I forgot to commit one small bit that is
> also not part of this diff that might fail the kernel compile.
> I'll post an update in about 12 hours.

OK, the issue is resolved now. ixgbe_x550.c should be at 1.2 now.

Here's an updated diff (Hrvoje Popovski has found out that
enable_tx_laser function pointers can be not set by X550).

diff --git sys/dev/pci/files.pci sys/dev/pci/files.pci
index a6b91fb..34ce9bf 100644
--- sys/dev/pci/files.pci
+++ sys/dev/pci/files.pci
@@ -356,10 +356,11 @@ attachix at pci
 file   dev/pci/if_ix.c ix
 file   dev/pci/ixgbe.c ix
 file   dev/pci/ixgbe_82598.c   ix
 file   dev/pci/ixgbe_82599.c   ix
 file   dev/pci/ixgbe_x540.cix
+file   dev/pci/ixgbe_x550.cix
 file   dev/pci/ixgbe_phy.c ix
 
 # Neterion Xframe 10 Gigabit ethernet 
 device xge: ether, ifnet, ifmedia
 attach xge  at pci
diff --git sys/dev/pci/if_ix.c sys/dev/pci/if_ix.c
index d13baea..c07758b 100644
--- sys/dev/pci/if_ix.c
+++ sys/dev/pci/if_ix.c
@@ -69,14 +69,24 @@ const struct pci_matchid ixgbe_devices[] = {
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_BPLANE_FCOE },
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_CX4 },
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_T3_LOM },
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_SFP },
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_SFP_EM },
+   { PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_SFP_SF_QP },
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_SFP_SF2 },
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_SFP_FCOE },
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599EN_SFP },
+   { PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_82599_QSFP_SF_QP },
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_X540T },
+   { PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_X540T1 },
+   { PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_X550T },
+   { PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_X550T1 },
+   { PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_X550EM_X_KX4 },
+   { PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_X550

per cpu counters for icmp

2016-11-18 Thread Jonathan Matthew
This is much like the other per cpu counter conversions, except the counter
enum has gaps in it to match the arrays in struct icmpstat.

Index: icmp_var.h
===
RCS file: /cvs/src/sys/netinet/icmp_var.h,v
retrieving revision 1.14
diff -u -p -u -p -r1.14 icmp_var.h
--- icmp_var.h  19 Jan 2014 05:01:50 -  1.14
+++ icmp_var.h  18 Nov 2016 09:02:48 -
@@ -91,6 +91,32 @@ struct   icmpstat {
 }
 
 #ifdef _KERNEL
-extern struct  icmpstat icmpstat;
+
+#include 
+
+enum icmpstat_counters {
+   icps_error,
+   icps_toofreq,
+   icps_oldshort,
+   icps_oldicmp,
+   icps_outhist,
+   icps_badcode = icps_outhist + ICMP_MAXTYPE + 1,
+   icps_tooshort,
+   icps_checksum,
+   icps_badlen,
+   icps_reflect,
+   icps_bmcastecho,
+   icps_inhist,
+   icps_ncounters = icps_inhist + ICMP_MAXTYPE + 1
+};
+
+static inline void
+icmpstat_inc(enum icmpstat_counters c)
+{
+   extern struct cpumem *icmpcounters;
+   counters_inc(icmpcounters, c);
+}
+
+extern struct cpumem *icmpcounters;
 #endif /* _KERNEL */
 #endif /* _NETINET_ICMP_VAR_H_ */
Index: ip_icmp.c
===
RCS file: /cvs/src/sys/netinet/ip_icmp.c,v
retrieving revision 1.156
diff -u -p -u -p -r1.156 ip_icmp.c
--- ip_icmp.c   16 Nov 2016 12:48:19 -  1.156
+++ ip_icmp.c   18 Nov 2016 09:02:48 -
@@ -121,7 +121,7 @@ static int icmperrpps_count = 0;
 static struct timeval icmperrppslim_last;
 
 static struct rttimer_queue *icmp_redirect_timeout_q = NULL;
-struct icmpstat icmpstat;
+struct cpumem *icmpcounters;
 
 int *icmpctl_vars[ICMPCTL_MAXID] = ICMPCTL_VARS;
 
@@ -129,10 +129,12 @@ void icmp_mtudisc_timeout(struct rtentry
 int icmp_ratelimit(const struct in_addr *, const int, const int);
 void icmp_redirect_timeout(struct rtentry *, struct rttimer *);
 void icmp_input_if(struct ifnet *, struct mbuf *, int);
+int icmp_sysctl_icmpstat(void *, size_t *, void *);
 
 void
 icmp_init(void)
 {
+   icmpcounters = counters_alloc(icps_ncounters, M_COUNTERS);
/*
 * This is only useful if the user initializes redirtimeout to
 * something other than zero.
@@ -157,7 +159,7 @@ icmp_do_error(struct mbuf *n, int type, 
printf("icmp_error(%x, %d, %d)\n", oip, type, code);
 #endif
if (type != ICMP_REDIRECT)
-   icmpstat.icps_error++;
+   icmpstat_inc(icps_error);
/*
 * Don't send error if not the first fragment of message.
 * Don't error if the old packet protocol was ICMP
@@ -169,7 +171,7 @@ icmp_do_error(struct mbuf *n, int type, 
n->m_len >= oiplen + ICMP_MINLEN &&
!ICMP_INFOTYPE(((struct icmp *)
((caddr_t)oip + oiplen))->icmp_type)) {
-   icmpstat.icps_oldicmp++;
+   icmpstat_inc(icps_oldicmp);
goto freeit;
}
/* Don't send error in response to a multicast or broadcast packet */
@@ -180,7 +182,7 @@ icmp_do_error(struct mbuf *n, int type, 
 * First, do a rate limitation check.
 */
if (icmp_ratelimit(&oip->ip_src, type, code)) {
-   icmpstat.icps_toofreq++;
+   icmpstat_inc(icps_toofreq);
goto freeit;
}
 
@@ -227,7 +229,7 @@ icmp_do_error(struct mbuf *n, int type, 
icp = mtod(m, struct icmp *);
if ((u_int)type > ICMP_MAXTYPE)
panic("icmp_error");
-   icmpstat.icps_outhist[type]++;
+   icmpstat_inc(icps_outhist + type);
icp->icmp_type = type;
if (type == ICMP_REDIRECT)
icp->icmp_gwaddr.s_addr = dest;
@@ -351,17 +353,17 @@ icmp_input_if(struct ifnet *ifp, struct 
}
 #endif
if (icmplen < ICMP_MINLEN) {
-   icmpstat.icps_tooshort++;
+   icmpstat_inc(icps_tooshort);
goto freeit;
}
i = hlen + min(icmplen, ICMP_ADVLENMIN);
if (m->m_len < i && (m = m_pullup(m, i)) == NULL) {
-   icmpstat.icps_tooshort++;
+   icmpstat_inc(icps_tooshort);
return;
}
ip = mtod(m, struct ip *);
if (in4_cksum(m, 0, hlen, icmplen)) {
-   icmpstat.icps_checksum++;
+   icmpstat_inc(icps_checksum);
goto freeit;
}
 
@@ -399,7 +401,7 @@ icmp_input_if(struct ifnet *ifp, struct 
}
}
 #endif /* NPF */
-   icmpstat.icps_inhist[icp->icmp_type]++;
+   icmpstat_inc(icps_inhist + icp->icmp_type);
code = icp->icmp_code;
switch (icp->icmp_type) {
 
@@ -464,7 +466,7 @@ icmp_input_if(struct ifnet *ifp, struct 
 */
if (icmplen < ICMP_ADVLENMIN || icmplen < ICMP_ADVLEN(icp) ||
icp->icmp_ip.ip_hl < (sizeof(struct ip) >> 2)) {
-   icmpstat.icps_badlen++;
+   icmpstat_inc(icps_badlen)

Re: pf IPv6 hop-by-hop after fragment header

2016-11-18 Thread Alexandr Nedvedicky
Hello,

> I found the link http://www.secfu.net/ in one of sthen@'s mails.
> There the author mentions that we accept IPv6 hop-by-hop headers
> after fragment headers.  In fact this is a result of my pf fragment
> reassembly, so add an extra check there.
> 
> ok?

I'm O.K. with it.

Side Note: I did quick check to RFCs. It seems to me there is a 'bug' in
specification. RFC 2460 says:

   When more than one extension header is used in the same packet, it is
   recommended that those headers appear in the following order:
   ^^^

   IPv6 header
   Hop-by-Hop Options header
   Destination Options header (note 1)
   Routing header
   Fragment header


The RFC 7045, which updates RFC 2460, says in section 2.2:

   As a reminder, in RFC 2460, it is stated that the Hop-by-Hop Options
   header, if present, must be first.
   ^^^
The quotation upgrades 'recommended/should' to 'required/must'. In any
case bluhm's patch makes sense to me.

regards
sasha