date:20150528

Re: [PATCH net-next 1/1] hv_netvsc: Properly size the vrss queues

2015-05-28 Thread Dan Carpenter

Since you're redoing this anyway.

On Tue, May 26, 2015 at 04:21:09PM -0700, K. Y. Srinivasan wrote:
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index ddcc7f8..dd45440 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -161,6 +161,7 @@ struct netvsc_device_info {
>   unsigned char mac_adr[ETH_ALEN];
>   bool link_state;/* 0 - link up, 1 - link down */
>   int  ring_size;
> + u32  max_num_vrss_chns;

We (Joe and I) have commented before that long names don't mix well with
the 80 character limit.  You could just leave the "num_" out.  Almost
all variables are numbers in C so it doesn't add anything.

regards,
dan carpenter
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] xfrm6: Do not use xfrm_local_error for path MTU issues in tunnels

2015-05-28 Thread Alexander Duyck


On 05/27/2015 10:36 PM, Steffen Klassert wrote:

On Wed, May 27, 2015 at 10:40:32AM -0700, Alexander Duyck wrote:

This change makes it so that we use icmpv6_send to report PMTU issues back
into tunnels in the case that the resulting packet is larger than the MTU
of the outgoing interface.  Previously xfrm_local_error was being used in
this case, however this was resulting in no changes, I suspect due to the
fact that the tunnel itself was being kept out of the loop.

This patch fixes PMTU problems seen on ip6_vti tunnels and is based on the
behavior seen if the socket was orphaned.  Instead of requiring the socket
to be orphaned this patch simply defaults to using icmpv6_send in the case
that the frame came though a tunnel.

We can use icmpv6_send() just in the case that the packet
was already transmitted by a tunnel device, otherwise we
get the bug back that I mentioned in my other mail.

Not sure if we have something to know that the packet
traversed a tunnel device. That's what I asked in the
thread 'Looking for a lost patch'.


Okay I will try to do some more digging.  From what I can tell right now 
it looks like my ping attempts are getting hung up on the 
xfrm_local_error in __xfrm6_output.  I wonder if we couldn't somehow 
make use of the skb->cb to store a pointer to the tunnel that could be 
checked to determine if we are going through a VTI or not.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v4] bnx2x: Alloc 4k fragment for each rx ring buffer element

2015-05-28 Thread Yuval Mintz

>> +struct bnx2x_alloc_pool {
>> + struct page *page;
>> + dma_addr_t  dma;
>> + u8  offset;
>> + u8  frag_count;
>> +};
>...
>>  static int bnx2x_alloc_rx_sge(struct bnx2x *bp, struct bnx2x_fastpath *fp,
>> u16 index, gfp_t gfp_mask)
>>  {
>...
>> + pool->offset += SGE_PAGE_SIZE;
>> + pool->frag_count--;
>> +
>>   return 0;
>>  }

> One SGE_PAGE_SIZE is already bigger than representable by u8, so offset
> will overflow.

Thanks for the catch Michal.

Actually, this upsets me greatly. We didn't see it on a system with 4KB
pages, but this means you've actually tried to 'sell' us a fastpath fix that
was never tested on machines for which it was meant as an improvement.

Dave - if possible, please wait with accepting any further fixes for this
issue until we [qlogic] manage to prepare a test environment
where we can properly test this with 64KB page size architecture.

Thanks,
Yuval--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kmalloc panic

2015-05-28 Thread pavani


Hi Cong ,

Thanks for the response.

Where we need to fix the bug ?I mean in the driver or kernel source code 
or hardware level.


Is there any possible cases in the driver to fix this issue.

please reply me as soon as possible.

Thanks
pavani




On 05/28/2015 10:45 AM, Cong Wang wrote:

(Cc'ing netdev and wireless... Looks like a bug in wireless ext.)

On Wed, May 27, 2015 at 6:46 AM, pavani
 wrote:

Hi,

I connected to AP with the help of wpa_supplicant in linux.After connecting
to AP I am facing an issue like "kmalloc panic".can you help me
  how to solve this issue.Logs are like






CPU: ARM926EJ-S [41069265] revision 5 (ARMv5TEJ), cr=00053177
CPU: VIVT data cache, VIVT instruction cache
Machine: SpaceCom-Lite
Memory policy: ECC disabled, Data cache writeback
On node 0 totalpages: 16384
free_area_init_node: node 0, pgdat c03b7ff8, node_mem_map c03e9000
   Normal zone: 128 pages used for memmap
   Normal zone: 0 pages reserved
   Normal zone: 16256 pages, LIFO batch:3
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 16256
Kernel command line: console=ttyNS1,115200 root=/dev/mtdblock2
rootfstype=jffs2 quiet ro
PID hash table entries: 256 (order: -2, 1024 bytes)
Dentry cache hash table entries: 8192 (order: 3, 32768 bytes)
Inode-cache hash table entries: 4096 (order: 2, 16384 bytes)
Memory: 64MB = 64MB total
Memory: 60964k/60964k available, 4572k reserved, 0K highmem
Virtual kernel memory layout:
 vector  : 0x - 0x1000   (   4 kB)
 fixmap  : 0xfff0 - 0xfffe   ( 896 kB)
 DMA : 0xffc0 - 0xffe0   (   2 MB)
 vmalloc : 0xc480 - 0xf000   ( 696 MB)
 lowmem  : 0xc000 - 0xc400   (  64 MB)
 modules : 0xbf00 - 0xc000   (  16 MB)
   .init : 0xc0008000 - 0xc00e5000   ( 884 kB)
   .text : 0xc00e5000 - 0xc0387000   (2696 kB)
   .data : 0xc039a000 - 0xc03b8600   ( 122 kB)
SLUB: Genslabs=11, HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
Hierarchical RCU implementation.
RCU-based detection of stalled CPUs is enabled.
NR_IRQS:82
Console: colour dummy device 80x30
Calibrating delay loop... 74.13 BogoMIPS (lpj=370688)
Mount-cache hash table entries: 512
CPU: Testing write buffer coherency: ok
Synthetic TSC timer will fire each 131104 jiffies.
NET: Registered protocol family 16
bio: create slab  at 0
Switching to clocksource ns921x-timer0
Switched to NOHz mode on CPU #0
NET: Registered protocol family 2
IP route cache hash table entries: 1024 (order: 0, 4096 bytes)
TCP established hash table entries: 2048 (order: 2, 16384 bytes)
TCP bind hash table entries: 2048 (order: 1, 8192 bytes)
TCP: Hash tables configured (established 2048 bind 2048)
TCP reno registered
UDP hash table entries: 256 (order: 0, 4096 bytes)
UDP-Lite hash table entries: 256 (order: 0, 4096 bytes)
NET: Registered protocol family 1
JFFS2 version 2.2. (NAND) Â© 2001-2006 Red Hat, Inc.
msgmni has been set to 119
alg: No test for stdrng (krng)
Block layer SCSI generic (bsg) driver version 0.4 loaded (major 253)
io scheduler noop registered
io scheduler deadline registered
io scheduler cfq registered (default)
ns921x-serial.1: ttyNS1 at MMIO 0x90018000 (irq = 8) is a NS921X
console [ttyNS1] enabled
ns921x-serial.2: ttyNS2 at MMIO 0x9002 (irq = 9) is a NS921X
ns921x-serial.3: ttyNS3 at MMIO 0x90028000 (irq = 10) is a NS921X
Digi NS921x UART driver
physmap platform flash device: 1000 at 5000
physmap-flash.0: Found 1 x16 devices at 0x0 in 16-bit bank
physmap-flash.0: Found an alias at 0x200 for the chip at 0x0
physmap-flash.0: Found an alias at 0x400 for the chip at 0x0
physmap-flash.0: Found an alias at 0x600 for the chip at 0x0
physmap-flash.0: Found an alias at 0x800 for the chip at 0x0
physmap-flash.0: Found an alias at 0xa00 for the chip at 0x0
physmap-flash.0: Found an alias at 0xc00 for the chip at 0x0
physmap-flash.0: Found an alias at 0xe00 for the chip at 0x0
  Amd/Fujitsu Extended Query Table at 0x0040
physmap-flash.0: CFI does not contain boot bank location. Assuming top.
number of CFI chips: 1
RedBoot partition parsing not available
Using physmap partition information
Creating 4 MTD partitions on "physmap-flash.0":
0x-0x0008 : "Bootloader"
0x0008-0x0088 : "Fallback"
0x0088-0x01e8 : "Normal"
0x01e8-0x0200 : "Data"
serialcan: serial line CAN interface driver
serialcan: 1 dynamic interface channels.
Digi NS9XXX Ethernet driver
rtc-ns9xxx rtc-ns9xxx.0: rtc core: registered rtc-ns9xxx as rtc0
ns9xxx-wdt ns9xxx-wdt: NS9xxx watchdog timer at 0xc48aa174
fims: Starting to register the FIMs module.
FIMs driver v0.2
fim-sdio: FIM SDIO driver v0.4
TCP cubic registered
NET: Registered protocol family 10
IPv6 over IPv4 tunneling driver
NET: Registered protocol family 17
can: controller area network core (rev 20090105 abi 8)
NET: Registered protocol family 29
can: raw protocol (rev 20090105)
determining board revision
board_rev = 0x0
kmemle

[PATCH 1/7] xfrm: fix a race in xfrm_state_lookup_byspi

2015-05-28 Thread Steffen Klassert

From: Li RongQing 

The returned xfrm_state should be hold before unlock xfrm_state_lock,
otherwise the returned xfrm_state maybe be released.

Fixes: c454997e6[{pktgen, xfrm} Introduce xfrm_state_lookup_byspi..]
Cc: Fan Du 
Signed-off-by: Li RongQing 
Acked-by: Fan Du 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_state.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index f5e39e3..96688cd 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -927,8 +927,8 @@ struct xfrm_state *xfrm_state_lookup_byspi(struct net *net, 
__be32 spi,
x->id.spi != spi)
continue;
 
-   spin_unlock_bh(&net->xfrm.xfrm_state_lock);
xfrm_state_hold(x);
+   spin_unlock_bh(&net->xfrm.xfrm_state_lock);
return x;
}
spin_unlock_bh(&net->xfrm.xfrm_state_lock);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/7] esp6: Use high-order sequence number bits for IV generation

2015-05-28 Thread Steffen Klassert

From: Herbert Xu 

I noticed we were only using the low-order bits for IV generation
when ESN is enabled.  This is very bad because it means that the
IV can repeat.  We must use the full 64 bits.

Signed-off-by: Herbert Xu 
Signed-off-by: Steffen Klassert 
---
 net/ipv6/esp6.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 31f1b5d..7c07ce3 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -248,7 +248,8 @@ static int esp6_output(struct xfrm_state *x, struct sk_buff 
*skb)
aead_givcrypt_set_crypt(req, sg, sg, clen, iv);
aead_givcrypt_set_assoc(req, asg, assoclen);
aead_givcrypt_set_giv(req, esph->enc_data,
- XFRM_SKB_CB(skb)->seq.output.low);
+ XFRM_SKB_CB(skb)->seq.output.low +
+ ((u64)XFRM_SKB_CB(skb)->seq.output.hi << 32));
 
ESP_SKB_CB(skb)->tmp = tmp;
err = crypto_aead_givencrypt(req);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 6/7] xfrm: Override skb->mark with tunnel->parm.i_key in xfrm_input

2015-05-28 Thread Steffen Klassert

From: Alexander Duyck 

This change makes it so that if a tunnel is defined we just use the mark
from the tunnel instead of the mark from the skb header.  By doing this we
can avoid the need to set skb->mark inside of the tunnel receive functions.

Signed-off-by: Alexander Duyck 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_input.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 526c4fe..b58286e 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -13,6 +13,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 static struct kmem_cache *secpath_cachep __read_mostly;
 
@@ -186,6 +188,7 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
struct xfrm_state *x = NULL;
xfrm_address_t *daddr;
struct xfrm_mode *inner_mode;
+   u32 mark = skb->mark;
unsigned int family;
int decaps = 0;
int async = 0;
@@ -203,6 +206,18 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
   XFRM_SPI_SKB_CB(skb)->daddroff);
family = XFRM_SPI_SKB_CB(skb)->family;
 
+   /* if tunnel is present override skb->mark value with tunnel i_key */
+   if (XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip4) {
+   switch (family) {
+   case AF_INET:
+   mark = 
be32_to_cpu(XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip4->parms.i_key);
+   break;
+   case AF_INET6:
+   mark = 
be32_to_cpu(XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip6->parms.i_key);
+   break;
+   }
+   }
+
/* Allocate new secpath or COW existing one. */
if (!skb->sp || atomic_read(&skb->sp->refcnt) != 1) {
struct sec_path *sp;
@@ -229,7 +244,7 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
goto drop;
}
 
-   x = xfrm_state_lookup(net, skb->mark, daddr, spi, nexthdr, 
family);
+   x = xfrm_state_lookup(net, mark, daddr, spi, nexthdr, family);
if (x == NULL) {
XFRM_INC_STATS(net, LINUX_MIB_XFRMINNOSTATES);
xfrm_audit_state_notfound(skb, family, spi, seq);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/7] esp4: Use high-order sequence number bits for IV generation

2015-05-28 Thread Steffen Klassert

From: Herbert Xu 

I noticed we were only using the low-order bits for IV generation
when ESN is enabled.  This is very bad because it means that the
IV can repeat.  We must use the full 64 bits.

Signed-off-by: Herbert Xu 
Signed-off-by: Steffen Klassert 
---
 net/ipv4/esp4.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 421a80b..30b544f 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -256,7 +256,8 @@ static int esp_output(struct xfrm_state *x, struct sk_buff 
*skb)
aead_givcrypt_set_crypt(req, sg, sg, clen, iv);
aead_givcrypt_set_assoc(req, asg, assoclen);
aead_givcrypt_set_giv(req, esph->enc_data,
- XFRM_SKB_CB(skb)->seq.output.low);
+ XFRM_SKB_CB(skb)->seq.output.low +
+ ((u64)XFRM_SKB_CB(skb)->seq.output.hi << 32));
 
ESP_SKB_CB(skb)->tmp = tmp;
err = crypto_aead_givencrypt(req);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/7] ip_vti/ip6_vti: Do not touch skb->mark on xmit

2015-05-28 Thread Steffen Klassert

From: Alexander Duyck 

Instead of modifying skb->mark we can simply modify the flowi_mark that is
generated as a result of the xfrm_decode_session.  By doing this we don't
need to actually touch the skb->mark and it can be preserved as it passes
out through the tunnel.

Signed-off-by: Alexander Duyck 
Signed-off-by: Steffen Klassert 
---
 net/ipv4/ip_vti.c  | 5 +++--
 net/ipv6/ip6_vti.c | 4 +++-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index 9f7269f..4c318e1 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -216,8 +216,6 @@ static netdev_tx_t vti_tunnel_xmit(struct sk_buff *skb, 
struct net_device *dev)
 
memset(&fl, 0, sizeof(fl));
 
-   skb->mark = be32_to_cpu(tunnel->parms.o_key);
-
switch (skb->protocol) {
case htons(ETH_P_IP):
xfrm_decode_session(skb, &fl, AF_INET);
@@ -233,6 +231,9 @@ static netdev_tx_t vti_tunnel_xmit(struct sk_buff *skb, 
struct net_device *dev)
return NETDEV_TX_OK;
}
 
+   /* override mark with tunnel output key */
+   fl.flowi_mark = be32_to_cpu(tunnel->parms.o_key);
+
return vti_xmit(skb, dev, &fl);
 }
 
diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index ed9d681..104de4d 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -495,7 +495,6 @@ vti6_tnl_xmit(struct sk_buff *skb, struct net_device *dev)
int ret;
 
memset(&fl, 0, sizeof(fl));
-   skb->mark = be32_to_cpu(t->parms.o_key);
 
switch (skb->protocol) {
case htons(ETH_P_IPV6):
@@ -516,6 +515,9 @@ vti6_tnl_xmit(struct sk_buff *skb, struct net_device *dev)
goto tx_err;
}
 
+   /* override mark with tunnel output key */
+   fl.flowi_mark = be32_to_cpu(t->parms.o_key);
+
ret = vti6_xmit(skb, dev, &fl);
if (ret < 0)
goto tx_err;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 7/7] ip_vti/ip6_vti: Preserve skb->mark after rcv_cb call

2015-05-28 Thread Steffen Klassert

From: Alexander Duyck 

The vti6_rcv_cb and vti_rcv_cb calls were leaving the skb->mark modified
after completing the function.  This resulted in the original skb->mark
value being lost.  Since we only need skb->mark to be set for
xfrm_policy_check we can pull the assignment into the rcv_cb calls and then
just restore the original mark after xfrm_policy_check has been completed.

Signed-off-by: Alexander Duyck 
Signed-off-by: Steffen Klassert 
---
 net/ipv4/ip_vti.c  | 9 +++--
 net/ipv6/ip6_vti.c | 9 +++--
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index 4c318e1..0c15208 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -65,7 +65,6 @@ static int vti_input(struct sk_buff *skb, int nexthdr, __be32 
spi,
goto drop;
 
XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip4 = tunnel;
-   skb->mark = be32_to_cpu(tunnel->parms.i_key);
 
return xfrm_input(skb, nexthdr, spi, encap_type);
}
@@ -91,6 +90,8 @@ static int vti_rcv_cb(struct sk_buff *skb, int err)
struct pcpu_sw_netstats *tstats;
struct xfrm_state *x;
struct ip_tunnel *tunnel = XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip4;
+   u32 orig_mark = skb->mark;
+   int ret;
 
if (!tunnel)
return 1;
@@ -107,7 +108,11 @@ static int vti_rcv_cb(struct sk_buff *skb, int err)
x = xfrm_input_state(skb);
family = x->inner_mode->afinfo->family;
 
-   if (!xfrm_policy_check(NULL, XFRM_POLICY_IN, skb, family))
+   skb->mark = be32_to_cpu(tunnel->parms.i_key);
+   ret = xfrm_policy_check(NULL, XFRM_POLICY_IN, skb, family);
+   skb->mark = orig_mark;
+
+   if (!ret)
return -EPERM;
 
skb_scrub_packet(skb, !net_eq(tunnel->net, dev_net(skb->dev)));
diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index 104de4d..ff3bd86 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -322,7 +322,6 @@ static int vti6_rcv(struct sk_buff *skb)
}
 
XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip6 = t;
-   skb->mark = be32_to_cpu(t->parms.i_key);
 
rcu_read_unlock();
 
@@ -342,6 +341,8 @@ static int vti6_rcv_cb(struct sk_buff *skb, int err)
struct pcpu_sw_netstats *tstats;
struct xfrm_state *x;
struct ip6_tnl *t = XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip6;
+   u32 orig_mark = skb->mark;
+   int ret;
 
if (!t)
return 1;
@@ -358,7 +359,11 @@ static int vti6_rcv_cb(struct sk_buff *skb, int err)
x = xfrm_input_state(skb);
family = x->inner_mode->afinfo->family;
 
-   if (!xfrm_policy_check(NULL, XFRM_POLICY_IN, skb, family))
+   skb->mark = be32_to_cpu(t->parms.i_key);
+   ret = xfrm_policy_check(NULL, XFRM_POLICY_IN, skb, family);
+   skb->mark = orig_mark;
+
+   if (!ret)
return -EPERM;
 
skb_scrub_packet(skb, !net_eq(t->net, dev_net(skb->dev)));
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/7] xfrm: Always zero high-order sequence number bits

2015-05-28 Thread Steffen Klassert

From: Herbert Xu 

As we're now always including the high bits of the sequence number
in the IV generation process we need to ensure that they don't
contain crap.

This patch ensures that the high sequence bits are always zeroed
so that we don't leak random data into the IV.

Signed-off-by: Herbert Xu 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_replay.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/xfrm/xfrm_replay.c b/net/xfrm/xfrm_replay.c
index dab57da..4fd725a 100644
--- a/net/xfrm/xfrm_replay.c
+++ b/net/xfrm/xfrm_replay.c
@@ -99,6 +99,7 @@ static int xfrm_replay_overflow(struct xfrm_state *x, struct 
sk_buff *skb)
 
if (x->type->flags & XFRM_TYPE_REPLAY_PROT) {
XFRM_SKB_CB(skb)->seq.output.low = ++x->replay.oseq;
+   XFRM_SKB_CB(skb)->seq.output.hi = 0;
if (unlikely(x->replay.oseq == 0)) {
x->replay.oseq--;
xfrm_audit_state_replay_overflow(x, skb);
@@ -177,6 +178,7 @@ static int xfrm_replay_overflow_bmp(struct xfrm_state *x, 
struct sk_buff *skb)
 
if (x->type->flags & XFRM_TYPE_REPLAY_PROT) {
XFRM_SKB_CB(skb)->seq.output.low = ++replay_esn->oseq;
+   XFRM_SKB_CB(skb)->seq.output.hi = 0;
if (unlikely(replay_esn->oseq == 0)) {
replay_esn->oseq--;
xfrm_audit_state_replay_overflow(x, skb);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

pull request (net): ipsec 2015-05-28

2015-05-28 Thread Steffen Klassert

1) Fix a race in xfrm_state_lookup_byspi, we need to take
   the refcount before we release xfrm_state_lock.
   From Li RongQing.

2) Fix IV generation on ESN state. We used just the
   low order sequence numbers for IV generation on
   ESN, as a result the IV can repeat on the same
   state. Fix this by using the  high order sequence
   number bits too and make sure to always initialize
   the high order bits with zero. These patches are
   serious stable candidates. Fixes from Herbert Xu.

3) Fix the skb->mark handling on vti. We don't
   reset skb->mark in skb_scrub_packet anymore,
   so vti must care to restore the original
   value back after it was used to lookup the
   vti policy and state. Fixes from Alexander Duyck.

Please pull or let me know if there are problems.

Thanks!

The following changes since commit 39376ccb1968ba9f83e2a880a8bf02ad5dea44e1:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf (2015-04-27 
23:12:34 -0400)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git master

for you to fetch changes up to d55c670cbc54b2270a465cdc382ce71adae45785:

  ip_vti/ip6_vti: Preserve skb->mark after rcv_cb call (2015-05-28 06:23:32 
+0200)


Alexander Duyck (3):
  ip_vti/ip6_vti: Do not touch skb->mark on xmit
  xfrm: Override skb->mark with tunnel->parm.i_key in xfrm_input
  ip_vti/ip6_vti: Preserve skb->mark after rcv_cb call

Herbert Xu (3):
  esp4: Use high-order sequence number bits for IV generation
  esp6: Use high-order sequence number bits for IV generation
  xfrm: Always zero high-order sequence number bits

Li RongQing (1):
  xfrm: fix a race in xfrm_state_lookup_byspi

 net/ipv4/esp4.c|  3 ++-
 net/ipv4/ip_vti.c  | 14 ++
 net/ipv6/esp6.c|  3 ++-
 net/ipv6/ip6_vti.c | 13 ++---
 net/xfrm/xfrm_input.c  | 17 -
 net/xfrm/xfrm_replay.c |  2 ++
 net/xfrm/xfrm_state.c  |  2 +-
 7 files changed, 43 insertions(+), 11 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kmalloc panic

2015-05-28 Thread Richard Weinberger

On Thu, May 28, 2015 at 9:21 AM, pavani
 wrote:
> Hi Cong ,
>
> Thanks for the response.
>
> Where we need to fix the bug ?I mean in the driver or kernel source code or
> hardware level.

The more interesting question is, is this a recent and pristine kernel
from kernel.org?

-- 
Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next] neigh: Add missing rcu_assign_pointer

2015-05-28 Thread Ying Xue

Commit e4c4e448cf55 ("neigh: Convert garbage collection from softirq
to workqueue") misses to use rcu_assign_pointer() macro to assign a
RCU-protected pointer.

Signed-off-by: Ying Xue 
---
 net/core/neighbour.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 3a74df7..aaad3a5 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -783,7 +783,8 @@ static void neigh_periodic_work(struct work_struct *work)
if (atomic_read(&n->refcnt) == 1 &&
(state == NUD_FAILED ||
 time_after(jiffies, n->used + NEIGH_VAR(n->parms, 
GC_STALETIME {
-   *np = n->next;
+   rcu_assign_pointer(*np, 
rcu_dereference_protected(n->next,
+   
lockdep_is_held(&tbl->lock)));
n->dead = 1;
write_unlock(&n->lock);
neigh_cleanup_and_release(n);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH/RFC net-next] rocker: remove rocker parameter from functions that have rocker_port parameter

2015-05-28 Thread Simon Horman

On Thu, May 28, 2015 at 08:15:42AM +0200, Jiri Pirko wrote:
> Thu, May 28, 2015 at 05:23:17AM CEST, simon.hor...@netronome.com wrote:
> >The rocker (switch) of a rocker_port may be trivially obtained from
> >the latter it seems cleaner not to pass the former to a function when
> >the latter is being passed anyway.
> 
> I don't understand reason for this patch. I like it the way it is I must
> say. + you introduce possible multiple dereference in a row in call-chain.

My main motivation is that it seems cleaner. I marked it as an RFC
as I wasn't sure if there was a particular reason that thins are how they
are. I have no objection to leaving things as they are if thats the
consensus.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] xfrm6: Do not use xfrm_local_error for path MTU issues in tunnels

2015-05-28 Thread Steffen Klassert

On Thu, May 28, 2015 at 12:18:51AM -0700, Alexander Duyck wrote:
> On 05/27/2015 10:36 PM, Steffen Klassert wrote:
> >On Wed, May 27, 2015 at 10:40:32AM -0700, Alexander Duyck wrote:
> >>This change makes it so that we use icmpv6_send to report PMTU issues back
> >>into tunnels in the case that the resulting packet is larger than the MTU
> >>of the outgoing interface.  Previously xfrm_local_error was being used in
> >>this case, however this was resulting in no changes, I suspect due to the
> >>fact that the tunnel itself was being kept out of the loop.
> >>
> >>This patch fixes PMTU problems seen on ip6_vti tunnels and is based on the
> >>behavior seen if the socket was orphaned.  Instead of requiring the socket
> >>to be orphaned this patch simply defaults to using icmpv6_send in the case
> >>that the frame came though a tunnel.
> >We can use icmpv6_send() just in the case that the packet
> >was already transmitted by a tunnel device, otherwise we
> >get the bug back that I mentioned in my other mail.
> >
> >Not sure if we have something to know that the packet
> >traversed a tunnel device. That's what I asked in the
> >thread 'Looking for a lost patch'.
> 
> Okay I will try to do some more digging.  From what I can tell right
> now it looks like my ping attempts are getting hung up on the
> xfrm_local_error in __xfrm6_output.  I wonder if we couldn't somehow
> make use of the skb->cb to store a pointer to the tunnel that could
> be checked to determine if we are going through a VTI or not.

Maybe it is as easy as the patch below, could you please test it?

Subject: [PATCH RFC] vti6: Add pmtu handling to vti6_xmit.

We currently rely on the PMTU discovery of xfrm.
However if a packet is localy sent, the PMTU mechanism
of xfrm tries to to local socket notification what
might not work for applications like ping that don't
check for this. So add pmtu handling to vti6_xmit to
report MTU changes immediately.

Signed-off-by: Steffen Klassert 
---
 net/ipv6/ip6_vti.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index ff3bd86..13cb771 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -434,6 +434,7 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, 
struct flowi *fl)
struct dst_entry *dst = skb_dst(skb);
struct net_device *tdev;
struct xfrm_state *x;
+   int mtu;
int err = -1;
 
if (!dst)
@@ -468,6 +469,15 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, 
struct flowi *fl)
skb_dst_set(skb, dst);
skb->dev = skb_dst(skb)->dev;
 
+   mtu = dst_mtu(dst);
+   if (!skb->ignore_df && skb->len > mtu) {
+   skb_dst(skb)->ops->update_pmtu(dst, NULL, skb, mtu);
+
+   icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
+
+   return -EMSGSIZE;
+   }
+
err = dst_output(skb);
if (net_xmit_eval(err) == 0) {
struct pcpu_sw_netstats *tstats = this_cpu_ptr(dev->tstats);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Problems with receiving packets

2015-05-28 Thread sssdrb

Hello,

I use Linux kernel 3.3.8 on Arm Xscale based embedded platform.
I noticed that sometimes some applications lost data from network.
To be more detailed - for example I'm using ping command between two
Arm boards. The communication goes through ethernet or wifi. Now from
time to time ping reports that it losts some packets. Additionaly I
run tcpdump on the same side where ping was invoced and it reports
that all the packets arrived (!).

This is the strange part - ping reports lost of data, while tcpdump
shows all of the packets (icmp request and replies).

It seems like there is some problem with transferring network data
from kernel space to user space (in RX path).
I'm sure that all the icmp reply packets are received by ethernet
driver and kernel, but ping application received only part of them
(not all of them).

Is it known problem and is there any solution for it?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next V4 00/12] net/mlx5: ConnectX-4 100G Ethernet driver

2015-05-28 Thread David Miller

From: Ben Hutchings 
Date: Wed, 27 May 2015 20:57:37 +0100

> How would an application tell the difference between an IRQ handler
> being renamed, or being unregistered and re-registered under a different
> name?  I'm fairly sure it can't tell.

What do things like the userland IRQ balancer do?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] connector: add cgroup release event report to proc connector

2015-05-28 Thread Dimitri John Ledkov

On 28 May 2015 at 04:30, Zefan Li  wrote:
> On 2015/5/27 20:37, Dimitri John Ledkov wrote:
>> On 27 May 2015 at 12:22, Zefan Li  wrote:
>>> On 2015/5/27 6:07, Dimitri John Ledkov wrote:
 Add a kernel API to send a proc connector notification that a cgroup
 has become empty. A userspace daemon can then act upon such
 information, and usually clean-up and remove such a group as it's no
 longer needed.

 Currently there are two other ways (one for current & one for unified
 cgroups) to receive such notifications, but they either involve
 spawning userspace helper or monitoring a lot of files. This is a
 firehose of all such events instead from a single place.

 In the current cgroups structure the way to get notifications is by
 enabling `release_agent' and setting `notify_on_release' for a given
 cgroup hierarchy. This will then spawn userspace helper with removed
 cgroup as an argument. It has been acknowledged that this is
 expensive, especially in the exit-heavy workloads. In userspace this
 is currently used by systemd and CGmanager that I know of, both of
 agents establish connection to the long running daemon and pass the
 message to it. As a courtesy to other processes, such an event is
 sometimes forwarded further on, e.g. systemd forwards it to the system
 DBus.

 In the future/unified cgroups structure support for `release_agent' is
 removed, without a direct replacement. However, there is a new
 `cgroup.populated' file exposed that recursively reports if there are
 any tasks in a given cgroup hierarchy. It's a very good flag to
 quickly/lazily scan for empty things, however one would need to
 establish inotify watch on each and every cgroup.populated file at
 cgroup setup time (ideally before any pids enter said cgroup). Thus
 again anybody else, but the original creator of a given cgroup, has a
 chance to reliably monitor cgroup becoming empty (since there is no
 reliable recursive inotify watch).

 Hence, the addition to the proc connector firehose. Multiple things,
 albeit with a CAP_NET_ADMIN in the init pid/user namespace), could
 connect and monitor cgroups release notifications. In a way, this
 repeats udev history, at first it was a userspace helper, which later
 became a netlink socket. And I hope, that proc connector is a
 naturally good fit for this notification type.

 For precisely when cgroups should emit this event, see next patch
 against kernel/cgroup.c.

>>>
>>> We really don't want yet another way for cgroup notification.
>>>
>>
>> we do have multiple information sources for similar events in other
>> places... e.g. fork events can be tracked with ptrace and with
>> proc-connector, ditto other things.
>>
>>> Systemd is happy with this cgroup.populated interface. Do you have any
>>> real use case in mind that can't be satisfied with inotify watch?
>>>
>>
>> cgroup.populated is not implemented in systemd and would require a lot
>> of inotify watches.
>
> I believe systemd will use cgroup.populated, though I don't know its
> roadmap. Maybe it's waiting for the kernel to remove the experimental
> flag of unified hierarchy.
>

There is no code in master to support unified hierarchy in systemd
that I can see. And more and more things rely on the current
hierarchy, especially around container-like technologies.

>> Also it's only set on the unified structure and
>> not exposed on the current one.
>>
>> Also it will not allow anybody else to establish notify watch in a
>> timely manner. Thus anyone external to the cgroups creator will not be
>> able to monitor cgroup.populated at the right time.
>
> I guess this isn't a problem, as you can watch the IN_CREATE event, and
> then you'll get notified when a cgroup is created.
>

It is a problem, there is no effective way to establish race-free
inotify watches, which is well known. Having a watch on
/sys/fs/cgroup, one has to establish inotify watch on a directory
created there, and then another watch on cgroup.populated within
there. By which time a process could have already entered, run and
exited.

>> With
>> proc_connector I was thinking processes entering cgroups would be
>> useful events as well, but I don't have a use-case for them yet thus
>> I'm not sure how the event should look like.
>>
>> Would cgroup.populated be exposed on the legacy cgroup hierchy? At the
>> moment I see about ~20ms of my ~200ms boot wasted on spawning the
>> cgroups agent and I would like to get rid of that as soon as possible.
>> This patch solves it for me. ( i have a matching one to connect to
>> proc connector and then feed notifications to systemd via systemd's
>> private api end-point )
>>
>> Exposing cgroup.populated irrespective of the cgroup mount options
>> would be great, but would result in many watches being established
>> awaiting for a once in a lifecycle condition of a cgroup.

RE: [PATCH] net: tcp: Fix a PTO timing granularity issue

2015-05-28 Thread David Laight

From: Ido Yariv
> Sent: 28 May 2015 05:37
...
> +/* Convert msecs to jiffies, ensuring that the return value is at least 2
> + * jiffies.
> + * This can be used when setting tick-based timers to guarantee that they 
> won't
> + * expire right away.
> + */
> +static inline unsigned long tcp_safe_msecs_to_jiffies(const unsigned int m)

I don't like using 'safe' in function names, being 'safe; depends on what
the caller wants.
Maybe tcp_msecs_to_jiffies_min_2() would be better.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH] sctp: Fix mangled IPv4 addresses on a IPv6 listening socket

2015-05-28 Thread David Laight

From: Jason Gunthorpe
> Sent: 27 May 2015 18:05
> On Wed, May 27, 2015 at 04:41:18PM +, David Laight wrote:
> 
> > The code will be sleeping in kernel_accept() and later calls
> > kernel_getpeername().
> > The code is used for both TCP and SCTP and this part is common (using
> > the TCP semantics).
> 
> getpeername uses a different flow, it calls into inet6_getname which
> will always return the AF_INET6 version.

Ok, that explains why I hadn't seen the problem.
It also means I don't have to worry about it.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kmalloc panic

2015-05-28 Thread Johannes Berg

On Wed, 2015-05-27 at 22:15 -0700, Cong Wang wrote:

> > rsi_client: module license 'Proprietary' taints kernel.
> > Disabling lock debugging due to kernel taint
> > RSI_Init called and registering the client driver

If this is what I think it is - the redpine signals wifi driver, then I
have no interest in this bug report whatsoever.

Please tell us
 * the exact kernel version, best with local patches
 * the wifi driver used

> > INFO: Allocated in 0x10100100 age=2669517238 cpu=16842820 pid=335563024

and what function this really was (0x10100100 isn't really useful).

johannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net iproute2 v2 1/2] mpls: always set type as RTN_UNICAST for route add/deletes

2015-05-28 Thread Robert Shearman


On 28/05/15 01:06, roopa wrote:

On 5/27/15, 1:08 PM, roopa wrote:

On 5/27/15, 12:59 PM, Robert Shearman wrote:

On 27/05/15 19:37, Roopa Prabhu wrote:

From: Roopa Prabhu 

Kernel expects type RTN_UNICAST for mpls route/dels

Signed-off-by: Vivek Venkataraman 
Signed-off-by: Roopa Prabhu 
---
  ip/iproute.c |5 +
  1 file changed, 5 insertions(+)

diff --git a/ip/iproute.c b/ip/iproute.c
index 670a4c6..71c088b 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -803,6 +803,7 @@ static int iproute_modify(int cmd, unsigned
flags, int argc, char **argv)
  int scope_ok = 0;
  int table_ok = 0;
  int raw = 0;
+int type_ok = 0;

  memset(&req, 0, sizeof(req));

@@ -1095,6 +1096,7 @@ static int iproute_modify(int cmd, unsigned
flags, int argc, char **argv)
  rtnl_rtntype_a2n(&type, *argv) == 0) {
  NEXT_ARG();
  req.r.rtm_type = type;
+type_ok = 1;
  }

  if (matches(*argv, "help") == 0)
@@ -1160,6 +1162,9 @@ static int iproute_modify(int cmd, unsigned
flags, int argc, char **argv)
  }
  }

+if (!type_ok && req.r.rtm_family == AF_MPLS)
+req.r.rtm_type = RTN_UNICAST;
+
  if (req.r.rtm_family == AF_UNSPEC)
  req.r.rtm_family = AF_INET;




There is this block of code near the start of iproute_modify that
sets req.r.rtm_type in the add/modify cases:

if (cmd != RTM_DELROUTE) {
req.r.rtm_protocol = RTPROT_BOOT;
req.r.rtm_scope = RT_SCOPE_UNIVERSE;
req.r.rtm_type = RTN_UNICAST;
}

How about doing similar for the mpls delete case? This would avoid
the need to track if the type has been set and would also make the
way rtm_type is set in the delete case as close as possible to that
in the add/modify cases.

sure that works too. There was already *_ok checks for the rest of the
attributes, ..so added it there.

v3 ...coming...


looking at the code again..now i remember why i have it this way. I will
have to add a check for family around
the code you point out above. And it some cases if the user has not
specified the family explicitly, we derive the msg family
in the while loop that parses the args...based on the other arguments
given by the user.
In the particular mpls case though, user explicitly specifies the family
and moving the patch to the code you point above should be ok.

But to be consistent with the rest of the code, it seems better to do
the check and set the defaults at the end after parsing all the args.

So, now i am inclined to keep the v2 patch as is...unless you have
strong reasons.


Ah, yes, of course.

In that case, LGTM.

Thanks,
Rob
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net iproute2 v2 1/2] mpls: always set type as RTN_UNICAST for route add/deletes

2015-05-28 Thread Robert Shearman


On 27/05/15 19:37, Roopa Prabhu wrote:

From: Roopa Prabhu 

Kernel expects type RTN_UNICAST for mpls route/dels

Signed-off-by: Vivek Venkataraman 
Signed-off-by: Roopa Prabhu 


Reviewed-by: Robert Shearman 


---
  ip/iproute.c |5 +
  1 file changed, 5 insertions(+)

diff --git a/ip/iproute.c b/ip/iproute.c
index 670a4c6..71c088b 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -803,6 +803,7 @@ static int iproute_modify(int cmd, unsigned flags, int 
argc, char **argv)
int scope_ok = 0;
int table_ok = 0;
int raw = 0;
+   int type_ok = 0;

memset(&req, 0, sizeof(req));

@@ -1095,6 +1096,7 @@ static int iproute_modify(int cmd, unsigned flags, int 
argc, char **argv)
rtnl_rtntype_a2n(&type, *argv) == 0) {
NEXT_ARG();
req.r.rtm_type = type;
+   type_ok = 1;
}

if (matches(*argv, "help") == 0)
@@ -1160,6 +1162,9 @@ static int iproute_modify(int cmd, unsigned flags, int 
argc, char **argv)
}
}

+   if (!type_ok && req.r.rtm_family == AF_MPLS)
+   req.r.rtm_type = RTN_UNICAST;
+
if (req.r.rtm_family == AF_UNSPEC)
req.r.rtm_family = AF_INET;



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH/RFC net-next] rocker: remove rocker parameter from functions that have rocker_port parameter

2015-05-28 Thread David Laight

From: Simon Horman
> Sent: 28 May 2015 04:23
> The rocker (switch) of a rocker_port may be trivially obtained from
> the latter it seems cleaner not to pass the former to a function when
> the latter is being passed anyway.

If the arguments are passed in registers (they almost certainly are)
or the function is inlined (possible since they are static) and
the calling code already has both values in registers then
passing both values saves a memory read inside the called code.

So on 'hot paths' it probably makes sense to pass both values.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next V4 00/12] net/mlx5: ConnectX-4 100G Ethernet driver

2015-05-28 Thread Amir Vadai

On Thu, May 28, 2015 at 11:52 AM, David Miller  wrote:
> From: Ben Hutchings 
> Date: Wed, 27 May 2015 20:57:37 +0100
>
>> How would an application tell the difference between an IRQ handler
>> being renamed, or being unregistered and re-registered under a different
>> name?  I'm fairly sure it can't tell.
>
> What do things like the userland IRQ balancer do?

Thanks to Neil Horman, userland scripts can get the irq number from
sysfs (/sys/bus/pci/devices//msi_irqs) which is not based on
the irq naming [1].
He also fixed irq_balancer [2] to use this API instead of being based
on those strings.

I will drop the irq renaming from the patchset. mlx5_core driver will
set generic irq names (since same irq's might service both Ethernet
and Infiniband), for example: mlx5_comp0@pci::00:04.0.

Thanks,
Amir

[1] - kernel: da8d1c8 PCI/sysfs: add per pci device msi[x] irq listing (v5)
[2] - irq_balancer: 32a7757 Complete rework of how we detect and classify irqs
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

pull request: bluetooth-next 2015-05-28

2015-05-28 Thread Johan Hedberg

Hi Dave,

Here's a set of patches intended for 4.2. The majority of the changes
are on the 802.15.4 side of things rather than Bluetooth related:

 - All sorts of cleanups & fixes to ieee802154 and related drivers
 - Rework of tx power support in ieee802154 and its drivers
 - Support for setting ieee802154 tx power through nl802154
 - New IDs for the btusb driver
 - Various cleanups & smaller fixes to btusb
 - New btrtl driver for Realtec devices
 - Fix suspend/resume for Realtek devices

Please let me know if there are any issues pulling. Thanks.

Johan

---
The following changes since commit f0b5e8a42f37a880b8467e59dc814f4f21581d3d:

  net: kill useless net_*_ingress_queue() definitions when NET_CLS_ACT is unset 
(2015-05-13 15:44:28 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next.git 
for-upstream

for you to fetch changes up to b5a61c306b0dddb28e3a3ab5d782c73e5f665497:

  atusb: add support for at86rf230 (2015-05-27 19:29:54 +0200)


Alexander Aring (42):
  nl802154: cleanup invalid argument handling
  ieee802154: move validation check out of softmac
  ieee802154: change transmit power to s32
  ieee802154: change transmit power to mbm
  ieee802154: change cca ed level to mbm
  ieee802154: introduce wpan_phy_supported
  ieee802154: add several phy supported handling
  mac802154: check for really changes
  mac802154: remove check if operation is supported
  cfg802154: introduce wpan phy flags
  ieee802154: add iftypes capability
  at86rf230: set cca_modes supported flags
  at86rf230: rework tx power support
  at86rf230: rework tx cca energy detection level
  at86rf230: add cca ed level reset value
  at86rf230: add reset states of tx power level
  nl802154: add support for dump phy capabilities
  at86rf230: fix callback for aret handling
  mac802154: tx: allow xmit complete from hard irq
  ieee802154: add support for atusb transceiver
  fakelb: creating two virtual phys per default
  fakelb: use list_for_each_entry_safe
  fakelb: rename fakelb_dev_priv to fakelb_phy
  fakelb: don't deliver when one phy
  fakelb: declare rwlock static
  fakelb: declare fakelb list static
  fakelb: move lock out of iteration
  fakelb: introduce fakelb ifup phys list
  fakelb: use own channel and page attributes
  fakelb: add virtual phy reset defaults
  fakelb: remove fakelb_hw_deliver
  fakelb: add support for async xmit handling
  fakelb: cleanup code
  at86rf230: add missing cca ed level values
  mac802154: fix hold rtnl while ioctl
  mac802154: remove pib lock
  mac802154: use atomic ops for sequence incrementation
  mac802154: remove mib lock
  nl802154: fix cca mode wpan phy flag
  nl802154: add support for cca ed level info
  nl802154: add support to set cca ed level
  atusb: add support for at86rf230

Arnd Bergmann (1):
  mac802154: select CRYPTO when needed

Carlo Caione (1):
  Bluetooth: btrtl: Create separate module for Realtek BT driver

Chan-yeol Park (1):
  Bluetooth: btusb: Support QCA61x4 ROME v2.0

Daniel Drake (1):
  Bluetooth: btusb: fix Realtek suspend/resume

Florian Grandel (1):
  Bluetooth: mgmt: fix typos

Frederic Danis (4):
  Bluetooth: Fix calls to __hci_cmd_sync()
  Bluetooth: btusb: Fix calls to __hci_cmd_sync()
  Bluetooth: btintel: Fix calls to __hci_cmd_sync()
  Bluetooth: btbcm: Fix calls to __hci_cmd_sync()

Johan Hedberg (1):
  Bluetooth: Add debug logs for legacy SMP crypto functions

Lennert Buytenhek (7):
  mac802154: Avoid rtnl deadlock in mac802154_wpan_ioctl().
  ieee802154 socket: Return EMSGSIZE from raw_sendmsg() if packet too big.
  Documentation/networking/ieee802154.txt: fix various inaccuracies.
  ieee802154: Remove ieee802154_reduced_mlme_ops references.
  ieee802154: Remove 802.15.4/6LoWPAN checks for interface MTU.
  ieee802154 socket: No need to check for ARPHRD_IEEE802154 in raw_bind().
  mac802154: mac802154_mlme_start_req() optimisation.

Leo Yan (1):
  Bluetooth: btwilink: remove DEBUG define

Martin Townsend (1):
  mac802154: fakelb: Fix potential NULL pointer dereference.

Shailendra Verma (2):
  Bluetooth: btusb: Change 1 to true in bool type variable assignment
  Bluetooth: hci_uart: Change 1 to true for bool type variables assignments

Stefan Schmidt (3):
  ieee802154/atusb: Warn about outdated device firmware.
  ieee802154/atusb: Mark driver as AACK enabled in hardware.
  ieee802154/atusb: Set default ed level to 0xbe like the rest of these 
drivers

Varka Bhadram (2):
  ieee802154: add set transmit power support
  ieee802154: fix typo for file name

Xinming Hu (1):
  Bluetooth: btmrvl: fix compilation warning

 Documentation/networking/ieee802154.txt |  32

Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-28 Thread Jiri Pirko

Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote:
>From: Roopa Prabhu 
>Date: Sun, 17 May 2015 16:42:05 -0700
>
>> On most systems where you can offload routes to hardware,
>> doing routing in software is not an option (the cpu limitations
>> make routing impossible in software).
>
>You absolutely do not get to determine this policy, none of us
>do.
>
>What matters is that by default the damn switch device being there
>is %100 transparent to the user.
>
>And the way to achieve that default is to do software routes as
>a fallback.
>
>I am not going to entertain changes of this nature which fail
>route loading by default just because we've exceeded a device's
>HW capacity to offload.
>
>I thought I was _really_ clear about this at netdev 0.1

I certainly agree that by default, transparency 1:1 sw:hw mapping is
what we need for fib. The current code is a good start!

I see couple of issues regarding switchdev_fib_ipv4_abort:
1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
   executed -> and, error returned. I would expect that route entry should
   be added in this case. The next attempt of adding the same entry will
   be successful.
   The current behaviour breaks the transparency you are reffering to.
2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
   disabled for good (until reboot). That is certainly not nice, alhough
   I understand that is the easiest solution for now.

I believe that we all agree that the 1:1 transparency, although it is a
default, may not be optimal for real-life usage. HW resources are
limited and user does not know them. The danger of hitting _abort and
screwing-up the whole system is huge, unacceptable.

So here, there are couple of more or less simple things that I suggest to
do in order to move a little bit forward:
1) Introduce system-wide option to switch _abort to just plain fail.
   When HW does not have capacity, do not flush and fallback to sw, but
   rather just fail to add the entry. This would not break anything.
   Userspace has to be prepared that entry add could fail.
2) Introduce a way to propagate resources to userspace. Driver knows about
   resources used/available/potentially_available. Switchdev infra could
   be extended in order to propagate the info to the user.
3) Introduce couple of flags for entry add that would alter the default
   behaviour. Something like:
NLM_F_SKIP_KERNEL
NLM_F_SKIP_OFFLOAD
   Again, this does not break the current users. On the other hand, this
   gives new users a leverage to instruct kernel where the entry should
   be added to (or not added to).

Any thoughts? Objections?

Thanks!

Jiri
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] netevent: remove automatic variable in register_netevent_notifier()

2015-05-28 Thread Wang Long

Remove automatic variable 'err' in register_netevent_notifier() and
return the return value of atomic_notifier_chain_register() directly.

Signed-off-by: Wang Long 
---
 net/core/netevent.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/core/netevent.c b/net/core/netevent.c
index f17ccd2..8b3bc4f 100644
--- a/net/core/netevent.c
+++ b/net/core/netevent.c
@@ -31,10 +31,7 @@ static ATOMIC_NOTIFIER_HEAD(netevent_notif_chain);
  */
 int register_netevent_notifier(struct notifier_block *nb)
 {
-   int err;
-
-   err = atomic_notifier_chain_register(&netevent_notif_chain, nb);
-   return err;
+   return atomic_notifier_chain_register(&netevent_notif_chain, nb);
 }
 EXPORT_SYMBOL_GPL(register_netevent_notifier);
 
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3] pktgen: Convert return type of process_ipsec to bool

2015-05-28 Thread Jesper Dangaard Brouer

On Thu, 28 May 2015 00:11:05 -0400
Nicholas Krause  wrote:

> This converts the function, process_ipsec to the 
> return type of bool due to only returning either
> one or zero.
> 
> Signed-off-by: Nicholas Krause 
> ---
> v3
> Move the v2 changes below the sign off line for this patch.
> v2
> Change incorrect patch subject and make commit message
> clearer
>  net/core/pktgen.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/net/core/pktgen.c b/net/core/pktgen.c
> index 508155b..33bdb76 100644
> --- a/net/core/pktgen.c
> +++ b/net/core/pktgen.c
> @@ -2587,7 +2587,7 @@ static void free_SAs(struct pktgen_dev *pkt_dev)
>   }
>  }
>  
> -static int process_ipsec(struct pktgen_dev *pkt_dev,
> +static bool process_ipsec(struct pktgen_dev *pkt_dev,
> struct sk_buff *skb, __be16 protocol)

When doing this change, could you please align the above line to the
open parenthesis of process_ipsec (even-though it was also misaligned
before).

scripts/checkpatch.pl will tell you:
 CHECK: Alignment should match open parenthesis 

Did anyone tell you that kernel developers nitpick? ;-)

And usually you don't need to Cc the "main" Linux Kernel Mailing List
(linux-ker...@vger.kernel.org) with a trivial patch like this.  Sending
it to the network developers should be enough (netdev@vger.kernel.org).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] neigh: Add missing rcu_assign_pointer

2015-05-28 Thread Eric Dumazet

On Thu, 2015-05-28 at 16:28 +0800, Ying Xue wrote:
> Commit e4c4e448cf55 ("neigh: Convert garbage collection from softirq
> to workqueue") misses to use rcu_assign_pointer() macro to assign a
> RCU-protected pointer.
> 
> Signed-off-by: Ying Xue 
> ---
>  net/core/neighbour.c |3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 3a74df7..aaad3a5 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -783,7 +783,8 @@ static void neigh_periodic_work(struct work_struct *work)
>   if (atomic_read(&n->refcnt) == 1 &&
>   (state == NUD_FAILED ||
>time_after(jiffies, n->used + NEIGH_VAR(n->parms, 
> GC_STALETIME {
> - *np = n->next;
> + rcu_assign_pointer(*np, 
> rcu_dereference_protected(n->next,
> + 
> lockdep_is_held(&tbl->lock)));
>   n->dead = 1;
>   write_unlock(&n->lock);
>   neigh_cleanup_and_release(n);


This patch is not needed.

You really should read Documentation/RCU , because it looks like you are
quite confused.

When we remove an element from a RCU protected list, all the objects in
the chain are already ready to be caught by rcu readers.

Therefore, no additional memory barrier is needed before doing *np =
n->next;

Please do not add spurious memory barriers. Like atomic operations, we
want all of them being required and possibly documented.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sctp: fix ASCONF list handling

2015-05-28 Thread Neil Horman

On Wed, May 27, 2015 at 09:52:17PM -0300, mleit...@redhat.com wrote:
> From: Marcelo Ricardo Leitner 
> 
> ->auto_asconf_splist is per namespace and mangled by functions like
> sctp_setsockopt_auto_asconf() which doesn't guarantee any serialization.
> 
> Also, the call to inet_sk_copy_descendant() was backuping
> ->auto_asconf_list through the copy but was not honoring
> ->do_auto_asconf, which could lead to list corruption if it was
> different between both sockets.
> 
> This commit thus fixes the list handling by adding a spinlock to protect
> against multiple writers and converts the list to be protected by RCU
> too, so that we don't have a lock inverstion issue at
> sctp_addr_wq_timeout_handler().
> 
> And as this list now uses RCU, we cannot do such backup and restore
> while copying descendant data anymore as readers may be traversing the
> list meanwhile. We fix this by simply ignoring/not copying those fields,
> placed at the end of struct sctp_sock, so we can just ignore it together
> with struct ipv6_pinfo data. For that we create sctp_copy_descendant()
> so we don't clutter inet_sk_copy_descendant() with SCTP info.
> 
> Issue was found with a test application that kept flipping sysctl
> default_auto_asconf on and off.
> 
> Fixes: 9f7d653b67ae ("sctp: Add Auto-ASCONF support (core).")
> Signed-off-by: Marcelo Ricardo Leitner 
> ---
>  include/net/netns/sctp.h   |  6 +-
>  include/net/sctp/structs.h |  2 ++
>  net/sctp/protocol.c|  6 +-
>  net/sctp/socket.c  | 39 ++-
>  4 files changed, 38 insertions(+), 15 deletions(-)
> 
> diff --git a/include/net/netns/sctp.h b/include/net/netns/sctp.h
> index 
> 3573a81815ad9e0efb6ceb721eb066d3726419f0..e080bebb3147af39c8275261f57018eb01e917b0
>  100644
> --- a/include/net/netns/sctp.h
> +++ b/include/net/netns/sctp.h
> @@ -30,12 +30,15 @@ struct netns_sctp {
>   struct list_head local_addr_list;
>   struct list_head addr_waitq;
>   struct timer_list addr_wq_timer;
> - struct list_head auto_asconf_splist;
> + struct list_head __rcu auto_asconf_splist;
You should use the addr_wq_lock here instead of creating a new lock, as thats
already used to protect most accesses to the list you are concerned about.
Though truthfully, that shouldn't be necessecary.  The list in question is only
read in one location and only written in one location.  You can likely just
rcu-ify, as the write side is in process context and protected by lock_sock.

Neil

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sctp: fix ASCONF list handling

2015-05-28 Thread Marcelo Ricardo Leitner

On Thu, May 28, 2015 at 06:15:11AM -0400, Neil Horman wrote:
> On Wed, May 27, 2015 at 09:52:17PM -0300, mleit...@redhat.com wrote:
> > From: Marcelo Ricardo Leitner 
> > 
> > ->auto_asconf_splist is per namespace and mangled by functions like
> > sctp_setsockopt_auto_asconf() which doesn't guarantee any serialization.
> > 
> > Also, the call to inet_sk_copy_descendant() was backuping
> > ->auto_asconf_list through the copy but was not honoring
> > ->do_auto_asconf, which could lead to list corruption if it was
> > different between both sockets.
> > 
> > This commit thus fixes the list handling by adding a spinlock to protect
> > against multiple writers and converts the list to be protected by RCU
> > too, so that we don't have a lock inverstion issue at
> > sctp_addr_wq_timeout_handler().
> > 
> > And as this list now uses RCU, we cannot do such backup and restore
> > while copying descendant data anymore as readers may be traversing the
> > list meanwhile. We fix this by simply ignoring/not copying those fields,
> > placed at the end of struct sctp_sock, so we can just ignore it together
> > with struct ipv6_pinfo data. For that we create sctp_copy_descendant()
> > so we don't clutter inet_sk_copy_descendant() with SCTP info.
> > 
> > Issue was found with a test application that kept flipping sysctl
> > default_auto_asconf on and off.
> > 
> > Fixes: 9f7d653b67ae ("sctp: Add Auto-ASCONF support (core).")
> > Signed-off-by: Marcelo Ricardo Leitner 
> > ---
> >  include/net/netns/sctp.h   |  6 +-
> >  include/net/sctp/structs.h |  2 ++
> >  net/sctp/protocol.c|  6 +-
> >  net/sctp/socket.c  | 39 ++-
> >  4 files changed, 38 insertions(+), 15 deletions(-)
> > 
> > diff --git a/include/net/netns/sctp.h b/include/net/netns/sctp.h
> > index 
> > 3573a81815ad9e0efb6ceb721eb066d3726419f0..e080bebb3147af39c8275261f57018eb01e917b0
> >  100644
> > --- a/include/net/netns/sctp.h
> > +++ b/include/net/netns/sctp.h
> > @@ -30,12 +30,15 @@ struct netns_sctp {
> > struct list_head local_addr_list;
> > struct list_head addr_waitq;
> > struct timer_list addr_wq_timer;
> > -   struct list_head auto_asconf_splist;
> > +   struct list_head __rcu auto_asconf_splist;
> You should use the addr_wq_lock here instead of creating a new lock, as thats
> already used to protect most accesses to the list you are concerned about.

Ok, that works too.

> Though truthfully, that shouldn't be necessecary.  The list in question is 
> only
> read in one location and only written in one location.  You can likely just
> rcu-ify, as the write side is in process context and protected by lock_sock.

It should, it's not protected by lock_sock as this list resides in
netns_sctp structure, which lock_sock doesn't cover. Write side is in
process context yes, but this list is written in sctp_init_sock(),
sctp_destroy_sock() and sctp_setsockopt_auto_asconf(), so one could
trigger this by either creating/destroying sockets if
default_auto_asconf=1 or just by creating a bunch of sockets and
flipping asconf via setsockopt (or a combination of these operations).
(I'll point this out in the changelog)

Btw I have two nits on the patch kindly broght to my attention already,
on adding blank newline and bad comment block, will fix it in v2.

  Marcelo

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next] vlan: Add GRO support for non hardware accelerated vlan

2015-05-28 Thread Toshiaki Makita

Currently packets with non-hardware-accelerated vlan cannot be handled
by GRO. This causes low performance for 802.1ad and stacked vlan, as their
vlan tags are currently not stripped by hardware.

This patch adds GRO support for non-hardware-accelerated vlan and
improves receive performance of them.

Test Environment:
 vlan device (.1Q) on vlan device (.1ad) on ixgbe (82599)

Result:

- Before

$ netperf -t TCP_STREAM -H 192.168.20.2 -l 60
Recv   SendSend
Socket Socket  Message  Elapsed
Size   SizeSize Time Throughput
bytes  bytes   bytessecs.10^6bits/sec

 87380  16384  1638460.005233.17

Rx side CPU usage:
  %usr  %sys  %irq %soft %idle
  0.27 58.03  0.00 41.70  0.00

- After

$ netperf -t TCP_STREAM -H 192.168.20.2 -l 60
Recv   SendSend
Socket Socket  Message  Elapsed
Size   SizeSize Time Throughput
bytes  bytes   bytessecs.10^6bits/sec

 87380  16384  1638460.007586.85

Rx side CPU usage:
  %usr  %sys  %irq %soft %idle
  0.50 25.83  0.00 59.53 14.14

Signed-off-by: Toshiaki Makita 
---
 net/8021q/vlan.c | 94 
 1 file changed, 94 insertions(+)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 59555f0..0a9e8e1 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -618,6 +618,90 @@ out:
return err;
 }
 
+static struct sk_buff **vlan_gro_receive(struct sk_buff **head,
+struct sk_buff *skb)
+{
+   struct sk_buff *p, **pp = NULL;
+   struct vlan_hdr *vhdr;
+   unsigned int hlen, off_vlan;
+   const struct packet_offload *ptype;
+   __be16 type;
+   int flush = 1;
+
+   off_vlan = skb_gro_offset(skb);
+   hlen = off_vlan + sizeof(*vhdr);
+   vhdr = skb_gro_header_fast(skb, off_vlan);
+   if (skb_gro_header_hard(skb, hlen)) {
+   vhdr = skb_gro_header_slow(skb, hlen, off_vlan);
+   if (unlikely(!vhdr))
+   goto out;
+   }
+
+   type = vhdr->h_vlan_encapsulated_proto;
+
+   rcu_read_lock();
+   ptype = gro_find_receive_by_type(type);
+   if (!ptype)
+   goto out_unlock;
+
+   flush = 0;
+
+   for (p = *head; p; p = p->next) {
+   struct vlan_hdr *vhdr2;
+
+   if (!NAPI_GRO_CB(p)->same_flow)
+   continue;
+
+   vhdr2 = (struct vlan_hdr *)(p->data + off_vlan);
+   if (memcmp(vhdr, vhdr2, VLAN_HLEN))
+   NAPI_GRO_CB(p)->same_flow = 0;
+   }
+
+   skb_gro_pull(skb, sizeof(*vhdr));
+   skb_gro_postpull_rcsum(skb, vhdr, sizeof(*vhdr));
+   pp = ptype->callbacks.gro_receive(head, skb);
+
+out_unlock:
+   rcu_read_unlock();
+out:
+   NAPI_GRO_CB(skb)->flush |= flush;
+
+   return pp;
+}
+
+static int vlan_gro_complete(struct sk_buff *skb, int nhoff)
+{
+   struct vlan_hdr *vhdr = (struct vlan_hdr *)(skb->data + nhoff);
+   __be16 type = vhdr->h_vlan_encapsulated_proto;
+   struct packet_offload *ptype;
+   int err = -ENOENT;
+
+   rcu_read_lock();
+   ptype = gro_find_complete_by_type(type);
+   if (ptype)
+   err = ptype->callbacks.gro_complete(skb, nhoff + sizeof(*vhdr));
+
+   rcu_read_unlock();
+   return err;
+}
+
+static struct packet_offload vlan_packet_offloads[] __read_mostly = {
+   {
+   .type = cpu_to_be16(ETH_P_8021Q),
+   .callbacks = {
+   .gro_receive = vlan_gro_receive,
+   .gro_complete = vlan_gro_complete,
+   },
+   },
+   {
+   .type = cpu_to_be16(ETH_P_8021AD),
+   .callbacks = {
+   .gro_receive = vlan_gro_receive,
+   .gro_complete = vlan_gro_complete,
+   },
+   },
+};
+
 static int __net_init vlan_init_net(struct net *net)
 {
struct vlan_net *vn = net_generic(net, vlan_net_id);
@@ -645,6 +729,7 @@ static struct pernet_operations vlan_net_ops = {
 static int __init vlan_proto_init(void)
 {
int err;
+   unsigned int i;
 
pr_info("%s v%s\n", vlan_fullname, vlan_version);
 
@@ -668,6 +753,9 @@ static int __init vlan_proto_init(void)
if (err < 0)
goto err5;
 
+   for (i = 0; i < ARRAY_SIZE(vlan_packet_offloads); i++)
+   dev_add_offload(&vlan_packet_offloads[i]);
+
vlan_ioctl_set(vlan_ioctl_handler);
return 0;
 
@@ -685,7 +773,13 @@ err0:
 
 static void __exit vlan_cleanup_module(void)
 {
+   unsigned int i;
+
vlan_ioctl_set(NULL);
+
+   for (i = 0; i < ARRAY_SIZE(vlan_packet_offloads); i++)
+   dev_remove_offload(&vlan_packet_offloads[i]);
+
vlan_netlink_fini();
 
unregister_netdevice_notifier(&vlan_notifier_block);
-- 
1.8.1.2


--
To unsubscribe from this list: send the line "unsubscribe

[net-next 05/14] i40e/i40evf: Add ATR support for tunneled TCP/IPv4/IPv6 packets.

2015-05-28 Thread Jeff Kirsher

From: Anjali Singhai Jain 

Without this, RSS would have done inner header load balancing. Now we can
get the benefits of ATR for tunneled packets to better align TX and RX
queues with the right core/interrupt.

Change-ID: I07d0e0a192faf28fdd33b2f04c32b2a82ff97ddd
Signed-off-by: Anjali Singhai Jain 
Signed-off-by: Jesse Brandeburg 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 77 +++
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |  1 +
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 34 ++--
 drivers/net/ethernet/intel/i40evf/i40e_txrx.h |  1 +
 4 files changed, 62 insertions(+), 51 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 0b4a7be..8565495 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1923,11 +1923,11 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
  * i40e_atr - Add a Flow Director ATR filter
  * @tx_ring:  ring to add programming descriptor to
  * @skb:  send buffer
- * @flags:send flags
+ * @tx_flags: send tx flags
  * @protocol: wire protocol
  **/
 static void i40e_atr(struct i40e_ring *tx_ring, struct sk_buff *skb,
-u32 flags, __be16 protocol)
+u32 tx_flags, __be16 protocol)
 {
struct i40e_filter_program_desc *fdir_desc;
struct i40e_pf *pf = tx_ring->vsi->back;
@@ -1952,25 +1952,38 @@ static void i40e_atr(struct i40e_ring *tx_ring, struct 
sk_buff *skb,
if (!tx_ring->atr_sample_rate)
return;
 
-   /* snag network header to get L4 type and address */
-   hdr.network = skb_network_header(skb);
+   if (!(tx_flags & (I40E_TX_FLAGS_IPV4 | I40E_TX_FLAGS_IPV6)))
+   return;
 
-   /* Currently only IPv4/IPv6 with TCP is supported */
-   if (protocol == htons(ETH_P_IP)) {
-   if (hdr.ipv4->protocol != IPPROTO_TCP)
-   return;
+   if (!(tx_flags & I40E_TX_FLAGS_VXLAN_TUNNEL)) {
+   /* snag network header to get L4 type and address */
+   hdr.network = skb_network_header(skb);
 
-   /* access ihl as a u8 to avoid unaligned access on ia64 */
-   hlen = (hdr.network[0] & 0x0F) << 2;
-   } else if (protocol == htons(ETH_P_IPV6)) {
-   if (hdr.ipv6->nexthdr != IPPROTO_TCP)
+   /* Currently only IPv4/IPv6 with TCP is supported
+* access ihl as u8 to avoid unaligned access on ia64
+*/
+   if (tx_flags & I40E_TX_FLAGS_IPV4)
+   hlen = (hdr.network[0] & 0x0F) << 2;
+   else if (protocol == htons(ETH_P_IPV6))
+   hlen = sizeof(struct ipv6hdr);
+   else
return;
-
-   hlen = sizeof(struct ipv6hdr);
} else {
-   return;
+   hdr.network = skb_inner_network_header(skb);
+   hlen = skb_inner_network_header_len(skb);
}
 
+   /* Currently only IPv4/IPv6 with TCP is supported
+* Note: tx_flags gets modified to reflect inner protocols in
+* tx_enable_csum function if encap is enabled.
+*/
+   if ((tx_flags & I40E_TX_FLAGS_IPV4) &&
+   (hdr.ipv4->protocol != IPPROTO_TCP))
+   return;
+   else if ((tx_flags & I40E_TX_FLAGS_IPV6) &&
+(hdr.ipv6->nexthdr != IPPROTO_TCP))
+   return;
+
th = (struct tcphdr *)(hdr.network + hlen);
 
/* Due to lack of space, no more new filters can be programmed */
@@ -2117,16 +2130,14 @@ out:
  * i40e_tso - set up the tso context descriptor
  * @tx_ring:  ptr to the ring to send
  * @skb:  ptr to the skb we're sending
- * @tx_flags: the collected send information
- * @protocol: the send protocol
  * @hdr_len:  ptr to the size of the packet header
  * @cd_tunneling: ptr to context descriptor bits
  *
  * Returns 0 if no TSO can happen, 1 if tso is going, or error
  **/
 static int i40e_tso(struct i40e_ring *tx_ring, struct sk_buff *skb,
-   u32 tx_flags, __be16 protocol, u8 *hdr_len,
-   u64 *cd_type_cmd_tso_mss, u32 *cd_tunneling)
+   u8 *hdr_len, u64 *cd_type_cmd_tso_mss,
+   u32 *cd_tunneling)
 {
u32 cd_cmd, cd_tso_len, cd_mss;
struct ipv6hdr *ipv6h;
@@ -2218,12 +2229,12 @@ static int i40e_tsyn(struct i40e_ring *tx_ring, struct 
sk_buff *skb,
 /**
  * i40e_tx_enable_csum - Enable Tx checksum offloads
  * @skb: send buffer
- * @tx_flags: Tx flags currently set
+ * @tx_flags: pointer to Tx flags currently set
  * @td_cmd: Tx descriptor command bits to set
  * @td_offset: Tx descriptor header offsets to set
  * @cd_tunneling: ptr to context desc bits
  **/
-static void i40e_tx_enable_csum(struct sk_buff *skb, u32 tx_flags,
+static void i40e_tx_enable_csum(struct sk_buff *skb, u32

[net-next 11/14] i40evf: skb->xmit_more support

2015-05-28 Thread Jeff Kirsher

From: Jesse Brandeburg 

Eric added support for skb->xmit_more in i40e, this ports that into
i40evf as well.

Support skb->xmit_more in i40evf is straightforward; we need to move
around i40e_maybe_stop_tx() call to correctly test netif_xmit_stopped()
before taking the decision to not kick the NIC.

Change-ID: Ia6a2e4a7ab335631c91ced51f55b25eb8468
Signed-off-by: Eric Dumazet 
Signed-off-by: Daniel Borkmann 
Signed-off-by: Jesse Brandeburg 
Tested-by: Jim Young 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 88 ++-
 1 file changed, 47 insertions(+), 41 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 1c79a08..6450663 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -1670,6 +1670,47 @@ linearize_chk_done:
 }
 
 /**
+ * __i40evf_maybe_stop_tx - 2nd level check for tx stop conditions
+ * @tx_ring: the ring to be checked
+ * @size:the size buffer we want to assure is available
+ *
+ * Returns -EBUSY if a stop is needed, else 0
+ **/
+static inline int __i40evf_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
+{
+   netif_stop_subqueue(tx_ring->netdev, tx_ring->queue_index);
+   /* Memory barrier before checking head and tail */
+   smp_mb();
+
+   /* Check again in a case another CPU has just made room available. */
+   if (likely(I40E_DESC_UNUSED(tx_ring) < size))
+   return -EBUSY;
+
+   /* A reprieve! - use start_queue because it doesn't call schedule */
+   netif_start_subqueue(tx_ring->netdev, tx_ring->queue_index);
+   ++tx_ring->tx_stats.restart_queue;
+   return 0;
+}
+
+/**
+ * i40evf_maybe_stop_tx - 1st level check for tx stop conditions
+ * @tx_ring: the ring to be checked
+ * @size:the size buffer we want to assure is available
+ *
+ * Returns 0 if stop is not needed
+ **/
+#ifdef I40E_FCOE
+int i40evf_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
+#else
+static int i40evf_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
+#endif
+{
+   if (likely(I40E_DESC_UNUSED(tx_ring) >= size))
+   return 0;
+   return __i40evf_maybe_stop_tx(tx_ring, size);
+}
+
+/**
  * i40e_tx_map - Build the Tx descriptor
  * @tx_ring:  ring to send buffer on
  * @skb:  send buffer
@@ -1806,8 +1847,12 @@ static void i40e_tx_map(struct i40e_ring *tx_ring, 
struct sk_buff *skb,
 
tx_ring->next_to_use = i;
 
+   i40evf_maybe_stop_tx(tx_ring, DESC_NEEDED);
/* notify HW of packet */
-   writel(i, tx_ring->tail);
+   if (!skb->xmit_more ||
+   netif_xmit_stopped(netdev_get_tx_queue(tx_ring->netdev,
+  tx_ring->queue_index)))
+   writel(i, tx_ring->tail);
 
return;
 
@@ -1829,43 +1874,6 @@ dma_error:
 }
 
 /**
- * __i40e_maybe_stop_tx - 2nd level check for tx stop conditions
- * @tx_ring: the ring to be checked
- * @size:the size buffer we want to assure is available
- *
- * Returns -EBUSY if a stop is needed, else 0
- **/
-static inline int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
-{
-   netif_stop_subqueue(tx_ring->netdev, tx_ring->queue_index);
-   /* Memory barrier before checking head and tail */
-   smp_mb();
-
-   /* Check again in a case another CPU has just made room available. */
-   if (likely(I40E_DESC_UNUSED(tx_ring) < size))
-   return -EBUSY;
-
-   /* A reprieve! - use start_queue because it doesn't call schedule */
-   netif_start_subqueue(tx_ring->netdev, tx_ring->queue_index);
-   ++tx_ring->tx_stats.restart_queue;
-   return 0;
-}
-
-/**
- * i40e_maybe_stop_tx - 1st level check for tx stop conditions
- * @tx_ring: the ring to be checked
- * @size:the size buffer we want to assure is available
- *
- * Returns 0 if stop is not needed
- **/
-static int i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
-{
-   if (likely(I40E_DESC_UNUSED(tx_ring) >= size))
-   return 0;
-   return __i40e_maybe_stop_tx(tx_ring, size);
-}
-
-/**
  * i40e_xmit_descriptor_count - calculate number of tx descriptors needed
  * @skb: send buffer
  * @tx_ring: ring to send buffer on
@@ -1890,7 +1898,7 @@ static int i40e_xmit_descriptor_count(struct sk_buff *skb,
count += TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size);
 
count += TXD_USE_COUNT(skb_headlen(skb));
-   if (i40e_maybe_stop_tx(tx_ring, count + 4 + 1)) {
+   if (i40evf_maybe_stop_tx(tx_ring, count + 4 + 1)) {
tx_ring->tx_stats.tx_busy++;
return 0;
}
@@ -1966,8 +1974,6 @@ static netdev_tx_t i40e_xmit_frame_ring(struct sk_buff 
*skb,
i40e_tx_map(tx_ring, skb, first, tx_flags, hdr_len,
td_cmd, td_offset);
 
-   i40e_maybe_stop_tx(tx_ring, DESC_NEEDED);
-
return NETDEV_TX_OK;
 
 out_dro

[net-next 01/14] ethtool: Add helper routines to pass vf to rx_flow_spec

2015-05-28 Thread Jeff Kirsher

From: John Fastabend 

The ring_cookie is 64 bits wide which is much larger than can be used
for actual queue index values. So provide some helper routines to
pack a VF index into the cookie. This is useful to steer packets to
a VF ring without having to know the queue layout of the device.

CC: Alex Duyck 
Signed-off-by: John Fastabend 
Signed-off-by: Jeff Kirsher 
---
 include/uapi/linux/ethtool.h | 25 +
 1 file changed, 25 insertions(+)

diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index ae832b4..0594933 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -796,6 +796,31 @@ struct ethtool_rx_flow_spec {
__u32   location;
 };
 
+/* How rings are layed out when accessing virtual functions or
+ * offloaded queues is device specific. To allow users to do flow
+ * steering and specify these queues the ring cookie is partitioned
+ * into a 32bit queue index with an 8 bit virtual function id.
+ * This also leaves the 3bytes for further specifiers. It is possible
+ * future devices may support more than 256 virtual functions if
+ * devices start supporting PCIe w/ARI. However at the moment I
+ * do not know of any devices that support this so I do not reserve
+ * space for this at this time. If a future patch consumes the next
+ * byte it should be aware of this possiblity.
+ */
+#define ETHTOOL_RX_FLOW_SPEC_RING  0xLL
+#define ETHTOOL_RX_FLOW_SPEC_RING_VF   0x00FFLL
+#define ETHTOOL_RX_FLOW_SPEC_RING_VF_OFF 32
+static inline __u64 ethtool_get_flow_spec_ring(__u64 ring_cookie)
+{
+   return ETHTOOL_RX_FLOW_SPEC_RING & ring_cookie;
+};
+
+static inline __u64 ethtool_get_flow_spec_ring_vf(__u64 ring_cookie)
+{
+   return (ETHTOOL_RX_FLOW_SPEC_RING_VF & ring_cookie) >>
+   ETHTOOL_RX_FLOW_SPEC_RING_VF_OFF;
+};
+
 /**
  * struct ethtool_rxnfc - command to get or set RX flow classification rules
  * @cmd: Specific command number - %ETHTOOL_GRXFH, %ETHTOOL_SRXFH,
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 06/14] i40e/i40evf: Add stats to count Tunnel ATR hits

2015-05-28 Thread Jeff Kirsher

From: Anjali Singhai Jain 

Add a 3rd dynamic filter counter to track Tunneled ATR hits separately.
Ethtool port stat "fdir_atr_tunnel_match"

Change-ID: Idd978b6db2a462b5722397cd2ffd04ef055f8655
Signed-off-by: Anjali Singhai Jain 
Tested-by: Jim Young 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h |  3 +++
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |  1 +
 drivers/net/ethernet/intel/i40e/i40e_main.c|  4 
 drivers/net/ethernet/intel/i40e/i40e_txrx.c| 13 ++---
 drivers/net/ethernet/intel/i40e/i40e_type.h|  1 +
 drivers/net/ethernet/intel/i40evf/i40e_type.h  |  1 +
 6 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 33c35d3..0bfa5a0 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -182,6 +182,7 @@ struct i40e_lump_tracking {
 enum i40e_fd_stat_idx {
I40E_FD_STAT_ATR,
I40E_FD_STAT_SB,
+   I40E_FD_STAT_ATR_TUNNEL,
I40E_FD_STAT_PF_COUNT
 };
 #define I40E_FD_STAT_PF_IDX(pf_id) ((pf_id) * I40E_FD_STAT_PF_COUNT)
@@ -189,6 +190,8 @@ enum i40e_fd_stat_idx {
(I40E_FD_STAT_PF_IDX(pf_id) + I40E_FD_STAT_ATR)
 #define I40E_FD_SB_STAT_IDX(pf_id)  \
(I40E_FD_STAT_PF_IDX(pf_id) + I40E_FD_STAT_SB)
+#define I40E_FD_ATR_TUNNEL_STAT_IDX(pf_id) \
+   (I40E_FD_STAT_PF_IDX(pf_id) + I40E_FD_STAT_ATR_TUNNEL)
 
 struct i40e_fdir_filter {
struct hlist_node fdir_node;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index e77b6bd..c568c90 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -147,6 +147,7 @@ static struct i40e_stats i40e_gstrings_stats[] = {
I40E_PF_STAT("rx_hwtstamp_cleared", rx_hwtstamp_cleared),
I40E_PF_STAT("fdir_flush_cnt", fd_flush_cnt),
I40E_PF_STAT("fdir_atr_match", stats.fd_atr_match),
+   I40E_PF_STAT("fdir_atr_tunnel_match", stats.fd_atr_tunnel_match),
I40E_PF_STAT("fdir_sb_match", stats.fd_sb_match),
 
/* LPI stats */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index f1a8c4c..e70a616 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -1102,6 +1102,10 @@ static void i40e_update_pf_stats(struct i40e_pf *pf)
i40e_stat_update32(hw, I40E_GLQF_PCNT(pf->fd_sb_cnt_idx),
   pf->stat_offsets_loaded,
   &osd->fd_sb_match, &nsd->fd_sb_match);
+   i40e_stat_update32(hw,
+ I40E_GLQF_PCNT(I40E_FD_ATR_TUNNEL_STAT_IDX(pf->hw.pf_id)),
+ pf->stat_offsets_loaded,
+ &osd->fd_atr_tunnel_match, &nsd->fd_atr_tunnel_match);
 
val = rd32(hw, I40E_PRTPM_EEE_STAT);
nsd->tx_lpi_status =
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 8565495..fc4ec82 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2033,9 +2033,16 @@ static void i40e_atr(struct i40e_ring *tx_ring, struct 
sk_buff *skb,
 I40E_TXD_FLTR_QW1_FD_STATUS_SHIFT;
 
dtype_cmd |= I40E_TXD_FLTR_QW1_CNT_ENA_MASK;
-   dtype_cmd |=
-   ((u32)pf->fd_atr_cnt_idx << I40E_TXD_FLTR_QW1_CNTINDEX_SHIFT) &
-   I40E_TXD_FLTR_QW1_CNTINDEX_MASK;
+   if (!(tx_flags & I40E_TX_FLAGS_VXLAN_TUNNEL))
+   dtype_cmd |=
+   ((u32)I40E_FD_ATR_STAT_IDX(pf->hw.pf_id) <<
+   I40E_TXD_FLTR_QW1_CNTINDEX_SHIFT) &
+   I40E_TXD_FLTR_QW1_CNTINDEX_MASK;
+   else
+   dtype_cmd |=
+   ((u32)I40E_FD_ATR_TUNNEL_STAT_IDX(pf->hw.pf_id) <<
+   I40E_TXD_FLTR_QW1_CNTINDEX_SHIFT) &
+   I40E_TXD_FLTR_QW1_CNTINDEX_MASK;
 
fdir_desc->qindex_flex_ptype_vsi = cpu_to_le32(flex_ptype);
fdir_desc->rsvd = cpu_to_le32(0);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_type.h 
b/drivers/net/ethernet/intel/i40e/i40e_type.h
index 568e855..9a5a75b 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_type.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_type.h
@@ -1133,6 +1133,7 @@ struct i40e_hw_port_stats {
/* flow director stats */
u64 fd_atr_match;
u64 fd_sb_match;
+   u64 fd_atr_tunnel_match;
/* EEE LPI */
u32 tx_lpi_status;
u32 rx_lpi_status;
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_type.h 
b/drivers/net/ethernet/intel/i40evf/i40e_type.h
index ec9d83a..c463ec4 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_type.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_type.h
@@ -1108,6 +1108,7 @@ struct i40e_hw_p

[net-next 04/14] i40e: Disable offline diagnostics if VFs are enabled

2015-05-28 Thread Jeff Kirsher

From: Greg Rose 

Require the user to disable virtual functions before running the device
offline diagnostics.  The offline diagnostics are intended to ensure
basic operation of the device - it is beyond the scope of the diagnostic
test to handle the additional complexity of bringing all the virtual
functions offline and then back online for each test run.

Change-ID: Ic0b854851a09fc85df0c9e82c220e45885457c30
Signed-off-by: Greg Rose 
Tested-by: Jim Young 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 27 ++
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |  7 ++
 2 files changed, 34 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 4cbaaeb..e77b6bd 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -1548,6 +1548,17 @@ static int i40e_loopback_test(struct net_device *netdev, 
u64 *data)
return *data;
 }
 
+static inline bool i40e_active_vfs(struct i40e_pf *pf)
+{
+   struct i40e_vf *vfs = pf->vf;
+   int i;
+
+   for (i = 0; i < pf->num_alloc_vfs; i++)
+   if (vfs[i].vf_states & I40E_VF_STAT_ACTIVE)
+   return true;
+   return false;
+}
+
 static void i40e_diag_test(struct net_device *netdev,
   struct ethtool_test *eth_test, u64 *data)
 {
@@ -1560,6 +1571,20 @@ static void i40e_diag_test(struct net_device *netdev,
netif_info(pf, drv, netdev, "offline testing starting\n");
 
set_bit(__I40E_TESTING, &pf->state);
+
+   if (i40e_active_vfs(pf)) {
+   dev_warn(&pf->pdev->dev,
+"Please take active VFS offline and restart 
the adapter before running NIC diagnostics\n");
+   data[I40E_ETH_TEST_REG] = 1;
+   data[I40E_ETH_TEST_EEPROM]  = 1;
+   data[I40E_ETH_TEST_INTR]= 1;
+   data[I40E_ETH_TEST_LOOPBACK]= 1;
+   data[I40E_ETH_TEST_LINK]= 1;
+   eth_test->flags |= ETH_TEST_FL_FAILED;
+   clear_bit(__I40E_TESTING, &pf->state);
+   goto skip_ol_tests;
+   }
+
/* If the device is online then take it offline */
if (if_running)
/* indicate we're in test mode */
@@ -1605,6 +1630,8 @@ static void i40e_diag_test(struct net_device *netdev,
data[I40E_ETH_TEST_LOOPBACK] = 0;
}
 
+skip_ol_tests:
+
netif_info(pf, drv, netdev, "testing finished\n");
 }
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index 78d1c4f..4653b6e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -980,6 +980,13 @@ static int i40e_pci_sriov_enable(struct pci_dev *pdev, int 
num_vfs)
int pre_existing_vfs = pci_num_vf(pdev);
int err = 0;
 
+   if (pf->state & __I40E_TESTING) {
+   dev_warn(&pdev->dev,
+"Cannot enable SR-IOV virtual functions while the 
device is undergoing diagnostic testing\n");
+   err = -EPERM;
+   goto err_out;
+   }
+
dev_info(&pdev->dev, "Allocating %d VFs.\n", num_vfs);
if (pre_existing_vfs && pre_existing_vfs != num_vfs)
i40e_free_vfs(pf);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 03/14] i40e: Collect PFC XOFF RX stats even in single TC case

2015-05-28 Thread Jeff Kirsher

From: Neerav Parikh 

When PFC is enabled for any UP in single TC configuration the driver didn't
collect the PFC XOFF RX stats. Though a single TC with PFC enabled is not a
common scenario do not prevent the driver from collecting stats if firmware
indicates that PFC is enabled.

Change-ID: Ie20bd58b07608b528f3c6d95894c9ae56b00077a
Signed-off-by: Neerav Parikh 
Tested-by: Jim Young 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index a54c144..f1a8c4c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -772,9 +772,8 @@ static void i40e_update_prio_xoff_rx(struct i40e_pf *pf)
 
dcb_cfg = &hw->local_dcbx_config;
 
-   /* See if DCB enabled with PFC TC */
-   if (!(pf->flags & I40E_FLAG_DCB_ENABLED) ||
-   !(dcb_cfg->pfc.pfcenable)) {
+   /* Collect Link XOFF stats when PFC is disabled */
+   if (!dcb_cfg->pfc.pfcenable) {
i40e_update_link_xoff_rx(pf);
return;
}
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 00/14][pull request] Intel Wired LAN Driver Updates 2015-05-28

2015-05-28 Thread Jeff Kirsher

This series contains updates to ethtool, ixgbe, i40e and i40evf.

John adds helper routines for ethtool to pass VF to rx_flow_spec.  Since
the ring_cookie is 64 bits wide which is much larger than what could be
used for actual queue index values, provide helper routines to pack a VF
index into the cookie.  Then John provides a ixgbe patch to allow flow
director to use the entire queue space.

Neerav provides a i40e patch to collect XOFF Rx stats, where it was not
being collected before.

Anjali provides ATR support for tunneled packets, as well as stats to
count tunnel ATR hits.  Cleaned up PF struct members which are
unnecessary, since we can use the stat index macro directly.  Cleaned
up flow director ATR/SB messages to a higher debug level since they
are not useful unless silicon validation is happening.

Greg provides a patch to disable offline diagnostics if VFs are enabled
since ethtool offline diagnostic tests are not designed (out of scope)
to disable VF functions for testing and re-enable afterward.  Also cleans
up TODO comment that is no longer needed.

Vasu provides a fix an FCoE EOF case where i40e_fcoe_ctxt_eof() maybe
called before i40e_fcoe_eof_is_supported() is called.

Jesse adds skb->xmit_more support for i40evf.  Then provides a performance
enhancement for i40evf by inlining some functions which provides a 15%
gain in small packet performance.  Also cleans up the use of time_stamp
since it is no longer used to determine if there is a tx_hang and was
a part of a previous tx_hang design which is no longer used.

The following are changes since commit ed2dfd900992aa7b6b3d0abd8ec9a7e9d2c7f827:
  tcp/dccp: warn user for preferred ip_local_port_range
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue master

Anjali Singhai Jain (4):
  i40e/i40evf: Add ATR support for tunneled TCP/IPv4/IPv6 packets.
  i40e/i40evf: Add stats to count Tunnel ATR hits
  i40e: Remove unnecessary pf members
  i40e: Move the FD ATR/SB messages to a higher debug level

Catherine Sullivan (1):
  i40e: Bump version to 1.3.4

Greg Rose (2):
  i40e: Disable offline diagnostics if VFs are enabled
  i40e/i40evf: Remove unneeded TODO

Jesse Brandeburg (3):
  i40evf: skb->xmit_more support
  i40e/i40evf: force inline transmit functions
  i40e/i40evf: remove time_stamp member

John Fastabend (2):
  ethtool: Add helper routines to pass vf to rx_flow_spec
  ixgbe: Allow flow director to use entire queue space

Neerav Parikh (1):
  i40e: Collect PFC XOFF RX stats even in single TC case

Vasu Dev (1):
  i40e: fix unrecognized FCOE EOF case

 drivers/net/ethernet/intel/i40e/i40e.h |   5 +-
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |  30 +++-
 drivers/net/ethernet/intel/i40e/i40e_fcoe.c|  11 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c|  39 ++---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c| 144 ++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h|   2 +-
 drivers/net/ethernet/intel/i40e/i40e_type.h|   1 +
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |   7 +
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c  | 158 ++---
 drivers/net/ethernet/intel/i40evf/i40e_txrx.h  |   2 +-
 drivers/net/ethernet/intel/i40evf/i40e_type.h  |   1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c   |  34 +++--
 include/uapi/linux/ethtool.h   |  25 
 13 files changed, 272 insertions(+), 187 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 02/14] ixgbe: Allow flow director to use entire queue space

2015-05-28 Thread Jeff Kirsher

From: John Fastabend 

Flow director is exported to user space using the ethtool ntuple
support. However, currently it only supports steering traffic to a
subset of the queues in use by the hardware. This change allows
flow director to specify queues that have been assigned to virtual
functions by partitioning the ring_cookie into a 8bit VF specifier
followed by 32bit queue index. At the moment we don't have any
ethernet drivers with more than 2^32 queues on a single function
as best I can tell and nor do I expect this to happen anytime
soon. This way the ring_cookie's normal use for specifying a queue
on a specific PCI function continues to work as expected.

CC: Alex Duyck 
Signed-off-by: John Fastabend 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 34 +---
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index 9f6fb19..9a1d0f1 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -2594,18 +2594,35 @@ static int ixgbe_add_ethtool_fdir_entry(struct 
ixgbe_adapter *adapter,
struct ixgbe_hw *hw = &adapter->hw;
struct ixgbe_fdir_filter *input;
union ixgbe_atr_input mask;
+   u8 queue;
int err;
 
if (!(adapter->flags & IXGBE_FLAG_FDIR_PERFECT_CAPABLE))
return -EOPNOTSUPP;
 
-   /*
-* Don't allow programming if the action is a queue greater than
-* the number of online Rx queues.
+   /* ring_cookie is a masked into a set of queues and ixgbe pools or
+* we use the drop index.
 */
-   if ((fsp->ring_cookie != RX_CLS_FLOW_DISC) &&
-   (fsp->ring_cookie >= adapter->num_rx_queues))
-   return -EINVAL;
+   if (fsp->ring_cookie == RX_CLS_FLOW_DISC) {
+   queue = IXGBE_FDIR_DROP_QUEUE;
+   } else {
+   u32 ring = ethtool_get_flow_spec_ring(fsp->ring_cookie);
+   u8 vf = ethtool_get_flow_spec_ring_vf(fsp->ring_cookie);
+
+   if (!vf && (ring >= adapter->num_rx_queues))
+   return -EINVAL;
+   else if (vf &&
+((vf > adapter->num_vfs) ||
+  ring >= adapter->num_rx_queues_per_pool))
+   return -EINVAL;
+
+   /* Map the ring onto the absolute queue index */
+   if (!vf)
+   queue = adapter->rx_ring[ring]->reg_idx;
+   else
+   queue = ((vf - 1) *
+   adapter->num_rx_queues_per_pool) + ring;
+   }
 
/* Don't allow indexes to exist outside of available space */
if (fsp->location >= ((1024 << adapter->fdir_pballoc) - 2)) {
@@ -2683,10 +2700,7 @@ static int ixgbe_add_ethtool_fdir_entry(struct 
ixgbe_adapter *adapter,
 
/* program filters to filter memory */
err = ixgbe_fdir_write_perfect_filter_82599(hw,
-   &input->filter, input->sw_idx,
-   (input->action == IXGBE_FDIR_DROP_QUEUE) ?
-   IXGBE_FDIR_DROP_QUEUE :
-   adapter->rx_ring[input->action]->reg_idx);
+   &input->filter, input->sw_idx, queue);
if (err)
goto err_out_w_lock;
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 08/14] i40e/i40evf: Remove unneeded TODO

2015-05-28 Thread Jeff Kirsher

From: Greg Rose 

There's no need for a counter so remove the TODO comment.

Change-ID: I3321dda04934c4f5fda9b279ab666192bda44214
Signed-off-by: Greg Rose 
Tested-by: Jim Young 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 3 ---
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 3 ---
 2 files changed, 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index fc4ec82..78ab8b5 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1653,9 +1653,6 @@ static int i40e_clean_rx_irq_ps(struct i40e_ring 
*rx_ring, int budget)
/* ERR_MASK will only have valid bits if EOP set */
if (unlikely(rx_error & (1 << I40E_RX_DESC_ERROR_RXE_SHIFT))) {
dev_kfree_skb_any(skb);
-   /* TODO: shouldn't we increment a counter indicating the
-* drop?
-*/
continue;
}
 
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 09cc2d7..1c79a08 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -1128,9 +1128,6 @@ static int i40e_clean_rx_irq_ps(struct i40e_ring 
*rx_ring, int budget)
/* ERR_MASK will only have valid bits if EOP set */
if (unlikely(rx_error & (1 << I40E_RX_DESC_ERROR_RXE_SHIFT))) {
dev_kfree_skb_any(skb);
-   /* TODO: shouldn't we increment a counter indicating the
-* drop?
-*/
continue;
}
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 12/14] i40e/i40evf: force inline transmit functions

2015-05-28 Thread Jeff Kirsher

From: Jesse Brandeburg 

Inlining these functions gives us about 15% more 64 byte packets per
second when using pktgen. 13.3 million to 15 million with a single
queue.

Also fix the function names in i40evf to i40evf not i40e while we are
touching the function header.

Change-ID: I3294ae9b085cf438672b6db5f9af122490ead9d0
Signed-off-by: Jesse Brandeburg 
Signed-off-by: Catherine Sullivan 
Tested-by: Jim Young 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 32 
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 36 ---
 2 files changed, 32 insertions(+), 36 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 3414e46..5fa43f7 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2063,13 +2063,13 @@ static void i40e_atr(struct i40e_ring *tx_ring, struct 
sk_buff *skb,
  * otherwise  returns 0 to indicate the flags has been set properly.
  **/
 #ifdef I40E_FCOE
-int i40e_tx_prepare_vlan_flags(struct sk_buff *skb,
-  struct i40e_ring *tx_ring,
-  u32 *flags)
-#else
-static int i40e_tx_prepare_vlan_flags(struct sk_buff *skb,
+inline int i40e_tx_prepare_vlan_flags(struct sk_buff *skb,
  struct i40e_ring *tx_ring,
  u32 *flags)
+#else
+static inline int i40e_tx_prepare_vlan_flags(struct sk_buff *skb,
+struct i40e_ring *tx_ring,
+u32 *flags)
 #endif
 {
__be16 protocol = skb->protocol;
@@ -2412,9 +2412,9 @@ static inline int __i40e_maybe_stop_tx(struct i40e_ring 
*tx_ring, int size)
  * Returns 0 if stop is not needed
  **/
 #ifdef I40E_FCOE
-int i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
+inline int i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
 #else
-static int i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
+static inline int i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
 #endif
 {
if (likely(I40E_DESC_UNUSED(tx_ring) >= size))
@@ -2494,13 +2494,13 @@ linearize_chk_done:
  * @td_offset: offset for checksum or crc
  **/
 #ifdef I40E_FCOE
-void i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
-struct i40e_tx_buffer *first, u32 tx_flags,
-const u8 hdr_len, u32 td_cmd, u32 td_offset)
-#else
-static void i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
+inline void i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
struct i40e_tx_buffer *first, u32 tx_flags,
const u8 hdr_len, u32 td_cmd, u32 td_offset)
+#else
+static inline void i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
+  struct i40e_tx_buffer *first, u32 tx_flags,
+  const u8 hdr_len, u32 td_cmd, u32 td_offset)
 #endif
 {
unsigned int data_len = skb->data_len;
@@ -2661,11 +2661,11 @@ dma_error:
  * one descriptor.
  **/
 #ifdef I40E_FCOE
-int i40e_xmit_descriptor_count(struct sk_buff *skb,
-  struct i40e_ring *tx_ring)
-#else
-static int i40e_xmit_descriptor_count(struct sk_buff *skb,
+inline int i40e_xmit_descriptor_count(struct sk_buff *skb,
  struct i40e_ring *tx_ring)
+#else
+static inline int i40e_xmit_descriptor_count(struct sk_buff *skb,
+struct i40e_ring *tx_ring)
 #endif
 {
unsigned int f;
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 6450663..0ac134b 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -1347,7 +1347,7 @@ int i40evf_napi_poll(struct napi_struct *napi, int budget)
 }
 
 /**
- * i40e_tx_prepare_vlan_flags - prepare generic TX VLAN tagging flags for HW
+ * i40evf_tx_prepare_vlan_flags - prepare generic TX VLAN tagging flags for HW
  * @skb: send buffer
  * @tx_ring: ring to send buffer on
  * @flags:   the tx flags to be set
@@ -1358,9 +1358,9 @@ int i40evf_napi_poll(struct napi_struct *napi, int budget)
  * Returns error code indicate the frame should be dropped upon error and the
  * otherwise  returns 0 to indicate the flags has been set properly.
  **/
-static int i40e_tx_prepare_vlan_flags(struct sk_buff *skb,
- struct i40e_ring *tx_ring,
- u32 *flags)
+static inline int i40evf_tx_prepare_vlan_flags(struct sk_buff *skb,
+  struct i40e_ring *tx_ring,
+  u32 *flags)
 {
__be16 protocol = skb->protocol;
u32  tx_flags = 0;
@@ -1699,11 +1699,7 @@ static inline int _

[net-next 14/14] i40e: Bump version to 1.3.4

2015-05-28 Thread Jeff Kirsher

From: Catherine Sullivan 

Bump.

Change-ID: I54ec2787a9fead5e18447078f26e5dd27f01da44
Signed-off-by: Catherine Sullivan 
Tested-by: Jim Young 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index d6113e3..0a3e928 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -39,7 +39,7 @@ static const char i40e_driver_string[] =
 
 #define DRV_VERSION_MAJOR 1
 #define DRV_VERSION_MINOR 3
-#define DRV_VERSION_BUILD 2
+#define DRV_VERSION_BUILD 4
 #define DRV_VERSION __stringify(DRV_VERSION_MAJOR) "." \
 __stringify(DRV_VERSION_MINOR) "." \
 __stringify(DRV_VERSION_BUILD)DRV_KERN
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 07/14] i40e: Remove unnecessary pf members

2015-05-28 Thread Jeff Kirsher

From: Anjali Singhai Jain 

We can use the stat index macro directly, a variable is not required.

Change-ID: I19f08ac16353dc0cd87a1a8248d714e15a54aa8a
Signed-off-by: Anjali Singhai Jain 
Tested-by: Jim Young 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h |  2 --
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c| 10 --
 3 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 0bfa5a0..aca9cef 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -266,8 +266,6 @@ struct i40e_pf {
 
struct hlist_head fdir_filter_list;
u16 fdir_pf_active_filters;
-   u16 fd_sb_cnt_idx;
-   u16 fd_atr_cnt_idx;
unsigned long fd_flush_timestamp;
u32 fd_flush_cnt;
u32 fd_add_err;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index c568c90..9a68c65 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -2293,7 +2293,7 @@ static int i40e_add_fdir_ethtool(struct i40e_vsi *vsi,
input->pctype = 0;
input->dest_vsi = vsi->id;
input->fd_status = I40E_FILTER_PROGRAM_DESC_FD_STATUS_FD_ID;
-   input->cnt_index  = pf->fd_sb_cnt_idx;
+   input->cnt_index  = I40E_FD_SB_STAT_IDX(pf->hw.pf_id);
input->flow_type = fsp->flow_type;
input->ip4_proto = fsp->h_u.usr_ip4_spec.proto;
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index e70a616..6f16f56 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -1096,10 +1096,12 @@ static void i40e_update_pf_stats(struct i40e_pf *pf)
   &osd->rx_jabber, &nsd->rx_jabber);
 
/* FDIR stats */
-   i40e_stat_update32(hw, I40E_GLQF_PCNT(pf->fd_atr_cnt_idx),
+   i40e_stat_update32(hw,
+  I40E_GLQF_PCNT(I40E_FD_ATR_STAT_IDX(pf->hw.pf_id)),
   pf->stat_offsets_loaded,
   &osd->fd_atr_match, &nsd->fd_atr_match);
-   i40e_stat_update32(hw, I40E_GLQF_PCNT(pf->fd_sb_cnt_idx),
+   i40e_stat_update32(hw,
+  I40E_GLQF_PCNT(I40E_FD_SB_STAT_IDX(pf->hw.pf_id)),
   pf->stat_offsets_loaded,
   &osd->fd_sb_match, &nsd->fd_sb_match);
i40e_stat_update32(hw,
@@ -7679,12 +7681,8 @@ static int i40e_sw_init(struct i40e_pf *pf)
(pf->hw.func_caps.fd_filters_best_effort > 0)) {
pf->flags |= I40E_FLAG_FD_ATR_ENABLED;
pf->atr_sample_rate = I40E_DEFAULT_ATR_SAMPLE_RATE;
-   /* Setup a counter for fd_atr per PF */
-   pf->fd_atr_cnt_idx = I40E_FD_ATR_STAT_IDX(pf->hw.pf_id);
if (!(pf->flags & I40E_FLAG_MFP_ENABLED)) {
pf->flags |= I40E_FLAG_FD_SB_ENABLED;
-   /* Setup a counter for fd_sb per PF */
-   pf->fd_sb_cnt_idx = I40E_FD_SB_STAT_IDX(pf->hw.pf_id);
} else {
dev_info(&pf->pdev->dev,
 "Flow Director Sideband mode Disabled in MFP 
mode\n");
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 13/14] i40e/i40evf: remove time_stamp member

2015-05-28 Thread Jeff Kirsher

From: Jesse Brandeburg 

The driver doesn't use the time_stamp member to determine if there is a
tx_hang any more. There really isn't any point to the variable at all
so just remove it. It was left over from a previous tx_hang design.

Change-ID: I4c814827e1bcb46e45118fe37acdcfa814fb62a0
Signed-off-by: Jesse Brandeburg 
Tested-by: Jim Young 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 10 --
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |  1 -
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  7 ---
 drivers/net/ethernet/intel/i40evf/i40e_txrx.h |  1 -
 4 files changed, 19 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 5fa43f7..cc82a7f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -165,9 +165,6 @@ int i40e_program_fdir_filter(struct i40e_fdir_filter 
*fdir_data, u8 *raw_packet,
tx_desc->cmd_type_offset_bsz =
build_ctob(td_cmd, 0, I40E_FDIR_MAX_RAW_PACKET_SIZE, 0);
 
-   /* set the timestamp */
-   tx_buf->time_stamp = jiffies;
-
/* Force memory writes to complete before letting h/w
 * know there are new descriptors to fetch.
 */
@@ -810,10 +807,6 @@ static bool i40e_clean_tx_irq(struct i40e_ring *tx_ring, 
int budget)
 tx_ring->vsi->seid,
 tx_ring->queue_index,
 tx_ring->next_to_use, i);
-   dev_info(tx_ring->dev, "tx_bi[next_to_clean]\n"
-"  time_stamp   <%lx>\n"
-"  jiffies  <%lx>\n",
-tx_ring->tx_bi[i].time_stamp, jiffies);
 
netif_stop_subqueue(tx_ring->netdev, tx_ring->queue_index);
 
@@ -2606,9 +2599,6 @@ static inline void i40e_tx_map(struct i40e_ring *tx_ring, 
struct sk_buff *skb,
 tx_ring->queue_index),
 first->bytecount);
 
-   /* set the timestamp */
-   first->time_stamp = jiffies;
-
/* Force memory writes to complete before letting h/w
 * know there are new descriptors to fetch.  (Only
 * applicable for weak-ordered memory model archs,
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index ea1df3b..0dc48dc 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -147,7 +147,6 @@ enum i40e_dyn_idx_t {
 
 struct i40e_tx_buffer {
struct i40e_tx_desc *next_to_watch;
-   unsigned long time_stamp;
union {
struct sk_buff *skb;
void *raw_buf;
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 0ac134b..ec7e220 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -322,10 +322,6 @@ static bool i40e_clean_tx_irq(struct i40e_ring *tx_ring, 
int budget)
 tx_ring->vsi->seid,
 tx_ring->queue_index,
 tx_ring->next_to_use, i);
-   dev_info(tx_ring->dev, "tx_bi[next_to_clean]\n"
-"  time_stamp   <%lx>\n"
-"  jiffies  <%lx>\n",
-tx_ring->tx_bi[i].time_stamp, jiffies);
 
netif_stop_subqueue(tx_ring->netdev, tx_ring->queue_index);
 
@@ -1824,9 +1820,6 @@ static inline void i40evf_tx_map(struct i40e_ring 
*tx_ring, struct sk_buff *skb,
 tx_ring->queue_index),
 first->bytecount);
 
-   /* set the timestamp */
-   first->time_stamp = jiffies;
-
/* Force memory writes to complete before letting h/w
 * know there are new descriptors to fetch.  (Only
 * applicable for weak-ordered memory model archs,
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.h 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.h
index a23f5e8..e7a34f8 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.h
@@ -146,7 +146,6 @@ enum i40e_dyn_idx_t {
 
 struct i40e_tx_buffer {
struct i40e_tx_desc *next_to_watch;
-   unsigned long time_stamp;
union {
struct sk_buff *skb;
void *raw_buf;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 09/14] i40e: fix unrecognized FCOE EOF case

2015-05-28 Thread Jeff Kirsher

From: Vasu Dev 

Because i40e_fcoe_ctxt_eof should never be called without
i40e_fcoe_eof_is_supported being called first, the EOF in fcoe_ctxt_eof
should always be valid and therefore we do not need to print an error
if it is not valid.

However, a WARN ON to easily catch any calls to i40e_fcoe_ctxt_eof that
aren't preceded with a call to i40e_fcoe_eof_is_supported is helpful.

Change-ID: I3b536b1981ec0bce80576a74440b7dea3908bdb9
Signed-off-by: Vasu Dev 
Tested-by: Jim Young 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_fcoe.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_fcoe.c 
b/drivers/net/ethernet/intel/i40e/i40e_fcoe.c
index 1803afe..c8b621e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_fcoe.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_fcoe.c
@@ -118,7 +118,7 @@ static inline int i40e_fcoe_fc_eof(struct sk_buff *skb, u8 
*eof)
  *
  * The FC EOF is converted to the value understood by HW for descriptor
  * programming. Never call this w/o calling i40e_fcoe_eof_is_supported()
- * first.
+ * first and that already checks for all supported valid eof values.
  **/
 static inline u32 i40e_fcoe_ctxt_eof(u8 eof)
 {
@@ -132,9 +132,12 @@ static inline u32 i40e_fcoe_ctxt_eof(u8 eof)
case FC_EOF_A:
return I40E_TX_DESC_CMD_L4T_EOFT_EOF_A;
default:
-   /* FIXME: still returns 0 */
-   pr_err("Unrecognized EOF %x\n", eof);
-   return 0;
+   /* Supported valid eof shall be already checked by
+* calling i40e_fcoe_eof_is_supported() first,
+* therefore this default case shall never hit.
+*/
+   WARN_ON(1);
+   return -EINVAL;
}
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 10/14] i40e: Move the FD ATR/SB messages to a higher debug level

2015-05-28 Thread Jeff Kirsher

From: Anjali Singhai Jain 

These are not useful unless SV is happening as there is a FD flush counter
that tracks this.

Change-ID: If2655b5a29687247d03a51d35f69854bbeb711ce
Signed-off-by: Anjali Singhai Jain 
Tested-by: Jim Young 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 18 --
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |  9 ++---
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 6f16f56..d6113e3 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -4744,7 +4744,8 @@ static int i40e_up_complete(struct i40e_vsi *vsi)
pf->fd_add_err = pf->fd_atr_cnt = 0;
if (pf->fd_tcp_rule > 0) {
pf->flags &= ~I40E_FLAG_FD_ATR_ENABLED;
-   dev_info(&pf->pdev->dev, "Forcing ATR off, sideband 
rules for TCP/IPv4 exist\n");
+   if (I40E_DEBUG_FD & pf->hw.debug_mask)
+   dev_info(&pf->pdev->dev, "Forcing ATR off, 
sideband rules for TCP/IPv4 exist\n");
pf->fd_tcp_rule = 0;
}
i40e_fdir_filter_restore(vsi);
@@ -5433,7 +5434,8 @@ void i40e_fdir_check_and_reenable(struct i40e_pf *pf)
if ((pf->flags & I40E_FLAG_FD_SB_ENABLED) &&
(pf->auto_disable_flags & I40E_FLAG_FD_SB_ENABLED)) {
pf->auto_disable_flags &= ~I40E_FLAG_FD_SB_ENABLED;
-   dev_info(&pf->pdev->dev, "FD Sideband/ntuple is being 
enabled since we have space in the table now\n");
+   if (I40E_DEBUG_FD & pf->hw.debug_mask)
+   dev_info(&pf->pdev->dev, "FD Sideband/ntuple is 
being enabled since we have space in the table now\n");
}
}
/* Wait for some more space to be available to turn on ATR */
@@ -5441,7 +5443,8 @@ void i40e_fdir_check_and_reenable(struct i40e_pf *pf)
if ((pf->flags & I40E_FLAG_FD_ATR_ENABLED) &&
(pf->auto_disable_flags & I40E_FLAG_FD_ATR_ENABLED)) {
pf->auto_disable_flags &= ~I40E_FLAG_FD_ATR_ENABLED;
-   dev_info(&pf->pdev->dev, "ATR is being enabled since we 
have space in the table now\n");
+   if (I40E_DEBUG_FD & pf->hw.debug_mask)
+   dev_info(&pf->pdev->dev, "ATR is being enabled 
since we have space in the table now\n");
}
}
 }
@@ -5474,7 +5477,8 @@ static void i40e_fdir_flush_and_replay(struct i40e_pf *pf)
 
if (!(time_after(jiffies, min_flush_time)) &&
(fd_room < I40E_FDIR_BUFFER_HEAD_ROOM_FOR_ATR)) {
-   dev_info(&pf->pdev->dev, "ATR disabled, not enough FD 
filter space.\n");
+   if (I40E_DEBUG_FD & pf->hw.debug_mask)
+   dev_info(&pf->pdev->dev, "ATR disabled, not 
enough FD filter space.\n");
disable_atr = true;
}
 
@@ -5501,7 +5505,8 @@ static void i40e_fdir_flush_and_replay(struct i40e_pf *pf)
if (!disable_atr)
pf->flags |= I40E_FLAG_FD_ATR_ENABLED;
clear_bit(__I40E_FD_FLUSH_REQUESTED, &pf->state);
-   dev_info(&pf->pdev->dev, "FD Filter table flushed and 
FD-SB replayed.\n");
+   if (I40E_DEBUG_FD & pf->hw.debug_mask)
+   dev_info(&pf->pdev->dev, "FD Filter table 
flushed and FD-SB replayed.\n");
}
}
 }
@@ -7772,7 +,8 @@ bool i40e_set_ntuple(struct i40e_pf *pf, 
netdev_features_t features)
pf->fd_add_err = pf->fd_atr_cnt = pf->fd_tcp_rule = 0;
pf->fdir_pf_active_filters = 0;
pf->flags |= I40E_FLAG_FD_ATR_ENABLED;
-   dev_info(&pf->pdev->dev, "ATR re-enabled.\n");
+   if (I40E_DEBUG_FD & pf->hw.debug_mask)
+   dev_info(&pf->pdev->dev, "ATR re-enabled.\n");
/* if ATR was auto disabled it can be re-enabled. */
if ((pf->flags & I40E_FLAG_FD_ATR_ENABLED) &&
(pf->auto_disable_flags & I40E_FLAG_FD_ATR_ENABLED))
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 78ab8b5..3414e46 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -283,7 +283,8 @@ static int i40e_add_del_fdir_tcpv4(struct i40e_vsi *vsi,
if (add) {
pf->fd_tcp_rule++;
if (pf->flags & I40E_FLAG_FD_ATR_ENABLED) {
-   dev_info(&pf->pdev->dev, "Forcing ATR off, sideband 
rules for TCP/IPv4 flow being applied\n");
+   if (I40E_DEBU

iproute2: missing patches in branch net-next

2015-05-28 Thread Nicolas Dichtel


Hi Stephen,

some patches that were recently included in iproute2 branch net-next are not
visible anymore on kernel.org. It seems that the branch has been overridden
(note the "forced update" when I've fetched it):

$ git fetch
remote: Counting objects: 65, done.
remote: Compressing objects: 100% (65/65), done.
remote: Total 65 (delta 58), reused 0 (delta 0)
Unpacking objects: 100% (65/65), done.
From git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2
 + aacee2695a90...eb9d6e794b52 net-next   -> origin/net-next  (forced update)
   f043759dd492..c52827e9077f  master -> origin/master


The following patches are lost:
aacee2695a90 tc: gred: Add support for TCA_GRED_LIMIT attribute
b6ec53e3008a xfrmmonitor: allows to monitor in several netns
449b824ad196 ipmonitor: allows to monitor in several netns
3b0006f8183e ipmonitor: introduce print_headers
0628cddd9d5c libnetlink: introduce rtnl_listen_filter_t
2503247d58c3 man: update ip monitor page
6fc1f8add30b iplink_bond: add support for ad_actor and port_key options
df1c7d9138ea codel: add ce_threshold support to codel & fc_codel
30eb304ecd1d tc: add support for Flower classifier
1a4dda7103bc ss: add support for bytes_acked & bytes_received
908755dc49df iproute2: GENEVE support
f9b004020a89 Merge branch 'master' into net-next
8f42ceaf2491 Update kernels for net-next


Regards,
Nicolas
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH iproute2] Fix changing tunnel remote and local address to any

2015-05-28 Thread Thadeu Lima de Souza Cascardo

If a tunnel is created with a local address, you can't change it to any.

 # ip tunnel add tunl1 mode ipip remote 10.16.42.37 local 10.16.42.214 ttl 64
 # ip tunnel show tunl1
 tunl1: ip/ip  remote 10.16.42.37  local 10.16.42.214  ttl 64
 # ip tunnel change tunl1 local any
 # echo $?
 0
 # ip tunnel show tunl1
 tunl1: ip/ip  remote 10.16.42.37  local 10.16.42.214  ttl 64

It happens that parse_args zeroes ip_tunnel_parm, and when creating the
tunnel, it is OK to leave it as is if the address is any. However, when
changing the tunnel, the current parameters will be read from
ip_tunnel_parm, and local and remote address won't be zeroes anymore, so
it needs to be explicitly set to any.

Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 ip/iptunnel.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/ip/iptunnel.c b/ip/iptunnel.c
index be84b83..78fa988 100644
--- a/ip/iptunnel.c
+++ b/ip/iptunnel.c
@@ -167,10 +167,14 @@ static int parse_args(int argc, char **argv, int cmd, 
struct ip_tunnel_parm *p)
NEXT_ARG();
if (strcmp(*argv, "any"))
p->iph.daddr = get_addr32(*argv);
+   else
+   p->iph.daddr = htonl(INADDR_ANY);
} else if (strcmp(*argv, "local") == 0) {
NEXT_ARG();
if (strcmp(*argv, "any"))
p->iph.saddr = get_addr32(*argv);
+   else
+   p->iph.saddr = htonl(INADDR_ANY);
} else if (strcmp(*argv, "dev") == 0) {
NEXT_ARG();
strncpy(medium, *argv, IFNAMSIZ-1);
-- 
2.4.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

I Need Your Attention,Reply

2015-05-28 Thread Mr.Simon Isaac

I am Mr.Simon Isaac a regional managing director (Africa Development
Bank) Ouagadougou Burkina Faso,If you are interested to help the
orphans with US$9,500. million united state dollars around the
world contact me and send your personal information for more details:
Names...
Country...
Occupation...
Age...
Your phone number...,
Call my Number. +22664786944
Via email (simonisaac...@gmail.com)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net] bridge: fix br_multicast_query_expired() bug

2015-05-28 Thread Eric Dumazet

From: Eric Dumazet 

br_multicast_query_expired() querier argument is a pointer to
a struct bridge_mcast_querier :

struct bridge_mcast_querier {
struct br_ip addr;
struct net_bridge_port __rcu*port;
};

Intent of the code was to clear port field, not the pointer to querier.

Fixes: 2cd4143192e8 ("bridge: memorize and export selected IGMP/MLD querier 
port")
Signed-off-by: Eric Dumazet 
Cc: Linus Lüssing 
Cc: Steinar H. Gunderson 
---
 net/bridge/br_multicast.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index a3abe6ed111e..22fd0419b314 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -1822,7 +1822,7 @@ static void br_multicast_query_expired(struct net_bridge 
*br,
if (query->startup_sent < br->multicast_startup_query_count)
query->startup_sent++;
 
-   RCU_INIT_POINTER(querier, NULL);
+   RCU_INIT_POINTER(querier->port, NULL);
br_multicast_send_query(br, NULL, query);
spin_unlock(&br->multicast_lock);
 }



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters

2015-05-28 Thread Haggai Eran

On 21/05/2015 20:43, Jason Gunthorpe wrote:
> On Thu, May 21, 2015 at 08:33:53AM +0300, Haggai Eran wrote:
> 
>> To create a new child interface on the default P_Key, its possible to
>> use iproute:
>> # ip link add link ib0 name ib0.1 type ipoib
> 
> Uh..
> 
> A key invariant of the IP stack is that is it possible to uniquely
> identify the ingress device.
> 
> So the above scheme is fine for IPoIB, because it uses the interface
> unique QPN to uniquely ID the netdevice: (Device,Port,Pkey,QPN)
> is the unique ID tuple. The world is happy.
> 
> But RDMA CM doesn't provide the QPN. So when RDMA CM searches the
> netdevs for an address it cannot *uniquely* map to a IPoIB interface.
This is technically true, but if someone configures their system that
way, they will also have ARP conflicts in addition. I don't see why we
should support such a configuration.

> This is bad, and *completely wrong*, but today, nobody is going to
> really notice or care. The cases where it does something you don't
> want are not very significant.
> 
> But with containers.. Think this through for a minute: 'In some cases
> the RDMA CM selecs the wrong child' - that goes from being a minor
> annoyance to a violation of containment! Worse the criteria for
> 'selects the wrong child' can be triggered by the contained users. Eg
> the contained user adds a IP to their child that duplicates another
> container. Now we've lost control.
No, this is exactly what would happen in the Ethernet world. If you
create a conflicting configuration between two containers on the same
Ethernet segment, then one of them could get the traffic that was
intended for the other.
I don't see this as a violation of containment, because these containers
are assigned net_devs that communicate on the same segment, so they
behave just as two different hosts would, with or without conflicts.

> 
> The very idea of ib_get_net_dev_by_port_pkey_ip is broken.
> 
> So, I don't know what to say here.. Ideas?
> 
> 1) Forbid creating more than one pkey per ipoib interface?
You probably mean more than one IP on the same pkey. The pkey is
actually part of the request, so its not an issue.

> 2) Somehow extend the RDMA CM to send the IPoIB qpn too?
> 3) ??
We can do something crazy in the future like moving all CM requests to
run over UDP as in RoCEv2. But both adding the QPN or moving to UDP
require a wire protocol change and won't be compatible with today's systems.

> 
> Right now the only case that comes to mind is duplicating IPs, that is
> already going to cause an ARP collision, so maybe having the RDMA CM
> randomly select an IP is not the end of the world... But with
> containers and security, who knows? I'm not confident I've
> exhaustively thought of all possibilities here.
> 
> --
> 
> Anyhow.. looking again through this series and the existing code, the
> flow is wrong, and really needs to be changed before this starts to
> make sense to anyone, and is no doubt part of how we got here..
> 
> When a REQ arrives RDMA CM needs to run down these steps (this is identical
> to what ip_input.c does)
> 
>  1) Locate the netdev associated with the ingress of the packet,
> in a sane world this is done by only checking the
> unique (Device,Port,Pkey,QPN) tuple.
> If we keep our brokeness, we'd do this based on
> (Device,Port,Pkey,IP) - if there are IP collisions then randomly
> select a netdev (similar to how ARP collision is handled).
That's what ib_get_net_dev_by_port_pkey_ip intends to do.

>  2) Then we do the ip_route_input_noref step, this will set skb_dst to
> the netdev that will handle the packet, or tell us to drop it.
> This is not always the same as the netdev that accepts the
> packet!!!
> 
> NOTE: This route step is missing today, it does critical things
> like check that the node is actually listening on the dest IP!

Isn't this a little over-engineered? If all you want is to make sure the
net dev is up, can't we use something like netif_running()?

Also, this sounds like a major change in behavior even for applications
that do not use containers. I think today RDMA CM will accept
connections even if the ipoib interface is down.

> 
>  3) Now we can use skb_dst to iterate over the set of all RDMA CM listens:
>  1) Bound to the skb_dst netdev
>  2) Unbound in the same namespace as skb_dst netdev
> The first to match the dst IP + port is the listen that will accept the
> connection, now we go into the cma_new_conn_id path, and we don't
> need rdma_translate_ip because we already have the handling netdev.
You shouldn't be able to bind one listener to a netdev in a namespace
and also have a different listener listening for any netdev on that same
namespace. (That is what cma_check_port verifies, right?) So when
looking for a listener in a namespace there should be only one match.

It is true we no longer need the rdma_translate_ip call.

> 
> The backwards operation of the current code is part o

Re: [PATCH net-next] vlan: Add GRO support for non hardware accelerated vlan

2015-05-28 Thread Eric Dumazet

On Thu, 2015-05-28 at 20:17 +0900, Toshiaki Makita wrote:
> Currently packets with non-hardware-accelerated vlan cannot be handled
> by GRO. This causes low performance for 802.1ad and stacked vlan, as their
> vlan tags are currently not stripped by hardware.
> 
> This patch adds GRO support for non-hardware-accelerated vlan and
> improves receive performance of them.

Very nice patch !

> 
> Signed-off-by: Toshiaki Makita 
> ---
>  net/8021q/vlan.c | 94 
> 
>  1 file changed, 94 insertions(+)
> 
> diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
> index 59555f0..0a9e8e1 100644
> --- a/net/8021q/vlan.c
> +++ b/net/8021q/vlan.c
> @@ -618,6 +618,90 @@ out:
>   return err;
>  }
>  
> + vhdr2 = (struct vlan_hdr *)(p->data + off_vlan);
> + if (memcmp(vhdr, vhdr2, VLAN_HLEN))
> + NAPI_GRO_CB(p)->same_flow = 0;
> + }


This memcmp() is quite expensive, you better use a helper like :

/* vlan header only guaranteed to be 16bit aligned */
static bool vlan_hdr_compare(const struct vlan_hdr *h1, const struct vlan_hdr 
*h2)
{
#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)
return *(u32 *)h1 != *(u32 *)h2;
#else
return (((__force u32)h1->h_vlan_TCI ^ (__force u32)h2->h_vlan_TCI) |
((__force u32)h1->h_vlan_encapsulated_proto ^
 (__force u32)h2->h_vlan_encapsulated_proto)) != 0;
#endif
}


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

linux-next: build failure after merge of most of the trees

2015-05-28 Thread Stephen Rothwell

Hi all,

After merging the all the trees, today's linux-next build (powerpc
allyesconfig) failed like this:

drivers/vhost/scsi.c: In function 'vhost_scsi_open':
drivers/vhost/scsi.c:1422:3: error: implicit declaration of function 'vzalloc' 
[-Werror=implicit-function-declaration]
   vs = vzalloc(sizeof(*vs));
   ^
drivers/vhost/scsi.c:1422:6: warning: assignment makes pointer from integer 
without a cast
   vs = vzalloc(sizeof(*vs));
  ^
drivers/target/target_core_pr.c: In function 
'core_scsi3_update_and_write_aptpl':
drivers/target/target_core_pr.c:1985:2: error: implicit declaration of function 
'vzalloc' [-Werror=implicit-function-declaration]
  buf = vzalloc(len);
  ^
drivers/target/target_core_pr.c:1985:6: warning: assignment makes pointer from 
integer without a cast
  buf = vzalloc(len);
  ^
drivers/target/target_core_pr.c:1991:3: error: implicit declaration of function 
'vfree' [-Werror=implicit-function-declaration]
   vfree(buf);
   ^
drivers/target/target_core_transport.c: In function 
'transport_alloc_session_tags':
drivers/target/target_core_transport.c:258:3: error: implicit declaration of 
function 'vzalloc' [-Werror=implicit-function-declaration]
   se_sess->sess_cmd_map = vzalloc(tag_num * tag_size);
   ^
drivers/target/target_core_transport.c:258:25: warning: assignment makes 
pointer from integer without a cast
   se_sess->sess_cmd_map = vzalloc(tag_num * tag_size);
 ^
drivers/target/target_core_transport.c:270:4: error: implicit declaration of 
function 'vfree' [-Werror=implicit-function-declaration]
vfree(se_sess->sess_cmd_map);
^
drivers/target/target_core_transport.c: In function 'transport_kmap_data_sg':
drivers/target/target_core_transport.c:2317:2: error: implicit declaration of 
function 'vmap' [-Werror=implicit-function-declaration]
  cmd->t_data_vmap = vmap(pages, cmd->t_data_nents,  VM_MAP, PAGE_KERNEL);
  ^
drivers/target/target_core_transport.c:2317:53: error: 'VM_MAP' undeclared 
(first use in this function)
  cmd->t_data_vmap = vmap(pages, cmd->t_data_nents,  VM_MAP, PAGE_KERNEL);
 ^
drivers/target/target_core_transport.c: In function 'transport_kunmap_data_sg':
drivers/target/target_core_transport.c:2335:2: error: implicit declaration of 
function 'vunmap' [-Werror=implicit-function-declaration]
  vunmap(cmd->t_data_vmap);
  ^
drivers/target/target_core_file.c: In function 'fd_format_prot':
drivers/target/target_core_file.c:809:2: error: implicit declaration of 
function 'vzalloc' [-Werror=implicit-function-declaration]
  buf = vzalloc(unit_size);
  ^
drivers/target/target_core_file.c:809:6: warning: assignment makes pointer from 
integer without a cast
  buf = vzalloc(unit_size);
  ^
drivers/target/target_core_file.c:822:2: error: implicit declaration of 
function 'vfree' [-Werror=implicit-function-declaration]
  vfree(buf);
  ^
drivers/target/target_core_user.c: In function 'tcmu_configure_device':
drivers/target/target_core_user.c:895:2: error: implicit declaration of 
function 'vzalloc' [-Werror=implicit-function-declaration]
  udev->mb_addr = vzalloc(TCMU_RING_SIZE);
  ^
drivers/target/target_core_user.c:895:16: warning: assignment makes pointer 
from integer without a cast
  udev->mb_addr = vzalloc(TCMU_RING_SIZE);
^
drivers/target/target_core_user.c:947:2: error: implicit declaration of 
function 'vfree' [-Werror=implicit-function-declaration]
  vfree(udev->mb_addr);
  ^
drivers/target/iscsi/iscsi_target.c: In function 'iscsi_target_init_module':
drivers/target/iscsi/iscsi_target.c:557:2: error: implicit declaration of 
function 'vzalloc' [-Werror=implicit-function-declaration]
  iscsit_global->ts_bitmap = vzalloc(size);
  ^
drivers/target/iscsi/iscsi_target.c:557:27: warning: assignment makes pointer 
from integer without a cast
  iscsit_global->ts_bitmap = vzalloc(size);
   ^
drivers/target/iscsi/iscsi_target.c:615:2: error: implicit declaration of 
function 'vfree' [-Werror=implicit-function-declaration]
  vfree(iscsit_global->ts_bitmap);
  ^
drivers/scsi/qla2xxx/tcm_qla2xxx.c: In function 'tcm_qla2xxx_init_lport':
drivers/scsi/qla2xxx/tcm_qla2xxx.c:1578:2: error: implicit declaration of 
function 'vmalloc' [-Werror=implicit-function-declaration]
  lport->lport_loopid_map = vmalloc(sizeof(struct tcm_qla2xxx_fc_loopid) *
  ^
drivers/scsi/qla2xxx/tcm_qla2xxx.c:1578:26: warning: assignment makes pointer 
from integer without a cast
  lport->lport_loopid_map = vmalloc(sizeof(struct tcm_qla2xxx_fc_loopid) *
  ^
drivers/scsi/qla2xxx/tcm_qla2xxx.c: In function 'tcm_qla2xxx_make_lport':
drivers/scsi/qla2xxx/tcm_qla2xxx.c:1643:2: error: implicit declaration of 
function 'vfree' [-Werror=implicit-function-declaration]
  vfree(lport->lport_loopid_map);
  ^

Ouch :-(

Maybe commit 095dc8e0c368 ("tcp: fix/cleanup
inet_ehash_locks_alloc()")?  (that is the only commit I can find that
remove an include of vmalloc.h

Ah

Re: [PATCH net] bridge: fix br_multicast_query_expired() bug

2015-05-28 Thread Thadeu Lima de Souza Cascardo

On Thu, May 28, 2015 at 04:42:54AM -0700, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> br_multicast_query_expired() querier argument is a pointer to
> a struct bridge_mcast_querier :
> 
> struct bridge_mcast_querier {
> struct br_ip addr;
> struct net_bridge_port __rcu*port;
> };
> 
> Intent of the code was to clear port field, not the pointer to querier.
> 
> Fixes: 2cd4143192e8 ("bridge: memorize and export selected IGMP/MLD querier 
> port")
> Signed-off-by: Eric Dumazet 
> Cc: Linus Lüssing 
> Cc: Steinar H. Gunderson 
> ---
>  net/bridge/br_multicast.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
> index a3abe6ed111e..22fd0419b314 100644
> --- a/net/bridge/br_multicast.c
> +++ b/net/bridge/br_multicast.c
> @@ -1822,7 +1822,7 @@ static void br_multicast_query_expired(struct 
> net_bridge *br,
>   if (query->startup_sent < br->multicast_startup_query_count)
>   query->startup_sent++;
>  
> - RCU_INIT_POINTER(querier, NULL);
> + RCU_INIT_POINTER(querier->port, NULL);
>   br_multicast_send_query(br, NULL, query);
>   spin_unlock(&br->multicast_lock);
>  }
> 
> 
> 
> --

Acked-by: Thadeu Lima de Souza Cascardo 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6] net: tcp: Fix a PTO timing granularity issue

2015-05-28 Thread Ido Yariv

The Tail Loss Probe RFC specifies that the PTO value should be set to
max(2 * SRTT, 10ms), where SRTT is the smoothed round-trip time.

The PTO value is converted to jiffies, so the timer may expire
prematurely.

This is especially problematic on systems in which HZ <= 100, so work
around this by setting the timeout to at least 2 jiffies on such
systems.

The 10ms figure was originally selected based on tests performed with
the current implementation and HZ = 1000. Thus, leave the behavior on
systems with HZ > 100 unchanged.

Signed-off-by: Ido Yariv 
---
 include/net/tcp.h | 20 
 net/ipv4/tcp_output.c |  2 +-
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 2bb2bad..2ff0181 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1751,4 +1751,24 @@ static inline void skb_set_tcp_pure_ack(struct sk_buff 
*skb)
skb->truesize = 2;
 }
 
+/* Convert msecs to jiffies, ensuring that the return value is at least 2
+ * jiffies.
+ * This can be used when setting tick-based timers to guarantee that they won't
+ * expire right away.
+ */
+static inline unsigned long tcp_msecs_to_jiffies_min_2(const unsigned int m)
+{
+   if (__builtin_constant_p(m)) {
+   /* The theoretical upper bound of m for 2 jiffies is 2 seconds,
+* so compare m with that to avoid potential integer overflows.
+*/
+   if ((m > 2 * MSEC_PER_SEC) || (m * HZ > 2 * MSEC_PER_SEC))
+   return msecs_to_jiffies(m);
+
+   return 2;
+   }
+
+   return max_t(u32, 2, msecs_to_jiffies(m));
+}
+
 #endif /* _TCP_H */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 190538a..37694d2 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2207,7 +2207,7 @@ bool tcp_schedule_loss_probe(struct sock *sk)
if (tp->packets_out == 1)
timeout = max_t(u32, timeout,
(rtt + (rtt >> 1) + TCP_DELACK_MAX));
-   timeout = max_t(u32, timeout, msecs_to_jiffies(10));
+   timeout = max_t(u32, timeout, tcp_msecs_to_jiffies_min_2(10));
 
/* If RTO is shorter, just schedule TLP in its place. */
tlp_time_stamp = tcp_time_stamp + timeout;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

pull-request: mac80211 2015-05-28

2015-05-28 Thread Johannes Berg

Hi Dave,

Please excuse the quick succession with another pull request - Ben
pointed out to me that a fix I'd applied on -next is actually needed on
4.1 - we'll have to live with it being in both I suppose. Sorry about
that.

johannes


The following changes since commit f9dca80b98caac8b4bfb43a2edf1e9f877ccf322:

  mac80211: fix AP_VLAN crypto tailroom calculation (2015-05-20 15:10:11 +0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git 
tags/mac80211-for-davem-2015-05-28

for you to fetch changes up to 3a7af58faa7829faa26026c245d2a8a44e9605c5:

  mac80211: Fix mac80211.h docbook comments (2015-05-28 14:37:43 +0200)


This just has a single docbook build fix. In my confusion
I'd already sent the same fix for -next, but Ben Hutchings
noted it's necessary in 4.1.


Jonathan Corbet (1):
  mac80211: Fix mac80211.h docbook comments

 include/net/mac80211.h | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next 07/14] i40e: Remove unnecessary pf members

2015-05-28 Thread Or Gerlitz

On Thu, May 28, 2015 at 2:25 PM, Jeff Kirsher
 wrote:
> From: Anjali Singhai Jain 
>
> We can use the stat index macro directly, a variable is not required.

Sorry, but while attempting to look on the patch, I wasn't able to
decipher the change-log nor to somehow
relate it  to the patch title. Can you make it a litter clearer, by
"pf members" are we talking on SRIOV PFs or VFs or none or both?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net v2] openvswitch: disable LRO

2015-05-28 Thread Jiri Benc

Currently, openvswitch tries to disable LRO from the user space. This does
not work correctly when the device added is a vlan interface, though.
Instead of dealing with possibly complex stacked cross name space relations
in the user space, do the same as bridging does and call dev_disable_lro in
the kernel.

Signed-off-by: Jiri Benc 
---
v1->v2: Disable LRO unconditionally. If the feature that leaves LRO enabled
is implemented in the future in ovs user space, the conditional disablement
can be implemented in the kernel easily. There won't be any problem even
if such new ovs user space is run with older kernels, enabling LRO is just
an optimization.
---
 net/openvswitch/vport-netdev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c
index 4776282c6417..33e6d6e2908f 100644
--- a/net/openvswitch/vport-netdev.c
+++ b/net/openvswitch/vport-netdev.c
@@ -125,6 +125,7 @@ static struct vport *netdev_create(const struct vport_parms 
*parms)
if (err)
goto error_master_upper_dev_unlink;
 
+   dev_disable_lro(netdev_vport->dev);
dev_set_promiscuity(netdev_vport->dev, 1);
netdev_vport->dev->priv_flags |= IFF_OVS_DATAPATH;
rtnl_unlock();
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-05-28 Thread Haggai Eran

On 26/05/2015 16:34, Doug Ledford wrote:
> On Sun, 2015-05-17 at 08:50 +0300, Haggai Eran wrote:
>> Thanks again everyone for the review comments. I've updated the patch set
>> accordingly. The main changes are in the first patch to use a read-write
>> semaphore instead of an SRCU, and with the reference counting of shared
>> ib_cm_ids.
>> Please let me know if I missed anything, or if there are other issues with
>> the series.
> 
> Hi Haggai,
> 
> I know you are probably busy reworking this right now on the basis of
> Jason's comments.  However, my biggest issue with this patch set right
> now is not technical (well, it is, but it's only partially technical).
Hi,

I'm sorry about the late reply. We had a holiday here, and then some
other tasks took precedence. I've only got back to working on this today.

> 
> This is a core feature more than anything else.  Namespaces for RDMA
> devices is not unique to IB or RoCE in any way.  Yet no thought has been
> given to how this will work universally across all of the RDMA capable
> devices (mainly I'm talking about iWARP here...
I don't agree. It is true we have are not planning to provide an iWarp
implementation for network namespaces, as we lack the capacity and the
expertise. However, I think that the changes we proposed to the rdma_cm
module will work with iWarp too. Perhaps with some of Jason's
suggestions it will be smoother, but even in the current design, I think
that if iWarp drivers can provide iw_cm with the network device on which
a request is received, then it should be simple to modify it for
namespace support without significant change to rdma_cm.

> I don't think this is an
> issue for usNIC as if you want namespace support there, you just start
> the user space app in a given namespace and you are probably 90% of the
> way there since the user space application gets its own device and so
> its own MAC/IP and all of the RDMA transfers are UDP, so the
> application's namespace should get inherited by all the rest, but Cisco
> would need to confirm that, hence why I say 90% of the way there, it
> needs confirmed).
> 
> So, while you are reworking things right now, you would ideally contact
> Steve Wise and/or Tatyana Nikolova and discuss the iWARP story on this.
> I know there won't be a lot of overlap between IB and iWARP, but last
> time you were asked you didn't even know if this setup could be extended
> to iWARP.
> 
> For this next statement, I know I'm directing this to you Haggai, but
> please don't take it that way.  I'm really using your patch set to make
> a broader point to everyone on the list.
> 
> When I look at patches for support for a given feature, one of the
> things I'm going to look at is whether or not that feature is specific
> to a given hardware type, or if it's a generic feature.  If it's a
> generic feature, then I'm going to want to know that the person
> submitting it has designed it well.  A pre-requisite of designing a
> generic feature well is that it considers all hardware types, not just
> your specific hardware type.  So when you come back with the next
> version of this patch set, please have an answer for how it should work
> on each hardware type even if you don't have implementation patches for
> each hardware type.

Well, because the RDMA subsystem supports a very diverse set of devices,
I think there are few people who know the details of all hardware types
well. If we are going to evolve the generic parts of the stack, we have
to cooperate. We have to rely on the knowledge of people on the mailing
list to say whether the feature is well designed for all hardware types,
or whether changes are warranted. In this specific case, the patches has
been on the list since February. I think it is enough time to allow
anyone who is interested in network namespace support to chime in.

Regards,
Haggai

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-05-28 Thread Haggai Eran

On 26/05/2015 19:59, Jason Gunthorpe wrote:
> The big open question for ethernet is how to work without relying on
> VLAN to create delgated netdevs - typically one would use a bridge and
> veth's, which do not seem very RDMA compatible. But that doesn't need
> to be answered right now.

I think in Ethernet the first step would be to support macvlan devices.
Like IPoIB child devices, they are directly attached to an RDMA device,
so they don't require handling a complex virtual bridging topology as
veths do.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next 01/14] ethtool: Add helper routines to pass vf to rx_flow_spec

2015-05-28 Thread Or Gerlitz

On Thu, May 28, 2015 at 2:25 PM, Jeff Kirsher
 wrote:
> From: John Fastabend 
>
> The ring_cookie is 64 bits wide which is much larger than can be used
> for actual queue index values. So provide some helper routines to
> pack a VF index into the cookie. This is useful to steer packets to
> a VF ring without having to know the queue layout of the device.

So this patch comes  to generalize the proprietary solution introduced
in the below commit?

commit e7c8c60bc5d48994a67e4b1c7bfb01d6979dbc54
Author: Anjali Singhai Jain 
Date:   Tue Apr 7 19:45:31 2015 -0400

i40e: Add support to program FDir SB rules for VF from PF through ethtool

With this patch we can now add Flow director Sideband rules for a VF from
it's PF. Here is an example on how it can be done when VF id = 5 and
queue = 2:

"ethtool -N ethx flow-type udp4 src-ip x.x.x.x dst-ip y.y.y.y
src-port p1 dst-port p2 action 2 user-def 5"

User-def specifies VF id and action specifies queue.


>  include/uapi/linux/ethtool.h | 25 +
>  1 file changed, 25 insertions(+)
>
> diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
> index ae832b4..0594933 100644
> --- a/include/uapi/linux/ethtool.h
> +++ b/include/uapi/linux/ethtool.h
> @@ -796,6 +796,31 @@ struct ethtool_rx_flow_spec {
> __u32   location;
>  };
>
> +/* How rings are layed out when accessing virtual functions or
> + * offloaded queues is device specific. To allow users to do flow
> + * steering and specify these queues the ring cookie is partitioned
> + * into a 32bit queue index with an 8 bit virtual function id.
> + * This also leaves the 3bytes for further specifiers. It is possible
> + * future devices may support more than 256 virtual functions if
> + * devices start supporting PCIe w/ARI. However at the moment I
> + * do not know of any devices that support this so I do not reserve
> + * space for this at this time. If a future patch consumes the next
> + * byte it should be aware of this possiblity.
> + */
> +#define ETHTOOL_RX_FLOW_SPEC_RING  0xLL
> +#define ETHTOOL_RX_FLOW_SPEC_RING_VF   0x00FFLL
> +#define ETHTOOL_RX_FLOW_SPEC_RING_VF_OFF 32
> +static inline __u64 ethtool_get_flow_spec_ring(__u64 ring_cookie)
> +{
> +   return ETHTOOL_RX_FLOW_SPEC_RING & ring_cookie;
> +};
> +
> +static inline __u64 ethtool_get_flow_spec_ring_vf(__u64 ring_cookie)
> +{
> +   return (ETHTOOL_RX_FLOW_SPEC_RING_VF & ring_cookie) >>
> +   ETHTOOL_RX_FLOW_SPEC_RING_VF_OFF;
> +};
> +
>  /**
>   * struct ethtool_rxnfc - command to get or set RX flow classification rules
>   * @cmd: Specific command number - %ETHTOOL_GRXFH, %ETHTOOL_SRXFH,
> --
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net v2] openvswitch: disable LRO

2015-05-28 Thread Flavio Leitner

On Thu, May 28, 2015 at 03:04:53PM +0200, Jiri Benc wrote:
> Currently, openvswitch tries to disable LRO from the user space. This does
> not work correctly when the device added is a vlan interface, though.
> Instead of dealing with possibly complex stacked cross name space relations
> in the user space, do the same as bridging does and call dev_disable_lro in
> the kernel.
> 
> Signed-off-by: Jiri Benc 
> ---
LGTM
Acked-by: Flavio Leitner 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-05-28 Thread Haggai Eran

On 26/05/2015 20:46, Doug Ledford wrote:
>> Remember, this isn't RDMA namespaces, this is netdev namespace support
>> > for RDMA-CM -> very different things.
> That was the point of my email.  This is a very myopic view of the
> feature.  It *should* at least have an idea of these other things too.

We did give some thought to the question of whether an RDMA namespace is
needed, and concluded that it isn't. RDMA resources such as QP numbers,
memory keys, etc. are allocated by the devices. So different containers
wouldn't care if they share the "QP number namespace", etc. RDMA CM
ports are different because they are chosen by the applications, but
they map directly to the network namespace, so they don't require their
own namespace.

Regards,
Haggai

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4] bnx2x: Alloc 4k fragment for each rx ring buffer element

2015-05-28 Thread Gabriel Krisman Bertazi

Yuval Mintz  writes:

> Actually, this upsets me greatly. We didn't see it on a system with 4KB
> pages, but this means you've actually tried to 'sell' us a fastpath fix that
> was never tested on machines for which it was meant as an improvement.

The iteration that inserted this bug was such a quick fix that I didn`t
bother rerunning it.  It just modified the type variable.  Take my word
it was a one time honest mistake.

The last iteration, as well as all of the previous ones were tested on
Power servers with bnx2x adapters.  The tests included stressing the
device and several iterations of hotplugging the adapter, helping us to
verify correctness of the rx queue operation as well as verifying that DMA
mapping/unmapping were correct.

-- 
Gabriel Krisman Bertazi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sctp: fix ASCONF list handling

2015-05-28 Thread Marcelo Ricardo Leitner

On Thu, May 28, 2015 at 08:17:27AM -0300, Marcelo Ricardo Leitner wrote:
> On Thu, May 28, 2015 at 06:15:11AM -0400, Neil Horman wrote:
> > On Wed, May 27, 2015 at 09:52:17PM -0300, mleit...@redhat.com wrote:
> > > From: Marcelo Ricardo Leitner 
> > > 
> > > ->auto_asconf_splist is per namespace and mangled by functions like
> > > sctp_setsockopt_auto_asconf() which doesn't guarantee any serialization.
> > > 
> > > Also, the call to inet_sk_copy_descendant() was backuping
> > > ->auto_asconf_list through the copy but was not honoring
> > > ->do_auto_asconf, which could lead to list corruption if it was
> > > different between both sockets.
> > > 
> > > This commit thus fixes the list handling by adding a spinlock to protect
> > > against multiple writers and converts the list to be protected by RCU
> > > too, so that we don't have a lock inverstion issue at
> > > sctp_addr_wq_timeout_handler().
> > > 
> > > And as this list now uses RCU, we cannot do such backup and restore
> > > while copying descendant data anymore as readers may be traversing the
> > > list meanwhile. We fix this by simply ignoring/not copying those fields,
> > > placed at the end of struct sctp_sock, so we can just ignore it together
> > > with struct ipv6_pinfo data. For that we create sctp_copy_descendant()
> > > so we don't clutter inet_sk_copy_descendant() with SCTP info.
> > > 
> > > Issue was found with a test application that kept flipping sysctl
> > > default_auto_asconf on and off.
> > > 
> > > Fixes: 9f7d653b67ae ("sctp: Add Auto-ASCONF support (core).")
> > > Signed-off-by: Marcelo Ricardo Leitner 
> > > ---
> > >  include/net/netns/sctp.h   |  6 +-
> > >  include/net/sctp/structs.h |  2 ++
> > >  net/sctp/protocol.c|  6 +-
> > >  net/sctp/socket.c  | 39 ++-
> > >  4 files changed, 38 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/include/net/netns/sctp.h b/include/net/netns/sctp.h
> > > index 
> > > 3573a81815ad9e0efb6ceb721eb066d3726419f0..e080bebb3147af39c8275261f57018eb01e917b0
> > >  100644
> > > --- a/include/net/netns/sctp.h
> > > +++ b/include/net/netns/sctp.h
> > > @@ -30,12 +30,15 @@ struct netns_sctp {
> > >   struct list_head local_addr_list;
> > >   struct list_head addr_waitq;
> > >   struct timer_list addr_wq_timer;
> > > - struct list_head auto_asconf_splist;
> > > + struct list_head __rcu auto_asconf_splist;
> > You should use the addr_wq_lock here instead of creating a new lock, as 
> > thats
> > already used to protect most accesses to the list you are concerned about.
> 
> Ok, that works too.
> 
> > Though truthfully, that shouldn't be necessecary.  The list in question is 
> > only
> > read in one location and only written in one location.  You can likely just
> > rcu-ify, as the write side is in process context and protected by lock_sock.
> 
> It should, it's not protected by lock_sock as this list resides in
> netns_sctp structure, which lock_sock doesn't cover. Write side is in
> process context yes, but this list is written in sctp_init_sock(),
> sctp_destroy_sock() and sctp_setsockopt_auto_asconf(), so one could
> trigger this by either creating/destroying sockets if
> default_auto_asconf=1 or just by creating a bunch of sockets and
> flipping asconf via setsockopt (or a combination of these operations).
> (I'll point this out in the changelog)

Hmm.. by reusing addr_wq_lock we don't need to rcu-ify the list, as the
reader is inside that lock too, so I can just protect auto_asconf_splist
writers with addr_wq_lock.

Nice, thanks Neil.

  Marcelo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: linux-next: manual merge of the net-next tree with Linus' tree

2015-05-28 Thread Tom Lendacky


On 05/27/2015 11:17 PM, Stephen Rothwell wrote:

Hi all,

Today's linux-next merge of the net-next tree got a conflict in
drivers/net/phy/amd-xgbe-phy.c between commit 983942a5eaca
("amd-xgbe-phy: Fix initial mode when autoneg is disabled") from Linus'
tree and commit 7c12aa08779c ("amd-xgbe: Move the PHY support into
amd-xgbe") from the net-next tree.

I fixed it up (the latter removed the file, so I did that - there may
be more needed) and can carry the fix as necessary (no action is
required).


I looked over the tree and it appears that everything is good. Deleting
the file is the correct action and it looks like the necessary changes
to the Makefiles and Kconfigs are present to account for that.

Thanks,
Tom




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: DSA and underlying 802.1Q encapsulation

2015-05-28 Thread Vivien Didelot

Hi Guenter,

>>> If yes, the dsa code may need to move the tag into the header.
>>> If we are lucky, a call to vlan_hwaccel_push_inside() might do it.
>>
>> Thanks, I'm currently looking into it and doing some tests, I'm coming back 
>> to
>> you asap.

Issue fixed, thanks! vlan_hwaccel_push_inside() adds additional memmove, but
this approach is simpler.

>>> Do you have some vlan dsa code to share, by any chance ? That might
>>> save me some time, as I am looking into it as well.
>>
>> Yes, I am about the send an RFC for fully integrated 802.1q VLAN support on
>> Marvell 88E6352 devices.
>>
> Excellent, that is going to safe me a lot of work! If you don't mind,
> please Cc: me as I am not subscribed to the network mailing list.

My work is based on v4.1-rc3. Unfortunately the VLAN support breaks on latest
net-next/master (e.g. NETIF_F_HW_SWITCH_OFFLOAD is removed; .ndo_bridge_setlink
used to set tag and PVID, is moved to switchdev.)

Are you willing to give it a try based on v4.1-rc3 in the meantime?
Otherwise, I'll send the RFC as soon as the support is functional on net-next.

Thanks,
-v
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] neigh: Add missing rcu_assign_pointer

2015-05-28 Thread Herbert Xu

Eric Dumazet  wrote:
>
> This patch is not needed.
> 
> You really should read Documentation/RCU , because it looks like you are
> quite confused.
> 
> When we remove an element from a RCU protected list, all the objects in
> the chain are already ready to be caught by rcu readers.
> 
> Therefore, no additional memory barrier is needed before doing *np =
> n->next;
> 
> Please do not add spurious memory barriers. Like atomic operations, we
> want all of them being required and possibly documented.

This patch is indeed bogus but accessing an RCU-protected like
this will trigger sparse warnings.  So better make it an
RCU_INIT_POINTER.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH net-next 1/1] hv_netvsc: Properly size the vrss queues

2015-05-28 Thread KY Srinivasan



> -Original Message-
> From: Dan Carpenter [mailto:dan.carpen...@oracle.com]
> Sent: Thursday, May 28, 2015 12:06 AM
> To: KY Srinivasan
> Cc: da...@davemloft.net; netdev@vger.kernel.org; linux-
> ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de;
> a...@canonical.com; jasow...@redhat.com
> Subject: Re: [PATCH net-next 1/1] hv_netvsc: Properly size the vrss queues
> 
> Since you're redoing this anyway.
> 
> On Tue, May 26, 2015 at 04:21:09PM -0700, K. Y. Srinivasan wrote:
> > diff --git a/drivers/net/hyperv/hyperv_net.h
> b/drivers/net/hyperv/hyperv_net.h
> > index ddcc7f8..dd45440 100644
> > --- a/drivers/net/hyperv/hyperv_net.h
> > +++ b/drivers/net/hyperv/hyperv_net.h
> > @@ -161,6 +161,7 @@ struct netvsc_device_info {
> > unsigned char mac_adr[ETH_ALEN];
> > bool link_state;/* 0 - link up, 1 - link down */
> > int  ring_size;
> > +   u32  max_num_vrss_chns;
> 
> We (Joe and I) have commented before that long names don't mix well with
> the 80 character limit.  You could just leave the "num_" out.  Almost
> all variables are numbers in C so it doesn't add anything.

Thanks Dan. Actually I sent out the revised patch yesterday and I currently 
don't
Have the 80 char issue. If it is ok with Dave, I will not re-spin the patch. 
However, I will
note this comment for future work.

K. Y
> 
> regards,
> dan carpenter
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 1/1] hv_netvsc: Properly size the vrss queues

2015-05-28 Thread Dan Carpenter

On Thu, May 28, 2015 at 01:52:47PM +, KY Srinivasan wrote:
> 
> 
> > -Original Message-
> > From: Dan Carpenter [mailto:dan.carpen...@oracle.com]
> > Sent: Thursday, May 28, 2015 12:06 AM
> > To: KY Srinivasan
> > Cc: da...@davemloft.net; netdev@vger.kernel.org; linux-
> > ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de;
> > a...@canonical.com; jasow...@redhat.com
> > Subject: Re: [PATCH net-next 1/1] hv_netvsc: Properly size the vrss queues
> > 
> > Since you're redoing this anyway.
> > 
> > On Tue, May 26, 2015 at 04:21:09PM -0700, K. Y. Srinivasan wrote:
> > > diff --git a/drivers/net/hyperv/hyperv_net.h
> > b/drivers/net/hyperv/hyperv_net.h
> > > index ddcc7f8..dd45440 100644
> > > --- a/drivers/net/hyperv/hyperv_net.h
> > > +++ b/drivers/net/hyperv/hyperv_net.h
> > > @@ -161,6 +161,7 @@ struct netvsc_device_info {
> > >   unsigned char mac_adr[ETH_ALEN];
> > >   bool link_state;/* 0 - link up, 1 - link down */
> > >   int  ring_size;
> > > + u32  max_num_vrss_chns;
> > 
> > We (Joe and I) have commented before that long names don't mix well with
> > the 80 character limit.  You could just leave the "num_" out.  Almost
> > all variables are numbers in C so it doesn't add anything.
> 
> Thanks Dan. Actually I sent out the revised patch yesterday and I currently 
> don't
> Have the 80 char issue. If it is ok with Dave, I will not re-spin the patch. 
> However, I will
> note this comment for future work.

Yes.  I saw that.  Fine fine.  It wasn't a redo the patch worthy
comment.  :P

regards,
dan carpenter

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] net: thunderx: add 64-bit dependency

2015-05-28 Thread Arnd Bergmann

The thunderx ethernet driver fails to build on architectures
that do not have an atomic readq() and writeq() function for
64-bit PCI bus access:

drivers/net/ethernet/cavium/thunder/thunder_bgx.c: In function 'bgx_reg_read':
include/asm-generic/io.h:195:23: error: implicit declaration of function 
'readq' [-Werror=implicit-function-declaration]

It seems impossible to get this driver to work on most 32-bit
hardware, so it's better to add an explicit dependency, in
order to let us keep building 'allmodconfig' kernels on
all architectures.

As the driver is meant for the internal hardware on an arm64 SoC, this
is not a problem for usability. Allowing the build on all 64-bit
architectures rather than just CONFIG_ARM64 on the other hand means that
we get the benefit of build testing on x86.

Signed-off-by: Arnd Bergmann 

diff --git a/drivers/net/ethernet/cavium/Kconfig 
b/drivers/net/ethernet/cavium/Kconfig
index 6365fb4242be..fc3d8e3ee807 100644
--- a/drivers/net/ethernet/cavium/Kconfig
+++ b/drivers/net/ethernet/cavium/Kconfig
@@ -4,7 +4,7 @@
 
 config NET_VENDOR_CAVIUM
tristate "Cavium ethernet drivers"
-   depends on PCI
+   depends on PCI && 64BIT
---help---
  Enable support for the Cavium ThunderX Network Interface
  Controller (NIC). The NIC provides the controller and DMA

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-05-28 Thread Doug Ledford

On Thu, 2015-05-28 at 16:07 +0300, Haggai Eran wrote:
> On 26/05/2015 16:34, Doug Ledford wrote:
> > On Sun, 2015-05-17 at 08:50 +0300, Haggai Eran wrote:
> > This is a core feature more than anything else.  Namespaces for RDMA
> > devices is not unique to IB or RoCE in any way.  Yet no thought has been
> > given to how this will work universally across all of the RDMA capable
> > devices (mainly I'm talking about iWARP here...
> I don't agree. It is true we have are not planning to provide an iWarp
> implementation for network namespaces, as we lack the capacity and the
> expertise. However, I think that the changes we proposed to the rdma_cm
> module will work with iWarp too. Perhaps with some of Jason's
> suggestions it will be smoother, but even in the current design, I think
> that if iWarp drivers can provide iw_cm with the network device on which
> a request is received, then it should be simple to modify it for
> namespace support without significant change to rdma_cm.

My request wasn't for a functional implementation, just a statement that
you had in fact thought about it and, as you say here, would expect it
to work (and preferably why as well).

> Well, because the RDMA subsystem supports a very diverse set of devices,
> I think there are few people who know the details of all hardware types
> well. If we are going to evolve the generic parts of the stack, we have
> to cooperate. We have to rely on the knowledge of people on the mailing
> list to say whether the feature is well designed for all hardware types,
> or whether changes are warranted. In this specific case, the patches has
> been on the list since February. I think it is enough time to allow
> anyone who is interested in network namespace support to chime in.

You would think that, but sometimes important information comes from
totally different places.  See mine and Jason's comments back and forth
in the SRIOV thread started by Or.

Long story short:

ip link add dev ib0 name ib0.1 type ipoib

is totally broken on at least all Red Hat OSes.  It will require
reworking of the network scripts and NetworkManager assumptions to make
it work.  It will also break DHCP on the interface as pkey/guid are the
only items that uniquely identify DHCP clients.  The net result of our
talks was that it is likely that each interface on the same pkey will
require an alias GUID per child interface in order to keep things
workable.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD

signature.asc
Description: This is a digitally signed message part

Re: [PATCH] netevent: remove automatic variable in register_netevent_notifier()

2015-05-28 Thread Sergei Shtylyov


Hello.

On 5/28/2015 1:00 PM, Wang Long wrote:


Remove automatic variable 'err' in register_netevent_notifier() and
return the return value of atomic_notifier_chain_register() directly.


   s/return value/result/, in order to avoid tautology.


Signed-off-by: Wang Long 

[...]

WBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: DSA and underlying 802.1Q encapsulation

2015-05-28 Thread Guenter Roeck


On 05/28/2015 06:44 AM, Vivien Didelot wrote:

Hi Guenter,


If yes, the dsa code may need to move the tag into the header.
If we are lucky, a call to vlan_hwaccel_push_inside() might do it.


Thanks, I'm currently looking into it and doing some tests, I'm coming back to
you asap.


Issue fixed, thanks! vlan_hwaccel_push_inside() adds additional memmove, but
this approach is simpler.


Do you have some vlan dsa code to share, by any chance ? That might
save me some time, as I am looking into it as well.


Yes, I am about the send an RFC for fully integrated 802.1q VLAN support on
Marvell 88E6352 devices.


Excellent, that is going to safe me a lot of work! If you don't mind,
please Cc: me as I am not subscribed to the network mailing list.


My work is based on v4.1-rc3. Unfortunately the VLAN support breaks on latest
net-next/master (e.g. NETIF_F_HW_SWITCH_OFFLOAD is removed; .ndo_bridge_setlink
used to set tag and PVID, is moved to switchdev.)

Are you willing to give it a try based on v4.1-rc3 in the meantime?
Otherwise, I'll send the RFC as soon as the support is functional on net-next.



I already rebased my code to net-next, not for the switchdev changes but
to catch the dsa changes, so net-next would be much better. I'd like to get
a glance at your code, though, if that is possible.

Thanks,
Guenter

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next 01/14] ethtool: Add helper routines to pass vf to rx_flow_spec

2015-05-28 Thread Sergei Shtylyov


Hello.

On 5/28/2015 2:25 PM, Jeff Kirsher wrote:


From: John Fastabend 



The ring_cookie is 64 bits wide which is much larger than can be used
for actual queue index values. So provide some helper routines to
pack a VF index into the cookie. This is useful to steer packets to
a VF ring without having to know the queue layout of the device.



CC: Alex Duyck 
Signed-off-by: John Fastabend 
Signed-off-by: Jeff Kirsher 
---
  include/uapi/linux/ethtool.h | 25 +
  1 file changed, 25 insertions(+)



diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index ae832b4..0594933 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -796,6 +796,31 @@ struct ethtool_rx_flow_spec {
__u32   location;
  };

+/* How rings are layed out when accessing virtual functions or


   s/layed/laid/.


+ * offloaded queues is device specific. To allow users to do flow
+ * steering and specify these queues the ring cookie is partitioned
+ * into a 32bit queue index with an 8 bit virtual function id.
+ * This also leaves the 3bytes for further specifiers. It is possible
+ * future devices may support more than 256 virtual functions if
+ * devices start supporting PCIe w/ARI. However at the moment I
+ * do not know of any devices that support this so I do not reserve
+ * space for this at this time. If a future patch consumes the next
+ * byte it should be aware of this possiblity.


   Possibility.

[...]

WBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sctp: fix ASCONF list handling

2015-05-28 Thread Neil Horman

On Thu, May 28, 2015 at 10:27:32AM -0300, Marcelo Ricardo Leitner wrote:
> On Thu, May 28, 2015 at 08:17:27AM -0300, Marcelo Ricardo Leitner wrote:
> > On Thu, May 28, 2015 at 06:15:11AM -0400, Neil Horman wrote:
> > > On Wed, May 27, 2015 at 09:52:17PM -0300, mleit...@redhat.com wrote:
> > > > From: Marcelo Ricardo Leitner 
> > > > 
> > > > ->auto_asconf_splist is per namespace and mangled by functions like
> > > > sctp_setsockopt_auto_asconf() which doesn't guarantee any serialization.
> > > > 
> > > > Also, the call to inet_sk_copy_descendant() was backuping
> > > > ->auto_asconf_list through the copy but was not honoring
> > > > ->do_auto_asconf, which could lead to list corruption if it was
> > > > different between both sockets.
> > > > 
> > > > This commit thus fixes the list handling by adding a spinlock to protect
> > > > against multiple writers and converts the list to be protected by RCU
> > > > too, so that we don't have a lock inverstion issue at
> > > > sctp_addr_wq_timeout_handler().
> > > > 
> > > > And as this list now uses RCU, we cannot do such backup and restore
> > > > while copying descendant data anymore as readers may be traversing the
> > > > list meanwhile. We fix this by simply ignoring/not copying those fields,
> > > > placed at the end of struct sctp_sock, so we can just ignore it together
> > > > with struct ipv6_pinfo data. For that we create sctp_copy_descendant()
> > > > so we don't clutter inet_sk_copy_descendant() with SCTP info.
> > > > 
> > > > Issue was found with a test application that kept flipping sysctl
> > > > default_auto_asconf on and off.
> > > > 
> > > > Fixes: 9f7d653b67ae ("sctp: Add Auto-ASCONF support (core).")
> > > > Signed-off-by: Marcelo Ricardo Leitner 
> > > > ---
> > > >  include/net/netns/sctp.h   |  6 +-
> > > >  include/net/sctp/structs.h |  2 ++
> > > >  net/sctp/protocol.c|  6 +-
> > > >  net/sctp/socket.c  | 39 ++-
> > > >  4 files changed, 38 insertions(+), 15 deletions(-)
> > > > 
> > > > diff --git a/include/net/netns/sctp.h b/include/net/netns/sctp.h
> > > > index 
> > > > 3573a81815ad9e0efb6ceb721eb066d3726419f0..e080bebb3147af39c8275261f57018eb01e917b0
> > > >  100644
> > > > --- a/include/net/netns/sctp.h
> > > > +++ b/include/net/netns/sctp.h
> > > > @@ -30,12 +30,15 @@ struct netns_sctp {
> > > > struct list_head local_addr_list;
> > > > struct list_head addr_waitq;
> > > > struct timer_list addr_wq_timer;
> > > > -   struct list_head auto_asconf_splist;
> > > > +   struct list_head __rcu auto_asconf_splist;
> > > You should use the addr_wq_lock here instead of creating a new lock, as 
> > > thats
> > > already used to protect most accesses to the list you are concerned about.
> > 
> > Ok, that works too.
> > 
> > > Though truthfully, that shouldn't be necessecary.  The list in question 
> > > is only
> > > read in one location and only written in one location.  You can likely 
> > > just
> > > rcu-ify, as the write side is in process context and protected by 
> > > lock_sock.
> > 
> > It should, it's not protected by lock_sock as this list resides in
> > netns_sctp structure, which lock_sock doesn't cover. Write side is in
> > process context yes, but this list is written in sctp_init_sock(),
> > sctp_destroy_sock() and sctp_setsockopt_auto_asconf(), so one could
> > trigger this by either creating/destroying sockets if
> > default_auto_asconf=1 or just by creating a bunch of sockets and
> > flipping asconf via setsockopt (or a combination of these operations).
> > (I'll point this out in the changelog)
> 
> Hmm.. by reusing addr_wq_lock we don't need to rcu-ify the list, as the
> reader is inside that lock too, so I can just protect auto_asconf_splist
> writers with addr_wq_lock.
> 
> Nice, thanks Neil.
> 
Yup, thanks!
Neil

>   Marcelo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: tc drop stats different between bond and slave interfaces

2015-05-28 Thread jsulli...@opensourcedevel.com

> On May 26, 2015 at 1:10 PM Cong Wang  wrote:
>
>
> On Mon, May 25, 2015 at 10:35 PM, jsulli...@opensourcedevel.com
>  wrote:
> >
> > I was also surprised to see that, although we are using a prio qdisc on the
> > bond, the physical interface is showing pfifo_fast.
> >
> [...]
> >
> > So why the difference and why the pfifo_fast qdiscs on the physical
> > interfaces?
>
> Qdisc is not aware of the network interface you attach it to, so it doesn't
> know
> whether it is bond or whatever stacked interface, the qdisc you add to bonding
> master has no idea about its slaves.
>
> For pfifo_fast, it is the default qdisc when you install mq on root, it is
> where
> mq actually holds the packets.
>
> Hope this helps.

Grr . . . . I think this web client formatted my last response with HTML by
default.  My apologies.

Yes, your reply does help, thank you although it then raises an interesting
question.  If I
neglect the slave interfaces as I have done, can I accidentally impact the
shaping I have done on the bond master? For example, I may prioritize real time
voice and video so their relatively evenly spaced packets are prioritized and
sent to the physical interface with no special ToS marking.  Someone's selfish
mail application sets ToS bits for high priority and decides to send a huge
attachment.  Those packets also flood into the physical interface behind the
video and voice packets but now the physical interface using pfifo_fast sends
the bulk email packets ahead of the voice and video.  Is this an accurate
scenario?

Thus, if one uses traffic shaping on a bonded interface, should one then do
something like use a prio qdisc with a single priority on the physical
interfaces? Thanks - John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next 01/14] ethtool: Add helper routines to pass vf to rx_flow_spec

2015-05-28 Thread John Fastabend


On 05/28/2015 06:18 AM, Or Gerlitz wrote:

On Thu, May 28, 2015 at 2:25 PM, Jeff Kirsher
 wrote:

From: John Fastabend 

The ring_cookie is 64 bits wide which is much larger than can be used
for actual queue index values. So provide some helper routines to
pack a VF index into the cookie. This is useful to steer packets to
a VF ring without having to know the queue layout of the device.


So this patch comes  to generalize the proprietary solution introduced
in the below commit?



Well I developed it in a bit different context, but I would expect
the same mechanism to be used for i40e and any other devices that can
support this.


commit e7c8c60bc5d48994a67e4b1c7bfb01d6979dbc54
Author: Anjali Singhai Jain 
Date:   Tue Apr 7 19:45:31 2015 -0400

 i40e: Add support to program FDir SB rules for VF from PF through ethtool

 With this patch we can now add Flow director Sideband rules for a VF from
 it's PF. Here is an example on how it can be done when VF id = 5 and
 queue = 2:

 "ethtool -N ethx flow-type udp4 src-ip x.x.x.x dst-ip y.y.y.y
src-port p1 dst-port p2 action 2 user-def 5"

 User-def specifies VF id and action specifies queue.



  include/uapi/linux/ethtool.h | 25 +
  1 file changed, 25 insertions(+)


[...]


--
John Fastabend Intel Corporation
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Fw: [Bug 99081] New: Bridge multicast snooping breaks ICMPv6

2015-05-28 Thread Stephen Hemminger

Begin forwarded message:

Date: Thu, 28 May 2015 10:34:55 +
From: "bugzilla-dae...@bugzilla.kernel.org" 

To: "shemmin...@linux-foundation.org" 
Subject: [Bug 99081] New: Bridge multicast snooping breaks ICMPv6

https://bugzilla.kernel.org/show_bug.cgi?id=99081

Bug ID: 99081
   Summary: Bridge multicast snooping breaks ICMPv6
   Product: Networking
   Version: 2.5
Kernel Version: 4.0.4
  Hardware: All
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: Other
  Assignee: shemmin...@linux-foundation.org
  Reporter: sgunder...@bigfoot.com
Regression: No

Hi,

I've seen this reported many times around the net, with no definite solution;
since I got hit by it again and now used the latest kernel, I thought the best
place would be the kernel Bugzilla.

I have a bridge br0 with eth0 on it, and then some tap devices from KVM guests.
In this situation, by default, I don't have IPv6. I can't ping anything because
ND fails. I've seen this over multiple kernel versions for years, so I'm fairly
certain this is not something about my specific setup.

The common workaround is:

 echo 0 > /sys/devices/virtual/net/br0/bridge/multicast_snooping 

which immediately makes IPv6 work well again.

IPv6 unicast is unaffected; only multicast (ie., ND) seems to have the problem.

-- 
You are receiving this mail because:
You are the assignee for the bug.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Fw: [Bug 99091] New: Kernel panic while sending network packets over TAP interface

2015-05-28 Thread Stephen Hemminger



Begin forwarded message:

Date: Thu, 28 May 2015 11:44:58 +
From: "bugzilla-dae...@bugzilla.kernel.org" 

To: "shemmin...@linux-foundation.org" 
Subject: [Bug 99091] New: Kernel panic while sending network packets over TAP 
interface


https://bugzilla.kernel.org/show_bug.cgi?id=99091

Bug ID: 99091
   Summary: Kernel panic while sending network packets over TAP
interface
   Product: Networking
   Version: 2.5
Kernel Version: 3.11 and higher
  Hardware: x86-64
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: Other
  Assignee: shemmin...@linux-foundation.org
  Reporter: r...@open.ch
Regression: No

We are experiencing kernel panics on a rather specific setup after upgrading to
kernel versions 3.12.40, 3.14.9, 3.16.7, 3.17.7 and 3.18.14. The same
configuration with kernel 3.10.79 runs stable.  Kernel 3.8 proved to be stable
as well.
Unfortunately we are unable to reproduce the bug in a lab environment, but on
one of our production hosts the kernel reliably panics within 24 hours.

In our setup, network traffic takes the following path:
(1) network interface => (2) bridge => (3) VLAN => (4) bridge => (5) TAP
interface => (6) Virtual Machine => (7) bridge => (8) VLAN => (9) bridge =>
(10) GRE interface
The bridges (4) and (7) reply to any ARP request with their MAC address to suck
all traffic into the virtual machine and forward everything coming out of the
virtual machine.

Bisecting points us to commit eda29772 "tun: Support software transmit time
stamping.", but sometimes we did not get a crash dump, so further manual
verification was needed. We managed to prevent 3.18.8 from crashing by removing
commit eda29772 and a few successive fixes (7bf66305, f96eb74c, 4bfb0513). The
crash dump indicates that skb_tstamp_tx() is called from tun_net_xmit(), which
can only happen since the first chunk of eda29772. Several fixes for eda29772
appeared on the stable branches, none of which helps in our case.
We assume the packet in transit during the crash must have been locally
created, as sk_buff->sk must be set to match the call sequence.
We further assume that the crash happens during transmit on a TAP interface
(5), as we see no crashes with traffic over GRE interfaces with TAP interfaces
disabled.
Our setup is designed specifically to cause the calling path "bridge transmit"
- "VLAN transmit" - "bridge transmit" - "GRE or TAP transmit" as reflected by
the crash dump. It appears that this sequence hits a race condition or a
corrupted/uninitialized error queue in skb_queue_tail().

Here is a stack trace from a crashed Linux kernel based on commit 82a54d0e
(linux 3.11-rc1):

general protection fault:  [#1] SMP 
Modules linked in: adm1021 vhost_net vhost macvtap xt_TEE xt_condition(O)
xt_set ip6t_ipv6header ip6t_rt ip6t_eui64 ip6t_frag ip6t_mh ip6t_hbh ip6t_ah
ip6t_REJECT ip6table_mangle ip6table_raw ip6table_filter nf_conntrack_ipv6
nf_defrag_ipv6 ip6_tables ebt_ip6 ip_set_hash_ip ip_set pl2303 e1000e ptp
pps_core i2c_i801 coretemp
CPU: 5 PID: 0 Comm: swapper/5 Tainted: G   O 3.11.0-rc1_1-osix- #1
Hardware name: To be filled by O.E.M. To be filled by O.E.M./To be filled by
O.E.M., BIOS 4.6.4 12/28/2012
task: 88042b99cfe0 ti: 88042b9a2000 task.ti: 88042b9a2000
RIP: 0010:[]  [] skb_queue_tail+0x2e/0x44
RSP: 0018:880440343828  EFLAGS: 00010046
RAX: 0246 RBX: 880411aaa950 RCX: 
RDX: 35322e3535322e35 RSI: 0246 RDI: 880411aaa964
RBP: 880440343840 R08: 8804284879e8 R09: 100a0081
R10:  R11: 8804129d8000 R12: 8804284879c0
R13: 880411aaa964 R14: 000800c1 R15: 100a
FS:  () GS:88044034() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f7900bb1218 CR3: 000424c99000 CR4: 000427e0
Stack:
  880411aaa800 0042 880440343870
 81486210 880411aaa800 8804284879c0 880411aaa800
 880428919800 880440343898 81487d79 880425480180
Call Trace:
  
 [] sock_queue_err_skb+0x9d/0xc8
 [] skb_tstamp_tx+0x80/0x93
 [] tun_net_xmit+0x15a/0x284
 [] dev_hard_start_xmit+0x29e/0x3c8
 [] sch_direct_xmit+0x70/0x185
 [] dev_queue_xmit+0x234/0x429
 [] br_dev_queue_push_xmit+0xa1/0xa6
 [] br_forward_finish+0x22/0x4f
 [] __br_deliver+0x44/0x72
 [] br_deliver+0x56/0x5b
 [] br_dev_xmit+0x15d/0x17d
 [] dev_hard_start_xmit+0x29e/0x3c8
 [] dev_queue_xmit+0x375/0x429
 [] vlan_dev_hard_start_xmit+0x82/0xac
 [] dev_hard_start_xmit+0x29e/0x3c8
 [] dev_queue_xmit+0x375/0x429
 [] br_dev_queue_push_xmit+0xa1/0xa6
 [] br_forward_finish+0x22/0x4f
 [] __br_deliver+0x44/0x72
 [] br_deliver+0x56/0x5b
 [] br_dev_xmit+0x15d/0x17d
 [] dev_hard_start_xmit+0x29e/0x3c8
 [] ? nf_nat_ipv4_out+0x42/0xbf
 [] dev

Re: [PATCH net-next] neigh: Add missing rcu_assign_pointer

2015-05-28 Thread Eric Dumazet

On Thu, 2015-05-28 at 21:50 +0800, Herbert Xu wrote:

> This patch is indeed bogus but accessing an RCU-protected like
> this will trigger sparse warnings.  So better make it an
> RCU_INIT_POINTER.

A = B;  is perfectly fine since both A and B have the same __rcu
attribute.

Sparse has no warning and should not.

root@edumazet-glaptop2:/usr/src/net# grep CONFIG_SPARSE_RCU_POINTER .config
CONFIG_SPARSE_RCU_POINTER=y
root@edumazet-glaptop2:/usr/src/net# make C=2 CF=-D__CHECK_ENDIAN__ 
net/core/neighbour.o
...
  CHECK   net/core/neighbour.c

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Drops in qdisc on ifb interface

2015-05-28 Thread jsulli...@opensourcedevel.com


> On May 25, 2015 at 6:31 PM Eric Dumazet  wrote:
>
>
> On Mon, 2015-05-25 at 16:05 -0400, John A. Sullivan III wrote:
> > Hello, all. One one of our connections we are doing intensive traffic
> > shaping with tc. We are using ifb interfaces for shaping ingress
> > traffic and we also use ifb interfaces for egress so that we can apply
> > the same set of rules to multiple interfaces (e.g., tun and eth
> > interfaces operating on the same physical interface).
> >
> > These are running on very powerful gateways; I have watched them
> > handling 16 Gbps with CPU utilization at a handful of percent. Yet, I
> > am seeing drops on the ifb interfaces when I do a tc -s qdisc show.
> >
> > Why would this be? I would expect if there was some kind of problem that
> > it would manifest as drops on the physical interfaces and not the IFB
> > interface. We have played with queue lengths in both directions. We
> > are using HFSC with SFQ leaves so I would imagine this overrides the
> > very short qlen on the IFB interfaces (32). These are drops and not
> > overlimits.
>
> IFB is single threaded and a serious bottleneck.
>
> Don't use this on egress, this destroys multiqueue capaility.
>
> And SFQ is pretty limited (127 packets)
>
> You might try to change your NIC to have a single queue for RX,
> so that you have a single cpu feeding your IFB queue.
>
> (ethtool -L eth0 rx 1)
>
>
>
>
>
This has been an interesting exercise - thank you for your help along the way,
Eric.  IFB did not seem to bottleneck in our initial testing but there was
really only one flow of traffic during the test at around 1 Gbps.  However, on a
non-test system with many different flows, IFB does seem to be a serious
bottleneck - I assume this is the consequence of being single-threaded.

Single queue did not seem to help.

Am I correct to assume that IFB would be as much as a bottleneck on the ingress
side as it would be on the egress side? If so, is there any way to do high
performance ingress traffic shaping on Linux - a multi-threaded version of IFB
or a different approach? Thanks - John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sctp: fix ASCONF list handling

2015-05-28 Thread Marcelo Ricardo Leitner

On Thu, May 28, 2015 at 10:27:32AM -0300, Marcelo Ricardo Leitner wrote:
> On Thu, May 28, 2015 at 08:17:27AM -0300, Marcelo Ricardo Leitner wrote:
> > On Thu, May 28, 2015 at 06:15:11AM -0400, Neil Horman wrote:
> > > On Wed, May 27, 2015 at 09:52:17PM -0300, mleit...@redhat.com wrote:
> > > > From: Marcelo Ricardo Leitner 
> > > > 
> > > > ->auto_asconf_splist is per namespace and mangled by functions like
> > > > sctp_setsockopt_auto_asconf() which doesn't guarantee any serialization.
> > > > 
> > > > Also, the call to inet_sk_copy_descendant() was backuping
> > > > ->auto_asconf_list through the copy but was not honoring
> > > > ->do_auto_asconf, which could lead to list corruption if it was
> > > > different between both sockets.
> > > > 
> > > > This commit thus fixes the list handling by adding a spinlock to protect
> > > > against multiple writers and converts the list to be protected by RCU
> > > > too, so that we don't have a lock inverstion issue at
> > > > sctp_addr_wq_timeout_handler().
> > > > 
> > > > And as this list now uses RCU, we cannot do such backup and restore
> > > > while copying descendant data anymore as readers may be traversing the
> > > > list meanwhile. We fix this by simply ignoring/not copying those fields,
> > > > placed at the end of struct sctp_sock, so we can just ignore it together
> > > > with struct ipv6_pinfo data. For that we create sctp_copy_descendant()
> > > > so we don't clutter inet_sk_copy_descendant() with SCTP info.
> > > > 
> > > > Issue was found with a test application that kept flipping sysctl
> > > > default_auto_asconf on and off.
> > > > 
> > > > Fixes: 9f7d653b67ae ("sctp: Add Auto-ASCONF support (core).")
> > > > Signed-off-by: Marcelo Ricardo Leitner 
> > > > ---
> > > >  include/net/netns/sctp.h   |  6 +-
> > > >  include/net/sctp/structs.h |  2 ++
> > > >  net/sctp/protocol.c|  6 +-
> > > >  net/sctp/socket.c  | 39 ++-
> > > >  4 files changed, 38 insertions(+), 15 deletions(-)
> > > > 
> > > > diff --git a/include/net/netns/sctp.h b/include/net/netns/sctp.h
> > > > index 
> > > > 3573a81815ad9e0efb6ceb721eb066d3726419f0..e080bebb3147af39c8275261f57018eb01e917b0
> > > >  100644
> > > > --- a/include/net/netns/sctp.h
> > > > +++ b/include/net/netns/sctp.h
> > > > @@ -30,12 +30,15 @@ struct netns_sctp {
> > > > struct list_head local_addr_list;
> > > > struct list_head addr_waitq;
> > > > struct timer_list addr_wq_timer;
> > > > -   struct list_head auto_asconf_splist;
> > > > +   struct list_head __rcu auto_asconf_splist;
> > > You should use the addr_wq_lock here instead of creating a new lock, as 
> > > thats
> > > already used to protect most accesses to the list you are concerned about.
> > 
> > Ok, that works too.
> > 
> > > Though truthfully, that shouldn't be necessecary.  The list in question 
> > > is only
> > > read in one location and only written in one location.  You can likely 
> > > just
> > > rcu-ify, as the write side is in process context and protected by 
> > > lock_sock.
> > 
> > It should, it's not protected by lock_sock as this list resides in
> > netns_sctp structure, which lock_sock doesn't cover. Write side is in
> > process context yes, but this list is written in sctp_init_sock(),
> > sctp_destroy_sock() and sctp_setsockopt_auto_asconf(), so one could
> > trigger this by either creating/destroying sockets if
> > default_auto_asconf=1 or just by creating a bunch of sockets and
> > flipping asconf via setsockopt (or a combination of these operations).
> > (I'll point this out in the changelog)
> 
> Hmm.. by reusing addr_wq_lock we don't need to rcu-ify the list, as the
> reader is inside that lock too, so I can just protect auto_asconf_splist
> writers with addr_wq_lock.
> 
> Nice, thanks Neil.

Cannot really do that.. as that creates a lock inversion between
sctp_destroy_sock() (which already holds lock_sock) and
sctp_addr_wq_timeout_handler(), which first grabs addr_wq_lock and then
locks socket by socket.

Due to that, I'm afraid reusing this lock is not possible, and we should
stick with the patch.. what do you think? (though I have to fix the nits
in there)

  Marcelo

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Lost network connectivity in 4.0.x

2015-05-28 Thread Ken Moffat

On Wed, May 27, 2015 at 10:53:00PM -0700, Cong Wang wrote:
> (Please always Cc netdev for networking bugs.)
> 
> On Sat, May 23, 2015 at 8:29 PM, Ken Moffat  wrote:
> > On Sun, May 24, 2015 at 03:43:52AM +0100, Ken Moffat wrote:
> >> Anybody else suffering frm lost network connectivity in 4.0.x
> >> kernels ?  A couple of times this week, vim on an nfs-3 mount hung
> >> and I had to reboot.  Both of those occasions were on an AMD desktop
> >> with the r8169 driver, running 4.0.3.  I thought it might be
> >> specific to that machine.  For the last two or three days I've been
> >> using an intel, and about 10 minutes ago it suffered the same problem
> >> while running 4.0.4.  Using ping from another term showed that it
> >> had no connectivity to the server on my local network.
> >>
> >> This is a bit hard to diagnose - nothing in the logs.
> >>
> > I forgot to add that this is with the released gcc-5.1 : I keep
> > forgetting that some people use old compilers ;-)
> >
> 
> Is there any way you can help to narrow down the problem?
> 

Thanks for the reply.  The problem is continuing to show up, but
irregularly and often only after the machine has been booted for a
long time (with s2ram, but I don't think I've used s2ram on every
occasion).

> For example:
> 
> 1) What is your network setup? iptables? routes? etc.
> 
I'm using iptables.  Ah, yes - it started dropping packets around
the time I last had a problem:

May 27 00:48:26 ac4tv dhclient: DHCPREQUEST on eth0 to 192.168.7.254
port 67
May 27 00:48:27 ac4tv dhclient: DHCPACK from 192.168.7.254
May 27 00:48:27 ac4tv dhclient: bound to 192.168.7.152 -- renewal in
1787 seconds.

 That address came from my router, and I had been getting the same
address for an hour, tbut then the dropped packet messages start
appearing - they are for a different address, one that would have
been offered by my server:

May 27 00:53:16 ac4tv kernel: [31922.316798] IPTABLES Packet
Dropped: IN=eth0 OUT= MAC=c8:60:00:97:07:35:bc:ae:c5:57:70:c5:08:00
SRC=192.168.7.11 DST=192.168.7.121 LEN=60 TOS=0x00 PREC=0x00 TTL=64
ID=0 DF PROTO=TCP SPT=2049 DPT=1005 WINDOW=28960 RES=0x00 ACK SYN
URGP=0 
May 27 00:53:17 ac4tv kernel: [31923.316612] IPTABLES Packet
Dropped: IN=eth0 OUT= MAC=c8:60:00:97:07:35:bc:ae:c5:57:70:c5:08:00
SRC=192.168.7.11 DST=192.168.7.121 LEN=60 TOS=0x00 PREC=0x00 TTL=64
ID=0 DF PROTO=TCP SPT=2049 DPT=1005 WINDOW=28960 RES=0x00 ACK SYN
URGP=0 

and those continued until I forced a reboot.

> 2) Can you check the stats to see if there is any error?
>   `ip -s -s li show`, `ethtool -S `
> 

I don't have ethtool installed, and that ip command appears ok at
the moment:

1: lo:  mtu 65536 qdisc noqueue state UNKNOWN
mode DEFAULT group default 
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
RX: bytes  packets  errors  dropped overrun mcast   
3964   66   0   0   0   0   
RX errors: length   crc frame   fifomissed
   00   0   0   0   
TX: bytes  packets  errors  dropped carrier collsns 
3964   66   0   0   0   0   
TX errors: aborted  fifo   window heartbeat transns
   00   0   0   0   
2: eth0:  mtu 1500 qdisc pfifo_fast
state UP mode DEFAULT group default qlen 1000
link/ether c8:60:00:97:07:35 brd ff:ff:ff:ff:ff:ff
RX: bytes  packets  errors  dropped overrun mcast   
224661061  277642   0   0   0   0   
RX errors: length   crc frame   fifomissed
   00   0   0   0   
TX: bytes  packets  errors  dropped carrier collsns 
278152429  370438   0   0   0   0   
TX errors: aborted  fifo   window heartbeat transns
   00   0   0   6   

> 3) Do a bisect?
> 
> Thanks!

That doesn't seem very practical when the machine is ok for a couple
of days at a time.

ĸen
-- 
Nanny Ogg usually went to bed early. After all, she was an old lady.
Sometimes she went to bed as early as 6 a.m.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Drops in qdisc on ifb interface

2015-05-28 Thread Eric Dumazet

On Thu, 2015-05-28 at 10:38 -0400, jsulli...@opensourcedevel.com wrote:

> This has been an interesting exercise - thank you for your help along the way,
> Eric.  IFB did not seem to bottleneck in our initial testing but there was
> really only one flow of traffic during the test at around 1 Gbps.  However, 
> on a
> non-test system with many different flows, IFB does seem to be a serious
> bottleneck - I assume this is the consequence of being single-threaded.
> 
> Single queue did not seem to help.
> 
> Am I correct to assume that IFB would be as much as a bottleneck on the 
> ingress
> side as it would be on the egress side? If so, is there any way to do high
> performance ingress traffic shaping on Linux - a multi-threaded version of IFB
> or a different approach? Thanks - John

IFB has still a long way before being efficient.

In the mean time, you could play with following patch, and
setup /sys/class/net/eth0/gro_timeout to 2

This way, the GRO aggregation will work even at 1Gbps, and your IFB will
get big GRO packets instead of single MSS segments.

Both IFB but also IP/TCP stack will have less work to do,
and receiver will send fewer ACK packets as well.

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 
f287186192bb655ba2dc1a205fb251351d593e98..c37f6657c047d3eb9bd72b647572edd53b1881ac
 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -151,7 +151,7 @@ static void igb_setup_dca(struct igb_adapter *);
 #endif /* CONFIG_IGB_DCA */
 static int igb_poll(struct napi_struct *, int);
 static bool igb_clean_tx_irq(struct igb_q_vector *);
-static bool igb_clean_rx_irq(struct igb_q_vector *, int);
+static unsigned int igb_clean_rx_irq(struct igb_q_vector *, int);
 static int igb_ioctl(struct net_device *, struct ifreq *, int cmd);
 static void igb_tx_timeout(struct net_device *);
 static void igb_reset_task(struct work_struct *);
@@ -6342,6 +6342,7 @@ static int igb_poll(struct napi_struct *napi, int budget)
 struct igb_q_vector,
 napi);
bool clean_complete = true;
+   unsigned int packets = 0;
 
 #ifdef CONFIG_IGB_DCA
if (q_vector->adapter->flags & IGB_FLAG_DCA_ENABLED)
@@ -6350,15 +6351,17 @@ static int igb_poll(struct napi_struct *napi, int 
budget)
if (q_vector->tx.ring)
clean_complete = igb_clean_tx_irq(q_vector);
 
-   if (q_vector->rx.ring)
-   clean_complete &= igb_clean_rx_irq(q_vector, budget);
+   if (q_vector->rx.ring) {
+   packets = igb_clean_rx_irq(q_vector, budget);
+   clean_complete &= packets < budget;
+   }
 
/* If all work not completed, return budget and keep polling */
if (!clean_complete)
return budget;
 
/* If not enough Rx work done, exit the polling mode */
-   napi_complete(napi);
+   napi_complete_done(napi, packets);
igb_ring_irq_enable(q_vector);
 
return 0;
@@ -6926,7 +6929,7 @@ static void igb_process_skb_fields(struct igb_ring 
*rx_ring,
skb->protocol = eth_type_trans(skb, rx_ring->netdev);
 }
 
-static bool igb_clean_rx_irq(struct igb_q_vector *q_vector, const int budget)
+static unsigned int igb_clean_rx_irq(struct igb_q_vector *q_vector, const int 
budget)
 {
struct igb_ring *rx_ring = q_vector->rx.ring;
struct sk_buff *skb = rx_ring->skb;
@@ -7000,7 +7003,7 @@ static bool igb_clean_rx_irq(struct igb_q_vector 
*q_vector, const int budget)
if (cleaned_count)
igb_alloc_rx_buffers(rx_ring, cleaned_count);
 
-   return total_packets < budget;
+   return total_packets;
 }
 
 static bool igb_alloc_mapped_page(struct igb_ring *rx_ring,


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: linux-next: build failure after merge of most of the trees

2015-05-28 Thread Eric Dumazet

On Thu, 2015-05-28 at 22:06 +1000, Stephen Rothwell wrote:
> Hi all,
> 
> After merging the all the trees, today's linux-next build (powerpc
> allyesconfig) failed like this:
> 
> drivers/vhost/scsi.c: In function 'vhost_scsi_open':
> drivers/vhost/scsi.c:1422:3: error: implicit declaration of function 
> 'vzalloc' [-Werror=implicit-function-declaration]
>vs = vzalloc(sizeof(*vs));
>^
> drivers/vhost/scsi.c:1422:6: warning: assignment makes pointer from integer 
> without a cast
>vs = vzalloc(sizeof(*vs));
>   ^
> drivers/target/target_core_pr.c: In function 
> 'core_scsi3_update_and_write_aptpl':
> drivers/target/target_core_pr.c:1985:2: error: implicit declaration of 
> function 'vzalloc' [-Werror=implicit-function-declaration]
>   buf = vzalloc(len);
>   ^
> drivers/target/target_core_pr.c:1985:6: warning: assignment makes pointer 
> from integer without a cast
>   buf = vzalloc(len);
>   ^
> drivers/target/target_core_pr.c:1991:3: error: implicit declaration of 
> function 'vfree' [-Werror=implicit-function-declaration]
>vfree(buf);
>^
> drivers/target/target_core_transport.c: In function 
> 'transport_alloc_session_tags':
> drivers/target/target_core_transport.c:258:3: error: implicit declaration of 
> function 'vzalloc' [-Werror=implicit-function-declaration]
>se_sess->sess_cmd_map = vzalloc(tag_num * tag_size);
>^
> drivers/target/target_core_transport.c:258:25: warning: assignment makes 
> pointer from integer without a cast
>se_sess->sess_cmd_map = vzalloc(tag_num * tag_size);
>  ^
> drivers/target/target_core_transport.c:270:4: error: implicit declaration of 
> function 'vfree' [-Werror=implicit-function-declaration]
> vfree(se_sess->sess_cmd_map);
> ^
> drivers/target/target_core_transport.c: In function 'transport_kmap_data_sg':
> drivers/target/target_core_transport.c:2317:2: error: implicit declaration of 
> function 'vmap' [-Werror=implicit-function-declaration]
>   cmd->t_data_vmap = vmap(pages, cmd->t_data_nents,  VM_MAP, PAGE_KERNEL);
>   ^
> drivers/target/target_core_transport.c:2317:53: error: 'VM_MAP' undeclared 
> (first use in this function)
>   cmd->t_data_vmap = vmap(pages, cmd->t_data_nents,  VM_MAP, PAGE_KERNEL);
>  ^
> drivers/target/target_core_transport.c: In function 
> 'transport_kunmap_data_sg':
> drivers/target/target_core_transport.c:2335:2: error: implicit declaration of 
> function 'vunmap' [-Werror=implicit-function-declaration]
>   vunmap(cmd->t_data_vmap);
>   ^
> drivers/target/target_core_file.c: In function 'fd_format_prot':
> drivers/target/target_core_file.c:809:2: error: implicit declaration of 
> function 'vzalloc' [-Werror=implicit-function-declaration]
>   buf = vzalloc(unit_size);
>   ^
> drivers/target/target_core_file.c:809:6: warning: assignment makes pointer 
> from integer without a cast
>   buf = vzalloc(unit_size);
>   ^
> drivers/target/target_core_file.c:822:2: error: implicit declaration of 
> function 'vfree' [-Werror=implicit-function-declaration]
>   vfree(buf);
>   ^
> drivers/target/target_core_user.c: In function 'tcmu_configure_device':
> drivers/target/target_core_user.c:895:2: error: implicit declaration of 
> function 'vzalloc' [-Werror=implicit-function-declaration]
>   udev->mb_addr = vzalloc(TCMU_RING_SIZE);
>   ^
> drivers/target/target_core_user.c:895:16: warning: assignment makes pointer 
> from integer without a cast
>   udev->mb_addr = vzalloc(TCMU_RING_SIZE);
> ^
> drivers/target/target_core_user.c:947:2: error: implicit declaration of 
> function 'vfree' [-Werror=implicit-function-declaration]
>   vfree(udev->mb_addr);
>   ^
> drivers/target/iscsi/iscsi_target.c: In function 'iscsi_target_init_module':
> drivers/target/iscsi/iscsi_target.c:557:2: error: implicit declaration of 
> function 'vzalloc' [-Werror=implicit-function-declaration]
>   iscsit_global->ts_bitmap = vzalloc(size);
>   ^
> drivers/target/iscsi/iscsi_target.c:557:27: warning: assignment makes pointer 
> from integer without a cast
>   iscsit_global->ts_bitmap = vzalloc(size);
>^
> drivers/target/iscsi/iscsi_target.c:615:2: error: implicit declaration of 
> function 'vfree' [-Werror=implicit-function-declaration]
>   vfree(iscsit_global->ts_bitmap);
>   ^
> drivers/scsi/qla2xxx/tcm_qla2xxx.c: In function 'tcm_qla2xxx_init_lport':
> drivers/scsi/qla2xxx/tcm_qla2xxx.c:1578:2: error: implicit declaration of 
> function 'vmalloc' [-Werror=implicit-function-declaration]
>   lport->lport_loopid_map = vmalloc(sizeof(struct tcm_qla2xxx_fc_loopid) *
>   ^
> drivers/scsi/qla2xxx/tcm_qla2xxx.c:1578:26: warning: assignment makes pointer 
> from integer without a cast
>   lport->lport_loopid_map = vmalloc(sizeof(struct tcm_qla2xxx_fc_loopid) *
>   ^
> drivers/scsi/qla2xxx/tcm_qla2xxx.c: In function 'tcm_qla2xxx_make_lport':
> drivers/scsi/qla2xxx/tcm_qla2xxx.c:1643:2: error: implic

Re: Drops in qdisc on ifb interface

2015-05-28 Thread jsulli...@opensourcedevel.com


> On May 28, 2015 at 11:14 AM Eric Dumazet  wrote:
>
>
> On Thu, 2015-05-28 at 10:38 -0400, jsulli...@opensourcedevel.com wrote:
>

> IFB has still a long way before being efficient.
>
> In the mean time, you could play with following patch, and
> setup /sys/class/net/eth0/gro_timeout to 2
>
> This way, the GRO aggregation will work even at 1Gbps, and your IFB will
> get big GRO packets instead of single MSS segments.
>
> Both IFB but also IP/TCP stack will have less work to do,
> and receiver will send fewer ACK packets as well.
>
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> b/drivers/net/ethernet/intel/igb/igb_main.c
> index
> f287186192bb655ba2dc1a205fb251351d593e98..c37f6657c047d3eb9bd72b647572edd53b1881ac
> 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -151,7 +151,7 @@ static void igb_setup_dca(struct igb_adapter *);
> #endif /* CONFIG_IGB_DCA */


Interesting but this is destined to become a critical production system for a
high profile, internationally recognized product so I am hesitant to patch.  I
doubt I can convince my company to do it but is improving IFB the sort of
development effort that could be sponsored and then executed in a moderately
short period of time? Thanks - John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-28 Thread Scott Feldman

On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko  wrote:
> Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote:
>>From: Roopa Prabhu 
>>Date: Sun, 17 May 2015 16:42:05 -0700
>>
>>> On most systems where you can offload routes to hardware,
>>> doing routing in software is not an option (the cpu limitations
>>> make routing impossible in software).
>>
>>You absolutely do not get to determine this policy, none of us
>>do.
>>
>>What matters is that by default the damn switch device being there
>>is %100 transparent to the user.
>>
>>And the way to achieve that default is to do software routes as
>>a fallback.
>>
>>I am not going to entertain changes of this nature which fail
>>route loading by default just because we've exceeded a device's
>>HW capacity to offload.
>>
>>I thought I was _really_ clear about this at netdev 0.1
>
> I certainly agree that by default, transparency 1:1 sw:hw mapping is
> what we need for fib. The current code is a good start!
>
> I see couple of issues regarding switchdev_fib_ipv4_abort:
> 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
>executed -> and, error returned. I would expect that route entry should
>be added in this case. The next attempt of adding the same entry will
>be successful.
>The current behaviour breaks the transparency you are reffering to.
> 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
>disabled for good (until reboot). That is certainly not nice, alhough
>I understand that is the easiest solution for now.
>
> I believe that we all agree that the 1:1 transparency, although it is a
> default, may not be optimal for real-life usage. HW resources are
> limited and user does not know them. The danger of hitting _abort and
> screwing-up the whole system is huge, unacceptable.
>
> So here, there are couple of more or less simple things that I suggest to
> do in order to move a little bit forward:
> 1) Introduce system-wide option to switch _abort to just plain fail.
>When HW does not have capacity, do not flush and fallback to sw, but
>rather just fail to add the entry. This would not break anything.
>Userspace has to be prepared that entry add could fail.
> 2) Introduce a way to propagate resources to userspace. Driver knows about
>resources used/available/potentially_available. Switchdev infra could
>be extended in order to propagate the info to the user.
> 3) Introduce couple of flags for entry add that would alter the default
>behaviour. Something like:
> NLM_F_SKIP_KERNEL
> NLM_F_SKIP_OFFLOAD
>Again, this does not break the current users. On the other hand, this
>gives new users a leverage to instruct kernel where the entry should
>be added to (or not added to).
>
> Any thoughts? Objections?

I don't like these.  Breaks transparency and forces the user in a
position of having to know hardware failures modes (unique to each
hardware device).  I presented an option d) which avoids this issues;
was it not understood?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 2/4] net/mlx4: Add EQ pool

2015-05-28 Thread Or Gerlitz

From: Matan Barak 

Previously, mlx4_en allocated EQs and used them exclusively.
This affected RoCE performance, as applications which are
events sensitive were limited to use only the legacy EQs.

Change that by introducing an EQ pool. This pool is managed
by mlx4_core. EQs are assigned to ports (when there are limited
number of EQs, multiple ports could be assigned to the same EQs).

An exception to this rule is the ASYNC EQ which handles various events.

Legacy EQs are completely removed as all EQs could be shared.

When a consumer (mlx4_ib/mlx4_en) requests an EQ, it asks for
EQ serving on a specific port. The core driver calculates which
EQ should be assigned to that request.

Because IRQs are shared between IB and Ethernet modules, their
names only include the PCI device BDF address.

Signed-off-by: Matan Barak 
Signed-off-by: Ido Shamay 
Signed-off-by: Or Gerlitz 
---
 drivers/infiniband/hw/mlx4/main.c  |   71 ++
 drivers/infiniband/hw/mlx4/mlx4_ib.h   |1 -
 drivers/net/ethernet/mellanox/mlx4/cq.c|   10 +-
 drivers/net/ethernet/mellanox/mlx4/en_cq.c |   48 ++--
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |7 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |   13 +-
 drivers/net/ethernet/mellanox/mlx4/eq.c|  353 ++--
 drivers/net/ethernet/mellanox/mlx4/main.c  |   74 --
 drivers/net/ethernet/mellanox/mlx4/mlx4.h  |   11 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |2 +-
 include/linux/mlx4/device.h|   11 +-
 11 files changed, 342 insertions(+), 259 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index 8c96c71..024b0f7 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -2041,77 +2041,52 @@ static void init_pkeys(struct mlx4_ib_dev *ibdev)
 
 static void mlx4_ib_alloc_eqs(struct mlx4_dev *dev, struct mlx4_ib_dev *ibdev)
 {
-   char name[80];
-   int eq_per_port = 0;
-   int added_eqs = 0;
-   int total_eqs = 0;
-   int i, j, eq;
-
-   /* Legacy mode or comp_pool is not large enough */
-   if (dev->caps.comp_pool == 0 ||
-   dev->caps.num_ports > dev->caps.comp_pool)
-   return;
-
-   eq_per_port = dev->caps.comp_pool / dev->caps.num_ports;
-
-   /* Init eq table */
-   added_eqs = 0;
-   mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB)
-   added_eqs += eq_per_port;
-
-   total_eqs = dev->caps.num_comp_vectors + added_eqs;
+   int i, j, eq = 0, total_eqs = 0;
 
-   ibdev->eq_table = kzalloc(total_eqs * sizeof(int), GFP_KERNEL);
+   ibdev->eq_table = kcalloc(dev->caps.num_comp_vectors,
+ sizeof(ibdev->eq_table[0]), GFP_KERNEL);
if (!ibdev->eq_table)
return;
 
-   ibdev->eq_added = added_eqs;
-
-   eq = 0;
-   mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB) {
-   for (j = 0; j < eq_per_port; j++) {
-   snprintf(name, sizeof(name), "mlx4-ib-%d-%d@%s",
-i, j, dev->persist->pdev->bus->name);
-   /* Set IRQ for specific name (per ring) */
-   if (mlx4_assign_eq(dev, name, NULL,
-  &ibdev->eq_table[eq])) {
-   /* Use legacy (same as mlx4_en driver) */
-   pr_warn("Can't allocate EQ %d; reverting to 
legacy\n", eq);
-   ibdev->eq_table[eq] =
-   (eq % dev->caps.num_comp_vectors);
-   }
-   eq++;
+   for (i = 1; i <= dev->caps.num_ports; i++) {
+   for (j = 0; j < mlx4_get_eqs_per_port(dev, i);
+j++, total_eqs++) {
+   if (i > 1 &&  mlx4_is_eq_shared(dev, total_eqs))
+   continue;
+   ibdev->eq_table[eq] = total_eqs;
+   if (!mlx4_assign_eq(dev, i,
+   &ibdev->eq_table[eq]))
+   eq++;
+   else
+   ibdev->eq_table[eq] = -1;
}
}
 
-   /* Fill the reset of the vector with legacy EQ */
-   for (i = 0, eq = added_eqs; i < dev->caps.num_comp_vectors; i++)
-   ibdev->eq_table[eq++] = i;
+   for (i = eq; i < dev->caps.num_comp_vectors;
+ibdev->eq_table[i++] = -1)
+   ;
 
/* Advertise the new number of EQs to clients */
-   ibdev->ib_dev.num_comp_vectors = total_eqs;
+   ibdev->ib_dev.num_comp_vectors = eq;
 }
 
 static void mlx4_ib_free_eqs(struct mlx4_dev *dev, struct mlx4_ib_dev *ibdev)
 {
int i;
+   int total_eqs = ibdev->ib_dev.num_comp_vectors;
 
-   /* no additional eqs were added */
+   /* no eqs were allocated */

[PATCH net-next 4/4] net/mlx4_core: Make sure there are no pending async events when freeing CQ

2015-05-28 Thread Or Gerlitz

From: Matan Barak 

When freeing a CQ, we need to make sure there are no
asynchronous events (on the ASYNC EQ) that could
relate to this CQ before freeing it.

This is done by introducing synchronize_irq.

Signed-off-by: Matan Barak 
Signed-off-by: Ido Shamay 
Signed-off-by: Or Gerlitz 
---
 drivers/net/ethernet/mellanox/mlx4/cq.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c 
b/drivers/net/ethernet/mellanox/mlx4/cq.c
index 7431cd4..1fc1dc5 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
@@ -369,6 +369,10 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq)
mlx4_warn(dev, "HW2SW_CQ failed (%d) for CQN %06x\n", err, 
cq->cqn);
 

synchronize_irq(priv->eq_table.eq[MLX4_CQ_TO_EQ_VECTOR(cq->vector)].irq);
+   if (priv->eq_table.eq[MLX4_CQ_TO_EQ_VECTOR(cq->vector)].irq !=
+   priv->eq_table.eq[MLX4_EQ_ASYNC].irq)
+   synchronize_irq(priv->eq_table.eq[MLX4_EQ_ASYNC].irq);
+
 
spin_lock_irq(&cq_table->lock);
radix_tree_delete(&cq_table->tree, cq->cqn);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 1/4] net/mlx4_core: Demote simple multicast and broadcast flow steering rules

2015-05-28 Thread Or Gerlitz

From: Matan Barak 

In SRIOV, when simple (i.e - Ethernet L2 only) flow steering rules are
created, always create them at MLX4_DOMAIN_NIC priority (instead of
the real priority the function created them at). This is done in order
to let multiple functions add broadcast/multicast rules without
affecting other functions, which is necessary for DPDK in SRIOV.

Signed-off-by: Matan Barak 
Signed-off-by: Or Gerlitz 
---
 drivers/infiniband/hw/mlx4/main.c  |4 +-
 .../net/ethernet/mellanox/mlx4/resource_tracker.c  |   23 
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index cc64400..8c96c71 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -1090,7 +1090,7 @@ static int __mlx4_ib_create_flow(struct ib_qp *qp, struct 
ib_flow_attr *flow_att
 
ret = mlx4_cmd_imm(mdev->dev, mailbox->dma, reg_id, size >> 2, 0,
   MLX4_QP_FLOW_STEERING_ATTACH, MLX4_CMD_TIME_CLASS_A,
-  MLX4_CMD_NATIVE);
+  MLX4_CMD_WRAPPED);
if (ret == -ENOMEM)
pr_err("mcg table is full. Fail to register network rule.\n");
else if (ret == -ENXIO)
@@ -1107,7 +1107,7 @@ static int __mlx4_ib_destroy_flow(struct mlx4_dev *dev, 
u64 reg_id)
int err;
err = mlx4_cmd(dev, reg_id, 0, 0,
   MLX4_QP_FLOW_STEERING_DETACH, MLX4_CMD_TIME_CLASS_A,
-  MLX4_CMD_NATIVE);
+  MLX4_CMD_WRAPPED);
if (err)
pr_err("Fail to detach network rule. registration id = 
0x%llx\n",
   reg_id);
diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c 
b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
index 15ec081..ab48386 100644
--- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
+++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
@@ -3973,6 +3973,22 @@ static int validate_eth_header_mac(int slave, struct 
_rule_hw *eth_header,
return 0;
 }
 
+static void handle_eth_header_mcast_prio(struct mlx4_net_trans_rule_hw_ctrl 
*ctrl,
+struct _rule_hw *eth_header)
+{
+   if (is_multicast_ether_addr(eth_header->eth.dst_mac) ||
+   is_broadcast_ether_addr(eth_header->eth.dst_mac)) {
+   struct mlx4_net_trans_rule_hw_eth *eth =
+   (struct mlx4_net_trans_rule_hw_eth *)eth_header;
+   struct _rule_hw *next_rule = (struct _rule_hw *)(eth + 1);
+   bool last_rule = next_rule->size == 0 && next_rule->id == 0 &&
+   next_rule->rsvd == 0;
+
+   if (last_rule)
+   ctrl->prio = cpu_to_be16(MLX4_DOMAIN_NIC);
+   }
+}
+
 /*
  * In case of missing eth header, append eth header with a MAC address
  * assigned to the VF.
@@ -4125,6 +4141,12 @@ int mlx4_QP_FLOW_STEERING_ATTACH_wrapper(struct mlx4_dev 
*dev, int slave,
rule_header = (struct _rule_hw *)(ctrl + 1);
header_id = map_hw_to_sw_id(be16_to_cpu(rule_header->id));
 
+   if (header_id == MLX4_NET_TRANS_RULE_ID_ETH)
+   handle_eth_header_mcast_prio(ctrl, rule_header);
+
+   if (slave == dev->caps.function)
+   goto execute;
+
switch (header_id) {
case MLX4_NET_TRANS_RULE_ID_ETH:
if (validate_eth_header_mac(slave, rule_header, rlist)) {
@@ -4151,6 +4173,7 @@ int mlx4_QP_FLOW_STEERING_ATTACH_wrapper(struct mlx4_dev 
*dev, int slave,
goto err_put;
}
 
+execute:
err = mlx4_cmd_imm(dev, inbox->dma, &vhcr->out_param,
   vhcr->in_modifier, 0,
   MLX4_QP_FLOW_STEERING_ATTACH, MLX4_CMD_TIME_CLASS_A,
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 0/4] mlx4 driver update, May 28, 2015

2015-05-28 Thread Or Gerlitz

Hi Dave,

The 1st patch fixes an issue with a function running DPDK overriding 
broadcast steering rules set by other functions. Please add
this one to your -stable queue.

The rest of the series from Matan and Ido deals with scaling the number 
of IRQs that serve RoCE applications to be in par with the Ethernet driver.

Or.

Ido Shamay (1):
  net/mlx4_core: Move affinity hints to mlx4_core ownership

Matan Barak (3):
  net/mlx4_core: Demote simple multicast and broadcast flow steering rules
  net/mlx4: Add EQ pool
  net/mlx4_core: Make sure there are no pending async events when freeing CQ

 drivers/infiniband/hw/mlx4/main.c  |   75 ++---
 drivers/infiniband/hw/mlx4/mlx4_ib.h   |1 -
 drivers/net/ethernet/mellanox/mlx4/cq.c|   14 +-
 drivers/net/ethernet/mellanox/mlx4/en_cq.c |   56 ++--
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |7 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |   13 +-
 drivers/net/ethernet/mellanox/mlx4/eq.c|  374 
 drivers/net/ethernet/mellanox/mlx4/main.c  |  110 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4.h  |   12 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |2 +-
 .../net/ethernet/mellanox/mlx4/resource_tracker.c  |   23 ++
 include/linux/mlx4/device.h|   11 +-
 12 files changed, 429 insertions(+), 269 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-28 Thread John Fastabend

On 05/28/2015 02:42 AM, Jiri Pirko wrote:

Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote:

From: Roopa Prabhu 
Date: Sun, 17 May 2015 16:42:05 -0700

On most systems where you can offload routes to hardware,
doing routing in software is not an option (the cpu limitations
make routing impossible in software).

You absolutely do not get to determine this policy, none of us
do.

What matters is that by default the damn switch device being there
is %100 transparent to the user.

And the way to achieve that default is to do software routes as
a fallback.

I am not going to entertain changes of this nature which fail
route loading by default just because we've exceeded a device's
HW capacity to offload.

I thought I was _really_ clear about this at netdev 0.1

I certainly agree that by default, transparency 1:1 sw:hw mapping is
what we need for fib. The current code is a good start!

I see couple of issues regarding switchdev_fib_ipv4_abort:
1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
executed -> and, error returned. I would expect that route entry should
be added in this case. The next attempt of adding the same entry will
be successful.
The current behaviour breaks the transparency you are reffering to.
2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
disabled for good (until reboot). That is certainly not nice, alhough
I understand that is the easiest solution for now.

I believe that we all agree that the 1:1 transparency, although it is a
default, may not be optimal for real-life usage. HW resources are
limited and user does not know them. The danger of hitting _abort and
screwing-up the whole system is huge, unacceptable.

So here, there are couple of more or less simple things that I suggest to
do in order to move a little bit forward:
1) Introduce system-wide option to switch _abort to just plain fail.
When HW does not have capacity, do not flush and fallback to sw, but
rather just fail to add the entry. This would not break anything.
Userspace has to be prepared that entry add could fail.
2) Introduce a way to propagate resources to userspace. Driver knows about
resources used/available/potentially_available. Switchdev infra could
be extended in order to propagate the info to the user.

I currently use the FlowAPI work I presented at netdev conference for
this. Perhaps I was a bit reaching by trying to also push it as a
replacement for the ethtool flow classification mechanism all in one
shot. For what it is worth replacing 'ethtool' flow classifier when
I have a pipeline of tables in a NIC is really my first use case for
the 'set' operations but that is off-topic probably.

The benefits I see of using this interface (or if you want rename
it and push it into a different netlink type) is it gives you the entire
view of the switch resources and pipeline from a single interface. Also
because you are talking about system-wide behaviour above it nicely
rolls up into user space software where we can act on it with the
flags we have for l2 already and if we pursue your option (3) also l3.
I like the single interface vs. scattering the information across many
different interfaces this way we can do it once and be done with it.
If you scatter it across all the interfaces just l2,l3 for now but
we will get more then each interface will have its own mechanism and
I have no idea where you put global information such as table ordering.

IMO we are going to need at least the base operations I outlined when
we want to work on many different pipelines possibly with different
ordering of tables, different amounts of resource sharing (l2 vs l3 vs
acls vs...), different levels of support (mac/vlan or just mac). And I
don't think it fits into an existing netlink structure because its not
specific to any one thing but the model of the hardware.

Also I believe that match/action tables are a really nice way to work
with hardware so this aligns with that. That said I think the interface
would need some tweaks to fit into the current code base. The biggest
one I would want is to make l2/l3 tables 'well-defined' e.g. give them
a #define value so we can always track them down easily, drop the set
operation (at least for now because the tables we have already have
defined interfaces l2/l3  I'll reopen this in the context of extending
flow classification on the NIC), and clean up the action bits so they
are well defined. I've pushed an update to my code on github to restrict
the hardware from exporting arbitrary actions which should be a
reasonable first step.

What do you think? I would like to try to make the above updates and
resubmit if we can get an agreement that "knowing" the hardware
resources and capabilities is useful. It is at least useful for my
software stacks/use cases.

3) Introduce couple of flags for entry add that would alter the default
behaviour. Something like:
NLM_F_SKIP_KERNEL

[PATCH net-next 3/4] net/mlx4_core: Move affinity hints to mlx4_core ownership

2015-05-28 Thread Or Gerlitz

From: Ido Shamay 

Now that EQs management is in the sole responsibility of mlx4_core,
the IRQ affinity hints configuration should be in its hands as well.
request_irq is called only once by the first consumer (maybe mlx4_ib),
so mlx4_en passes the affinity mask too late. We also need to request
vectors according to the cores we want to run on.

mlx4_core distribution of IRQs to cores is straight forward,
EQ(i)->IRQ will set affinity hint to core i.
Consumers need to request EQ vectors, according to their cores
considerations (NUMA).

Signed-off-by: Ido Shamay 
Signed-off-by: Matan Barak 
Signed-off-by: Or Gerlitz 
---
 drivers/net/ethernet/mellanox/mlx4/en_cq.c |   10 +---
 drivers/net/ethernet/mellanox/mlx4/eq.c|   21 
 drivers/net/ethernet/mellanox/mlx4/main.c  |   36 
 drivers/net/ethernet/mellanox/mlx4/mlx4.h  |1 +
 4 files changed, 59 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_cq.c 
b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
index d71c567..63769df 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -114,7 +114,7 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct 
mlx4_en_cq *cq,
if (cq->is_tx == RX) {
if (!mlx4_is_eq_vector_valid(mdev->dev, priv->port,
 cq->vector)) {
-   cq->vector = cq_idx;
+   cq->vector = 
cpumask_first(priv->rx_ring[cq->ring]->affinity_mask);
 
err = mlx4_assign_eq(mdev->dev, priv->port,
 &cq->vector);
@@ -160,13 +160,6 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct 
mlx4_en_cq *cq,
netif_napi_add(cq->dev, &cq->napi, mlx4_en_poll_tx_cq,
   NAPI_POLL_WEIGHT);
} else {
-   struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring];
-
-   err = irq_set_affinity_hint(cq->mcq.irq,
-   ring->affinity_mask);
-   if (err)
-   mlx4_warn(mdev, "Failed setting affinity hint\n");
-
netif_napi_add(cq->dev, &cq->napi, mlx4_en_poll_rx_cq, 64);
napi_hash_add(&cq->napi);
}
@@ -205,7 +198,6 @@ void mlx4_en_deactivate_cq(struct mlx4_en_priv *priv, 
struct mlx4_en_cq *cq)
if (!cq->is_tx) {
napi_hash_del(&cq->napi);
synchronize_rcu();
-   irq_set_affinity_hint(cq->mcq.irq, NULL);
}
netif_napi_del(&cq->napi);
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c 
b/drivers/net/ethernet/mellanox/mlx4/eq.c
index 2e6fc6a..1116882 100644
--- a/drivers/net/ethernet/mellanox/mlx4/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
@@ -221,6 +221,20 @@ static void mlx4_slave_event(struct mlx4_dev *dev, int 
slave,
slave_event(dev, slave, eqe);
 }
 
+static void mlx4_set_eq_affinity_hint(struct mlx4_priv *priv, int vec)
+{
+   int hint_err;
+   struct mlx4_dev *dev = &priv->dev;
+   struct mlx4_eq *eq = &priv->eq_table.eq[vec];
+
+   if (!eq->affinity_mask || cpumask_empty(eq->affinity_mask))
+   return;
+
+   hint_err = irq_set_affinity_hint(eq->irq, eq->affinity_mask);
+   if (hint_err)
+   mlx4_warn(dev, "irq_set_affinity_hint failed, err %d\n", 
hint_err);
+}
+
 int mlx4_gen_pkey_eqe(struct mlx4_dev *dev, int slave, u8 port)
 {
struct mlx4_eqe eqe;
@@ -1092,6 +1106,10 @@ static void mlx4_free_irqs(struct mlx4_dev *dev)
 
for (i = 0; i < dev->caps.num_comp_vectors + 1; ++i)
if (eq_table->eq[i].have_irq) {
+   free_cpumask_var(eq_table->eq[i].affinity_mask);
+#if defined(CONFIG_SMP)
+   irq_set_affinity_hint(eq_table->eq[i].irq, NULL);
+#endif
free_irq(eq_table->eq[i].irq, eq_table->eq + i);
eq_table->eq[i].have_irq = 0;
}
@@ -1483,6 +1501,9 @@ int mlx4_assign_eq(struct mlx4_dev *dev, u8 port, int 
*vector)
clear_bit(*prequested_vector, priv->msix_ctl.pool_bm);
*prequested_vector = -1;
} else {
+#if defined(CONFIG_SMP)
+   mlx4_set_eq_affinity_hint(priv, *prequested_vector);
+#endif
eq_set_ci(&priv->eq_table.eq[*prequested_vector], 1);
priv->eq_table.eq[*prequested_vector].have_irq = 1;
}
diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c 
b/drivers/net/ethernet/mellanox/mlx4/main.c
index 3ec5113..0dbd704 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -2481,6 +2481,36 @@ err_uar_table_free:
return err;
 }
 
+static int mlx4_init_affinity_hint(struct mlx4_dev *dev, int port, int eqn)
+{
+   int requested_cp

Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters

2015-05-28 Thread Jason Gunthorpe

On Thu, May 28, 2015 at 02:51:51PM +0300, Haggai Eran wrote:

> > But RDMA CM doesn't provide the QPN. So when RDMA CM searches the
> > netdevs for an address it cannot *uniquely* map to a IPoIB interface.

> This is technically true, but if someone configures their system that
> way, they will also have ARP conflicts in addition. I don't see why we
> should support such a configuration.

Based on the dicussion in the other thread about this, the feeling is
we should not support same-GUID, same-PKey child interfaces at all. It
breaks too much stuff (DHCP,NetworkManager,IPv6,RDMA-CM)

> No, this is exactly what would happen in the Ethernet world. If you
> create a conflicting configuration between two containers on the same
> Ethernet segment, then one of them could get the traffic that was
> intended for the other.

It is not exactly the same. In Ethernet there is an ARP collision at
L3, but the traffic is properly addressed at L2 and unambiguously
directed into only one container. There are ways to deal with ARP
collisions, but those are only effective if the full LLADDR is
consistently used for routing to containers.

With RDMA-CM it is an L3 collsion, so our anti-ARP collision stuff
doesn't help.

Like I said, I don't from a security perspective what to make of this,
but it isn't exactly the same of ethernet.

> >  1) Locate the netdev associated with the ingress of the packet,
> > in a sane world this is done by only checking the
> > unique (Device,Port,Pkey,QPN) tuple.
> > If we keep our brokeness, we'd do this based on
> > (Device,Port,Pkey,IP) - if there are IP collisions then randomly
> > select a netdev (similar to how ARP collision is handled).
> That's what ib_get_net_dev_by_port_pkey_ip intends to do.

Right, almost there.

ib_get_net_dev_by_port_pkey_ip needs to work in a very specific
way: If there is only one netdev with the (Device,Port,GUID,Pkey)
match then that is the answer.
(guid comes from the CM_REQ, if we add alias GUID support to IPoIB as
 Or suggested then it is needed)

The IP search should *only* be done if there are two children with
identical (Pkey,GUID), and as above, perhaps we should de-support that.

> >  2) Then we do the ip_route_input_noref step, this will set skb_dst to
> > the netdev that will handle the packet, or tell us to drop it.
> > This is not always the same as the netdev that accepts the
> > packet!!!
> > 
> > NOTE: This route step is missing today, it does critical things
> > like check that the node is actually listening on the dest IP!
> 
> Isn't this a little over-engineered? If all you want is to make sure the
> net dev is up, can't we use something like netif_running()?

The routing check is not to see if the netdev is up, it is doing all
sorts of subtle userspace visible things. Like checking there is no
blackhole route configured for the packet, checking that the IP is
present in the system, netdevs are up, etc etc.

We don't get to pick and choose what netdev behaviors we implement
when doing this kind of stuff. Copy the netstack, don't make stuff up.

Understand the two layer separation, first with pick a netdev without
looking at L3 info, then we feed the netdev and L3 info into routing
to complete the process.

> Also, this sounds like a major change in behavior even for applications
> that do not use containers. I think today RDMA CM will accept
> connections even if the ipoib interface is down.

Yes, it is a change in behavior, things move closer to alignment with
how netdev works. We did a similar change to the output side years ago
as well. I guess the input side was missed.

Unless I'm missing something this isn't 'major', these are corner case
conditions/bugs that nobody sane should rely on.

> >  3) Now we can use skb_dst to iterate over the set of all RDMA CM listens:
> >  1) Bound to the skb_dst netdev
> >  2) Unbound in the same namespace as skb_dst netdev
> > The first to match the dst IP + port is the listen that will accept the
> > connection, now we go into the cma_new_conn_id path, and we don't
> > need rdma_translate_ip because we already have the handling netdev.
> You shouldn't be able to bind one listener to a netdev in a namespace
> and also have a different listener listening for any netdev on that same
> namespace. (That is what cma_check_port verifies, right?) So when
> looking for a listener in a namespace there should be only one match.

You know, I don't remember off hand the exact semantics of sockets,
whatever sockets does :)

> > The backwards operation of the current code is part of why this is all
> > so strange looking, and I think is strongly connected to the private
> > data compare issues Sean is talking about. It is very much the wrong
> > flow to look for the RDMA CM listen first, and then try to work
> > backwards to the netdev.

> That's not what the code does. It first finds the netdev and decides on
> the namespace based on that. It then finds the R

Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-05-28 Thread Jason Gunthorpe

On Thu, May 28, 2015 at 04:22:36PM +0300, Haggai Eran wrote:
> wouldn't care if they share the "QP number namespace", etc. RDMA CM
> ports are different because they are chosen by the applications, but
> they map directly to the network namespace, so they don't require their
> own namespace.

Different containers should have restricted access to the PKey and GID
tables, and the presence device itself. Just like in the SRIOV
case.

That is what the 'RDMA Namespace' would control.

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Drops in qdisc on ifb interface

2015-05-28 Thread John Fastabend


On 05/28/2015 08:30 AM, jsulli...@opensourcedevel.com wrote:



On May 28, 2015 at 11:14 AM Eric Dumazet  wrote:


On Thu, 2015-05-28 at 10:38 -0400, jsulli...@opensourcedevel.com wrote:




IFB has still a long way before being efficient.

In the mean time, you could play with following patch, and
setup /sys/class/net/eth0/gro_timeout to 2

This way, the GRO aggregation will work even at 1Gbps, and your IFB will
get big GRO packets instead of single MSS segments.

Both IFB but also IP/TCP stack will have less work to do,
and receiver will send fewer ACK packets as well.

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
b/drivers/net/ethernet/intel/igb/igb_main.c
index
f287186192bb655ba2dc1a205fb251351d593e98..c37f6657c047d3eb9bd72b647572edd53b1881ac
100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -151,7 +151,7 @@ static void igb_setup_dca(struct igb_adapter *);
#endif /* CONFIG_IGB_DCA */



Interesting but this is destined to become a critical production system for a
high profile, internationally recognized product so I am hesitant to patch.  I
doubt I can convince my company to do it but is improving IFB the sort of
development effort that could be sponsored and then executed in a moderately
short period of time? Thanks - John
--


If your experimenting one thing you could do is create many
ifb devices and load balance across them from tc. I'm not
sure if this would be practical in your setup or not but might
be worth trying.

One thing I've been debating adding is the ability to match
on current cpu_id in tc which would allow you to load balance by
cpu. I could send you a patch if you wanted to test it. I would
expect this to help somewhat with 'single queue' issue but sorry
haven't had time yet to test it out myself.

.John

--
John Fastabend Intel Corporation
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 >

1 - 100 of 238 matches

Mail list logo