[net PATCH] i40e/i40evf: Limit TSO to 7 descriptors for payload instead of 8 per packet
This patch addresses a bug introduced based on my interpretation of the XL710 datasheet. Specifically section 8.4.1 states that "A single transmit packet may span up to 8 buffers (up to 8 data descriptors per packet including both the header and payload buffers)." It then later goes on to say that each segment for a TSO obeys the previous rule, however it then refers to TSO header and the segment payload buffers. I believe the actual limit for fragments with TSO and a skbuff that has payload data in the header portion of the buffer is actually only 7 fragments as the skb->data portion counts as 2 buffers, one for the TSO header, and one for a segment payload buffer. Fixes: 2d37490b82af ("i40e/i40evf: Rewrite logic for 8 descriptor per packet check") Signed-off-by: Alexander Duyck --- This patch has been sanity checked only. I cannot yet guarantee it resolves the original issue that was reported. I'll try to get a reproduction environment setup tomorrow but I don't know how long that should take. drivers/net/ethernet/intel/i40e/i40e_txrx.c | 40 ++--- drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 40 ++--- 2 files changed, 44 insertions(+), 36 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c index 5d5fa5359a1d..97437f04d99d 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c @@ -2597,12 +2597,17 @@ int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size) } /** - * __i40e_chk_linearize - Check if there are more than 8 fragments per packet + * __i40e_chk_linearize - Check if there are more than 8 buffers per packet * @skb: send buffer * - * Note: Our HW can't scatter-gather more than 8 fragments to build - * a packet on the wire and so we need to figure out the cases where we - * need to linearize the skb. + * Note: Our HW can't DMA more than 8 buffers to build a packet on the wire + * and so we need to figure out the cases where we need to linearize the skb. + * + * For TSO we need to count the TSO header and segment payload separately. + * As such we need to check cases where we have 7 fragments or more as we + * can potentially require 9 DMA transactions, 1 for the TSO header, 1 for + * the segment payload in the first descriptor, and another 7 for the + * fragments. **/ bool __i40e_chk_linearize(struct sk_buff *skb) { @@ -2614,18 +2619,17 @@ bool __i40e_chk_linearize(struct sk_buff *skb) if (unlikely(!gso_size)) return true; - /* no need to check if number of frags is less than 8 */ + /* no need to check if number of frags is less than 7 */ nr_frags = skb_shinfo(skb)->nr_frags; - if (nr_frags < I40E_MAX_BUFFER_TXD) + if (nr_frags < (I40E_MAX_BUFFER_TXD - 1)) return false; /* We need to walk through the list and validate that each group * of 6 fragments totals at least gso_size. However we don't need -* to perform such validation on the first or last 6 since the first -* 6 cannot inherit any data from a descriptor before them, and the -* last 6 cannot inherit any data from a descriptor after them. +* to perform such validation on the last 6 since the last 6 cannot +* inherit any data from a descriptor after them. */ - nr_frags -= I40E_MAX_BUFFER_TXD - 1; + nr_frags -= I40E_MAX_BUFFER_TXD - 2; frag = &skb_shinfo(skb)->frags[0]; /* Initialize size to the negative value of gso_size minus 1. We @@ -2636,19 +2640,19 @@ bool __i40e_chk_linearize(struct sk_buff *skb) */ sum = 1 - gso_size; - /* Add size of frags 1 through 5 to create our initial sum */ - sum += skb_frag_size(++frag); - sum += skb_frag_size(++frag); - sum += skb_frag_size(++frag); - sum += skb_frag_size(++frag); - sum += skb_frag_size(++frag); + /* Add size of frags 0 through 4 to create our initial sum */ + sum += skb_frag_size(frag++); + sum += skb_frag_size(frag++); + sum += skb_frag_size(frag++); + sum += skb_frag_size(frag++); + sum += skb_frag_size(frag++); /* Walk through fragments adding latest fragment, testing it, and * then removing stale fragments from the sum. */ stale = &skb_shinfo(skb)->frags[0]; for (;;) { - sum += skb_frag_size(++frag); + sum += skb_frag_size(frag++); /* if sum is negative we failed to make sufficient progress */ if (sum < 0) @@ -2658,7 +2662,7 @@ bool __i40e_chk_linearize(struct sk_buff *skb) if (!--nr_frags) break; - sum -= skb_frag_size(++stale); + sum -= skb_frag_size(stale++); } return false; diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/dri
Re: [PATCH RFC] net: decrease the length of backlog queue immediately after it's detached from sk
On 2016/3/30 13:34, Eric Dumazet wrote: On Tue, 2016-03-29 at 22:25 -0700, Eric Dumazet wrote: On Wed, 2016-03-30 at 13:16 +0800, Yang Yingliang wrote: When task A hold the sk owned in tcp_sendmsg, if lots of packets arrive and the packets will be added to backlog queue. The packets will be handled in release_sock called from tcp_sendmsg. When the sk_backlog is removed from sk, the length will not decrease until all the packets in backlog queue are handled. This may leads to the new packets be dropped because the lenth is too big. So set the lenth to 0 immediately after it's detached from sk. Signed-off-by: Yang Yingliang --- net/core/sock.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/core/sock.c b/net/core/sock.c index 47fc8bb..108be05 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1933,6 +1933,7 @@ static void __release_sock(struct sock *sk) do { sk->sk_backlog.head = sk->sk_backlog.tail = NULL; + sk->sk_backlog.len = 0; bh_unlock_sock(sk); do { Certainly not. Have you really missed the comment ? https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8eae939f1400326b06d0c9afe53d2a484a326871 I do not believe the case you describe can happen, unless a misbehaving driver cooks fat skb (with skb->truesize being far more bigger than skb->len) And also make sure you backported https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=da882c1f2ecadb0ed582628ec1585e36b137c0f0 Sorry, I made a mistake. I am very sure my kernel has these two patches. And I can get some dropping of the packets in 10Gb eth. # netstat -s | grep -i backlog TCPBacklogDrop: 4135 # netstat -s | grep -i backlog TCPBacklogDrop: 4167
Re: [PATCH RFC] net: decrease the length of backlog queue immediately after it's detached from sk
On 2016/3/30 13:25, Eric Dumazet wrote: On Wed, 2016-03-30 at 13:16 +0800, Yang Yingliang wrote: When task A hold the sk owned in tcp_sendmsg, if lots of packets arrive and the packets will be added to backlog queue. The packets will be handled in release_sock called from tcp_sendmsg. When the sk_backlog is removed from sk, the length will not decrease until all the packets in backlog queue are handled. This may leads to the new packets be dropped because the lenth is too big. So set the lenth to 0 immediately after it's detached from sk. Signed-off-by: Yang Yingliang --- net/core/sock.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/core/sock.c b/net/core/sock.c index 47fc8bb..108be05 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1933,6 +1933,7 @@ static void __release_sock(struct sock *sk) do { sk->sk_backlog.head = sk->sk_backlog.tail = NULL; + sk->sk_backlog.len = 0; bh_unlock_sock(sk); do { Certainly not. Have you really missed the comment ? https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8eae939f1400326b06d0c9afe53d2a484a326871 My kernel is 4.1 LTS, it seems don't have this patch. I will try this patch later. Thanks Yang I do not believe the case you describe can happen, unless a misbehaving driver cooks fat skb (with skb->truesize being far more bigger than skb->len)
Re: am335x: no multicast reception over VLAN
On Tuesday 29 March 2016 06:14 PM, Grygorii Strashko wrote: > On 03/29/2016 03:35 PM, Yegor Yefremov wrote: >> On Tue, Mar 29, 2016 at 1:05 PM, Grygorii Strashko >> wrote: >>> On 03/29/2016 08:21 AM, Yegor Yefremov wrote: Hi Mugunthan, On Tue, Mar 29, 2016 at 6:00 AM, Mugunthan V N wrote: > Hi Yegor > > On Wednesday 16 March 2016 08:35 PM, Yegor Yefremov wrote: >> I have an am335x based board using CPSW in Dual EMAC mode. Without >> VLAN IDs I can receive and send multicast packets [1]. When I create >> VLAN ID: >> >> ip link add link eth1 name eth1.100 type vlan id 100 >> ip addr add 192.168.100.2/24 brd 192.168.100.255 dev eth1.100 >> route add -net 224.0.0.0 netmask 224.0.0.0 eth1.100 >> >> I can successfully send multicast packets, but not receive them. On >> the other side of the Ethernet cable I've used Pandaboard. Pandaboard >> could both receive and send multicast packets via VLAN. > > Are you trying multicast tx/rx on eth1 or eth1.100? I'm trying multicast tx/rx on eth1.100. eth1 has no problems. >>> >>> it'd be nice if will be able to post here output fom commands: >>> # switch-config -d [git://git.ti.com/switch-config/switch-config.git v4.1] >>> # ifconfig -a >>> # tcpdump -e -f -Q in -i eth0 >>> # tcpdump -e -f -Q in -i eth0.100 >> >> Which kernel/branch do you want me to test? >> >> git://git.ti.com/ti-linux-kernel/ti-linux-kernel.git and ti-rt-linux-4.1.y? >> >> So far I was using vanilla kernel. > > Your branch (but better 4.5 kernels (or 4.4)). > Just when you've done with configuration run cmds 1&2, > and when you run your use-case - run cmds 2&3 on receiver side (grap ~5-10 > packets). > then stop test and run cmd 1 again. > > After all could you provide your console output here, pls. > > To use command 1, you need TI kernel [1] as it won't build with vanilla kernel. [1]: git://git.ti.com/ti-linux-kernel/ti-linux-kernel.git ti-linux-4.1.y Regards Mugunthan V N
Re: [PATCH RFC] net: decrease the length of backlog queue immediately after it's detached from sk
On Tue, 2016-03-29 at 22:25 -0700, Eric Dumazet wrote: > On Wed, 2016-03-30 at 13:16 +0800, Yang Yingliang wrote: > > When task A hold the sk owned in tcp_sendmsg, if lots of packets > > arrive and the packets will be added to backlog queue. The packets > > will be handled in release_sock called from tcp_sendmsg. When the > > sk_backlog is removed from sk, the length will not decrease until > > all the packets in backlog queue are handled. This may leads to the > > new packets be dropped because the lenth is too big. So set the > > lenth to 0 immediately after it's detached from sk. > > > > Signed-off-by: Yang Yingliang > > --- > > net/core/sock.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/net/core/sock.c b/net/core/sock.c > > index 47fc8bb..108be05 100644 > > --- a/net/core/sock.c > > +++ b/net/core/sock.c > > @@ -1933,6 +1933,7 @@ static void __release_sock(struct sock *sk) > > > > do { > > sk->sk_backlog.head = sk->sk_backlog.tail = NULL; > > + sk->sk_backlog.len = 0; > > bh_unlock_sock(sk); > > > > do { > > Certainly not. > > Have you really missed the comment ? > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8eae939f1400326b06d0c9afe53d2a484a326871 > > > I do not believe the case you describe can happen, unless a misbehaving > driver cooks fat skb (with skb->truesize being far more bigger than > skb->len) > And also make sure you backported https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=da882c1f2ecadb0ed582628ec1585e36b137c0f0
Re: [net-next 02/16] i40e/i40evf: Rewrite logic for 8 descriptor per packet check
On Tue, 2016-03-29 at 22:19 -0700, Alexander Duyck wrote: > Minor correction here. This is 72 hex, which is 114 in decimal. The > interesting bit I believe is the fact that we have both header and > payload data in the same descriptor. > > You should probably check with your hardware team on this but I have > a > working theory on the issue. I'm thinking that because both header > and payload are coming out of the same descriptor this must count as > 2 > descriptors and not 1 when we compute the usage for the first > descriptor. As such I do probably need to start testing at frag 0 > because the first payload can actually come out of the header region > if the first descriptor includes both payload and data. > > I'll try to submit a patch for it tonight. If you can test it > tomorrow I would appreciate it. Otherwise I will try setting up an > environment to try and reproduce the issue tomorrow. If you get it submitted tonight, I can have it ready in my tree for testing for Jesse and the team when they get into the office tomorrow. signature.asc Description: This is a digitally signed message part
Re: [PATCH RFC] net: decrease the length of backlog queue immediately after it's detached from sk
On Wed, 2016-03-30 at 13:16 +0800, Yang Yingliang wrote: > When task A hold the sk owned in tcp_sendmsg, if lots of packets > arrive and the packets will be added to backlog queue. The packets > will be handled in release_sock called from tcp_sendmsg. When the > sk_backlog is removed from sk, the length will not decrease until > all the packets in backlog queue are handled. This may leads to the > new packets be dropped because the lenth is too big. So set the > lenth to 0 immediately after it's detached from sk. > > Signed-off-by: Yang Yingliang > --- > net/core/sock.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/net/core/sock.c b/net/core/sock.c > index 47fc8bb..108be05 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -1933,6 +1933,7 @@ static void __release_sock(struct sock *sk) > > do { > sk->sk_backlog.head = sk->sk_backlog.tail = NULL; > + sk->sk_backlog.len = 0; > bh_unlock_sock(sk); > > do { Certainly not. Have you really missed the comment ? https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8eae939f1400326b06d0c9afe53d2a484a326871 I do not believe the case you describe can happen, unless a misbehaving driver cooks fat skb (with skb->truesize being far more bigger than skb->len)
Re: [net-next 02/16] i40e/i40evf: Rewrite logic for 8 descriptor per packet check
On Tue, Mar 29, 2016 at 7:20 PM, Alexander Duyck wrote: > On Tue, Mar 29, 2016 at 5:34 PM, Jesse Brandeburg > wrote: >> stupid gmail... sent again to the list, sorry for duplicate (but >> formatted better this time) >> >> >> Hey Alex, this patch appears to have caused a regression (probably both >> i40e/i40evf). Easily reproducible running rds-stress (see the thread titled >> "i40e card Tx resets", the middle post by sowmini has the repro steps, >> ignore the pktgen discussion in the thread.) > > I'll see about setting up rds tomorrow. It should be pretty straight forward. > >> I've spent some time trying to figure out the right fix, but keep getting >> stuck in the complicated logic of the function, so I'm sending this in the >> hope you'll take a look. >> >> This is (one example of) the skb that doesn't get linearized: >> skb_print_bits: > headlen=114 datalen=33824 data=238 mac=238 nh=252 >> h=272 gso=1448 frag=17 > > So the code as I had written it up should be verifying we use no more > than 8 descriptors as that was what the data sheet states. You might > want to request a spec update if the hardware only supports 7 data > descriptors per segment for a TSO as that isn't what is in the > documentation. > >> skb_print_bits: frag[0]: len: 256 >> skb_print_bits: frag[1]: len: 48 >> skb_print_bits: frag[2]: len: 256 >> skb_print_bits: frag[3]: len: 48 >> skb_print_bits: frag[4]: len: 256 >> skb_print_bits: frag[5]: len: 48 >> skb_print_bits: frag[6]: len: 4096 >> >> This descriptor ^^^ is the 8th, I believe the hardware mechanism faults on >> this input. > > Is this something that was previously being linearized? It doesn't > seem like it would be. We are only 8 descriptors in and that 7th > fragment should have flushed the count back to 1 with a remainder. > > If you are wanting to trigger a linearize with the updated code you > would just need to update 2 lines. Replace "I40E_MAX_BUFFER_TXD - 1" > with just I40E_MAX_BUFFER_TXD" in the line where we subtract that > value from nr_frags. Then remove one of the "sum += > skb_frag_size(++frag);" lines. That should update the code so that > you only evaluate 5 buffers at a time instead of 6. It should force > things to coalesce if we would span 8 or more descriptors instead of 9 > or more which is what the code currently does based on what was > required in the EAS. > >> I added a print of the sum at each point it is subtracted or added to after >> initially being set, sum7/8 are in the for loop. >> >> skb_print_bits: frag[7]: len: 4096 >> skb_print_bits: frag[8]: len: 48 >> skb_print_bits: frag[9]: len: 4096 >> skb_print_bits: frag[10]: len: 4096 >> skb_print_bits: frag[11]: len: 48 >> skb_print_bits: frag[12]: len: 4096 >> skb_print_bits: frag[13]: len: 4096 >> skb_print_bits: frag[14]: len: 48 >> skb_print_bits: frag[15]: len: 4096 >> skb_print_bits: frag[16]: len: 4096 >> __i40e_chk_linearize: sum1: -1399 >> __i40e_chk_linearize: sum2: -1143 >> __i40e_chk_linearize: sum3: -1095 >> __i40e_chk_linearize: sum4: -839 >> __i40e_chk_linearize: sum5: -791 >> __i40e_chk_linearize: sum7: 3305 >> __i40e_chk_linearize: sum8: 3257 >> __i40e_chk_linearize: sum7: 7353 >> __i40e_chk_linearize: sum8: 7097 >> __i40e_chk_linearize: sum7: 7145 >> __i40e_chk_linearize: sum8: 7097 >> __i40e_chk_linearize: sum7: 11193 >> __i40e_chk_linearize: sum8: 10937 >> __i40e_chk_linearize: sum7: 15033 >> __i40e_chk_linearize: sum8: 14985 >> __i40e_chk_linearize: sum7: 15033 >> >> This was the descriptors generated from the above: >> d[054] = 0x 0x16a021140011 >> d[055] = 0x000bfbaa40ee 0x01ca02871640 > > len 72 Minor correction here. This is 72 hex, which is 114 in decimal. The interesting bit I believe is the fact that we have both header and payload data in the same descriptor. You should probably check with your hardware team on this but I have a working theory on the issue. I'm thinking that because both header and payload are coming out of the same descriptor this must count as 2 descriptors and not 1 when we compute the usage for the first descriptor. As such I do probably need to start testing at frag 0 because the first payload can actually come out of the header region if the first descriptor includes both payload and data. I'll try to submit a patch for it tonight. If you can test it tomorrow I would appreciate it. Otherwise I will try setting up an environment to try and reproduce the issue tomorrow. - Alex
Re: [PATCH 06/16] wcn36xx: Fetch private sta data from sta entry instead of from vif
Bjorn Andersson writes: >> All error/warnings (new ones prefixed by >>): >> >>drivers/net/wireless/ath/wcn36xx/main.c: In function 'wcn36xx_set_key': >> >> drivers/net/wireless/ath/wcn36xx/main.c:389:9: error: implicit >> >> declaration of function 'wcn36xx_sta_to_priv' >> >> [-Werror=implicit-function-declaration] >> struct wcn36xx_sta *sta_priv = wcn36xx_sta_to_priv(sta); >> ^ >> >> drivers/net/wireless/ath/wcn36xx/main.c:389:33: warning: initialization >> >> makes pointer from integer without a cast >> struct wcn36xx_sta *sta_priv = wcn36xx_sta_to_priv(sta); >> ^ >>cc1: some warnings being treated as errors > > This should have been reordered with patch 7, that introduces this > helper function. Do you want me to resend, or can you apply the patches > out of order? It's better that you resend the whole patchset as v2. -- Kalle Valo
[PATCH RFC] net: decrease the length of backlog queue immediately after it's detached from sk
When task A hold the sk owned in tcp_sendmsg, if lots of packets arrive and the packets will be added to backlog queue. The packets will be handled in release_sock called from tcp_sendmsg. When the sk_backlog is removed from sk, the length will not decrease until all the packets in backlog queue are handled. This may leads to the new packets be dropped because the lenth is too big. So set the lenth to 0 immediately after it's detached from sk. Signed-off-by: Yang Yingliang --- net/core/sock.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/core/sock.c b/net/core/sock.c index 47fc8bb..108be05 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1933,6 +1933,7 @@ static void __release_sock(struct sock *sk) do { sk->sk_backlog.head = sk->sk_backlog.tail = NULL; + sk->sk_backlog.len = 0; bh_unlock_sock(sk); do { -- 2.5.0
Re: [PATCH] ethernet: mvneta: Support netpoll
Hi David, On 30 March 2016 at 00:09, David Miller wrote: > From: Ezequiel Garcia > Date: Mon, 28 Mar 2016 17:41:18 -0300 > >> +/* Polled functionality used by netconsole and others in non interrupt mode >> */ >> +static void mvneta_poll_controller(struct net_device *dev) >> +{ >> + struct mvneta_port *pp = netdev_priv(dev); >> + >> + on_each_cpu(mvneta_percpu_poll_controller, pp, false); >> +} > > This doesn't work. > > netpoll may be invoked from any context whatsoever, even hardware > interrupt handlers or contexts where cpu interrupts are disabled. > > smp_call_function() and thus on_each_cpu() may not be called with > disabled interrupts or from a hardware interrupt handler or from > a bottom half handler, all of which are valid situations where > netpoll may occur since printk's can occur anywhere. Well, I hated the idea of using on_each_cpu here, but wasn't sure if there was a real reason that forbid it. Thanks a lot for the feedback. -- Ezequiel García, VanguardiaSur www.vanguardiasur.com.ar
Re: [PATCH net] bpf: make padding in bpf_tunnel_key explicit
From: Daniel Borkmann Date: Wed, 30 Mar 2016 00:02:00 +0200 > Make the 2 byte padding in struct bpf_tunnel_key between tunnel_ttl > and tunnel_label members explicit. No issue has been observed, and > gcc/llvm does padding for the old struct already, where tunnel_label > was not yet present, so the current code works, but since it's part > of uapi, make sure we don't introduce holes in structs. > > Therefore, add tunnel_ext that we can use generically in future > (f.e. to flag OAM messages for backends, etc). Also add the offset > to the compat tests to be sure should some compilers not padd the > tail of the old version of bpf_tunnel_key. > > Fixes: 4018ab1875e0 ("bpf: support flow label for bpf_skb_{set, > get}_tunnel_key") > Signed-off-by: Daniel Borkmann Applied, thanks.
Re: [PATCH] net: fec: stop the "rcv is not +last, " error messages
Hi Troy, Commit 55cd48c8 ('net: fec: stop the "rcv is not +last, " error messages') adds a write to a register that is not present in all implementations of the FEC hardware module. None of the ColdFire SoC parts with the FEC module have the FTRL (0x1b0) register. Does this need a quirk flag to key access to this register of? Or can you piggyback on the FEC_QUIRK_HAS_RACC flag? Regards Greg
Re: [PATCH net] net: ipv4: Multipath needs to handle unreachable nexthops
On 3/25/16 4:05 AM, Julian Anastasov wrote: Hello, On Thu, 24 Mar 2016, David Ahern wrote: On 3/24/16 4:33 PM, Julian Anastasov wrote: But for multipath routes we can also consider the nexthops as "alternatives", so it depends on how one uses the multipath mechanism. The ability to fallback to another nexthop assumes one connection is allowed to move from one ISP to another. What if the second ISP decides to reject the connection? What we have is a broken connection just because the retransmits were diverted to wrong place in the hurry. So, the nexthops can be compatible or incompatible. For your setup they are, for others they are not. I am not sure I completely understand your point. Are you saying that within a single multipath route some connections to nexthops are allowed and others are not? So to put that paragraph into an example 15.0.0.0/16 nexthop via 12.0.0.2 dev swp1 weight 1 nexthop via 12.0.0.3 dev swp1 weight 1 Hosts from 15.0/16 could have TCP connections use 12.0.0.2, but not 12.0.0.3 because 12.0.0.3 could be a different ISP and not allow TCP connections from this address space? Yes. Two cases are possible: 1. ISP2 filters saddr, traffic with saddr from ISP1 is dropped. 2. ISP2 allows any saddr. But how the responses from world with daddr=IP_from_ISP1 will come from ISP2 link? If the nexthops are for different ISP the connection can survive only if sticks to its ISP. An ISP will not work as a backup link for another ISP. Seems to me this is a problem that is addressed by VRFs, not multipath routes where some nexthops are actually deadends because they attempt to cross ISPs. After that if it has information that says that a nexthop is dead, why would it continue to try to probe? Any traffic that selects that nh is dead. That to If entry becomes FAILED this state is preserved if we do not direct traffic to this entry. If there was a single connection that was rejected after 3 failed probes the next connection (with your patch) will fallback to another neigh and the first entry will remain in FAILED state until expiration. If one wants to refresh the state often, a script/tool that pings all GWs is needed, so that you can notice the available or failed paths faster. me defies the basis of having multiple paths. We do not know how long is the outage. Long living connections may prefer to survive with retransmits. Say you are using SSH via wifi link doing important work. Do you want your connection to break just because link was down for a while? neighbor entries have a timeout and when it drops from the cache the arp will try again. This suggested patch is not saying 'never try a nexthop again' it is saying 'I have multiple paths and since path 1 is down try another one'. I'll send an updated patch when I get time (traveling at the moment); I guess a sysctl is going to be needed if the behavior you mention with ISPs is reasonable.
Re: [PATCH] ethernet: mvneta: Support netpoll
From: Ezequiel Garcia Date: Mon, 28 Mar 2016 17:41:18 -0300 > +/* Polled functionality used by netconsole and others in non interrupt mode > */ > +static void mvneta_poll_controller(struct net_device *dev) > +{ > + struct mvneta_port *pp = netdev_priv(dev); > + > + on_each_cpu(mvneta_percpu_poll_controller, pp, false); > +} This doesn't work. netpoll may be invoked from any context whatsoever, even hardware interrupt handlers or contexts where cpu interrupts are disabled. smp_call_function() and thus on_each_cpu() may not be called with disabled interrupts or from a hardware interrupt handler or from a bottom half handler, all of which are valid situations where netpoll may occur since printk's can occur anywhere.
[PATCHv2 net] team: team should sync the port's uc/mc addrs when add a port
There is an issue when we use mavtap over team: When we replug nic links from team0, the real nics's mc list will not include the maddr for macvtap any more. then we can't receive pkts to macvtap device, as they are filterred by mc list of nic. In Bonding Driver, it syncs the uc/mc addrs in bond_enslave(). We will fix this issue on team by adding the port's uc/mc addrs sync in team_port_add. Signed-off-by: Xin Long --- drivers/net/team/team.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c index 26c64d2..a0f64cb 100644 --- a/drivers/net/team/team.c +++ b/drivers/net/team/team.c @@ -1198,6 +1198,9 @@ static int team_port_add(struct team *team, struct net_device *port_dev) goto err_dev_open; } + dev_uc_sync_multiple(port_dev, dev); + dev_mc_sync_multiple(port_dev, dev); + err = vlan_vids_add_by_dev(port_dev, dev); if (err) { netdev_err(dev, "Failed to add vlan ids to device %s\n", @@ -1261,6 +1264,8 @@ err_enable_netpoll: vlan_vids_del_by_dev(port_dev, dev); err_vids_add: + dev_uc_unsync(port_dev, dev); + dev_mc_unsync(port_dev, dev); dev_close(port_dev); err_dev_open: -- 2.1.0
Re: [net-next 02/16] i40e/i40evf: Rewrite logic for 8 descriptor per packet check
On Tue, Mar 29, 2016 at 5:34 PM, Jesse Brandeburg wrote: > stupid gmail... sent again to the list, sorry for duplicate (but > formatted better this time) > > > Hey Alex, this patch appears to have caused a regression (probably both > i40e/i40evf). Easily reproducible running rds-stress (see the thread titled > "i40e card Tx resets", the middle post by sowmini has the repro steps, > ignore the pktgen discussion in the thread.) I'll see about setting up rds tomorrow. It should be pretty straight forward. > I've spent some time trying to figure out the right fix, but keep getting > stuck in the complicated logic of the function, so I'm sending this in the > hope you'll take a look. > > This is (one example of) the skb that doesn't get linearized: > skb_print_bits: > headlen=114 datalen=33824 data=238 mac=238 nh=252 > h=272 gso=1448 frag=17 So the code as I had written it up should be verifying we use no more than 8 descriptors as that was what the data sheet states. You might want to request a spec update if the hardware only supports 7 data descriptors per segment for a TSO as that isn't what is in the documentation. > skb_print_bits: frag[0]: len: 256 > skb_print_bits: frag[1]: len: 48 > skb_print_bits: frag[2]: len: 256 > skb_print_bits: frag[3]: len: 48 > skb_print_bits: frag[4]: len: 256 > skb_print_bits: frag[5]: len: 48 > skb_print_bits: frag[6]: len: 4096 > > This descriptor ^^^ is the 8th, I believe the hardware mechanism faults on > this input. Is this something that was previously being linearized? It doesn't seem like it would be. We are only 8 descriptors in and that 7th fragment should have flushed the count back to 1 with a remainder. If you are wanting to trigger a linearize with the updated code you would just need to update 2 lines. Replace "I40E_MAX_BUFFER_TXD - 1" with just I40E_MAX_BUFFER_TXD" in the line where we subtract that value from nr_frags. Then remove one of the "sum += skb_frag_size(++frag);" lines. That should update the code so that you only evaluate 5 buffers at a time instead of 6. It should force things to coalesce if we would span 8 or more descriptors instead of 9 or more which is what the code currently does based on what was required in the EAS. > I added a print of the sum at each point it is subtracted or added to after > initially being set, sum7/8 are in the for loop. > > skb_print_bits: frag[7]: len: 4096 > skb_print_bits: frag[8]: len: 48 > skb_print_bits: frag[9]: len: 4096 > skb_print_bits: frag[10]: len: 4096 > skb_print_bits: frag[11]: len: 48 > skb_print_bits: frag[12]: len: 4096 > skb_print_bits: frag[13]: len: 4096 > skb_print_bits: frag[14]: len: 48 > skb_print_bits: frag[15]: len: 4096 > skb_print_bits: frag[16]: len: 4096 > __i40e_chk_linearize: sum1: -1399 > __i40e_chk_linearize: sum2: -1143 > __i40e_chk_linearize: sum3: -1095 > __i40e_chk_linearize: sum4: -839 > __i40e_chk_linearize: sum5: -791 > __i40e_chk_linearize: sum7: 3305 > __i40e_chk_linearize: sum8: 3257 > __i40e_chk_linearize: sum7: 7353 > __i40e_chk_linearize: sum8: 7097 > __i40e_chk_linearize: sum7: 7145 > __i40e_chk_linearize: sum8: 7097 > __i40e_chk_linearize: sum7: 11193 > __i40e_chk_linearize: sum8: 10937 > __i40e_chk_linearize: sum7: 15033 > __i40e_chk_linearize: sum8: 14985 > __i40e_chk_linearize: sum7: 15033 > > This was the descriptors generated from the above: > d[054] = 0x 0x16a021140011 > d[055] = 0x000bfbaa40ee 0x01ca02871640 len 72 > d[056] = 0x000584bcd000 0x040202871640 len 256 > d[057] = 0x000c0bfea9d0 0x00c202871640 len 48 > d[058] = 0x000584bcd100 0x040202871640 len 256 > d[059] = 0x000c0bfeaa00 0x00c202871640 len 48 > d[05a] = 0x000584bcd200 0x040202871640 len 256 > d[05b] = 0x000c0bfeaa30 0x00c202871640 len 48 > d[05c] = 0x00056d5f 0x400202871640 len 4096 >From what I can see we are at 8 descriptors and have exceeded 1 MSS. According to the EAS this is supposed to work. > d[05d] = 0x00056d5f1000 0x400202871640 > d[05e] = 0x000c0bfeaa60 0x00c202871640 > d[05f] = 0x0005f2762000 0x400202871640 > d[060] = 0x0005f765e000 0x400202871640 > d[061] = 0x000c0bfeaa90 0x00c202871640 > d[062] = 0x000574928000 0x400202871640 > d[063] = 0x000568ba5000 0x400202871640 > d[064] = 0x000c0bfeaac0 0x00c202871640 > d[065] = 0x0005f68cd000 0x400202871640 > d[066] = 0x000585a2a000 0x400202871670 > > > On Fri, Feb 19, 2016 at 3:54 AM Jeff Kirsher > wrote: >> >> From: Alexander Duyck >> >> This patch is meant to rewrite the logic for how we determine if we can >> transmit the frame or if it needs to be linearized. >> >> + /* Initialize size to the negative value of gso_size minus 1. We >> +* use this as the worst case scenerio in which the frag ahead >> +* of us only provides one byte which is why we are limited to 6 >> +* descriptors for a single transmit as the h
[PATCH net-next v3 2/2] net: dsa: mv88e6xxx: Clear the PDOWN bit on setup
Some of the vendor-specific bootloaders set up this part of the initialization for us, so this was never added. However, since upstream bootloaders don't initialize the chip specifically, they leave the fiber MII's PDOWN flag set, which means that the CPU port doesn't connect. This patch checks whether this flag has been clear prior by something else, and if not make us clear it. Reviewed-by: Andrew Lunn Signed-off-by: Patrick Uiterwijk --- drivers/net/dsa/mv88e6xxx.c | 36 drivers/net/dsa/mv88e6xxx.h | 8 2 files changed, 44 insertions(+) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index 86a2029..50454be 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -2296,6 +2296,25 @@ restore_page_0: return ret; } +static int mv88e6xxx_power_on_serdes(struct dsa_switch *ds) +{ + int ret; + + ret = _mv88e6xxx_phy_page_read(ds, REG_FIBER_SERDES, PAGE_FIBER_SERDES, + MII_BMCR); + if (ret < 0) + return ret; + + if (ret & BMCR_PDOWN) { + ret &= ~BMCR_PDOWN; + ret = _mv88e6xxx_phy_page_write(ds, REG_FIBER_SERDES, + PAGE_FIBER_SERDES, MII_BMCR, + ret); + } + + return ret; +} + static int mv88e6xxx_setup_port(struct dsa_switch *ds, int port) { struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); @@ -2399,6 +2418,23 @@ static int mv88e6xxx_setup_port(struct dsa_switch *ds, int port) goto abort; } + /* If this port is connected to a SerDes, make sure the SerDes is not +* powered down. +*/ + if (mv88e6xxx_6352_family(ds)) { + ret = _mv88e6xxx_reg_read(ds, REG_PORT(port), PORT_STATUS); + if (ret < 0) + goto abort; + ret &= PORT_STATUS_CMODE_MASK; + if ((ret == PORT_STATUS_CMODE_100BASE_X) || + (ret == PORT_STATUS_CMODE_1000BASE_X) || + (ret == PORT_STATUS_CMODE_SGMII)) { + ret = mv88e6xxx_power_on_serdes(ds); + if (ret < 0) + goto abort; + } + } + /* Port Control 2: don't force a good FCS, set the maximum frame size to * 10240 bytes, disable 802.1q tags checking, don't discard tagged or * untagged frames on this port, do a destination address lookup on all diff --git a/drivers/net/dsa/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx.h index 9a038ab..26a424a 100644 --- a/drivers/net/dsa/mv88e6xxx.h +++ b/drivers/net/dsa/mv88e6xxx.h @@ -28,6 +28,10 @@ #define SMI_CMD_OP_45_READ_DATA_INC((3 << 10) | SMI_CMD_BUSY) #define SMI_DATA 0x01 +/* Fiber/SERDES Registers are located at SMI address F, page 1 */ +#define REG_FIBER_SERDES 0x0f +#define PAGE_FIBER_SERDES 0x01 + #define REG_PORT(p)(0x10 + (p)) #define PORT_STATUS0x00 #define PORT_STATUS_PAUSE_EN BIT(15) @@ -45,6 +49,10 @@ #define PORT_STATUS_MGMII BIT(6) /* 6185 */ #define PORT_STATUS_TX_PAUSED BIT(5) #define PORT_STATUS_FLOW_CTRL BIT(4) +#define PORT_STATUS_CMODE_MASK 0x0f +#define PORT_STATUS_CMODE_100BASE_X0x8 +#define PORT_STATUS_CMODE_1000BASE_X 0x9 +#define PORT_STATUS_CMODE_SGMII0xa #define PORT_PCS_CTRL 0x01 #define PORT_PCS_CTRL_RGMII_DELAY_RXCLKBIT(15) #define PORT_PCS_CTRL_RGMII_DELAY_TXCLKBIT(14) -- 2.7.4
[PATCH net-next v3 1/2] net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read,write}
Add versions of the phy_page_read and _write functions to be used in a context where the SMI mutex is held. Tested-by: Vivien Didelot Reviewed-by: Vivien Didelot Signed-off-by: Patrick Uiterwijk --- drivers/net/dsa/mv88e6xxx.c | 49 + 1 file changed, 36 insertions(+), 13 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index fa086e0..86a2029 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -2264,6 +2264,38 @@ static void mv88e6xxx_bridge_work(struct work_struct *work) mutex_unlock(&ps->smi_mutex); } +static int _mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page, +int reg, int val) +{ + int ret; + + ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page); + if (ret < 0) + goto restore_page_0; + + ret = _mv88e6xxx_phy_write_indirect(ds, port, reg, val); +restore_page_0: + _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0); + + return ret; +} + +static int _mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, + int reg) +{ + int ret; + + ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page); + if (ret < 0) + goto restore_page_0; + + ret = _mv88e6xxx_phy_read_indirect(ds, port, reg); +restore_page_0: + _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0); + + return ret; +} + static int mv88e6xxx_setup_port(struct dsa_switch *ds, int port) { struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); @@ -2714,13 +2746,9 @@ int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, int reg) int ret; mutex_lock(&ps->smi_mutex); - ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page); - if (ret < 0) - goto error; - ret = _mv88e6xxx_phy_read_indirect(ds, port, reg); -error: - _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0); + ret = _mv88e6xxx_phy_page_read(ds, port, page, reg); mutex_unlock(&ps->smi_mutex); + return ret; } @@ -2731,14 +2759,9 @@ int mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page, int ret; mutex_lock(&ps->smi_mutex); - ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page); - if (ret < 0) - goto error; - - ret = _mv88e6xxx_phy_write_indirect(ds, port, reg, val); -error: - _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0); + ret = _mv88e6xxx_phy_page_write(ds, port, page, reg, val); mutex_unlock(&ps->smi_mutex); + return ret; } -- 2.7.4
Re: [PATCH net-next v2 2/2] net: dsa: mv88e6xxx: Clear the PDOWN bit on setup
On Wed, Mar 30, 2016 at 12:58:04AM +, Patrick Uiterwijk wrote: > Hi Vivien, > > On Tue, Mar 29, 2016 at 6:49 PM, Vivien Didelot > wrote: > > Hi Andrew, Patrick, > > > > Andrew Lunn writes: > > > >> On Tue, Mar 29, 2016 at 12:23:06PM -0400, Vivien Didelot wrote: > >>> Hi Patrick, > >>> > >>> Two comments below. > >>> > >>> Patrick Uiterwijk writes: > >>> > >>> > +static int mv88e6xxx_power_on_serdes(struct dsa_switch *ds) > >>> > >>> Since this function assumes the SMI lock is already held, its name > >>> should be prefixed with _ by convention (_mv88e6xxx_power_on_serdes). > >> > >> We decided to drop at, since nearly everything would end up with a _ > >> prefix. The assert_smi_lock() should find any missing locks, and > >> lockdep/deadlocks will make it clear when the lock is taken twice. > > > > OK, I didn't know that. This makes sense. There is no need to respin a > > v3 only for my previous &= comment then. > > Does that mean the merger will fix this up? > Or that I'll roll a v3 when I get a reviewed-by for the second patch? Hi Patrick Role a v3, and you can add Reviewed-by: Andrew Lunn as well as Viviens for patch #1. Andrew
Re: [PATCH net-next v2 2/2] net: dsa: mv88e6xxx: Clear the PDOWN bit on setup
Hi Vivien, On Tue, Mar 29, 2016 at 6:49 PM, Vivien Didelot wrote: > Hi Andrew, Patrick, > > Andrew Lunn writes: > >> On Tue, Mar 29, 2016 at 12:23:06PM -0400, Vivien Didelot wrote: >>> Hi Patrick, >>> >>> Two comments below. >>> >>> Patrick Uiterwijk writes: >>> >>> > +static int mv88e6xxx_power_on_serdes(struct dsa_switch *ds) >>> >>> Since this function assumes the SMI lock is already held, its name >>> should be prefixed with _ by convention (_mv88e6xxx_power_on_serdes). >> >> We decided to drop at, since nearly everything would end up with a _ >> prefix. The assert_smi_lock() should find any missing locks, and >> lockdep/deadlocks will make it clear when the lock is taken twice. > > OK, I didn't know that. This makes sense. There is no need to respin a > v3 only for my previous &= comment then. Does that mean the merger will fix this up? Or that I'll roll a v3 when I get a reviewed-by for the second patch? Thanks, Patrick
Re: [PATCH net-next] tcp: remove cwnd moderation after recovery
On Tue, 29 Mar 2016 17:15:52 -0700 Yuchung Cheng wrote: > For non-SACK connections, cwnd is lowered to inflight plus 3 packets > when the recovery ends. This is an optional feature in the NewReno > RFC 2582 to reduce the potential burst when cwnd is "re-opened" > after recovery and inflight is low. > > This feature is questionably effective because of PRR: when > the recovery ends (i.e., snd_una == high_seq) NewReno holds the > CA_Recovery state for another round trip to prevent false fast > retransmits. But if the inflight is low, PRR will overwrite the > moderated cwnd in tcp_cwnd_reduction() later. > > On the other hand, if the recovery ends because the sender > detects the losses were spurious (e.g., reordering). This feature > unconditionally lowers a reverted cwnd even though nothing > was lost. > > By principle loss recovery module should not update cwnd. Further > pacing is much more effective to reduce burst. Hence this patch > removes the cwnd moderation feature. > > Signed-off-by: Matt Mathis > Signed-off-by: Neal Cardwell > Signed-off-by: Soheil Hassas Yeganeh I have a concern that this might break Linux builtin protection against hostile receiver sending bogus ACK's. Remember Linux is different than NewReno. You are changing something that has existed for a long long time.
Re: [net-next 02/16] i40e/i40evf: Rewrite logic for 8 descriptor per packet check
stupid gmail... sent again to the list, sorry for duplicate (but formatted better this time) Hey Alex, this patch appears to have caused a regression (probably both i40e/i40evf). Easily reproducible running rds-stress (see the thread titled "i40e card Tx resets", the middle post by sowmini has the repro steps, ignore the pktgen discussion in the thread.) I've spent some time trying to figure out the right fix, but keep getting stuck in the complicated logic of the function, so I'm sending this in the hope you'll take a look. This is (one example of) the skb that doesn't get linearized: skb_print_bits: > headlen=114 datalen=33824 data=238 mac=238 nh=252 h=272 gso=1448 frag=17 skb_print_bits: frag[0]: len: 256 skb_print_bits: frag[1]: len: 48 skb_print_bits: frag[2]: len: 256 skb_print_bits: frag[3]: len: 48 skb_print_bits: frag[4]: len: 256 skb_print_bits: frag[5]: len: 48 skb_print_bits: frag[6]: len: 4096 This descriptor ^^^ is the 8th, I believe the hardware mechanism faults on this input. I added a print of the sum at each point it is subtracted or added to after initially being set, sum7/8 are in the for loop. skb_print_bits: frag[7]: len: 4096 skb_print_bits: frag[8]: len: 48 skb_print_bits: frag[9]: len: 4096 skb_print_bits: frag[10]: len: 4096 skb_print_bits: frag[11]: len: 48 skb_print_bits: frag[12]: len: 4096 skb_print_bits: frag[13]: len: 4096 skb_print_bits: frag[14]: len: 48 skb_print_bits: frag[15]: len: 4096 skb_print_bits: frag[16]: len: 4096 __i40e_chk_linearize: sum1: -1399 __i40e_chk_linearize: sum2: -1143 __i40e_chk_linearize: sum3: -1095 __i40e_chk_linearize: sum4: -839 __i40e_chk_linearize: sum5: -791 __i40e_chk_linearize: sum7: 3305 __i40e_chk_linearize: sum8: 3257 __i40e_chk_linearize: sum7: 7353 __i40e_chk_linearize: sum8: 7097 __i40e_chk_linearize: sum7: 7145 __i40e_chk_linearize: sum8: 7097 __i40e_chk_linearize: sum7: 11193 __i40e_chk_linearize: sum8: 10937 __i40e_chk_linearize: sum7: 15033 __i40e_chk_linearize: sum8: 14985 __i40e_chk_linearize: sum7: 15033 This was the descriptors generated from the above: d[054] = 0x 0x16a021140011 d[055] = 0x000bfbaa40ee 0x01ca02871640 d[056] = 0x000584bcd000 0x040202871640 d[057] = 0x000c0bfea9d0 0x00c202871640 d[058] = 0x000584bcd100 0x040202871640 d[059] = 0x000c0bfeaa00 0x00c202871640 d[05a] = 0x000584bcd200 0x040202871640 d[05b] = 0x000c0bfeaa30 0x00c202871640 d[05c] = 0x00056d5f 0x400202871640 d[05d] = 0x00056d5f1000 0x400202871640 d[05e] = 0x000c0bfeaa60 0x00c202871640 d[05f] = 0x0005f2762000 0x400202871640 d[060] = 0x0005f765e000 0x400202871640 d[061] = 0x000c0bfeaa90 0x00c202871640 d[062] = 0x000574928000 0x400202871640 d[063] = 0x000568ba5000 0x400202871640 d[064] = 0x000c0bfeaac0 0x00c202871640 d[065] = 0x0005f68cd000 0x400202871640 d[066] = 0x000585a2a000 0x400202871670 On Fri, Feb 19, 2016 at 3:54 AM Jeff Kirsher wrote: > > From: Alexander Duyck > > This patch is meant to rewrite the logic for how we determine if we can > transmit the frame or if it needs to be linearized. > > + /* Initialize size to the negative value of gso_size minus 1. We > +* use this as the worst case scenerio in which the frag ahead > +* of us only provides one byte which is why we are limited to 6 > +* descriptors for a single transmit as the header and previous > +* fragment are already consuming 2 descriptors. > +*/ > + sum = 1 - gso_size; > + > + /* Add size of frags 1 through 5 to create our initial sum */ > + sum += skb_frag_size(++frag); I'm pretty sure this code skips frag[0] due to the pre-increment, the bug seems to occur in the algorithm because skb->data contains L2/L3/L4 header plus some data, and should counts as 1 descriptor for every subsequent sent packet, and if the incoming skb has 7 chunks that add up to < MSS then we get in trouble. > + sum += skb_frag_size(++frag); > + sum += skb_frag_size(++frag); > + sum += skb_frag_size(++frag); > + sum += skb_frag_size(++frag); > + > + /* Walk through fragments adding latest fragment, testing it, and > +* then removing stale fragments from the sum. > +*/ > + stale = &skb_shinfo(skb)->frags[0]; > + for (;;) { > + sum += skb_frag_size(++frag); > + > + /* if sum is negative we failed to make sufficient progress */ > + if (sum < 0) > + return true; > + > + /* use pre-decrement to avoid processing last fragment */ > + if (!--nr_frags) > + break; > + > + sum -= skb_frag_size(++stale); I think this line also skips stale[0]
Re: [PATCH RFC net-next] net: core: Pass XPS select queue decision to skb_tx_hash
On 16-03-29 03:24 PM, Saeed Mahameed wrote: > Currently XPS select queue decision is final and overrides/ignores all other > select queue parameters such as QoS TC, RX recording. > > This patch makes get_xps_queue value as a hint for skb_tx_hash, which will > decide > whether to use this hint as is or to tweak it a little to provide the correct > TXQ. > > This will fix bugs in cases such that TC QoS offload (tc_to_txq mapping) and > XPS mapping > are configured but select queue will only respect the XPS configuration which > will skip > the TC QoS queue selection and thus will not satisfy the QoS configuration. > > RFC because I want to discuss how we would like the final behavior of the > __netdev_pick_tx, with this patch it goes as follows: > > netdev_pick_tx(skb): > hint = get_xps_queue > txq = skb_tx_hash(skb, hint) > > skb_tx_hash(skb, hint): > if (skb_rx_queue_recorded(skb)) > return skb_get_rx_queue(skb); > > queue_offset = 0; > if (dev->num_tc) > queue_offset = tc_queue_offset[tc]; > > hash = hint < 0 ? skb_get_hash(skb) : hint; > return hash + queue_offset; > > i.e: instead of blindly return the XPS decision, we pass it to skb_tx_hash > which can make the final > decision for us, select queue will now respect and combine XPS and TC QoS > tc_to_txq mappings. > > Also there is one additional behavioral change that recorded rx queues will > now override the XPS configuration. > > Signed-off-by: Saeed Mahameed > --- > drivers/net/ethernet/mellanox/mlx4/en_tx.c |2 +- > include/linux/netdevice.h |6 +++--- > net/core/dev.c | 10 ++ > 3 files changed, 10 insertions(+), 8 deletions(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c > b/drivers/net/ethernet/mellanox/mlx4/en_tx.c > index c0d7b72..873cf49 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c > @@ -693,7 +693,7 @@ u16 mlx4_en_select_queue(struct net_device *dev, struct > sk_buff *skb, > u8 up = 0; > > if (dev->num_tc) > - return skb_tx_hash(dev, skb); > + return skb_tx_hash(dev, skb, -1); > I would prefer to not have another strange quirk users have to remember in order to do tx classification. So with this change depending on the driver the queue selection precedence changes. In short I agree with the problem statement but think we can find a better solution. One idea that comes to mind is we can have a tc action to force the queue selection? Now that we have the egress tc hook it would probably be fairly cheap to implement and if users want this behavior they can ask for it explicitly. If your thinking about tc stuff we could fix the tooling to set this action when ever dcb is turned on or hardware rate limiting is enabled, etc. And even if we wanted we could have the driver add the rule in the cases where firmware protocols are configuring the QOS/etc. > if (skb_vlan_tag_present(skb)) > up = skb_vlan_tag_get(skb) >> VLAN_PRIO_SHIFT; > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index cb0d5d0..ad81ffe 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -3130,16 +3130,16 @@ static inline int netif_set_xps_queue(struct > net_device *dev, > #endif > > u16 __skb_tx_hash(const struct net_device *dev, struct sk_buff *skb, > - unsigned int num_tx_queues); > + unsigned int num_tx_queues, int txq_hint); > [...] And all this seems like it would only ever be called by drivers select queue routines which I really wish we could kill off one of these days instead of add to. Now if the signal is something higher in the stack and not the driver I think it is OK. .John
[PATCH net-next] tcp: remove cwnd moderation after recovery
For non-SACK connections, cwnd is lowered to inflight plus 3 packets when the recovery ends. This is an optional feature in the NewReno RFC 2582 to reduce the potential burst when cwnd is "re-opened" after recovery and inflight is low. This feature is questionably effective because of PRR: when the recovery ends (i.e., snd_una == high_seq) NewReno holds the CA_Recovery state for another round trip to prevent false fast retransmits. But if the inflight is low, PRR will overwrite the moderated cwnd in tcp_cwnd_reduction() later. On the other hand, if the recovery ends because the sender detects the losses were spurious (e.g., reordering). This feature unconditionally lowers a reverted cwnd even though nothing was lost. By principle loss recovery module should not update cwnd. Further pacing is much more effective to reduce burst. Hence this patch removes the cwnd moderation feature. Signed-off-by: Matt Mathis Signed-off-by: Neal Cardwell Signed-off-by: Soheil Hassas Yeganeh --- include/net/tcp.h| 11 --- net/ipv4/tcp_input.c | 11 --- 2 files changed, 22 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index b91370f..f8bb4a4 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1039,17 +1039,6 @@ static inline __u32 tcp_max_tso_deferred_mss(const struct tcp_sock *tp) return 3; } -/* Slow start with delack produces 3 packets of burst, so that - * it is safe "de facto". This will be the default - same as - * the default reordering threshold - but if reordering increases, - * we must be able to allow cwnd to burst at least this much in order - * to not pull it back when holes are filled. - */ -static __inline__ __u32 tcp_max_burst(const struct tcp_sock *tp) -{ - return tp->reordering; -} - /* Returns end sequence number of the receiver's advertised window */ static inline u32 tcp_wnd_end(const struct tcp_sock *tp) { diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index e6e65f7..f87b84a 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2252,16 +2252,6 @@ static void tcp_update_scoreboard(struct sock *sk, int fast_rexmit) } } -/* CWND moderation, preventing bursts due to too big ACKs - * in dubious situations. - */ -static inline void tcp_moderate_cwnd(struct tcp_sock *tp) -{ - tp->snd_cwnd = min(tp->snd_cwnd, - tcp_packets_in_flight(tp) + tcp_max_burst(tp)); - tp->snd_cwnd_stamp = tcp_time_stamp; -} - static bool tcp_tsopt_ecr_before(const struct tcp_sock *tp, u32 when) { return tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr && @@ -2410,7 +2400,6 @@ static bool tcp_try_undo_recovery(struct sock *sk) /* Hold old state until something *above* high_seq * is ACKed. For Reno it is MUST to prevent false * fast retransmits (RFC2582). SACK TCP is safe. */ - tcp_moderate_cwnd(tp); if (!tcp_any_retrans_done(sk)) tp->retrans_stamp = 0; return true; -- 2.8.0.rc3.226.g39d4020
Best way to reduce system call overhead for tun device I/O?
I'm trying to reduce system call overhead when reading/writing to/from a tun device in userspace. For sockets, one can use sendmmsg()/recvmmsg(), but a tun fd is not a socket fd, so this doesn't work. I'm see several options to allow userspace to read/write multiple packets with one syscall: - Implement a TX/RX ring buffer that is mmap()ed, like with AF_PACKET sockets. - Implement a ioctl() to emulate sendmmsg()/recvmmsg(). - Add a flag that can be set using TUNSETIFF that makes regular read()/write() calls handle multiple packets in one go. - Expose a socket fd to userspace, so regular sendmmsg()/recvmmsg() can be used. There is tun_get_socket() which is used internally in the kernel, but this is not exposed to userspace, and doesn't look trivial to do either. What would be the right way to do this? -- Met vriendelijke groet / with kind regards, Guus Sliepen
[PATCH RFC net-next] net: core: Pass XPS select queue decision to skb_tx_hash
Currently XPS select queue decision is final and overrides/ignores all other select queue parameters such as QoS TC, RX recording. This patch makes get_xps_queue value as a hint for skb_tx_hash, which will decide whether to use this hint as is or to tweak it a little to provide the correct TXQ. This will fix bugs in cases such that TC QoS offload (tc_to_txq mapping) and XPS mapping are configured but select queue will only respect the XPS configuration which will skip the TC QoS queue selection and thus will not satisfy the QoS configuration. RFC because I want to discuss how we would like the final behavior of the __netdev_pick_tx, with this patch it goes as follows: netdev_pick_tx(skb): hint = get_xps_queue txq = skb_tx_hash(skb, hint) skb_tx_hash(skb, hint): if (skb_rx_queue_recorded(skb)) return skb_get_rx_queue(skb); queue_offset = 0; if (dev->num_tc) queue_offset = tc_queue_offset[tc]; hash = hint < 0 ? skb_get_hash(skb) : hint; return hash + queue_offset; i.e: instead of blindly return the XPS decision, we pass it to skb_tx_hash which can make the final decision for us, select queue will now respect and combine XPS and TC QoS tc_to_txq mappings. Also there is one additional behavioral change that recorded rx queues will now override the XPS configuration. Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx4/en_tx.c |2 +- include/linux/netdevice.h |6 +++--- net/core/dev.c | 10 ++ 3 files changed, 10 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c index c0d7b72..873cf49 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c @@ -693,7 +693,7 @@ u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb, u8 up = 0; if (dev->num_tc) - return skb_tx_hash(dev, skb); + return skb_tx_hash(dev, skb, -1); if (skb_vlan_tag_present(skb)) up = skb_vlan_tag_get(skb) >> VLAN_PRIO_SHIFT; diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index cb0d5d0..ad81ffe 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3130,16 +3130,16 @@ static inline int netif_set_xps_queue(struct net_device *dev, #endif u16 __skb_tx_hash(const struct net_device *dev, struct sk_buff *skb, - unsigned int num_tx_queues); + unsigned int num_tx_queues, int txq_hint); /* * Returns a Tx hash for the given packet when dev->real_num_tx_queues is used * as a distribution range limit for the returned value. */ static inline u16 skb_tx_hash(const struct net_device *dev, - struct sk_buff *skb) + struct sk_buff *skb, int txq_hint) { - return __skb_tx_hash(dev, skb, dev->real_num_tx_queues); + return __skb_tx_hash(dev, skb, dev->real_num_tx_queues, txq_hint); } /** diff --git a/net/core/dev.c b/net/core/dev.c index b9bcbe7..ff640b7 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2394,7 +2394,7 @@ EXPORT_SYMBOL(netif_device_attach); * to be used as a distribution range. */ u16 __skb_tx_hash(const struct net_device *dev, struct sk_buff *skb, - unsigned int num_tx_queues) + unsigned int num_tx_queues, int txq_hint) { u32 hash; u16 qoffset = 0; @@ -2411,9 +2411,12 @@ u16 __skb_tx_hash(const struct net_device *dev, struct sk_buff *skb, u8 tc = netdev_get_prio_tc_map(dev, skb->priority); qoffset = dev->tc_to_txq[tc].offset; qcount = dev->tc_to_txq[tc].count; + if (txq_hint >= qcount) /* This can happen when xps is configured on TC TXQs */ + txq_hint = -1; /* recalculate TXQ hash */ } - return (u16) reciprocal_scale(skb_get_hash(skb), qcount) + qoffset; + hash = txq_hint < 0 ? reciprocal_scale(skb_get_hash(skb), qcount) : txq_hint; + return (u16)(hash) + qoffset; } EXPORT_SYMBOL(__skb_tx_hash); @@ -3201,8 +3204,7 @@ static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb) if (queue_index < 0 || skb->ooo_okay || queue_index >= dev->real_num_tx_queues) { int new_index = get_xps_queue(dev, skb); - if (new_index < 0) - new_index = skb_tx_hash(dev, skb); + new_index = skb_tx_hash(dev, skb, new_index); if (queue_index != new_index && sk && sk_fullsock(sk) && -- 1.7.1
Re: [PATCH] bridge: Allow set bridge ageing time when switchdev disabled
Tue, Mar 29, 2016 at 01:48:08PM IDT, yanhaishu...@cmss.chinamobile.com wrote: >When NET_SWITCHDEV=n, switchdev_port_attr_set will return -EOPNOTSUPP, >we should ignore this error code and continue to set the ageing time. > >Signed-off-by: Haishuang Yan Fixes: c62987bbd8a1 ("bridge: push bridge setting ageing_time down to switchdev") Acked-by: Ido Schimmel Thank you.
Re: [PATCH net] bpf: make padding in bpf_tunnel_key explicit
On Wed, Mar 30, 2016 at 12:02:00AM +0200, Daniel Borkmann wrote: > Make the 2 byte padding in struct bpf_tunnel_key between tunnel_ttl > and tunnel_label members explicit. No issue has been observed, and > gcc/llvm does padding for the old struct already, where tunnel_label > was not yet present, so the current code works, but since it's part > of uapi, make sure we don't introduce holes in structs. > > Therefore, add tunnel_ext that we can use generically in future > (f.e. to flag OAM messages for backends, etc). Also add the offset > to the compat tests to be sure should some compilers not padd the > tail of the old version of bpf_tunnel_key. > > Fixes: 4018ab1875e0 ("bpf: support flow label for bpf_skb_{set, > get}_tunnel_key") > Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov
[PATCH net] bpf: make padding in bpf_tunnel_key explicit
Make the 2 byte padding in struct bpf_tunnel_key between tunnel_ttl and tunnel_label members explicit. No issue has been observed, and gcc/llvm does padding for the old struct already, where tunnel_label was not yet present, so the current code works, but since it's part of uapi, make sure we don't introduce holes in structs. Therefore, add tunnel_ext that we can use generically in future (f.e. to flag OAM messages for backends, etc). Also add the offset to the compat tests to be sure should some compilers not padd the tail of the old version of bpf_tunnel_key. Fixes: 4018ab1875e0 ("bpf: support flow label for bpf_skb_{set, get}_tunnel_key") Signed-off-by: Daniel Borkmann --- include/uapi/linux/bpf.h | 1 + net/core/filter.c| 5 - 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 924f537..23917bb 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -375,6 +375,7 @@ struct bpf_tunnel_key { }; __u8 tunnel_tos; __u8 tunnel_ttl; + __u16 tunnel_ext; __u32 tunnel_label; }; diff --git a/net/core/filter.c b/net/core/filter.c index b7177d0..4b81b71 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -1764,6 +1764,7 @@ static u64 bpf_skb_get_tunnel_key(u64 r1, u64 r2, u64 size, u64 flags, u64 r5) if (unlikely(size != sizeof(struct bpf_tunnel_key))) { switch (size) { case offsetof(struct bpf_tunnel_key, tunnel_label): + case offsetof(struct bpf_tunnel_key, tunnel_ext): goto set_compat; case offsetof(struct bpf_tunnel_key, remote_ipv6[1]): /* Fixup deprecated structure layouts here, so we have @@ -1849,6 +1850,7 @@ static u64 bpf_skb_set_tunnel_key(u64 r1, u64 r2, u64 size, u64 flags, u64 r5) if (unlikely(size != sizeof(struct bpf_tunnel_key))) { switch (size) { case offsetof(struct bpf_tunnel_key, tunnel_label): + case offsetof(struct bpf_tunnel_key, tunnel_ext): case offsetof(struct bpf_tunnel_key, remote_ipv6[1]): /* Fixup deprecated structure layouts here, so we have * a common path later on. @@ -1861,7 +1863,8 @@ static u64 bpf_skb_set_tunnel_key(u64 r1, u64 r2, u64 size, u64 flags, u64 r5) return -EINVAL; } } - if (unlikely(!(flags & BPF_F_TUNINFO_IPV6) && from->tunnel_label)) + if (unlikely((!(flags & BPF_F_TUNINFO_IPV6) && from->tunnel_label) || +from->tunnel_ext)) return -EINVAL; skb_dst_drop(skb); -- 1.9.3
[net PATCH v2] gro: Allow tunnel stacking in the case of FOU/GUE
This patch should fix the issues seen with a recent fix to prevent tunnel-in-tunnel frames from being generated with GRO. The fix itself is correct for now as long as we do not add any devices that support NETIF_F_GSO_GRE_CSUM. When such a device is added it could have the potential to mess things up due to the fact that the outer transport header points to the outer UDP header and not the GRE header as would be expected. Fixes: fac8e0f579695 ("tunnels: Don't apply GRO to multiple layers of encapsulation.") Signed-off-by: Alexander Duyck --- v2: Dropped switch statements per suggestion of Tom Herbert. net/ipv4/fou.c | 16 1 file changed, 16 insertions(+) diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c index a0586b4a197d..5a94aea280d3 100644 --- a/net/ipv4/fou.c +++ b/net/ipv4/fou.c @@ -195,6 +195,14 @@ static struct sk_buff **fou_gro_receive(struct sk_buff **head, u8 proto = NAPI_GRO_CB(skb)->proto; const struct net_offload **offloads; + /* We can clear the encap_mark for FOU as we are essentially doing +* one of two possible things. We are either adding an L4 tunnel +* header to the outer L3 tunnel header, or we are are simply +* treating the GRE tunnel header as though it is a UDP protocol +* specific header such as VXLAN or GENEVE. +*/ + NAPI_GRO_CB(skb)->encap_mark = 0; + rcu_read_lock(); offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads; ops = rcu_dereference(offloads[proto]); @@ -352,6 +360,14 @@ static struct sk_buff **gue_gro_receive(struct sk_buff **head, } } + /* We can clear the encap_mark for GUE as we are essentially doing +* one of two possible things. We are either adding an L4 tunnel +* header to the outer L3 tunnel header, or we are are simply +* treating the GRE tunnel header as though it is a UDP protocol +* specific header such as VXLAN or GENEVE. +*/ + NAPI_GRO_CB(skb)->encap_mark = 0; + rcu_read_lock(); offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads; ops = rcu_dereference(offloads[guehdr->proto_ctype]);
Re: [RESEND PATCH 15/16] wcn36xx: don't pad beacons for mesh
On Tue 29 Mar 14:41 PDT 2016, Bjorn Andersson wrote: > From: Jason Mobarak > > Patch "wcn36xx: Pad TIM PVM if needed" has caused a regression in mesh > beaconing. The field tim_off is always 0 for mesh mode, and thus > pvm_len (referring to the TIM length field) and pad are both incorrectly > calculated. Thus, msg_body.beacon_length is incorrectly calculated for > mesh mode. Fix this. > > Fixes: 8ad99a4e3ee5 ("wcn36xx: Pad TIM PVM if needed") > Signed-off-by: Jason Mobarak > Signed-off-by: Chun-Yeow Yeoh > Signed-off-by: Bjorn Andersson > --- > > Resend this single patch with included Fixes tag. > Sorry for the spam, I read the git log incorrectly. The patch referred to is part of this series, so the sha1 is bogus. Regards, Bjorn > drivers/net/wireless/ath/wcn36xx/smd.c | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/drivers/net/wireless/ath/wcn36xx/smd.c > b/drivers/net/wireless/ath/wcn36xx/smd.c > index a57d158298a1..b1bdc229e560 100644 > --- a/drivers/net/wireless/ath/wcn36xx/smd.c > +++ b/drivers/net/wireless/ath/wcn36xx/smd.c > @@ -1410,6 +1410,11 @@ int wcn36xx_smd_send_beacon(struct wcn36xx *wcn, > struct ieee80211_vif *vif, > > pvm_len = skb_beacon->data[tim_off + 1] - 3; > pad = TIM_MIN_PVM_SIZE - pvm_len; > + > + /* Padding is irrelevant to mesh mode since tim_off is always 0. */ > + if (vif->type == NL80211_IFTYPE_MESH_POINT) > + pad = 0; > + > msg_body.beacon_length = skb_beacon->len + pad; > /* TODO need to find out why + 6 is needed */ > msg_body.beacon_length6 = msg_body.beacon_length + 6; > -- > 2.5.0 >
[RESEND PATCH 15/16] wcn36xx: don't pad beacons for mesh
From: Jason Mobarak Patch "wcn36xx: Pad TIM PVM if needed" has caused a regression in mesh beaconing. The field tim_off is always 0 for mesh mode, and thus pvm_len (referring to the TIM length field) and pad are both incorrectly calculated. Thus, msg_body.beacon_length is incorrectly calculated for mesh mode. Fix this. Fixes: 8ad99a4e3ee5 ("wcn36xx: Pad TIM PVM if needed") Signed-off-by: Jason Mobarak Signed-off-by: Chun-Yeow Yeoh Signed-off-by: Bjorn Andersson --- Resend this single patch with included Fixes tag. drivers/net/wireless/ath/wcn36xx/smd.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/net/wireless/ath/wcn36xx/smd.c b/drivers/net/wireless/ath/wcn36xx/smd.c index a57d158298a1..b1bdc229e560 100644 --- a/drivers/net/wireless/ath/wcn36xx/smd.c +++ b/drivers/net/wireless/ath/wcn36xx/smd.c @@ -1410,6 +1410,11 @@ int wcn36xx_smd_send_beacon(struct wcn36xx *wcn, struct ieee80211_vif *vif, pvm_len = skb_beacon->data[tim_off + 1] - 3; pad = TIM_MIN_PVM_SIZE - pvm_len; + + /* Padding is irrelevant to mesh mode since tim_off is always 0. */ + if (vif->type == NL80211_IFTYPE_MESH_POINT) + pad = 0; + msg_body.beacon_length = skb_beacon->len + pad; /* TODO need to find out why + 6 is needed */ msg_body.beacon_length6 = msg_body.beacon_length + 6; -- 2.5.0
Re: [PATCH 06/16] wcn36xx: Fetch private sta data from sta entry instead of from vif
On Tue 29 Mar 10:01 PDT 2016, kbuild test robot wrote: > Hi Pontus, > > [auto build test ERROR on wireless-drivers/master] > [also build test ERROR on v4.6-rc1 next-20160329] > [if your patch is applied to the wrong git tree, please drop us a note to > help improving the system] > > url: > https://github.com/0day-ci/linux/commits/Bjorn-Andersson/Misc-wcn36xx-fixes/20160329-141847 > base: > https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers.git > master > config: sparc64-allyesconfig (attached as .config) > reproduce: > wget > https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross > -O ~/bin/make.cross > chmod +x ~/bin/make.cross > # save the attached .config to linux build tree > make.cross ARCH=sparc64 > > Note: the linux-review/Bjorn-Andersson/Misc-wcn36xx-fixes/20160329-141847 > HEAD 8303daac889854237207e7caefaea94fee0b87f2 builds fine. > It only hurts bisectibility. > > All error/warnings (new ones prefixed by >>): > >drivers/net/wireless/ath/wcn36xx/main.c: In function 'wcn36xx_set_key': > >> drivers/net/wireless/ath/wcn36xx/main.c:389:9: error: implicit declaration > >> of function 'wcn36xx_sta_to_priv' [-Werror=implicit-function-declaration] > struct wcn36xx_sta *sta_priv = wcn36xx_sta_to_priv(sta); > ^ > >> drivers/net/wireless/ath/wcn36xx/main.c:389:33: warning: initialization > >> makes pointer from integer without a cast > struct wcn36xx_sta *sta_priv = wcn36xx_sta_to_priv(sta); > ^ >cc1: some warnings being treated as errors This should have been reordered with patch 7, that introduces this helper function. Do you want me to resend, or can you apply the patches out of order? Regards, Bjorn > > vim +/wcn36xx_sta_to_priv +389 drivers/net/wireless/ath/wcn36xx/main.c > >383 struct ieee80211_vif *vif, >384 struct ieee80211_sta *sta, >385 struct ieee80211_key_conf *key_conf) >386{ >387struct wcn36xx *wcn = hw->priv; >388struct wcn36xx_vif *vif_priv = wcn36xx_vif_to_priv(vif); > > 389struct wcn36xx_sta *sta_priv = wcn36xx_sta_to_priv(sta); >390int ret = 0; >391u8 key[WLAN_MAX_KEY_LEN]; >392 > > --- > 0-DAY kernel test infrastructureOpen Source Technology Center > https://lists.01.org/pipermail/kbuild-all Intel Corporation > ___ > wcn36xx mailing list > wcn3...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/wcn36xx
Re: [net PATCH] gro: Allow tunnel stacking in the case of FOU/GUE
On Tue, Mar 29, 2016 at 2:13 PM, Tom Herbert wrote: > On Mon, Mar 28, 2016 at 4:58 PM, Alexander Duyck wrote: >> This patch should fix the issues seen with a recent fix to prevent >> tunnel-in-tunnel frames from being generated with GRO. The fix itself is >> correct for now as long as we do not add any devices that support >> NETIF_F_GSO_GRE_CSUM. When such a device is added it could have the >> potential to mess things up due to the fact that the outer transport header >> points to the outer UDP header and not the GRE header as would be expected. >> >> Fixes: fac8e0f579695 ("tunnels: Don't apply GRO to multiple layers of >> encapsulation.") >> Signed-off-by: Alexander Duyck >> --- >> >> This should allow us to keep the fix that Jesse added without breaking the >> 3 cases that Tom called out in terms of FOU/GUE. >> >> Additional work will be needed in net-next as we probably need to make it >> so that offloads work correctly when we get around to supporting >> NETIF_F_GSO_GRE_CSUM. >> >> net/ipv4/fou.c | 32 >> 1 file changed, 32 insertions(+) >> >> diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c >> index 4136da9275b2..2c30256ee959 100644 >> --- a/net/ipv4/fou.c >> +++ b/net/ipv4/fou.c >> @@ -195,6 +195,22 @@ static struct sk_buff **fou_gro_receive(struct sk_buff >> **head, >> u8 proto = NAPI_GRO_CB(skb)->proto; >> const struct net_offload **offloads; >> >> + switch (proto) { >> + case IPPROTO_IPIP: >> + case IPPROTO_IPV6: >> + case IPPROTO_GRE: >> + /* We can clear the encap_mark for these 3 protocols as >> +* we are either adding an L4 tunnel header to the outer >> +* L3 tunnel header, or we are are simply treating the >> +* GRE tunnel header as though it is a UDP protocol >> +* specific header such as VXLAN or GENEVE. >> +*/ >> + NAPI_GRO_CB(skb)->encap_mark = 0; >> + /* fall-through */ >> + default: >> + break; >> + } > > Switch statement is not needed, just do NAPI_GRO_CB(skb)->encap_mark = 0; > >> + >> rcu_read_lock(); >> offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : >> inet_offloads; >> ops = rcu_dereference(offloads[proto]); >> @@ -359,6 +375,22 @@ static struct sk_buff **gue_gro_receive(struct sk_buff >> **head, >> NAPI_GRO_CB(p)->flush |= NAPI_GRO_CB(p)->flush_id; >> } >> >> + switch (guehdr->proto_ctype) { >> + case IPPROTO_IPIP: >> + case IPPROTO_IPV6: >> + case IPPROTO_GRE: >> + /* We can clear the encap_mark for these 3 protocols as >> +* we are either adding an L4 tunnel header to the outer >> +* L3 tunnel header, or we are are simply treating the >> +* GRE tunnel header as though it is a UDP protocol >> +* specific header such as VXLAN or GENEVE. >> +*/ >> + NAPI_GRO_CB(skb)->encap_mark = 0; >> + /* fall-through */ >> + default: >> + break; >> + } >> + > Here also. > >> rcu_read_lock(); >> offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : >> inet_offloads; >> ops = rcu_dereference(offloads[guehdr->proto_ctype]); >> Okay, I can update that and submit a v2. The only real reason why I had the switch statements was out of an abundance of caution since those were the only 3 cases where I knew we would run into issues. - Alex
Re: [net PATCH] gro: Allow tunnel stacking in the case of FOU/GUE
On Mon, Mar 28, 2016 at 4:58 PM, Alexander Duyck wrote: > This patch should fix the issues seen with a recent fix to prevent > tunnel-in-tunnel frames from being generated with GRO. The fix itself is > correct for now as long as we do not add any devices that support > NETIF_F_GSO_GRE_CSUM. When such a device is added it could have the > potential to mess things up due to the fact that the outer transport header > points to the outer UDP header and not the GRE header as would be expected. > > Fixes: fac8e0f579695 ("tunnels: Don't apply GRO to multiple layers of > encapsulation.") > Signed-off-by: Alexander Duyck > --- > > This should allow us to keep the fix that Jesse added without breaking the > 3 cases that Tom called out in terms of FOU/GUE. > > Additional work will be needed in net-next as we probably need to make it > so that offloads work correctly when we get around to supporting > NETIF_F_GSO_GRE_CSUM. > > net/ipv4/fou.c | 32 > 1 file changed, 32 insertions(+) > > diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c > index 4136da9275b2..2c30256ee959 100644 > --- a/net/ipv4/fou.c > +++ b/net/ipv4/fou.c > @@ -195,6 +195,22 @@ static struct sk_buff **fou_gro_receive(struct sk_buff > **head, > u8 proto = NAPI_GRO_CB(skb)->proto; > const struct net_offload **offloads; > > + switch (proto) { > + case IPPROTO_IPIP: > + case IPPROTO_IPV6: > + case IPPROTO_GRE: > + /* We can clear the encap_mark for these 3 protocols as > +* we are either adding an L4 tunnel header to the outer > +* L3 tunnel header, or we are are simply treating the > +* GRE tunnel header as though it is a UDP protocol > +* specific header such as VXLAN or GENEVE. > +*/ > + NAPI_GRO_CB(skb)->encap_mark = 0; > + /* fall-through */ > + default: > + break; > + } Switch statement is not needed, just do NAPI_GRO_CB(skb)->encap_mark = 0; > + > rcu_read_lock(); > offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads; > ops = rcu_dereference(offloads[proto]); > @@ -359,6 +375,22 @@ static struct sk_buff **gue_gro_receive(struct sk_buff > **head, > NAPI_GRO_CB(p)->flush |= NAPI_GRO_CB(p)->flush_id; > } > > + switch (guehdr->proto_ctype) { > + case IPPROTO_IPIP: > + case IPPROTO_IPV6: > + case IPPROTO_GRE: > + /* We can clear the encap_mark for these 3 protocols as > +* we are either adding an L4 tunnel header to the outer > +* L3 tunnel header, or we are are simply treating the > +* GRE tunnel header as though it is a UDP protocol > +* specific header such as VXLAN or GENEVE. > +*/ > + NAPI_GRO_CB(skb)->encap_mark = 0; > + /* fall-through */ > + default: > + break; > + } > + Here also. > rcu_read_lock(); > offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads; > ops = rcu_dereference(offloads[guehdr->proto_ctype]); >
Re: [net PATCH] gro: Allow tunnel stacking in the case of FOU/GUE
From: Tom Herbert Date: Mon, 28 Mar 2016 21:51:17 -0700 > No, but I do expect that you support code that is already there. There > was apparently zero testing done on the original patch and it caused > one very obvious regression. So how can we have any confidence > whatsoever that this patch doesn't break other things? Furthermore, > with all these claims of bugs I still don't see that _anyone_ has > taken the time to reproduce any issue and show that this patch > materially fixes any thing. I seriously don't understand how basic > testing could be such a challenge. > > Anyway, what I expect is moot. It's up to davem to decide what to do > with this... You being upset with a lack of testing is one issue, and is legitimate. But the fact that we can't support, and never could support, more than one network header at a time except in a very special case for GRO is very real. And you must acknowledge that this was a very shaky foundation upon which to erect the kinds of things you expect to work.
Re: [PATCH net-next v3.16]r9169: Correct Set Vlan tag
From: Corcodel Marian Date: Tue, 29 Mar 2016 11:33:20 +0300 > This patch add set Vlan tag and flush CPlusCmd register because when unset > RxVlan and RxChkSum bit, whithout some explication , unwanted bits > is set, PCIDAC, PCIMulRW and others.Whithout this patch when run > ethtool -d eth0 on "C+ Command" field missing "VLAN de-tagging" > > Signed-off-by: Corcodel Marian I am hereby blocking you from making any and all postings to the mailing lists at vger.kernel.org There is not negotiating about this, all complaints sent to me will be ignored. You should have thought about the consequences of your actions (or lack thereof) over the past several months. You cannot continually post patches, as you please, and completely refuse to interact with the developers who review your changes and give you feedback. You have not coherently replied to any feedback you have been given. You have almost always ignored the feedback you were given and continued to make the same undesirable changes over and over again. And you refuse to even acknowledge or address the issue of your interactions with the community in any way whatsoever. Therefore you are hurting development, and wasting precious developer resources with your postings. All of this is unacceptable. But that all ends right now. Thank you.
Re: [PATCH 1/1] ipv4: fix NULL pointer dereference in __inet_put_port()
From: fanhui Date: Tue, 29 Mar 2016 14:45:53 +0800 > [] tcp_nuke_addr+0x22c/0x2a0 Do not report or fix problems in non-mainline kernels. Thank you.
Re: ss filter problem
Hi Phil, On Tue, Mar 29, 2016 at 09:32:42PM +0200, Phil Sutter wrote: > Hi, > > I am trying to fix a bug in ss filter code, but feel quite lost right > now. The issue is this: > > | ss -nl -f inet '( sport = :22 )' > > prints not only listening sockets (as requested by -l flag), but > established ones as well (reproduce by opening ssh connection to > 127.0.0.1 before calling above). > > In contrast, the following both don't show the established sockets: > > | ss -nl '( sport = :22 )' > | ss -nl -f inet > > My investigation led me to see that current_filter.states is altered > after ssfilter_parse() returns, and using gdb with a watchpoint I was > able to identify parse_hostcond() to be the bad guy: In line 1560, it > calls filter_af_set() after checking for fam != AF_UNSPEC (which is the > case, since fam = preferred_family and the latter is changed to AF_INET > when parsing '-f inet' parameter). Yes, after removing of fam != AF_UNSPEC body - it works, because it does not overwrite specified states (-l) from command line, but I can't say what it may affect else, I will try to investigate it better. > > This whole jumping back and forth confuses me quite effectively. Since > you did some fixes in the past already, are you possibly able to point > out where/how this tiny mess has to be fixed? > > I guess in an ideal world we would translate '-l' to 'state listen', '-f > inet' to 'src inet:*' and pass everything ANDed together to > ssfilter_parse(). Or maybe that would make things even worse. ;) > > Cheers, Phil I thought I fixed & tested well ss filter, but seems it would be good to have good automation testing. Regards, Vadim Kochan
Re: [PATCH] bus: mvebu-mbus: use %pad to print phys_addr_t
On Tuesday 29 March 2016 18:04:47 Gregory CLEMENT wrote: > > What is the status of this patch? > > Do you plan to send a second version with the title fixed as suggested > by Joe Perches? > > Also do you expect that I collect this patch in the mvebu subsystem? Right now, it's on my long-term todo list along with some 70 other patches I need to revisit. If you want to fix up the title and apply it now, that would be great, otherwise I'll get to it in a few weeks after coming back from ELC/vacation. ARnd
ss filter problem
Hi, I am trying to fix a bug in ss filter code, but feel quite lost right now. The issue is this: | ss -nl -f inet '( sport = :22 )' prints not only listening sockets (as requested by -l flag), but established ones as well (reproduce by opening ssh connection to 127.0.0.1 before calling above). In contrast, the following both don't show the established sockets: | ss -nl '( sport = :22 )' | ss -nl -f inet My investigation led me to see that current_filter.states is altered after ssfilter_parse() returns, and using gdb with a watchpoint I was able to identify parse_hostcond() to be the bad guy: In line 1560, it calls filter_af_set() after checking for fam != AF_UNSPEC (which is the case, since fam = preferred_family and the latter is changed to AF_INET when parsing '-f inet' parameter). This whole jumping back and forth confuses me quite effectively. Since you did some fixes in the past already, are you possibly able to point out where/how this tiny mess has to be fixed? I guess in an ideal world we would translate '-l' to 'state listen', '-f inet' to 'src inet:*' and pass everything ANDed together to ssfilter_parse(). Or maybe that would make things even worse. ;) Cheers, Phil
Re: [PATCH net-next v2 2/2] net: dsa: mv88e6xxx: Clear the PDOWN bit on setup
Hi Andrew, Patrick, Andrew Lunn writes: > On Tue, Mar 29, 2016 at 12:23:06PM -0400, Vivien Didelot wrote: >> Hi Patrick, >> >> Two comments below. >> >> Patrick Uiterwijk writes: >> >> > +static int mv88e6xxx_power_on_serdes(struct dsa_switch *ds) >> >> Since this function assumes the SMI lock is already held, its name >> should be prefixed with _ by convention (_mv88e6xxx_power_on_serdes). > > We decided to drop at, since nearly everything would end up with a _ > prefix. The assert_smi_lock() should find any missing locks, and > lockdep/deadlocks will make it clear when the lock is taken twice. OK, I didn't know that. This makes sense. There is no need to respin a v3 only for my previous &= comment then. Thanks, -v
RE: [PATCH] qed: initialize return rc to avoid returning garbage
> From: Colin Ian King > > in the case where qed_slowpath_irq_req is not called, rc is not assigned and > so > qed_int_igu_enable will return a garbage value. > Fix this by initializing rc to 0. > > Signed-off-by: Colin Ian King Thanks Colin. Acked-by: Yuval Mintz
[RFC] Add netdev all_adj_list refcnt propagation to fix panic
This is an RFC patch to fix a relatively easily reproducible kernel panic related to the all_adj_list handling for netdevs in recent kernels. This is more to generate discussion than anything else. I don't particularly like this approach, I'm hoping someone has a better idea. The following sequence of commands will reproduce the issue: ip link add link eth0 name eth0.100 type vlan id 100 ip link add link eth0 name eth0.200 type vlan id 200 ip link add name testbr type bridge ip link set eth0.100 master testbr ip link set eth0.200 master testbr ip link add link testbr mac0 type macvlan ip link delete dev testbr This creates an upper/lower tree of (excuse the poor ASCII art): /---eth0.100-eth0 mac0-testbr- \---eth0.200-eth0 When testbr is deleted, the all_adj_lists are walked, and eth0 is deleted twice from the mac0 list. Unfortunately, during setup in __netdev_upper_dev_link, only one reference to eth0 is added, so this results in a panic. This change adds reference count propagation so things are handled properly. Matthias Schiffer reported a similar crash in batman-adv: https://github.com/freifunk-gluon/gluon/issues/680 https://www.open-mesh.org/issues/247 which this patch also seems to resolve. --- net/core/dev.c | 68 -- 1 file changed, 37 insertions(+), 31 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index b9bcbe7..4b4ef6b 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5471,6 +5471,7 @@ static inline bool netdev_adjacent_is_neigh_list(struct net_device *dev, static int __netdev_adjacent_dev_insert(struct net_device *dev, struct net_device *adj_dev, + u16 ref_nr, struct list_head *dev_list, void *private, bool master) { @@ -5480,7 +5481,7 @@ static int __netdev_adjacent_dev_insert(struct net_device *dev, adj = __netdev_find_adj(adj_dev, dev_list); if (adj) { - adj->ref_nr++; + adj->ref_nr += ref_nr; return 0; } @@ -5490,7 +5491,7 @@ static int __netdev_adjacent_dev_insert(struct net_device *dev, adj->dev = adj_dev; adj->master = master; - adj->ref_nr = 1; + adj->ref_nr = ref_nr; adj->private = private; dev_hold(adj_dev); @@ -5529,6 +5530,7 @@ free_adj: static void __netdev_adjacent_dev_remove(struct net_device *dev, struct net_device *adj_dev, +u16 ref_nr, struct list_head *dev_list) { struct netdev_adjacent *adj; @@ -5541,10 +5543,10 @@ static void __netdev_adjacent_dev_remove(struct net_device *dev, BUG(); } - if (adj->ref_nr > 1) { - pr_debug("%s to %s ref_nr-- = %d\n", dev->name, adj_dev->name, -adj->ref_nr-1); - adj->ref_nr--; + if (adj->ref_nr > ref_nr) { + pr_debug("%s to %s ref_nr-%d = %d\n", dev->name, adj_dev->name, +ref_nr, adj->ref_nr-ref_nr); + adj->ref_nr -= ref_nr; return; } @@ -5563,21 +5565,22 @@ static void __netdev_adjacent_dev_remove(struct net_device *dev, static int __netdev_adjacent_dev_link_lists(struct net_device *dev, struct net_device *upper_dev, + u16 ref_nr, struct list_head *up_list, struct list_head *down_list, void *private, bool master) { int ret; - ret = __netdev_adjacent_dev_insert(dev, upper_dev, up_list, private, - master); + ret = __netdev_adjacent_dev_insert(dev, upper_dev, ref_nr, up_list, + private, master); if (ret) return ret; - ret = __netdev_adjacent_dev_insert(upper_dev, dev, down_list, private, - false); + ret = __netdev_adjacent_dev_insert(upper_dev, dev, ref_nr, down_list, + private, false); if (ret) { - __netdev_adjacent_dev_remove(dev, upper_dev, up_list); + __netdev_adjacent_dev_remove(dev, upper_dev, ref_nr, up_list); return ret; } @@ -5585,9 +5588,10 @@ static int __netdev_adjacent_dev_link_lists(struct net_device *dev, } static int __netdev_adjacent_dev_link(struct net_device *dev, - struct net_device *upper_dev) + struct net_device *upper_dev, + u16 r
Re: [PATCH 06/16] wcn36xx: Fetch private sta data from sta entry instead of from vif
Hi Pontus, [auto build test ERROR on wireless-drivers/master] [also build test ERROR on v4.6-rc1 next-20160329] [if your patch is applied to the wrong git tree, please drop us a note to help improving the system] url: https://github.com/0day-ci/linux/commits/Bjorn-Andersson/Misc-wcn36xx-fixes/20160329-141847 base: https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers.git master config: sparc64-allyesconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=sparc64 Note: the linux-review/Bjorn-Andersson/Misc-wcn36xx-fixes/20160329-141847 HEAD 8303daac889854237207e7caefaea94fee0b87f2 builds fine. It only hurts bisectibility. All error/warnings (new ones prefixed by >>): drivers/net/wireless/ath/wcn36xx/main.c: In function 'wcn36xx_set_key': >> drivers/net/wireless/ath/wcn36xx/main.c:389:9: error: implicit declaration >> of function 'wcn36xx_sta_to_priv' [-Werror=implicit-function-declaration] struct wcn36xx_sta *sta_priv = wcn36xx_sta_to_priv(sta); ^ >> drivers/net/wireless/ath/wcn36xx/main.c:389:33: warning: initialization >> makes pointer from integer without a cast struct wcn36xx_sta *sta_priv = wcn36xx_sta_to_priv(sta); ^ cc1: some warnings being treated as errors vim +/wcn36xx_sta_to_priv +389 drivers/net/wireless/ath/wcn36xx/main.c 383 struct ieee80211_vif *vif, 384 struct ieee80211_sta *sta, 385 struct ieee80211_key_conf *key_conf) 386 { 387 struct wcn36xx *wcn = hw->priv; 388 struct wcn36xx_vif *vif_priv = wcn36xx_vif_to_priv(vif); > 389 struct wcn36xx_sta *sta_priv = wcn36xx_sta_to_priv(sta); 390 int ret = 0; 391 u8 key[WLAN_MAX_KEY_LEN]; 392 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
[PATCH] qed: initialize return rc to avoid returning garbage
From: Colin Ian King in the case where qed_slowpath_irq_req is not called, rc is not assigned and so qed_int_igu_enable will return a garbage value. Fix this by initializing rc to 0. Signed-off-by: Colin Ian King --- drivers/net/ethernet/qlogic/qed/qed_int.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/qlogic/qed/qed_int.c b/drivers/net/ethernet/qlogic/qed/qed_int.c index ffd0acc..2017b01 100644 --- a/drivers/net/ethernet/qlogic/qed/qed_int.c +++ b/drivers/net/ethernet/qlogic/qed/qed_int.c @@ -2750,7 +2750,7 @@ void qed_int_igu_enable_int(struct qed_hwfn *p_hwfn, int qed_int_igu_enable(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt, enum qed_int_mode int_mode) { - int rc; + int rc = 0; /* Configure AEU signal change to produce attentions */ qed_wr(p_hwfn, p_ptt, IGU_REG_ATTENTION_ENABLE, 0); -- 2.7.4
Re: [PATCH] rtl8xxxu: fix uninitialized return value in ret
Colin King writes: > From: Colin Ian King > > several functions are not initializing a return status in ret > resulting in garbage to be returned instead of 0 for success. > Currently, the calls to these functions are not checking the > return, however, it seems prudent to return the correct status > in case they are to be checked at a later date. > > Signed-off-by: Colin Ian King > --- > drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) Thanks for the patch! I'm surprised the compiler didn't warn about this. I'll add it to my queue for rtl8xxxu. Cheers, Jes > > diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.c > b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.c > index abdff45..9262aad 100644 > --- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.c > +++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.c > @@ -5231,7 +5231,7 @@ static void rtl8xxxu_set_ampdu_min_space(struct > rtl8xxxu_priv *priv, u8 density) > static int rtl8xxxu_active_to_emu(struct rtl8xxxu_priv *priv) > { > u8 val8; > - int count, ret; > + int count, ret = 0; > > /* Start of rtl8723AU_card_enable_flow */ > /* Act to Cardemu sequence*/ > @@ -5281,7 +5281,7 @@ static int rtl8723bu_active_to_emu(struct rtl8xxxu_priv > *priv) > u8 val8; > u16 val16; > u32 val32; > - int count, ret; > + int count, ret = 0; > > /* Turn off RF */ > rtl8xxxu_write8(priv, REG_RF_CTRL, 0); > @@ -5338,7 +5338,7 @@ static int rtl8xxxu_active_to_lps(struct rtl8xxxu_priv > *priv) > { > u8 val8; > u8 val32; > - int count, ret; > + int count, ret = 0; > > rtl8xxxu_write8(priv, REG_TXPAUSE, 0xff);
[PATCH] rtl8xxxu: fix uninitialized return value in ret
From: Colin Ian King several functions are not initializing a return status in ret resulting in garbage to be returned instead of 0 for success. Currently, the calls to these functions are not checking the return, however, it seems prudent to return the correct status in case they are to be checked at a later date. Signed-off-by: Colin Ian King --- drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.c b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.c index abdff45..9262aad 100644 --- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.c +++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.c @@ -5231,7 +5231,7 @@ static void rtl8xxxu_set_ampdu_min_space(struct rtl8xxxu_priv *priv, u8 density) static int rtl8xxxu_active_to_emu(struct rtl8xxxu_priv *priv) { u8 val8; - int count, ret; + int count, ret = 0; /* Start of rtl8723AU_card_enable_flow */ /* Act to Cardemu sequence*/ @@ -5281,7 +5281,7 @@ static int rtl8723bu_active_to_emu(struct rtl8xxxu_priv *priv) u8 val8; u16 val16; u32 val32; - int count, ret; + int count, ret = 0; /* Turn off RF */ rtl8xxxu_write8(priv, REG_RF_CTRL, 0); @@ -5338,7 +5338,7 @@ static int rtl8xxxu_active_to_lps(struct rtl8xxxu_priv *priv) { u8 val8; u8 val32; - int count, ret; + int count, ret = 0; rtl8xxxu_write8(priv, REG_TXPAUSE, 0xff); -- 2.7.4
Re: [PATCH net-next v2 2/2] net: dsa: mv88e6xxx: Clear the PDOWN bit on setup
On Tue, Mar 29, 2016 at 12:23:06PM -0400, Vivien Didelot wrote: > Hi Patrick, > > Two comments below. > > Patrick Uiterwijk writes: > > > +static int mv88e6xxx_power_on_serdes(struct dsa_switch *ds) > > Since this function assumes the SMI lock is already held, its name > should be prefixed with _ by convention (_mv88e6xxx_power_on_serdes). We decided to drop at, since nearly everything would end up with a _ prefix. The assert_smi_lock() should find any missing locks, and lockdep/deadlocks will make it clear when the lock is taken twice. Andrew
Re: [PATCH net-next v2 1/2] net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read,write}
Patrick Uiterwijk writes: > Add versions of the phy_page_read and _write functions to > be used in a context where the SMI mutex is held. > > Signed-off-by: Patrick Uiterwijk Tested-by: Vivien Didelot Reviewed-by: Vivien Didelot Thanks, Vivien
Re: [PATCH net-next v2 2/2] net: dsa: mv88e6xxx: Clear the PDOWN bit on setup
Hi Patrick, Two comments below. Patrick Uiterwijk writes: > +static int mv88e6xxx_power_on_serdes(struct dsa_switch *ds) Since this function assumes the SMI lock is already held, its name should be prefixed with _ by convention (_mv88e6xxx_power_on_serdes). > +{ > + int ret; > + > + ret = _mv88e6xxx_phy_page_read(ds, REG_FIBER_SERDES, PAGE_FIBER_SERDES, > +MII_BMCR); > + if (ret < 0) > + return ret; > + > + if (ret & BMCR_PDOWN) { > + ret = ret & ~BMCR_PDOWN; ret &= ~BMCR_PDOWN; > + ret = _mv88e6xxx_phy_page_write(ds, REG_FIBER_SERDES, > + PAGE_FIBER_SERDES, MII_BMCR, > + ret); > + } > + > + return ret; > +} Thanks, Vivien
Re: [PATCH] bus: mvebu-mbus: use %pad to print phys_addr_t
Hi Arnd, On mar., mars 15 2016, Arnd Bergmann wrote: > A recent change to the mbus driver added a warning printk that > prints a phys_addr_t using the %x format string, which fails in > case we build with 64-bit phys_addr_t: > > drivers/bus/mvebu-mbus.c: In function 'mvebu_mbus_get_dram_win_info': > drivers/bus/mvebu-mbus.c:975:9: error: format '%x' expects argument of type > 'unsigned int', but argument 2 has type 'phys_addr_t {aka long long unsigned > int}' [-Werror=format=] > > This uses the special %pa format string instead, so we always > print the correct type. > > Signed-off-by: Arnd Bergmann > Fixes: f2900acea801 ("bus: mvebu-mbus: provide api for obtaining IO > and DRAM window information") What is the status of this patch? Do you plan to send a second version with the title fixed as suggested by Joe Perches? Also do you expect that I collect this patch in the mvebu subsystem? Thanks, Gregory > --- > The warning was introduced in today's linux-next through the net-next > tree, please apply this fixup on top. > > drivers/bus/mvebu-mbus.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/bus/mvebu-mbus.c b/drivers/bus/mvebu-mbus.c > index c2e52864bb03..ce54a0160faa 100644 > --- a/drivers/bus/mvebu-mbus.c > +++ b/drivers/bus/mvebu-mbus.c > @@ -972,7 +972,7 @@ int mvebu_mbus_get_dram_win_info(phys_addr_t phyaddr, u8 > *target, u8 *attr) > } > } > > - pr_err("invalid dram address 0x%x\n", phyaddr); > + pr_err("invalid dram address %pa\n", &phyaddr); > return -EINVAL; > } > EXPORT_SYMBOL_GPL(mvebu_mbus_get_dram_win_info); > -- > 2.7.0 > -- Gregory Clement, Free Electrons Kernel, drivers, real-time and embedded Linux development, consulting, training and support. http://free-electrons.com
[PATCH net] ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates
From: Eric Dumazet IPv6 counters updates use a different macro than IPv4. Fixes: 36cbb2452cbaf ("udp: Increment UDP_MIB_IGNOREDMULTI for arriving unmatched multicasts") Signed-off-by: Eric Dumazet Cc: Rick Jones Cc: Willem de Bruijn --- net/ipv6/udp.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index fd25e447a5fa..8125931106be 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -843,8 +843,8 @@ start_lookup: flush_stack(stack, count, skb, count - 1); } else { if (!inner_flushed) - UDP_INC_STATS_BH(net, UDP_MIB_IGNOREDMULTI, -proto == IPPROTO_UDPLITE); + UDP6_INC_STATS_BH(net, UDP_MIB_IGNOREDMULTI, + proto == IPPROTO_UDPLITE); consume_skb(skb); } return 0;
RE: [PATCH] net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)
Hi, Netanel, +into 5 levels and assignes interrupt delay value to each level. Should be: assigns +The ENA device AQ and AENQ are allocated on probe and freed ontermination. Should be: on termination. + /* commit previously loaded firmare */ Should be: firmware +static int ena_com_hash_key_destroy(struct ena_com_dev *ena_dev) +{ + struct ena_rss *rss = &ena_dev->rss; + + if (rss->hash_key) + dma_free_coherent(ena_dev->dmadev, + sizeof(*rss->hash_key), + rss->hash_key, + rss->hash_key_dma_addr); + rss->hash_key = NULL; + return 0; +} This method always returns 0. +static int ena_com_hash_ctrl_init(struct ena_com_dev *ena_dev) +{ + struct ena_rss *rss = &ena_dev->rss; + + rss->hash_ctrl = dma_alloc_coherent(ena_dev->dmadev, + sizeof(*rss->hash_ctrl), + &rss->hash_ctrl_dma_addr, + GFP_KERNEL | __GFP_ZERO); + + return 0; +} + This method also always returns 0. +static int ena_com_hash_ctrl_destroy(struct ena_com_dev *ena_dev) +{ + struct ena_rss *rss = &ena_dev->rss; + + if (rss->hash_ctrl) + dma_free_coherent(ena_dev->dmadev, + sizeof(*rss->hash_ctrl), + rss->hash_ctrl, + rss->hash_ctrl_dma_addr); + rss->hash_ctrl = NULL; + + return 0; +} + This method also always returns 0. +static int ena_com_indirect_table_destroy(struct ena_com_dev *ena_dev) +{ + struct ena_rss *rss = &ena_dev->rss; + size_t tbl_size = (1 << rss->tbl_log_size) * + sizeof(struct ena_admin_rss_ind_table_entry); + + if (rss->rss_ind_tbl) + dma_free_coherent(ena_dev->dmadev, + tbl_size, + rss->rss_ind_tbl, + rss->rss_ind_tbl_dma_addr); + rss->rss_ind_tbl = NULL; + + if (rss->host_rss_ind_tbl) + devm_kfree(ena_dev->dmadev, rss->host_rss_ind_tbl); + rss->host_rss_ind_tbl = NULL; + + return 0; +} This method also always returns 0. +int ena_com_rss_destroy(struct ena_com_dev *ena_dev) +{ + ena_com_indirect_table_destroy(ena_dev); + ena_com_hash_key_destroy(ena_dev); + ena_com_hash_ctrl_destroy(ena_dev); + + memset(&ena_dev->rss, 0x0, sizeof(ena_dev->rss)); + + return 0; +} This method also always returns 0. Regards, Rami Rosen Intel Corporation
[net-next] bond: set mac address only if necessary
Bond device gets it's mac address from the first slave device, it's not necessary to set slave device's mac address to bond if equal. Signed-off-by: Zhang Shengju --- drivers/net/bonding/bond_main.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 97fad05..6569219 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1473,8 +1473,9 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) */ ether_addr_copy(new_slave->perm_hwaddr, slave_dev->dev_addr); - if (!bond->params.fail_over_mac || - BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) { + if ((!bond->params.fail_over_mac || + BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) && + !ether_addr_equal_64bits(bond_dev->dev_addr, slave_dev->dev_addr)) { /* Set slave to master's mac address. The application already * set the master's mac address to that of the first slave */ -- 1.8.3.1
Re: bpf: net/core/filter.c:2115 suspicious rcu_dereference_protected() usage!
On 03/29/2016 03:55 PM, Daniel Borkmann wrote: [ dropping my old email address ] On 03/29/2016 02:58 PM, Michal Kubecek wrote: On Mon, Feb 22, 2016 at 10:31:33AM -0500, Sasha Levin wrote: I've hit the following warning while fuzzing with trinity inside a kvmtool guest running the latest -next kernel: [ 1343.104588] === [ 1343.104591] [ INFO: suspicious RCU usage. ] [ 1343.104619] 4.5.0-rc4-next-20160219-sasha-00026-g7978205-dirty #2978 Not tainted [ 1343.104624] --- [ 1343.104635] net/core/filter.c:2115 suspicious rcu_dereference_protected() usage! [ 1343.104641] [ 1343.104641] other info that might help us debug this: [ 1343.104641] [ 1343.104650] [ 1343.104650] rcu_scheduler_active = 1, debug_locks = 0 [ 1343.104660] 1 lock held by syz-executor/17916: [ 1343.104784] #0: (rtnl_mutex){+.+.+.}, at: rtnl_lock (net/core/rtnetlink.c:71) [ 1343.104789] [ 1343.104789] stack backtrace: [ 1343.104820] CPU: 1 PID: 17916 Comm: trinity-c8 Not tainted 4.5.0-rc4-next-20160219-sasha-00026-g7978205-dirty #2978 [ 1343.104868] 110036968f44 8801b4b47aa8 a23d9a9d 0001 [ 1343.104891] fbfff5c2a630 41b58ab3 adb3a8f2 a23d9905 [ 1343.104914] 8801b5419b40 fbfff7596522 0001 [ 1343.104919] Call Trace: [ 1343.104985] dump_stack (lib/dump_stack.c:53) [ 1343.105060] lockdep_rcu_suspicious (kernel/locking/lockdep.c:4282) [ 1343.105093] sk_detach_filter (net/core/filter.c:2114 (discriminator 5)) [ 1343.105193] tun_detach_filter (drivers/net/tun.c:1808 (discriminator 7)) [ 1343.105238] __tun_chr_ioctl (drivers/net/tun.c:2133) [ 1343.105370] tun_chr_ioctl (drivers/net/tun.c:2161) [ 1343.105407] do_vfs_ioctl (fs/ioctl.c:44 fs/ioctl.c:674) [ 1343.105506] SyS_ioctl (fs/ioctl.c:689 fs/ioctl.c:680) [ 1343.105542] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:200) Looks like sk_detach_filter() wants the socket to be owned which neither tun_detach_filter() does not do, unlike sock_setsockopt(). Could you check if the patch below helps? I'm also not really sure if it is safe to ignore return value of sk_detach_filter() and just sets tun->filter_attached to false - but it's not really clear what should be done if one of the calls fails after some succeeded. Wrt return value, afaik SOCK_FILTER_LOCKED cannot be set for tun devs, so we should be okay. diff --git a/drivers/net/tun.c b/drivers/net/tun.c index afdf950617c3..7417d7c20bab 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1818,11 +1818,13 @@ static int set_offload(struct tun_struct *tun, unsigned long arg) static void tun_detach_filter(struct tun_struct *tun, int n) { int i; -struct tun_file *tfile; for (i = 0; i < n; i++) { -tfile = rtnl_dereference(tun->tfiles[i]); -sk_detach_filter(tfile->socket.sk); +struct sock *sk = rtnl_dereference(tun->tfiles[i])->socket.sk; + +lock_sock(sk); +sk_detach_filter(sk); +release_sock(sk); } tun->filter_attached = false; In tun case, the control path for tun_attach_filter() and tun_detach_filter() is under RTNL lock (held in __tun_chr_ioctl()). So in the BPF core the rcu_dereference_protected(, sock_owned_by_user(sk)) looks like a false positive in this specific use case to me, that we should probably just silence. Running the filter via sk_filter() in tun device happens under rcu_read_lock(), so the dereference and assignment pair seems okay to me. Was wondering whether we should convert this to unattached BPF filter, but this would break with existing expectations from sk_filter() (e.g. security modules). If we want to silence it, could be something like the below (only compile-tested): drivers/net/tun.c | 8 +--- include/linux/filter.h | 4 net/core/filter.c | 33 + 3 files changed, 30 insertions(+), 15 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index afdf950..510e90a 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -622,7 +622,8 @@ static int tun_attach(struct tun_struct *tun, struct file *file, bool skip_filte /* Re-attach the filter to persist device */ if (!skip_filter && (tun->filter_attached == true)) { - err = sk_attach_filter(&tun->fprog, tfile->socket.sk); + err = __sk_attach_filter(&tun->fprog, tfile->socket.sk, +lockdep_rtnl_is_held()); if (!err) goto out; } @@ -1822,7 +1823,7 @@ static void tun_detach_filter(struct tun_struct *tun, int n) for (i = 0; i < n; i++) { tfile = rtnl_dereference(tun->tfiles[i]); - sk_detach_filter(tfile->socket.sk); + __sk_detach_filter(tfile->socket.sk, lockdep_rtnl_is_held()); } tun->filter_attached = false; @@ -1835,7 +1836,8 @@ static int tun_attac
Re: [net-next RFC 0/4] SO_BINDTOPREFIX
On Wed, 2016-03-23 at 02:26 +, Gilberto Bertin wrote: > Since the net-next window just opened, I'm resubmitting my RFC for the > SO_BINDTOSUBNET patch, following Mark Smith's suggestion to rename the > whole thing to a more clear SO_BINDTOPREFIX. Please do not add such monolithic option. BPF is absolutely the way to go here, as it allows for whatever user specified tweaks, like a list of destination subnetwork, or/and a list of source network, or the date/time of the day, or port knocking without netfilter, or ... you name it. Simply add an option to load a BPF filter on a socket, used to vary the various compute_score() functions. No hard coded knowledge in the kernel, but a generic interface.
Re: bpf: net/core/filter.c:2115 suspicious rcu_dereference_protected() usage!
[ dropping my old email address ] On 03/29/2016 02:58 PM, Michal Kubecek wrote: On Mon, Feb 22, 2016 at 10:31:33AM -0500, Sasha Levin wrote: I've hit the following warning while fuzzing with trinity inside a kvmtool guest running the latest -next kernel: [ 1343.104588] === [ 1343.104591] [ INFO: suspicious RCU usage. ] [ 1343.104619] 4.5.0-rc4-next-20160219-sasha-00026-g7978205-dirty #2978 Not tainted [ 1343.104624] --- [ 1343.104635] net/core/filter.c:2115 suspicious rcu_dereference_protected() usage! [ 1343.104641] [ 1343.104641] other info that might help us debug this: [ 1343.104641] [ 1343.104650] [ 1343.104650] rcu_scheduler_active = 1, debug_locks = 0 [ 1343.104660] 1 lock held by syz-executor/17916: [ 1343.104784] #0: (rtnl_mutex){+.+.+.}, at: rtnl_lock (net/core/rtnetlink.c:71) [ 1343.104789] [ 1343.104789] stack backtrace: [ 1343.104820] CPU: 1 PID: 17916 Comm: trinity-c8 Not tainted 4.5.0-rc4-next-20160219-sasha-00026-g7978205-dirty #2978 [ 1343.104868] 110036968f44 8801b4b47aa8 a23d9a9d 0001 [ 1343.104891] fbfff5c2a630 41b58ab3 adb3a8f2 a23d9905 [ 1343.104914] 8801b5419b40 fbfff7596522 0001 [ 1343.104919] Call Trace: [ 1343.104985] dump_stack (lib/dump_stack.c:53) [ 1343.105060] lockdep_rcu_suspicious (kernel/locking/lockdep.c:4282) [ 1343.105093] sk_detach_filter (net/core/filter.c:2114 (discriminator 5)) [ 1343.105193] tun_detach_filter (drivers/net/tun.c:1808 (discriminator 7)) [ 1343.105238] __tun_chr_ioctl (drivers/net/tun.c:2133) [ 1343.105370] tun_chr_ioctl (drivers/net/tun.c:2161) [ 1343.105407] do_vfs_ioctl (fs/ioctl.c:44 fs/ioctl.c:674) [ 1343.105506] SyS_ioctl (fs/ioctl.c:689 fs/ioctl.c:680) [ 1343.105542] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:200) Looks like sk_detach_filter() wants the socket to be owned which neither tun_detach_filter() does not do, unlike sock_setsockopt(). Could you check if the patch below helps? I'm also not really sure if it is safe to ignore return value of sk_detach_filter() and just sets tun->filter_attached to false - but it's not really clear what should be done if one of the calls fails after some succeeded. Wrt return value, afaik SOCK_FILTER_LOCKED cannot be set for tun devs, so we should be okay. diff --git a/drivers/net/tun.c b/drivers/net/tun.c index afdf950617c3..7417d7c20bab 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1818,11 +1818,13 @@ static int set_offload(struct tun_struct *tun, unsigned long arg) static void tun_detach_filter(struct tun_struct *tun, int n) { int i; - struct tun_file *tfile; for (i = 0; i < n; i++) { - tfile = rtnl_dereference(tun->tfiles[i]); - sk_detach_filter(tfile->socket.sk); + struct sock *sk = rtnl_dereference(tun->tfiles[i])->socket.sk; + + lock_sock(sk); + sk_detach_filter(sk); + release_sock(sk); } tun->filter_attached = false; In tun case, the control path for tun_attach_filter() and tun_detach_filter() is under RTNL lock (held in __tun_chr_ioctl()). So in the BPF core the rcu_dereference_protected(, sock_owned_by_user(sk)) looks like a false positive in this specific use case to me, that we should probably just silence. Running the filter via sk_filter() in tun device happens under rcu_read_lock(), so the dereference and assignment pair seems okay to me. Was wondering whether we should convert this to unattached BPF filter, but this would break with existing expectations from sk_filter() (e.g. security modules).
[PATCH] sctp: use list_* in sctp_list_dequeue
Use list_* helpers in sctp_list_dequeue, more readable. Signed-off-by: Marcelo Ricardo Leitner --- include/net/sctp/sctp.h | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h index 65521cfdcadeee35d61f280165a387cc2164ab6d..03fb33efcae21d54192204629ff4ced2e36e7d4d 100644 --- a/include/net/sctp/sctp.h +++ b/include/net/sctp/sctp.h @@ -386,11 +386,9 @@ static inline struct list_head *sctp_list_dequeue(struct list_head *list) { struct list_head *result = NULL; - if (list->next != list) { + if (!list_empty(list)) { result = list->next; - list->next = result->next; - list->next->prev = list; - INIT_LIST_HEAD(result); + list_del_init(result); } return result; } -- 2.5.0
[PATCH 0/2] sctp: delay calls to sk_data_ready() as much as possible
1st patch is a preparation for the 2nd. The idea is to not call ->sk_data_ready() for every data chunk processed while processing packets but only once before releasing the socket. Marcelo Ricardo Leitner (2): sctp: compress bit-wide flags to a bitfield on sctp_sock sctp: delay calls to sk_data_ready() as much as possible include/net/sctp/structs.h | 13 +++-- net/sctp/sm_sideeffect.c | 5 + net/sctp/ulpqueue.c| 4 ++-- 3 files changed, 14 insertions(+), 8 deletions(-) -- 2.5.0
[PATCH 1/2] sctp: compress bit-wide flags to a bitfield on sctp_sock
It wastes space and gets worse as we add new flags, so convert bit-wide flags to a bitfield. Currently it already saves 4 bytes in sctp_sock. Note that do_auto_asconf cannot be merged, as explained in the comment before it. Signed-off-by: Marcelo Ricardo Leitner --- include/net/sctp/structs.h | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h index 6df1ce7a411c548bda4163840a90578b6e1b4cfe..1a6a626904bba4223b7921bbb4be41c2550271a7 100644 --- a/include/net/sctp/structs.h +++ b/include/net/sctp/structs.h @@ -210,14 +210,14 @@ struct sctp_sock { int user_frag; __u32 autoclose; - __u8 nodelay; - __u8 disable_fragments; - __u8 v4mapped; - __u8 frag_interleave; __u32 adaptation_ind; __u32 pd_point; - __u8 recvrcvinfo; - __u8 recvnxtinfo; + __u16 nodelay:1, + disable_fragments:1, + v4mapped:1, + frag_interleave:1, + recvrcvinfo:1, + recvnxtinfo:1; atomic_t pd_mode; /* Receive to here while partial delivery is in effect. */ -- 2.5.0
[PATCH 2/2] sctp: delay calls to sk_data_ready() as much as possible
Currently processing of multiple chunks in a single SCTP packet leads to multiple calls to sk_data_ready, causing multiple wake up signals which are costy and doesn't make it wake up any faster. With this patch it will note that the wake up is pending and will do it before leaving the state machine interpreter, latest place possible to do it realiably and cleanly. Note that sk_data_ready events are not dependent on asocs, unlike waking up writers. Signed-off-by: Marcelo Ricardo Leitner --- include/net/sctp/structs.h | 3 ++- net/sctp/sm_sideeffect.c | 5 + net/sctp/ulpqueue.c| 4 ++-- 3 files changed, 9 insertions(+), 3 deletions(-) diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h index 1a6a626904bba4223b7921bbb4be41c2550271a7..21cb11107e378b4da1e7efde22fab4349496e35a 100644 --- a/include/net/sctp/structs.h +++ b/include/net/sctp/structs.h @@ -217,7 +217,8 @@ struct sctp_sock { v4mapped:1, frag_interleave:1, recvrcvinfo:1, - recvnxtinfo:1; + recvnxtinfo:1, + pending_data_ready:1; atomic_t pd_mode; /* Receive to here while partial delivery is in effect. */ diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c index 7fe56d0acabf66cfd8fe29dfdb45f7620b470ac7..e7042f9ce63b0cfca50cae252f51b60b68cb5731 100644 --- a/net/sctp/sm_sideeffect.c +++ b/net/sctp/sm_sideeffect.c @@ -1742,6 +1742,11 @@ out: error = sctp_outq_uncork(&asoc->outqueue, gfp); } else if (local_cork) error = sctp_outq_uncork(&asoc->outqueue, gfp); + + if (sctp_sk(ep->base.sk)->pending_data_ready) { + ep->base.sk->sk_data_ready(ep->base.sk); + sctp_sk(ep->base.sk)->pending_data_ready = 0; + } return error; nomem: error = -ENOMEM; diff --git a/net/sctp/ulpqueue.c b/net/sctp/ulpqueue.c index ce469d648ffbe166f9ae1c5650f481256f31a7f8..72e5b3e41cddf9d79371de8ab01484e4601b97b6 100644 --- a/net/sctp/ulpqueue.c +++ b/net/sctp/ulpqueue.c @@ -264,7 +264,7 @@ int sctp_ulpq_tail_event(struct sctp_ulpq *ulpq, struct sctp_ulpevent *event) sctp_ulpq_clear_pd(ulpq); if (queue == &sk->sk_receive_queue) - sk->sk_data_ready(sk); + sctp_sk(sk)->pending_data_ready = 1; return 1; out_free: @@ -1140,5 +1140,5 @@ void sctp_ulpq_abort_pd(struct sctp_ulpq *ulpq, gfp_t gfp) /* If there is data waiting, send it up the socket now. */ if (sctp_ulpq_clear_pd(ulpq) || ev) - sk->sk_data_ready(sk); + sctp_sk(sk)->pending_data_ready = 1; } -- 2.5.0
[PATCH] sctp: avoid refreshing heartbeat timer too often
Currently on high rate SCTP streams the heartbeat timer refresh can consume quite a lot of resources as timer updates are costly and it contains a random factor, which a) is also costly and b) invalidates mod_timer() optimization for not editing a timer to the same value. It may even cause the timer to be slightly advanced, for no good reason. There are 3 places that call sctp_transport_reset_timers(), including one that calls it for every single chunk added to a packet on sctp_outq_flush(), even if they are bundled. As there is the possibility of working on several transports, it's not trivial to just post-poned this call as it would require to at least remember which transports were used. This change then moves the random factor to outside sctp_transport_timeout(), allowing sctp_transport_reset_timers() to check if a timer update is really necessary. For that, the random factor is considered 0. If timer expires is still after it, the update is not necessary as it's a possible random value and there is no security/functional loss in doing so. On loopback with MTU of 65535 and data chunks with 1636, so that we have a considerable amount of chunks without stressing system calls, netperf -t SCTP_STREAM -l 30, perf looked like this before: Samples: 103K of event 'cpu-clock', Event count (approx.): 2583300 Overhead Command Shared Object Symbol +6,15% netperf [kernel.vmlinux] [k] copy_user_enhanced_fast_string -5,43% netperf [kernel.vmlinux] [k] _raw_write_unlock_irqrestore - _raw_write_unlock_irqrestore - 96,54% _raw_spin_unlock_irqrestore - 36,14% mod_timer + 97,24% sctp_transport_reset_timers + 2,76% sctp_do_sm + 33,65% __wake_up_sync_key + 28,77% sctp_ulpq_tail_event + 1,40% del_timer - 1,84% mod_timer + 99,03% sctp_transport_reset_timers + 0,97% sctp_do_sm + 1,50% sctp_ulpq_tail_event And after this patch: Samples: 120K of event 'cpu-clock', Event count (approx.): 3000225 Overhead Command Shared Object Symbol +7,34% netperf [kernel.vmlinux] [k] memcpy_erms +6,67% netperf [kernel.vmlinux] [k] copy_user_enhanced_fast_string -5,34% netperf [kernel.vmlinux] [k] _raw_write_unlock_irqrestore - _raw_write_unlock_irqrestore - 98,00% _raw_spin_unlock_irqrestore + 54,24% __wake_up_sync_key + 41,54% sctp_ulpq_tail_event - 2,61% mod_timer + 79,88% sctp_transport_update_pmtu + 20,12% sctp_do_sm + 1,61% del_timer + 1,83% sctp_ulpq_tail_event +2,34% netperf [kernel.vmlinux] [k] pvclock_clocksource_read +2,31% netperf [sctp] [k] sctp_sendmsg +1,93% netperf [kernel.vmlinux] [k] __slab_free +1,92% netperf [sctp] [k] sctp_packet_transmit +1,85% netperf [kernel.vmlinux] [k] kfree Signed-off-by: Marcelo Ricardo Leitner --- net/sctp/transport.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/net/sctp/transport.c b/net/sctp/transport.c index 9b6b48c7524e4b441a151b80f0babec81f539d49..c9d6c61f5a511f96f38a0f2c275e418e158e0632 100644 --- a/net/sctp/transport.c +++ b/net/sctp/transport.c @@ -185,6 +185,8 @@ static void sctp_transport_destroy(struct sctp_transport *transport) */ void sctp_transport_reset_timers(struct sctp_transport *transport) { + unsigned long expires; + /* RFC 2960 6.3.2 Retransmission Timer Rules * * R1) Every time a DATA chunk is sent to any address(including a @@ -199,8 +201,10 @@ void sctp_transport_reset_timers(struct sctp_transport *transport) sctp_transport_hold(transport); /* When a data chunk is sent, reset the heartbeat interval. */ - if (!mod_timer(&transport->hb_timer, - sctp_transport_timeout(transport))) + expires = sctp_transport_timeout(transport); + if (time_before(transport->hb_timer.expires, expires) && + !mod_timer(&transport->hb_timer, + expires + prandom_u32_max(transport->rto))) sctp_transport_hold(transport); } @@ -595,7 +599,7 @@ void sctp_transport_burst_reset(struct sctp_transport *t) unsigned long sctp_transport_timeout(struct sctp_transport *trans) { /* RTO + timer slack +/- 50% of RTO */ - unsigned long timeout = (trans->rto >> 1) + prandom_u32_max(trans->rto); + unsigned long timeout = trans->rto >> 1; if (trans->state != SCTP_UNCONFIRMED && trans->state != SCTP_PF) -- 2.5.0
[PATCH] sctp: really allow using GFP_KERNEL on sctp_packet_transmit
Somehow my patch for commit cea8768f333e ("sctp: allow sctp_transmit_packet and others to use gfp") missed two important chunks, which are now added. Fixes: cea8768f333e ("sctp: allow sctp_transmit_packet and others to use gfp") Signed-off-by: Marcelo Ricardo Leitner --- net/sctp/output.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/sctp/output.c b/net/sctp/output.c index 736c004abfbc2787a3c50fa85168ebdf3b112787..97745351d58c2fb32b9f9b57d61831d7724d83b2 100644 --- a/net/sctp/output.c +++ b/net/sctp/output.c @@ -401,7 +401,7 @@ int sctp_packet_transmit(struct sctp_packet *packet, gfp_t gfp) sk = chunk->skb->sk; /* Allocate the new skb. */ - nskb = alloc_skb(packet->size + MAX_HEADER, GFP_ATOMIC); + nskb = alloc_skb(packet->size + MAX_HEADER, gfp); if (!nskb) goto nomem; @@ -523,8 +523,8 @@ int sctp_packet_transmit(struct sctp_packet *packet, gfp_t gfp) */ if (auth) sctp_auth_calculate_hmac(asoc, nskb, - (struct sctp_auth_chunk *)auth, - GFP_ATOMIC); +(struct sctp_auth_chunk *)auth, +gfp); /* 2) Calculate the Adler-32 checksum of the whole packet, *including the SCTP common header and all the -- 2.5.0
[PATCH] sctp: flush if we can't fit another DATA chunk
There is no point in delaying the packet if we can't fit a single byte of data on it anymore. So lets just reduce the threshold by the amount that a data chunk with 4 bytes (rounding) would use. Signed-off-by: Marcelo Ricardo Leitner --- net/sctp/output.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/sctp/output.c b/net/sctp/output.c index 97745351d58c2fb32b9f9b57d61831d7724d83b2..c518569123ce42a8f21f80754756306c39875013 100644 --- a/net/sctp/output.c +++ b/net/sctp/output.c @@ -705,7 +705,8 @@ static sctp_xmit_t sctp_packet_can_append_data(struct sctp_packet *packet, /* Check whether this chunk and all the rest of pending data will fit * or delay in hopes of bundling a full sized packet. */ - if (chunk->skb->len + q->out_qlen >= transport->pathmtu - packet->overhead) + if (chunk->skb->len + q->out_qlen > + maxsize - packet->overhead - sizeof(sctp_data_chunk_t) - 4) /* Enough data queued to fill a packet */ return SCTP_XMIT_OK; -- 2.5.0
[net-next RFC 0/4] SO_BINDTOPREFIX
Since the net-next window just opened, I'm resubmitting my RFC for the SO_BINDTOSUBNET patch, following Mark Smith's suggestion to rename the whole thing to a more clear SO_BINDTOPREFIX. Some arguments for and against it since the first submission: * SO_BINDTOPREFIX is an arbitrary option and can be seens as nother use * case of the SO_REUSEPORT BPF patch * but at the same time using BPF requires more work/code on the server and since the bind to prefix use case could potentially become a common one maybe there is some value in having it as an option instead of having to code (either manually or with clang) an eBPF program that would do the same * it may probably possible to archive the same results using VRF. This would require to create a VRF device, configure the device routing table and make each bind each process to a different VRF device (but I'm not sure how this would work/interfere with an existing iptables setup for example) - This series introduces support for the SO_BINDTOPREFIX socket option, which allows a listener socket to bind to a prefix instead of * or a single address. Motivation: consider a set of servers, each one with thousands and thousands of IP addresses. Since assigning /32 or /128 IP individual addresses would be inefficient, one solution can be assigning prefixes using local routes (with 'ip route add local'). This allows a listener to listen and terminate connections going to any of the IP addresses of these prefixes without explicitly configuring all the IP addresses of the prefix range. This is very efficient. Unfortunately there may be the need to use different prefixes for different purposes. One can imagine port 80 being served by one HTTP server for some IP prefix, while another server used for another prefix. Right now Linux does not allow this. It is either possible to bind to *, indicating ALL traffic going to given port, or to individual IP addresses. The first only allows to accept connections from all the prefixes. The latter does not scale well with lots of IP addresses. Using bindtoprefix would solve this problem: just by adding a local route rule and setting the SO_BINDTOPREFIX option for a socket it would be possible to easily partition traffic by prefixes. API: the prefix is specified (as argument of the setsockopt syscall) by the address of the network, and the prefix length of the netmask. IPv4: struct ipv4_prefix { __be32 net; u_char plen; }; and IPv6: struct ipv6_prefix { struct in6_addr net; u_char plen; }; Bind conflicts: two sockets with the bindtoprefix option enabled generate a bind conflict if their network addresses masked with the shortest of their prefix are equal. The bindtoprefix option can be combined with soreuseport so that two listener can bind on the same prefix. Any questions/feedback appreciated. Thanks, Gilberto Gilberto Bertin (4): bindtoprefix: infrastructure bindtoprefix: TCP/IPv4 implementation bindtoprefix: TCP/IPv6 implementation bindtoprefix: UPD implementation include/net/sock.h| 20 +++ include/uapi/asm-generic/socket.h | 1 + net/core/sock.c | 111 ++ net/ipv4/inet_connection_sock.c | 20 ++- net/ipv4/inet_hashtables.c| 9 net/ipv4/udp.c| 36 + net/ipv6/inet6_connection_sock.c | 17 +- net/ipv6/inet6_hashtables.c | 6 +++ 8 files changed, 218 insertions(+), 2 deletions(-) -- 2.7.3
[net-next RFC 3/4] bindtoprefix: TCP/IPv6 implementation
Signed-off-by: Gilberto Bertin --- net/ipv6/inet6_connection_sock.c | 17 - net/ipv6/inet6_hashtables.c | 6 ++ 2 files changed, 22 insertions(+), 1 deletion(-) diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c index 36c3f01..c65023f 100644 --- a/net/ipv6/inet6_connection_sock.c +++ b/net/ipv6/inet6_connection_sock.c @@ -27,6 +27,20 @@ #include #include +int inet6_csk_bind_prefix_conflict(const struct sock *sk, + const struct sock *sk2) +{ + u_char plen; + + plen = min(sk->sk_bind_prefix6.plen, sk2->sk_bind_prefix6.plen); + + if (sk->sk_bind_to_prefix && sk2->sk_bind_to_prefix) + return ipv6_prefix_equal(&sk->sk_bind_prefix6.net, +&sk2->sk_bind_prefix6.net, plen); + + return 0; +} + int inet6_csk_bind_conflict(const struct sock *sk, const struct inet_bind_bucket *tb, bool relax) { @@ -44,7 +58,8 @@ int inet6_csk_bind_conflict(const struct sock *sk, if (sk != sk2 && (!sk->sk_bound_dev_if || !sk2->sk_bound_dev_if || -sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) { +sk->sk_bound_dev_if == sk2->sk_bound_dev_if) && +inet6_csk_bind_prefix_conflict(sk, sk2)) { if ((!reuse || !sk2->sk_reuse || sk2->sk_state == TCP_LISTEN) && (!reuseport || !sk2->sk_reuseport || diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c index 21ace5a..bcc16ed 100644 --- a/net/ipv6/inet6_hashtables.c +++ b/net/ipv6/inet6_hashtables.c @@ -114,6 +114,12 @@ static inline int compute_score(struct sock *sk, struct net *net, return -1; score++; } + if (sk->sk_bind_to_prefix) { + if (!ipv6_prefix_equal(&sk->sk_bind_prefix6.net, daddr, + sk->sk_bind_prefix6.plen)) + return -1; + score++; + } if (sk->sk_incoming_cpu == raw_smp_processor_id()) score++; } -- 2.7.3
[net-next RFC 4/4] bindtoprefix: UPD implementation
Signed-off-by: Gilberto Bertin --- net/ipv4/udp.c | 36 1 file changed, 36 insertions(+) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 95d2f19..31b9687 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -133,6 +133,23 @@ EXPORT_SYMBOL(udp_memory_allocated); #define MAX_UDP_PORTS 65536 #define PORTS_PER_CHAIN (MAX_UDP_PORTS / UDP_HTABLE_SIZE_MIN) +static inline int udp_csk_bind_prefix_conflict(const struct sock *sk, + const struct sock *sk2) +{ + __be32 mask; + + if (sk->sk_bind_to_prefix && sk2->sk_bind_to_prefix) { + mask = inet_make_mask(min(sk->sk_bind_prefix4.plen, + sk2->sk_bind_prefix4.plen)); + + return (sk->sk_bind_prefix4.net & mask) == + (sk2->sk_bind_prefix4.net & mask); + } + + return 0; +} + + static int udp_lib_lport_inuse(struct net *net, __u16 num, const struct udp_hslot *hslot, unsigned long *bitmap, @@ -153,6 +170,7 @@ static int udp_lib_lport_inuse(struct net *net, __u16 num, (!sk2->sk_reuse || !sk->sk_reuse) && (!sk2->sk_bound_dev_if || !sk->sk_bound_dev_if || sk2->sk_bound_dev_if == sk->sk_bound_dev_if) && +udp_csk_bind_prefix_conflict(sk, sk2) && (!sk2->sk_reuseport || !sk->sk_reuseport || rcu_access_pointer(sk->sk_reuseport_cb) || !uid_eq(uid, sock_i_uid(sk2))) && @@ -189,6 +207,7 @@ static int udp_lib_lport_inuse2(struct net *net, __u16 num, (!sk2->sk_reuse || !sk->sk_reuse) && (!sk2->sk_bound_dev_if || !sk->sk_bound_dev_if || sk2->sk_bound_dev_if == sk->sk_bound_dev_if) && +udp_csk_bind_prefix_conflict(sk, sk2) && (!sk2->sk_reuseport || !sk->sk_reuseport || rcu_access_pointer(sk->sk_reuseport_cb) || !uid_eq(uid, sock_i_uid(sk2))) && @@ -426,6 +445,15 @@ static inline int compute_score(struct sock *sk, struct net *net, return -1; score += 4; } + + if (sk->sk_bind_to_prefix) { + __be32 mask = inet_make_mask(sk->sk_bind_prefix4.plen); + + if ((sk->sk_bind_prefix4.net & mask) != (daddr & mask)) + return -1; + score += 4; + } + if (sk->sk_incoming_cpu == raw_smp_processor_id()) score++; return score; @@ -471,6 +499,14 @@ static inline int compute_score2(struct sock *sk, struct net *net, score += 4; } + if (sk->sk_bind_to_prefix) { + __be32 mask = inet_make_mask(sk->sk_bind_prefix4.plen); + + if ((sk->sk_bind_prefix4.net & mask) != (daddr & mask)) + return -1; + score += 4; + } + if (sk->sk_incoming_cpu == raw_smp_processor_id()) score++; -- 2.7.3
[net-next RFC 1/4] bindtoprefix: infrastructure
Signed-off-by: Gilberto Bertin --- include/net/sock.h| 20 +++ include/uapi/asm-generic/socket.h | 1 + net/core/sock.c | 111 ++ 3 files changed, 132 insertions(+) diff --git a/include/net/sock.h b/include/net/sock.h index f5ea148..409d255 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -109,6 +109,16 @@ typedef struct { #endif } socket_lock_t; +struct ipv4_prefix { + __be32 net; + u_char plen; +}; + +struct ipv6_prefix { + struct in6_addr net; + u_char plen; +}; + struct sock; struct proto; struct net; @@ -176,6 +186,13 @@ struct sock_common { unsigned char skc_ipv6only:1; unsigned char skc_net_refcnt:1; int skc_bound_dev_if; + + unsigned char skc_bind_to_prefix; + union { + struct ipv4_prefix skc_bind_prefix4; + struct ipv6_prefix skc_bind_prefix6; + }; + union { struct hlist_node skc_bind_node; struct hlist_nulls_node skc_portaddr_node; @@ -327,6 +344,9 @@ struct sock { #define sk_state __sk_common.skc_state #define sk_reuse __sk_common.skc_reuse #define sk_reuseport __sk_common.skc_reuseport +#define sk_bind_to_prefix __sk_common.skc_bind_to_prefix +#define sk_bind_prefix4__sk_common.skc_bind_prefix4 +#define sk_bind_prefix6__sk_common.skc_bind_prefix6 #define sk_ipv6only__sk_common.skc_ipv6only #define sk_net_refcnt __sk_common.skc_net_refcnt #define sk_bound_dev_if__sk_common.skc_bound_dev_if diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index fb8a416..b4dd61f 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -30,6 +30,7 @@ #define SO_SNDLOWAT19 #define SO_RCVTIMEO20 #define SO_SNDTIMEO21 +#define SO_BINDTOPREFIX 22 #endif /* Security levels - as per NRL IPv6 - don't actually do anything */ diff --git a/net/core/sock.c b/net/core/sock.c index 6c1c8bc..e4c9c55 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -571,6 +571,68 @@ out: return ret; } +static int sock_setbindtoprefix(struct sock *sk, char __user *optval, + int optlen) +{ + int ret = -ENOPROTOOPT; + + if (sk->sk_family == AF_INET) { + struct ipv4_prefix bind_prefix4; + + ret = -EFAULT; + if (optlen != sizeof(struct ipv4_prefix)) + goto out; + + if (copy_from_user(&bind_prefix4, optval, + sizeof(struct ipv4_prefix))) + goto out; + + ret = -EINVAL; + if (bind_prefix4.plen > 32) + goto out; + + lock_sock(sk); + + sk->sk_bind_to_prefix = 1; + sk->sk_bind_prefix4.net = bind_prefix4.net; + sk->sk_bind_prefix4.plen = bind_prefix4.plen; + sk_dst_reset(sk); + + release_sock(sk); + + ret = 0; + } else if (sk->sk_family == AF_INET6) { + struct ipv6_prefix bind_prefix6; + + ret = -EFAULT; + if (optlen != sizeof(struct ipv6_prefix)) + goto out; + + if (copy_from_user(&bind_prefix6, optval, + sizeof(struct ipv6_prefix))) + goto out; + + ret = -EINVAL; + if (bind_prefix6.plen > 128) + goto out; + + lock_sock(sk); + + sk->sk_bind_to_prefix = 1; + memcpy(&sk->sk_bind_prefix6.net, &bind_prefix6.net, + sizeof(struct in6_addr)); + sk->sk_bind_prefix6.plen = bind_prefix6.plen; + sk_dst_reset(sk); + + release_sock(sk); + + ret = 0; + } + +out: + return ret; +} + static int sock_getbindtodevice(struct sock *sk, char __user *optval, int __user *optlen, int len) { @@ -611,6 +673,49 @@ out: return ret; } +static int sock_getbindtoprefix(struct sock *sk, char __user *optval, + int __user *optlen, int len) +{ + int ret; + + if (sk->sk_bind_to_prefix == 0) { + len = 0; + goto zero; + } + + if (sk->sk_family == AF_INET) { + ret = -EINVAL; + if (len < sizeof(struct ipv4_prefix)) + goto out; + + len = sizeof(struct ipv4_prefix); + + ret = -EFAULT; + if (copy_to_user(optval, &sk->sk_bind_prefix4, len)) + goto out; + + } else if (sk->sk_family == AF_INET6) { + ret = -EINVAL; +
[net-next RFC 2/4] bindtoprefix: TCP/IPv4 implementation
Signed-off-by: Gilberto Bertin --- net/ipv4/inet_connection_sock.c | 20 +++- net/ipv4/inet_hashtables.c | 9 + 2 files changed, 28 insertions(+), 1 deletion(-) diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 6414891..162c252 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -15,6 +15,7 @@ #include #include +#include #include #include @@ -43,6 +44,22 @@ void inet_get_local_port_range(struct net *net, int *low, int *high) } EXPORT_SYMBOL(inet_get_local_port_range); +static inline int inet_csk_bind_prefix_conflict(const struct sock *sk, + const struct sock *sk2) +{ + __be32 mask; + + if (sk->sk_bind_to_prefix && sk2->sk_bind_to_prefix) { + mask = inet_make_mask(min(sk->sk_bind_prefix4.plen, + sk2->sk_bind_prefix4.plen)); + + return (sk->sk_bind_prefix4.net & mask) == + (sk2->sk_bind_prefix4.net & mask); + } + + return 0; +} + int inet_csk_bind_conflict(const struct sock *sk, const struct inet_bind_bucket *tb, bool relax) { @@ -63,7 +80,8 @@ int inet_csk_bind_conflict(const struct sock *sk, !inet_v6_ipv6only(sk2) && (!sk->sk_bound_dev_if || !sk2->sk_bound_dev_if || -sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) { +sk->sk_bound_dev_if == sk2->sk_bound_dev_if) && +inet_csk_bind_prefix_conflict(sk, sk2)) { if ((!reuse || !sk2->sk_reuse || sk2->sk_state == TCP_LISTEN) && (!reuseport || !sk2->sk_reuseport || diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index ccc5980..44693c4 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -13,6 +13,7 @@ * 2 of the License, or (at your option) any later version. */ +#include #include #include #include @@ -189,6 +190,14 @@ static inline int compute_score(struct sock *sk, struct net *net, return -1; score += 4; } + if (sk->sk_bind_to_prefix) { + __be32 mask = inet_make_mask(sk->sk_bind_prefix4.plen); + + if ((sk->sk_bind_prefix4.net & mask) != (daddr & mask)) + return -1; + score += 4; + } + if (sk->sk_incoming_cpu == raw_smp_processor_id()) score++; } -- 2.7.3
Re: [PATCH net-next v3.16]r9169: Correct Set Vlan tag
On Tue, 2016-03-29 at 11:33 +0300, Corcodel Marian wrote: > This patch add set Vlan tag and flush CPlusCmd register because when unset > RxVlan and RxChkSum bit, whithout some explication , unwanted bits > is set, PCIDAC, PCIMulRW and others.Whithout this patch when run > ethtool -d eth0 on "C+ Command" field missing "VLAN de-tagging" Are you a bot or a human being ? It really looks like a program was assigned to fool us, sending random patches targeting r8169 and random linux kernels. We wont accept patches if they are not targeting current linux kernels.
Re: [PATCH 1/2 net-next v3.16]r8169: Disable set bit multicast enable per multicast address.
On Sun, 2016-03-27 at 19:31 +0300, Corcodel Marian wrote: > On Sat, 26 Mar 2016 10:18:46 -0700 > Eric Dumazet wrote: > > > On Sat, 2016-03-26 at 12:57 +0200, Corcodel Marian wrote: > > > This patch correct set bit multicast enable only once per > > > set_rx_mode invocation. > > > > > > Signed-off-by: Corcodel Marian > > > --- > > > drivers/net/ethernet/realtek/r8169.c | 3 +-- > > > 1 file changed, 1 insertion(+), 2 deletions(-) > > > > > > diff --git a/drivers/net/ethernet/realtek/r8169.c > > > b/drivers/net/ethernet/realtek/r8169.c index 7f6fb1f..f7b0dfb 100644 > > > --- a/drivers/net/ethernet/realtek/r8169.c > > > +++ b/drivers/net/ethernet/realtek/r8169.c > > > @@ -4619,12 +4619,11 @@ static void rtl_set_rx_mode(struct > > > net_device *dev) } else { > > > struct netdev_hw_addr *ha; > > > > > > - rx_mode = AcceptBroadcast | AcceptMyPhys; > > > + rx_mode = AcceptBroadcast | AcceptMyPhys | > > > AcceptMulticast; mc_filter[1] = mc_filter[0] = 0; > > > netdev_for_each_mc_addr(ha, dev) { > > > int bit_nr = ether_crc(ETH_ALEN, ha->addr) > > > >> 26; mc_filter[bit_nr >> 5] |= 1 << (bit_nr & 31); > > > - rx_mode |= AcceptMulticast; > > > } > > > } > > > > > > > If the list is empty, why should we enable AcceptMulticast ? > > > > > > > > I not experienced list empty, allways on my case exist bit_nr variable. If dev->mc list is empty, the netdev_for_each_mc_addr(ha, dev) does nothing at all. In other words, netdev_mc_count(dev) can be 0 Current code is fine, and run in control (slow) path anyway.
Re: [PATCH] mwifiex: add __GFP_REPEAT to skb allocation call
On Tue, 2016-03-29 at 17:27 +0800, Wei-Ning Huang wrote: > Adding some chromium devs to the thread. > > In, http://lxr.free-electrons.com/source/mm/page_alloc.c#L3152 > > The default mm retry allocation when 'order <= > PAGE_ALLOC_COSTLY_ORDER' of gfp_mask contains __GFP_REPEAT. > PAGE_ALLOC_COSTLY_ORDER is defined to be 3. On systems with page size > = 4K, this means memory compaction and retry is only done when the > size of allocation is <= 32K > In mwifiex, the allocation size is 64K. > When we have system with > memory fragmentation and allocation failed, there will be no retry. > This is why we need to add __GFP_REPEAT here to allow the system to > perform memory compaction and retry allocation. > > Maybe Amit@marvell can comment on if this is a good fix on this issue. > I'm also aware that marvell is the progress of implementing > scatter/gatter for mwifiex, which can also fix the issue. Before SG is implemented, you really need to copy incoming frames into smallest chunks (to get lowest skb->truesize) and leave the 64KB allocated stuff forever in the driver. __GFP_REPEAT wont really solve the issue. It seems the problem comes from the fact that the drivers calls dev_kfree_skb_any() after calling mwifiex_deaggr_sdio_pkt(), instead of recycling this very precious 64KB skb once memory gets fragmented. Another problem is that mwifiex_deaggr_sdio_pkt() uses mwifiex_alloc_dma_align_buf() with GFP_KERNEL | GFP_DMA Really GFP_DMA makes no sense here, since the skb is going to be processed by the stack, which has no such requirement. Please use normal skb allocations there. diff --git a/drivers/net/wireless/marvell/mwifiex/sdio.c b/drivers/net/wireless/marvell/mwifiex/sdio.c index b2c839a..8404db5 100644 --- a/drivers/net/wireless/marvell/mwifiex/sdio.c +++ b/drivers/net/wireless/marvell/mwifiex/sdio.c @@ -1123,8 +1123,8 @@ static void mwifiex_deaggr_sdio_pkt(struct mwifiex_adapter *adapter, __func__, pkt_len, blk_size); break; } - skb_deaggr = mwifiex_alloc_dma_align_buf(pkt_len, -GFP_KERNEL | GFP_DMA); + skb_deaggr = __netdev_alloc_skb_ip_align(NULL, pkt_len, +GFP_KERNEL); if (!skb_deaggr) break; skb_put(skb_deaggr, pkt_len);
Re: bpf: net/core/filter.c:2115 suspicious rcu_dereference_protected() usage!
On Mon, Feb 22, 2016 at 10:31:33AM -0500, Sasha Levin wrote: > > I've hit the following warning while fuzzing with trinity inside a kvmtool > guest > running the latest -next kernel: > > [ 1343.104588] === > [ 1343.104591] [ INFO: suspicious RCU usage. ] > [ 1343.104619] 4.5.0-rc4-next-20160219-sasha-00026-g7978205-dirty #2978 Not > tainted > [ 1343.104624] --- > [ 1343.104635] net/core/filter.c:2115 suspicious rcu_dereference_protected() > usage! > [ 1343.104641] > [ 1343.104641] other info that might help us debug this: > [ 1343.104641] > [ 1343.104650] > [ 1343.104650] rcu_scheduler_active = 1, debug_locks = 0 > [ 1343.104660] 1 lock held by syz-executor/17916: > [ 1343.104784] #0: (rtnl_mutex){+.+.+.}, at: rtnl_lock > (net/core/rtnetlink.c:71) > [ 1343.104789] > [ 1343.104789] stack backtrace: > [ 1343.104820] CPU: 1 PID: 17916 Comm: trinity-c8 Not tainted > 4.5.0-rc4-next-20160219-sasha-00026-g7978205-dirty #2978 > [ 1343.104868] 110036968f44 8801b4b47aa8 a23d9a9d > 0001 > [ 1343.104891] fbfff5c2a630 41b58ab3 adb3a8f2 > a23d9905 > [ 1343.104914] 8801b5419b40 fbfff7596522 > 0001 > [ 1343.104919] Call Trace: > [ 1343.104985] dump_stack (lib/dump_stack.c:53) > [ 1343.105060] lockdep_rcu_suspicious (kernel/locking/lockdep.c:4282) > [ 1343.105093] sk_detach_filter (net/core/filter.c:2114 (discriminator 5)) > [ 1343.105193] tun_detach_filter (drivers/net/tun.c:1808 (discriminator 7)) > [ 1343.105238] __tun_chr_ioctl (drivers/net/tun.c:2133) > [ 1343.105370] tun_chr_ioctl (drivers/net/tun.c:2161) > [ 1343.105407] do_vfs_ioctl (fs/ioctl.c:44 fs/ioctl.c:674) > [ 1343.105506] SyS_ioctl (fs/ioctl.c:689 fs/ioctl.c:680) > [ 1343.105542] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:200) Looks like sk_detach_filter() wants the socket to be owned which neither tun_detach_filter() does not do, unlike sock_setsockopt(). Could you check if the patch below helps? I'm also not really sure if it is safe to ignore return value of sk_detach_filter() and just sets tun->filter_attached to false - but it's not really clear what should be done if one of the calls fails after some succeeded. diff --git a/drivers/net/tun.c b/drivers/net/tun.c index afdf950617c3..7417d7c20bab 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1818,11 +1818,13 @@ static int set_offload(struct tun_struct *tun, unsigned long arg) static void tun_detach_filter(struct tun_struct *tun, int n) { int i; - struct tun_file *tfile; for (i = 0; i < n; i++) { - tfile = rtnl_dereference(tun->tfiles[i]); - sk_detach_filter(tfile->socket.sk); + struct sock *sk = rtnl_dereference(tun->tfiles[i])->socket.sk; + + lock_sock(sk); + sk_detach_filter(sk); + release_sock(sk); } tun->filter_attached = false;
Re: [PATCH 15/16] wcn36xx: don't pad beacons for mesh
Hello. On 3/29/2016 9:06 AM, Bjorn Andersson wrote: From: Jason Mobarak Patch "wcn36xx: Pad TIM PVM if needed" has caused a regression in mesh scripts/checkpatch.pl now enforces the specific commit citing format, i.e. <12-digit SHA1> (""). beaconing. The field tim_off is always 0 for mesh mode, and thus pvm_len (referring to the TIM length field) and pad are both incorrectly calculated. Thus, msg_body.beacon_length is incorrectly calculated for mesh mode. Fix this. Signed-off-by: Jason Mobarak Signed-off-by: Chun-Yeow Yeoh Signed-off-by: Bjorn Andersson You might want to add the Fixes: tag. [...] MBR, Sergei
Re: am335x: no multicast reception over VLAN
On 03/29/2016 03:35 PM, Yegor Yefremov wrote: > On Tue, Mar 29, 2016 at 1:05 PM, Grygorii Strashko > wrote: >> On 03/29/2016 08:21 AM, Yegor Yefremov wrote: >>> Hi Mugunthan, >>> >>> On Tue, Mar 29, 2016 at 6:00 AM, Mugunthan V N wrote: Hi Yegor On Wednesday 16 March 2016 08:35 PM, Yegor Yefremov wrote: > I have an am335x based board using CPSW in Dual EMAC mode. Without > VLAN IDs I can receive and send multicast packets [1]. When I create > VLAN ID: > > ip link add link eth1 name eth1.100 type vlan id 100 > ip addr add 192.168.100.2/24 brd 192.168.100.255 dev eth1.100 > route add -net 224.0.0.0 netmask 224.0.0.0 eth1.100 > > I can successfully send multicast packets, but not receive them. On > the other side of the Ethernet cable I've used Pandaboard. Pandaboard > could both receive and send multicast packets via VLAN. Are you trying multicast tx/rx on eth1 or eth1.100? >>> >>> I'm trying multicast tx/rx on eth1.100. >>> >>> eth1 has no problems. >>> >> >> it'd be nice if will be able to post here output fom commands: >> # switch-config -d [git://git.ti.com/switch-config/switch-config.git v4.1] >> # ifconfig -a >> # tcpdump -e -f -Q in -i eth0 >> # tcpdump -e -f -Q in -i eth0.100 > > Which kernel/branch do you want me to test? > > git://git.ti.com/ti-linux-kernel/ti-linux-kernel.git and ti-rt-linux-4.1.y? > > So far I was using vanilla kernel. Your branch (but better 4.5 kernels (or 4.4)). Just when you've done with configuration run cmds 1&2, and when you run your use-case - run cmds 2&3 on receiver side (grap ~5-10 packets). then stop test and run cmd 1 again. After all could you provide your console output here, pls. -- regards, -grygorii
Re: am335x: no multicast reception over VLAN
On Tue, Mar 29, 2016 at 1:05 PM, Grygorii Strashko wrote: > On 03/29/2016 08:21 AM, Yegor Yefremov wrote: >> Hi Mugunthan, >> >> On Tue, Mar 29, 2016 at 6:00 AM, Mugunthan V N wrote: >>> Hi Yegor >>> >>> On Wednesday 16 March 2016 08:35 PM, Yegor Yefremov wrote: I have an am335x based board using CPSW in Dual EMAC mode. Without VLAN IDs I can receive and send multicast packets [1]. When I create VLAN ID: ip link add link eth1 name eth1.100 type vlan id 100 ip addr add 192.168.100.2/24 brd 192.168.100.255 dev eth1.100 route add -net 224.0.0.0 netmask 224.0.0.0 eth1.100 I can successfully send multicast packets, but not receive them. On the other side of the Ethernet cable I've used Pandaboard. Pandaboard could both receive and send multicast packets via VLAN. >>> >>> Are you trying multicast tx/rx on eth1 or eth1.100? >> >> I'm trying multicast tx/rx on eth1.100. >> >> eth1 has no problems. >> > > it'd be nice if will be able to post here output fom commands: > # switch-config -d [git://git.ti.com/switch-config/switch-config.git v4.1] > # ifconfig -a > # tcpdump -e -f -Q in -i eth0 > # tcpdump -e -f -Q in -i eth0.100 Which kernel/branch do you want me to test? git://git.ti.com/ti-linux-kernel/ti-linux-kernel.git and ti-rt-linux-4.1.y? So far I was using vanilla kernel. Yegor
[PATCH net-next v2 2/2] net: dsa: mv88e6xxx: Clear the PDOWN bit on setup
Some of the vendor-specific bootloaders set up this part of the initialization for us, so this was never added. However, since upstream bootloaders don't initialize the chip specifically, they leave the fiber MII's PDOWN flag set, which means that the CPU port doesn't connect. This patch checks whether this flag has been clear prior by something else, and if not make us clear it. Signed-off-by: Patrick Uiterwijk --- drivers/net/dsa/mv88e6xxx.c | 36 drivers/net/dsa/mv88e6xxx.h | 8 2 files changed, 44 insertions(+) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index 86a2029..9807913 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -2296,6 +2296,25 @@ restore_page_0: return ret; } +static int mv88e6xxx_power_on_serdes(struct dsa_switch *ds) +{ + int ret; + + ret = _mv88e6xxx_phy_page_read(ds, REG_FIBER_SERDES, PAGE_FIBER_SERDES, + MII_BMCR); + if (ret < 0) + return ret; + + if (ret & BMCR_PDOWN) { + ret = ret & ~BMCR_PDOWN; + ret = _mv88e6xxx_phy_page_write(ds, REG_FIBER_SERDES, + PAGE_FIBER_SERDES, MII_BMCR, + ret); + } + + return ret; +} + static int mv88e6xxx_setup_port(struct dsa_switch *ds, int port) { struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); @@ -2399,6 +2418,23 @@ static int mv88e6xxx_setup_port(struct dsa_switch *ds, int port) goto abort; } + /* If this port is connected to a SerDes, make sure the SerDes is not +* powered down. +*/ + if (mv88e6xxx_6352_family(ds)) { + ret = _mv88e6xxx_reg_read(ds, REG_PORT(port), PORT_STATUS); + if (ret < 0) + goto abort; + ret &= PORT_STATUS_CMODE_MASK; + if ((ret == PORT_STATUS_CMODE_100BASE_X) || + (ret == PORT_STATUS_CMODE_1000BASE_X) || + (ret == PORT_STATUS_CMODE_SGMII)) { + ret = mv88e6xxx_power_on_serdes(ds); + if (ret < 0) + goto abort; + } + } + /* Port Control 2: don't force a good FCS, set the maximum frame size to * 10240 bytes, disable 802.1q tags checking, don't discard tagged or * untagged frames on this port, do a destination address lookup on all diff --git a/drivers/net/dsa/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx.h index 9a038ab..26a424a 100644 --- a/drivers/net/dsa/mv88e6xxx.h +++ b/drivers/net/dsa/mv88e6xxx.h @@ -28,6 +28,10 @@ #define SMI_CMD_OP_45_READ_DATA_INC((3 << 10) | SMI_CMD_BUSY) #define SMI_DATA 0x01 +/* Fiber/SERDES Registers are located at SMI address F, page 1 */ +#define REG_FIBER_SERDES 0x0f +#define PAGE_FIBER_SERDES 0x01 + #define REG_PORT(p)(0x10 + (p)) #define PORT_STATUS0x00 #define PORT_STATUS_PAUSE_EN BIT(15) @@ -45,6 +49,10 @@ #define PORT_STATUS_MGMII BIT(6) /* 6185 */ #define PORT_STATUS_TX_PAUSED BIT(5) #define PORT_STATUS_FLOW_CTRL BIT(4) +#define PORT_STATUS_CMODE_MASK 0x0f +#define PORT_STATUS_CMODE_100BASE_X0x8 +#define PORT_STATUS_CMODE_1000BASE_X 0x9 +#define PORT_STATUS_CMODE_SGMII0xa #define PORT_PCS_CTRL 0x01 #define PORT_PCS_CTRL_RGMII_DELAY_RXCLKBIT(15) #define PORT_PCS_CTRL_RGMII_DELAY_TXCLKBIT(14) -- 2.7.4
[PATCH net-next v2 1/2] net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read,write}
Add versions of the phy_page_read and _write functions to be used in a context where the SMI mutex is held. Signed-off-by: Patrick Uiterwijk --- drivers/net/dsa/mv88e6xxx.c | 49 + 1 file changed, 36 insertions(+), 13 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index fa086e0..86a2029 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -2264,6 +2264,38 @@ static void mv88e6xxx_bridge_work(struct work_struct *work) mutex_unlock(&ps->smi_mutex); } +static int _mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page, +int reg, int val) +{ + int ret; + + ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page); + if (ret < 0) + goto restore_page_0; + + ret = _mv88e6xxx_phy_write_indirect(ds, port, reg, val); +restore_page_0: + _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0); + + return ret; +} + +static int _mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, + int reg) +{ + int ret; + + ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page); + if (ret < 0) + goto restore_page_0; + + ret = _mv88e6xxx_phy_read_indirect(ds, port, reg); +restore_page_0: + _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0); + + return ret; +} + static int mv88e6xxx_setup_port(struct dsa_switch *ds, int port) { struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); @@ -2714,13 +2746,9 @@ int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, int reg) int ret; mutex_lock(&ps->smi_mutex); - ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page); - if (ret < 0) - goto error; - ret = _mv88e6xxx_phy_read_indirect(ds, port, reg); -error: - _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0); + ret = _mv88e6xxx_phy_page_read(ds, port, page, reg); mutex_unlock(&ps->smi_mutex); + return ret; } @@ -2731,14 +2759,9 @@ int mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page, int ret; mutex_lock(&ps->smi_mutex); - ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page); - if (ret < 0) - goto error; - - ret = _mv88e6xxx_phy_write_indirect(ds, port, reg, val); -error: - _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0); + ret = _mv88e6xxx_phy_page_write(ds, port, page, reg, val); mutex_unlock(&ps->smi_mutex); + return ret; } -- 2.7.4
Re: am335x: no multicast reception over VLAN
On 03/29/2016 08:21 AM, Yegor Yefremov wrote: > Hi Mugunthan, > > On Tue, Mar 29, 2016 at 6:00 AM, Mugunthan V N wrote: >> Hi Yegor >> >> On Wednesday 16 March 2016 08:35 PM, Yegor Yefremov wrote: >>> I have an am335x based board using CPSW in Dual EMAC mode. Without >>> VLAN IDs I can receive and send multicast packets [1]. When I create >>> VLAN ID: >>> >>> ip link add link eth1 name eth1.100 type vlan id 100 >>> ip addr add 192.168.100.2/24 brd 192.168.100.255 dev eth1.100 >>> route add -net 224.0.0.0 netmask 224.0.0.0 eth1.100 >>> >>> I can successfully send multicast packets, but not receive them. On >>> the other side of the Ethernet cable I've used Pandaboard. Pandaboard >>> could both receive and send multicast packets via VLAN. >> >> Are you trying multicast tx/rx on eth1 or eth1.100? > > I'm trying multicast tx/rx on eth1.100. > > eth1 has no problems. > it'd be nice if will be able to post here output fom commands: # switch-config -d [git://git.ti.com/switch-config/switch-config.git v4.1] # ifconfig -a # tcpdump -e -f -Q in -i eth0 # tcpdump -e -f -Q in -i eth0.100 -- regards, -grygorii
Re: [PATCH] mwifiex: advertise low priority scan feature
I've resent the patch here: https://patchwork.kernel.org/patch/8637861/ Thanks! Wei-Ning On Tue, Mar 22, 2016 at 12:12 PM, Wei-Ning Huang wrote: > Hi Kalle, > > Thanks for the review. I accidentally removed the s-o-b line from > akarwar in this version. > The original patch can be found at: > https://chromium-review.googlesource.com/#/c/246052/ > I've resent a new one. > > Wei-Ning > > On Mon, Mar 21, 2016 at 6:28 PM, Kalle Valo wrote: >> Wei-Ning Huang writes: >> >>> From: Amitkumar Karwar >>> >>> Low priority scan handling code which delays or aborts scan >>> operation based on Tx traffic is removed recently. The reason >>> is firmware already takes care of it in our new feature scan >>> channel gap. Hence we should advertise low priority scan >>> support to cfg80211. >>> >>> This patch fixes a problem in which OBSS scan request from >>> wpa_supplicant was being rejected by cfg80211. >>> >>> Signed-off-by: Wei-Ning Huang >> >> The From line states that this is written by Amitkumar but there's no >> Signed-off-By line from him. I can't take this without that, please >> resend. >> >> (Wei-Ning's s-o-b line is correct, I just need also Amitkumar's line.) >> >> -- >> Kalle Valo > > > > -- > Wei-Ning Huang, 黃偉寧 | Software Engineer, Google Inc., Taiwan | > wnhu...@google.com | Cell: +886 910-380678 -- Wei-Ning Huang, 黃偉寧 | Software Engineer, Google Inc., Taiwan | wnhu...@google.com | Cell: +886 910-380678
RE: [PATCH] mwifiex: add __GFP_REPEAT to skb allocation call
> From: Wei-Ning Huang [mailto:wnhu...@google.com] > Sent: Tuesday, March 29, 2016 2:57 PM > To: Kalle Valo > Cc: Linux Wireless; LKML; Amitkumar Karwar; Nishant Sarmukadam; Sameer > Nanda; netdev@vger.kernel.org; Sonny Rao; Douglas Anderson > Subject: Re: [PATCH] mwifiex: add __GFP_REPEAT to skb allocation call > > Adding some chromium devs to the thread. > > In, http://lxr.free-electrons.com/source/mm/page_alloc.c#L3152 > > The default mm retry allocation when 'order <= PAGE_ALLOC_COSTLY_ORDER' > of gfp_mask contains __GFP_REPEAT. > PAGE_ALLOC_COSTLY_ORDER is defined to be 3. On systems with page size = > 4K, this means memory compaction and retry is only done when the size of > allocation is <= 32K In mwifiex, the allocation size is 64K. When we > have system with memory fragmentation and allocation failed, there will > be no retry. > This is why we need to add __GFP_REPEAT here to allow the system to > perform memory compaction and retry allocation. > > Maybe Amit@marvell can comment on if this is a good fix on this issue. > I'm also aware that marvell is the progress of implementing > scatter/gatter for mwifiex, which can also fix the issue. > > Wei-Ning > This fix would be useful. We have a feature called single port aggregation in which sometimes data received from SDIO interface can be >32k (but less than 64k). This feature improves throughput performance. We are preparing patches for scatter/gather feature. but scatter/gather won't be supported by some platforms. Hence this fix would still be needed. Regards, Amitkumar
[PATCH] bridge: Allow set bridge ageing time when switchdev disabled
When NET_SWITCHDEV=n, switchdev_port_attr_set will return -EOPNOTSUPP, we should ignore this error code and continue to set the ageing time. Signed-off-by: Haishuang Yan --- net/bridge/br_stp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c index e234490..9cb7044 100644 --- a/net/bridge/br_stp.c +++ b/net/bridge/br_stp.c @@ -582,7 +582,7 @@ int br_set_ageing_time(struct net_bridge *br, u32 ageing_time) int err; err = switchdev_port_attr_set(br->dev, &attr); - if (err) + if (err && err != -EOPNOTSUPP) return err; br->ageing_time = t; -- 1.8.3.1
Re: [PATCH] mwifiex: add __GFP_REPEAT to skb allocation call
Adding some chromium devs to the thread. In, http://lxr.free-electrons.com/source/mm/page_alloc.c#L3152 The default mm retry allocation when 'order <= PAGE_ALLOC_COSTLY_ORDER' of gfp_mask contains __GFP_REPEAT. PAGE_ALLOC_COSTLY_ORDER is defined to be 3. On systems with page size = 4K, this means memory compaction and retry is only done when the size of allocation is <= 32K In mwifiex, the allocation size is 64K. When we have system with memory fragmentation and allocation failed, there will be no retry. This is why we need to add __GFP_REPEAT here to allow the system to perform memory compaction and retry allocation. Maybe Amit@marvell can comment on if this is a good fix on this issue. I'm also aware that marvell is the progress of implementing scatter/gatter for mwifiex, which can also fix the issue. Wei-Ning On Tue, Mar 29, 2016 at 4:37 PM, Kalle Valo wrote: > Wei-Ning Huang writes: > >> "single skb allocation failure" happens when system is under heavy >> memory pressure. Add __GFP_REPEAT to skb allocation call so kernel >> attempts to reclaim pages and retry the allocation. >> >> Signed-off-by: Wei-Ning Huang > > Is this really a proper way to fix the issue? This is the first time I'm > hearing about the flag and there isn't even a single user in > drivers/net. I would like to get confirmation from others that > __GFP_REPEAT is really ok to use in a wireless driver before I can take > this. > > -- > Kalle Valo -- Wei-Ning Huang, 黃偉寧 | Software Engineer, Google Inc., Taiwan | wnhu...@google.com | Cell: +886 910-380678
[PATCH net-next v3.16]r9169: Correct Set Vlan tag
This patch add set Vlan tag and flush CPlusCmd register because when unset RxVlan and RxChkSum bit, whithout some explication , unwanted bits is set, PCIDAC, PCIMulRW and others.Whithout this patch when run ethtool -d eth0 on "C+ Command" field missing "VLAN de-tagging" Signed-off-by: Corcodel Marian --- drivers/net/ethernet/realtek/r8169.c | 21 +++-- 1 file changed, 3 insertions(+), 18 deletions(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 2594bbb..df7ea28 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -752,7 +752,7 @@ struct rtl8169_private { void *Rx_databuff[NUM_RX_DESC]; /* Rx data buffers */ struct ring_info tx_skb[NUM_TX_DESC]; /* Tx data buffers */ struct timer_list timer; - u16 cp_cmd; + int cp_cmd; u16 event_slow; @@ -1797,7 +1797,7 @@ static void __rtl8169_set_features(struct net_device *dev, netdev_features_t features) { struct rtl8169_private *tp = netdev_priv(dev); - netdev_features_t changed = features ^ dev->features; + netdev_features_t changed = features & dev->features; void __iomem *ioaddr = tp->mmio_addr; if (!(changed & (NETIF_F_RXALL | NETIF_F_RXCSUM | @@ -1807,16 +1807,10 @@ static void __rtl8169_set_features(struct net_device *dev, if (changed & (NETIF_F_RXCSUM | NETIF_F_HW_VLAN_CTAG_RX)) { if (features & NETIF_F_RXCSUM) tp->cp_cmd |= RxChkSum; - else - tp->cp_cmd &= ~RxChkSum; if (dev->features & NETIF_F_HW_VLAN_CTAG_RX) tp->cp_cmd |= RxVlan; - else - tp->cp_cmd &= ~RxVlan; - RTL_W16(CPlusCmd, tp->cp_cmd); - RTL_R16(CPlusCmd); } if (changed & NETIF_F_RXALL) { int tmp = (RTL_R32(RxConfig) & ~(AcceptErr | AcceptRunt)); @@ -6573,8 +6567,6 @@ static int rtl_open(struct net_device *dev) rtl8169_init_phy(dev, tp); - __rtl8169_set_features(dev, dev->features); - rtl_pll_power_up(tp); rtl_hw_start(dev); @@ -7060,7 +7052,7 @@ rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) goto err_out_mwi_2; } - tp->cp_cmd = RxChkSum; + tp->cp_cmd = ~0x; if ((sizeof(dma_addr_t) > 4) && !pci_set_dma_mask(pdev, DMA_BIT_MASK(64)) && use_dac) { @@ -7101,13 +7093,6 @@ rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) pci_set_master(pdev); - /* -* Pretend we are using VLANs; This bypasses a nasty bug where -* Interrupts stop flowing on high load on 8110SCd controllers. -*/ - if (tp->mac_version == RTL_GIGA_MAC_VER_05) - tp->cp_cmd |= RxVlan; - rtl_init_mdio_ops(tp); rtl_init_pll_power_ops(tp); rtl_init_jumbo_ops(tp); -- 2.1.4
Re: [PATCH net] team: team should sync the port's uc/mc addrs when add a port
On Tue, Mar 29, 2016 at 12:56 PM, Cong Wang wrote: > On Mon, Mar 28, 2016 at 9:42 AM, Xin Long wrote: >> There is an issue when we use mavtap over team: >> When we replug nic links from team0, the real nics's mc list will not >> include the maddr for macvtap any more. then we can't receive pkts to >> macvtap device, as they are filterred by mc list of nic. >> >> In Bonding Driver, it syncs the uc/mc addrs in bond_enslave(). >> >> We will fix this issue on team by adding the port's uc/mc addrs sync in >> team_port_add. >> >> Signed-off-by: Xin Long >> --- >> drivers/net/team/team.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c >> index 26c64d2..17ff367 100644 >> --- a/drivers/net/team/team.c >> +++ b/drivers/net/team/team.c >> @@ -1198,6 +1198,9 @@ static int team_port_add(struct team *team, struct >> net_device *port_dev) >> goto err_dev_open; >> } >> >> + dev_uc_sync_multiple(port_dev, dev); >> + dev_mc_sync_multiple(port_dev, dev); >> + >> err = vlan_vids_add_by_dev(port_dev, dev); > > You need to call dev_{uc,mc}_unsync() on error path, don't you? I think so, I'm gonna post another patch as a fix for this one. Thanks, Cong.
Re: [PATCH] mwifiex: add __GFP_REPEAT to skb allocation call
Wei-Ning Huang writes: > "single skb allocation failure" happens when system is under heavy > memory pressure. Add __GFP_REPEAT to skb allocation call so kernel > attempts to reclaim pages and retry the allocation. > > Signed-off-by: Wei-Ning Huang Is this really a proper way to fix the issue? This is the first time I'm hearing about the flag and there isn't even a single user in drivers/net. I would like to get confirmation from others that __GFP_REPEAT is really ok to use in a wireless driver before I can take this. -- Kalle Valo
Re: [PATCH] net: macb: Only call GPIO functions if there is a valid GPIO
On Mon, Mar 28, 2016 at 08:30:11PM +0300, Sergei Shtylyov wrote: > >@@ -3029,7 +3030,8 @@ static int macb_remove(struct platform_device *pdev) > > mdiobus_free(bp->mii_bus); > > > > /* Shutdown the PHY if there is a GPIO reset */ > >-gpiod_set_value(bp->reset_gpio, 0); > >+if (bp->reset_gpio) > >+gpiod_set_value(bp->reset_gpio, 0); > >Hm, this function was previously OK to call with NULL (it didn't curse)... > Looks like it was changed so that it does complain fairly recently (patch librally snipped down): commit fdeb8e1547cb9dd39d5d7223b33f3565cf86c28e Author: Linus Walleij gpio: reflect base and ngpio into gpio_device +#define VALIDATE_DESC_VOID(desc) do { \ + if (!desc || !desc->gdev) { \ + pr_warn("%s: invalid GPIO\n", __func__); \ + return; \ + } \ + void gpiod_set_value(struct gpio_desc *desc, int value) { - if (!desc) - return; + VALIDATE_DESC_VOID(desc); Thanks, Charles