Re: [RFC 2/7] ath10k: Add support to process rx packet in thread
On 3/22/21 6:20 PM, Brian Norris wrote: On Mon, Mar 22, 2021 at 4:58 PM Ben Greear wrote: On 7/22/20 6:00 AM, Felix Fietkau wrote: On 2020-07-22 14:55, Johannes Berg wrote: On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote: I'm considering testing a different approach (with mt76 initially): - Add a mac80211 rx function that puts processed skbs into a list instead of handing them to the network stack directly. Would this be *after* all the mac80211 processing, i.e. in place of the rx-up-to-stack? Yes, it would run all the rx handlers normally and then put the resulting skbs into a list instead of calling netif_receive_skb or napi_gro_frags. Whatever came of this? I realized I'm running Felix's patch since his mt76 driver needs it. Any chance it will go upstream? If you're asking about $subject (moving NAPI/RX to a thread), this landed upstream recently: http://git.kernel.org/linus/adbb4fb028452b1b0488a1a7b66ab856cdf20715 It needs a bit of coaxing to work on a WiFi driver (including: WiFi drivers tend to have a different netdev for NAPI than they expose to /sys/class/net/), but it's there. I'm not sure if people had something else in mind in the stuff you're quoting though. No, I got it confused with something Felix did: https://github.com/greearb/mt76/blob/master/patches/0001-net-add-support-for-threaded-NAPI-polling.patch Maybe the NAPI/RX to a thread thing superceded Felix's patch? Thanks, Ben Brian -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [RFC 2/7] ath10k: Add support to process rx packet in thread
On 7/22/20 6:00 AM, Felix Fietkau wrote: On 2020-07-22 14:55, Johannes Berg wrote: On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote: I'm considering testing a different approach (with mt76 initially): - Add a mac80211 rx function that puts processed skbs into a list instead of handing them to the network stack directly. Would this be *after* all the mac80211 processing, i.e. in place of the rx-up-to-stack? Yes, it would run all the rx handlers normally and then put the resulting skbs into a list instead of calling netif_receive_skb or napi_gro_frags. Whatever came of this? I realized I'm running Felix's patch since his mt76 driver needs it. Any chance it will go upstream? Thanks, Ben - Felix -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: VRF: ssh port forwarding between non-vrf and vrf interface.
On 1/22/21 8:02 AM, David Ahern wrote: On 1/22/21 8:45 AM, Ben Greear wrote: Hello, I have a system with a management interface that is not in any VRF, and then I have a port that *is* in a VRF. I'd like to be able to set up ssh port forwarding so that when I log into the system on the management interface it will automatically forward to an IP accessible through the VRF interface. Is there a way to do such a thing? For a while I had a system setup with eth0 in a management VRF and setup to do NAT and port forwarding of incoming ssh connections, redirecting to VMs running in a different namespace. Crossing VRFs with netfilter most likely will not work without some development. You might be able to do it with XDP - rewrite packet headers and redirect. That too might need a bit of development depending on the netdevs involved. Maybe easier to improve ssh so that it could specify a netdev to bind to when making the call to the redirected destination? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH net] iwlwifi: provide gso_type to GSO packets
On 1/25/21 7:09 AM, Eric Dumazet wrote: From: Eric Dumazet net/core/tso.c got recent support for USO, and this broke iwlfifi because the driver implemented a limited form of GSO. Providing ->gso_type allows for skb_is_gso_tcp() to provide a correct result. Fixes: 3d5b459ba0e3 ("net: tso: add UDP segmentation support") Signed-off-by: Eric Dumazet Reported-by: Ben Greear Bisected-by: Ben Greear I appreciate the credit, but the bisect and some other initial bug hunting was done by people on this thread: https://bugzilla.kernel.org/show_bug.cgi?id=209913 Thanks, Ben Tested-by: Ben Greear Cc: Luca Coelho Cc: linux-wirel...@vger.kernel.org Cc: Johannes Berg --- drivers/net/wireless/intel/iwlwifi/mvm/tx.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/tx.c b/drivers/net/wireless/intel/iwlwifi/mvm/tx.c index a983c215df310776ffe67f3b3ffa203eab609bfc..3712adc3ccc2511d46bcc855efbfba41c487d8e6 100644 --- a/drivers/net/wireless/intel/iwlwifi/mvm/tx.c +++ b/drivers/net/wireless/intel/iwlwifi/mvm/tx.c @@ -773,6 +773,7 @@ iwl_mvm_tx_tso_segment(struct sk_buff *skb, unsigned int num_subframes, next = skb_gso_segment(skb, netdev_flags); skb_shinfo(skb)->gso_size = mss; + skb_shinfo(skb)->gso_type = ipv4 ? SKB_GSO_TCPV4 : SKB_GSO_TCPV6; if (WARN_ON_ONCE(IS_ERR(next))) return -EINVAL; else if (next) @@ -795,6 +796,8 @@ iwl_mvm_tx_tso_segment(struct sk_buff *skb, unsigned int num_subframes, if (tcp_payload_len > mss) { skb_shinfo(tmp)->gso_size = mss; + skb_shinfo(tmp)->gso_type = ipv4 ? SKB_GSO_TCPV4 : + SKB_GSO_TCPV6; } else { if (qos) { u8 *qc; -- Ben Greear Candela Technologies Inc http://www.candelatech.com
VRF: ssh port forwarding between non-vrf and vrf interface.
Hello, I have a system with a management interface that is not in any VRF, and then I have a port that *is* in a VRF. I'd like to be able to set up ssh port forwarding so that when I log into the system on the management interface it will automatically forward to an IP accessible through the VRF interface. Is there a way to do such a thing? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: 5.10.4+ hang with 'rmmod nf_conntrack'
On 1/7/21 10:16 PM, Florian Westphal wrote: Ben Greear wrote: I noticed my system has a hung process trying to 'rmmod nf_conntrack'. I've generally been doing the script that calls rmmod forever, but only extensively tested on 5.4 kernel and earlier. If anyone has any ideas, please let me know. This is from 'sysrq t'. I don't see any hung-task splats in dmesg. rmmod on conntrack loops forever until the active conntrack object count reaches 0. (plus a walk of the conntrack table to evict/put all entries). I'll see if it is reproducible and if so will try with lockdep enabled... No idea, there was a regression in 5.6, but that was fixed by the time 5.7 was released. Can't reproduce hangs with a script that injects a few dummy entries and then removes the module: added=0 add_and_rmmod() { while [ $added -lt 1000 ]; do conntrack -I -s $(($RANDOM%256)).$(($RANDOM%256)).$(($RANDOM%256)).$(($RANDOM%255+1)) \ -d $(($RANDOM%256)).$(($RANDOM%256)).$(($RANDOM%256)).$(($RANDOM%255+1)) \ --protonum 6 --timeout $(((RANDOM%120) + 240)) --state ESTABLISHED --sport $RANDOM --dport $RANDOM 2> /dev/null || break added=$((added + 1)) if [ $((added % 1000)) -eq 0 ];then echo $added fi done echo rmmod after adding $added entries conntrack -C rmmod nf_conntrack_netlink rmmod nf_conntrack } add_and_rmmod I don't see how it would make a difference, but do you have any special conntrack features enabled at run time, e.g. reliable netlink events? (If you don't know what I mean the answer is no). Not that I know of, but I am using lots of VRF devices, each with their own routing table, as well as some wifi stations and AP netdevs. I'll let you know if I can reproduce it again.. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
5.10.4+ hang with 'rmmod nf_conntrack'
I noticed my system has a hung process trying to 'rmmod nf_conntrack'. I've generally been doing the script that calls rmmod forever, but only extensively tested on 5.4 kernel and earlier. If anyone has any ideas, please let me know. This is from 'sysrq t'. I don't see any hung-task splats in dmesg. I'll see if it is reproducible and if so will try with lockdep enabled... 21497 Jan 07 16:12:05 TR-398 kernel: task:rmmod state:R running task stack:0 pid: 4107 ppid: 4054 flags:0x4084 21498 Jan 07 16:12:05 TR-398 kernel: Call Trace: 21499 Jan 07 16:12:05 TR-398 kernel: ? do_softirq_own_stack+0x32/0x40 21500 Jan 07 16:12:05 TR-398 kernel: ? irq_exit_rcu+0x39/0x90 21501 Jan 07 16:12:05 TR-398 kernel: ? sysvec_apic_timer_interrupt+0x34/0x80 21502 Jan 07 16:12:05 TR-398 kernel: ? asm_sysvec_apic_timer_interrupt+0x12/0x20 21503 Jan 07 16:12:05 TR-398 kernel: ? nf_conntrack_attach+0x30/0x30 [nf_conntrack] 21504 Jan 07 16:12:05 TR-398 kernel: ? _raw_spin_lock+0x12/0x20 21505 Jan 07 16:12:05 TR-398 kernel: ? do_softirq_own_stack+0x32/0x40 21506 Jan 07 16:12:05 TR-398 kernel: ? nf_conntrack_lock+0x9/0x40 [nf_conntrack] 21507 Jan 07 16:12:05 TR-398 kernel: ? nf_ct_iterate_cleanup+0x88/0x140 [nf_conntrack] 21508 Jan 07 16:12:05 TR-398 kernel: ? nf_conntrack_cleanup_net_list+0x36/0xc0 [nf_conntrack] 21509 Jan 07 16:12:05 TR-398 kernel: ? unregister_pernet_operations+0xcc/0x130 21510 Jan 07 16:12:05 TR-398 kernel: ? unregister_pernet_subsys+0x18/0x30 21511 Jan 07 16:12:05 TR-398 kernel: ? nf_conntrack_standalone_fini+0x11/0x425 [nf_conntrack] 21512 Jan 07 16:12:05 TR-398 kernel: ? __x64_sys_delete_module+0x131/0x270 21513 Jan 07 16:12:05 TR-398 kernel: ? syscall_trace_enter.isra.21+0xf9/0x190 21514 Jan 07 16:12:05 TR-398 kernel: ? do_syscall_64+0x2d/0x70 21515 Jan 07 16:12:05 TR-398 kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: net: tso: add UDP segmentation support: adds regression for ax200 upload
On 12/21/20 12:01 PM, Rainer Suhm wrote: Am 21.12.20 um 20:14 schrieb Eric Dumazet: On Mon, Dec 21, 2020 at 8:04 PM Eric Dumazet wrote: On Mon, Dec 21, 2020 at 7:46 PM Eric Dumazet wrote: On Sat, Dec 19, 2020 at 5:55 PM Ben Greear wrote: On 12/19/20 7:18 AM, Johannes Berg wrote: On Fri, 2020-12-18 at 12:16 -0800, Jakub Kicinski wrote: On Thu, 17 Dec 2020 12:40:26 -0800 Ben Greear wrote: On 12/17/20 10:20 AM, Eric Dumazet wrote: On Thu, Dec 17, 2020 at 7:13 PM Ben Greear wrote: It is the iwlwifi/mvm logic that supports ax200. Let me ask again : I see two different potential call points : drivers/net/wireless/intel/iwlwifi/pcie/tx.c:1529: tso_build_hdr(skb, hdr_page->pos, &tso, data_left, !total_len); drivers/net/wireless/intel/iwlwifi/queue/tx.c:427: tso_build_hdr(skb, hdr_page->pos, &tso, data_left, !total_len); To the best of your knowledge, which one would be used in your case ? Both are horribly complex, I do not want to spend time studying two implementations. It is the queue/tx.c code that executes on my system, verified with printk. Not sure why Intel's not on CC here. Heh :) Let's also add linux-wireless. Luca, is the ax200 TSO performance regression with recent kernel on your radar? It wasn't on mine for sure, so far. But it's supposed to be Christmas vacation, so haven't checked our bug tracker etc. I see Emmanuel was at least looking at the bug report, but not sure what else happened yet. Not to bitch and moan too much, but even the most basic of testing would have shown this, how can testing be so poor on the ax200 driver? It even shows up with the out-of-tree ax200 driver. Off the top of my head, I don't really see the issue. Does anyone have the ability to capture the frames over the air (e.g. with another AX200 in monitor mode, load the driver with amsdu_size=3 module parameter to properly capture A-MSDUs)? I can do that at some point, and likely it could be reproduced with an /n or /ac AP and those are a lot easier to sniff. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com It seems the problem comes from some skbs reaching the driver with gso_type == 0, meaning skb_is_gso_tcp() is fuzzy. (net/core/tso.c is only one of the skb_is_gso_tcp() users) Local TCP stack should provide either SKB_GSO_TCPV4 or SKB_GSO_TCPV6 for GSO packets. So maybe the issue is coming from traffic coming from a VM through a tun device or something, and our handling of GSO_ROBUST / DODGY never cared about setting SKB_GSO_TCPV4 or SKB_GSO_TCPV6 if not already given by user space ? Or a plain bug somewhere, possibly overwriting gso_type with 0 or garbage... Oh well, iwl_mvm_tx_tso_segment() 'builds' a fake gso packet. I suspect this will fix the issue : diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/tx.c b/drivers/net/wireless/intel/iwlwifi/mvm/tx.c index a983c215df310776ffe67f3b3ffa203eab609bfc..e7ad6367c88de4aff700c630d850760d1d3bf011 100644 --- a/drivers/net/wireless/intel/iwlwifi/mvm/tx.c +++ b/drivers/net/wireless/intel/iwlwifi/mvm/tx.c @@ -773,6 +773,7 @@ iwl_mvm_tx_tso_segment(struct sk_buff *skb, unsigned int num_subframes, next = skb_gso_segment(skb, netdev_flags); skb_shinfo(skb)->gso_size = mss; + skb_shinfo(skb)->gso_type = ipv4 ? SKB_GSO_TCPV4 : SKB_GSO_TCPV6; if (WARN_ON_ONCE(IS_ERR(next))) return -EINVAL; else if (next) Or more precisely : diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/tx.c b/drivers/net/wireless/intel/iwlwifi/mvm/tx.c index a983c215df310776ffe67f3b3ffa203eab609bfc..11145bf29f3cbeefcce1a05cc81fd90978f2cbfe 100644 --- a/drivers/net/wireless/intel/iwlwifi/mvm/tx.c +++ b/drivers/net/wireless/intel/iwlwifi/mvm/tx.c @@ -773,6 +773,7 @@ iwl_mvm_tx_tso_segment(struct sk_buff *skb, unsigned int num_subframes, next = skb_gso_segment(skb, netdev_flags); skb_shinfo(skb)->gso_size = mss; + skb_shinfo(skb)->gso_type = ipv4 ? SKB_GSO_TCPV4 : SKB_GSO_TCPV6; if (WARN_ON_ONCE(IS_ERR(next))) return -EINVAL; else if (next) @@ -795,6 +796,7 @@ iwl_mvm_tx_tso_segment(struct sk_buff *skb, unsigned int num_subframes, if (tcp_payload_len > mss) { skb_shinfo(tmp)->gso_size = mss; + skb_shinfo(tmp)->gso_type = ipv4 ? SKB_GSO_TCPV4 : SKB_GSO_TCPV6; } else { if (qos) { u8 *qc; This looks good to me. Transmission rate is in the expected range. iperf3 shows no retries anymore. Here is my kernel log with the above changes applied, and the debug patches from Eric. I tested this successfully as well. Eric: Thanks for the patch! --Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: net: tso: add UDP segmentation support: adds regression for ax200 upload
On 12/19/20 7:18 AM, Johannes Berg wrote: On Fri, 2020-12-18 at 12:16 -0800, Jakub Kicinski wrote: On Thu, 17 Dec 2020 12:40:26 -0800 Ben Greear wrote: On 12/17/20 10:20 AM, Eric Dumazet wrote: On Thu, Dec 17, 2020 at 7:13 PM Ben Greear wrote: It is the iwlwifi/mvm logic that supports ax200. Let me ask again : I see two different potential call points : drivers/net/wireless/intel/iwlwifi/pcie/tx.c:1529: tso_build_hdr(skb, hdr_page->pos, &tso, data_left, !total_len); drivers/net/wireless/intel/iwlwifi/queue/tx.c:427: tso_build_hdr(skb, hdr_page->pos, &tso, data_left, !total_len); To the best of your knowledge, which one would be used in your case ? Both are horribly complex, I do not want to spend time studying two implementations. It is the queue/tx.c code that executes on my system, verified with printk. Not sure why Intel's not on CC here. Heh :) Let's also add linux-wireless. Luca, is the ax200 TSO performance regression with recent kernel on your radar? It wasn't on mine for sure, so far. But it's supposed to be Christmas vacation, so haven't checked our bug tracker etc. I see Emmanuel was at least looking at the bug report, but not sure what else happened yet. Not to bitch and moan too much, but even the most basic of testing would have shown this, how can testing be so poor on the ax200 driver? It even shows up with the out-of-tree ax200 driver. Off the top of my head, I don't really see the issue. Does anyone have the ability to capture the frames over the air (e.g. with another AX200 in monitor mode, load the driver with amsdu_size=3 module parameter to properly capture A-MSDUs)? I can do that at some point, and likely it could be reproduced with an /n or /ac AP and those are a lot easier to sniff. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 0/3] mac80211: Trigger disconnect for STA during recovery
On 12/17/20 2:24 PM, Brian Norris wrote: On Tue, Dec 15, 2020 at 10:23:33AM -0800, Ben Greear wrote: On 12/15/20 9:21 AM, Youghandhar Chintala wrote: From: Rakesh Pillai Currently in case of target hardware restart ,we just reconfig and re-enable the security keys and enable the network queues to start data traffic back from where it was interrupted. Are there any known mac80211 radios/drivers that *can* support seamless restarts? If not, then just could always enable this feature in mac80211? I'm quite sure that iwlwifi intentionally supports a seamless restart. From my experience with dealing with user reports, I don't recall any issues where restart didn't function as expected, unless there was some deeper underlying failure (e.g., hardware/power failure; driver bugs / lockups). I don't have very good stats for ath10k/QCA6174, but it survives our testing OK and I again don't recall any user-reported complaints in this area. I'd say this is a weaker example though, as I don't have as clear of data. (By contrast, ath10k/WCN399x, which Rakesh, et al, are patching here, does not pass our tests at all, and clearly fails to recover from "seamless" restarts, as noted in patch 3.) I'd also note that we don't operate in AP mode -- only STA -- and IIRC Ben, you've complained about AP mode in the past. I complain about all sorts of things, but I'm usually running station mode :) Do you actually see iwlwifi stations stay associated through firmware crashes? Anyway, happy to hear some have seamless recovery, and in that case, I have no objections to the patch. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: net: tso: add UDP segmentation support: adds regression for ax200 upload
On 12/17/20 10:20 AM, Eric Dumazet wrote: On Thu, Dec 17, 2020 at 7:13 PM Ben Greear wrote: It is the iwlwifi/mvm logic that supports ax200. Let me ask again : I see two different potential call points : drivers/net/wireless/intel/iwlwifi/pcie/tx.c:1529: tso_build_hdr(skb, hdr_page->pos, &tso, data_left, !total_len); drivers/net/wireless/intel/iwlwifi/queue/tx.c:427: tso_build_hdr(skb, hdr_page->pos, &tso, data_left, !total_len); To the best of your knowledge, which one would be used in your case ? Both are horribly complex, I do not want to spend time studying two implementations. It is the queue/tx.c code that executes on my system, verified with printk. Thanks, Ben Thanks. -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: net: tso: add UDP segmentation support: adds regression for ax200 upload
On 12/17/2020 10:07 AM, Eric Dumazet wrote: On Thu, Dec 17, 2020 at 6:56 PM Ben Greear wrote: On 12/17/20 2:11 AM, Eric Dumazet wrote: On Thu, Dec 17, 2020 at 12:59 AM Ben Greear wrote: On 12/16/20 3:09 PM, Ben Greear wrote: Hello Eric, The patch below evidently causes TCP throughput to be about 50Mbps instead of 700Mbps when using ax200 to upload tcp traffic. When I disable TSO, performance goes back up to around 700Mbps. As a followup, when I revert the patch, upload speed goes to ~900Mbps, so even better than just disabling TSO (I left TSO enabled after reverting the patch). Thanks, Ben Thanks for the report ! It seems drivers/net/wireless/intel/iwlwifi/pcie/tx.c:iwl_fill_data_tbs_amsdu() calls tso_build_hdr() with extra bytes (SNAP header), it is not yet clear to me what is broken :/ Your patch is guessing tcp vs udp by looking at header length from what I could tell. So if something uses a different size, it probably gets confused? I do not think so, my patch selects TCP vs UDP by using standard GSO helper skb_is_gso_tcp(skb) tso->tlen is initialized from tso_start() : int tlen = skb_is_gso_tcp(skb) ? tcp_hdrlen(skb) : sizeof(struct udphdr); tso->tlen = tlen; Maybe for some reason skb_is_gso_tcp(skb) returns false in your case, some debugging would help. Can you confirm which driver is used for ax200 ? I see tso_build_hdr() also being used from drivers/net/wireless/intel/iwlwifi/queue/tx.c I tested against the un-modified ax200 5.10.0 kernel driver, and it has the issue. The ax200 backports release/core56 driver acts a bit different (poorer performance over all than in-kernel driver), but has similar upstream issues that are mitigated by disabling TSO. Sorry, I can not find ax200 driver. It is the iwlwifi/mvm logic that supports ax200. Thanks, Ben
Re: net: tso: add UDP segmentation support: adds regression for ax200 upload
On 12/17/20 2:11 AM, Eric Dumazet wrote: On Thu, Dec 17, 2020 at 12:59 AM Ben Greear wrote: On 12/16/20 3:09 PM, Ben Greear wrote: Hello Eric, The patch below evidently causes TCP throughput to be about 50Mbps instead of 700Mbps when using ax200 to upload tcp traffic. When I disable TSO, performance goes back up to around 700Mbps. As a followup, when I revert the patch, upload speed goes to ~900Mbps, so even better than just disabling TSO (I left TSO enabled after reverting the patch). Thanks, Ben Thanks for the report ! It seems drivers/net/wireless/intel/iwlwifi/pcie/tx.c:iwl_fill_data_tbs_amsdu() calls tso_build_hdr() with extra bytes (SNAP header), it is not yet clear to me what is broken :/ Your patch is guessing tcp vs udp by looking at header length from what I could tell. So if something uses a different size, it probably gets confused? Can you confirm which driver is used for ax200 ? I see tso_build_hdr() also being used from drivers/net/wireless/intel/iwlwifi/queue/tx.c I tested against the un-modified ax200 5.10.0 kernel driver, and it has the issue. The ax200 backports release/core56 driver acts a bit different (poorer performance over all than in-kernel driver), but has similar upstream issues that are mitigated by disabling TSO. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: net: tso: add UDP segmentation support: adds regression for ax200 upload
On 12/16/20 3:09 PM, Ben Greear wrote: Hello Eric, The patch below evidently causes TCP throughput to be about 50Mbps instead of 700Mbps when using ax200 to upload tcp traffic. When I disable TSO, performance goes back up to around 700Mbps. As a followup, when I revert the patch, upload speed goes to ~900Mbps, so even better than just disabling TSO (I left TSO enabled after reverting the patch). Thanks, Ben I recall ~5 years ago we had similar TCP related performance issues with ath10k. I vaguely recall that there might be some driver-level socket pacing tuning value, but I cannot find the right thing to search for. Is this really a thing? If so, maybe it will be a way to resolve this issue? See this more thorough bug report: https://bugzilla.kernel.org/show_bug.cgi?id=209913 Patch description: net: tso: add UDP segmentation support Note that like TCP, we do not support additional encapsulations, and that checksums must be offloaded to the NIC. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller Thanks, Ben
net: tso: add UDP segmentation support: adds regression for ax200 upload
Hello Eric, The patch below evidently causes TCP throughput to be about 50Mbps instead of 700Mbps when using ax200 to upload tcp traffic. When I disable TSO, performance goes back up to around 700Mbps. I recall ~5 years ago we had similar TCP related performance issues with ath10k. I vaguely recall that there might be some driver-level socket pacing tuning value, but I cannot find the right thing to search for. Is this really a thing? If so, maybe it will be a way to resolve this issue? See this more thorough bug report: https://bugzilla.kernel.org/show_bug.cgi?id=209913 Patch description: net: tso: add UDP segmentation support Note that like TCP, we do not support additional encapsulations, and that checksums must be offloaded to the NIC. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 0/3] mac80211: Trigger disconnect for STA during recovery
On 12/15/20 9:21 AM, Youghandhar Chintala wrote: From: Rakesh Pillai Currently in case of target hardware restart ,we just reconfig and re-enable the security keys and enable the network queues to start data traffic back from where it was interrupted. Are there any known mac80211 radios/drivers that *can* support seamless restarts? If not, then just could always enable this feature in mac80211? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH v2 1/3] ath10k: Add history for tracking certain events
On 7/31/20 11:27 AM, Rakesh Pillai wrote: Add history for tracking the below events - register read - register write - IRQ trigger - NAPI poll - CE service - WMI cmd - WMI event - WMI tx completion This will help in debugging any crash or any improper behaviour. Tested-on: WCN3990 hw1.0 SNOC WLAN.HL.3.1-01040-QCAHLSWMTPLZ-1 Signed-off-by: Rakesh Pillai --- drivers/net/wireless/ath/ath10k/ce.c | 1 + drivers/net/wireless/ath/ath10k/core.h| 74 + drivers/net/wireless/ath/ath10k/debug.c | 133 ++ drivers/net/wireless/ath/ath10k/debug.h | 74 + drivers/net/wireless/ath/ath10k/snoc.c| 15 +++- drivers/net/wireless/ath/ath10k/wmi-tlv.c | 1 + drivers/net/wireless/ath/ath10k/wmi.c | 10 +++ 7 files changed, 307 insertions(+), 1 deletion(-) +void ath10k_record_wmi_event(struct ath10k *ar, enum ath10k_wmi_type type, +u32 id, unsigned char *data) +{ + struct ath10k_wmi_event_entry *entry; + u32 idx; + + if (type == ATH10K_WMI_EVENT) { + if (!ar->wmi_event_history.record) + return; This check above is duplicated below, add it once at top of the method instead. + + spin_lock_bh(&ar->wmi_event_history.hist_lock); + idx = ath10k_core_get_next_idx(&ar->reg_access_history.index, + ar->wmi_event_history.max_entries); + spin_unlock_bh(&ar->wmi_event_history.hist_lock); + entry = &ar->wmi_event_history.record[idx]; + } else { + if (!ar->wmi_cmd_history.record) + return; + + spin_lock_bh(&ar->wmi_cmd_history.hist_lock); + idx = ath10k_core_get_next_idx(&ar->reg_access_history.index, + ar->wmi_cmd_history.max_entries); + spin_unlock_bh(&ar->wmi_cmd_history.hist_lock); + entry = &ar->wmi_cmd_history.record[idx]; + } + + entry->timestamp = ath10k_core_get_timestamp(); + entry->cpu_id = smp_processor_id(); + entry->type = type; + entry->id = id; + memcpy(&entry->data, data + 4, ATH10K_WMI_DATA_LEN); +} +EXPORT_SYMBOL(ath10k_record_wmi_event); @@ -1660,6 +1668,11 @@ static int ath10k_snoc_probe(struct platform_device *pdev) ar->ce_priv = &ar_snoc->ce; msa_size = drv_data->msa_size; + ath10k_core_reg_access_history_init(ar, ATH10K_REG_ACCESS_HISTORY_MAX); + ath10k_core_wmi_event_history_init(ar, ATH10K_WMI_EVENT_HISTORY_MAX); + ath10k_core_wmi_cmd_history_init(ar, ATH10K_WMI_CMD_HISTORY_MAX); + ath10k_core_ce_event_history_init(ar, ATH10K_CE_EVENT_HISTORY_MAX); Maybe only enable this once user turns it on? It sucks up a bit of memory? + ath10k_snoc_quirks_init(ar); ret = ath10k_snoc_resource_init(ar); diff --git a/drivers/net/wireless/ath/ath10k/wmi-tlv.c b/drivers/net/wireless/ath/ath10k/wmi-tlv.c index 932266d..9df5748 100644 --- a/drivers/net/wireless/ath/ath10k/wmi-tlv.c +++ b/drivers/net/wireless/ath/ath10k/wmi-tlv.c @@ -627,6 +627,7 @@ static void ath10k_wmi_tlv_op_rx(struct ath10k *ar, struct sk_buff *skb) if (skb_pull(skb, sizeof(struct wmi_cmd_hdr)) == NULL) goto out; + ath10k_record_wmi_event(ar, ATH10K_WMI_EVENT, id, skb->data); trace_ath10k_wmi_event(ar, id, skb->data, skb->len); consumed = ath10k_tm_event_wmi(ar, id, skb); diff --git a/drivers/net/wireless/ath/ath10k/wmi.c b/drivers/net/wireless/ath/ath10k/wmi.c index a81a1ab..8ebd05c 100644 --- a/drivers/net/wireless/ath/ath10k/wmi.c +++ b/drivers/net/wireless/ath/ath10k/wmi.c @@ -1802,6 +1802,15 @@ struct sk_buff *ath10k_wmi_alloc_skb(struct ath10k *ar, u32 len) static void ath10k_wmi_htc_tx_complete(struct ath10k *ar, struct sk_buff *skb) { + struct wmi_cmd_hdr *cmd_hdr; + enum wmi_tlv_event_id id; + + cmd_hdr = (struct wmi_cmd_hdr *)skb->data; + id = MS(__le32_to_cpu(cmd_hdr->cmd_id), WMI_CMD_HDR_CMD_ID); + + ath10k_record_wmi_event(ar, ATH10K_WMI_TX_COMPL, id, + skb->data + sizeof(struct wmi_cmd_hdr)); + dev_kfree_skb(skb); } I think guard the above new code with if (unlikely(ar->ce_event_history.record)) { ... } All in all, I think I'd want to compile this out (while leaving other debug compiled in) since it seems this stuff would be rarely used and it adds method calls to hot paths. That is a decision for Kalle though, so see what he says... Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH v3 0/8] kernel: taint when the driver firmware crashes
On 05/28/2020 07:27 AM, Luis Chamberlain wrote: On Wed, May 27, 2020 at 02:36:42PM -0700, Jakub Kicinski wrote: On Wed, 27 May 2020 03:19:18 + Luis Chamberlain wrote: I read your patch, and granted, I will accept I was under the incorrect assumption that this can only be used by networking devices, however it the devlink approach achieves getting userspace the ability with iproute2 devlink util to query a device health, on to which we can peg firmware health. But *this* patch series is not about health status and letting users query it, its about a *critical* situation which has come up with firmware requiring me to reboot my system, and the lack of *any* infrastructure in the kernel today to inform userspace about it. So say we use netlink to report a critical health situation, how are we informing userspace with your patch series about requring a reboot? One of main features of netlink is pub/sub model of notifications. Whatever you imagine listening to your uevent can listen to devlink-health notifications via devlink. In fact I've shown this off in the RFC patches I sent to you, see the devlink mon health command being used. Yes but I looked at iputils2 devlink and seems I made an incorrect assumption this can only be used for a network device rather than a struct device. I'll take a second look. Hello Jakub, I'm thinking about something similar to what Luis is proposing, but in my case I'd like to report just when the driver knows the hardware is gone and cannot be recovered, like when this is reported: [ 2548.851832] WARNING: CPU: 3 PID: 98 at backports-4.19.98-1/net/mac80211/util.c:2040 ieee80211_reconfig+0x98/0xb64 [mac80211] [ 2548.856020] Hardware became unavailable during restart. I'd like to be able to tie this into a watch-dog program to allow automatic reboot of the system soon after this event is seen, for instance. Could you post your devlink RFC patches somewhere public? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [RFC 1/2] devlink: add simple fw crash helpers
On 05/25/2020 02:07 AM, Andy Shevchenko wrote: On Fri, May 22, 2020 at 04:23:55PM -0700, Steve deRosier wrote: On Fri, May 22, 2020 at 2:51 PM Luis Chamberlain wrote: I had to go RTFM re: kernel taints because it has been a very long time since I looked at them. It had always seemed to me that most were caused by "kernel-unfriendly" user actions. The most famous of course is loading proprietary modules, out-of-tree modules, forced module loads, etc... Honestly, I had forgotten the large variety of uses of the taint flags. For anyone who hasn't looked at taints recently, I recommend: https://www.kernel.org/doc/html/latest/admin-guide/tainted-kernels.html In light of this I don't object to setting a taint on this anymore. I'm a little uneasy, but I've softened on it now, and now I feel it depends on implementation. Specifically, I don't think we should set a taint flag when a driver easily handles a routine firmware crash and is confident that things have come up just fine again. In other words, triggering the taint in every driver module where it spits out a log comment that it had a firmware crash and had to recover seems too much. Sure, firmware shouldn't crash, sure it should be open source so we can fix it, whatever... While it may sound idealistic the firmware for the end-user, and even for mere kernel developer like me, is a complete blackbox which has more access than root user in the kernel. We have tons of firmwares and each of them potentially dangerous beast. As a user I really care about my data and privacy (hacker can oops a firmware in order to set a specific vector attack). So, tainting kernel is _a least_ we can do there, the strict rules would be to reboot immediately. those sort of wishful comments simply ignore reality and our ability to affect effective change. We can encourage users not to buy cheap crap for the starter. There is no stable wifi firmware for any price. There is also no obvious feedback from even name-brand NICs like ath10k or AX200 when you report a crash. That said, at least in my experience with ath10k-ct, the OS normally recovers fine from firmware crashes. ath10k already reports full crash reports on udev, so easy for user-space to notice and report bug reports upstream if it cares to. Probably other NICs do the same, and if not, they certainly could. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
On 05/18/2020 10:09 AM, Luis Chamberlain wrote: On Mon, May 18, 2020 at 09:58:53AM -0700, Ben Greear wrote: On 05/18/2020 09:51 AM, Luis Chamberlain wrote: On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote: On Fri, 2020-05-15 at 21:28 +, Luis Chamberlain wrote:> module_firmware_crashed You didn't CC me or the wireless list on the rest of the patches, so I'm replying to a random one, but ... What is the point here? This should in no way affect the integrity of the system/kernel, for most devices anyway. Keyword you used here is "most device". And in the worst case, *who* knows what other odd things may happen afterwards. So what if ath10k's firmware crashes? If there's a driver bug it will not handle it right (and probably crash, WARN_ON, or something else), but if the driver is working right then that will not affect the kernel at all. Sometimes the device can go into a state which requires driver removal and addition to get things back up. It would be lovely to be able to detect this case in the driver/system somehow! I haven't seen any such cases recently, I assure you that I have run into it. Once it does again I'll report the crash, but the problem with some of this is that unless you scrape the log you won't know. Eventually, a uevent would indeed tell inform me. but in case there is some common case you see, maybe we can think of a way to detect it? ath10k is just one case, this patch series addresses a simple way to annotate this tree-wide. So maybe I can understand that maybe you want an easy way to discover - per device - that the firmware crashed, but that still doesn't warrant a complete kernel taint. That is one reason, another is that a taint helps support cases *fast* easily detect if the issue was a firmware crash, instead of scraping logs for driver specific ways to say the firmware has crashed. You can listen for udev events (I think that is the right term), and find crashes that way. You get the actual crash info as well. My follow up to this was to add uevent to add_taint() as well, this way these could generically be processed by userspace. I'm not opposed to the taint, though I have not thought much on it. But, if you can already get the crash info from uevent, and it automatically comes without polling or scraping logs, then what benefit beyond that does the taint give you? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
On 05/18/2020 09:51 AM, Luis Chamberlain wrote: On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote: On Fri, 2020-05-15 at 21:28 +, Luis Chamberlain wrote:> module_firmware_crashed You didn't CC me or the wireless list on the rest of the patches, so I'm replying to a random one, but ... What is the point here? This should in no way affect the integrity of the system/kernel, for most devices anyway. Keyword you used here is "most device". And in the worst case, *who* knows what other odd things may happen afterwards. So what if ath10k's firmware crashes? If there's a driver bug it will not handle it right (and probably crash, WARN_ON, or something else), but if the driver is working right then that will not affect the kernel at all. Sometimes the device can go into a state which requires driver removal and addition to get things back up. It would be lovely to be able to detect this case in the driver/system somehow! I haven't seen any such cases recently, but in case there is some common case you see, maybe we can think of a way to detect it? So maybe I can understand that maybe you want an easy way to discover - per device - that the firmware crashed, but that still doesn't warrant a complete kernel taint. That is one reason, another is that a taint helps support cases *fast* easily detect if the issue was a firmware crash, instead of scraping logs for driver specific ways to say the firmware has crashed. You can listen for udev events (I think that is the right term), and find crashes that way. You get the actual crash info as well. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH] ath10k: increase rx buffer size to 2048
On 04/28/2020 05:01 AM, Kalle Valo wrote: Sven Eckelmann writes: On Wednesday, 1 April 2020 09:00:49 CEST Sven Eckelmann wrote: On Wednesday, 5 February 2020 20:10:43 CEST Linus Lüssing wrote: From: Linus Lüssing Before, only frames with a maximum size of 1528 bytes could be transmitted between two 802.11s nodes. For batman-adv for instance, which adds its own header to each frame, we typically need an MTU of at least 1532 bytes to be able to transmit without fragmentation. This patch now increases the maxmimum frame size from 1528 to 1656 bytes. [...] @Kalle, I saw that this patch was marked as deferred [1] but I couldn't find any mail why it was done so. It seems like this currently creates real world problems - so would be nice if you could explain shortly what is currently blocking its acceptance. Ping? Sorry for the delay, my plan was to first write some documentation about different hardware families but haven't managed to do that yet. My problem with this patch is that I don't know what hardware and firmware versions were tested, so it needs analysis before I feel safe to apply it. The ath10k hardware families are very different that even if a patch works perfectly on one ath10k hardware it could still break badly on another one. What makes me faster to apply ath10k patches is to have comprehensive analysis in the commit log. This shows me the patch author has considered about all hardware families, not just the one he is testing on, and that I don't need to do the analysis myself. It has been in ath10k-ct for a while, and that has some fairly wide coverage in OpenWrt, so likely if there were problems we would have seen it already. I did not make any specific changes to firmware to support this, so upstream firmware should behave similarly. Seems like upstream ath10k could really benefit from having some test beds so you can actually test code on different chips and have confidence in your changes! Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Strange routing with VRF and 5.2.7+
On 9/30/19 11:45 AM, Ben Greear wrote: On 9/22/19 12:23 PM, David Ahern wrote: On 9/20/19 9:57 AM, Ben Greear wrote: On 9/10/19 6:08 PM, Ben Greear wrote: On 9/10/19 3:17 PM, Ben Greear wrote: Today we were testing creating 200 virtual station vdevs on ath9k, and using VRF for the routing. Looks like the same issue happens w/out VRF, but there I have oodles of routing rules, so it is an area ripe for failure. Will upgrade to 5.2.14+ and retest, and try 4.20 as well Turns out, this was ipsec (strongswan) inserting a rule that pointed to a table that we then used for a vrf w/out realizing the rule was added. Stopping strongswan and/or reconfiguring how routing tables are assigned resolved the issue. Hi Ben: Since you are the pioneer with vrf and ipsec, can you add an ipsec section with some notes to Documentation/networking/vrf.txt? I need to to some more testing, an initial attempt to reproduce my working config on another system did not work properly, and I have not yet dug into it. I'm still grinding out the bugs... Here is my current quandry. In the VRF I have the 'real' device, say eth4 with IP 192.168.5.5. This talks to the VPN gateway device at 192.168.5.1. When I add the xfrm, it is given the address 192.168.10.100. I need all traffic routing out the vrf to use the xfrm as source IP, except the eth4 still needs to be able to talk to the 5.1 device (I think?) Evidently, adding this type of route below will do the trick, at least in non-vrf setup, and with this route in its own table that is queried after 'local' routing table, but before the others via use of a fairly generic rule default via 192.168.5.1 dev enp1s0 proto static src 192.168.10.100 I am guessing that in VRF world, I can get rid of the rule, and replace the existing default route (given to eth4 when it does DHCP or is statically assigned) with something like the above. And, maybe I need a special route for the VPN gateway itself as destination so that ipsec logic on eth4 can still talk to it? (I am thinking of the case where the VPN gateway is not on the local subnet and so we have to route to it special???) Any insight is welcome. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: IPv6 addr and route is gone after adding port to vrf (5.2.0+)
On 10/11/19 1:35 PM, David Ahern wrote: On 10/11/19 7:57 AM, Ben Greear wrote: The down-up cycling is done on purpose - to clear out neigh entries and routes associated with the device under the old VRF. All entries must be created with the device in the new VRF. I believe I found another thing to be aware of relating to this. My logic has been to do supplicant, then do DHCP, and only when DHCP responds do I set up the networking for the wifi station. It is at this time that I would be creating a VRF (or using routing rules if not using VRF). But, when I add the station to the newly created vrf, then it bounces it, and that causes supplicant to have to re-associate (I think, lots of moving pieces, so I could be missing something). Any chance you could just clear the neighbor entries and routes w/out bouncing the interface? yes, it is annoying. I have been meaning to fix that, but never found the motivation to do it. If you have the time, it would be worth avoiding the overhead. I changed my code so that it adds to the vrf first, so I too am lacking motivation and time to dig into the kernel at the moment. I'll let you know if I find time to work on it. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: IPv6 addr and route is gone after adding port to vrf (5.2.0+)
On 8/16/19 2:48 PM, David Ahern wrote: On 8/16/19 3:28 PM, Ben Greear wrote: On 8/16/19 12:15 PM, David Ahern wrote: On 8/16/19 1:13 PM, Ben Greear wrote: I have a problem with a VETH port when setting up a somewhat complicated VRF setup. I am loosing the global IPv6 addr, and also the route, apparently when I add the veth device to a vrf. From my script's output: Either enslave the device before adding the address or enable the retention of addresses: sysctl -q -w net.ipv6.conf.all.keep_addr_on_down=1 Thanks, I added it to the vrf first just in case some other logic was expecting the routes to go away on network down. That part now seems to be working. The down-up cycling is done on purpose - to clear out neigh entries and routes associated with the device under the old VRF. All entries must be created with the device in the new VRF. I believe I found another thing to be aware of relating to this. My logic has been to do supplicant, then do DHCP, and only when DHCP responds do I set up the networking for the wifi station. It is at this time that I would be creating a VRF (or using routing rules if not using VRF). But, when I add the station to the newly created vrf, then it bounces it, and that causes supplicant to have to re-associate (I think, lots of moving pieces, so I could be missing something). Any chance you could just clear the neighbor entries and routes w/out bouncing the interface? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Strange routing with VRF and 5.2.7+
On 9/22/19 12:23 PM, David Ahern wrote: On 9/20/19 9:57 AM, Ben Greear wrote: On 9/10/19 6:08 PM, Ben Greear wrote: On 9/10/19 3:17 PM, Ben Greear wrote: Today we were testing creating 200 virtual station vdevs on ath9k, and using VRF for the routing. Looks like the same issue happens w/out VRF, but there I have oodles of routing rules, so it is an area ripe for failure. Will upgrade to 5.2.14+ and retest, and try 4.20 as well Turns out, this was ipsec (strongswan) inserting a rule that pointed to a table that we then used for a vrf w/out realizing the rule was added. Stopping strongswan and/or reconfiguring how routing tables are assigned resolved the issue. Hi Ben: Since you are the pioneer with vrf and ipsec, can you add an ipsec section with some notes to Documentation/networking/vrf.txt? I need to to some more testing, an initial attempt to reproduce my working config on another system did not work properly, and I have not yet dug into it. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Strange routing with VRF and 5.2.7+
On 9/10/19 6:08 PM, Ben Greear wrote: On 9/10/19 3:17 PM, Ben Greear wrote: Today we were testing creating 200 virtual station vdevs on ath9k, and using VRF for the routing. Looks like the same issue happens w/out VRF, but there I have oodles of routing rules, so it is an area ripe for failure. Will upgrade to 5.2.14+ and retest, and try 4.20 as well Turns out, this was ipsec (strongswan) inserting a rule that pointed to a table that we then used for a vrf w/out realizing the rule was added. Stopping strongswan and/or reconfiguring how routing tables are assigned resolved the issue. Thanks, Ben Thanks, Ben This really slows down the machine in question. During the minutes that it takes to bring these up and configure them, we loose network connectivity on the management port. If I do 'ip route show', it just shows the default route out of eth0, and the subnet route. But, if I try to ping the gateway, I get an ICMP error coming back from the gateway of one of the virtual stations (which should be safely using VRFs and so not in use when I do a plain 'ping' from the shell). I tried running tshark on eth0 in the background and running ping, and it captures no packets leaving eth0. After some time (and during this time, my various scripts will be (re)configuring vrfs and stations and related vrf routing tables and such, but should *not* be messing with the main routing table, then suddenly things start working again. I am curious if anyone has seen anything similar or has suggestions for more ways to debug this. It seems reproducible, but it is a pain to debug. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Strange routing with VRF and 5.2.7+
On 9/10/19 3:17 PM, Ben Greear wrote: Today we were testing creating 200 virtual station vdevs on ath9k, and using VRF for the routing. Looks like the same issue happens w/out VRF, but there I have oodles of routing rules, so it is an area ripe for failure. Will upgrade to 5.2.14+ and retest, and try 4.20 as well Thanks, Ben This really slows down the machine in question. During the minutes that it takes to bring these up and configure them, we loose network connectivity on the management port. If I do 'ip route show', it just shows the default route out of eth0, and the subnet route. But, if I try to ping the gateway, I get an ICMP error coming back from the gateway of one of the virtual stations (which should be safely using VRFs and so not in use when I do a plain 'ping' from the shell). I tried running tshark on eth0 in the background and running ping, and it captures no packets leaving eth0. After some time (and during this time, my various scripts will be (re)configuring vrfs and stations and related vrf routing tables and such, but should *not* be messing with the main routing table, then suddenly things start working again. I am curious if anyone has seen anything similar or has suggestions for more ways to debug this. It seems reproducible, but it is a pain to debug. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Strange routing with VRF and 5.2.7+
Today we were testing creating 200 virtual station vdevs on ath9k, and using VRF for the routing. This really slows down the machine in question. During the minutes that it takes to bring these up and configure them, we loose network connectivity on the management port. If I do 'ip route show', it just shows the default route out of eth0, and the subnet route. But, if I try to ping the gateway, I get an ICMP error coming back from the gateway of one of the virtual stations (which should be safely using VRFs and so not in use when I do a plain 'ping' from the shell). I tried running tshark on eth0 in the background and running ping, and it captures no packets leaving eth0. After some time (and during this time, my various scripts will be (re)configuring vrfs and stations and related vrf routing tables and such, but should *not* be messing with the main routing table, then suddenly things start working again. I am curious if anyone has seen anything similar or has suggestions for more ways to debug this. It seems reproducible, but it is a pain to debug. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: VRF notes when using ipv6 and flushing tables.
On 08/20/2019 08:02 PM, David Ahern wrote: On 8/20/19 2:27 PM, Ben Greear wrote: I recently spend a few days debugging what in the end was user error on my part. Here are my notes in hope they help someone else. First, 'ip -6 route show vrf vrfX' will not show some of the routes (like local routes) that will show up with 'ip -6 route show table X', where X == vrfX's table-id If you run 'ip -6 route flush table X', then you will loose all of the auto generated routes, including anycast, ff00::/8, and local routes. ff00::/8 is needed for neigh discovery to work (probably among other things) local route is needed or packets won't actually be accepted up the stack (I think that is the symptom at least) Not sure exactly what anycast does, but I'm guessing it is required for something useful. You must manually re-add those to the table unless you for certain know that you do not need them for whatever reason. sorry you went through such a long and painful debugging session. No problem. I learned some details of IPv6 I never realized before, sure to come in useful some day! Thanks, Ben yes, the kernel doc for VRF needs to be updated that 'ip route show vrf X' and 'ip route show table X' are different ('show vrf' mimics the main table in not showing local, broadcast, anycast; 'table vrf' shows all). A suggestion for others: the documentation and selftests directory have a lot of VRF examples now. If something basic is not working (e.g., arp or neigh discovery), see if it works there and if so compare the outputs of the route table along the way. -- Ben Greear Candela Technologies Inc http://www.candelatech.com
VRF notes when using ipv6 and flushing tables.
I recently spend a few days debugging what in the end was user error on my part. Here are my notes in hope they help someone else. First, 'ip -6 route show vrf vrfX' will not show some of the routes (like local routes) that will show up with 'ip -6 route show table X', where X == vrfX's table-id If you run 'ip -6 route flush table X', then you will loose all of the auto generated routes, including anycast, ff00::/8, and local routes. ff00::/8 is needed for neigh discovery to work (probably among other things) local route is needed or packets won't actually be accepted up the stack (I think that is the symptom at least) Not sure exactly what anycast does, but I'm guessing it is required for something useful. You must manually re-add those to the table unless you for certain know that you do not need them for whatever reason. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: IPv6 addr and route is gone after adding port to vrf (5.2.0+)
On 8/16/19 12:15 PM, David Ahern wrote: On 8/16/19 1:13 PM, Ben Greear wrote: I have a problem with a VETH port when setting up a somewhat complicated VRF setup. I am loosing the global IPv6 addr, and also the route, apparently when I add the veth device to a vrf. From my script's output: Either enslave the device before adding the address or enable the retention of addresses: sysctl -q -w net.ipv6.conf.all.keep_addr_on_down=1 Thanks, I added it to the vrf first just in case some other logic was expecting the routes to go away on network down. That part now seems to be working. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
IPv6 addr and route is gone after adding port to vrf (5.2.0+)
Hello, I have a problem with a VETH port when setting up a somewhat complicated VRF setup. I am loosing the global IPv6 addr, and also the route, apparently when I add the veth device to a vrf. From my script's output: ### commands to set up the veth 'rddVR0' ./local/sbin/ip link set rddVR0 down ./local/sbin/ip -4 addr flush dev rddVR0 ./local/sbin/ip -6 addr flush dev rddVR0 echo 1 > /proc/sys/net/ipv4/conf/rddVR0/forwarding echo 1 > /proc/sys/net/ipv6/conf/rddVR0/forwarding ./local/sbin/ip link set rddVR0 up ./local/sbin/ip -4 addr add 10.2.127.1/24 broadcast 10.2.127.255 dev rddVR0 ./local/sbin/ip -6 addr add 2001:3::1/64 scope global dev rddVR0 ./local/sbin/ip -6 addr add fe80::d0f8:6fff:fe06:8ae/64 scope link dev rddVR0 RTNETLINK answers: File exists ./local/sbin/ip -6 route add 2001:3::1/64 dev rddVR0 table 10001 ./local/sbin/ip -6 route add fe80::d0f8:6fff:fe06:8ae/64 dev rddVR0 table 10001 ./local/sbin/ip route add 10.2.127.0/24 dev rddVR0 table 10001 echo 1 > /proc/sys/net/ipv4/conf/rddVR0/arp_filter #printRoutes for table 10001 broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.1 linkdown 10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.1 linkdown local 10.2.1.1 dev eth1 proto kernel scope host src 10.2.1.1 broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.1 linkdown broadcast 10.2.8.0 dev vap proto kernel scope link src 10.2.8.1 linkdown 10.2.8.0/24 dev vap proto kernel scope link src 10.2.8.1 linkdown local 10.2.8.1 dev vap proto kernel scope host src 10.2.8.1 broadcast 10.2.8.255 dev vap proto kernel scope link src 10.2.8.1 linkdown broadcast 10.2.9.0 dev vap0100 proto kernel scope link src 10.2.9.1 linkdown 10.2.9.0/24 dev vap0100 proto kernel scope link src 10.2.9.1 linkdown local 10.2.9.1 dev vap0100 proto kernel scope host src 10.2.9.1 broadcast 10.2.9.255 dev vap0100 proto kernel scope link src 10.2.9.1 linkdown 10.2.127.0/24 dev rddVR0 scope link 2001:3::/64 dev rddVR0 metric 1024 pref medium fe80::/64 dev rddVR0 metric 1024 pref medium some other commands, route/ip is still there #printRoutes for table 10001 broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.1 linkdown 10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.1 linkdown local 10.2.1.1 dev eth1 proto kernel scope host src 10.2.1.1 broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.1 linkdown broadcast 10.2.8.0 dev vap proto kernel scope link src 10.2.8.1 linkdown 10.2.8.0/24 dev vap proto kernel scope link src 10.2.8.1 linkdown local 10.2.8.1 dev vap proto kernel scope host src 10.2.8.1 broadcast 10.2.8.255 dev vap proto kernel scope link src 10.2.8.1 linkdown broadcast 10.2.9.0 dev vap0100 proto kernel scope link src 10.2.9.1 linkdown 10.2.9.0/24 dev vap0100 proto kernel scope link src 10.2.9.1 linkdown local 10.2.9.1 dev vap0100 proto kernel scope host src 10.2.9.1 broadcast 10.2.9.255 dev vap0100 proto kernel scope link src 10.2.9.1 linkdown 10.2.127.0/24 dev rddVR0 scope link 2001:3::/64 dev rddVR0 metric 1024 pref medium fe80::/64 dev rddVR0 metric 1024 pref medium ./local/sbin/ip link set rddVR0 vrf vrf10001 #printRoutes for table 10001 broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.1 linkdown 10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.1 linkdown local 10.2.1.1 dev eth1 proto kernel scope host src 10.2.1.1 broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.1 linkdown broadcast 10.2.8.0 dev vap proto kernel scope link src 10.2.8.1 linkdown 10.2.8.0/24 dev vap proto kernel scope link src 10.2.8.1 linkdown local 10.2.8.1 dev vap proto kernel scope host src 10.2.8.1 broadcast 10.2.8.255 dev vap proto kernel scope link src 10.2.8.1 linkdown broadcast 10.2.9.0 dev vap0100 proto kernel scope link src 10.2.9.1 linkdown 10.2.9.0/24 dev vap0100 proto kernel scope link src 10.2.9.1 linkdown local 10.2.9.1 dev vap0100 proto kernel scope host src 10.2.9.1 broadcast 10.2.9.255 dev vap0100 proto kernel scope link src 10.2.9.1 linkdown broadcast 10.2.127.0 dev rddVR0 proto kernel scope link src 10.2.127.1 10.2.127.0/24 dev rddVR0 proto kernel scope link src 10.2.127.1 local 10.2.127.1 dev rddVR0 proto kernel scope host src 10.2.127.1 broadcast 10.2.127.255 dev rddVR0 proto kernel scope link src 10.2.127.1 fe80::/64 dev rddVR0 proto kernel metric 256 pref medium ff00::/8 dev rddVR0 metric 256 pref medium Route is gone... 2001:3::/64 dev rddVR0 metric 1024 pref medium As far as I can tell, the same actions for a wifi AP interface do not hit this problem, but not sure if that is luck or not at this point. Any ideas what might be going on here? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
lockup in hacked 4.20.17+ kernel, maybe addrconf_verify_work related?
[67044.714944] sock_sendmsg+0x2b/0x40 [67044.714946] ___sys_sendmsg+0x28a/0x2f0 [67044.714947] ? ___sys_recvmsg+0x156/0x1d0 [67044.714950] ? __alloc_pages_nodemask+0x111/0x280 [67044.714954] ? alloc_pages_vma+0x6f/0x1c0 [67044.714957] ? page_add_new_anon_rmap+0x72/0xb0 [67044.714958] ? __handle_mm_fault+0x7db/0x12c0 [67044.714961] __sys_sendmsg+0x52/0xa0 [67044.714964] do_syscall_64+0x4a/0xf0 [67044.714967] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [67044.714969] RIP: 0033:0x7fa9c4af15a7 [67044.714972] Code: Bad RIP value. [67044.714973] RSP: 002b:7fffdd7ac818 EFLAGS: 0246 ORIG_RAX: 002e [67044.714974] RAX: ffda RBX: 021ae990 RCX: 7fa9c4af15a7 [67044.714975] RDX: RSI: 7fffdd7ac8b0 RDI: 0008 [67044.714976] RBP: 021b3d80 R08: 0004 R09: 7fa9c4dabf20 [67044.714976] R10: 0170 R11: 0246 R12: 021b3ec0 [67044.714977] R13: 7fffdd7ac8b0 R14: 021b3ec0 R15: 7fffdd7acb18 [67044.714980] INFO: task sshd:1763 blocked for more than 180 seconds. [67044.720810] Tainted: GW O 4.20.17+ #30 [67044.725186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [67044.732027] sshdD0 1763 1355 0x0080 [67044.732029] Call Trace: [67044.732038] ? __schedule+0x29e/0x880 [67044.732040] schedule+0x2a/0x80 [67044.732042] schedule_preempt_disabled+0xc/0x20 [67044.732043] __mutex_lock.isra.10+0x2e7/0x4f0 [67044.732046] ? netlink_lookup+0x111/0x160 [67044.732048] __netlink_dump_start+0x4f/0x1d0 [67044.732051] ? rtnl_xdp_prog_skb+0x60/0x60 [67044.732052] rtnetlink_rcv_msg+0x25c/0x390 [67044.732054] ? rtnl_xdp_prog_skb+0x60/0x60 [67044.732055] ? rtnl_calcit.isra.31+0x110/0x110 [67044.732057] netlink_rcv_skb+0x44/0x120 [67044.732059] netlink_unicast+0x18b/0x220 [67044.732060] netlink_sendmsg+0x1ff/0x3d0 [67044.732064] sock_sendmsg+0x2b/0x40 [67044.732066] __sys_sendto+0xe9/0x150 [67044.732070] ? __audit_syscall_exit+0x216/0x280 [67044.732071] __x64_sys_sendto+0x1f/0x30 [67044.732075] do_syscall_64+0x4a/0xf0 [67044.732077] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [67044.732079] RIP: 0033:0x7f16e29c765a [67044.732082] Code: Bad RIP value. [67044.732083] RSP: 002b:7ffe57e52e88 EFLAGS: 0246 ORIG_RAX: 002c [67044.732084] RAX: ffda RBX: 7ffe57e53f80 RCX: 7f16e29c765a [67044.732085] RDX: 0014 RSI: 7ffe57e53f80 RDI: 0003 [67044.732085] RBP: 7ffe57e53fd0 R08: 7ffe57e53f24 R09: 000c [67044.732086] R10: R11: 0246 R12: 7ffe57e53f24 [67044.732087] R13: 7ffe57e54160 R14: R15: 0003 Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
mgmt-tx issues with off-channel neighbor response on channel 100
Hello, I'm not sure if the fault is hostapd or the wireless stack (or something else), but this is what I see: I put an AP on channel 100, configured for RRM. STA associates to it and sends a channel report request. hostapd reports tx of the response frame failed with EBUSY (-16). Debugging in the kernel (4.20.8+ hacks) shows it fails because of the offchannel check. This appears to be because hostapd marks the frame as off-channel-OK, and nl80211 fails because of the CAC logic (I think): static bool cfg80211_off_channel_oper_allowed(struct wireless_dev *wdev) { ASSERT_WDEV_LOCK(wdev); if (!cfg80211_beaconing_iface_active(wdev)) return true; if (!(wdev->chandef.chan->flags & IEEE80211_CHAN_RADAR)) return true; return regulatory_pre_cac_allowed(wdev->wiphy); } In this case, the packet is not actually off-channel, and CAC has already completed successfully. Any opinions on where to fix this? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Waiting for vrf to become free on rmmod of bridge...
On 2/6/19 5:50 PM, David Ahern wrote: On 2/6/19 3:20 PM, Ben Greear wrote: Hello, I just saw this warning on a system running a hacked 4.20.2+ kernel. Any known bugs of this nature in this (upstream) kernel? The command that is blocked is: 'rmmod bridge llc' [17069.299135] unregister_netdevice: waiting for _vrf13 to become free. Usage count = 1 [17079.306438] unregister_netdevice: waiting for _vrf13 to become free. Usage count = 1 [17089.314656] unregister_netdevice: waiting for _vrf13 to become free. Usage count = 1 [17099.322870] unregister_netdevice: waiting for _vrf13 to become free. Usage count = 1 Thanks, Ben No known refcount issues with vrf. I use namespaces for testing which creates devices, adds routes, runs traffic and deletes the device and namespace. That series in the tests has been known to trigger refcount problems in the past. I'm not using namespaces in my test, but it is fairly convoluted. If I figure out how to reproduce the issue I'll let you know. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Waiting for vrf to become free on rmmod of bridge...
Hello, I just saw this warning on a system running a hacked 4.20.2+ kernel. Any known bugs of this nature in this (upstream) kernel? The command that is blocked is: 'rmmod bridge llc' [17069.299135] unregister_netdevice: waiting for _vrf13 to become free. Usage count = 1 [17079.306438] unregister_netdevice: waiting for _vrf13 to become free. Usage count = 1 [17089.314656] unregister_netdevice: waiting for _vrf13 to become free. Usage count = 1 [17099.322870] unregister_netdevice: waiting for _vrf13 to become free. Usage count = 1 Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Can NFS work with VRF?
Hello, I was trying to improve my old series of patches that binds NFS to a particular source IP address so that it could work with VRF in a 4.16 kernel. But, it seems a huge tangle to try to make NFS (and rpc, etc) able to bind to a local netdevice, which I think is what would be needed to make it work with VRF. Has anyone already worked on VRF support for NFS? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Anyone know if strongswan works with vrf?
Hello, We're trying to create lots of strongswan VPN tunnels on network devices bound to different VRFs. We are using Fedora-24 on the client side, with a 4.16.15+ kernel and updated 'ip' package, etc. So far, no luck getting it to work. Any idea if this is supported or not? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.
On 06/10/2018 10:10 AM, Michał Kazior wrote: Ben, The patch is symptomatic. fq_tin_dequeue() already checks if the list is empty before it tries to access first entry. I see no point in using the _or_null() + WARN_ON. The 0x3c deref is likely an offset off of NULL base pointer. Did you check gdb/addr2line of the ieee80211_tx_dequeue+0xfb? Where did it point to? gdb pointed to one line above the flow dereference, which is why I was going to put some debugging in there. I suspect there's not enough synchronization between quescing the device/ath10k after fw crashes and performing mac80211's reconfig procedure. I am already running this patch which helps with some of that. That patch never made it upstream, but it fixed problems for me earlier. https://patchwork.kernel.org/patch/9457639/ Could easily be there are some more issues in that logic. Someone else posted a patch to disable mac-80211 tx when FW crashes, I think...I have not tried to backport that. https://patchwork.kernel.org/patch/10411967/ Thanks, Ben Michał On 8 June 2018 at 23:40, Arend van Spriel wrote: On 6/8/2018 5:17 PM, Ben Greear wrote: I recalled an email from Michał leaving tieto so adding his alternate email he provided back then. Gr. AvS On 06/07/2018 04:59 PM, Cong Wang wrote: On Thu, Jun 7, 2018 at 4:48 PM, wrote: diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h index be7c0fa..cb911f0 100644 --- a/include/net/fq_impl.h +++ b/include/net/fq_impl.h @@ -78,7 +78,10 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, return NULL; } - flow = list_first_entry(head, struct fq_flow, flowchain); + flow = list_first_entry_or_null(head, struct fq_flow, flowchain); + + if (WARN_ON_ONCE(!flow)) + return NULL; This does not make sense either. list_first_entry_or_null() returns NULL only when the list is empty, but we already check list_empty() right before this code, and it is protected by fq->lock. Hello Michal, git blame shows you as the author of the fq_impl.h code. I saw a crash when debugging funky ath10k firmware in a 4.16 + hacks kernel. There was an apparent mostly-null deref in the fq_tin_dequeue method. According to gdb, it was within 1 line of the dereference of 'flow'. My hack above is probably not that useful. Cong thinks maybe the locking is bad. If you get a chance, please review this thread and see if you have any ideas for a better fix (or better debugging code). As always, if you would like me to generate you a buggy firmware that will crash in the tx path and cause all sorts of mayhem in the ath10k driver and wifi stack, I will be happy to do so. https://www.mail-archive.com/netdev@vger.kernel.org/msg239738.html Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.
On 06/07/2018 05:13 PM, Cong Wang wrote: On Thu, Jun 7, 2018 at 4:48 PM, wrote: From: Ben Greear While testing an ath10k firmware that often crashed under load, I was seeing kernel crashes as well. One of them appeared to be a dereference of a NULL flow object in fq_tin_dequeue. I have since fixed the firmware flaw, but I think it would be worth adding the WARN_ON in case the problem appears again. BUG: unable to handle kernel NULL pointer dereference at 003c IP: ieee80211_tx_dequeue+0xfb/0xb10 [mac80211] Instead of adding WARN_ON(), you need to think about the locking there, it is suspicious: fq is from struct ieee80211_local: struct fq *fq = &local->fq; tin is from struct txq_info: struct fq_tin *tin = &txqi->tin; I don't know if fq and tin are supposed to be 1:1, if not there is a bug in the locking, because ->new_flows and ->old_flows are both inside tin instead of fq, but they are protected by fq->lock Maybe whoever put this code together can take a stab at it. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.
On 06/07/2018 04:59 PM, Cong Wang wrote: On Thu, Jun 7, 2018 at 4:48 PM, wrote: diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h index be7c0fa..cb911f0 100644 --- a/include/net/fq_impl.h +++ b/include/net/fq_impl.h @@ -78,7 +78,10 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, return NULL; } - flow = list_first_entry(head, struct fq_flow, flowchain); + flow = list_first_entry_or_null(head, struct fq_flow, flowchain); + + if (WARN_ON_ONCE(!flow)) + return NULL; This does not make sense either. list_first_entry_or_null() returns NULL only when the list is empty, but we already check list_empty() right before this code, and it is protected by fq->lock. Nevermind then. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH] net-fq: Add WARN_ON check for null flow.
On 06/07/2018 02:52 PM, Cong Wang wrote: On Thu, Jun 7, 2018 at 2:41 PM, Ben Greear wrote: On 06/07/2018 02:29 PM, Cong Wang wrote: On Thu, Jun 7, 2018 at 9:06 AM, wrote: --- a/include/net/fq_impl.h +++ b/include/net/fq_impl.h @@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, flow = list_first_entry(head, struct fq_flow, flowchain); + if (WARN_ON_ONCE(!flow)) + return NULL; + How could even possibly list_first_entry() returns NULL? You need list_first_entry_or_null(). I don't know for certain flow as null, but something was NULL in this method near that line and it looked like a likely culprit. I guess possibly tin or fq was passed in as NULL? A NULL pointer is not always 0. You can trigger a NULL-ptr-def with 0x3c too, but you are checking against 0 in your patch, that is the problem and that is why list_first_entry_or_null() exists. Ahh, I see what you mean, and that is my mistake. In my case, it did seem to be a mostly-null deref, not a 0x0 deref. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH] net-fq: Add WARN_ON check for null flow.
On 06/07/2018 02:29 PM, Cong Wang wrote: On Thu, Jun 7, 2018 at 9:06 AM, wrote: --- a/include/net/fq_impl.h +++ b/include/net/fq_impl.h @@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, flow = list_first_entry(head, struct fq_flow, flowchain); + if (WARN_ON_ONCE(!flow)) + return NULL; + How could even possibly list_first_entry() returns NULL? You need list_first_entry_or_null(). I don't know for certain flow as null, but something was NULL in this method near that line and it looked like a likely culprit. I guess possibly tin or fq was passed in as NULL? Anyway, if the patch seems worthless just ignore it. I'll leave it in my tree since it should be harmless and will let you know if I ever hit it. If someone else hits a similar crash, hopefully they can report it. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH] net-fq: Add WARN_ON check for null flow.
On 06/07/2018 09:17 AM, Eric Dumazet wrote: On 06/07/2018 09:06 AM, gree...@candelatech.com wrote: From: Ben Greear While testing an ath10k firmware that often crashed under load, I was seeing kernel crashes as well. One of them appeared to be a dereference of a NULL flow object in fq_tin_dequeue. I have since fixed the firmware flaw, but I think it would be worth adding the WARN_ON in case the problem appears again. common_interrupt+0xf/0xf Please find the exact commit that brought this bug, and add a corresponding Fixes: tag It will be a total pain to bisect this problem since my test case that causes this is running my modified firmware (and a buggy one at that), modified ath10k driver (to work with this firmware and support my test case easily), and the failure case appears to cause multiple different-but-probably-related crashes and often hangs or reboots the test system. Probably this is all caused by some nasty race or buggy logic related to dealing with a crashed ath10k firmware tearing down txq logic from the bottom up. There have been many such bugs in the past, I and others fixed a few, and very likely more remain. For what it is worth, I didn't see this crash in 4.13, and I spent some time testing buggy firmware there occasionally. If someone else has interest in debugging the ath10k driver, I will be happy to generate a mostly-stock firmware image with ability to crash in the TX path and give it to them. It will crash the stock upstream code reliably in my experience. Thanks, Ben Signed-off-by: Ben Greear --- include/net/fq_impl.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h index be7c0fa..e40354d 100644 --- a/include/net/fq_impl.h +++ b/include/net/fq_impl.h @@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, flow = list_first_entry(head, struct fq_flow, flowchain); + if (WARN_ON_ONCE(!flow)) + return NULL; + if (flow->deficit <= 0) { flow->deficit += fq->quantum; list_move_tail(&flow->flowchain, -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Regression bisected to: softirq: Let ksoftirqd do its job
One of my out-of-tree patches is a network impairment tool that acts a lot like an Ethernet bridge with latency, jitter, etc. We noticed recently that we were seeing igb adapter errors when testing with our emulator at high speeds. For whatever reason, it is only easily reproduced when we add jitter to our emulator. This would cause a bit more CPU usage and lock contention in our software, and would increase the skb pkts allocated at any given time. I bisected the problem to the commit below: Author: Eric Dumazet Date: Wed Aug 31 10:42:29 2016 -0700 softirq: Let ksoftirqd do its job A while back, Paolo and Hannes sent an RFC patch adding threaded-able napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/) If I replace my emulator with a bridge, then I do not see the problem. But, I also do not (or very rarely?) see the problem when configuring the emulator with zero latency and jitter, which is how the bridge would act. Any idea what sort of (bad?) behaviour would be able to cause this tx q timeout? If you have any interest, I will be happy to email you my out-of-tree patches and instructions to reproduce the problem. The kernel splat looks like this, and repeats often: May 17 16:03:09 localhost.localdomain kernel: audit: type=1131 audit(1526598189.492:159): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' May 17 16:03:39 localhost.localdomain kernel: [ cut here ] May 17 16:03:39 localhost.localdomain kernel: WARNING: CPU: 5 PID: 0 at /home/greearb/git/linux-bisect/net/sched/sch_generic.c:316 dev_watchdog+0x234/0x240 May 17 16:03:39 localhost.localdomain kernel: NETDEV WATCHDOG: eth5 (igb): transmit queue 0 timed out May 17 16:03:39 localhost.localdomain kernel: Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 fuse macvlan wanlink(O) pktgen cfg80211 sunrpc coretemp intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass ipmi_ssif iTCO_wdt iTCO_vendor_support joydev i2c_i801 lpc_ich i2c_smbus ioatdma shpchp wmi ipmi_si ipmi_msghandler tpm_tis tpm_tis_core tpm acpi_power_meter acpi_pad sch_fq_codel ast drm_kms_helper ttm drm igb hwmon ptp pps_core dca i2c_algo_bit i2c_core fjes ipv6 crc_ccitt [last unloaded: nf_conntrack] May 17 16:03:39 localhost.localdomain kernel: CPU: 5 PID: 0 Comm: swapper/5 Tainted: G O4.8.0-rc7+ #132 May 17 16:03:39 localhost.localdomain kernel: Hardware name: Iron_Systems,Inc CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017 May 17 16:03:39 localhost.localdomain kernel: 88087fd43d78 81417eb1 88087fd43dc8 May 17 16:03:39 localhost.localdomain kernel: 88087fd43db8 81103556 013c7fd43da8 May 17 16:03:39 localhost.localdomain kernel: 880854221940 0005 880854bb8000 May 17 16:03:39 localhost.localdomain kernel: Call Trace: May 17 16:03:39 localhost.localdomain kernel:[] dump_stack+0x63/0x82 May 17 16:03:39 localhost.localdomain kernel: [] __warn+0xc6/0xe0 May 17 16:03:39 localhost.localdomain kernel: [] warn_slowpath_fmt+0x4a/0x50 May 17 16:03:39 localhost.localdomain kernel: [] dev_watchdog+0x234/0x240 May 17 16:03:39 localhost.localdomain kernel: [] ? qdisc_rcu_free+0x40/0x40 May 17 16:03:39 localhost.localdomain kernel: [] call_timer_fn+0x30/0x150 May 17 16:03:39 localhost.localdomain kernel: [] ? qdisc_rcu_free+0x40/0x40 May 17 16:03:39 localhost.localdomain kernel: [] run_timer_softirq+0x1ea/0x450 May 17 16:03:39 localhost.localdomain kernel: [] ? ktime_get+0x37/0xa0 May 17 16:03:39 localhost.localdomain kernel: [] ? lapic_next_deadline+0x21/0x30 May 17 16:03:39 localhost.localdomain kernel: [] ? clockevents_program_event+0x7d/0x120 May 17 16:03:39 localhost.localdomain kernel: [] __do_softirq+0xca/0x2d0 May 17 16:03:39 localhost.localdomain kernel: [] irq_exit+0xb3/0xc0 May 17 16:03:39 localhost.localdomain kernel: [] smp_apic_timer_interrupt+0x3d/0x50 May 17 16:03:39 localhost.localdomain kernel: [] apic_timer_interrupt+0x82/0x90 May 17 16:03:39 localhost.localdomain kernel:[] ? cpuidle_enter_state+0x126/0x300 May 17 16:03:39 localhost.localdomain kernel: [] cpuidle_enter+0x12/0x20 May 17 16:03:39 localhost.localdomain kernel: [] call_cpuidle+0x25/0x40 May 17 16:03:39 localhost.localdomain kernel: [] cpu_startup_entry+0x2ba/0x380 May 17 16:03:39 localhost.localdomain kernel: [] start_secondary+0x149/0x170 May 17 16:03:39 localhost.localdomain kernel: ---[ end trace f62c6dd947785e8f ]--- Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Performance regression between 4.13 and 4.14
On 05/09/2018 12:02 PM, Ben Greear wrote: On 05/09/2018 11:48 AM, Eric Dumazet wrote: On 05/09/2018 11:43 AM, Ben Greear wrote: On 05/08/2018 10:10 AM, Eric Dumazet wrote: On 05/08/2018 09:44 AM, Ben Greear wrote: Hello, I am trying to track down a performance regression that appears to be between 4.13 and 4.14. I first saw the problem with a hacked version of pktgen on some ixgbe NICs. 4.13 can do right at 10G bi-directional on two ports, and 4.14 and later can do only about 6Gbps. I also tried with user-space UDP traffic on a stock kernel, and I can get about 3.2Gbps combined tx+rx on 4.14 and about 4.4Gbps on 4.13. Attempting to bisect seems to be triggering a weirdness in git, and also lots of commits crash or do not bring up networking, which makes the bisect difficult. Looking at perf top, it would appear that some lock is probably to blame. perf record -a -g -e cycles:pp sleep 5 perf report Then you'll be able to tell us which lock (or call graph) is killing your perf. I seem to be chasing multiple issues. For 4.13, at least part of my problem was that LOCKDEP was enabled, during my bisect, though it does NOT appear enabled in 4.16. I think maybe CONFIG_LOCKDEP moved to CONFIG_PROVE_LOCKING in 4.16, or something like that? My 4.16 .config does have CONFIG_LOCKDEP_SUPPORT enabled, and I see no option to disable it: [greearb@ben-dt3 linux-4.16.x64]$ grep LOCKDEP .config CONFIG_LOCKDEP_SUPPORT=y For 4.16, I am disabling RETRAMPOLINE...are there any other such things I need to disable to keep from getting a performance hit from the spectre-related bug fixes? At this point, I do not care about the security implications. greearb@ben-dt3 linux-4.16.x64]$ grep RETPO .config # CONFIG_RETPOLINE is not set Thanks, Ben No idea really, you mention a 4.13 -> 4.14 regression and jump then to 4.16 :/ I initially saw the problem in 4.16, then bisected, and 4.14 still showed the issue. So, I guess I must have been enabling lockdep the whole time. This __lock_acquire is from lockdep as far as I can tell, not normal locking. I re-built 4.16 after verifying as best as I could that lockdep was not enabled, and now it performs as expected. I'm going to test a patch to change __lock_acquire to __lock_acquire_lockdep so maybe someone else will not make the same mistake I made. + 17.78%17.78% kpktgend_1 [kernel.kallsyms] [k] __lock_acquire.isra.3 Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Performance regression between 4.13 and 4.14
On 05/09/2018 11:48 AM, Eric Dumazet wrote: On 05/09/2018 11:43 AM, Ben Greear wrote: On 05/08/2018 10:10 AM, Eric Dumazet wrote: On 05/08/2018 09:44 AM, Ben Greear wrote: Hello, I am trying to track down a performance regression that appears to be between 4.13 and 4.14. I first saw the problem with a hacked version of pktgen on some ixgbe NICs. 4.13 can do right at 10G bi-directional on two ports, and 4.14 and later can do only about 6Gbps. I also tried with user-space UDP traffic on a stock kernel, and I can get about 3.2Gbps combined tx+rx on 4.14 and about 4.4Gbps on 4.13. Attempting to bisect seems to be triggering a weirdness in git, and also lots of commits crash or do not bring up networking, which makes the bisect difficult. Looking at perf top, it would appear that some lock is probably to blame. perf record -a -g -e cycles:pp sleep 5 perf report Then you'll be able to tell us which lock (or call graph) is killing your perf. I seem to be chasing multiple issues. For 4.13, at least part of my problem was that LOCKDEP was enabled, during my bisect, though it does NOT appear enabled in 4.16. I think maybe CONFIG_LOCKDEP moved to CONFIG_PROVE_LOCKING in 4.16, or something like that? My 4.16 .config does have CONFIG_LOCKDEP_SUPPORT enabled, and I see no option to disable it: [greearb@ben-dt3 linux-4.16.x64]$ grep LOCKDEP .config CONFIG_LOCKDEP_SUPPORT=y For 4.16, I am disabling RETRAMPOLINE...are there any other such things I need to disable to keep from getting a performance hit from the spectre-related bug fixes? At this point, I do not care about the security implications. greearb@ben-dt3 linux-4.16.x64]$ grep RETPO .config # CONFIG_RETPOLINE is not set Thanks, Ben No idea really, you mention a 4.13 -> 4.14 regression and jump then to 4.16 :/ I initially saw the problem in 4.16, then bisected, and 4.14 still showed the issue. 4.13 works, but only when I use a .config I originally built for 4.13, not the 4.16 .config that I ended up using with the bisect (make oldconfig, accept all defaults). I originally configured 4.16 with a .config that had lockdep enabled, then manually tried to disable it through 'make xconfig'. I think that must leave "CONFIG_LOCKDEP=y" in the .config, which screws up older builds during bisect, perhaps? Before doing a (painful) dissection, the perf output would immediately tell you if something is really wrong on your .config. I didn't realize lockdep might be an issue at the time, but here is a 'bad' run from a 4.13+ (plus pktgen hacks). I guess lockdep is why this runs slowly, but I see no obvious proof of that in the output: 4.13+, patched pktgen, 6Gbps throughput, on commit 906dde0f355bd97c080c215811ae7db1137c4af8 Samples: 26K of event 'cycles:pp', Event count (approx.): 20119166736 Children Self Command Shared ObjectSymbol + 87.97% 0.00% kpktgend_1 [kernel.kallsyms][k] ret_from_fork + 87.97% 0.00% kpktgend_1 [kernel.kallsyms][k] kthread + 86.89% 5.42% kpktgend_1 [kernel.kallsyms][k] pktgen_thread_worker + 33.75% 0.18% kpktgend_1 [kernel.kallsyms][k] getnstimeofday64 + 32.77% 4.47% kpktgend_1 [kernel.kallsyms][k] __getnstimeofday64 + 24.60%10.91% kpktgend_1 [kernel.kallsyms][k] lock_acquire + 23.59% 0.03% kpktgend_1 [kernel.kallsyms][k] __do_softirq + 23.55% 0.07% kpktgend_1 [kernel.kallsyms][k] net_rx_action + 22.29% 0.47% kpktgend_1 [kernel.kallsyms][k] getRelativeCurNs + 21.33% 1.71% kpktgend_1 [kernel.kallsyms][k] ixgbe_poll + 15.79% 0.02% kpktgend_1 [kernel.kallsyms][k] ret_from_intr + 15.78% 0.01% kpktgend_1 [kernel.kallsyms][k] do_IRQ + 15.34% 0.01% kpktgend_1 [kernel.kallsyms][k] irq_exit + 13.95%10.00% kpktgend_1 [kernel.kallsyms][k] ip_send_check + 13.80%13.80% kpktgend_1 [kernel.kallsyms][k] __lock_acquire.isra.31 + 12.98% 0.53% kpktgend_1 [kernel.kallsyms][k] pktgen_finalize_skb + 12.31% 0.20% kpktgend_1 [kernel.kallsyms][k] timestamp_skb.isra.24 + 11.68% 0.13% kpktgend_1 [kernel.kallsyms][k] napi_gro_receive + 11.36% 0.25% kpktgend_1 [kernel.kallsyms][k] netif_receive_skb_internal + 10.93% 0.00% swapper [kernel.kallsyms][k] verify_cpu + 10.93% 0.00% swapper [kernel.kallsyms][k] cpu_startup_entry + 10.92% 0.02% swapper [kernel.kallsyms][k] do_idle + 10.71% 0.00% swapper [kernel.kallsyms][k] cpuidle_enter +
Re: Performance regression between 4.13 and 4.14
On 05/08/2018 10:10 AM, Eric Dumazet wrote: On 05/08/2018 09:44 AM, Ben Greear wrote: Hello, I am trying to track down a performance regression that appears to be between 4.13 and 4.14. I first saw the problem with a hacked version of pktgen on some ixgbe NICs. 4.13 can do right at 10G bi-directional on two ports, and 4.14 and later can do only about 6Gbps. I also tried with user-space UDP traffic on a stock kernel, and I can get about 3.2Gbps combined tx+rx on 4.14 and about 4.4Gbps on 4.13. Attempting to bisect seems to be triggering a weirdness in git, and also lots of commits crash or do not bring up networking, which makes the bisect difficult. Looking at perf top, it would appear that some lock is probably to blame. perf record -a -g -e cycles:pp sleep 5 perf report Then you'll be able to tell us which lock (or call graph) is killing your perf. I seem to be chasing multiple issues. For 4.13, at least part of my problem was that LOCKDEP was enabled, during my bisect, though it does NOT appear enabled in 4.16. I think maybe CONFIG_LOCKDEP moved to CONFIG_PROVE_LOCKING in 4.16, or something like that? My 4.16 .config does have CONFIG_LOCKDEP_SUPPORT enabled, and I see no option to disable it: [greearb@ben-dt3 linux-4.16.x64]$ grep LOCKDEP .config CONFIG_LOCKDEP_SUPPORT=y For 4.16, I am disabling RETRAMPOLINE...are there any other such things I need to disable to keep from getting a performance hit from the spectre-related bug fixes? At this point, I do not care about the security implications. greearb@ben-dt3 linux-4.16.x64]$ grep RETPO .config # CONFIG_RETPOLINE is not set Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
ICMP redirect and VRF
While debugging some other problem today on a system using ip rules instead of VRF, I ran into a case where the remote router was sending back ICMP redirects. That got me thinking...where would these routes get stored in a VRF scenario? Would it magically go to the correct VRF routing table based on the incoming interface for the ICMP redirect response? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Performance regression between 4.13 and 4.14
Hello, I am trying to track down a performance regression that appears to be between 4.13 and 4.14. I first saw the problem with a hacked version of pktgen on some ixgbe NICs. 4.13 can do right at 10G bi-directional on two ports, and 4.14 and later can do only about 6Gbps. I also tried with user-space UDP traffic on a stock kernel, and I can get about 3.2Gbps combined tx+rx on 4.14 and about 4.4Gbps on 4.13. Attempting to bisect seems to be triggering a weirdness in git, and also lots of commits crash or do not bring up networking, which makes the bisect difficult. Looking at perf top, it would appear that some lock is probably to blame. Any ideas what might have been introduced during this interval that would cause this? Anyone else seen similar? I'm going to attempt some more manual steps to try to find the commit that introduces this... Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: The SO_BINDTODEVICE was set to the desired interface, but packets are received from all interfaces.
On 05/07/2018 03:19 AM, Damir Mansurov wrote: Greetings, After successful call of the setsockopt(SO_BINDTODEVICE) function to set data reception from only one interface, the data is still received from all interfaces. Function setsockopt() returns 0 but then recv() receives data from all available network interfaces. The problem is reproducible on linux kernels 4.14 - 4.16, but it does not on linux kernels 4.4, 4.13. I have written C-code to reproduce this issue (see attached files b2d_send.c and b2d_recv.c). See below explanation of tested configuration. Hello, I am not sure if this is your problem or not, but if you are using VRF, then you need to call SO_BINDTODEVICE before you do the 'normal' bind() call. Thanks, Ben PC-1 PC-2 --- --- | b2d_send| | b2d_recv| | | | | | --| |-- | | | eth0 |---| eth0 | | | --| |-- | | | | | | --| |-- | | | eth1 |---| eth1 | | | --| |-- | | | | | --- --- Steps: 1. Copy b2d_recv.c to PC-2, compile it ("gcc -o b2d_recv b2d_recv.c") and run "./b2d_recv eth0 23777" to get derived data only from eth0 interface. Port number in this example is 23777 only for sample. 2. Copy b2d_send.c to PC-1, compile it ("gcc -o b2d_send b2d_send.c") and run "./b2d_send ip1 ip2 23777" where ip1 and ip2 are ip addresses of interfaces eth0 and eth1 of PC-2. 3. Result: - b2d_recv prints out data from eth0 and eth1 on linux kernels from 4.14 up to 4.16. - b2d_recv prints out data from only eth0 on linux kernels below 4.14. ** Thanks, Damir Mansurov dn...@oktetlabs.ru -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH] net: Work around crash in ipv6 fib-walk-continue
On 05/04/2018 10:47 AM, David Ahern wrote: On 4/19/18 12:01 PM, gree...@candelatech.com wrote: From: Ben Greear This keeps us from crashing in certain test cases where we bring up many (1000, for instance) mac-vlans with IPv6 enabled in the kernel. This bug has been around for a very long time. Until a real fix is found (and for stable), maybe it is better to return an incomplete fib walk instead of crashing. BUG: unable to handle kernel NULL pointer dereference at 8 IP: fib6_walk_continue+0x5b/0x140 [ipv6] PGD 8007dfc0c067 P4D 8007dfc0c067 PUD 7e66ff067 PMD 0 Oops: [#1] PREEMPT SMP PTI Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c vrf] CPU: 3 PID: 15117 Comm: ip Tainted: G O 4.16.0+ #5 Hardware name: Iron_Systems,Inc CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017 RIP: 0010:fib6_walk_continue+0x5b/0x140 [ipv6] RSP: 0018:c90008c3bc10 EFLAGS: 00010287 RAX: 88085ac45050 RBX: 8807e03008a0 RCX: RDX: RSI: c90008c3bc48 RDI: 8232b240 RBP: 880819167600 R08: 0008 R09: 8807dff10071 R10: c90008c3bbd0 R11: R12: 8807e03008a0 R13: 0002 R14: 8807e05744c8 R15: 8807e08ef000 FS: 7f2f04342700() GS:88087fcc() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0008 CR3: 0007e0556002 CR4: 003606e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: inet6_dump_fib+0x14b/0x2c0 [ipv6] netlink_dump+0x216/0x2a0 netlink_recvmsg+0x254/0x400 ? copy_msghdr_from_user+0xb5/0x110 ___sys_recvmsg+0xe9/0x230 ? find_held_lock+0x3b/0xb0 ? __handle_mm_fault+0x617/0x1180 ? __audit_syscall_entry+0xb3/0x110 ? __sys_recvmsg+0x39/0x70 __sys_recvmsg+0x39/0x70 do_syscall_64+0x63/0x120 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7f2f03a72030 RSP: 002b:7fffab3de508 EFLAGS: 0246 ORIG_RAX: 002f RAX: ffda RBX: 7fffab3e641c RCX: 7f2f03a72030 RDX: RSI: 7fffab3de570 RDI: 0004 RBP: R08: 7e6c R09: 7fffab3e63a8 R10: 7fffab3de5b0 R11: 0246 R12: 7fffab3e6608 R13: 0066b460 R14: 7e6c R15: Code: 85 d2 74 17 f6 40 2a 04 74 11 8b 53 2c 85 d2 0f 84 d7 00 00 00 83 ea 01 89 53 2c c7 4 RIP: fib6_walk_continue+0x5b/0x140 [ipv6] RSP: c90008c3bc10 CR2: 0008 ---[ end trace bd03458864eb266c ]--- Signed-off-by: Ben Greear --- Does your use case that triggers this involve replacing routes? I just noticed the route delete code in fib6_add_rt2node does not have the 'Adjust walkers' code that is in fib6_del_route. Further, the adjust walkers code in fib6_del_route looks suspicious in its timing with route deletes. If you have a reliable reproducer we can try a few things with fib6_del_route and the walker code. Yes, we replace routes, and yes we can reliably reproduce it and will be happy to test patches. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Performance regressions in TCP_STREAM tests in Linux 4.15 (and later)
On 04/27/2018 08:11 PM, Steven Rostedt wrote: We'd like this email archived in netdev list, but since netdev is notorious for blocking outlook email as spam, it didn't go through. So I'm replying here to help get it into the archives. Thanks! -- Steve On Fri, 27 Apr 2018 23:05:46 + Michael Wenig wrote: As part of VMware's performance testing with the Linux 4.15 kernel, we identified CPU cost and throughput regressions when comparing to the Linux 4.14 kernel. The impacted test cases are mostly TCP_STREAM send tests when using small message sizes. The regressions are significant (up 3x) and were tracked down to be a side effect of Eric Dumazat's RB tree changes that went into the Linux 4.15 kernel. Further investigation showed our use of the TCP_NODELAY flag in conjunction with Eric's change caused the regressions to show and simply disabling TCP_NODELAY brought performance back to normal. Eric's change also resulted into significant improvements in our TCP_RR test cases. Based on these results, our theory is that Eric's change made the system overall faster (reduced latency) but as a side effect less aggregation is happening (with TCP_NODELAY) and that results in lower throughput. Previously even though TCP_NODELAY was set, system was slower and we still got some benefit of aggregation. Aggregation helps in better efficiency and higher throughput although it can increase the latency. If you are seeing a regression in your application throughput after this change, using TCP_NODELAY might help bring performance back however that might increase latency. I guess you mean _disabling_ TCP_NODELAY instead of _using_ TCP_NODELAY? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.
On 04/22/2018 02:15 PM, Roopa Prabhu wrote: On Sun, Apr 22, 2018 at 11:54 AM, David Miller wrote: From: Johannes Berg Date: Thu, 19 Apr 2018 17:26:57 +0200 On Thu, 2018-04-19 at 08:25 -0700, Ben Greear wrote: Maybe this could be in followup patches? It's going to touch a lot of files, and might be hell to get merged all at once, and I've never used spatch, so just maybe someone else will volunteer that part :) I guess you'll have to ask davem. :) Well, first of all, I really don't like this. The first reason is that every time I see interface foo become foo2, foo3 is never far behind it. If foo was not extensible enough such that we needed foo2, we beter design the new thing with explicitly better extensibility in mind. Furthermore, what you want here is a specific filter. Someone else will want to filter on another criteria, and the next person will want yet another. This needs to be properly generalized. And frankly if we had moved to ethtool netlink/devlink by now, we could just add a netlink attribute for filtering and not even be having this conversation. +1. Also, the RTM_GETSTATS api was added to improve stats query efficiency (with filters). we should look at it to see if this fits there. Keeping all stats queries in one place will help. I like the ethtool API, so I'll be sticking with that for now. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.
On 04/22/2018 11:54 AM, David Miller wrote: From: Johannes Berg Date: Thu, 19 Apr 2018 17:26:57 +0200 On Thu, 2018-04-19 at 08:25 -0700, Ben Greear wrote: Maybe this could be in followup patches? It's going to touch a lot of files, and might be hell to get merged all at once, and I've never used spatch, so just maybe someone else will volunteer that part :) I guess you'll have to ask davem. :) Well, first of all, I really don't like this. The first reason is that every time I see interface foo become foo2, foo3 is never far behind it. If foo was not extensible enough such that we needed foo2, we beter design the new thing with explicitly better extensibility in mind. Furthermore, what you want here is a specific filter. Someone else will want to filter on another criteria, and the next person will want yet another. This needs to be properly generalized. And frankly if we had moved to ethtool netlink/devlink by now, we could just add a netlink attribute for filtering and not even be having this conversation. Well, since there are un-defined flags, it would be simple enough to extend the API further in the future (flag (1<<31) could mean expect more input members, etc. And, adding up to 30 more flags to filter on different things won't change the API and should be backwards compatible. But, if you don't want it, that is OK by me, I agree it is a fairly obscure feature. It would have saved me time if you had said you didn't want it at the first RFC patch though... Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.
On 04/18/2018 11:38 PM, Johannes Berg wrote: On Wed, 2018-04-18 at 14:51 -0700, Ben Greear wrote: It'd be pretty hard to know which flags are firmware stats? Yes, it is, but ethtool stats are difficult to understand in a generic manner anyway, so someone using them is already likely aware of low-level details of the driver(s) they are using. Right. Come to think of it though, + * @get_ethtool_stats2: Return extended statistics about the device. + * This is only useful if the device maintains statistics not + * included in &struct rtnl_link_stats64. + * Takes a flags argument: 0 means all (same as get_ethtool_stats), + * 0x1 (ETHTOOL_GS2_SKIP_FW) means skip firmware stats. + * Other flags are reserved for now. + * Same number of stats will be returned, but some of them might + * not be as accurate/refreshed. This is to allow not querying + * firmware or other expensive-to-read stats, for instance. "skip" vs. "don't refresh" is a bit ambiguous - I'd argue better to either really skip and not return the non-refreshed ones (also helps with the identifying), or rename the flag. In order to efficiently parse lots of stats over and over again, I probe the stat names once on startup, map them to the variable I am trying to use (since different drivers may have different names for the same basic stat), and then I store the stat index. On subsequent stat reads, I just grab stats and go right to the index to store the stat. If the stats indexes change, that will complicate my logic quite a bit. Maybe the flag could be called: ETHTOOL_GS2_NO_REFRESH_FW ? Also, wrt. the rest of the patch, I'd argue that it'd be worthwhile to write the spatch and just add the flags argument to "get_ethtool_stats" instead of adding a separate method - internally to the kernel it's not that hard to change. Maybe this could be in followup patches? It's going to touch a lot of files, and might be hell to get merged all at once, and I've never used spatch, so just maybe someone else will volunteer that part :) Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.
On 04/18/2018 02:26 PM, Johannes Berg wrote: On Tue, 2018-04-17 at 18:49 -0700, gree...@candelatech.com wrote: + * @get_ethtool_stats2: Return extended statistics about the device. + * This is only useful if the device maintains statistics not + * included in &struct rtnl_link_stats64. + * Takes a flags argument: 0 means all (same as get_ethtool_stats), + * 0x1 (ETHTOOL_GS2_SKIP_FW) means skip firmware stats. + * Other flags are reserved for now. It'd be pretty hard to know which flags are firmware stats? Yes, it is, but ethtool stats are difficult to understand in a generic manner anyway, so someone using them is already likely aware of low-level details of the driver(s) they are using. In my case, I have lots of virtual stations (or APs), and I want stats for them as well as for the 'radio', so I would probe the first vdev with flags of 'skip-none' to get all stats, including radio (firmware) stats. And then the rest I would just probe the non-firmware stats. To be honest, I was slightly amused that anyone expressed interest in this patch originally, but maybe other people have similar use case and/or drivers with slow-to-acquire stats. Anyway, there's no way I'm going to take this patch, so you need to float it on netdev first (best CC us here) and get it applied there before we can do anything on the wifi side. I posted the patches to netdev, ath10k and linux-wireless. If I had only posted them individually to different lists I figure I'd be hearing about how the netdev patch is useless because it has no driver support, etc. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
On 01/24/2018 03:59 PM, Ben Greear wrote: On 06/20/2017 08:03 PM, David Ahern wrote: On 6/20/17 5:41 PM, Ben Greear wrote: On 06/20/2017 11:05 AM, Michal Kubecek wrote: On Tue, Jun 20, 2017 at 07:12:27AM -0700, Ben Greear wrote: On 06/14/2017 03:25 PM, David Ahern wrote: On 6/14/17 4:23 PM, Ben Greear wrote: On 06/13/2017 07:27 PM, David Ahern wrote: Let's try a targeted debug patch. See attached I had to change it to pr_err so it would go to our serial console since the system locked hard on crash, and that appears to be enough to change the timing where we can no longer reproduce the problem. ok, let's figure out which one is doing that. There are 3 debug statements. I suspect fib6_del_route is the one setting the state to FWS_U. Can you remove the debug prints in fib6_repair_tree and fib6_walk_continue and try again? We cannot reproduce with just that one printf in the kernel either. It must change the timing too much to trigger the bug. You might try trace_printk() which should have less impact (don't forget to enable /proc/sys/kernel/ftrace_dump_on_oops). We cannot reproduce with trace_printk() either. I think that suggests the walker state is set to FWS_U in fib6_del_route, and it is the FWS_U case in fib6_walk_continue that triggers the fault -- the null parent (pn = fn->parent). So we have the 2 areas of code that are interacting. I'm on a road trip through the end of this week with little time to focus on this problem. I'll get back to you another suggestion when I can. FYI, problem still happens in 4.16. I'm going to re-enable my hack below for this kernel as well...I had hopes it might be fixed... BUG: unable to handle kernel NULL pointer dereference at 8 IP: fib6_walk_continue+0x5b/0x140 [ipv6] PGD 8007dfc0c067 P4D 8007dfc0c067 PUD 7e66ff067 PMD 0 Oops: [#1] PREEMPT SMP PTI Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c vrf] CPU: 3 PID: 15117 Comm: ip Tainted: G O 4.16.0+ #5 Hardware name: Iron_Systems,Inc CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017 RIP: 0010:fib6_walk_continue+0x5b/0x140 [ipv6] RSP: 0018:c90008c3bc10 EFLAGS: 00010287 RAX: 88085ac45050 RBX: 8807e03008a0 RCX: RDX: RSI: c90008c3bc48 RDI: 8232b240 RBP: 880819167600 R08: 0008 R09: 8807dff10071 R10: c90008c3bbd0 R11: R12: 8807e03008a0 R13: 0002 R14: 8807e05744c8 R15: 8807e08ef000 FS: 7f2f04342700() GS:88087fcc() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0008 CR3: 0007e0556002 CR4: 003606e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: inet6_dump_fib+0x14b/0x2c0 [ipv6] netlink_dump+0x216/0x2a0 netlink_recvmsg+0x254/0x400 ? copy_msghdr_from_user+0xb5/0x110 ___sys_recvmsg+0xe9/0x230 ? find_held_lock+0x3b/0xb0 ? __handle_mm_fault+0x617/0x1180 ? __audit_syscall_entry+0xb3/0x110 ? __sys_recvmsg+0x39/0x70 __sys_recvmsg+0x39/0x70 do_syscall_64+0x63/0x120 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7f2f03a72030 RSP: 002b:7fffab3de508 EFLAGS: 0246 ORIG_RAX: 002f RAX: ffda RBX: 7fffab3e641c RCX: 7f2f03a72030 RDX: RSI: 7fffab3de570 RDI: 0004 RBP: R08: 7e6c R09: 7fffab3e63a8 R10: 7fffab3de5b0 R11: 0246 R12: 7fffab3e6608 R13: 0066b460 R14: 7e6c R15: Code: 85 d2 74 17 f6 40 2a 04 74 11 8b 53 2c 85 d2 0f 84 d7 00 00 00 83 ea 01 89 53 2c c7 4 RIP: fib6_walk_continue+0x5b/0x140 [ipv6] RSP: c90008c3bc10 CR2: 0008 ---[ end trace bd03458864eb266c ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled Rebooting in 10 seconds.. ACPI MEMORY or I/O RESET_REG. So, though I don't know the right way to fix it, the patch below appears to make the system not crash. diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index 68b9cc7..bf19a14 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -1614,6 +1614,12 @@ static int fib6_walk_continue(struct fib6_walker *w) pn = fn->parent; w->node = pn; #ifdef CONFIG_IPV6_SUBTREES + if (WARN_ON_ONCE(!pn)) { + pr_err("FWS-U, w: %p fn: %p pn: %p\n", + w, fn, pn); + /* Attempt to work around crash that has been here forever. --Ben */ + return 0; + } if (FIB6_SUBTREE(pn) == fn) { WARN_ON(!(fn->fn_flags & RTN_ROOT)); w->stat
Re: [RFC] ethtool: Support ETHTOOL_GSTATS2 command.
On 03/20/2018 11:24 AM, Michal Kubecek wrote: On Tue, Mar 20, 2018 at 08:39:33AM -0700, Ben Greear wrote: On 03/20/2018 03:37 AM, Michal Kubecek wrote: IMHO it would be more practical to set "0 means same as GSTATS" as a rule and make ethtool_get_stats() a wrapper for ethtool_get_stats2() to avoid code duplication (or perhaps a use fall-through in the switch). It would also allow drivers to provide only one of the callbacks. Yes, but that would require changing all drivers at once, and would make backporting and out-of-tree drivers harder to manage. I had low hopes that this feature would make it upstream, so I didn't want to propose any large changes up front. I don't think so. What I mean is: (a) driver implements ->get_ethtool_stats2() callback; then we use it for GSTATS2 (b) driver does not implement get_ethtool_stats2() but implements ->get_ethtool_stats(); then we call for GSTATS2 if level is zero, otherwise GSTATS2 returns -EINVAL and GSTATS is always translated to GSTATS2 with level 0, either by defining ethtool_get_stats() as a wrapper or by fall-through in the switch statement. This way, most drivers could be left untouched and only those which would implement non-default levels would provide ->get_ethtool_stats2() callback instead of ->get_ethtool_stats(). OK, that makes sense. I'll wait on feedback from the flags or #defined levels and re-spin the patch accordingly. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH] net: dev_forward_skb(): Scrub packet's per-netns info only when crossing netns
On 03/20/2018 09:44 AM, Liran Alon wrote: On 20/03/18 18:24, ebied...@xmission.com wrote: I don't believe the current behavior is a bug. I looked through the history. Basically skb_scrub_packet started out as the scrubbing needed for crossing network namespaces. Then tunnels which needed 90% of the functionality started calling it, with the xnet flag added. Because the tunnels needed to preserve their historic behavior. Then dev_forward_skb started calling skb_scrub_packet. A veth pair is supposed to give the same behavior as a cross-over cable plugged into two local nics. A cross over cable won't preserve things like the skb mark. So I don't see why anyone would expect a veth pair to preserve the mark. I disagree with this argument. I think that a skb crossing netns is what simulates a real packet crossing physical computers. Following your argument, why would skb->mark should be preserved when crossing netdevs on same netns via routing? But this does today preserve skb->mark. Therefore, I do think that skb->mark should conceptually only be scrubbed when crossing netns. Regardless of the netdev used to cross it. It should be scrubbed in VETH as well. That is one way to make virtual routers. Possibly the newer VRF features will give another better way to do it, but you should not break things that used to work. Now, if you want to add a new feature that allows one to configure the kernel (or VETH) for a new behavior, then that might be something to consider. Right now I don't see the point of handling packets that don't cross network namespace boundaries specially, other than to preserve backwards compatibility. Well, backwards compat is a big deal all by itself! Thanks, Ben Eric -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [RFC] ethtool: Support ETHTOOL_GSTATS2 command.
On 03/20/2018 09:11 AM, Steve deRosier wrote: On Tue, Mar 20, 2018 at 8:39 AM, Ben Greear wrote: On 03/20/2018 03:37 AM, Michal Kubecek wrote: On Wed, Mar 07, 2018 at 11:51:29AM -0800, gree...@candelatech.com wrote: From: Ben Greear This is similar to ETHTOOL_GSTATS, but it allows you to specify a 'level'. This level can be used by the driver to decrease the amount of stats refreshed. In particular, this helps with ath10k since getting the firmware stats can be slow. Signed-off-by: Ben Greear --- NOTE: I know to make it upstream I would need to split the patch and remove the #define for 'backporting' that I added. But, is the feature in general wanted? If so, I'll do the patch split and other tweaks that might be suggested. Yes, but that would require changing all drivers at once, and would make backporting and out-of-tree drivers harder to manage. I had low hopes that this feature would make it upstream, so I didn't want to propose any large changes up front. Hi Ben, I find the feature OK, but I'm not thrilled with the arbitrary scale of "level". Maybe there could be some named values, either on a spectrum as level already is, similar to the kernel log DEBUG, WARN, INFO type levels. Or named bit flags like the way the ath drivers do their debug flags for granular results. Thoughts? Yes, that would be easier to code too. If there are any other drivers out there that might take advantage of this, maybe they could chime in with what levels and/or bit-fields they would like to see. For instance a bit that says 'refresh-stats-from-firmware' would be great for ath10k, but maybe useless for everyone else Thanks, Ben - Steve -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [RFC] ethtool: Support ETHTOOL_GSTATS2 command.
On 03/20/2018 03:37 AM, Michal Kubecek wrote: On Wed, Mar 07, 2018 at 11:51:29AM -0800, gree...@candelatech.com wrote: From: Ben Greear This is similar to ETHTOOL_GSTATS, but it allows you to specify a 'level'. This level can be used by the driver to decrease the amount of stats refreshed. In particular, this helps with ath10k since getting the firmware stats can be slow. Signed-off-by: Ben Greear --- NOTE: I know to make it upstream I would need to split the patch and remove the #define for 'backporting' that I added. But, is the feature in general wanted? If so, I'll do the patch split and other tweaks that might be suggested. I'm not familiar enough with the technical background of stats collecting to comment on usefulness and desirability of this feature. Adding a new command just to add a numeric parameter certainly doesn't feel right but it's how the ioctl interface works. I take it as a reminder to find some time to get back to the netlink interface. diff --git a/net/core/ethtool.c b/net/core/ethtool.c index 674b6c9..d3b709f 100644 --- a/net/core/ethtool.c +++ b/net/core/ethtool.c @@ -1947,6 +1947,54 @@ static int ethtool_get_stats(struct net_device *dev, void __user *useraddr) return ret; } +static int ethtool_get_stats2(struct net_device *dev, void __user *useraddr) +{ + struct ethtool_stats stats; + const struct ethtool_ops *ops = dev->ethtool_ops; + u64 *data; + int ret, n_stats; + u32 stats_level = 0; + + if (!ops->get_ethtool_stats2 || !ops->get_sset_count) + return -EOPNOTSUPP; + + n_stats = ops->get_sset_count(dev, ETH_SS_STATS); + if (n_stats < 0) + return n_stats; + if (n_stats > S32_MAX / sizeof(u64)) + return -ENOMEM; + WARN_ON_ONCE(!n_stats); + if (copy_from_user(&stats, useraddr, sizeof(stats))) + return -EFAULT; + + /* User can specify the level of stats to query. How the +* level value is used is up to the driver, but in general, +* 0 means 'all', 1 means least, and higher means more. +* The idea is that some stats may be expensive to query, so user +* space could just ask for the cheap ones... +*/ + stats_level = stats.n_stats; + + stats.n_stats = n_stats; + data = vzalloc(n_stats * sizeof(u64)); + if (n_stats && !data) + return -ENOMEM; + + ops->get_ethtool_stats2(dev, &stats, data, stats_level); + + ret = -EFAULT; + if (copy_to_user(useraddr, &stats, sizeof(stats))) + goto out; + useraddr += sizeof(stats); + if (n_stats && copy_to_user(useraddr, data, n_stats * sizeof(u64))) + goto out; + ret = 0; + + out: + vfree(data); + return ret; +} + static int ethtool_get_phy_stats(struct net_device *dev, void __user *useraddr) { struct ethtool_stats stats; IMHO it would be more practical to set "0 means same as GSTATS" as a rule and make ethtool_get_stats() a wrapper for ethtool_get_stats2() to avoid code duplication (or perhaps a use fall-through in the switch). It would also allow drivers to provide only one of the callbacks. Yes, but that would require changing all drivers at once, and would make backporting and out-of-tree drivers harder to manage. I had low hopes that this feature would make it upstream, so I didn't want to propose any large changes up front. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH net] virtio-net: disable NAPI only when enabled during XDP set
On 02/28/2018 09:22 AM, David Miller wrote: From: Jason Wang Date: Wed, 28 Feb 2018 18:20:04 +0800 We try to disable NAPI to prevent a single XDP TX queue being used by multiple cpus. But we don't check if device is up (NAPI is enabled), this could result stall because of infinite wait in napi_disable(). Fixing this by checking device state through netif_running() before. Fixes: 4941d472bf95b ("virtio-net: do not reset during XDP set") Signed-off-by: Jason Wang Yes, mis-paired NAPI enable/disable are really a pain. Probably, we can do something in the interfaces or mechanisms to make this less error prone and less fragile. Anyways, applied and queued up for -stable, thanks! I just hit a similar bug in ath10k. It seems like napi has plenty of free bit flags so it could keep track of 'is-enabled' state and allow someone to call napi_disable multiple times w/out deadlocking. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH RFC net-next 1/4] ipv4: fib_rules: support match on sport, dport and ip proto
On 02/12/2018 04:03 PM, David Miller wrote: From: Eric Dumazet Date: Mon, 12 Feb 2018 13:54:59 -0800 We had project/teams using different routing tables for each vlan they setup :/ Indeed, people use FIB rules and think they can scale in software. As currently implemented, they can't. The example you give sounds possibly like a great VRF use case btw :-) I'm one of those people with lots of FIB rules wishing it would scale better, and wanting a routing table per netdev. If there is a relatively easy suggestion to make this work better, I'd like to give it a try. I have not looked at VRF at all to date... Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
On 06/20/2017 08:03 PM, David Ahern wrote: On 6/20/17 5:41 PM, Ben Greear wrote: On 06/20/2017 11:05 AM, Michal Kubecek wrote: On Tue, Jun 20, 2017 at 07:12:27AM -0700, Ben Greear wrote: On 06/14/2017 03:25 PM, David Ahern wrote: On 6/14/17 4:23 PM, Ben Greear wrote: On 06/13/2017 07:27 PM, David Ahern wrote: Let's try a targeted debug patch. See attached I had to change it to pr_err so it would go to our serial console since the system locked hard on crash, and that appears to be enough to change the timing where we can no longer reproduce the problem. ok, let's figure out which one is doing that. There are 3 debug statements. I suspect fib6_del_route is the one setting the state to FWS_U. Can you remove the debug prints in fib6_repair_tree and fib6_walk_continue and try again? We cannot reproduce with just that one printf in the kernel either. It must change the timing too much to trigger the bug. You might try trace_printk() which should have less impact (don't forget to enable /proc/sys/kernel/ftrace_dump_on_oops). We cannot reproduce with trace_printk() either. I think that suggests the walker state is set to FWS_U in fib6_del_route, and it is the FWS_U case in fib6_walk_continue that triggers the fault -- the null parent (pn = fn->parent). So we have the 2 areas of code that are interacting. I'm on a road trip through the end of this week with little time to focus on this problem. I'll get back to you another suggestion when I can. So, though I don't know the right way to fix it, the patch below appears to make the system not crash. diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index 68b9cc7..bf19a14 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -1614,6 +1614,12 @@ static int fib6_walk_continue(struct fib6_walker *w) pn = fn->parent; w->node = pn; #ifdef CONFIG_IPV6_SUBTREES + if (WARN_ON_ONCE(!pn)) { + pr_err("FWS-U, w: %p fn: %p pn: %p\n", + w, fn, pn); + /* Attempt to work around crash that has been here forever. --Ben */ + return 0; + } if (FIB6_SUBTREE(pn) == fn) { WARN_ON(!(fn->fn_flags & RTN_ROOT)); w->state = FWS_L; The printout looks like this (when adding 4000 mac-vlans, so it is pretty rare). PN is definitely NULL sometimes: [root@2u-6n ~]# journalctl -f|grep FWS Jan 24 15:48:05 2u-6n kernel: IPv6: FWS-U, w: 8807ea121ba0 fn: 880856a09260 pn: (null) Jan 24 15:51:15 2u-6n kernel: IPv6: FWS-U, w: 8807e3963de0 fn: 880856a09260 pn: (null) Jan 24 15:51:15 2u-6n kernel: IPv6: FWS-U, w: 88081ac22de0 fn: 880856a09260 pn: (null) Jan 24 15:53:13 2u-6n kernel: IPv6: FWS-U, w: 8808290c69c0 fn: 8807e369f920 pn: (null) Jan 24 15:53:24 2u-6n kernel: IPv6: FWS-U, w: 8807ea3156c0 fn: 88082d1eeb60 pn: (null) 8066 Jan 24 15:48:04 2u-6n kernel: 8021q: adding VLAN 0 to HW filter on device eth2#1006 8067 Jan 24 15:48:05 2u-6n kernel: [ cut here ] 8068 Jan 24 15:48:05 2u-6n kernel: WARNING: CPU: 5 PID: 3346 at /home/greearb/git/linux-4.13.dev.y/net/ipv6/ip6_fib.c:1617 fib6_walk_continue+ 0x154/0x1b0 [ipv6] 8069 Jan 24 15:48:05 2u-6n kernel: Modules linked in: 8021q garp mrp stp llc fuse macvlan wanlink(O) pktgen ipmi_ssif coretemp intel_raplsb_edac x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm ath9k irqbypass iTCO_wdt ath9k_common iTCO_vendor_support ath9k_hw ath i2c_i801 mac80211 joydev lpc_ich cfg80211 ioatdma shpchp tpm_tis tpm_tis_core wmi tpm ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter nfsd auth_rpcgss nfs_acl sch_fq_codel lockd grace sunrpc ast drm_kms_helper ttm drm igb hwmon ptp pps_core dca i2c_algo_bit i2c_core ipv6 crc_ccitt 8070 Jan 24 15:48:05 2u-6n kernel: CPU: 5 PID: 3346 Comm: ip Tainted: G O4.13.16+ #22 8071 Jan 24 15:48:05 2u-6n kernel: Hardware name: Iron_Systems,Inc CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017 8072 Jan 24 15:48:05 2u-6n kernel: task: 8807e9ef1dc0 task.stack: c9002083c000 8073 Jan 24 15:48:05 2u-6n kernel: RIP: 0010:fib6_walk_continue+0x154/0x1b0 [ipv6] 8074 Jan 24 15:48:05 2u-6n kernel: RSP: 0018:c9002083fbc0 EFLAGS: 00010246 8075 Jan 24 15:48:05 2u-6n kernel: RAX: RBX: 8807ea121ba0 RCX: 8076 Jan 24 15:48:05 2u-6n kernel: RDX: 880856a09260 RSI: c9002083fc00 RDI: 81ef2140 8077 Jan 24 15:48:05 2u-6n kernel: RBP: c9002083fbc8 R08: 0008 R09: 8807e36f6b25 8078 Jan 24 15:48:05 2u-6n kernel: R10: c9002083fb70 R11: 000
Re: e1000e hardware unit hangs
On 01/24/2018 10:38 AM, Denys Fedoryshchenko wrote: On 2018-01-24 20:31, Ben Greear wrote: On 01/24/2018 08:34 AM, Neftin, Sasha wrote: On 1/24/2018 18:11, Alexander Duyck wrote: On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear wrote: Hello, Anyone have any more suggestions for making e1000e work better? This is from a 4.9.65+ kernel, with these additional e1000e patches applied: e1000e: Fix error path in link detection e1000e: Fix wrong comment related to link detection e1000e: Fix return value test e1000e: Separate signaling for link check/link up e1000e: Avoid receiver overrun interrupt bursts Most of these patches shouldn't address anything that would trigger Tx hangs. They are mostly related to just link detection. Test case is simply to run 3 tcp connections each trying to send 56Kbps of bi-directional data between a pair of e1000e interfaces :) No OOM related issues are seen on this kernel...similar test on 4.13 showed some OOM issues, but I have not debugged that yet... Really a question like this probably belongs on e1000-devel or intel-wired-lan so I have added those lists and the e1000e maintainer to the thread. It would be useful if you could provide more information about the device itself such as the ID and the kind of test you are running. Keep in mind the e1000e driver supports a pretty broad swath of devices so we need to narrow things down a bit. please, also re-check if your kernel include: e1000e: fix buffer overrun while the I219 is processing DMA transactions e1000e: fix the use of magic numbers for buffer overrun issue where you take fresh version of kernel? Hello, I tried adding those two patches, but I still see this splat shortly after starting my test. The kernel I am using is here: https://github.com/greearb/linux-ct-4.13 I've seen similar issues at least back to the 4.0 kernel, including stock kernels and my own kernels with additional patches. Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295298499, wd-timeout: 5000 jiffies: 4295304192 tx-queues: 1 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: [ cut here ] Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 0 PID: 0 at /home/greearb/git/linux-4.13.dev.y/net/sched/sch_generic.c:322 dev_watchdog+0x228/0x250 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c cfg80211 macvlan wanlink(O) pktgen bnep bluetooth f...ss tpm_tis ip Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O4.13.16+ #22 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Hardware name: Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: task: 81e104c0 task.stack: 81e0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: 0010:dev_watchdog+0x228/0x250 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: 0018:88042fc03e50 EFLAGS: 00010282 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: 0086 RBX: RCX: Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: 88042fc15b40 RSI: 88042fc0dbf8 RDI: 88042fc0dbf8 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: 88042fc03e98 R08: 0001 R09: 03c4 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: R11: 03c4 R12: 1388 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: 000100050dc3 R14: 88041767 R15: 000100052400 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: FS: () GS:88042fc0() knlGS: Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CS: 0010 DS: ES: CR0: 80050033 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CR2: 01d14000 CR3: 01e09000 CR4: 001406f0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Call Trace: Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? qdisc_rcu_free+0x40/0x40 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: call_timer_fn+0x30/0x160 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? qdisc_rcu_free+0x40/0x40 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: run_timer_softirq+0x1f0/0x450 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? lapic_next_deadline+0x21/0x30 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? clockevents_program_event+0x78/0xf0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: __do_softirq+0xc1/0x2c0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: irq_exit+0xb1/0xc0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: smp_apic_timer_interrupt+0x38/0x50 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: apic_timer_interrupt+0x89/0x90 Ja
Re: e1000e hardware unit hangs
On 01/24/2018 08:34 AM, Neftin, Sasha wrote: On 1/24/2018 18:11, Alexander Duyck wrote: On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear wrote: Hello, Anyone have any more suggestions for making e1000e work better? This is from a 4.9.65+ kernel, with these additional e1000e patches applied: e1000e: Fix error path in link detection e1000e: Fix wrong comment related to link detection e1000e: Fix return value test e1000e: Separate signaling for link check/link up e1000e: Avoid receiver overrun interrupt bursts Most of these patches shouldn't address anything that would trigger Tx hangs. They are mostly related to just link detection. Test case is simply to run 3 tcp connections each trying to send 56Kbps of bi-directional data between a pair of e1000e interfaces :) No OOM related issues are seen on this kernel...similar test on 4.13 showed some OOM issues, but I have not debugged that yet... Really a question like this probably belongs on e1000-devel or intel-wired-lan so I have added those lists and the e1000e maintainer to the thread. It would be useful if you could provide more information about the device itself such as the ID and the kind of test you are running. Keep in mind the e1000e driver supports a pretty broad swath of devices so we need to narrow things down a bit. please, also re-check if your kernel include: e1000e: fix buffer overrun while the I219 is processing DMA transactions e1000e: fix the use of magic numbers for buffer overrun issue where you take fresh version of kernel? Hello, I tried adding those two patches, but I still see this splat shortly after starting my test. The kernel I am using is here: https://github.com/greearb/linux-ct-4.13 I've seen similar issues at least back to the 4.0 kernel, including stock kernels and my own kernels with additional patches. Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295298499, wd-timeout: 5000 jiffies: 4295304192 tx-queues: 1 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: [ cut here ] Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 0 PID: 0 at /home/greearb/git/linux-4.13.dev.y/net/sched/sch_generic.c:322 dev_watchdog+0x228/0x250 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c cfg80211 macvlan wanlink(O) pktgen bnep bluetooth f...ss tpm_tis ip Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O4.13.16+ #22 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Hardware name: Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: task: 81e104c0 task.stack: 81e0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: 0010:dev_watchdog+0x228/0x250 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: 0018:88042fc03e50 EFLAGS: 00010282 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: 0086 RBX: RCX: Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: 88042fc15b40 RSI: 88042fc0dbf8 RDI: 88042fc0dbf8 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: 88042fc03e98 R08: 0001 R09: 03c4 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: R11: 03c4 R12: 1388 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: 000100050dc3 R14: 88041767 R15: 000100052400 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: FS: () GS:88042fc0() knlGS: Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CS: 0010 DS: ES: CR0: 80050033 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CR2: 01d14000 CR3: 01e09000 CR4: 001406f0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Call Trace: Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? qdisc_rcu_free+0x40/0x40 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: call_timer_fn+0x30/0x160 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? qdisc_rcu_free+0x40/0x40 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: run_timer_softirq+0x1f0/0x450 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? lapic_next_deadline+0x21/0x30 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? clockevents_program_event+0x78/0xf0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: __do_softirq+0xc1/0x2c0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: irq_exit+0xb1/0xc0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: smp_apic_timer_interrupt+0x38/0x50 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: apic_timer_interrupt+0x89/0x90 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:
Re: TCP many-connection regression (bisected to 4.5.0-rc2+)
On 01/23/2018 03:27 PM, Ben Greear wrote: On 01/23/2018 03:21 PM, Eric Dumazet wrote: On Tue, 2018-01-23 at 15:10 -0800, Ben Greear wrote: On 01/23/2018 02:29 PM, Eric Dumazet wrote: On Tue, 2018-01-23 at 14:09 -0800, Ben Greear wrote: On 01/23/2018 02:07 PM, Eric Dumazet wrote: On Tue, 2018-01-23 at 13:49 -0800, Ben Greear wrote: On 01/22/2018 10:16 AM, Eric Dumazet wrote: On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote: My test case is to have 6 processes each create 5000 TCP IPv4 connections to each other on a system with 16GB RAM and send slow-speed data. This works fine on a 4.7 kernel, but will not work at all on a 4.13. The 4.13 first complains about running out of tcp memory, but even after forcing those values higher, the max connections we can get is around 15k. Both kernels have my out-of-tree patches applied, so it is possible it is my fault at this point. Any suggestions as to what this might be caused by, or if it is fixed in more recent kernels? I will start bisecting in the meantime... Hi Ben Unfortunately I have no idea. Are you using loopback flows, or have I misunderstood you ? How loopback connections can be slow-speed ? Hello Eric, looks like it is one of your commits that causes the issue I see. Here are some more details on my specific test case I used to bisect: I have two ixgbe ports looped back, configured on same subnet, but with different IPs. Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client and server side let me send-to-self over the external looped cable. I have 2 mac-vlans on each physical interface. I created 5 server-side connections on one physical port, and two more on one of the mac-vlans. On the client-side, I create a process that spawns 5000 connections to the corresponding server side. End result is 25,000 connections on one pair of real interfaces, and 10,000 connections on the mac-vlan ports. In the passing case, I get very close to all 5000 connections on all endpoints quickly. In the failing case, I get a max of around 16k connections on the two physical ports. The two mac-vlans have 10k connections across them working reliably. It seems to be an issue with 'connect' failing. connect(2074, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075 fcntl(2075, F_GETFD)= 0 fcntl(2075, F_SETFD, FD_CLOEXEC)= 0 setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(2075, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0 fcntl(2075, F_GETFL)= 0x2 (flags O_RDWR) fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 connect(2075, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076 fcntl(2076, F_GETFD)= 0 fcntl(2076, F_SETFD, FD_CLOEXEC)= 0 setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(2076, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0 fcntl(2076, F_GETFL)= 0x2 (flags O_RDWR) fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 connect(2076, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit commit ea8add2b190395408b22a9127bed2c0912aecbc8 Author: Eric Dumazet Date: Thu Feb 11 16:28:50 2016 -0800 tcp/dccp: better use of ephemeral ports in bind() Implement strategy used in __inet_hash_connect() in opposite way : Try to find a candidate using odd ports, then fallback to even ports. We no longer disable BH for whole traversal, but one bucket at a time. We also use cond_resched() to yield cpu to other tasks if needed. I removed one indentation level and tried to mirror the loop we have in __inet_hash_connect() and variable names to ease code maintenance. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller :04 04 3af4595c6eb6d331e1cba78a142d44e00f710d81 e0c014ae8b7e2867256eff60f6210821d36eacef Mnet I will be happy to test patches or try to get any other results that might help diagno
e1000e hardware unit hangs
, trans_start: 4294748730, wd-timeout: 5000 jiffies: 4294759424 tx-queues: 1 Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: e1000e :07:00.0 eth3: Reset adapter unexpectedly Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: e1000e :06:00.0 eth2: Reset adapter unexpectedly Jan 23 15:39:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 23 15:39:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out, trans_start: 4294766123, wd-timeout: 5000 jiffies: 4294771200 tx-queues: 1 Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out, trans_start: 4294766125, wd-timeout: 5000 jiffies: 4294771200 tx-queues: 1 Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: e1000e :06:00.0 eth2: Reset adapter unexpectedly Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: e1000e :07:00.0 eth3: Reset adapter unexpectedly Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e :06:00.0 eth2: Detected Hardware Unit Hang: TDH TDT ... Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: TCP many-connection regression (bisected to 4.5.0-rc2+)
On 01/23/2018 03:21 PM, Eric Dumazet wrote: On Tue, 2018-01-23 at 15:10 -0800, Ben Greear wrote: On 01/23/2018 02:29 PM, Eric Dumazet wrote: On Tue, 2018-01-23 at 14:09 -0800, Ben Greear wrote: On 01/23/2018 02:07 PM, Eric Dumazet wrote: On Tue, 2018-01-23 at 13:49 -0800, Ben Greear wrote: On 01/22/2018 10:16 AM, Eric Dumazet wrote: On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote: My test case is to have 6 processes each create 5000 TCP IPv4 connections to each other on a system with 16GB RAM and send slow-speed data. This works fine on a 4.7 kernel, but will not work at all on a 4.13. The 4.13 first complains about running out of tcp memory, but even after forcing those values higher, the max connections we can get is around 15k. Both kernels have my out-of-tree patches applied, so it is possible it is my fault at this point. Any suggestions as to what this might be caused by, or if it is fixed in more recent kernels? I will start bisecting in the meantime... Hi Ben Unfortunately I have no idea. Are you using loopback flows, or have I misunderstood you ? How loopback connections can be slow-speed ? Hello Eric, looks like it is one of your commits that causes the issue I see. Here are some more details on my specific test case I used to bisect: I have two ixgbe ports looped back, configured on same subnet, but with different IPs. Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client and server side let me send-to-self over the external looped cable. I have 2 mac-vlans on each physical interface. I created 5 server-side connections on one physical port, and two more on one of the mac-vlans. On the client-side, I create a process that spawns 5000 connections to the corresponding server side. End result is 25,000 connections on one pair of real interfaces, and 10,000 connections on the mac-vlan ports. In the passing case, I get very close to all 5000 connections on all endpoints quickly. In the failing case, I get a max of around 16k connections on the two physical ports. The two mac-vlans have 10k connections across them working reliably. It seems to be an issue with 'connect' failing. connect(2074, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075 fcntl(2075, F_GETFD)= 0 fcntl(2075, F_SETFD, FD_CLOEXEC)= 0 setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(2075, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0 fcntl(2075, F_GETFL)= 0x2 (flags O_RDWR) fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 connect(2075, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076 fcntl(2076, F_GETFD)= 0 fcntl(2076, F_SETFD, FD_CLOEXEC)= 0 setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(2076, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0 fcntl(2076, F_GETFL)= 0x2 (flags O_RDWR) fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 connect(2076, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit commit ea8add2b190395408b22a9127bed2c0912aecbc8 Author: Eric Dumazet Date: Thu Feb 11 16:28:50 2016 -0800 tcp/dccp: better use of ephemeral ports in bind() Implement strategy used in __inet_hash_connect() in opposite way : Try to find a candidate using odd ports, then fallback to even ports. We no longer disable BH for whole traversal, but one bucket at a time. We also use cond_resched() to yield cpu to other tasks if needed. I removed one indentation level and tried to mirror the loop we have in __inet_hash_connect() and variable names to ease code maintenance. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller :04 04 3af4595c6eb6d331e1cba78a142d44e00f710d81 e0c014ae8b7e2867256eff60f6210821d36eacef M net I will be happy to test patches or try to get any other results that might help diagnose this problem better. Problem is I do not s
Re: TCP many-connection regression (bisected to 4.5.0-rc2+)
On 01/23/2018 02:29 PM, Eric Dumazet wrote: On Tue, 2018-01-23 at 14:09 -0800, Ben Greear wrote: On 01/23/2018 02:07 PM, Eric Dumazet wrote: On Tue, 2018-01-23 at 13:49 -0800, Ben Greear wrote: On 01/22/2018 10:16 AM, Eric Dumazet wrote: On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote: My test case is to have 6 processes each create 5000 TCP IPv4 connections to each other on a system with 16GB RAM and send slow-speed data. This works fine on a 4.7 kernel, but will not work at all on a 4.13. The 4.13 first complains about running out of tcp memory, but even after forcing those values higher, the max connections we can get is around 15k. Both kernels have my out-of-tree patches applied, so it is possible it is my fault at this point. Any suggestions as to what this might be caused by, or if it is fixed in more recent kernels? I will start bisecting in the meantime... Hi Ben Unfortunately I have no idea. Are you using loopback flows, or have I misunderstood you ? How loopback connections can be slow-speed ? Hello Eric, looks like it is one of your commits that causes the issue I see. Here are some more details on my specific test case I used to bisect: I have two ixgbe ports looped back, configured on same subnet, but with different IPs. Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client and server side let me send-to-self over the external looped cable. I have 2 mac-vlans on each physical interface. I created 5 server-side connections on one physical port, and two more on one of the mac-vlans. On the client-side, I create a process that spawns 5000 connections to the corresponding server side. End result is 25,000 connections on one pair of real interfaces, and 10,000 connections on the mac-vlan ports. In the passing case, I get very close to all 5000 connections on all endpoints quickly. In the failing case, I get a max of around 16k connections on the two physical ports. The two mac-vlans have 10k connections across them working reliably. It seems to be an issue with 'connect' failing. connect(2074, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075 fcntl(2075, F_GETFD)= 0 fcntl(2075, F_SETFD, FD_CLOEXEC)= 0 setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(2075, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0 fcntl(2075, F_GETFL)= 0x2 (flags O_RDWR) fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 connect(2075, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076 fcntl(2076, F_GETFD)= 0 fcntl(2076, F_SETFD, FD_CLOEXEC)= 0 setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(2076, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0 fcntl(2076, F_GETFL)= 0x2 (flags O_RDWR) fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 connect(2076, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit commit ea8add2b190395408b22a9127bed2c0912aecbc8 Author: Eric Dumazet Date: Thu Feb 11 16:28:50 2016 -0800 tcp/dccp: better use of ephemeral ports in bind() Implement strategy used in __inet_hash_connect() in opposite way : Try to find a candidate using odd ports, then fallback to even ports. We no longer disable BH for whole traversal, but one bucket at a time. We also use cond_resched() to yield cpu to other tasks if needed. I removed one indentation level and tried to mirror the loop we have in __inet_hash_connect() and variable names to ease code maintenance. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller :04 04 3af4595c6eb6d331e1cba78a142d44e00f710d81 e0c014ae8b7e2867256eff60f6210821d36eacef M net I will be happy to test patches or try to get any other results that might help diagnose this problem better. Problem is I do not see anything obvious here. Please provide /proc/sys/net/ipv4/ip_local_port_range [root@lf1003-e
Re: TCP many-connection regression (bisected to 4.5.0-rc2+)
On 01/23/2018 02:07 PM, Eric Dumazet wrote: On Tue, 2018-01-23 at 13:49 -0800, Ben Greear wrote: On 01/22/2018 10:16 AM, Eric Dumazet wrote: On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote: My test case is to have 6 processes each create 5000 TCP IPv4 connections to each other on a system with 16GB RAM and send slow-speed data. This works fine on a 4.7 kernel, but will not work at all on a 4.13. The 4.13 first complains about running out of tcp memory, but even after forcing those values higher, the max connections we can get is around 15k. Both kernels have my out-of-tree patches applied, so it is possible it is my fault at this point. Any suggestions as to what this might be caused by, or if it is fixed in more recent kernels? I will start bisecting in the meantime... Hi Ben Unfortunately I have no idea. Are you using loopback flows, or have I misunderstood you ? How loopback connections can be slow-speed ? Hello Eric, looks like it is one of your commits that causes the issue I see. Here are some more details on my specific test case I used to bisect: I have two ixgbe ports looped back, configured on same subnet, but with different IPs. Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client and server side let me send-to-self over the external looped cable. I have 2 mac-vlans on each physical interface. I created 5 server-side connections on one physical port, and two more on one of the mac-vlans. On the client-side, I create a process that spawns 5000 connections to the corresponding server side. End result is 25,000 connections on one pair of real interfaces, and 10,000 connections on the mac-vlan ports. In the passing case, I get very close to all 5000 connections on all endpoints quickly. In the failing case, I get a max of around 16k connections on the two physical ports. The two mac-vlans have 10k connections across them working reliably. It seems to be an issue with 'connect' failing. connect(2074, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075 fcntl(2075, F_GETFD)= 0 fcntl(2075, F_SETFD, FD_CLOEXEC)= 0 setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(2075, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0 fcntl(2075, F_GETFL)= 0x2 (flags O_RDWR) fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 connect(2075, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076 fcntl(2076, F_GETFD)= 0 fcntl(2076, F_SETFD, FD_CLOEXEC)= 0 setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(2076, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0 fcntl(2076, F_GETFL)= 0x2 (flags O_RDWR) fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 connect(2076, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit commit ea8add2b190395408b22a9127bed2c0912aecbc8 Author: Eric Dumazet Date: Thu Feb 11 16:28:50 2016 -0800 tcp/dccp: better use of ephemeral ports in bind() Implement strategy used in __inet_hash_connect() in opposite way : Try to find a candidate using odd ports, then fallback to even ports. We no longer disable BH for whole traversal, but one bucket at a time. We also use cond_resched() to yield cpu to other tasks if needed. I removed one indentation level and tried to mirror the loop we have in __inet_hash_connect() and variable names to ease code maintenance. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller :04 04 3af4595c6eb6d331e1cba78a142d44e00f710d81 e0c014ae8b7e2867256eff60f6210821d36eacef M net I will be happy to test patches or try to get any other results that might help diagnose this problem better. Problem is I do not see anything obvious here. Please provide /proc/sys/net/ipv4/ip_local_port_range [root@lf1003-e3v2-13100124-f20x64 ~]# cat /proc/sys/net/ipv4/ip_local_port_range 1 61001 Also you probab
Re: TCP many-connection regression between 4.7 and 4.13 kernels.
On 01/22/2018 10:46 AM, Josh Hunt wrote: On Mon, Jan 22, 2018 at 10:30 AM, Ben Greear wrote: On 01/22/2018 10:16 AM, Eric Dumazet wrote: On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote: My test case is to have 6 processes each create 5000 TCP IPv4 connections to each other on a system with 16GB RAM and send slow-speed data. This works fine on a 4.7 kernel, but will not work at all on a 4.13. The 4.13 first complains about running out of tcp memory, but even after forcing those values higher, the max connections we can get is around 15k. Both kernels have my out-of-tree patches applied, so it is possible it is my fault at this point. Any suggestions as to what this might be caused by, or if it is fixed in more recent kernels? I will start bisecting in the meantime... Hi Ben Unfortunately I have no idea. Are you using loopback flows, or have I misunderstood you ? How loopback connections can be slow-speed ? I am sending to self, but over external network interfaces, by using routing tables and rules and such. On 4.13.16+, I see the Intel driver bouncing when I try to start 20k connections. In this case, I have a pair of 10G ports doing 15k, and then I try to start 5k on two of the 1G ports Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:43 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out, trans_s...es: 1 Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e :07:00.0 eth3: Reset adapter unexpectedly Ben We had an interface doing this and grabbing these commits resolved it for us: 4aea7a5c5e94 e1000e: Avoid receiver overrun interrupt bursts 19110cfbb34d e1000e: Separate signaling for link check/link up d3509f8bc7b0 e1000e: Fix return value test 65a29da1f5fd e1000e: Fix wrong comment related to link detection c4c40e51f9c3 e1000e: Fix error path in link detection They are in the LTS kernels now, but don't believe they were when we first hit this problem. Thanks a lot for the suggestions, I can confirm that these patches applied to my 4.13.16+ tree does indeed seem to fix the problem. Thanks, Ben Josh -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: TCP many-connection regression (bisected to 4.5.0-rc2+)
On 01/22/2018 10:16 AM, Eric Dumazet wrote: On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote: My test case is to have 6 processes each create 5000 TCP IPv4 connections to each other on a system with 16GB RAM and send slow-speed data. This works fine on a 4.7 kernel, but will not work at all on a 4.13. The 4.13 first complains about running out of tcp memory, but even after forcing those values higher, the max connections we can get is around 15k. Both kernels have my out-of-tree patches applied, so it is possible it is my fault at this point. Any suggestions as to what this might be caused by, or if it is fixed in more recent kernels? I will start bisecting in the meantime... Hi Ben Unfortunately I have no idea. Are you using loopback flows, or have I misunderstood you ? How loopback connections can be slow-speed ? Hello Eric, looks like it is one of your commits that causes the issue I see. Here are some more details on my specific test case I used to bisect: I have two ixgbe ports looped back, configured on same subnet, but with different IPs. Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client and server side let me send-to-self over the external looped cable. I have 2 mac-vlans on each physical interface. I created 5 server-side connections on one physical port, and two more on one of the mac-vlans. On the client-side, I create a process that spawns 5000 connections to the corresponding server side. End result is 25,000 connections on one pair of real interfaces, and 10,000 connections on the mac-vlan ports. In the passing case, I get very close to all 5000 connections on all endpoints quickly. In the failing case, I get a max of around 16k connections on the two physical ports. The two mac-vlans have 10k connections across them working reliably. It seems to be an issue with 'connect' failing. connect(2074, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075 fcntl(2075, F_GETFD)= 0 fcntl(2075, F_SETFD, FD_CLOEXEC)= 0 setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(2075, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0 fcntl(2075, F_GETFL)= 0x2 (flags O_RDWR) fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 connect(2075, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076 fcntl(2076, F_GETFD)= 0 fcntl(2076, F_SETFD, FD_CLOEXEC)= 0 setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(2076, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0 fcntl(2076, F_GETFL)= 0x2 (flags O_RDWR) fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 connect(2076, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit commit ea8add2b190395408b22a9127bed2c0912aecbc8 Author: Eric Dumazet Date: Thu Feb 11 16:28:50 2016 -0800 tcp/dccp: better use of ephemeral ports in bind() Implement strategy used in __inet_hash_connect() in opposite way : Try to find a candidate using odd ports, then fallback to even ports. We no longer disable BH for whole traversal, but one bucket at a time. We also use cond_resched() to yield cpu to other tasks if needed. I removed one indentation level and tried to mirror the loop we have in __inet_hash_connect() and variable names to ease code maintenance. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller :04 04 3af4595c6eb6d331e1cba78a142d44e00f710d81 e0c014ae8b7e2867256eff60f6210821d36eacef M net I will be happy to test patches or try to get any other results that might help diagnose this problem better. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: TCP many-connection regression between 4.7 and 4.13 kernels.
On 01/22/2018 10:30 AM, Ben Greear wrote: On 01/22/2018 10:16 AM, Eric Dumazet wrote: On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote: My test case is to have 6 processes each create 5000 TCP IPv4 connections to each other on a system with 16GB RAM and send slow-speed data. This works fine on a 4.7 kernel, but will not work at all on a 4.13. The 4.13 first complains about running out of tcp memory, but even after forcing those values higher, the max connections we can get is around 15k. Both kernels have my out-of-tree patches applied, so it is possible it is my fault at this point. Any suggestions as to what this might be caused by, or if it is fixed in more recent kernels? I will start bisecting in the meantime... Hi Ben Unfortunately I have no idea. Are you using loopback flows, or have I misunderstood you ? How loopback connections can be slow-speed ? I am sending to self, but over external network interfaces, by using routing tables and rules and such. On 4.13.16+, I see the Intel driver bouncing when I try to start 20k connections. In this case, I have a pair of 10G ports doing 15k, and then I try to start 5k on two of the 1G ports Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:43 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out, trans_s...es: 1 Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e :07:00.0 eth3: Reset adapter unexpectedly System reports 10+GB RAM free in this case, btw. Actually, maybe the good kernel was even older than 4.7...I see same resets and inability to do a full 20k connections on 4.7 too. I double-checked with system-test and it seems 4.4 was a good kernel. I'll test that next. Here is splat from 4.7: [ 238.921679] [ cut here ] [ 238.921689] WARNING: CPU: 0 PID: 3 at /home/greearb/git/linux-bisect/net/sched/sch_generic.c:272 dev_watchdog+0xd4/0x12f [ 238.921690] NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out [ 238.921691] Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 cfg80211 macvlan pktgen bnep bluetooth fuse coretemp intel_rapl ftdi_sio x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm iTCO_wdt iTCO_vendor_support joydev ie31200_edac ipmi_devintf irqbypass serio_raw ipmi_si edac_core shpchp fjes video i2c_i801 tpm_tis lpc_ich ipmi_msghandler tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc mgag200 i2c_algo_bit drm_kms_helper ttm drm i2c_core e1000e ixgbe mdio hwmon dca ptp pps_core ipv6 [last unloaded: nf_conntrack] [ 238.921720] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.7.0 #62 [ 238.921721] Hardware name: Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 [ 238.921723] 88041cdd7cd8 81352a23 88041cdd7d28 [ 238.921725] 88041cdd7d18 810ea5dd 01101cdd7d90 [ 238.921727] 880417a84000 0100 8163ecff 880417a84440 [ 238.921728] Call Trace: [ 238.921733] [] dump_stack+0x61/0x7d [ 238.921736] [] __warn+0xbd/0xd8 [ 238.921738] [] ? netif_tx_lock+0x81/0x81 [ 238.921740] [] warn_slowpath_fmt+0x46/0x4e [ 238.921741] [] ? netif_tx_lock+0x74/0x81 [ 238.921743] [] dev_watchdog+0xd4/0x12f [ 238.921746] [] call_timer_fn+0x65/0x11b [ 238.921748] [] ? netif_tx_lock+0x81/0x81 [ 238.921749] [] run_timer_softirq+0x1ad/0x1d7 [ 238.921751] [] __do_softirq+0xfb/0x25c [ 238.921752] [] run_ksoftirqd+0x19/0x35 [ 238.921755] [] smpboot_thread_fn+0x169/0x1a9 [ 238.921756] [] ? sort_range+0x1d/0x1d [ 238.921759] [] kthread+0xa0/0xa8 [ 238.921763] [] ret_from_fork+0x1f/0x40 [ 238.921764] [] ? init_completion+0x24/0x24 [ 238.921765] ---[ end trace 933912956c6ee5ff ]--- [ 238.961672] e1000e :07:00.0 eth3: Reset adapter unexpectedly So, on 4.4.8+, I see this and other splats related to e1000e. I guess that is a separate issue. I can easily start 40k connections however, 30k across the two 10G ports, and 10k more across a pair of mac-vlans on the 10G ports (since I was out of address space to add a full 40k on the two physical ports). Looks like the e1000e problem is a separate issu
Re: TCP many-connection regression between 4.7 and 4.13 kernels.
On 01/22/2018 10:16 AM, Eric Dumazet wrote: On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote: My test case is to have 6 processes each create 5000 TCP IPv4 connections to each other on a system with 16GB RAM and send slow-speed data. This works fine on a 4.7 kernel, but will not work at all on a 4.13. The 4.13 first complains about running out of tcp memory, but even after forcing those values higher, the max connections we can get is around 15k. Both kernels have my out-of-tree patches applied, so it is possible it is my fault at this point. Any suggestions as to what this might be caused by, or if it is fixed in more recent kernels? I will start bisecting in the meantime... Hi Ben Unfortunately I have no idea. Are you using loopback flows, or have I misunderstood you ? How loopback connections can be slow-speed ? I am sending to self, but over external network interfaces, by using routing tables and rules and such. On 4.13.16+, I see the Intel driver bouncing when I try to start 20k connections. In this case, I have a pair of 10G ports doing 15k, and then I try to start 5k on two of the 1G ports Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:43 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 22 10:15:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out, trans_s...es: 1 Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e :07:00.0 eth3: Reset adapter unexpectedly System reports 10+GB RAM free in this case, btw. Actually, maybe the good kernel was even older than 4.7...I see same resets and inability to do a full 20k connections on 4.7 too. I double-checked with system-test and it seems 4.4 was a good kernel. I'll test that next. Here is splat from 4.7: [ 238.921679] [ cut here ] [ 238.921689] WARNING: CPU: 0 PID: 3 at /home/greearb/git/linux-bisect/net/sched/sch_generic.c:272 dev_watchdog+0xd4/0x12f [ 238.921690] NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out [ 238.921691] Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 cfg80211 macvlan pktgen bnep bluetooth fuse coretemp intel_rapl ftdi_sio x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm iTCO_wdt iTCO_vendor_support joydev ie31200_edac ipmi_devintf irqbypass serio_raw ipmi_si edac_core shpchp fjes video i2c_i801 tpm_tis lpc_ich ipmi_msghandler tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc mgag200 i2c_algo_bit drm_kms_helper ttm drm i2c_core e1000e ixgbe mdio hwmon dca ptp pps_core ipv6 [last unloaded: nf_conntrack] [ 238.921720] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.7.0 #62 [ 238.921721] Hardware name: Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 [ 238.921723] 88041cdd7cd8 81352a23 88041cdd7d28 [ 238.921725] 88041cdd7d18 810ea5dd 01101cdd7d90 [ 238.921727] 880417a84000 0100 8163ecff 880417a84440 [ 238.921728] Call Trace: [ 238.921733] [] dump_stack+0x61/0x7d [ 238.921736] [] __warn+0xbd/0xd8 [ 238.921738] [] ? netif_tx_lock+0x81/0x81 [ 238.921740] [] warn_slowpath_fmt+0x46/0x4e [ 238.921741] [] ? netif_tx_lock+0x74/0x81 [ 238.921743] [] dev_watchdog+0xd4/0x12f [ 238.921746] [] call_timer_fn+0x65/0x11b [ 238.921748] [] ? netif_tx_lock+0x81/0x81 [ 238.921749] [] run_timer_softirq+0x1ad/0x1d7 [ 238.921751] [] __do_softirq+0xfb/0x25c [ 238.921752] [] run_ksoftirqd+0x19/0x35 [ 238.921755] [] smpboot_thread_fn+0x169/0x1a9 [ 238.921756] [] ? sort_range+0x1d/0x1d [ 238.921759] [] kthread+0xa0/0xa8 [ 238.921763] [] ret_from_fork+0x1f/0x40 [ 238.921764] [] ? init_completion+0x24/0x24 [ 238.921765] ---[ end trace 933912956c6ee5ff ]--- [ 238.961672] e1000e :07:00.0 eth3: Reset adapter unexpectedly Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
TCP many-connection regression between 4.7 and 4.13 kernels.
My test case is to have 6 processes each create 5000 TCP IPv4 connections to each other on a system with 16GB RAM and send slow-speed data. This works fine on a 4.7 kernel, but will not work at all on a 4.13. The 4.13 first complains about running out of tcp memory, but even after forcing those values higher, the max connections we can get is around 15k. Both kernels have my out-of-tree patches applied, so it is possible it is my fault at this point. Any suggestions as to what this might be caused by, or if it is fixed in more recent kernels? I will start bisecting in the meantime... Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
fm10k cannot get link
Hello, We're trying to get an Intel 100G NIC to work, and so far, cannot get it to link. The cable is: X0016I4AO3 QSFP28 10Gtek (any suggestions for a better/different one?) [5.022681] fm10k :05:00.0: PCI Express bandwidth of 64GT/s available [5.022683] fm10k :05:00.0: (Speed:8.0GT/s, Width: x8, Encoding Loss:<2%, Payload:256B) [5.022684] fm10k :05:00.0: 00:e0:ed:54:78:f2 [5.027864] fm10k :06:00.0: PCI Express bandwidth of 64GT/s available [5.027865] fm10k :06:00.0: (Speed:8.0GT/s, Width: x8, Encoding Loss:<2%, Payload:256B) [5.027866] fm10k :06:00.0: 00:e0:ed:54:78:f3 [6.057950] Modules linked in: ioatdma(+) shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc ast drm_kms_helper ttm igb drm i2c_algo_bit i2c_core ixgbe mdio hwmon fm10k ptp pps_core dca fjes ipv6 crc_ccitt [7.294441] fm10k :05:00.0 eth0.r: renamed from eth0 [ 14.044914] fm10k :05:00.0 eth2: renamed from eth0.r [ 14.107798] fm10k :06:00.0 eth1.r: renamed from eth1 [ 14.178217] fm10k :06:00.0 eth3: renamed from eth1.r [root@lf1005c-is14120020 ~]# ethtool eth3 Settings for eth3: Current message level: 0x0007 (7) drv probe link Link detected: no [root@lf1005c-is14120020 ~]# uname -a Linux lf1005c-is14120020 4.9.29+ #46 SMP PREEMPT Wed Jul 26 17:48:57 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux [root@lf1005c-is14120020 ~]# ethtool -i eth3 driver: fm10k version: 0.21.2-k firmware-version: bus-info: :06:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: no supports-register-dump: yes supports-priv-flags: no [root@lf1005c-is14120020 ~]# lspci|grep 06 06:00.0 Ethernet controller: Intel Corporation Device 15a4 Please let me know if you have any suggestions. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Ethtool question
On 10/12/2017 03:00 PM, Roopa Prabhu wrote: On Thu, Oct 12, 2017 at 2:45 PM, Ben Greear wrote: On 10/11/2017 01:49 PM, David Miller wrote: From: "John W. Linville" Date: Wed, 11 Oct 2017 16:44:07 -0400 On Wed, Oct 11, 2017 at 09:51:56AM -0700, Ben Greear wrote: I noticed today that setting some ethtool settings to the same value returns an error code. I would think this should silently return success instead? Makes it easier to call it from scripts this way: [root@lf0313-6477 lanforge]# ethtool -L eth3 combined 1 combined unmodified, ignoring no channel parameters changed, aborting current values: tx 0 rx 0 other 1 combined 1 [root@lf0313-6477 lanforge]# echo $? 1 I just had this discussion a couple of months ago with someone. My initial feeling was like you, a no-op is not a failure. But someone convinced me otherwise...I will now endeavour to remember who that was and how they convinced me... Anyone else have input here? I guess this usually happens when drivers don't support changing the settings at all. So they just make their ethtool operation for the 'set' always return an error. We could have a generic ethtool helper that does "get" and then if the "set" request is identical just return zero. But from another perspective, the error returned from the "set" in this situation also indicates to the user that the driver does not support the "set" operation which has value and meaning in and of itself. And we'd lose that with the given suggestion. In my case, the driver (igb) does support the set, my program just made the same ethtool call several times and it fails after the initial change (that actually changes something), as best as I can figure. This error is returned by ethtool user-space. It does a get, check and then set if user has requested changes. So, should we fix ethtool to return 0 in this case instead of an error code? I think so. If the driver itself returns an error, then probably return the error code and/or fix the driver as seems appropriate. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Ethtool question
On 10/11/2017 01:49 PM, David Miller wrote: From: "John W. Linville" Date: Wed, 11 Oct 2017 16:44:07 -0400 On Wed, Oct 11, 2017 at 09:51:56AM -0700, Ben Greear wrote: I noticed today that setting some ethtool settings to the same value returns an error code. I would think this should silently return success instead? Makes it easier to call it from scripts this way: [root@lf0313-6477 lanforge]# ethtool -L eth3 combined 1 combined unmodified, ignoring no channel parameters changed, aborting current values: tx 0 rx 0 other 1 combined 1 [root@lf0313-6477 lanforge]# echo $? 1 I just had this discussion a couple of months ago with someone. My initial feeling was like you, a no-op is not a failure. But someone convinced me otherwise...I will now endeavour to remember who that was and how they convinced me... Anyone else have input here? I guess this usually happens when drivers don't support changing the settings at all. So they just make their ethtool operation for the 'set' always return an error. We could have a generic ethtool helper that does "get" and then if the "set" request is identical just return zero. But from another perspective, the error returned from the "set" in this situation also indicates to the user that the driver does not support the "set" operation which has value and meaning in and of itself. And we'd lose that with the given suggestion. In my case, the driver (igb) does support the set, my program just made the same ethtool call several times and it fails after the initial change (that actually changes something), as best as I can figure. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Ethtool question
I noticed today that setting some ethtool settings to the same value returns an error code. I would think this should silently return success instead? Makes it easier to call it from scripts this way: [root@lf0313-6477 lanforge]# ethtool -L eth3 combined 1 combined unmodified, ignoring no channel parameters changed, aborting current values: tx 0 rx 0 other 1 combined 1 [root@lf0313-6477 lanforge]# echo $? 1 Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Can libpcap filter on vlan tags when vlans are hardware-accelerated?
On 09/12/2017 01:26 PM, Michal Kubecek wrote: On Tue, Sep 12, 2017 at 11:54:43AM -0700, Ben Greear wrote: It does not appear to work on Fedora-26, and I'm curious if someone knows what needs doing to get this support working? It's rather complicated. The "vlan" and "vlan " filters didn't handle the case when vlan information is passed in metadata until commit 04660eb1e561 ("Use BPF extensions in compiled filters"), i.e. libpcap 1.7.0. Unfortunately that commit made libpcap always check only metadata for the first outermost vlan tag so that it broke the case when vlan information is passed in packet itself (which is less frequent today). To handle both cases correctly, you would need libpcap with commits d739b068ac29 ("Make VLAN filter handle both metadata and inline tags") and 7c7a19fbd9af ("Fix logic of combined VLAN test") and also the optimizer fix from https://github.com/the-tcpdump-group/libpcap/pull/582/commits/075015a3d17a (without it the filters generate incorrect BPF in some cases unless the optimizer is disabled). As far as I can see, these commits are not in any release yet. Michal Kubecek So, I cloned the latest libpcap, and I'm going to start poking at this. Do you happen to know if I need to do anything special other than 'pcap_compile()'? I'm curious how the library would know if it can use newer kernel API or not...or maybe it is somehow magically backwards/forward compatible? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Can libpcap filter on vlan tags when vlans are hardware-accelerated?
On 09/12/2017 11:54 AM, Ben Greear wrote: It does not appear to work on Fedora-26, and I'm curious if someone knows what needs doing to get this support working? Thanks, Ben Gah, I spoke too soon. system-test guy says it works on cmd-line, but not when we try to make it work in another way...could be local bug, I'll poke at this more. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Can libpcap filter on vlan tags when vlans are hardware-accelerated?
It does not appear to work on Fedora-26, and I'm curious if someone knows what needs doing to get this support working? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH] Fix build on fedora-14 (and other older systems)
On 09/03/2017 08:50 AM, Stephen Hemminger wrote: On Sat, 2 Sep 2017 07:15:02 -0700 gree...@candelatech.com wrote: diff --git a/include/linux/sysinfo.h b/include/linux/sysinfo.h index 934335a..3596b02 100644 --- a/include/linux/sysinfo.h +++ b/include/linux/sysinfo.h @@ -3,6 +3,14 @@ #include +/* So we can compile on older OSs, hopefully this is correct. --Ben */ +#ifndef __kernel_long_t +typedef long __kernel_long_t; +#endif +#ifndef __kernel_ulong_t +typedef unsigned long __kernel_ulong_t; +#endif + #define SI_LOAD_SHIFT 16 struct sysinfo { __kernel_long_t uptime; /* Seconds since boot */ I am not accepting this patch because all files in include/linux are automatically regenerated from kernel 'make install_headers'. No exceptions. If you want to change a header in include/linux it has to go through upstream kernel inclusion. It would be wrong to add this to the actual kernel header I think. Do you have another suggestion for fixing iproute2 compile? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Problem compiling iproute2 on older systems
On 09/02/2017 12:55 AM, Michal Kubecek wrote: On Fri, Sep 01, 2017 at 04:52:20PM -0700, Ben Greear wrote: In the patch below, usage of __kernel_ulong_t and __kernel_long_t is introduced, but that is not available on older system (fedora-14, at least). It is not a #define, so I am having trouble finding a quick hack around this. Any ideas on how to make this work better on older OSs running modern kernels? Author: Stephen Hemminger 2017-01-12 17:54:39 Committer: Stephen Hemminger 2017-01-12 17:54:39 Child: c7ec7697e3f000359aa317394e6dd972e35c1f84 (Fix build on fedora-14 (and other older systems)) Branches: master, remotes/origin/master Follows: v3.10.0 Precedes: add more uapi header files In order to ensure no backward/forward compatiablity problems, make sure that all kernel headers used come from the local copy. Signed-off-by: Stephen Hemminger --- include/linux/sysinfo.h --- new file mode 100644 index 000..934335a @@ -0,0 +1,24 @@ +#ifndef _LINUX_SYSINFO_H +#define _LINUX_SYSINFO_H + +#include + +#define SI_LOAD_SHIFT 16 +struct sysinfo { + __kernel_long_t uptime; /* Seconds since boot */ + __kernel_ulong_t loads[3]; /* 1, 5, and 15 minute load averages */ + __kernel_ulong_t totalram; /* Total usable main memory size */ + __kernel_ulong_t freeram; /* Available memory size */ + __kernel_ulong_t sharedram; /* Amount of shared memory */ + __kernel_ulong_t bufferram; /* Memory used by buffers */ + __kernel_ulong_t totalswap; /* Total swap space size */ + __kernel_ulong_t freeswap; /* swap space still available */ + __u16 procs;/* Number of current processes */ + __u16 pad; /* Explicit padding for m68k */ + __kernel_ulong_t totalhigh; /* Total high memory size */ + __kernel_ulong_t freehigh; /* Available high memory size */ + __u32 mem_unit; /* Memory unit size in bytes */ + char _f[20-2*sizeof(__kernel_ulong_t)-sizeof(__u32)]; /* Padding: libc5 uses this.. */ +}; + +#endif /* _LINUX_SYSINFO_H */ I've been already thinking about this a bit. Normally, we would simply add the file where __kernel_long_t and __kernel_ulong_t are defined. The problem is this is which depends on architecture - which is the point of these types. Good thing is iproute2 doesn't actually use struct sysinfo anywhere so we don't need to have them defined correctly. One possible workaround would therefore be defining them as long and unsigned long. As long as we don't use the types anywhere, we would be fine. Another option would be to replace include/linux/sysinfo.h with an empty file. The problem I can see with this is that if someone uses a script to refresh all copies of uapi headers automatically, the script would have to be aware that it must not update this file and preserve the fake empty one. I just sent a patch that appears to compile on all of my build systems, which are generally fedora-14 to fedora-24 currently. I haven't actually tested functionality yet, but if you say it is unused, then it is very likely to be OK, and even if not, I think it will be fine unless someone is trying to cross-compile. And in that case, probably more than one issue involved... Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Problem compiling iproute2 on older systems
In the patch below, usage of __kernel_ulong_t and __kernel_long_t is introduced, but that is not available on older system (fedora-14, at least). It is not a #define, so I am having trouble finding a quick hack around this. Any ideas on how to make this work better on older OSs running modern kernels? Author: Stephen Hemminger 2017-01-12 17:54:39 Committer: Stephen Hemminger 2017-01-12 17:54:39 Child: c7ec7697e3f000359aa317394e6dd972e35c1f84 (Fix build on fedora-14 (and other older systems)) Branches: master, remotes/origin/master Follows: v3.10.0 Precedes: add more uapi header files In order to ensure no backward/forward compatiablity problems, make sure that all kernel headers used come from the local copy. Signed-off-by: Stephen Hemminger --- include/linux/sysinfo.h --- new file mode 100644 index 000..934335a @@ -0,0 +1,24 @@ +#ifndef _LINUX_SYSINFO_H +#define _LINUX_SYSINFO_H + +#include + +#define SI_LOAD_SHIFT 16 +struct sysinfo { + __kernel_long_t uptime; /* Seconds since boot */ + __kernel_ulong_t loads[3]; /* 1, 5, and 15 minute load averages */ + __kernel_ulong_t totalram; /* Total usable main memory size */ + __kernel_ulong_t freeram; /* Available memory size */ + __kernel_ulong_t sharedram; /* Amount of shared memory */ + __kernel_ulong_t bufferram; /* Memory used by buffers */ + __kernel_ulong_t totalswap; /* Total swap space size */ + __kernel_ulong_t freeswap; /* swap space still available */ + __u16 procs;/* Number of current processes */ + __u16 pad; /* Explicit padding for m68k */ + __kernel_ulong_t totalhigh; /* Total high memory size */ + __kernel_ulong_t freehigh; /* Available high memory size */ + __u32 mem_unit; /* Memory unit size in bytes */ + char _f[20-2*sizeof(__kernel_ulong_t)-sizeof(__u32)]; /* Padding: libc5 uses this.. */ +}; + +#endif /* _LINUX_SYSINFO_H */ -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Regression: Bug 196547 - Since 4.12 - bonding module not working with wireless drivers
On 08/16/2017 08:18 PM, Dan Williams wrote: On Wed, 2017-08-16 at 19:36 -0700, Ben Greear wrote: On 08/16/2017 07:11 PM, Dan Williams wrote: On Wed, 2017-08-16 at 14:31 -0700, David Miller wrote: From: Dan Williams Date: Wed, 16 Aug 2017 16:22:41 -0500 My biggest suggestion is that perhaps bonding should grow hysteresis for link speeds. Since WiFi can change speed every packet, you probably don't want the bond characteristics changing every couple seconds just in case your WiFi link is jumping around. Ethernet won't bounce around that much, so the hysteresis would have no effect there. Or, if people are concerned about response time to speed changes on ethernet (where you probably do want an instant switch-over) some new flag to indicate that certain devices don't have stable speeds over time. Or just report the average of the range the wireless link can hit, and be done with it. I think you guys are overcomplicating things. That range can be from 1 to > 800Mb/s. No, it won't usually be all over that range, but it won't be uncommon to fluctuate by hundreds of Mb/s. I'm not sure a simple average is really the answer here. Even doing that would require new knobs to ethtool, since the rate depends heavily on card capabilities and also what AP you're connected to *at that moment*. If you roam to another AP, then the max speed can certainly change. You'll probably say "aim for the 75% case" or something like that, which is fine, but then you're depending on your 75% case to be (a) single AP, (b) never move (eg, only bond wifi + ethernet), (c) little radio interference. I'm not sure I'd buy that. If I've put words in your mouth, forgive me. If you keep ethtool API simple and just return the last (rx-rate + tx-rate) / 2, or the rate averaged over the last 100 frames or 10 seconds, then the caller can do longer term averaging as it sees fit. Probably no need for lots of averaging complexity in the kernel. Yeah, that works too, but I was thinking it was better to present the actual data through ethtool so that things other than bonding could use it, and since bonding is the thing that actually cares about the fluctuation, make it do the more extensive processing. What do you mean by 'actual data'? If you want to know the most accurate transmit/rx rate info, then you need to pay attention to each and every frame's tx/rx rate, as well as it's ampdu/amsdu, retries, etc. It is virtually impossible. So, you will have to settle for something less... I suggest something simple to calculate, similar to existing values that are available via debugfs and/or 'iw dev foo station dump', etc. Let higher layers manipulate the raw data as they see fit (they can query ethtool as often as they like). Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Regression: Bug 196547 - Since 4.12 - bonding module not working with wireless drivers
On 08/16/2017 07:11 PM, Dan Williams wrote: On Wed, 2017-08-16 at 14:31 -0700, David Miller wrote: From: Dan Williams Date: Wed, 16 Aug 2017 16:22:41 -0500 My biggest suggestion is that perhaps bonding should grow hysteresis for link speeds. Since WiFi can change speed every packet, you probably don't want the bond characteristics changing every couple seconds just in case your WiFi link is jumping around. Ethernet won't bounce around that much, so the hysteresis would have no effect there. Or, if people are concerned about response time to speed changes on ethernet (where you probably do want an instant switch-over) some new flag to indicate that certain devices don't have stable speeds over time. Or just report the average of the range the wireless link can hit, and be done with it. I think you guys are overcomplicating things. That range can be from 1 to > 800Mb/s. No, it won't usually be all over that range, but it won't be uncommon to fluctuate by hundreds of Mb/s. I'm not sure a simple average is really the answer here. Even doing that would require new knobs to ethtool, since the rate depends heavily on card capabilities and also what AP you're connected to *at that moment*. If you roam to another AP, then the max speed can certainly change. You'll probably say "aim for the 75% case" or something like that, which is fine, but then you're depending on your 75% case to be (a) single AP, (b) never move (eg, only bond wifi + ethernet), (c) little radio interference. I'm not sure I'd buy that. If I've put words in your mouth, forgive me. If you keep ethtool API simple and just return the last (rx-rate + tx-rate) / 2, or the rate averaged over the last 100 frames or 10 seconds, then the caller can do longer term averaging as it sees fit. Probably no need for lots of averaging complexity in the kernel. rate-ctrl for wifi basically doesn't happen until you transmit or receive a fairly steady stream, so it will fluctuate a lot. Thanks, Ben Dan -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
On 06/20/2017 11:05 AM, Michal Kubecek wrote: On Tue, Jun 20, 2017 at 07:12:27AM -0700, Ben Greear wrote: On 06/14/2017 03:25 PM, David Ahern wrote: On 6/14/17 4:23 PM, Ben Greear wrote: On 06/13/2017 07:27 PM, David Ahern wrote: Let's try a targeted debug patch. See attached I had to change it to pr_err so it would go to our serial console since the system locked hard on crash, and that appears to be enough to change the timing where we can no longer reproduce the problem. ok, let's figure out which one is doing that. There are 3 debug statements. I suspect fib6_del_route is the one setting the state to FWS_U. Can you remove the debug prints in fib6_repair_tree and fib6_walk_continue and try again? We cannot reproduce with just that one printf in the kernel either. It must change the timing too much to trigger the bug. You might try trace_printk() which should have less impact (don't forget to enable /proc/sys/kernel/ftrace_dump_on_oops). We cannot reproduce with trace_printk() either. Thanks, Ben Michal Kubecek -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
On 06/14/2017 03:25 PM, David Ahern wrote: On 6/14/17 4:23 PM, Ben Greear wrote: On 06/13/2017 07:27 PM, David Ahern wrote: Let's try a targeted debug patch. See attached I had to change it to pr_err so it would go to our serial console since the system locked hard on crash, and that appears to be enough to change the timing where we can no longer reproduce the problem. ok, let's figure out which one is doing that. There are 3 debug statements. I suspect fib6_del_route is the one setting the state to FWS_U. Can you remove the debug prints in fib6_repair_tree and fib6_walk_continue and try again? We cannot reproduce with just that one printf in the kernel either. It must change the timing too much to trigger the bug. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
On 06/13/2017 07:27 PM, David Ahern wrote: Let's try a targeted debug patch. See attached I had to change it to pr_err so it would go to our serial console since the system locked hard on crash, and that appears to be enough to change the timing where we can no longer reproduce the problem. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
On 06/13/2017 01:28 PM, David Ahern wrote: On 6/13/17 2:16 PM, Ben Greear wrote: On 06/09/2017 02:25 PM, Eric Dumazet wrote: On Fri, 2017-06-09 at 07:27 -0600, David Ahern wrote: On 6/8/17 11:55 PM, Cong Wang wrote: On Thu, Jun 8, 2017 at 2:27 PM, Ben Greear wrote: As far as I can tell, the patch did not help, or at least we still reproduce the crash easily. netlink dump is serialized by nlk->cb_mutex so I don't think that patch makes any sense w.r.t race condition. From what I can see fn_sernum should be accessed under table lock, so when saving and checking it during a walk make sure it the lock is held. That has nothing to do with the netlink dump, but the table changing during a walk. Yes, your patch makes total sense, of course. I guess someone should go ahead and make an official patch and submit it, even if it doesn't fix my problem. I can do that; was hoping to root cause the problem first. (gdb) l *(fib6_walk_continue+0x76) 0x188c6 is in fib6_walk_continue (/home/greearb/git/linux-2.6/net/ipv6/ip6_fib.c:1593). 1588if (fn == w->root) 1589return 0; 1590pn = fn->parent; 1591w->node = pn; 1592#ifdef CONFIG_IPV6_SUBTREES 1593if (FIB6_SUBTREE(pn) == fn) { Apparently fn->parent is NULL here for some reason, but I don't know if that is expected or not. If a simple NULL check is not enough here, we have to trace why it is NULL. From my understanding, parent should not be null hence the attempts to fix access to table nodes under a lock. ie., figuring out why it is null here. If someone has more suggestions, I'll be happy to test. I have looked at the code again and nothing is jumping out. Will look again later today. I noticed there is some code to help fix up the walkers when nodes are deleted. They use lock: read_lock(&net->ipv6.fib6_walker_lock); The code you were tweaking uses a different lock: read_lock_bh(&table->tb6_lock); In is certainly not simple code, so I don't know if that is correct or not, but might possibly be a place to start looking. I'm going to re-test with a WARN_ON to see if that triggers since previous suggestion was that f->parent was NULL. diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index 51cd637..86295df 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -1571,6 +1571,10 @@ static int fib6_walk_continue(struct fib6_walker *w) case FWS_U: if (fn == w->root) return 0; + if (!fn->parent) { + WARN_ON_ONCE(0); + return 0; + } pn = fn->parent; w->node = pn; #ifdef CONFIG_IPV6_SUBTREES Thanks, Ben Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
On 06/09/2017 02:25 PM, Eric Dumazet wrote: On Fri, 2017-06-09 at 07:27 -0600, David Ahern wrote: On 6/8/17 11:55 PM, Cong Wang wrote: On Thu, Jun 8, 2017 at 2:27 PM, Ben Greear wrote: As far as I can tell, the patch did not help, or at least we still reproduce the crash easily. netlink dump is serialized by nlk->cb_mutex so I don't think that patch makes any sense w.r.t race condition. From what I can see fn_sernum should be accessed under table lock, so when saving and checking it during a walk make sure it the lock is held. That has nothing to do with the netlink dump, but the table changing during a walk. Yes, your patch makes total sense, of course. I guess someone should go ahead and make an official patch and submit it, even if it doesn't fix my problem. (gdb) l *(fib6_walk_continue+0x76) 0x188c6 is in fib6_walk_continue (/home/greearb/git/linux-2.6/net/ipv6/ip6_fib.c:1593). 1588if (fn == w->root) 1589return 0; 1590pn = fn->parent; 1591w->node = pn; 1592#ifdef CONFIG_IPV6_SUBTREES 1593if (FIB6_SUBTREE(pn) == fn) { Apparently fn->parent is NULL here for some reason, but I don't know if that is expected or not. If a simple NULL check is not enough here, we have to trace why it is NULL. From my understanding, parent should not be null hence the attempts to fix access to table nodes under a lock. ie., figuring out why it is null here. If someone has more suggestions, I'll be happy to test. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
; 1597} (gdb) l *(inet6_dump_fib+0x1ab) 0x1939b is in inet6_dump_fib (/home/greearb/git/linux-2.6/net/ipv6/ip6_fib.c:392). 387 w->skip = w->count; 388 } else 389 w->skip = 0; 390 391 res = fib6_walk_continue(w); 392 read_unlock_bh(&table->tb6_lock); 393 if (res <= 0) { 394 fib6_walker_unlink(net, w); 395 cb->args[4] = 0; 396 } (gdb) [greearb@ben-dt3 linux-2.6]$ git diff diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index d4bf2c6..4e32a16 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -372,12 +372,13 @@ static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb, read_lock_bh(&table->tb6_lock); res = fib6_walk(net, w); - read_unlock_bh(&table->tb6_lock); if (res > 0) { cb->args[4] = 1; cb->args[5] = w->root->fn_sernum; } + read_unlock_bh(&table->tb6_lock); } else { + read_lock_bh(&table->tb6_lock); if (cb->args[5] != w->root->fn_sernum) { /* Begin at the root if the tree changed */ cb->args[5] = w->root->fn_sernum; @@ -387,7 +388,6 @@ static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb, } else w->skip = 0; - read_lock_bh(&table->tb6_lock); res = fib6_walk_continue(w); read_unlock_bh(&table->tb6_lock); if (res <= 0) { Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
On 06/06/2017 05:27 PM, Eric Dumazet wrote: On Tue, 2017-06-06 at 18:00 -0600, David Ahern wrote: On 6/6/17 3:06 PM, Ben Greear wrote: This bug has been around forever, and we recently got an intern and stuck him with trying to reproduce it on the latest kernel. It is still here. I'm not super excited about trying to fix this, but we can easily test patches if someone has a patch to try. Can you try this (whitespace damaged on paste, but it is moving the lock ahead of the fn_sernum check): diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index deea901746c8..7a44c49055c0 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -378,6 +378,7 @@ static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb, cb->args[5] = w->root->fn_sernum; } } else { + read_lock_bh(&table->tb6_lock); if (cb->args[5] != w->root->fn_sernum) { /* Begin at the root if the tree changed */ cb->args[5] = w->root->fn_sernum; @@ -387,7 +388,6 @@ static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb, } else w->skip = 0; - read_lock_bh(&table->tb6_lock); res = fib6_walk_continue(w); read_unlock_bh(&table->tb6_lock); if (res <= 0) { Good catch, but it looks like similar fix is needed a few lines before. We will test this tomorrow. Thanks, Ben diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index deea901746c8570c5e801e40592c91e3b62812e0..b214443dc8346cef3690df7f27cc48a864028865 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -372,12 +372,13 @@ static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb, read_lock_bh(&table->tb6_lock); res = fib6_walk(net, w); - read_unlock_bh(&table->tb6_lock); if (res > 0) { cb->args[4] = 1; cb->args[5] = w->root->fn_sernum; } + read_unlock_bh(&table->tb6_lock); } else { + read_lock_bh(&table->tb6_lock); if (cb->args[5] != w->root->fn_sernum) { /* Begin at the root if the tree changed */ cb->args[5] = w->root->fn_sernum; @@ -387,7 +388,6 @@ static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb, } else w->skip = 0; - read_lock_bh(&table->tb6_lock); res = fib6_walk_continue(w); read_unlock_bh(&table->tb6_lock); if (res <= 0) { -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
Hello, This bug has been around forever, and we recently got an intern and stuck him with trying to reproduce it on the latest kernel. It is still here. I'm not super excited about trying to fix this, but we can easily test patches if someone has a patch to try. Test case is to create 1000 mac-vlans and bring them up, with user-space tools running lots of 'dump' related commands as part of bringing up the interfaces and configuring some special source-based routing tables. (gdb) l *(inet6_dump_fib+0x109) 0x192f9 is in inet6_dump_fib (/home/greearb/git/linux-2.6/net/ipv6/ip6_fib.c:392). 387 } else 388 w->skip = 0; 389 390 read_lock_bh(&table->tb6_lock); 391 res = fib6_walk_continue(w); 392 read_unlock_bh(&table->tb6_lock); 393 if (res <= 0) { 394 fib6_walker_unlink(net, w); 395 cb->args[4] = 0; 396 } (gdb) l *(fib6_walk_continue+0x76) 0x188c6 is in fib6_walk_continue (/home/greearb/git/linux-2.6/net/ipv6/ip6_fib.c:1593). 1588if (fn == w->root) 1589return 0; 1590pn = fn->parent; 1591w->node = pn; 1592#ifdef CONFIG_IPV6_SUBTREES 1593if (FIB6_SUBTREE(pn) == fn) { 1594WARN_ON(!(fn->fn_flags & RTN_ROOT)); 1595w->state = FWS_L; 1596continue; 1597} [root@ct524-ffb0 ~]# BUG: unable to handle kernel NULL pointer dereference at 0018 IP: fib6_walk_continue+0x76/0x180 [ipv6] PGD 3d9226067 P4D 3d9226067 PUD 3d9020067 PMD 0 Oops: [#1] PREEMPT SMP Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c bnep fuse macvlan pktgen cfg80211 ipmi_ssif iTCO_wdt iTCO_vendor_support coretemp intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass joydev i2c_i801 ie31200_edac intel_pch_thermal shpchp hci_uart ipmi_si btbcm btqca ipmi_devintf btintel ipmi_msghandler bluetooth pinctrl_sunrisepoint acpi_als pinctrl_intel video tpm_tis intel_lpss_acpi kfifo_buf tpm_tis_core intel_lpss industrialio tpm acpi_pad acpi_power_meter sch_fq_codel nfsd auth_rpcgss nfs_acl lockd grace sunrpc ast drm_kms_helper ttm drm igb hwmon ptp pps_core dca i2c_algo_bit i2c_hid i2c_core ipv6 crc_ccitt [last unloaded: nf_conntrack] CPU: 1 PID: 996 Comm: ip Not tainted 4.12.0-rc4+ #32 Hardware name: Supermicro Super Server/X11SSM-F, BIOS 1.0b 12/29/2015 task: 8803d4d61dc0 task.stack: c9000970c000 RIP: 0010:fib6_walk_continue+0x76/0x180 [ipv6] RSP: 0018:c9000970fbb8 EFLAGS: 00010283 RAX: 8803de84b020 RBX: 8803e0756f00 RCX: RDX: RSI: c9000970fc00 RDI: 81eee280 RBP: c9000970fbc0 R08: 0008 R09: 8803d4fbbf31 R10: c9000970fb68 R11: R12: 0001 R13: 0001 R14: 8803e0756f00 R15: 8803d9345b18 FS: 7f32ca4ec700() GS:88047784() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0018 CR3: 0003ddacc000 CR4: 003406e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: inet6_dump_fib+0x109/0x290 [ipv6] netlink_dump+0x11d/0x290 netlink_recvmsg+0x260/0x3f0 sock_recvmsg+0x38/0x40 ___sys_recvmsg+0xe9/0x230 ? alloc_pages_vma+0x9d/0x260 ? page_add_new_anon_rmap+0x88/0xc0 ? lru_cache_add_active_or_unevictable+0x31/0xb0 ? __handle_mm_fault+0xce3/0xf70 __sys_recvmsg+0x3d/0x70 ? __sys_recvmsg+0x3d/0x70 SyS_recvmsg+0xd/0x20 do_syscall_64+0x56/0xc0 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7f32c9e21050 RSP: 002b:7fff96401de8 EFLAGS: 0246 ORIG_RAX: 002f RAX: ffda RBX: RCX: 7f32c9e21050 RDX: RSI: 7fff96401e50 RDI: 0004 RBP: 7fff96405e74 R08: 3fe4 R09: R10: 7fff96401e90 R11: 0246 R12: 0064f3a0 R13: 7fff96405ee0 R14: 3fe4 R15: Code: f6 40 2a 04 74 11 8b 53 30 85 d2 0f 84 02 01 00 00 83 ea 01 89 53 30 c7 43 28 04 00 00 00 48 39 43 10 74 33 48 8b 10 48 89 53 18 <48> 39 42 18 0f 84 a3 00 00 00 48 39 42 08 0f 84 ae 00 00 00 48 RIP: fib6_walk_continue+0x76/0x180 [ipv6] RSP: c9000970fbb8 CR2: 0018 ---[ end trace 5ebbc4ee97bea64e ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled Rebooting in 10 seconds.. ACPI MEMORY or I/O RESET_REG. -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: 'iw events' stops receiving events after a while on 4.9 + hacks
On 05/31/2017 01:18 AM, Bastian Bittorf wrote: * Johannes Berg [31.05.2017 10:09]: Is there any way to dump out the socket information if we reproduce the problem? I have no idea, sorry. If you or Bastian can tell me how to reproduce the problem, I can try to investigate it. there was an interesting fix regarding the shell-builtin 'read' in busybox[1]. I will retest again and report if this changes anything. bye, bastian PS: @ben: are you also using 'iw event | while read -r LINE ...'? I'm using a perl script to read the output, and not using busybox. I have not seen the problem again, so it is not easy for me to reproduce. If you reproduce it, maybe check 'strace' on the 'iw' process to see if it is hung on writing output to the pipe or reading input? In my case, it appeared to be hung reading input from netlink, input that never arrived. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: 'iw events' stops receiving events after a while on 4.9 + hacks
On 05/17/2017 06:30 AM, Johannes Berg wrote: On Wed, 2017-05-17 at 12:08 +0200, Bastian Bittorf wrote: * Ben Greear [17.05.2017 11:51]: I have been keeping an 'iw events' program running with a perl script gathering its output and post-processing it. This has been working for several years on 4.7 and earlier kernels, but when testing on 4.9 overnight, I notice that 'iw events' is not showing any input. 'strace' shows that it is waiting on recvmsg. If I start a second 'iw events' then it will get wifi events as expected. me too, also seen on 4.4 - i'am happy for debug ideas. I've never seen this. Does it happen when it's very long-running? Or when there are lots of events? Perhaps something in the socket buffer accounting is going wrong, so that it's slowly decreasing to 0? I saw it exactly once so far, and it happened overnight, but we have not been doing a lot of work with the 4.9 kernel until recently. I don't think there were many messages on this system, and certainly others have run much longer on systems that should be generating many more events without trouble. Is there any way to dump out the socket information if we reproduce the problem? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com