Re: [vnet] [epair] epair interface stops working after some time
Hi, I have filed a bug for this issue and cc'd both of you in it. https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=227100 Best, Reshad On 29 March 2018 6:39:13 PM IST, Kristof Provostwrote: >On 29 Mar 2018, at 14:48, Reshad Patuck wrote: >> pulling the 'net.link.epair.netisr_maxqlen' down does seem to make >> this occur faster. >> >Good, I think my hypothesis about where the issue lies is correct then. >You should be able to avoid (or at least reduce the frequency of) the >issue by increasing the value on your system(s). > >> When I dropped it to 2 like Kristof did and I have the same symptoms >> on a box which was not exhibiting the problems manually began to have > >> the same symptoms. >> Bumping it back up to 2100 did not restore the functionality (I don't > >> know if it should). >> >It’s good to know this. It doesn’t surprise me that it doesn’t fix >things. >Something’s wrong in the code which handle an overflow of the netisr >queue in the epair driver. Once that happens the IFF_DRV_OACTIVE flag >gets set, and we keep enqueuing outside the netisr queue. >Somehow we never end up back in epair_nh_drainedcpu(), so the flag >never >gets cleared and the driver never recovers. > >> I will create a PR for this later today with all the information I >> have gathered so that we can have it all in one place. >> >Thanks. Please cc me on it. I’ll see if I can figure out what the >problem is, but we might need someone smarter, so cc Bjoern too. > >Regards, >Kristof ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [vnet] [epair] epair interface stops working after some time
On 29 Mar 2018, at 14:48, Reshad Patuck wrote: pulling the 'net.link.epair.netisr_maxqlen' down does seem to make this occur faster. Good, I think my hypothesis about where the issue lies is correct then. You should be able to avoid (or at least reduce the frequency of) the issue by increasing the value on your system(s). When I dropped it to 2 like Kristof did and I have the same symptoms on a box which was not exhibiting the problems manually began to have the same symptoms. Bumping it back up to 2100 did not restore the functionality (I don't know if it should). It’s good to know this. It doesn’t surprise me that it doesn’t fix things. Something’s wrong in the code which handle an overflow of the netisr queue in the epair driver. Once that happens the IFF_DRV_OACTIVE flag gets set, and we keep enqueuing outside the netisr queue. Somehow we never end up back in epair_nh_drainedcpu(), so the flag never gets cleared and the driver never recovers. I will create a PR for this later today with all the information I have gathered so that we can have it all in one place. Thanks. Please cc me on it. I’ll see if I can figure out what the problem is, but we might need someone smarter, so cc Bjoern too. Regards, Kristof ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [vnet] [epair] epair interface stops working after some time
Hi, pulling the 'net.link.epair.netisr_maxqlen' down does seem to make this occur faster. When I dropped it to 2 like Kristof did and I have the same symptoms on a box which was not exhibiting the problems manually began to have the same symptoms. Bumping it back up to 2100 did not restore the functionality (I don't know if it should). I will create a PR for this later today with all the information I have gathered so that we can have it all in one place. Till then I have still have access to a box which is naturally in this state. Let me know if there is anything you would like me to check Thanks for the help, Reshad On 28 March 2018 12:32:44 AM IST, Kristof Provostwrote: >On 27 Mar 2018, at 20:59, Reshad Patuck wrote: >> The current value of 'net.link.epair.netisr_maxqlen' is 2100, I will >> make it 210. >> Will this require a reboot? or can I just change the sysctl and >reload >> the epair module? >> >You shouldn’t need to reboot or reload the epair module. When I set it >to 2 on my box it pretty much immediately lost connectivity over the >epair interfaces. > >I’d expect you to get hit by the bug relatively quickly now, so be >aware of that. > >Regards, >Kristof ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [vnet] [epair] epair interface stops working after some time
Excellent, will give it a try on a box that I never have this problem on. Will let you know of the symptoms are the same when I trigger it. Best, Reshad On 28 March 2018 12:32:44 AM IST, Kristof Provostwrote: >On 27 Mar 2018, at 20:59, Reshad Patuck wrote: >> The current value of 'net.link.epair.netisr_maxqlen' is 2100, I will >> make it 210. >> Will this require a reboot? or can I just change the sysctl and >reload >> the epair module? >> >You shouldn’t need to reboot or reload the epair module. When I set it >to 2 on my box it pretty much immediately lost connectivity over the >epair interfaces. > >I’d expect you to get hit by the bug relatively quickly now, so be >aware of that. > >Regards, >Kristof ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [vnet] [epair] epair interface stops working after some time
On 27 Mar 2018, at 20:59, Reshad Patuck wrote: The current value of 'net.link.epair.netisr_maxqlen' is 2100, I will make it 210. Will this require a reboot? or can I just change the sysctl and reload the epair module? You shouldn’t need to reboot or reload the epair module. When I set it to 2 on my box it pretty much immediately lost connectivity over the epair interfaces. I’d expect you to get hit by the bug relatively quickly now, so be aware of that. Regards, Kristof ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [vnet] [epair] epair interface stops working after some time
Hi, @Kristof: The current value of 'net.link.epair.netisr_maxqlen' is 2100, I will make it 210. Will this require a reboot? or can I just change the sysctl and reload the epair module? @Bjoern: here is the output to 'netstat -Q' ``` # netstat -Q Configuration: SettingCurrentLimit Thread count 11 Default queue limit25610240 Dispatch policy direct n/a Threads bound to CPUs disabled n/a Protocols: Name Proto QLimit Policy Dispatch Flags ip 1256 flow default --- igmp 2256 source default --- rtsock 3256 source default --- arp4256 source default --- ether 5256 source direct --- ip66256 flow default --- epair 8 2100cpu default CD- Workstreams: WSID CPU Name Len WMark Disp'd HDisp'd QDrops Queued Handled 0 0 ip 030 1140926700 13574317 24983409 0 0 igmp 0 000000 0 0 rtsock 0 1000 42 42 0 0 arp0 0 61109751000 61109751 0 0 ether 0 0 115098020000 115098020 0 0 ip6010 3615757700 4273274 40430846 0 0 epair 0 210000 210972 303785724 303785724 ``` I still have access to a machine in this state, but will need to reset it to a working state soon. Please let me know if there is any information you would like me to get from this machine before I reset it. Best, Reshad On 27 March 2018 8:18:29 PM IST, "Bjoern A. Zeeb"wrote: >On 27 Mar 2018, at 14:40, Kristof Provost wrote: > >> (Re-cc freebsd-net, because this is useful information) >> >> On 27 Mar 2018, at 13:07, Reshad Patuck wrote: >>> The epair crash occurred again today running the epair module code >>> with the added dtrace sdt providers. >>> >>> Running the same command as last time, 'dtrace -n ::epair\*:' >returns >>> the following: >>> ``` >>> CPU IDFUNCTION:NAME >> … >>> 0 66499 epair_transmit_locked:enqueued >>> ``` >> >>> Looks like its filled up a queue somewhere and is dropping >>> connections post that. >>> >>> The value of the 'error' is 55 I can see both the ifp and m structs >>> but don't know what to look for in them. >>> >> That’s useful. Error 55 is ENOBUFS, which in IFQ_ENQUEUE() means >> we’re hitting _IF_QFULL(). >> There don’t seem to be counters for that drop though, so that makes >> it hard to diagnose without these extra probe points. >> It also explains why you don’t really see any drop counters >> incrementing. >> >> The fact that this queue is full presumably means that the other side > >> is not reading packets off it any more. >> That’s supposed to happen in epair_start_locked() (Look for the >> IFQ_DEQUEUE() calls). >> >> It’s not at all clear to my how, but it looks like the receive side >> is not doing its work. >> >> It looks like the IFQ code is already a fallback for when the netisr >> queue is full. >> That code might be broken, or there might be a different issue that >> will just mean you’ll always end up in the same situation, >> regardless of queue size. >> >> It’s probably worth trying to play with >> ‘net.route.netisr_maxqlen’. I’d recommend *lowering* it, to see >> if the problem happens more frequently that way. If it does it’ll be >> helpful in reproducing and trying to fix this. If it doesn’t the >> full queues is probably a consequence rather than a cause/trigger. >> (Of course, once you’ve confirmed that lowering the netisr_maxqlen >> makes the problem more frequent go ahead and increase it.) > >netstat -Q will be useful >___ >freebsd-net@freebsd.org mailing list >https://lists.freebsd.org/mailman/listinfo/freebsd-net >To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [vnet] [epair] epair interface stops working after some time
On 27 Mar 2018, at 16:48, Bjoern A. Zeeb wrote: On 27 Mar 2018, at 14:40, Kristof Provost wrote: (Re-cc freebsd-net, because this is useful information) On 27 Mar 2018, at 13:07, Reshad Patuck wrote: The epair crash occurred again today running the epair module code with the added dtrace sdt providers. Running the same command as last time, 'dtrace -n ::epair\*:' returns the following: ``` CPU IDFUNCTION:NAME … 0 66499 epair_transmit_locked:enqueued ``` Looks like its filled up a queue somewhere and is dropping connections post that. The value of the 'error' is 55 I can see both the ifp and m structs but don't know what to look for in them. That’s useful. Error 55 is ENOBUFS, which in IFQ_ENQUEUE() means we’re hitting _IF_QFULL(). There don’t seem to be counters for that drop though, so that makes it hard to diagnose without these extra probe points. It also explains why you don’t really see any drop counters incrementing. The fact that this queue is full presumably means that the other side is not reading packets off it any more. That’s supposed to happen in epair_start_locked() (Look for the IFQ_DEQUEUE() calls). It’s not at all clear to my how, but it looks like the receive side is not doing its work. It looks like the IFQ code is already a fallback for when the netisr queue is full. That code might be broken, or there might be a different issue that will just mean you’ll always end up in the same situation, regardless of queue size. It’s probably worth trying to play with ‘net.route.netisr_maxqlen’. I’d recommend *lowering* it, to see if the problem happens more frequently that way. If it does it’ll be helpful in reproducing and trying to fix this. If it doesn’t the full queues is probably a consequence rather than a cause/trigger. (Of course, once you’ve confirmed that lowering the netisr_maxqlen makes the problem more frequent go ahead and increase it.) netstat -Q will be useful Reshad included that in his e-mail to me: On the system with the bug 'netstat -Q' seems to have queue drops for epair. ``` # netstat -Q Configuration: Setting Current Limit Thread count 1 1 Default queue limit 256 10240 Dispatch policy direct n/a Threads bound to CPUs disabled n/a Protocols: Name Proto QLimit Policy Dispatch Flags ip 1 256 flow default --- igmp 2 256 source default --- rtsock 3 256 source default --- arp 4 256 source default --- ether 5 256 source direct --- ip6 6 256 flow default --- epair 8 2100 cpu default CD- Workstreams: WSID CPU Name Len WMark Disp'd HDisp'd QDrops Queued Handled 0 0 ip 0 30 11150458 0 0 13092275 24242558 0 0 igmp 0 0 0 0 0 0 0 0 0 rtsock 0 1 0 0 0 42 42 0 0 arp 0 0 56380919 0 0 0 56380919 0 0 ether 0 0 108761357 0 0 0 108761357 0 0 ip6 0 10 34999359 0 0 4091259 39090613 0 0 epair 0 2100 0 0 210972 303785724 303785724 ``` I also noticed that the values for 'epair' in the 'Workstreams' section including drops do not change, while all others increase after some time. I think I’ve triggered this problem by setting net.link.epair.netisr_maxqlen to an absurdly low value (2 in my case). It looks like there’s an issue with the handling over an overflow of the “hardware” queue, but I don’t really understand that code. Regards, Kristof ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [vnet] [epair] epair interface stops working after some time
On 27 Mar 2018, at 14:40, Kristof Provost wrote: (Re-cc freebsd-net, because this is useful information) On 27 Mar 2018, at 13:07, Reshad Patuck wrote: The epair crash occurred again today running the epair module code with the added dtrace sdt providers. Running the same command as last time, 'dtrace -n ::epair\*:' returns the following: ``` CPU IDFUNCTION:NAME … 0 66499 epair_transmit_locked:enqueued ``` Looks like its filled up a queue somewhere and is dropping connections post that. The value of the 'error' is 55 I can see both the ifp and m structs but don't know what to look for in them. That’s useful. Error 55 is ENOBUFS, which in IFQ_ENQUEUE() means we’re hitting _IF_QFULL(). There don’t seem to be counters for that drop though, so that makes it hard to diagnose without these extra probe points. It also explains why you don’t really see any drop counters incrementing. The fact that this queue is full presumably means that the other side is not reading packets off it any more. That’s supposed to happen in epair_start_locked() (Look for the IFQ_DEQUEUE() calls). It’s not at all clear to my how, but it looks like the receive side is not doing its work. It looks like the IFQ code is already a fallback for when the netisr queue is full. That code might be broken, or there might be a different issue that will just mean you’ll always end up in the same situation, regardless of queue size. It’s probably worth trying to play with ‘net.route.netisr_maxqlen’. I’d recommend *lowering* it, to see if the problem happens more frequently that way. If it does it’ll be helpful in reproducing and trying to fix this. If it doesn’t the full queues is probably a consequence rather than a cause/trigger. (Of course, once you’ve confirmed that lowering the netisr_maxqlen makes the problem more frequent go ahead and increase it.) netstat -Q will be useful ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [vnet] [epair] epair interface stops working after some time
On 27 Mar 2018, at 16:40, Kristof Provost wrote: > It’s probably worth trying to play with ‘net.route.netisr_maxqlen’. I probably mean ‘net.link.epair.netisr_maxqlen’ here. Regards, Kristof ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [vnet] [epair] epair interface stops working after some time
(Re-cc freebsd-net, because this is useful information) On 27 Mar 2018, at 13:07, Reshad Patuck wrote: The epair crash occurred again today running the epair module code with the added dtrace sdt providers. Running the same command as last time, 'dtrace -n ::epair\*:' returns the following: ``` CPU IDFUNCTION:NAME … 0 66499 epair_transmit_locked:enqueued ``` Looks like its filled up a queue somewhere and is dropping connections post that. The value of the 'error' is 55 I can see both the ifp and m structs but don't know what to look for in them. That’s useful. Error 55 is ENOBUFS, which in IFQ_ENQUEUE() means we’re hitting _IF_QFULL(). There don’t seem to be counters for that drop though, so that makes it hard to diagnose without these extra probe points. It also explains why you don’t really see any drop counters incrementing. The fact that this queue is full presumably means that the other side is not reading packets off it any more. That’s supposed to happen in epair_start_locked() (Look for the IFQ_DEQUEUE() calls). It’s not at all clear to my how, but it looks like the receive side is not doing its work. It looks like the IFQ code is already a fallback for when the netisr queue is full. That code might be broken, or there might be a different issue that will just mean you’ll always end up in the same situation, regardless of queue size. It’s probably worth trying to play with ‘net.route.netisr_maxqlen’. I’d recommend *lowering* it, to see if the problem happens more frequently that way. If it does it’ll be helpful in reproducing and trying to fix this. If it doesn’t the full queues is probably a consequence rather than a cause/trigger. (Of course, once you’ve confirmed that lowering the netisr_maxqlen makes the problem more frequent go ahead and increase it.) Regards, Kristof ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [vnet] [epair] epair interface stops working after some time
Hi, I attempted to unload the pf module, but this did not cause any changes. I am not creating/destroying any VNET jails at the time epais stop to function. Multiple VNET jails are started when I start the box, but no further activity (starts or stops of vnet jails, creation deletion of epair interfaces, pf start, stop or reload) I have been monitoring output from the following: - netstat -ss - netstat -m - vmstat -z - vmstat -m I will add 'netstat -i' to my battery of monitoring commands. So far I the only pattern I can see out of the ordinary is the 'vmstat -m' output for epairs. Where the size seems to keep growing, and at some point, the memory-use and high-use grow too. The epair interface seems to stop working when the memory-use and high-use grow. I have also noticed that these parameters stay almost constant on other boxes. Here is a link (http://dpaste.com/3WB6AD4.txt) to the csv file containing the 'vmstat -m' output for 'epair' over time. I noticed the epair being to fail at timestamp 2018-01-09T07:56Z, but this test ran every 5 minutes so it could be upto 5 minutes before this timestamp. NOTE: I have used --libxo on the vmstat to get json output, it seems to have lost the trailing 'K' in the memory-use column. I will update things here if I find anything else in the logs. Please let me know if there is anything else I should look at, or if there is any other output you would like. Best regards, Reshad On Thursday 11 January 2018 2:20:06 AM IST Kristof Provost wrote: > On 5 Jan 2018, at 20:54, Reshad Patuck wrote: > > I have done the following on both servers to test what happens: > > - Created a new epair interface epair3a and epair3b > > - upped both interfaces > > - given epair3a IP address 10.20.30.40/24 (I don't have this subnet > > anywhere in my network) > > - attempted to ping 10.20.30.50 > > - checked for any packets on epair3b > > On the server where epairs are working, I can see APR packets for > > 10.20.30.50, but on the server where epairs are not working I cant see > > any > > packets on epair3b. > > I can however see the arp packets on epair3a on both servers. > > > So epair3a was not added to the bridge and epair3b was not added to a > jail? > That’s interesting, because it should mean the problem is not with the > bridge or jail. > As it affects ARP packets it also shouldn’t be a pf problem. > It might be worth unloading the pf module, just to re-confirm, but I > wouldn’t expect it to make a difference. > > > Please let me know if there is anything I can do the debug this issue > > or if > > you need any other information. > > > Are you creating/destroying vnet jails at any point? Is there a > correlation with that and the start of the epair issues? > > Are there any errors in `netstat -s` or `netstat -i epair3a` ? > > Regards, > Kristof ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [vnet] [epair] epair interface stops working after some time
On 5 Jan 2018, at 20:54, Reshad Patuck wrote: I have done the following on both servers to test what happens: - Created a new epair interface epair3a and epair3b - upped both interfaces - given epair3a IP address 10.20.30.40/24 (I don't have this subnet anywhere in my network) - attempted to ping 10.20.30.50 - checked for any packets on epair3b On the server where epairs are working, I can see APR packets for 10.20.30.50, but on the server where epairs are not working I cant see any packets on epair3b. I can however see the arp packets on epair3a on both servers. So epair3a was not added to the bridge and epair3b was not added to a jail? That’s interesting, because it should mean the problem is not with the bridge or jail. As it affects ARP packets it also shouldn’t be a pf problem. It might be worth unloading the pf module, just to re-confirm, but I wouldn’t expect it to make a difference. Please let me know if there is anything I can do the debug this issue or if you need any other information. Are you creating/destroying vnet jails at any point? Is there a correlation with that and the start of the epair issues? Are there any errors in `netstat -s` or `netstat -i epair3a` ? Regards, Kristof ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
[vnet][epair] epair interface stops working after some time
Hey, I am having a strange issue with one of my servers. I have a couple of VNET jails FreeBSD 12 r321619 set up using if_bridge and epairs. Each VNET jail (and the host too) has a pf firewall limiting inbound traffic. Everything works as intended for some time (1-5 days), services inside the jail work and the jail can connect out to the rest of the network. After some time of working fine I suddenly find that the jails stop receiving traffic and can not send traffic out. Essentially the traffic on one end of the epair does not come out the other. I have linked to a diagram with my network setup for the jails. Essentially the same setup is running on another identical server at another location and has been running for atleast two weeks without any issues. The symptoms are as follows: - I can connect to the server via ssh (on igb0 at IP 192.168.1.50). - All connections from outside the jails work fine from (192.168.1.50 to external IPs) - I can not connect to any services running inside the jails from either outside or inside the server - I can not connect out from the jails (jexec in to the jails and then attempt to connect out) - When I attempt to connect out from one of the jails: - I see arp traffic (via tcpdump) on the epair inside the jail (epair0b) - I cant see the same arp traffic (via tcpdump) on the epair outside the jail (epair0a) - 'arp -a' insde the jails shows incomplete arps for any external IP I try to reach. - When I tcpdump on igb0, bridge0 or epair0a I see broadcast/multicast/general network traffic. - When I tcpdump on epair0b I see no traffic at all. I have done the following on both servers to test what happens: - Created a new epair interface epair3a and epair3b - upped both interfaces - given epair3a IP address 10.20.30.40/24 (I don't have this subnet anywhere in my network) - attempted to ping 10.20.30.50 - checked for any packets on epair3b On the server where epairs are working, I can see APR packets for 10.20.30.50, but on the server where epairs are not working I cant see any packets on epair3b. I can however see the arp packets on epair3a on both servers. This is the third time I have found this on the same server and the other server is still going strong. After rebooting the server this problem seems to go away temporarily, but seems to manifest itself again after some time. Any commands, ideas, thoughts on how to troubleshoot what is wrong here will be much appreciated. Please let me know if there is anything I can do the debug this issue or if you need any other information. Thanks and best regards, Reshad Link to network diagram: https://i.imgur.com/1XdRjt0.jpg ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"