Re: Ingress tc filters with IPSec
On May 30, 2015 at 2:24 AM John A. Sullivan III jsulli...@opensourcedevel.com wrote: On Sat, 2015-05-30 at 01:52 -0400, John A. Sullivan III wrote: Argh! yet another obstacle from my ignorance. We are attempting ingress traffic shaping using IFB interfaces on traffic coming via GRE / IPSec. Filters and hash tables are working fine with plain GRE including stripping the header. We even got the ematch filter working so that the ESP packets are the only packets not redirected to IFB. But, regardless of whether we redirect ESP packets to IFB, the filters never see the decrypted packets. I thought the packets passed through the interface twice - first encrypted and they decrypted. However, tcpdump only shows the ESP packets on the interface. How do we apply filters to the packets after decryption? Thanks - John I see what changed. In the past, this seemed to work but we were using tunnel mode. We were trying to use transport mode in this application but that seems to prevent the decrypted packet contents from appearing again on the interface. Reverting to tunnel mode made the contents visible again and our filters are working as expected - John Alas, this is still a problem since we are using VRRP and the tunnel end points are the virtual IP addresses. That makes StrongSWAN choke on selector matching in tunnel mode so back to trying to make transport mode work. I am guessing we do not see the second pass of the packet because it is only encrypted and not encapsulated. So my hunch is that we ned to pass the ESP packet into the ifb qdisc but need to look elsewhere the packet for the filter matching information. We know that matching on the normal offsets does not work so I am hoping the decrypted packet is decipherable by the filter matching logic but just still has all the ESP transport header attached. Normally, to extract the contents of my GRE tunnel, I would place them into a separate hash table with the GRE header stripped off and then filter them into TCP and UDP hast tables: tc filter add dev ifb0 parent 11:0 protocol ip prio 2 u32 match ip protocol 47 0xff match u16 0x0800 0x at 22 link 11: offset at 0 mask 0f00 shift 6 plus 4 eat So we match the GRE protocol and determine that GRE is carrying an IP packet. With the ESP transport header and IV (AES = 16B) interposed between the IP header and the GRE header, I suppose the first part of this filter becomes: tc filter add dev ifb0 parent 11:0 protocol ip prio 2 u32 match ip protocol 47 0xff match u16 0x0800 0x at 46 but what do I do with the second half to find the start of the TCP/UDP header? Is it still offset at 0 because tc filter somehow knows where the interior IP header starts or should it be offset at 48 to account for the GRE + ESP headers? Or is there a better way to filter ingress traffic on GRE/IPSec tunnels? Thanks - John -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ingress tc filters with IPSec
On May 30, 2015 at 4:12 PM jsulli...@opensourcedevel.com jsulli...@opensourcedevel.com wrote: On May 30, 2015 at 2:24 AM John A. Sullivan III jsulli...@opensourcedevel.com wrote: On Sat, 2015-05-30 at 01:52 -0400, John A. Sullivan III wrote: Argh! yet another obstacle from my ignorance. We are attempting ingress traffic shaping using IFB interfaces on traffic coming via GRE / IPSec. Filters and hash tables are working fine with plain GRE including stripping the header. We even got the ematch filter working so that the ESP packets are the only packets not redirected to IFB. But, regardless of whether we redirect ESP packets to IFB, the filters never see the decrypted packets. I thought the packets passed through the interface twice - first encrypted and they decrypted. However, tcpdump only shows the ESP packets on the interface. How do we apply filters to the packets after decryption? Thanks - John I see what changed. In the past, this seemed to work but we were using tunnel mode. We were trying to use transport mode in this application but that seems to prevent the decrypted packet contents from appearing again on the interface. Reverting to tunnel mode made the contents visible again and our filters are working as expected - John Alas, this is still a problem since we are using VRRP and the tunnel end points are the virtual IP addresses. That makes StrongSWAN choke on selector matching in tunnel mode so back to trying to make transport mode work. I am guessing we do not see the second pass of the packet because it is only encrypted and not encapsulated. So my hunch is that we ned to pass the ESP packet into the ifb qdisc but need to look elsewhere the packet for the filter matching information. We know that matching on the normal offsets does not work so I am hoping the decrypted packet is decipherable by the filter matching logic but just still has all the ESP transport header attached. Normally, to extract the contents of my GRE tunnel, I would place them into a separate hash table with the GRE header stripped off and then filter them into TCP and UDP hast tables: tc filter add dev ifb0 parent 11:0 protocol ip prio 2 u32 match ip protocol 47 0xff match u16 0x0800 0x at 22 link 11: offset at 0 mask 0f00 shift 6 plus 4 eat So we match the GRE protocol and determine that GRE is carrying an IP packet. With the ESP transport header and IV (AES = 16B) interposed between the IP header and the GRE header, I suppose the first part of this filter becomes: tc filter add dev ifb0 parent 11:0 protocol ip prio 2 u32 match ip protocol 47 0xff match u16 0x0800 0x at 46 but what do I do with the second half to find the start of the TCP/UDP header? Is it still offset at 0 because tc filter somehow knows where the interior IP header starts or should it be offset at 48 to account for the GRE + ESP headers? Or is there a better way to filter ingress traffic on GRE/IPSec tunnels? Thanks - John Alas, this is not working. I set a continue action for the ESP traffic: tc filter replace dev ifb0 parent 11:0 protocol ip prio 1 u32 match ip protocol 50 0xff action continue and that seems to be matching: filter parent 11: protocol ip pref 1 u32 fh 802::800 order 2048 key ht 802 bkt 0 terminal flowid ??? (rule hit 3130003 success 2931853) match 0032/00ff at 8 (success 2931853 ) action order 1: gact action continue random type none pass val 0 index 1 ref 1 bind 1 installed 294 sec And I even reduced the GRE filter to just look for the GRE protocol in the IP header: tc filter add dev ifb0 parent 11:0 protocol ip prio 2 u32 match ip protocol 47 0xff link 11: offset at 48 mask 0f00 shift 6 plus 4 eat but it does not appear to be matching at all: filter parent 11: protocol ip pref 2 u32 fh 800::800 order 2048 key ht 800 bkt 0 link 11: (rule hit 3130012 success 0) match 002f/00ff at 8 (success 0 ) offset 0f006 at 48 plus 4 eat Any suggestions about how to traffic shape ingest traffic coming off an ESP Transport connection? Thanks - John -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Drops in qdisc on ifb interface
On May 28, 2015 at 1:17 PM Eric Dumazet eric.duma...@gmail.com wrote: On Thu, 2015-05-28 at 12:33 -0400, jsulli...@opensourcedevel.com wrote: Our initial testing has been single flow but the ultimate purpose is processing real time video in a complex application which ingests associated meta data, post to consumer facing cloud, does reporting back - so lots of different traffics with very different demands - a perfect tc environment. Wait, do you really plan using TCP for real time video ? The overall product does but the video source feeds come over a different network via UDP. There are, however, RTMP quality control feeds coming across this connection. There may also occasionally be test UDP source feeds on this connection but those are not production. Thanks - John -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Drops in qdisc on ifb interface
On May 28, 2015 at 1:49 PM Eric Dumazet eric.duma...@gmail.com wrote: On Thu, 2015-05-28 at 13:31 -0400, jsulli...@opensourcedevel.com wrote: The overall product does but the video source feeds come over a different network via UDP. There are, however, RTMP quality control feeds coming across this connection. There may also occasionally be test UDP source feeds on this connection but those are not production. Thanks - John This is important to know, because UDP wont benefit from GRO. I was assuming your receiver had to handle ~88000 packets per second, so I was doubting it could saturate one core, but maybe your target is very different. That PPS estimate seems accurate - the port speed and CIR on the shaped connection is 1 Gbps. I'm still mystified by why the GbE bottlenecks on IFB but the 10GbE does not. Thanks - John -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Drops in qdisc on ifb interface
On May 25, 2015 at 6:31 PM Eric Dumazet eric.duma...@gmail.com wrote: On Mon, 2015-05-25 at 16:05 -0400, John A. Sullivan III wrote: Hello, all. One one of our connections we are doing intensive traffic shaping with tc. We are using ifb interfaces for shaping ingress traffic and we also use ifb interfaces for egress so that we can apply the same set of rules to multiple interfaces (e.g., tun and eth interfaces operating on the same physical interface). These are running on very powerful gateways; I have watched them handling 16 Gbps with CPU utilization at a handful of percent. Yet, I am seeing drops on the ifb interfaces when I do a tc -s qdisc show. Why would this be? I would expect if there was some kind of problem that it would manifest as drops on the physical interfaces and not the IFB interface. We have played with queue lengths in both directions. We are using HFSC with SFQ leaves so I would imagine this overrides the very short qlen on the IFB interfaces (32). These are drops and not overlimits. IFB is single threaded and a serious bottleneck. Don't use this on egress, this destroys multiqueue capaility. And SFQ is pretty limited (127 packets) You might try to change your NIC to have a single queue for RX, so that you have a single cpu feeding your IFB queue. (ethtool -L eth0 rx 1) This has been an interesting exercise - thank you for your help along the way, Eric. IFB did not seem to bottleneck in our initial testing but there was really only one flow of traffic during the test at around 1 Gbps. However, on a non-test system with many different flows, IFB does seem to be a serious bottleneck - I assume this is the consequence of being single-threaded. Single queue did not seem to help. Am I correct to assume that IFB would be as much as a bottleneck on the ingress side as it would be on the egress side? If so, is there any way to do high performance ingress traffic shaping on Linux - a multi-threaded version of IFB or a different approach? Thanks - John -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tc drop stats different between bond and slave interfaces
On May 26, 2015 at 1:10 PM Cong Wang cw...@twopensource.com wrote: On Mon, May 25, 2015 at 10:35 PM, jsulli...@opensourcedevel.com jsulli...@opensourcedevel.com wrote: I was also surprised to see that, although we are using a prio qdisc on the bond, the physical interface is showing pfifo_fast. [...] So why the difference and why the pfifo_fast qdiscs on the physical interfaces? Qdisc is not aware of the network interface you attach it to, so it doesn't know whether it is bond or whatever stacked interface, the qdisc you add to bonding master has no idea about its slaves. For pfifo_fast, it is the default qdisc when you install mq on root, it is where mq actually holds the packets. Hope this helps. Grr . . . . I think this web client formatted my last response with HTML by default. My apologies. Yes, your reply does help, thank you although it then raises an interesting question. If I neglect the slave interfaces as I have done, can I accidentally impact the shaping I have done on the bond master? For example, I may prioritize real time voice and video so their relatively evenly spaced packets are prioritized and sent to the physical interface with no special ToS marking. Someone's selfish mail application sets ToS bits for high priority and decides to send a huge attachment. Those packets also flood into the physical interface behind the video and voice packets but now the physical interface using pfifo_fast sends the bulk email packets ahead of the voice and video. Is this an accurate scenario? Thus, if one uses traffic shaping on a bonded interface, should one then do something like use a prio qdisc with a single priority on the physical interfaces? Thanks - John -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Drops in qdisc on ifb interface
On May 28, 2015 at 11:45 AM John Fastabend john.fastab...@gmail.com wrote: On 05/28/2015 08:30 AM, jsulli...@opensourcedevel.com wrote: On May 28, 2015 at 11:14 AM Eric Dumazet eric.duma...@gmail.com wrote: On Thu, 2015-05-28 at 10:38 -0400, jsulli...@opensourcedevel.com wrote: snip IFB has still a long way before being efficient. In the mean time, you could play with following patch, and setup /sys/class/net/eth0/gro_timeout to 2 This way, the GRO aggregation will work even at 1Gbps, and your IFB will get big GRO packets instead of single MSS segments. Both IFB but also IP/TCP stack will have less work to do, and receiver will send fewer ACK packets as well. diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index f287186192bb655ba2dc1a205fb251351d593e98..c37f6657c047d3eb9bd72b647572edd53b1881ac 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -151,7 +151,7 @@ static void igb_setup_dca(struct igb_adapter *); #endif /* CONFIG_IGB_DCA */ snip Interesting but this is destined to become a critical production system for a high profile, internationally recognized product so I am hesitant to patch. I doubt I can convince my company to do it but is improving IFB the sort of development effort that could be sponsored and then executed in a moderately short period of time? Thanks - John -- If your experimenting one thing you could do is create many ifb devices and load balance across them from tc. I'm not sure if this would be practical in your setup or not but might be worth trying. One thing I've been debating adding is the ability to match on current cpu_id in tc which would allow you to load balance by cpu. I could send you a patch if you wanted to test it. I would expect this to help somewhat with 'single queue' issue but sorry haven't had time yet to test it out myself. .John -- John Fastabend Intel Corporation In the meantime, I've noticed something strange. When testing traffic between the two primary gateways and thus identical traffic flows, I have the bottleneck on the one which uses two bonded GbE igb interfaces but not on the one which uses two bonded 10 GbE ixgbe interfaces. The ethtool -k settings are identical, e.g., gso, gro, lro. The ring buffer is larger on the ixgbe cards but I would not think that would affect this. Identical kernels. The gateway hardware is identical and not working hard at all - no CPU or RAM pressure. Any idea why one bottlenecks and the other does not? Returning to your idea, John, how would I load balance? I assume I would need to attach several filters to the physical interfaces each redirecting traffic to different IFB devices. However, couldn't this work against the traffic shaping? Let's take an extreme example: all the time sensitive ingress packets find their way onto ifb0 and all the bulk ingress packets find their way onto ifb1. As these packets are merged back to the physical interface, wont' they simply be treated in pfifo_fast (or other physical interface qdisc) order? Thanks - John -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Drops in qdisc on ifb interface
On May 28, 2015 at 12:26 PM Eric Dumazet eric.duma...@gmail.com wrote: On Thu, 2015-05-28 at 08:45 -0700, John Fastabend wrote: If your experimenting one thing you could do is create many ifb devices and load balance across them from tc. I'm not sure if this would be practical in your setup or not but might be worth trying. One thing I've been debating adding is the ability to match on current cpu_id in tc which would allow you to load balance by cpu. I could send you a patch if you wanted to test it. I would expect this to help somewhat with 'single queue' issue but sorry haven't had time yet to test it out myself. It seems John uses a single 1Gbps flow, so only one cpu would receive NIC interrupts. The only way he could get better results would be to schedule IFB work on another core. (Assuming one cpu is 100% busy servicing NIC + IFB, but I really doubt it...) Our initial testing has been single flow but the ultimate purpose is processing real time video in a complex application which ingests associated meta data, post to consumer facing cloud, does reporting back - so lots of different traffics with very different demands - a perfect tc environment. CPU utilization is remarkably light. Every once in a while, we see a single CPU about 50% utilized with si. Thanks, all - John -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Drops in qdisc on ifb interface
On May 28, 2015 at 11:14 AM Eric Dumazet eric.duma...@gmail.com wrote: On Thu, 2015-05-28 at 10:38 -0400, jsulli...@opensourcedevel.com wrote: snip IFB has still a long way before being efficient. In the mean time, you could play with following patch, and setup /sys/class/net/eth0/gro_timeout to 2 This way, the GRO aggregation will work even at 1Gbps, and your IFB will get big GRO packets instead of single MSS segments. Both IFB but also IP/TCP stack will have less work to do, and receiver will send fewer ACK packets as well. diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index f287186192bb655ba2dc1a205fb251351d593e98..c37f6657c047d3eb9bd72b647572edd53b1881ac 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -151,7 +151,7 @@ static void igb_setup_dca(struct igb_adapter *); #endif /* CONFIG_IGB_DCA */ snip Interesting but this is destined to become a critical production system for a high profile, internationally recognized product so I am hesitant to patch. I doubt I can convince my company to do it but is improving IFB the sort of development effort that could be sponsored and then executed in a moderately short period of time? Thanks - John -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
tc drop stats different between bond and slave interfaces
Hello, all. I'm troubleshooting why tunneled performance is degrade on one of our Internet connections. Eric Dumazet was very helpful in some earlier issues. We replace SFQ with fq_codel as the leaf qdisc on our HFSC classes and we no longer have drops on the ifb interfaces. However, now, we are seeing drops on the physical interfaces. These are bonded using 802.3ad. I assume we are correct to execute the tc commands against the bond interface. However, I was surprised to see the drop statistics different between the bond interface and the slave interfaces. On one side, we see no errors on the bond interface and none on one slave but quite a number on the other slave: root@gwhq-2:~# tc -s qdisc show dev bond1 qdisc prio 2: root refcnt 17 bands 2 priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Sent 62053402767 bytes 41315883 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc ingress : parent :fff1 Sent 7344131114 bytes 11437274 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 root@gwhq-2:~# tc -s qdisc show dev eth8 qdisc mq 0: root Sent 62044791989 bytes 41310334 pkt (dropped 5700, overlimits 0 requeues 2488) backlog 0b 0p requeues 2488 qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 18848 bytes 152 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc pfifo_fast 0: parent :5 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 62044765871 bytes 41310027 pkt (dropped 5700, overlimits 0 requeues 2487) backlog 0b 0p requeues 2487 qdisc pfifo_fast 0: parent :6 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 5754 bytes 137 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc pfifo_fast 0: parent :7 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc pfifo_fast 0: parent :8 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 1516 bytes 18 pkt (dropped 0, overlimits 0 requeues 1) backlog 0b 0p requeues 1 I was also surprised to see that, although we are using a prio qdisc on the bond, the physical interface is showing pfifo_fast. On the other side, we show drops on the bond but none on either physical: qdisc prio 2: root refcnt 17 bands 2 priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Sent 7744366990 bytes 11438167 pkt (dropped 8, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc ingress : parent :fff1 Sent 59853360604 bytes 41423515 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 root@lcppeppr-labc02:~# tc -s qdisc show dev eth7 qdisc mq 0: root Sent 7744152748 bytes 11432931 pkt (dropped 0, overlimits 0 requeues 69) backlog 0b 0p requeues 69 qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 71342010 bytes 844547 pkt (dropped 0, overlimits 0 requeues 10) backlog 0b 0p requeues 10 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 104260672 bytes 1298159 pkt (dropped 0, overlimits 0 requeues 4) backlog 0b 0p requeues 4 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 58931075 bytes 708986 pkt (dropped 0, overlimits 0 requeues 1) backlog 0b 0p requeues 1 qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 7288852140 bytes 5677457 pkt (dropped 0, overlimits 0 requeues 14) backlog 0b 0p requeues 14 qdisc pfifo_fast 0: parent :5 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 42372833 bytes 506483 pkt (dropped 0, overlimits 0 requeues 1) backlog 0b 0p requeues 1 qdisc pfifo_fast 0: parent :6 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 36524401 bytes 395709 pkt (dropped 0, overlimits 0 requeues 30) backlog 0b 0p requeues 30 qdisc pfifo_fast 0: parent :7 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 121978491 bytes 1737068 pkt (dropped 0, overlimits 0 requeues 5) backlog 0b 0p requeues 5 qdisc pfifo_fast 0: parent :8 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 13336774 bytes 184341 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc pfifo_fast 0: parent :9 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 2553156 bytes 38393 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc pfifo_fast 0: parent :a bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 676410 bytes 7091 pkt