Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
On 9/20/2017 7:18 AM, Tom Herbert wrote: On Tue, Sep 19, 2017 at 10:13 PM, Eric Dumazet wrote: On Tue, 2017-09-19 at 21:59 -0700, Samudrala, Sridhar wrote: On 9/19/2017 5:48 PM, Tom Herbert wrote: On Tue, Sep 19, 2017 at 5:34 PM, Samudrala, Sridhar wrote: On 9/12/2017 3:53 PM, Tom Herbert wrote: On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar wrote: On 9/12/2017 8:47 AM, Eric Dumazet wrote: On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote: On 9/11/2017 8:53 PM, Eric Dumazet wrote: On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote: Two ints in sock_common for this purpose is quite expensive and the use case for this is limited-- even if a RX->TX queue mapping were introduced to eliminate the queue pair assumption this still won't help if the receive and transmit interfaces are different for the connection. I think we really need to see some very compelling results to be able to justify this. Will try to collect and post some perf data with symmetric queue configuration. Here is some performance data i collected with memcached workload over ixgbe 10Gb NIC with mcblaster benchmark. ixgbe is configured with 16 queues and rx-usecs is set to 1000 for a very low interrupt rate. ethtool -L p1p1 combined 16 ethtool -C p1p1 rx-usecs 1000 and busy poll is set to 1000usecs sysctl net.core.busy_poll = 1000 16 threads 800K requests/sec = rtt(min/avg/max)usecs intr/sec contextswitch/sec --- Default2/182/1064123391 61163 Symmetric Queues 2/50/6311 20457 32843 32 threads 800K requests/sec = rtt(min/avg/max)usecs intr/sec contextswitch/sec Default2/162/639032168 69450 Symmetric Queues2/50/385335044 35847 No idea what "Default" configuration is. Please report how xps_cpus is being set, how many RSS queues there are, and what the mapping is between RSS queues and CPUs and shared caches. Also, whether and threads are pinned. Default is linux 4.13 with the settings i listed above. ethtool -L p1p1 combined 16 ethtool -C p1p1 rx-usecs 1000 sysctl net.core.busy_poll = 1000 # ethtool -x p1p1 RX flow hash indirection table for p1p1 with 16 RX ring(s): 0: 0 1 2 3 4 5 6 7 8: 8 9101112131415 16: 0 1 2 3 4 5 6 7 24: 8 9101112131415 32: 0 1 2 3 4 5 6 7 40: 8 9101112131415 48: 0 1 2 3 4 5 6 7 56: 8 9101112131415 64: 0 1 2 3 4 5 6 7 72: 8 9101112131415 80: 0 1 2 3 4 5 6 7 88: 8 9101112131415 96: 0 1 2 3 4 5 6 7 104: 8 9101112131415 112: 0 1 2 3 4 5 6 7 120: 8 9101112131415 smp_affinity for the 16 queuepairs 141 p1p1-TxRx-0 ,0001 142 p1p1-TxRx-1 ,0002 143 p1p1-TxRx-2 ,0004 144 p1p1-TxRx-3 ,0008 145 p1p1-TxRx-4 ,0010 146 p1p1-TxRx-5 ,0020 147 p1p1-TxRx-6 ,0040 148 p1p1-TxRx-7 ,0080 149 p1p1-TxRx-8 ,0100 150 p1p1-TxRx-9 ,0200 151 p1p1-TxRx-10 ,0400 152 p1p1-TxRx-11 ,0800 153 p1p1-TxRx-12 ,1000 154 p1p1-TxRx-13 ,2000 155 p1p1-TxRx-14 ,4000 156 p1p1-TxRx-15 ,8000 xps_cpus for the 16 Tx queues ,0001 ,0002 ,0004 ,0008 ,0010 ,0020 ,0040 ,0080 ,0100 ,0200 ,0400 ,0800 ,1000 ,2000 ,4000 ,8000 memcached threads are not pinned. ... I urge you to take the time to properly tune this host. linux kernel does not do automagic configuration. This is user policy. Documentation/networking/scaling.txt has everything you need. Yes, tuning a system for optimal performance is difficult. Even if you find a performance benefit for a configuration on one system, that might not translate to another. In other words, if you've produced some code that seems to perform better than previous implementation on a test machine it's n
Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
Sridhar Samudrala writes: > This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used > to enable symmetric tx and rx queues on a socket. > > This option is specifically useful for epoll based multi threaded workloads > where each thread handles packets received on a single RX queue . In this > model, > we have noticed that it helps to send the packets on the same TX queue > corresponding to the queue-pair associated with the RX queue specifically when > busy poll is enabled with epoll(). > > Two new fields are added to struct sock_common to cache the last rx ifindex > and > the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the > cached > rx queue when this option is enabled and the TX is happening on the same > device. Would it help to make the rx and tx skb hashes symmetric (skb_get_hash_symmetric) on request?
Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
On Tue, Sep 19, 2017 at 10:13 PM, Eric Dumazet wrote: > On Tue, 2017-09-19 at 21:59 -0700, Samudrala, Sridhar wrote: >> On 9/19/2017 5:48 PM, Tom Herbert wrote: >> > On Tue, Sep 19, 2017 at 5:34 PM, Samudrala, Sridhar >> > wrote: >> > > On 9/12/2017 3:53 PM, Tom Herbert wrote: >> > > > On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar >> > > > wrote: >> > > > > >> > > > > On 9/12/2017 8:47 AM, Eric Dumazet wrote: >> > > > > > On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote: >> > > > > > > On 9/11/2017 8:53 PM, Eric Dumazet wrote: >> > > > > > > > On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote: >> > > > > > > > >> > > > > > > > > Two ints in sock_common for this purpose is quite expensive >> > > > > > > > > and the >> > > > > > > > > use case for this is limited-- even if a RX->TX queue >> > > > > > > > > mapping were >> > > > > > > > > introduced to eliminate the queue pair assumption this still >> > > > > > > > > won't >> > > > > > > > > help if the receive and transmit interfaces are different >> > > > > > > > > for the >> > > > > > > > > connection. I think we really need to see some very >> > > > > > > > > compelling >> > > > > > > > > results >> > > > > > > > > to be able to justify this. >> > > > > > > Will try to collect and post some perf data with symmetric queue >> > > > > > > configuration. >> > > >> > > Here is some performance data i collected with memcached workload over >> > > ixgbe 10Gb NIC with mcblaster benchmark. >> > > ixgbe is configured with 16 queues and rx-usecs is set to 1000 for a very >> > > low >> > > interrupt rate. >> > > ethtool -L p1p1 combined 16 >> > > ethtool -C p1p1 rx-usecs 1000 >> > > and busy poll is set to 1000usecs >> > > sysctl net.core.busy_poll = 1000 >> > > >> > > 16 threads 800K requests/sec >> > > = >> > > rtt(min/avg/max)usecs intr/sec contextswitch/sec >> > > --- >> > > Default2/182/1064123391 61163 >> > > Symmetric Queues 2/50/6311 20457 32843 >> > > >> > > 32 threads 800K requests/sec >> > > = >> > > rtt(min/avg/max)usecs intr/sec contextswitch/sec >> > > >> > > Default2/162/639032168 69450 >> > > Symmetric Queues2/50/385335044 35847 >> > > >> > No idea what "Default" configuration is. Please report how xps_cpus is >> > being set, how many RSS queues there are, and what the mapping is >> > between RSS queues and CPUs and shared caches. Also, whether and >> > threads are pinned. >> Default is linux 4.13 with the settings i listed above. >> ethtool -L p1p1 combined 16 >> ethtool -C p1p1 rx-usecs 1000 >> sysctl net.core.busy_poll = 1000 >> >> # ethtool -x p1p1 >> RX flow hash indirection table for p1p1 with 16 RX ring(s): >> 0: 0 1 2 3 4 5 6 7 >> 8: 8 9101112131415 >>16: 0 1 2 3 4 5 6 7 >>24: 8 9101112131415 >>32: 0 1 2 3 4 5 6 7 >>40: 8 9101112131415 >>48: 0 1 2 3 4 5 6 7 >>56: 8 9101112131415 >>64: 0 1 2 3 4 5 6 7 >>72: 8 9101112131415 >>80: 0 1 2 3 4 5 6 7 >>88: 8 9101112131415 >>96: 0 1 2 3 4 5 6 7 >> 104: 8 9101112131415 >> 112: 0 1 2 3 4 5 6 7 >> 120: 8 9101112131415 >> >> smp_affinity for the 16 queuepairs >> 141 p1p1-TxRx-0 ,0001 >> 142 p1p1-TxRx-1 ,0002 >> 143 p1p1-TxRx-2 ,0004 >> 144 p1p1-TxRx-3 ,0008 >> 145 p1p1-TxRx-4 ,0010 >> 146 p1p1-TxRx-5 ,0020 >> 147 p1p1-TxRx-6 ,0040 >> 148 p1p1-TxRx-7 ,0080 >> 149 p1p1-TxRx-8 ,0100 >> 150 p1p1-TxRx-9 ,0200 >> 151 p1p1-TxRx-10 ,0400 >> 152 p1p1-TxRx-11 ,0800 >> 153 p1p1-TxRx-12 ,1000 >> 154 p1p1-TxRx-13 ,2000 >> 155 p1p1-TxRx-14 ,4000 >> 156 p1p1-TxRx-15 ,8000 >> xps_cpus for the 16 Tx queues >> ,0001 >> ,0002 >> ,0004 >> ,0008 >> ,0010 >> ,0020 >> ,0040 >> ,0080 >> ,0100 >> ,0200 >> ,0400 >> ,0
Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
On Tue, 2017-09-19 at 21:59 -0700, Samudrala, Sridhar wrote: > On 9/19/2017 5:48 PM, Tom Herbert wrote: > > On Tue, Sep 19, 2017 at 5:34 PM, Samudrala, Sridhar > > wrote: > > > On 9/12/2017 3:53 PM, Tom Herbert wrote: > > > > On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar > > > > wrote: > > > > > > > > > > On 9/12/2017 8:47 AM, Eric Dumazet wrote: > > > > > > On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote: > > > > > > > On 9/11/2017 8:53 PM, Eric Dumazet wrote: > > > > > > > > On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote: > > > > > > > > > > > > > > > > > Two ints in sock_common for this purpose is quite expensive > > > > > > > > > and the > > > > > > > > > use case for this is limited-- even if a RX->TX queue mapping > > > > > > > > > were > > > > > > > > > introduced to eliminate the queue pair assumption this still > > > > > > > > > won't > > > > > > > > > help if the receive and transmit interfaces are different for > > > > > > > > > the > > > > > > > > > connection. I think we really need to see some very compelling > > > > > > > > > results > > > > > > > > > to be able to justify this. > > > > > > > Will try to collect and post some perf data with symmetric queue > > > > > > > configuration. > > > > > > Here is some performance data i collected with memcached workload over > > > ixgbe 10Gb NIC with mcblaster benchmark. > > > ixgbe is configured with 16 queues and rx-usecs is set to 1000 for a very > > > low > > > interrupt rate. > > > ethtool -L p1p1 combined 16 > > > ethtool -C p1p1 rx-usecs 1000 > > > and busy poll is set to 1000usecs > > > sysctl net.core.busy_poll = 1000 > > > > > > 16 threads 800K requests/sec > > > = > > > rtt(min/avg/max)usecs intr/sec contextswitch/sec > > > --- > > > Default2/182/1064123391 61163 > > > Symmetric Queues 2/50/6311 20457 32843 > > > > > > 32 threads 800K requests/sec > > > = > > > rtt(min/avg/max)usecs intr/sec contextswitch/sec > > > > > > Default2/162/639032168 69450 > > > Symmetric Queues2/50/385335044 35847 > > > > > No idea what "Default" configuration is. Please report how xps_cpus is > > being set, how many RSS queues there are, and what the mapping is > > between RSS queues and CPUs and shared caches. Also, whether and > > threads are pinned. > Default is linux 4.13 with the settings i listed above. > ethtool -L p1p1 combined 16 > ethtool -C p1p1 rx-usecs 1000 > sysctl net.core.busy_poll = 1000 > > # ethtool -x p1p1 > RX flow hash indirection table for p1p1 with 16 RX ring(s): > 0: 0 1 2 3 4 5 6 7 > 8: 8 9101112131415 >16: 0 1 2 3 4 5 6 7 >24: 8 9101112131415 >32: 0 1 2 3 4 5 6 7 >40: 8 9101112131415 >48: 0 1 2 3 4 5 6 7 >56: 8 9101112131415 >64: 0 1 2 3 4 5 6 7 >72: 8 9101112131415 >80: 0 1 2 3 4 5 6 7 >88: 8 9101112131415 >96: 0 1 2 3 4 5 6 7 > 104: 8 9101112131415 > 112: 0 1 2 3 4 5 6 7 > 120: 8 9101112131415 > > smp_affinity for the 16 queuepairs > 141 p1p1-TxRx-0 ,0001 > 142 p1p1-TxRx-1 ,0002 > 143 p1p1-TxRx-2 ,0004 > 144 p1p1-TxRx-3 ,0008 > 145 p1p1-TxRx-4 ,0010 > 146 p1p1-TxRx-5 ,0020 > 147 p1p1-TxRx-6 ,0040 > 148 p1p1-TxRx-7 ,0080 > 149 p1p1-TxRx-8 ,0100 > 150 p1p1-TxRx-9 ,0200 > 151 p1p1-TxRx-10 ,0400 > 152 p1p1-TxRx-11 ,0800 > 153 p1p1-TxRx-12 ,1000 > 154 p1p1-TxRx-13 ,2000 > 155 p1p1-TxRx-14 ,4000 > 156 p1p1-TxRx-15 ,8000 > xps_cpus for the 16 Tx queues > ,0001 > ,0002 > ,0004 > ,0008 > ,0010 > ,0020 > ,0040 > ,0080 > ,0100 > ,0200 > ,0400 > ,0800 > ,1000 > ,2000 > ,4000 > ,8000 > memcached threads are not pinned. > ... I urge you to take the time
Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
On Tue, Sep 19, 2017 at 5:34 PM, Samudrala, Sridhar wrote: > On 9/12/2017 3:53 PM, Tom Herbert wrote: >> >> On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar >> wrote: >>> >>> >>> On 9/12/2017 8:47 AM, Eric Dumazet wrote: On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote: > > On 9/11/2017 8:53 PM, Eric Dumazet wrote: >> >> On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote: >> >>> Two ints in sock_common for this purpose is quite expensive and the >>> use case for this is limited-- even if a RX->TX queue mapping were >>> introduced to eliminate the queue pair assumption this still won't >>> help if the receive and transmit interfaces are different for the >>> connection. I think we really need to see some very compelling >>> results >>> to be able to justify this. > > Will try to collect and post some perf data with symmetric queue > configuration. > > > Here is some performance data i collected with memcached workload over > ixgbe 10Gb NIC with mcblaster benchmark. > ixgbe is configured with 16 queues and rx-usecs is set to 1000 for a very > low > interrupt rate. > ethtool -L p1p1 combined 16 > ethtool -C p1p1 rx-usecs 1000 > and busy poll is set to 1000usecs > sysctl net.core.busy_poll = 1000 > > 16 threads 800K requests/sec > = > rtt(min/avg/max)usecs intr/sec contextswitch/sec > --- > Default2/182/1064123391 61163 > Symmetric Queues 2/50/6311 20457 32843 > > 32 threads 800K requests/sec > = > rtt(min/avg/max)usecs intr/sec contextswitch/sec > > Default2/162/639032168 69450 > Symmetric Queues2/50/385335044 35847 > No idea what "Default" configuration is. Please report how xps_cpus is being set, how many RSS queues there are, and what the mapping is between RSS queues and CPUs and shared caches. Also, whether and threads are pinned. Thanks, Tom
Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
On 9/12/2017 3:53 PM, Tom Herbert wrote: On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar wrote: On 9/12/2017 8:47 AM, Eric Dumazet wrote: On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote: On 9/11/2017 8:53 PM, Eric Dumazet wrote: On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote: Two ints in sock_common for this purpose is quite expensive and the use case for this is limited-- even if a RX->TX queue mapping were introduced to eliminate the queue pair assumption this still won't help if the receive and transmit interfaces are different for the connection. I think we really need to see some very compelling results to be able to justify this. Will try to collect and post some perf data with symmetric queue configuration. Here is some performance data i collected with memcached workload over ixgbe 10Gb NIC with mcblaster benchmark. ixgbe is configured with 16 queues and rx-usecs is set to 1000 for a very low interrupt rate. ethtool -L p1p1 combined 16 ethtool -C p1p1 rx-usecs 1000 and busy poll is set to 1000usecs sysctl net.core.busy_poll = 1000 16 threads 800K requests/sec = rtt(min/avg/max)usecs intr/sec contextswitch/sec --- Default2/182/1064123391 61163 Symmetric Queues 2/50/6311 20457 32843 32 threads 800K requests/sec = rtt(min/avg/max)usecs intr/sec contextswitch/sec Default2/162/639032168 69450 Symmetric Queues2/50/385335044 35847 Yes, this is unreasonable cost. XPS should really cover the case already. Eric, Can you clarify how XPS covers the RX-> TX queue mapping case? Is it possible to configure XPS to select TX queue based on the RX queue of a flow? IIUC, it is based on the CPU of the thread doing the transmit OR based on skb->priority to TC mapping? It may be possible to get this effect if the the threads are pinned to a core, but if the app threads are freely moving, i am not sure how XPS can be configured to select the TX queue based on the RX queue of a flow. If application is freely moving, how NIC can properly select the RX queue so that packets are coming to the appropriate queue ? The RX queue is selected via RSS and we don't want to move the flow based on where the thread is running. Unless flow director is enabled on the Intel device... This was, I believe, one of the first attempts to introduce a queue pair notion to general purpose NICs. The idea was that the device records the TX queue for a flow and then uses that to determine receive queue in a symmetric fashion. aRFS is similar, but was under SW control how the mapping is done. As Eric mentioned there are scalability issues with these mechanisms, but we also found that flow director can easily reorder packets whenever the thread moves. You must be referring to the ATR(application targeted routing) feature on Intel NICs wherea flow director entry is added for a flow based on TX queue used for that flow. Instead, we would like to select the TX queue based on the RX queue of a flow. This is called aRFS, and it does not scale to millions of flows. We tried in the past, and this went nowhere really, since the setup cost is prohibitive and DDOS vulnerable. XPS will follow the thread, since selection is done on current cpu. The problem is RX side. If application is free to migrate, then special support (aRFS) is needed from the hardware. This may be true if most of the rx processing is happening in the interrupt context. But with busy polling, i think we don't need aRFS as a thread should be able to poll any queue irrespective of where it is running. It's not just a problem with interrupt processing, in general we like to have all receive processing an subsequent transmit of a reply to be done on one CPU. Silo'ing is good for performance and parallelism. This can sometimes be relaxed in situations where CPUs share a cache so crossing CPUs is not not costly. Yes. We would like to get this behavior even without binding the app thread to a CPU. At least for passive connections, we already have all the support in the kernel so that you can have one thread per NIC queue, dealing with sockets that have incoming packets all received on one NIC RX queue. (And of course all TX packets will use the symmetric TX queue) SO_REUSEPORT plus appropriate BPF filter can achieve that. Say you have 32 queues, 32 cpus. Simply use 32 listeners, 32 threads (or 32 pools of threads) Yes. This will work if each thread is pinned to a core associated with the RX interrupt. It may not be possible to pin the threads to a core. Instead we want to associate a thread to a queue and do all the RX and TX completion of a queue in the same thread cont
Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar wrote: > > > On 9/12/2017 8:47 AM, Eric Dumazet wrote: >> >> On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote: >>> >>> On 9/11/2017 8:53 PM, Eric Dumazet wrote: On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote: > Two ints in sock_common for this purpose is quite expensive and the > use case for this is limited-- even if a RX->TX queue mapping were > introduced to eliminate the queue pair assumption this still won't > help if the receive and transmit interfaces are different for the > connection. I think we really need to see some very compelling results > to be able to justify this. >>> >>> Will try to collect and post some perf data with symmetric queue >>> configuration. >>> Yes, this is unreasonable cost. XPS should really cover the case already. >>> >>> Eric, >>> >>> Can you clarify how XPS covers the RX-> TX queue mapping case? >>> Is it possible to configure XPS to select TX queue based on the RX queue >>> of a flow? >>> IIUC, it is based on the CPU of the thread doing the transmit OR based >>> on skb->priority to TC mapping? >>> It may be possible to get this effect if the the threads are pinned to a >>> core, but if the app threads are >>> freely moving, i am not sure how XPS can be configured to select the TX >>> queue based on the RX queue of a flow. >> >> If application is freely moving, how NIC can properly select the RX >> queue so that packets are coming to the appropriate queue ? > > The RX queue is selected via RSS and we don't want to move the flow based on > where the thread is running. Unless flow director is enabled on the Intel device... This was, I believe, one of the first attempts to introduce a queue pair notion to general purpose NICs. The idea was that the device records the TX queue for a flow and then uses that to determine receive queue in a symmetric fashion. aRFS is similar, but was under SW control how the mapping is done. As Eric mentioned there are scalability issues with these mechanisms, but we also found that flow director can easily reorder packets whenever the thread moves. >> >> >> This is called aRFS, and it does not scale to millions of flows. >> We tried in the past, and this went nowhere really, since the setup cost >> is prohibitive and DDOS vulnerable. >> >> XPS will follow the thread, since selection is done on current cpu. >> >> The problem is RX side. If application is free to migrate, then special >> support (aRFS) is needed from the hardware. > > This may be true if most of the rx processing is happening in the interrupt > context. > But with busy polling, i think we don't need aRFS as a thread should be > able to poll > any queue irrespective of where it is running. It's not just a problem with interrupt processing, in general we like to have all receive processing an subsequent transmit of a reply to be done on one CPU. Silo'ing is good for performance and parallelism. This can sometimes be relaxed in situations where CPUs share a cache so crossing CPUs is not not costly. >> >> >> At least for passive connections, we already have all the support in the >> kernel so that you can have one thread per NIC queue, dealing with >> sockets that have incoming packets all received on one NIC RX queue. >> (And of course all TX packets will use the symmetric TX queue) >> >> SO_REUSEPORT plus appropriate BPF filter can achieve that. >> >> Say you have 32 queues, 32 cpus. >> >> Simply use 32 listeners, 32 threads (or 32 pools of threads) > > Yes. This will work if each thread is pinned to a core associated with the > RX interrupt. > It may not be possible to pin the threads to a core. > Instead we want to associate a thread to a queue and do all the RX and TX > completion > of a queue in the same thread context via busy polling. > When that happens it's possible for RX to be done on the completely wrong CPU which we know is suboptimal. However, this shouldn't negatively affect TX side since XPS will just use the queue appropriate for running CPU. Like Eric said, this is really a receive problem more than a transmit problem. Keeping them as independent paths seems to be a good approach. Tom
Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
On 9/12/2017 8:47 AM, Eric Dumazet wrote: On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote: On 9/11/2017 8:53 PM, Eric Dumazet wrote: On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote: Two ints in sock_common for this purpose is quite expensive and the use case for this is limited-- even if a RX->TX queue mapping were introduced to eliminate the queue pair assumption this still won't help if the receive and transmit interfaces are different for the connection. I think we really need to see some very compelling results to be able to justify this. Will try to collect and post some perf data with symmetric queue configuration. Yes, this is unreasonable cost. XPS should really cover the case already. Eric, Can you clarify how XPS covers the RX-> TX queue mapping case? Is it possible to configure XPS to select TX queue based on the RX queue of a flow? IIUC, it is based on the CPU of the thread doing the transmit OR based on skb->priority to TC mapping? It may be possible to get this effect if the the threads are pinned to a core, but if the app threads are freely moving, i am not sure how XPS can be configured to select the TX queue based on the RX queue of a flow. If application is freely moving, how NIC can properly select the RX queue so that packets are coming to the appropriate queue ? The RX queue is selected via RSS and we don't want to move the flow based on where the thread is running. This is called aRFS, and it does not scale to millions of flows. We tried in the past, and this went nowhere really, since the setup cost is prohibitive and DDOS vulnerable. XPS will follow the thread, since selection is done on current cpu. The problem is RX side. If application is free to migrate, then special support (aRFS) is needed from the hardware. This may be true if most of the rx processing is happening in the interrupt context. But with busy polling, i think we don't need aRFS as a thread should be able to poll any queue irrespective of where it is running. At least for passive connections, we already have all the support in the kernel so that you can have one thread per NIC queue, dealing with sockets that have incoming packets all received on one NIC RX queue. (And of course all TX packets will use the symmetric TX queue) SO_REUSEPORT plus appropriate BPF filter can achieve that. Say you have 32 queues, 32 cpus. Simply use 32 listeners, 32 threads (or 32 pools of threads) Yes. This will work if each thread is pinned to a core associated with the RX interrupt. It may not be possible to pin the threads to a core. Instead we want to associate a thread to a queue and do all the RX and TX completion of a queue in the same thread context via busy polling. Thanks Sridhar
Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote: > > On 9/11/2017 8:53 PM, Eric Dumazet wrote: > > On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote: > > > >> Two ints in sock_common for this purpose is quite expensive and the > >> use case for this is limited-- even if a RX->TX queue mapping were > >> introduced to eliminate the queue pair assumption this still won't > >> help if the receive and transmit interfaces are different for the > >> connection. I think we really need to see some very compelling results > >> to be able to justify this. > Will try to collect and post some perf data with symmetric queue > configuration. > > > Yes, this is unreasonable cost. > > > > XPS should really cover the case already. > > > Eric, > > Can you clarify how XPS covers the RX-> TX queue mapping case? > Is it possible to configure XPS to select TX queue based on the RX queue > of a flow? > IIUC, it is based on the CPU of the thread doing the transmit OR based > on skb->priority to TC mapping? > It may be possible to get this effect if the the threads are pinned to a > core, but if the app threads are > freely moving, i am not sure how XPS can be configured to select the TX > queue based on the RX queue of a flow. If application is freely moving, how NIC can properly select the RX queue so that packets are coming to the appropriate queue ? This is called aRFS, and it does not scale to millions of flows. We tried in the past, and this went nowhere really, since the setup cost is prohibitive and DDOS vulnerable. XPS will follow the thread, since selection is done on current cpu. The problem is RX side. If application is free to migrate, then special support (aRFS) is needed from the hardware. At least for passive connections, we already have all the support in the kernel so that you can have one thread per NIC queue, dealing with sockets that have incoming packets all received on one NIC RX queue. (And of course all TX packets will use the symmetric TX queue) SO_REUSEPORT plus appropriate BPF filter can achieve that. Say you have 32 queues, 32 cpus. Simply use 32 listeners, 32 threads (or 32 pools of threads)
Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
On 9/11/2017 8:53 PM, Eric Dumazet wrote: On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote: Two ints in sock_common for this purpose is quite expensive and the use case for this is limited-- even if a RX->TX queue mapping were introduced to eliminate the queue pair assumption this still won't help if the receive and transmit interfaces are different for the connection. I think we really need to see some very compelling results to be able to justify this. Will try to collect and post some perf data with symmetric queue configuration. Yes, this is unreasonable cost. XPS should really cover the case already. Eric, Can you clarify how XPS covers the RX-> TX queue mapping case? Is it possible to configure XPS to select TX queue based on the RX queue of a flow? IIUC, it is based on the CPU of the thread doing the transmit OR based on skb->priority to TC mapping? It may be possible to get this effect if the the threads are pinned to a core, but if the app threads are freely moving, i am not sure how XPS can be configured to select the TX queue based on the RX queue of a flow. Thanks Sridhar
Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote: > Two ints in sock_common for this purpose is quite expensive and the > use case for this is limited-- even if a RX->TX queue mapping were > introduced to eliminate the queue pair assumption this still won't > help if the receive and transmit interfaces are different for the > connection. I think we really need to see some very compelling results > to be able to justify this. > Yes, this is unreasonable cost. XPS should really cover the case already.
Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala wrote: > This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used > to enable symmetric tx and rx queues on a socket. > > This option is specifically useful for epoll based multi threaded workloads > where each thread handles packets received on a single RX queue . In this > model, > we have noticed that it helps to send the packets on the same TX queue > corresponding to the queue-pair associated with the RX queue specifically when > busy poll is enabled with epoll(). > > Two new fields are added to struct sock_common to cache the last rx ifindex > and > the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the > cached > rx queue when this option is enabled and the TX is happening on the same > device. > > Signed-off-by: Sridhar Samudrala > --- > include/net/request_sock.h| 1 + > include/net/sock.h| 17 + > include/uapi/asm-generic/socket.h | 2 ++ > net/core/dev.c| 8 +++- > net/core/sock.c | 10 ++ > net/ipv4/tcp_input.c | 1 + > net/ipv4/tcp_ipv4.c | 1 + > net/ipv4/tcp_minisocks.c | 1 + > 8 files changed, 40 insertions(+), 1 deletion(-) > > diff --git a/include/net/request_sock.h b/include/net/request_sock.h > index 23e2205..c3bc12e 100644 > --- a/include/net/request_sock.h > +++ b/include/net/request_sock.h > @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct request_sock > *req) > req_to_sk(req)->sk_prot = sk_listener->sk_prot; > sk_node_init(&req_to_sk(req)->sk_node); > sk_tx_queue_clear(req_to_sk(req)); > + req_to_sk(req)->sk_symmetric_queues = > sk_listener->sk_symmetric_queues; > req->saved_syn = NULL; > refcount_set(&req->rsk_refcnt, 0); > > diff --git a/include/net/sock.h b/include/net/sock.h > index 03a3625..3421809 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char *msg, > ...) > * @skc_node: main hash linkage for various protocol lookup tables > * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol > * @skc_tx_queue_mapping: tx queue number for this connection > + * @skc_rx_queue_mapping: rx queue number for this connection > + * @skc_rx_ifindex: rx ifindex for this connection > * @skc_flags: place holder for sk_flags > * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE, > * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings > * @skc_incoming_cpu: record/match cpu processing incoming packets > * @skc_refcnt: reference count > + * @skc_symmetric_queues: symmetric tx/rx queues > * > * This is the minimal network layer representation of sockets, the > header > * for struct sock and struct inet_timewait_sock. > @@ -177,6 +180,7 @@ struct sock_common { > unsigned char skc_reuseport:1; > unsigned char skc_ipv6only:1; > unsigned char skc_net_refcnt:1; > + unsigned char skc_symmetric_queues:1; > int skc_bound_dev_if; > union { > struct hlist_node skc_bind_node; > @@ -214,6 +218,8 @@ struct sock_common { > struct hlist_nulls_node skc_nulls_node; > }; > int skc_tx_queue_mapping; > + int skc_rx_queue_mapping; > + int skc_rx_ifindex; Two ints in sock_common for this purpose is quite expensive and the use case for this is limited-- even if a RX->TX queue mapping were introduced to eliminate the queue pair assumption this still won't help if the receive and transmit interfaces are different for the connection. I think we really need to see some very compelling results to be able to justify this. Thanks, Tom
Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
On Mon, Sep 11, 2017 at 3:07 PM, Tom Herbert wrote: > On Mon, Sep 11, 2017 at 9:49 AM, Samudrala, Sridhar > wrote: >> On 9/10/2017 8:19 AM, Tom Herbert wrote: >>> >>> On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala >>> wrote: This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used to enable symmetric tx and rx queues on a socket. This option is specifically useful for epoll based multi threaded workloads where each thread handles packets received on a single RX queue . In this model, we have noticed that it helps to send the packets on the same TX queue corresponding to the queue-pair associated with the RX queue specifically when busy poll is enabled with epoll(). Two new fields are added to struct sock_common to cache the last rx ifindex and the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the cached rx queue when this option is enabled and the TX is happening on the same device. Signed-off-by: Sridhar Samudrala --- include/net/request_sock.h| 1 + include/net/sock.h| 17 + include/uapi/asm-generic/socket.h | 2 ++ net/core/dev.c| 8 +++- net/core/sock.c | 10 ++ net/ipv4/tcp_input.c | 1 + net/ipv4/tcp_ipv4.c | 1 + net/ipv4/tcp_minisocks.c | 1 + 8 files changed, 40 insertions(+), 1 deletion(-) diff --git a/include/net/request_sock.h b/include/net/request_sock.h index 23e2205..c3bc12e 100644 --- a/include/net/request_sock.h +++ b/include/net/request_sock.h @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct request_sock *req) req_to_sk(req)->sk_prot = sk_listener->sk_prot; sk_node_init(&req_to_sk(req)->sk_node); sk_tx_queue_clear(req_to_sk(req)); + req_to_sk(req)->sk_symmetric_queues = sk_listener->sk_symmetric_queues; req->saved_syn = NULL; refcount_set(&req->rsk_refcnt, 0); diff --git a/include/net/sock.h b/include/net/sock.h index 03a3625..3421809 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char *msg, ...) * @skc_node: main hash linkage for various protocol lookup tables * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol * @skc_tx_queue_mapping: tx queue number for this connection + * @skc_rx_queue_mapping: rx queue number for this connection + * @skc_rx_ifindex: rx ifindex for this connection * @skc_flags: place holder for sk_flags * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE, * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings * @skc_incoming_cpu: record/match cpu processing incoming packets * @skc_refcnt: reference count + * @skc_symmetric_queues: symmetric tx/rx queues * * This is the minimal network layer representation of sockets, the header * for struct sock and struct inet_timewait_sock. @@ -177,6 +180,7 @@ struct sock_common { unsigned char skc_reuseport:1; unsigned char skc_ipv6only:1; unsigned char skc_net_refcnt:1; + unsigned char skc_symmetric_queues:1; int skc_bound_dev_if; union { struct hlist_node skc_bind_node; @@ -214,6 +218,8 @@ struct sock_common { struct hlist_nulls_node skc_nulls_node; }; int skc_tx_queue_mapping; + int skc_rx_queue_mapping; + int skc_rx_ifindex; union { int skc_incoming_cpu; u32 skc_rcv_wnd; @@ -324,6 +330,8 @@ struct sock { #define sk_nulls_node __sk_common.skc_nulls_node #define sk_refcnt __sk_common.skc_refcnt #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping +#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping +#define sk_rx_ifindex __sk_common.skc_rx_ifindex #define sk_dontcopy_begin __sk_common.skc_dontcopy_begin #define sk_dontcopy_end__sk_common.skc_dontcopy_end @@ -340,6 +348,7 @@ struct sock { #define sk_reuseport __sk_common.skc_reuseport #define sk_ipv6only__sk_common.skc_ipv6only #define sk_net_refcnt __sk_common.skc_net_refcnt +#define sk_symmetric_queues__sk_common.skc_symmetric
Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
On Mon, Sep 11, 2017 at 9:49 AM, Samudrala, Sridhar wrote: > On 9/10/2017 8:19 AM, Tom Herbert wrote: >> >> On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala >> wrote: >>> >>> This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be >>> used >>> to enable symmetric tx and rx queues on a socket. >>> >>> This option is specifically useful for epoll based multi threaded >>> workloads >>> where each thread handles packets received on a single RX queue . In this >>> model, >>> we have noticed that it helps to send the packets on the same TX queue >>> corresponding to the queue-pair associated with the RX queue specifically >>> when >>> busy poll is enabled with epoll(). >>> >>> Two new fields are added to struct sock_common to cache the last rx >>> ifindex and >>> the rx queue in the receive path of an SKB. __netdev_pick_tx() returns >>> the cached >>> rx queue when this option is enabled and the TX is happening on the same >>> device. >>> >>> Signed-off-by: Sridhar Samudrala >>> --- >>> include/net/request_sock.h| 1 + >>> include/net/sock.h| 17 + >>> include/uapi/asm-generic/socket.h | 2 ++ >>> net/core/dev.c| 8 +++- >>> net/core/sock.c | 10 ++ >>> net/ipv4/tcp_input.c | 1 + >>> net/ipv4/tcp_ipv4.c | 1 + >>> net/ipv4/tcp_minisocks.c | 1 + >>> 8 files changed, 40 insertions(+), 1 deletion(-) >>> >>> diff --git a/include/net/request_sock.h b/include/net/request_sock.h >>> index 23e2205..c3bc12e 100644 >>> --- a/include/net/request_sock.h >>> +++ b/include/net/request_sock.h >>> @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct >>> request_sock *req) >>> req_to_sk(req)->sk_prot = sk_listener->sk_prot; >>> sk_node_init(&req_to_sk(req)->sk_node); >>> sk_tx_queue_clear(req_to_sk(req)); >>> + req_to_sk(req)->sk_symmetric_queues = >>> sk_listener->sk_symmetric_queues; >>> req->saved_syn = NULL; >>> refcount_set(&req->rsk_refcnt, 0); >>> >>> diff --git a/include/net/sock.h b/include/net/sock.h >>> index 03a3625..3421809 100644 >>> --- a/include/net/sock.h >>> +++ b/include/net/sock.h >>> @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char >>> *msg, ...) >>>* @skc_node: main hash linkage for various protocol lookup tables >>>* @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol >>>* @skc_tx_queue_mapping: tx queue number for this connection >>> + * @skc_rx_queue_mapping: rx queue number for this connection >>> + * @skc_rx_ifindex: rx ifindex for this connection >>>* @skc_flags: place holder for sk_flags >>>* %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE, >>>* %SO_OOBINLINE settings, %SO_TIMESTAMPING settings >>>* @skc_incoming_cpu: record/match cpu processing incoming packets >>>* @skc_refcnt: reference count >>> + * @skc_symmetric_queues: symmetric tx/rx queues >>>* >>>* This is the minimal network layer representation of sockets, the >>> header >>>* for struct sock and struct inet_timewait_sock. >>> @@ -177,6 +180,7 @@ struct sock_common { >>> unsigned char skc_reuseport:1; >>> unsigned char skc_ipv6only:1; >>> unsigned char skc_net_refcnt:1; >>> + unsigned char skc_symmetric_queues:1; >>> int skc_bound_dev_if; >>> union { >>> struct hlist_node skc_bind_node; >>> @@ -214,6 +218,8 @@ struct sock_common { >>> struct hlist_nulls_node skc_nulls_node; >>> }; >>> int skc_tx_queue_mapping; >>> + int skc_rx_queue_mapping; >>> + int skc_rx_ifindex; >>> union { >>> int skc_incoming_cpu; >>> u32 skc_rcv_wnd; >>> @@ -324,6 +330,8 @@ struct sock { >>> #define sk_nulls_node __sk_common.skc_nulls_node >>> #define sk_refcnt __sk_common.skc_refcnt >>> #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping >>> +#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping >>> +#define sk_rx_ifindex __sk_common.skc_rx_ifindex >>> >>> #define sk_dontcopy_begin __sk_common.skc_dontcopy_begin >>> #define sk_dontcopy_end__sk_common.skc_dontcopy_end >>> @@ -340,6 +348,7 @@ struct sock { >>> #define sk_reuseport __sk_common.skc_reuseport >>> #define sk_ipv6only__sk_common.skc_ipv6only >>> #define sk_net_refcnt __sk_common.skc_net_refcnt >>> +#define sk_symmetric_queues__sk_common.skc_symmetric_queues >>> #define sk_bound_dev_if__sk_common.skc_bound_dev_if >>> #define sk_bind_node __sk_common.skc_bind_node >>> #define s
Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
On Mon, Sep 11, 2017 at 9:48 AM, Samudrala, Sridhar wrote: > > > On 9/9/2017 6:28 PM, Alexander Duyck wrote: >> >> On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala >> wrote: >>> >>> This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be >>> used >>> to enable symmetric tx and rx queues on a socket. >>> >>> This option is specifically useful for epoll based multi threaded >>> workloads >>> where each thread handles packets received on a single RX queue . In this >>> model, >>> we have noticed that it helps to send the packets on the same TX queue >>> corresponding to the queue-pair associated with the RX queue specifically >>> when >>> busy poll is enabled with epoll(). >>> >>> Two new fields are added to struct sock_common to cache the last rx >>> ifindex and >>> the rx queue in the receive path of an SKB. __netdev_pick_tx() returns >>> the cached >>> rx queue when this option is enabled and the TX is happening on the same >>> device. >>> >>> Signed-off-by: Sridhar Samudrala >>> --- >>> include/net/request_sock.h| 1 + >>> include/net/sock.h| 17 + >>> include/uapi/asm-generic/socket.h | 2 ++ >>> net/core/dev.c| 8 +++- >>> net/core/sock.c | 10 ++ >>> net/ipv4/tcp_input.c | 1 + >>> net/ipv4/tcp_ipv4.c | 1 + >>> net/ipv4/tcp_minisocks.c | 1 + >>> 8 files changed, 40 insertions(+), 1 deletion(-) >>> >>> diff --git a/include/net/request_sock.h b/include/net/request_sock.h >>> index 23e2205..c3bc12e 100644 >>> --- a/include/net/request_sock.h >>> +++ b/include/net/request_sock.h >>> @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct >>> request_sock *req) >>> req_to_sk(req)->sk_prot = sk_listener->sk_prot; >>> sk_node_init(&req_to_sk(req)->sk_node); >>> sk_tx_queue_clear(req_to_sk(req)); >>> + req_to_sk(req)->sk_symmetric_queues = >>> sk_listener->sk_symmetric_queues; >>> req->saved_syn = NULL; >>> refcount_set(&req->rsk_refcnt, 0); >>> >>> diff --git a/include/net/sock.h b/include/net/sock.h >>> index 03a3625..3421809 100644 >>> --- a/include/net/sock.h >>> +++ b/include/net/sock.h >>> @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char >>> *msg, ...) >>>* @skc_node: main hash linkage for various protocol lookup tables >>>* @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol >>>* @skc_tx_queue_mapping: tx queue number for this connection >>> + * @skc_rx_queue_mapping: rx queue number for this connection >>> + * @skc_rx_ifindex: rx ifindex for this connection >>>* @skc_flags: place holder for sk_flags >>>* %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE, >>>* %SO_OOBINLINE settings, %SO_TIMESTAMPING settings >>>* @skc_incoming_cpu: record/match cpu processing incoming packets >>>* @skc_refcnt: reference count >>> + * @skc_symmetric_queues: symmetric tx/rx queues >>>* >>>* This is the minimal network layer representation of sockets, the >>> header >>>* for struct sock and struct inet_timewait_sock. >>> @@ -177,6 +180,7 @@ struct sock_common { >>> unsigned char skc_reuseport:1; >>> unsigned char skc_ipv6only:1; >>> unsigned char skc_net_refcnt:1; >>> + unsigned char skc_symmetric_queues:1; >>> int skc_bound_dev_if; >>> union { >>> struct hlist_node skc_bind_node; >>> @@ -214,6 +218,8 @@ struct sock_common { >>> struct hlist_nulls_node skc_nulls_node; >>> }; >>> int skc_tx_queue_mapping; >>> + int skc_rx_queue_mapping; >>> + int skc_rx_ifindex; >>> union { >>> int skc_incoming_cpu; >>> u32 skc_rcv_wnd; >>> @@ -324,6 +330,8 @@ struct sock { >>> #define sk_nulls_node __sk_common.skc_nulls_node >>> #define sk_refcnt __sk_common.skc_refcnt >>> #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping >>> +#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping >>> +#define sk_rx_ifindex __sk_common.skc_rx_ifindex >>> >>> #define sk_dontcopy_begin __sk_common.skc_dontcopy_begin >>> #define sk_dontcopy_end__sk_common.skc_dontcopy_end >>> @@ -340,6 +348,7 @@ struct sock { >>> #define sk_reuseport __sk_common.skc_reuseport >>> #define sk_ipv6only__sk_common.skc_ipv6only >>> #define sk_net_refcnt __sk_common.skc_net_refcnt >>> +#define sk_symmetric_queues__sk_common.skc_symmetric_queues >>> #define sk_bound_dev_if__sk_common.skc_bound_dev_if >>> #define sk_bind_node __sk_common.skc_bind_node >>> #d
Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
On 9/10/2017 8:19 AM, Tom Herbert wrote: On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala wrote: This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used to enable symmetric tx and rx queues on a socket. This option is specifically useful for epoll based multi threaded workloads where each thread handles packets received on a single RX queue . In this model, we have noticed that it helps to send the packets on the same TX queue corresponding to the queue-pair associated with the RX queue specifically when busy poll is enabled with epoll(). Two new fields are added to struct sock_common to cache the last rx ifindex and the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the cached rx queue when this option is enabled and the TX is happening on the same device. Signed-off-by: Sridhar Samudrala --- include/net/request_sock.h| 1 + include/net/sock.h| 17 + include/uapi/asm-generic/socket.h | 2 ++ net/core/dev.c| 8 +++- net/core/sock.c | 10 ++ net/ipv4/tcp_input.c | 1 + net/ipv4/tcp_ipv4.c | 1 + net/ipv4/tcp_minisocks.c | 1 + 8 files changed, 40 insertions(+), 1 deletion(-) diff --git a/include/net/request_sock.h b/include/net/request_sock.h index 23e2205..c3bc12e 100644 --- a/include/net/request_sock.h +++ b/include/net/request_sock.h @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct request_sock *req) req_to_sk(req)->sk_prot = sk_listener->sk_prot; sk_node_init(&req_to_sk(req)->sk_node); sk_tx_queue_clear(req_to_sk(req)); + req_to_sk(req)->sk_symmetric_queues = sk_listener->sk_symmetric_queues; req->saved_syn = NULL; refcount_set(&req->rsk_refcnt, 0); diff --git a/include/net/sock.h b/include/net/sock.h index 03a3625..3421809 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char *msg, ...) * @skc_node: main hash linkage for various protocol lookup tables * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol * @skc_tx_queue_mapping: tx queue number for this connection + * @skc_rx_queue_mapping: rx queue number for this connection + * @skc_rx_ifindex: rx ifindex for this connection * @skc_flags: place holder for sk_flags * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE, * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings * @skc_incoming_cpu: record/match cpu processing incoming packets * @skc_refcnt: reference count + * @skc_symmetric_queues: symmetric tx/rx queues * * This is the minimal network layer representation of sockets, the header * for struct sock and struct inet_timewait_sock. @@ -177,6 +180,7 @@ struct sock_common { unsigned char skc_reuseport:1; unsigned char skc_ipv6only:1; unsigned char skc_net_refcnt:1; + unsigned char skc_symmetric_queues:1; int skc_bound_dev_if; union { struct hlist_node skc_bind_node; @@ -214,6 +218,8 @@ struct sock_common { struct hlist_nulls_node skc_nulls_node; }; int skc_tx_queue_mapping; + int skc_rx_queue_mapping; + int skc_rx_ifindex; union { int skc_incoming_cpu; u32 skc_rcv_wnd; @@ -324,6 +330,8 @@ struct sock { #define sk_nulls_node __sk_common.skc_nulls_node #define sk_refcnt __sk_common.skc_refcnt #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping +#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping +#define sk_rx_ifindex __sk_common.skc_rx_ifindex #define sk_dontcopy_begin __sk_common.skc_dontcopy_begin #define sk_dontcopy_end__sk_common.skc_dontcopy_end @@ -340,6 +348,7 @@ struct sock { #define sk_reuseport __sk_common.skc_reuseport #define sk_ipv6only__sk_common.skc_ipv6only #define sk_net_refcnt __sk_common.skc_net_refcnt +#define sk_symmetric_queues__sk_common.skc_symmetric_queues #define sk_bound_dev_if__sk_common.skc_bound_dev_if #define sk_bind_node __sk_common.skc_bind_node #define sk_prot__sk_common.skc_prot @@ -1676,6 +1685,14 @@ static inline int sk_tx_queue_get(const struct sock *sk) return sk ? sk->sk_tx_queue_mapping : -1; } +static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb) +{ + if (sk->sk_symmetric_queues) { + sk->sk_rx_ifindex = skb->skb_iif; + sk->sk_rx_queue_mapping = skb_get_rx_queue(skb); + } +} + static inline void sk_set_socke
Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
On 9/9/2017 6:28 PM, Alexander Duyck wrote: On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala wrote: This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used to enable symmetric tx and rx queues on a socket. This option is specifically useful for epoll based multi threaded workloads where each thread handles packets received on a single RX queue . In this model, we have noticed that it helps to send the packets on the same TX queue corresponding to the queue-pair associated with the RX queue specifically when busy poll is enabled with epoll(). Two new fields are added to struct sock_common to cache the last rx ifindex and the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the cached rx queue when this option is enabled and the TX is happening on the same device. Signed-off-by: Sridhar Samudrala --- include/net/request_sock.h| 1 + include/net/sock.h| 17 + include/uapi/asm-generic/socket.h | 2 ++ net/core/dev.c| 8 +++- net/core/sock.c | 10 ++ net/ipv4/tcp_input.c | 1 + net/ipv4/tcp_ipv4.c | 1 + net/ipv4/tcp_minisocks.c | 1 + 8 files changed, 40 insertions(+), 1 deletion(-) diff --git a/include/net/request_sock.h b/include/net/request_sock.h index 23e2205..c3bc12e 100644 --- a/include/net/request_sock.h +++ b/include/net/request_sock.h @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct request_sock *req) req_to_sk(req)->sk_prot = sk_listener->sk_prot; sk_node_init(&req_to_sk(req)->sk_node); sk_tx_queue_clear(req_to_sk(req)); + req_to_sk(req)->sk_symmetric_queues = sk_listener->sk_symmetric_queues; req->saved_syn = NULL; refcount_set(&req->rsk_refcnt, 0); diff --git a/include/net/sock.h b/include/net/sock.h index 03a3625..3421809 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char *msg, ...) * @skc_node: main hash linkage for various protocol lookup tables * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol * @skc_tx_queue_mapping: tx queue number for this connection + * @skc_rx_queue_mapping: rx queue number for this connection + * @skc_rx_ifindex: rx ifindex for this connection * @skc_flags: place holder for sk_flags * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE, * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings * @skc_incoming_cpu: record/match cpu processing incoming packets * @skc_refcnt: reference count + * @skc_symmetric_queues: symmetric tx/rx queues * * This is the minimal network layer representation of sockets, the header * for struct sock and struct inet_timewait_sock. @@ -177,6 +180,7 @@ struct sock_common { unsigned char skc_reuseport:1; unsigned char skc_ipv6only:1; unsigned char skc_net_refcnt:1; + unsigned char skc_symmetric_queues:1; int skc_bound_dev_if; union { struct hlist_node skc_bind_node; @@ -214,6 +218,8 @@ struct sock_common { struct hlist_nulls_node skc_nulls_node; }; int skc_tx_queue_mapping; + int skc_rx_queue_mapping; + int skc_rx_ifindex; union { int skc_incoming_cpu; u32 skc_rcv_wnd; @@ -324,6 +330,8 @@ struct sock { #define sk_nulls_node __sk_common.skc_nulls_node #define sk_refcnt __sk_common.skc_refcnt #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping +#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping +#define sk_rx_ifindex __sk_common.skc_rx_ifindex #define sk_dontcopy_begin __sk_common.skc_dontcopy_begin #define sk_dontcopy_end__sk_common.skc_dontcopy_end @@ -340,6 +348,7 @@ struct sock { #define sk_reuseport __sk_common.skc_reuseport #define sk_ipv6only__sk_common.skc_ipv6only #define sk_net_refcnt __sk_common.skc_net_refcnt +#define sk_symmetric_queues__sk_common.skc_symmetric_queues #define sk_bound_dev_if__sk_common.skc_bound_dev_if #define sk_bind_node __sk_common.skc_bind_node #define sk_prot__sk_common.skc_prot @@ -1676,6 +1685,14 @@ static inline int sk_tx_queue_get(const struct sock *sk) return sk ? sk->sk_tx_queue_mapping : -1; } +static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb) +{ + if (sk->sk_symmetric_queues) { + sk->sk_rx_ifindex = skb->skb_iif; + sk->sk_rx_queue_mapping = skb_get_rx_queue(skb); + } +} + static inline void sk_set_
Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala wrote: > This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used > to enable symmetric tx and rx queues on a socket. > > This option is specifically useful for epoll based multi threaded workloads > where each thread handles packets received on a single RX queue . In this > model, > we have noticed that it helps to send the packets on the same TX queue > corresponding to the queue-pair associated with the RX queue specifically when > busy poll is enabled with epoll(). > > Two new fields are added to struct sock_common to cache the last rx ifindex > and > the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the > cached > rx queue when this option is enabled and the TX is happening on the same > device. > > Signed-off-by: Sridhar Samudrala > --- > include/net/request_sock.h| 1 + > include/net/sock.h| 17 + > include/uapi/asm-generic/socket.h | 2 ++ > net/core/dev.c| 8 +++- > net/core/sock.c | 10 ++ > net/ipv4/tcp_input.c | 1 + > net/ipv4/tcp_ipv4.c | 1 + > net/ipv4/tcp_minisocks.c | 1 + > 8 files changed, 40 insertions(+), 1 deletion(-) > > diff --git a/include/net/request_sock.h b/include/net/request_sock.h > index 23e2205..c3bc12e 100644 > --- a/include/net/request_sock.h > +++ b/include/net/request_sock.h > @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct request_sock > *req) > req_to_sk(req)->sk_prot = sk_listener->sk_prot; > sk_node_init(&req_to_sk(req)->sk_node); > sk_tx_queue_clear(req_to_sk(req)); > + req_to_sk(req)->sk_symmetric_queues = > sk_listener->sk_symmetric_queues; > req->saved_syn = NULL; > refcount_set(&req->rsk_refcnt, 0); > > diff --git a/include/net/sock.h b/include/net/sock.h > index 03a3625..3421809 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char *msg, > ...) > * @skc_node: main hash linkage for various protocol lookup tables > * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol > * @skc_tx_queue_mapping: tx queue number for this connection > + * @skc_rx_queue_mapping: rx queue number for this connection > + * @skc_rx_ifindex: rx ifindex for this connection > * @skc_flags: place holder for sk_flags > * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE, > * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings > * @skc_incoming_cpu: record/match cpu processing incoming packets > * @skc_refcnt: reference count > + * @skc_symmetric_queues: symmetric tx/rx queues > * > * This is the minimal network layer representation of sockets, the > header > * for struct sock and struct inet_timewait_sock. > @@ -177,6 +180,7 @@ struct sock_common { > unsigned char skc_reuseport:1; > unsigned char skc_ipv6only:1; > unsigned char skc_net_refcnt:1; > + unsigned char skc_symmetric_queues:1; > int skc_bound_dev_if; > union { > struct hlist_node skc_bind_node; > @@ -214,6 +218,8 @@ struct sock_common { > struct hlist_nulls_node skc_nulls_node; > }; > int skc_tx_queue_mapping; > + int skc_rx_queue_mapping; > + int skc_rx_ifindex; > union { > int skc_incoming_cpu; > u32 skc_rcv_wnd; > @@ -324,6 +330,8 @@ struct sock { > #define sk_nulls_node __sk_common.skc_nulls_node > #define sk_refcnt __sk_common.skc_refcnt > #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping > +#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping > +#define sk_rx_ifindex __sk_common.skc_rx_ifindex > > #define sk_dontcopy_begin __sk_common.skc_dontcopy_begin > #define sk_dontcopy_end__sk_common.skc_dontcopy_end > @@ -340,6 +348,7 @@ struct sock { > #define sk_reuseport __sk_common.skc_reuseport > #define sk_ipv6only__sk_common.skc_ipv6only > #define sk_net_refcnt __sk_common.skc_net_refcnt > +#define sk_symmetric_queues__sk_common.skc_symmetric_queues > #define sk_bound_dev_if__sk_common.skc_bound_dev_if > #define sk_bind_node __sk_common.skc_bind_node > #define sk_prot__sk_common.skc_prot > @@ -1676,6 +1685,14 @@ static inline int sk_tx_queue_get(const struct sock > *sk) > return sk ? sk->sk_tx_queue_mapping : -1; > } > > +static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb) > +{ > + if (sk->sk_symmetric_queues) { > + sk->sk_rx_ifindex = skb->skb
Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala wrote: > This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used > to enable symmetric tx and rx queues on a socket. > > This option is specifically useful for epoll based multi threaded workloads > where each thread handles packets received on a single RX queue . In this > model, > we have noticed that it helps to send the packets on the same TX queue > corresponding to the queue-pair associated with the RX queue specifically when > busy poll is enabled with epoll(). > Please provide more details, test results on exactly how this helps. Why would this better than than optimized XPS? Thanks, Tom > Two new fields are added to struct sock_common to cache the last rx ifindex > and > the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the > cached > rx queue when this option is enabled and the TX is happening on the same > device. > > Signed-off-by: Sridhar Samudrala > --- > include/net/request_sock.h| 1 + > include/net/sock.h| 17 + > include/uapi/asm-generic/socket.h | 2 ++ > net/core/dev.c| 8 +++- > net/core/sock.c | 10 ++ > net/ipv4/tcp_input.c | 1 + > net/ipv4/tcp_ipv4.c | 1 + > net/ipv4/tcp_minisocks.c | 1 + > 8 files changed, 40 insertions(+), 1 deletion(-) > > diff --git a/include/net/request_sock.h b/include/net/request_sock.h > index 23e2205..c3bc12e 100644 > --- a/include/net/request_sock.h > +++ b/include/net/request_sock.h > @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct request_sock > *req) > req_to_sk(req)->sk_prot = sk_listener->sk_prot; > sk_node_init(&req_to_sk(req)->sk_node); > sk_tx_queue_clear(req_to_sk(req)); > + req_to_sk(req)->sk_symmetric_queues = > sk_listener->sk_symmetric_queues; > req->saved_syn = NULL; > refcount_set(&req->rsk_refcnt, 0); > > diff --git a/include/net/sock.h b/include/net/sock.h > index 03a3625..3421809 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char *msg, > ...) > * @skc_node: main hash linkage for various protocol lookup tables > * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol > * @skc_tx_queue_mapping: tx queue number for this connection > + * @skc_rx_queue_mapping: rx queue number for this connection > + * @skc_rx_ifindex: rx ifindex for this connection > * @skc_flags: place holder for sk_flags > * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE, > * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings > * @skc_incoming_cpu: record/match cpu processing incoming packets > * @skc_refcnt: reference count > + * @skc_symmetric_queues: symmetric tx/rx queues > * > * This is the minimal network layer representation of sockets, the > header > * for struct sock and struct inet_timewait_sock. > @@ -177,6 +180,7 @@ struct sock_common { > unsigned char skc_reuseport:1; > unsigned char skc_ipv6only:1; > unsigned char skc_net_refcnt:1; > + unsigned char skc_symmetric_queues:1; > int skc_bound_dev_if; > union { > struct hlist_node skc_bind_node; > @@ -214,6 +218,8 @@ struct sock_common { > struct hlist_nulls_node skc_nulls_node; > }; > int skc_tx_queue_mapping; > + int skc_rx_queue_mapping; > + int skc_rx_ifindex; > union { > int skc_incoming_cpu; > u32 skc_rcv_wnd; > @@ -324,6 +330,8 @@ struct sock { > #define sk_nulls_node __sk_common.skc_nulls_node > #define sk_refcnt __sk_common.skc_refcnt > #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping > +#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping > +#define sk_rx_ifindex __sk_common.skc_rx_ifindex > > #define sk_dontcopy_begin __sk_common.skc_dontcopy_begin > #define sk_dontcopy_end__sk_common.skc_dontcopy_end > @@ -340,6 +348,7 @@ struct sock { > #define sk_reuseport __sk_common.skc_reuseport > #define sk_ipv6only__sk_common.skc_ipv6only > #define sk_net_refcnt __sk_common.skc_net_refcnt > +#define sk_symmetric_queues__sk_common.skc_symmetric_queues > #define sk_bound_dev_if__sk_common.skc_bound_dev_if > #define sk_bind_node __sk_common.skc_bind_node > #define sk_prot__sk_common.skc_prot > @@ -1676,6 +1685,14 @@ static inline int sk_tx_queue_get(const struct sock > *sk) > return sk ? sk->sk_tx_queue_mapping : -1; > } > > +static inline void sk_mark_rx_queue(
Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala wrote: > This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used > to enable symmetric tx and rx queues on a socket. > > This option is specifically useful for epoll based multi threaded workloads > where each thread handles packets received on a single RX queue . In this > model, > we have noticed that it helps to send the packets on the same TX queue > corresponding to the queue-pair associated with the RX queue specifically when > busy poll is enabled with epoll(). > > Two new fields are added to struct sock_common to cache the last rx ifindex > and > the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the > cached > rx queue when this option is enabled and the TX is happening on the same > device. > > Signed-off-by: Sridhar Samudrala > --- > include/net/request_sock.h| 1 + > include/net/sock.h| 17 + > include/uapi/asm-generic/socket.h | 2 ++ > net/core/dev.c| 8 +++- > net/core/sock.c | 10 ++ > net/ipv4/tcp_input.c | 1 + > net/ipv4/tcp_ipv4.c | 1 + > net/ipv4/tcp_minisocks.c | 1 + > 8 files changed, 40 insertions(+), 1 deletion(-) > > diff --git a/include/net/request_sock.h b/include/net/request_sock.h > index 23e2205..c3bc12e 100644 > --- a/include/net/request_sock.h > +++ b/include/net/request_sock.h > @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct request_sock > *req) > req_to_sk(req)->sk_prot = sk_listener->sk_prot; > sk_node_init(&req_to_sk(req)->sk_node); > sk_tx_queue_clear(req_to_sk(req)); > + req_to_sk(req)->sk_symmetric_queues = > sk_listener->sk_symmetric_queues; > req->saved_syn = NULL; > refcount_set(&req->rsk_refcnt, 0); > > diff --git a/include/net/sock.h b/include/net/sock.h > index 03a3625..3421809 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char *msg, > ...) > * @skc_node: main hash linkage for various protocol lookup tables > * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol > * @skc_tx_queue_mapping: tx queue number for this connection > + * @skc_rx_queue_mapping: rx queue number for this connection > + * @skc_rx_ifindex: rx ifindex for this connection > * @skc_flags: place holder for sk_flags > * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE, > * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings > * @skc_incoming_cpu: record/match cpu processing incoming packets > * @skc_refcnt: reference count > + * @skc_symmetric_queues: symmetric tx/rx queues > * > * This is the minimal network layer representation of sockets, the > header > * for struct sock and struct inet_timewait_sock. > @@ -177,6 +180,7 @@ struct sock_common { > unsigned char skc_reuseport:1; > unsigned char skc_ipv6only:1; > unsigned char skc_net_refcnt:1; > + unsigned char skc_symmetric_queues:1; > int skc_bound_dev_if; > union { > struct hlist_node skc_bind_node; > @@ -214,6 +218,8 @@ struct sock_common { > struct hlist_nulls_node skc_nulls_node; > }; > int skc_tx_queue_mapping; > + int skc_rx_queue_mapping; > + int skc_rx_ifindex; > union { > int skc_incoming_cpu; > u32 skc_rcv_wnd; > @@ -324,6 +330,8 @@ struct sock { > #define sk_nulls_node __sk_common.skc_nulls_node > #define sk_refcnt __sk_common.skc_refcnt > #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping > +#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping > +#define sk_rx_ifindex __sk_common.skc_rx_ifindex > > #define sk_dontcopy_begin __sk_common.skc_dontcopy_begin > #define sk_dontcopy_end__sk_common.skc_dontcopy_end > @@ -340,6 +348,7 @@ struct sock { > #define sk_reuseport __sk_common.skc_reuseport > #define sk_ipv6only__sk_common.skc_ipv6only > #define sk_net_refcnt __sk_common.skc_net_refcnt > +#define sk_symmetric_queues__sk_common.skc_symmetric_queues > #define sk_bound_dev_if__sk_common.skc_bound_dev_if > #define sk_bind_node __sk_common.skc_bind_node > #define sk_prot__sk_common.skc_prot > @@ -1676,6 +1685,14 @@ static inline int sk_tx_queue_get(const struct sock > *sk) > return sk ? sk->sk_tx_queue_mapping : -1; > } > > +static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb) > +{ > + if (sk->sk_symmetric_queues) { > + sk->sk_rx_ifindex = skb->skb
[RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used to enable symmetric tx and rx queues on a socket. This option is specifically useful for epoll based multi threaded workloads where each thread handles packets received on a single RX queue . In this model, we have noticed that it helps to send the packets on the same TX queue corresponding to the queue-pair associated with the RX queue specifically when busy poll is enabled with epoll(). Two new fields are added to struct sock_common to cache the last rx ifindex and the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the cached rx queue when this option is enabled and the TX is happening on the same device. Signed-off-by: Sridhar Samudrala --- include/net/request_sock.h| 1 + include/net/sock.h| 17 + include/uapi/asm-generic/socket.h | 2 ++ net/core/dev.c| 8 +++- net/core/sock.c | 10 ++ net/ipv4/tcp_input.c | 1 + net/ipv4/tcp_ipv4.c | 1 + net/ipv4/tcp_minisocks.c | 1 + 8 files changed, 40 insertions(+), 1 deletion(-) diff --git a/include/net/request_sock.h b/include/net/request_sock.h index 23e2205..c3bc12e 100644 --- a/include/net/request_sock.h +++ b/include/net/request_sock.h @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct request_sock *req) req_to_sk(req)->sk_prot = sk_listener->sk_prot; sk_node_init(&req_to_sk(req)->sk_node); sk_tx_queue_clear(req_to_sk(req)); + req_to_sk(req)->sk_symmetric_queues = sk_listener->sk_symmetric_queues; req->saved_syn = NULL; refcount_set(&req->rsk_refcnt, 0); diff --git a/include/net/sock.h b/include/net/sock.h index 03a3625..3421809 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char *msg, ...) * @skc_node: main hash linkage for various protocol lookup tables * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol * @skc_tx_queue_mapping: tx queue number for this connection + * @skc_rx_queue_mapping: rx queue number for this connection + * @skc_rx_ifindex: rx ifindex for this connection * @skc_flags: place holder for sk_flags * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE, * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings * @skc_incoming_cpu: record/match cpu processing incoming packets * @skc_refcnt: reference count + * @skc_symmetric_queues: symmetric tx/rx queues * * This is the minimal network layer representation of sockets, the header * for struct sock and struct inet_timewait_sock. @@ -177,6 +180,7 @@ struct sock_common { unsigned char skc_reuseport:1; unsigned char skc_ipv6only:1; unsigned char skc_net_refcnt:1; + unsigned char skc_symmetric_queues:1; int skc_bound_dev_if; union { struct hlist_node skc_bind_node; @@ -214,6 +218,8 @@ struct sock_common { struct hlist_nulls_node skc_nulls_node; }; int skc_tx_queue_mapping; + int skc_rx_queue_mapping; + int skc_rx_ifindex; union { int skc_incoming_cpu; u32 skc_rcv_wnd; @@ -324,6 +330,8 @@ struct sock { #define sk_nulls_node __sk_common.skc_nulls_node #define sk_refcnt __sk_common.skc_refcnt #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping +#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping +#define sk_rx_ifindex __sk_common.skc_rx_ifindex #define sk_dontcopy_begin __sk_common.skc_dontcopy_begin #define sk_dontcopy_end__sk_common.skc_dontcopy_end @@ -340,6 +348,7 @@ struct sock { #define sk_reuseport __sk_common.skc_reuseport #define sk_ipv6only__sk_common.skc_ipv6only #define sk_net_refcnt __sk_common.skc_net_refcnt +#define sk_symmetric_queues__sk_common.skc_symmetric_queues #define sk_bound_dev_if__sk_common.skc_bound_dev_if #define sk_bind_node __sk_common.skc_bind_node #define sk_prot__sk_common.skc_prot @@ -1676,6 +1685,14 @@ static inline int sk_tx_queue_get(const struct sock *sk) return sk ? sk->sk_tx_queue_mapping : -1; } +static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb) +{ + if (sk->sk_symmetric_queues) { + sk->sk_rx_ifindex = skb->skb_iif; + sk->sk_rx_queue_mapping = skb_get_rx_queue(skb); + } +} + static inline void sk_set_socket(struct sock *sk, struct socket *sock) { sk_tx_queue_clear(sk); diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/so