Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-20 Thread Samudrala, Sridhar



On 9/20/2017 7:18 AM, Tom Herbert wrote:

On Tue, Sep 19, 2017 at 10:13 PM, Eric Dumazet  wrote:

On Tue, 2017-09-19 at 21:59 -0700, Samudrala, Sridhar wrote:

On 9/19/2017 5:48 PM, Tom Herbert wrote:

On Tue, Sep 19, 2017 at 5:34 PM, Samudrala, Sridhar
 wrote:

On 9/12/2017 3:53 PM, Tom Herbert wrote:

On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar
 wrote:

On 9/12/2017 8:47 AM, Eric Dumazet wrote:

On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:

On 9/11/2017 8:53 PM, Eric Dumazet wrote:

On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:


Two ints in sock_common for this purpose is quite expensive and the
use case for this is limited-- even if a RX->TX queue mapping were
introduced to eliminate the queue pair assumption this still won't
help if the receive and transmit interfaces are different for the
connection. I think we really need to see some very compelling
results
to be able to justify this.

Will try to collect and post some perf data with symmetric queue
configuration.

Here is some performance data i collected with memcached workload over
ixgbe 10Gb NIC with mcblaster benchmark.
ixgbe is configured with 16 queues and rx-usecs is set to 1000 for a very
low
interrupt rate.
   ethtool -L p1p1 combined 16
   ethtool -C p1p1 rx-usecs 1000
and busy poll is set to 1000usecs
   sysctl net.core.busy_poll = 1000

16 threads  800K requests/sec
=
   rtt(min/avg/max)usecs intr/sec contextswitch/sec
---
Default2/182/1064123391 61163
Symmetric Queues   2/50/6311  20457 32843

32 threads  800K requests/sec
=
  rtt(min/avg/max)usecs intr/sec contextswitch/sec

Default2/162/639032168 69450
Symmetric Queues2/50/385335044 35847


No idea what "Default" configuration is. Please report how xps_cpus is
being set, how many RSS queues there are, and what the mapping is
between RSS queues and CPUs and shared caches. Also, whether and
threads are pinned.

Default is linux 4.13 with the settings i listed above.
 ethtool -L p1p1 combined 16
 ethtool -C p1p1 rx-usecs 1000
 sysctl net.core.busy_poll = 1000

# ethtool -x p1p1
RX flow hash indirection table for p1p1 with 16 RX ring(s):
 0:  0 1 2 3 4 5 6 7
 8:  8 9101112131415
16:  0 1 2 3 4 5 6 7
24:  8 9101112131415
32:  0 1 2 3 4 5 6 7
40:  8 9101112131415
48:  0 1 2 3 4 5 6 7
56:  8 9101112131415
64:  0 1 2 3 4 5 6 7
72:  8 9101112131415
80:  0 1 2 3 4 5 6 7
88:  8 9101112131415
96:  0 1 2 3 4 5 6 7
   104:  8 9101112131415
   112:  0 1 2 3 4 5 6 7
   120:  8 9101112131415

smp_affinity for the 16 queuepairs
 141 p1p1-TxRx-0 ,0001
 142 p1p1-TxRx-1 ,0002
 143 p1p1-TxRx-2 ,0004
 144 p1p1-TxRx-3 ,0008
 145 p1p1-TxRx-4 ,0010
 146 p1p1-TxRx-5 ,0020
 147 p1p1-TxRx-6 ,0040
 148 p1p1-TxRx-7 ,0080
 149 p1p1-TxRx-8 ,0100
 150 p1p1-TxRx-9 ,0200
 151 p1p1-TxRx-10 ,0400
 152 p1p1-TxRx-11 ,0800
 153 p1p1-TxRx-12 ,1000
 154 p1p1-TxRx-13 ,2000
 155 p1p1-TxRx-14 ,4000
 156 p1p1-TxRx-15 ,8000
xps_cpus for the 16 Tx queues
 ,0001
 ,0002
 ,0004
 ,0008
 ,0010
 ,0020
 ,0040
 ,0080
 ,0100
 ,0200
 ,0400
 ,0800
 ,1000
 ,2000
 ,4000
 ,8000
memcached threads are not pinned.


...

I urge you to take the time to properly tune this host.

linux kernel does not do automagic configuration. This is user policy.

Documentation/networking/scaling.txt has everything you need.


Yes, tuning a system for optimal performance is difficult. Even if you
find a performance benefit for a configuration on one system, that
might not translate to another. In other words, if you've produced
some code that seems to perform better than previous implementation on
a test machine it's n

Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-20 Thread Hannes Frederic Sowa
Sridhar Samudrala  writes:

> This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used
> to enable symmetric tx and rx queues on a socket.
>
> This option is specifically useful for epoll based multi threaded workloads
> where each thread handles packets received on a single RX queue . In this 
> model,
> we have noticed that it helps to send the packets on the same TX queue
> corresponding to the queue-pair associated with the RX queue specifically when
> busy poll is enabled with epoll().
>
> Two new fields are added to struct sock_common to cache the last rx ifindex 
> and
> the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the 
> cached
> rx queue when this option is enabled and the TX is happening on the same 
> device.

Would it help to make the rx and tx skb hashes symmetric
(skb_get_hash_symmetric) on request?


Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-20 Thread Tom Herbert
On Tue, Sep 19, 2017 at 10:13 PM, Eric Dumazet  wrote:
> On Tue, 2017-09-19 at 21:59 -0700, Samudrala, Sridhar wrote:
>> On 9/19/2017 5:48 PM, Tom Herbert wrote:
>> > On Tue, Sep 19, 2017 at 5:34 PM, Samudrala, Sridhar
>> >  wrote:
>> > > On 9/12/2017 3:53 PM, Tom Herbert wrote:
>> > > > On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar
>> > > >  wrote:
>> > > > >
>> > > > > On 9/12/2017 8:47 AM, Eric Dumazet wrote:
>> > > > > > On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:
>> > > > > > > On 9/11/2017 8:53 PM, Eric Dumazet wrote:
>> > > > > > > > On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:
>> > > > > > > >
>> > > > > > > > > Two ints in sock_common for this purpose is quite expensive 
>> > > > > > > > > and the
>> > > > > > > > > use case for this is limited-- even if a RX->TX queue 
>> > > > > > > > > mapping were
>> > > > > > > > > introduced to eliminate the queue pair assumption this still 
>> > > > > > > > > won't
>> > > > > > > > > help if the receive and transmit interfaces are different 
>> > > > > > > > > for the
>> > > > > > > > > connection. I think we really need to see some very 
>> > > > > > > > > compelling
>> > > > > > > > > results
>> > > > > > > > > to be able to justify this.
>> > > > > > > Will try to collect and post some perf data with symmetric queue
>> > > > > > > configuration.
>> > >
>> > > Here is some performance data i collected with memcached workload over
>> > > ixgbe 10Gb NIC with mcblaster benchmark.
>> > > ixgbe is configured with 16 queues and rx-usecs is set to 1000 for a very
>> > > low
>> > > interrupt rate.
>> > >   ethtool -L p1p1 combined 16
>> > >   ethtool -C p1p1 rx-usecs 1000
>> > > and busy poll is set to 1000usecs
>> > >   sysctl net.core.busy_poll = 1000
>> > >
>> > > 16 threads  800K requests/sec
>> > > =
>> > >   rtt(min/avg/max)usecs intr/sec contextswitch/sec
>> > > ---
>> > > Default2/182/1064123391 61163
>> > > Symmetric Queues   2/50/6311  20457 32843
>> > >
>> > > 32 threads  800K requests/sec
>> > > =
>> > >  rtt(min/avg/max)usecs intr/sec contextswitch/sec
>> > > 
>> > > Default2/162/639032168 69450
>> > > Symmetric Queues2/50/385335044 35847
>> > >
>> > No idea what "Default" configuration is. Please report how xps_cpus is
>> > being set, how many RSS queues there are, and what the mapping is
>> > between RSS queues and CPUs and shared caches. Also, whether and
>> > threads are pinned.
>> Default is linux 4.13 with the settings i listed above.
>> ethtool -L p1p1 combined 16
>> ethtool -C p1p1 rx-usecs 1000
>> sysctl net.core.busy_poll = 1000
>>
>> # ethtool -x p1p1
>> RX flow hash indirection table for p1p1 with 16 RX ring(s):
>> 0:  0 1 2 3 4 5 6 7
>> 8:  8 9101112131415
>>16:  0 1 2 3 4 5 6 7
>>24:  8 9101112131415
>>32:  0 1 2 3 4 5 6 7
>>40:  8 9101112131415
>>48:  0 1 2 3 4 5 6 7
>>56:  8 9101112131415
>>64:  0 1 2 3 4 5 6 7
>>72:  8 9101112131415
>>80:  0 1 2 3 4 5 6 7
>>88:  8 9101112131415
>>96:  0 1 2 3 4 5 6 7
>>   104:  8 9101112131415
>>   112:  0 1 2 3 4 5 6 7
>>   120:  8 9101112131415
>>
>> smp_affinity for the 16 queuepairs
>> 141 p1p1-TxRx-0 ,0001
>> 142 p1p1-TxRx-1 ,0002
>> 143 p1p1-TxRx-2 ,0004
>> 144 p1p1-TxRx-3 ,0008
>> 145 p1p1-TxRx-4 ,0010
>> 146 p1p1-TxRx-5 ,0020
>> 147 p1p1-TxRx-6 ,0040
>> 148 p1p1-TxRx-7 ,0080
>> 149 p1p1-TxRx-8 ,0100
>> 150 p1p1-TxRx-9 ,0200
>> 151 p1p1-TxRx-10 ,0400
>> 152 p1p1-TxRx-11 ,0800
>> 153 p1p1-TxRx-12 ,1000
>> 154 p1p1-TxRx-13 ,2000
>> 155 p1p1-TxRx-14 ,4000
>> 156 p1p1-TxRx-15 ,8000
>> xps_cpus for the 16 Tx queues
>> ,0001
>> ,0002
>> ,0004
>> ,0008
>> ,0010
>> ,0020
>> ,0040
>> ,0080
>> ,0100
>> ,0200
>> ,0400
>> ,0

Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-19 Thread Eric Dumazet
On Tue, 2017-09-19 at 21:59 -0700, Samudrala, Sridhar wrote:
> On 9/19/2017 5:48 PM, Tom Herbert wrote:
> > On Tue, Sep 19, 2017 at 5:34 PM, Samudrala, Sridhar
> >  wrote:
> > > On 9/12/2017 3:53 PM, Tom Herbert wrote:
> > > > On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar
> > > >  wrote:
> > > > > 
> > > > > On 9/12/2017 8:47 AM, Eric Dumazet wrote:
> > > > > > On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:
> > > > > > > On 9/11/2017 8:53 PM, Eric Dumazet wrote:
> > > > > > > > On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:
> > > > > > > > 
> > > > > > > > > Two ints in sock_common for this purpose is quite expensive 
> > > > > > > > > and the
> > > > > > > > > use case for this is limited-- even if a RX->TX queue mapping 
> > > > > > > > > were
> > > > > > > > > introduced to eliminate the queue pair assumption this still 
> > > > > > > > > won't
> > > > > > > > > help if the receive and transmit interfaces are different for 
> > > > > > > > > the
> > > > > > > > > connection. I think we really need to see some very compelling
> > > > > > > > > results
> > > > > > > > > to be able to justify this.
> > > > > > > Will try to collect and post some perf data with symmetric queue
> > > > > > > configuration.
> > > 
> > > Here is some performance data i collected with memcached workload over
> > > ixgbe 10Gb NIC with mcblaster benchmark.
> > > ixgbe is configured with 16 queues and rx-usecs is set to 1000 for a very
> > > low
> > > interrupt rate.
> > >   ethtool -L p1p1 combined 16
> > >   ethtool -C p1p1 rx-usecs 1000
> > > and busy poll is set to 1000usecs
> > >   sysctl net.core.busy_poll = 1000
> > > 
> > > 16 threads  800K requests/sec
> > > =
> > >   rtt(min/avg/max)usecs intr/sec contextswitch/sec
> > > ---
> > > Default2/182/1064123391 61163
> > > Symmetric Queues   2/50/6311  20457 32843
> > > 
> > > 32 threads  800K requests/sec
> > > =
> > >  rtt(min/avg/max)usecs intr/sec contextswitch/sec
> > > 
> > > Default2/162/639032168 69450
> > > Symmetric Queues2/50/385335044 35847
> > > 
> > No idea what "Default" configuration is. Please report how xps_cpus is
> > being set, how many RSS queues there are, and what the mapping is
> > between RSS queues and CPUs and shared caches. Also, whether and
> > threads are pinned.
> Default is linux 4.13 with the settings i listed above.
> ethtool -L p1p1 combined 16
> ethtool -C p1p1 rx-usecs 1000
> sysctl net.core.busy_poll = 1000
> 
> # ethtool -x p1p1
> RX flow hash indirection table for p1p1 with 16 RX ring(s):
> 0:  0 1 2 3 4 5 6 7
> 8:  8 9101112131415
>16:  0 1 2 3 4 5 6 7
>24:  8 9101112131415
>32:  0 1 2 3 4 5 6 7
>40:  8 9101112131415
>48:  0 1 2 3 4 5 6 7
>56:  8 9101112131415
>64:  0 1 2 3 4 5 6 7
>72:  8 9101112131415
>80:  0 1 2 3 4 5 6 7
>88:  8 9101112131415
>96:  0 1 2 3 4 5 6 7
>   104:  8 9101112131415
>   112:  0 1 2 3 4 5 6 7
>   120:  8 9101112131415
> 
> smp_affinity for the 16 queuepairs
> 141 p1p1-TxRx-0 ,0001
> 142 p1p1-TxRx-1 ,0002
> 143 p1p1-TxRx-2 ,0004
> 144 p1p1-TxRx-3 ,0008
> 145 p1p1-TxRx-4 ,0010
> 146 p1p1-TxRx-5 ,0020
> 147 p1p1-TxRx-6 ,0040
> 148 p1p1-TxRx-7 ,0080
> 149 p1p1-TxRx-8 ,0100
> 150 p1p1-TxRx-9 ,0200
> 151 p1p1-TxRx-10 ,0400
> 152 p1p1-TxRx-11 ,0800
> 153 p1p1-TxRx-12 ,1000
> 154 p1p1-TxRx-13 ,2000
> 155 p1p1-TxRx-14 ,4000
> 156 p1p1-TxRx-15 ,8000
> xps_cpus for the 16 Tx queues
> ,0001
> ,0002
> ,0004
> ,0008
> ,0010
> ,0020
> ,0040
> ,0080
> ,0100
> ,0200
> ,0400
> ,0800
> ,1000
> ,2000
> ,4000
> ,8000
> memcached threads are not pinned.
> 

...

I urge you to take the time 

Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-19 Thread Tom Herbert
On Tue, Sep 19, 2017 at 5:34 PM, Samudrala, Sridhar
 wrote:
> On 9/12/2017 3:53 PM, Tom Herbert wrote:
>>
>> On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar
>>  wrote:
>>>
>>>
>>> On 9/12/2017 8:47 AM, Eric Dumazet wrote:

 On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:
>
> On 9/11/2017 8:53 PM, Eric Dumazet wrote:
>>
>> On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:
>>
>>> Two ints in sock_common for this purpose is quite expensive and the
>>> use case for this is limited-- even if a RX->TX queue mapping were
>>> introduced to eliminate the queue pair assumption this still won't
>>> help if the receive and transmit interfaces are different for the
>>> connection. I think we really need to see some very compelling
>>> results
>>> to be able to justify this.
>
> Will try to collect and post some perf data with symmetric queue
> configuration.
>
>
> Here is some performance data i collected with memcached workload over
> ixgbe 10Gb NIC with mcblaster benchmark.
> ixgbe is configured with 16 queues and rx-usecs is set to 1000 for a very
> low
> interrupt rate.
>   ethtool -L p1p1 combined 16
>   ethtool -C p1p1 rx-usecs 1000
> and busy poll is set to 1000usecs
>   sysctl net.core.busy_poll = 1000
>
> 16 threads  800K requests/sec
> =
>   rtt(min/avg/max)usecs intr/sec contextswitch/sec
> ---
> Default2/182/1064123391 61163
> Symmetric Queues   2/50/6311  20457 32843
>
> 32 threads  800K requests/sec
> =
>  rtt(min/avg/max)usecs intr/sec contextswitch/sec
> 
> Default2/162/639032168 69450
> Symmetric Queues2/50/385335044 35847
>
No idea what "Default" configuration is. Please report how xps_cpus is
being set, how many RSS queues there are, and what the mapping is
between RSS queues and CPUs and shared caches. Also, whether and
threads are pinned.

Thanks,
Tom


Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-19 Thread Samudrala, Sridhar

On 9/12/2017 3:53 PM, Tom Herbert wrote:

On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar
 wrote:


On 9/12/2017 8:47 AM, Eric Dumazet wrote:

On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:

On 9/11/2017 8:53 PM, Eric Dumazet wrote:

On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:


Two ints in sock_common for this purpose is quite expensive and the
use case for this is limited-- even if a RX->TX queue mapping were
introduced to eliminate the queue pair assumption this still won't
help if the receive and transmit interfaces are different for the
connection. I think we really need to see some very compelling results
to be able to justify this.

Will try to collect and post some perf data with symmetric queue
configuration.


Here is some performance data i collected with memcached workload over
ixgbe 10Gb NIC with mcblaster benchmark.
ixgbe is configured with 16 queues and rx-usecs is set to 1000 for a 
very low

interrupt rate.
  ethtool -L p1p1 combined 16
  ethtool -C p1p1 rx-usecs 1000
and busy poll is set to 1000usecs
  sysctl net.core.busy_poll = 1000

16 threads  800K requests/sec
=
  rtt(min/avg/max)usecs intr/sec contextswitch/sec
---
Default2/182/1064123391 61163
Symmetric Queues   2/50/6311  20457 32843

32 threads  800K requests/sec
=
 rtt(min/avg/max)usecs intr/sec contextswitch/sec

Default2/162/639032168 69450
Symmetric Queues2/50/385335044 35847




Yes, this is unreasonable cost.

XPS should really cover the case already.


Eric,

Can you clarify how XPS covers the RX-> TX queue mapping case?
Is it possible to configure XPS to select TX queue based on the RX queue
of a flow?
IIUC, it is based on the CPU of the thread doing the transmit OR based
on skb->priority to TC mapping?
It may be possible to get this effect if the the threads are pinned to a
core, but if the app threads are
freely moving, i am not sure how XPS can be configured to select the TX
queue based on the RX queue of a flow.

If application is freely moving, how NIC can properly select the RX
queue so that packets are coming to the appropriate queue ?

The RX queue is selected via RSS and we don't want to move the flow based on
where the thread is running.

Unless flow director is enabled on the Intel device... This was, I
believe, one of the first attempts to introduce a queue pair notion to
general purpose NICs. The idea was that the device records the TX
queue for a flow and then uses that to determine receive queue in a
symmetric fashion. aRFS is similar, but was under SW control how the
mapping is done. As Eric mentioned there are scalability issues with
these mechanisms, but we also found that flow director can easily
reorder packets whenever the thread moves.


You must be referring to the ATR(application targeted routing) feature 
on Intel
NICs wherea flow director entry is added for a flow based on TX queue 
used for
that flow. Instead, we would like to select the TX queue based on the RX 
queue

of a flow.






This is called aRFS, and it does not scale to millions of flows.
We tried in the past, and this went nowhere really, since the setup cost
is prohibitive and DDOS vulnerable.

XPS will follow the thread, since selection is done on current cpu.

The problem is RX side. If application is free to migrate, then special
support (aRFS) is needed from the hardware.

This may be true if most of the rx processing is happening in the interrupt
context.
But with busy polling,  i think we don't need aRFS as a thread should be
able to poll
any queue irrespective of where it is running.

It's not just a problem with interrupt processing, in general we like
to have all receive processing an subsequent transmit of a reply to be
done on one CPU. Silo'ing is good for performance and parallelism.
This can sometimes be relaxed in situations where CPUs share a cache
so crossing CPUs is not not costly.


Yes. We would like to get this behavior even without binding the app 
thread to a CPU.







At least for passive connections, we already have all the support in the
kernel so that you can have one thread per NIC queue, dealing with
sockets that have incoming packets all received on one NIC RX queue.
(And of course all TX packets will use the symmetric TX queue)

SO_REUSEPORT plus appropriate BPF filter can achieve that.

Say you have 32 queues, 32 cpus.

Simply use 32 listeners, 32 threads (or 32 pools of threads)

Yes. This will work if each thread is pinned to a core associated with the
RX interrupt.
It may not be possible to pin the threads to a core.
Instead we want to associate a thread to a queue and do all the RX and TX
completion
of a queue in the same thread cont

Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-12 Thread Tom Herbert
On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar
 wrote:
>
>
> On 9/12/2017 8:47 AM, Eric Dumazet wrote:
>>
>> On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:
>>>
>>> On 9/11/2017 8:53 PM, Eric Dumazet wrote:

 On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:

> Two ints in sock_common for this purpose is quite expensive and the
> use case for this is limited-- even if a RX->TX queue mapping were
> introduced to eliminate the queue pair assumption this still won't
> help if the receive and transmit interfaces are different for the
> connection. I think we really need to see some very compelling results
> to be able to justify this.
>>>
>>> Will try to collect and post some perf data with symmetric queue
>>> configuration.
>>>
 Yes, this is unreasonable cost.

 XPS should really cover the case already.

>>>
>>> Eric,
>>>
>>> Can you clarify how XPS covers the RX-> TX queue mapping case?
>>> Is it possible to configure XPS to select TX queue based on the RX queue
>>> of a flow?
>>> IIUC, it is based on the CPU of the thread doing the transmit OR based
>>> on skb->priority to TC mapping?
>>> It may be possible to get this effect if the the threads are pinned to a
>>> core, but if the app threads are
>>> freely moving, i am not sure how XPS can be configured to select the TX
>>> queue based on the RX queue of a flow.
>>
>> If application is freely moving, how NIC can properly select the RX
>> queue so that packets are coming to the appropriate queue ?
>
> The RX queue is selected via RSS and we don't want to move the flow based on
> where the thread is running.

Unless flow director is enabled on the Intel device... This was, I
believe, one of the first attempts to introduce a queue pair notion to
general purpose NICs. The idea was that the device records the TX
queue for a flow and then uses that to determine receive queue in a
symmetric fashion. aRFS is similar, but was under SW control how the
mapping is done. As Eric mentioned there are scalability issues with
these mechanisms, but we also found that flow director can easily
reorder packets whenever the thread moves.

>>
>>
>> This is called aRFS, and it does not scale to millions of flows.
>> We tried in the past, and this went nowhere really, since the setup cost
>> is prohibitive and DDOS vulnerable.
>>
>> XPS will follow the thread, since selection is done on current cpu.
>>
>> The problem is RX side. If application is free to migrate, then special
>> support (aRFS) is needed from the hardware.
>
> This may be true if most of the rx processing is happening in the interrupt
> context.
> But with busy polling,  i think we don't need aRFS as a thread should be
> able to poll
> any queue irrespective of where it is running.

It's not just a problem with interrupt processing, in general we like
to have all receive processing an subsequent transmit of a reply to be
done on one CPU. Silo'ing is good for performance and parallelism.
This can sometimes be relaxed in situations where CPUs share a cache
so crossing CPUs is not not costly.

>>
>>
>> At least for passive connections, we already have all the support in the
>> kernel so that you can have one thread per NIC queue, dealing with
>> sockets that have incoming packets all received on one NIC RX queue.
>> (And of course all TX packets will use the symmetric TX queue)
>>
>> SO_REUSEPORT plus appropriate BPF filter can achieve that.
>>
>> Say you have 32 queues, 32 cpus.
>>
>> Simply use 32 listeners, 32 threads (or 32 pools of threads)
>
> Yes. This will work if each thread is pinned to a core associated with the
> RX interrupt.
> It may not be possible to pin the threads to a core.
> Instead we want to associate a thread to a queue and do all the RX and TX
> completion
> of a queue in the same thread context via busy polling.
>
When that happens it's possible for RX to be done on the completely
wrong CPU which we know is suboptimal. However, this shouldn't
negatively affect TX side since XPS will just use the queue
appropriate for running CPU. Like Eric said, this is really a receive
problem more than a transmit problem. Keeping them as independent
paths seems to be a good approach.

Tom


Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-12 Thread Samudrala, Sridhar



On 9/12/2017 8:47 AM, Eric Dumazet wrote:

On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:

On 9/11/2017 8:53 PM, Eric Dumazet wrote:

On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:


Two ints in sock_common for this purpose is quite expensive and the
use case for this is limited-- even if a RX->TX queue mapping were
introduced to eliminate the queue pair assumption this still won't
help if the receive and transmit interfaces are different for the
connection. I think we really need to see some very compelling results
to be able to justify this.

Will try to collect and post some perf data with symmetric queue
configuration.


Yes, this is unreasonable cost.

XPS should really cover the case already.
   

Eric,

Can you clarify how XPS covers the RX-> TX queue mapping case?
Is it possible to configure XPS to select TX queue based on the RX queue
of a flow?
IIUC, it is based on the CPU of the thread doing the transmit OR based
on skb->priority to TC mapping?
It may be possible to get this effect if the the threads are pinned to a
core, but if the app threads are
freely moving, i am not sure how XPS can be configured to select the TX
queue based on the RX queue of a flow.

If application is freely moving, how NIC can properly select the RX
queue so that packets are coming to the appropriate queue ?

The RX queue is selected via RSS and we don't want to move the flow based on
where the thread is running.


This is called aRFS, and it does not scale to millions of flows.
We tried in the past, and this went nowhere really, since the setup cost
is prohibitive and DDOS vulnerable.

XPS will follow the thread, since selection is done on current cpu.

The problem is RX side. If application is free to migrate, then special
support (aRFS) is needed from the hardware.
This may be true if most of the rx processing is happening in the 
interrupt context.
But with busy polling,  i think we don't need aRFS as a thread should be 
able to poll

any queue irrespective of where it is running.


At least for passive connections, we already have all the support in the
kernel so that you can have one thread per NIC queue, dealing with
sockets that have incoming packets all received on one NIC RX queue.
(And of course all TX packets will use the symmetric TX queue)

SO_REUSEPORT plus appropriate BPF filter can achieve that.

Say you have 32 queues, 32 cpus.

Simply use 32 listeners, 32 threads (or 32 pools of threads)
Yes. This will work if each thread is pinned to a core associated with 
the RX interrupt.

It may not be possible to pin the threads to a core.
Instead we want to associate a thread to a queue and do all the RX and 
TX completion

of a queue in the same thread context via busy polling.

Thanks
Sridhar





Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-12 Thread Eric Dumazet
On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:
> 
> On 9/11/2017 8:53 PM, Eric Dumazet wrote:
> > On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:
> >
> >> Two ints in sock_common for this purpose is quite expensive and the
> >> use case for this is limited-- even if a RX->TX queue mapping were
> >> introduced to eliminate the queue pair assumption this still won't
> >> help if the receive and transmit interfaces are different for the
> >> connection. I think we really need to see some very compelling results
> >> to be able to justify this.
> Will try to collect and post some perf data with symmetric queue 
> configuration.
> 
> > Yes, this is unreasonable cost.
> >
> > XPS should really cover the case already.
> >   
> Eric,
> 
> Can you clarify how XPS covers the RX-> TX queue mapping case?
> Is it possible to configure XPS to select TX queue based on the RX queue 
> of a flow?
> IIUC, it is based on the CPU of the thread doing the transmit OR based 
> on skb->priority to TC mapping?
> It may be possible to get this effect if the the threads are pinned to a 
> core, but if the app threads are
> freely moving, i am not sure how XPS can be configured to select the TX 
> queue based on the RX queue of a flow.

If application is freely moving, how NIC can properly select the RX
queue so that packets are coming to the appropriate queue ?

This is called aRFS, and it does not scale to millions of flows.
We tried in the past, and this went nowhere really, since the setup cost
is prohibitive and DDOS vulnerable.

XPS will follow the thread, since selection is done on current cpu.

The problem is RX side. If application is free to migrate, then special
support (aRFS) is needed from the hardware.

At least for passive connections, we already have all the support in the
kernel so that you can have one thread per NIC queue, dealing with
sockets that have incoming packets all received on one NIC RX queue.
(And of course all TX packets will use the symmetric TX queue)

SO_REUSEPORT plus appropriate BPF filter can achieve that.

Say you have 32 queues, 32 cpus.

Simply use 32 listeners, 32 threads (or 32 pools of threads)




Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-11 Thread Samudrala, Sridhar



On 9/11/2017 8:53 PM, Eric Dumazet wrote:

On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:


Two ints in sock_common for this purpose is quite expensive and the
use case for this is limited-- even if a RX->TX queue mapping were
introduced to eliminate the queue pair assumption this still won't
help if the receive and transmit interfaces are different for the
connection. I think we really need to see some very compelling results
to be able to justify this.
Will try to collect and post some perf data with symmetric queue 
configuration.



Yes, this is unreasonable cost.

XPS should really cover the case already.
  

Eric,

Can you clarify how XPS covers the RX-> TX queue mapping case?
Is it possible to configure XPS to select TX queue based on the RX queue 
of a flow?
IIUC, it is based on the CPU of the thread doing the transmit OR based 
on skb->priority to TC mapping?
It may be possible to get this effect if the the threads are pinned to a 
core, but if the app threads are
freely moving, i am not sure how XPS can be configured to select the TX 
queue based on the RX queue of a flow.


Thanks
Sridhar


Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-11 Thread Eric Dumazet
On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:

> Two ints in sock_common for this purpose is quite expensive and the
> use case for this is limited-- even if a RX->TX queue mapping were
> introduced to eliminate the queue pair assumption this still won't
> help if the receive and transmit interfaces are different for the
> connection. I think we really need to see some very compelling results
> to be able to justify this.
> 

Yes, this is unreasonable cost.

XPS should really cover the case already.
 



Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-11 Thread Tom Herbert
On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala
 wrote:
> This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used
> to enable symmetric tx and rx queues on a socket.
>
> This option is specifically useful for epoll based multi threaded workloads
> where each thread handles packets received on a single RX queue . In this 
> model,
> we have noticed that it helps to send the packets on the same TX queue
> corresponding to the queue-pair associated with the RX queue specifically when
> busy poll is enabled with epoll().
>
> Two new fields are added to struct sock_common to cache the last rx ifindex 
> and
> the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the 
> cached
> rx queue when this option is enabled and the TX is happening on the same 
> device.
>
> Signed-off-by: Sridhar Samudrala 
> ---
>  include/net/request_sock.h|  1 +
>  include/net/sock.h| 17 +
>  include/uapi/asm-generic/socket.h |  2 ++
>  net/core/dev.c|  8 +++-
>  net/core/sock.c   | 10 ++
>  net/ipv4/tcp_input.c  |  1 +
>  net/ipv4/tcp_ipv4.c   |  1 +
>  net/ipv4/tcp_minisocks.c  |  1 +
>  8 files changed, 40 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/request_sock.h b/include/net/request_sock.h
> index 23e2205..c3bc12e 100644
> --- a/include/net/request_sock.h
> +++ b/include/net/request_sock.h
> @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct request_sock 
> *req)
> req_to_sk(req)->sk_prot = sk_listener->sk_prot;
> sk_node_init(&req_to_sk(req)->sk_node);
> sk_tx_queue_clear(req_to_sk(req));
> +   req_to_sk(req)->sk_symmetric_queues = 
> sk_listener->sk_symmetric_queues;
> req->saved_syn = NULL;
> refcount_set(&req->rsk_refcnt, 0);
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 03a3625..3421809 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char *msg, 
> ...)
>   * @skc_node: main hash linkage for various protocol lookup tables
>   * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
>   * @skc_tx_queue_mapping: tx queue number for this connection
> + * @skc_rx_queue_mapping: rx queue number for this connection
> + * @skc_rx_ifindex: rx ifindex for this connection
>   * @skc_flags: place holder for sk_flags
>   * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
>   * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
>   * @skc_incoming_cpu: record/match cpu processing incoming packets
>   * @skc_refcnt: reference count
> + * @skc_symmetric_queues: symmetric tx/rx queues
>   *
>   * This is the minimal network layer representation of sockets, the 
> header
>   * for struct sock and struct inet_timewait_sock.
> @@ -177,6 +180,7 @@ struct sock_common {
> unsigned char   skc_reuseport:1;
> unsigned char   skc_ipv6only:1;
> unsigned char   skc_net_refcnt:1;
> +   unsigned char   skc_symmetric_queues:1;
> int skc_bound_dev_if;
> union {
> struct hlist_node   skc_bind_node;
> @@ -214,6 +218,8 @@ struct sock_common {
> struct hlist_nulls_node skc_nulls_node;
> };
> int skc_tx_queue_mapping;
> +   int skc_rx_queue_mapping;
> +   int skc_rx_ifindex;

Two ints in sock_common for this purpose is quite expensive and the
use case for this is limited-- even if a RX->TX queue mapping were
introduced to eliminate the queue pair assumption this still won't
help if the receive and transmit interfaces are different for the
connection. I think we really need to see some very compelling results
to be able to justify this.

Thanks,
Tom


Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-11 Thread Alexander Duyck
On Mon, Sep 11, 2017 at 3:07 PM, Tom Herbert  wrote:
> On Mon, Sep 11, 2017 at 9:49 AM, Samudrala, Sridhar
>  wrote:
>> On 9/10/2017 8:19 AM, Tom Herbert wrote:
>>>
>>> On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala
>>>  wrote:

 This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be
 used
 to enable symmetric tx and rx queues on a socket.

 This option is specifically useful for epoll based multi threaded
 workloads
 where each thread handles packets received on a single RX queue . In this
 model,
 we have noticed that it helps to send the packets on the same TX queue
 corresponding to the queue-pair associated with the RX queue specifically
 when
 busy poll is enabled with epoll().

 Two new fields are added to struct sock_common to cache the last rx
 ifindex and
 the rx queue in the receive path of an SKB. __netdev_pick_tx() returns
 the cached
 rx queue when this option is enabled and the TX is happening on the same
 device.

 Signed-off-by: Sridhar Samudrala 
 ---
   include/net/request_sock.h|  1 +
   include/net/sock.h| 17 +
   include/uapi/asm-generic/socket.h |  2 ++
   net/core/dev.c|  8 +++-
   net/core/sock.c   | 10 ++
   net/ipv4/tcp_input.c  |  1 +
   net/ipv4/tcp_ipv4.c   |  1 +
   net/ipv4/tcp_minisocks.c  |  1 +
   8 files changed, 40 insertions(+), 1 deletion(-)

 diff --git a/include/net/request_sock.h b/include/net/request_sock.h
 index 23e2205..c3bc12e 100644
 --- a/include/net/request_sock.h
 +++ b/include/net/request_sock.h
 @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct
 request_sock *req)
  req_to_sk(req)->sk_prot = sk_listener->sk_prot;
  sk_node_init(&req_to_sk(req)->sk_node);
  sk_tx_queue_clear(req_to_sk(req));
 +   req_to_sk(req)->sk_symmetric_queues =
 sk_listener->sk_symmetric_queues;
  req->saved_syn = NULL;
  refcount_set(&req->rsk_refcnt, 0);

 diff --git a/include/net/sock.h b/include/net/sock.h
 index 03a3625..3421809 100644
 --- a/include/net/sock.h
 +++ b/include/net/sock.h
 @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char
 *msg, ...)
* @skc_node: main hash linkage for various protocol lookup tables
* @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
* @skc_tx_queue_mapping: tx queue number for this connection
 + * @skc_rx_queue_mapping: rx queue number for this connection
 + * @skc_rx_ifindex: rx ifindex for this connection
* @skc_flags: place holder for sk_flags
* %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
* %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
* @skc_incoming_cpu: record/match cpu processing incoming packets
* @skc_refcnt: reference count
 + * @skc_symmetric_queues: symmetric tx/rx queues
*
* This is the minimal network layer representation of sockets, the
 header
* for struct sock and struct inet_timewait_sock.
 @@ -177,6 +180,7 @@ struct sock_common {
  unsigned char   skc_reuseport:1;
  unsigned char   skc_ipv6only:1;
  unsigned char   skc_net_refcnt:1;
 +   unsigned char   skc_symmetric_queues:1;
  int skc_bound_dev_if;
  union {
  struct hlist_node   skc_bind_node;
 @@ -214,6 +218,8 @@ struct sock_common {
  struct hlist_nulls_node skc_nulls_node;
  };
  int skc_tx_queue_mapping;
 +   int skc_rx_queue_mapping;
 +   int skc_rx_ifindex;
  union {
  int skc_incoming_cpu;
  u32 skc_rcv_wnd;
 @@ -324,6 +330,8 @@ struct sock {
   #define sk_nulls_node  __sk_common.skc_nulls_node
   #define sk_refcnt  __sk_common.skc_refcnt
   #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping
 +#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping
 +#define sk_rx_ifindex  __sk_common.skc_rx_ifindex

   #define sk_dontcopy_begin  __sk_common.skc_dontcopy_begin
   #define sk_dontcopy_end__sk_common.skc_dontcopy_end
 @@ -340,6 +348,7 @@ struct sock {
   #define sk_reuseport   __sk_common.skc_reuseport
   #define sk_ipv6only__sk_common.skc_ipv6only
   #define sk_net_refcnt  __sk_common.skc_net_refcnt
 +#define sk_symmetric_queues__sk_common.skc_symmetric

Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-11 Thread Tom Herbert
On Mon, Sep 11, 2017 at 9:49 AM, Samudrala, Sridhar
 wrote:
> On 9/10/2017 8:19 AM, Tom Herbert wrote:
>>
>> On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala
>>  wrote:
>>>
>>> This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be
>>> used
>>> to enable symmetric tx and rx queues on a socket.
>>>
>>> This option is specifically useful for epoll based multi threaded
>>> workloads
>>> where each thread handles packets received on a single RX queue . In this
>>> model,
>>> we have noticed that it helps to send the packets on the same TX queue
>>> corresponding to the queue-pair associated with the RX queue specifically
>>> when
>>> busy poll is enabled with epoll().
>>>
>>> Two new fields are added to struct sock_common to cache the last rx
>>> ifindex and
>>> the rx queue in the receive path of an SKB. __netdev_pick_tx() returns
>>> the cached
>>> rx queue when this option is enabled and the TX is happening on the same
>>> device.
>>>
>>> Signed-off-by: Sridhar Samudrala 
>>> ---
>>>   include/net/request_sock.h|  1 +
>>>   include/net/sock.h| 17 +
>>>   include/uapi/asm-generic/socket.h |  2 ++
>>>   net/core/dev.c|  8 +++-
>>>   net/core/sock.c   | 10 ++
>>>   net/ipv4/tcp_input.c  |  1 +
>>>   net/ipv4/tcp_ipv4.c   |  1 +
>>>   net/ipv4/tcp_minisocks.c  |  1 +
>>>   8 files changed, 40 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/net/request_sock.h b/include/net/request_sock.h
>>> index 23e2205..c3bc12e 100644
>>> --- a/include/net/request_sock.h
>>> +++ b/include/net/request_sock.h
>>> @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct
>>> request_sock *req)
>>>  req_to_sk(req)->sk_prot = sk_listener->sk_prot;
>>>  sk_node_init(&req_to_sk(req)->sk_node);
>>>  sk_tx_queue_clear(req_to_sk(req));
>>> +   req_to_sk(req)->sk_symmetric_queues =
>>> sk_listener->sk_symmetric_queues;
>>>  req->saved_syn = NULL;
>>>  refcount_set(&req->rsk_refcnt, 0);
>>>
>>> diff --git a/include/net/sock.h b/include/net/sock.h
>>> index 03a3625..3421809 100644
>>> --- a/include/net/sock.h
>>> +++ b/include/net/sock.h
>>> @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char
>>> *msg, ...)
>>>* @skc_node: main hash linkage for various protocol lookup tables
>>>* @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
>>>* @skc_tx_queue_mapping: tx queue number for this connection
>>> + * @skc_rx_queue_mapping: rx queue number for this connection
>>> + * @skc_rx_ifindex: rx ifindex for this connection
>>>* @skc_flags: place holder for sk_flags
>>>* %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
>>>* %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
>>>* @skc_incoming_cpu: record/match cpu processing incoming packets
>>>* @skc_refcnt: reference count
>>> + * @skc_symmetric_queues: symmetric tx/rx queues
>>>*
>>>* This is the minimal network layer representation of sockets, the
>>> header
>>>* for struct sock and struct inet_timewait_sock.
>>> @@ -177,6 +180,7 @@ struct sock_common {
>>>  unsigned char   skc_reuseport:1;
>>>  unsigned char   skc_ipv6only:1;
>>>  unsigned char   skc_net_refcnt:1;
>>> +   unsigned char   skc_symmetric_queues:1;
>>>  int skc_bound_dev_if;
>>>  union {
>>>  struct hlist_node   skc_bind_node;
>>> @@ -214,6 +218,8 @@ struct sock_common {
>>>  struct hlist_nulls_node skc_nulls_node;
>>>  };
>>>  int skc_tx_queue_mapping;
>>> +   int skc_rx_queue_mapping;
>>> +   int skc_rx_ifindex;
>>>  union {
>>>  int skc_incoming_cpu;
>>>  u32 skc_rcv_wnd;
>>> @@ -324,6 +330,8 @@ struct sock {
>>>   #define sk_nulls_node  __sk_common.skc_nulls_node
>>>   #define sk_refcnt  __sk_common.skc_refcnt
>>>   #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping
>>> +#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping
>>> +#define sk_rx_ifindex  __sk_common.skc_rx_ifindex
>>>
>>>   #define sk_dontcopy_begin  __sk_common.skc_dontcopy_begin
>>>   #define sk_dontcopy_end__sk_common.skc_dontcopy_end
>>> @@ -340,6 +348,7 @@ struct sock {
>>>   #define sk_reuseport   __sk_common.skc_reuseport
>>>   #define sk_ipv6only__sk_common.skc_ipv6only
>>>   #define sk_net_refcnt  __sk_common.skc_net_refcnt
>>> +#define sk_symmetric_queues__sk_common.skc_symmetric_queues
>>>   #define sk_bound_dev_if__sk_common.skc_bound_dev_if
>>>   #define sk_bind_node   __sk_common.skc_bind_node
>>>   #define s

Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-11 Thread Alexander Duyck
On Mon, Sep 11, 2017 at 9:48 AM, Samudrala, Sridhar
 wrote:
>
>
> On 9/9/2017 6:28 PM, Alexander Duyck wrote:
>>
>> On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala
>>  wrote:
>>>
>>> This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be
>>> used
>>> to enable symmetric tx and rx queues on a socket.
>>>
>>> This option is specifically useful for epoll based multi threaded
>>> workloads
>>> where each thread handles packets received on a single RX queue . In this
>>> model,
>>> we have noticed that it helps to send the packets on the same TX queue
>>> corresponding to the queue-pair associated with the RX queue specifically
>>> when
>>> busy poll is enabled with epoll().
>>>
>>> Two new fields are added to struct sock_common to cache the last rx
>>> ifindex and
>>> the rx queue in the receive path of an SKB. __netdev_pick_tx() returns
>>> the cached
>>> rx queue when this option is enabled and the TX is happening on the same
>>> device.
>>>
>>> Signed-off-by: Sridhar Samudrala 
>>> ---
>>>   include/net/request_sock.h|  1 +
>>>   include/net/sock.h| 17 +
>>>   include/uapi/asm-generic/socket.h |  2 ++
>>>   net/core/dev.c|  8 +++-
>>>   net/core/sock.c   | 10 ++
>>>   net/ipv4/tcp_input.c  |  1 +
>>>   net/ipv4/tcp_ipv4.c   |  1 +
>>>   net/ipv4/tcp_minisocks.c  |  1 +
>>>   8 files changed, 40 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/net/request_sock.h b/include/net/request_sock.h
>>> index 23e2205..c3bc12e 100644
>>> --- a/include/net/request_sock.h
>>> +++ b/include/net/request_sock.h
>>> @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct
>>> request_sock *req)
>>>  req_to_sk(req)->sk_prot = sk_listener->sk_prot;
>>>  sk_node_init(&req_to_sk(req)->sk_node);
>>>  sk_tx_queue_clear(req_to_sk(req));
>>> +   req_to_sk(req)->sk_symmetric_queues =
>>> sk_listener->sk_symmetric_queues;
>>>  req->saved_syn = NULL;
>>>  refcount_set(&req->rsk_refcnt, 0);
>>>
>>> diff --git a/include/net/sock.h b/include/net/sock.h
>>> index 03a3625..3421809 100644
>>> --- a/include/net/sock.h
>>> +++ b/include/net/sock.h
>>> @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char
>>> *msg, ...)
>>>* @skc_node: main hash linkage for various protocol lookup tables
>>>* @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
>>>* @skc_tx_queue_mapping: tx queue number for this connection
>>> + * @skc_rx_queue_mapping: rx queue number for this connection
>>> + * @skc_rx_ifindex: rx ifindex for this connection
>>>* @skc_flags: place holder for sk_flags
>>>* %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
>>>* %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
>>>* @skc_incoming_cpu: record/match cpu processing incoming packets
>>>* @skc_refcnt: reference count
>>> + * @skc_symmetric_queues: symmetric tx/rx queues
>>>*
>>>* This is the minimal network layer representation of sockets, the
>>> header
>>>* for struct sock and struct inet_timewait_sock.
>>> @@ -177,6 +180,7 @@ struct sock_common {
>>>  unsigned char   skc_reuseport:1;
>>>  unsigned char   skc_ipv6only:1;
>>>  unsigned char   skc_net_refcnt:1;
>>> +   unsigned char   skc_symmetric_queues:1;
>>>  int skc_bound_dev_if;
>>>  union {
>>>  struct hlist_node   skc_bind_node;
>>> @@ -214,6 +218,8 @@ struct sock_common {
>>>  struct hlist_nulls_node skc_nulls_node;
>>>  };
>>>  int skc_tx_queue_mapping;
>>> +   int skc_rx_queue_mapping;
>>> +   int skc_rx_ifindex;
>>>  union {
>>>  int skc_incoming_cpu;
>>>  u32 skc_rcv_wnd;
>>> @@ -324,6 +330,8 @@ struct sock {
>>>   #define sk_nulls_node  __sk_common.skc_nulls_node
>>>   #define sk_refcnt  __sk_common.skc_refcnt
>>>   #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping
>>> +#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping
>>> +#define sk_rx_ifindex  __sk_common.skc_rx_ifindex
>>>
>>>   #define sk_dontcopy_begin  __sk_common.skc_dontcopy_begin
>>>   #define sk_dontcopy_end__sk_common.skc_dontcopy_end
>>> @@ -340,6 +348,7 @@ struct sock {
>>>   #define sk_reuseport   __sk_common.skc_reuseport
>>>   #define sk_ipv6only__sk_common.skc_ipv6only
>>>   #define sk_net_refcnt  __sk_common.skc_net_refcnt
>>> +#define sk_symmetric_queues__sk_common.skc_symmetric_queues
>>>   #define sk_bound_dev_if__sk_common.skc_bound_dev_if
>>>   #define sk_bind_node   __sk_common.skc_bind_node
>>>   #d

Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-11 Thread Samudrala, Sridhar

On 9/10/2017 8:19 AM, Tom Herbert wrote:

On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala
 wrote:

This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used
to enable symmetric tx and rx queues on a socket.

This option is specifically useful for epoll based multi threaded workloads
where each thread handles packets received on a single RX queue . In this model,
we have noticed that it helps to send the packets on the same TX queue
corresponding to the queue-pair associated with the RX queue specifically when
busy poll is enabled with epoll().

Two new fields are added to struct sock_common to cache the last rx ifindex and
the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the 
cached
rx queue when this option is enabled and the TX is happening on the same device.

Signed-off-by: Sridhar Samudrala 
---
  include/net/request_sock.h|  1 +
  include/net/sock.h| 17 +
  include/uapi/asm-generic/socket.h |  2 ++
  net/core/dev.c|  8 +++-
  net/core/sock.c   | 10 ++
  net/ipv4/tcp_input.c  |  1 +
  net/ipv4/tcp_ipv4.c   |  1 +
  net/ipv4/tcp_minisocks.c  |  1 +
  8 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 23e2205..c3bc12e 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct request_sock 
*req)
 req_to_sk(req)->sk_prot = sk_listener->sk_prot;
 sk_node_init(&req_to_sk(req)->sk_node);
 sk_tx_queue_clear(req_to_sk(req));
+   req_to_sk(req)->sk_symmetric_queues = sk_listener->sk_symmetric_queues;
 req->saved_syn = NULL;
 refcount_set(&req->rsk_refcnt, 0);

diff --git a/include/net/sock.h b/include/net/sock.h
index 03a3625..3421809 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char *msg, 
...)
   * @skc_node: main hash linkage for various protocol lookup tables
   * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
   * @skc_tx_queue_mapping: tx queue number for this connection
+ * @skc_rx_queue_mapping: rx queue number for this connection
+ * @skc_rx_ifindex: rx ifindex for this connection
   * @skc_flags: place holder for sk_flags
   * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
   * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
   * @skc_incoming_cpu: record/match cpu processing incoming packets
   * @skc_refcnt: reference count
+ * @skc_symmetric_queues: symmetric tx/rx queues
   *
   * This is the minimal network layer representation of sockets, the header
   * for struct sock and struct inet_timewait_sock.
@@ -177,6 +180,7 @@ struct sock_common {
 unsigned char   skc_reuseport:1;
 unsigned char   skc_ipv6only:1;
 unsigned char   skc_net_refcnt:1;
+   unsigned char   skc_symmetric_queues:1;
 int skc_bound_dev_if;
 union {
 struct hlist_node   skc_bind_node;
@@ -214,6 +218,8 @@ struct sock_common {
 struct hlist_nulls_node skc_nulls_node;
 };
 int skc_tx_queue_mapping;
+   int skc_rx_queue_mapping;
+   int skc_rx_ifindex;
 union {
 int skc_incoming_cpu;
 u32 skc_rcv_wnd;
@@ -324,6 +330,8 @@ struct sock {
  #define sk_nulls_node  __sk_common.skc_nulls_node
  #define sk_refcnt  __sk_common.skc_refcnt
  #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping
+#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping
+#define sk_rx_ifindex  __sk_common.skc_rx_ifindex

  #define sk_dontcopy_begin  __sk_common.skc_dontcopy_begin
  #define sk_dontcopy_end__sk_common.skc_dontcopy_end
@@ -340,6 +348,7 @@ struct sock {
  #define sk_reuseport   __sk_common.skc_reuseport
  #define sk_ipv6only__sk_common.skc_ipv6only
  #define sk_net_refcnt  __sk_common.skc_net_refcnt
+#define sk_symmetric_queues__sk_common.skc_symmetric_queues
  #define sk_bound_dev_if__sk_common.skc_bound_dev_if
  #define sk_bind_node   __sk_common.skc_bind_node
  #define sk_prot__sk_common.skc_prot
@@ -1676,6 +1685,14 @@ static inline int sk_tx_queue_get(const struct sock *sk)
 return sk ? sk->sk_tx_queue_mapping : -1;
  }

+static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb)
+{
+   if (sk->sk_symmetric_queues) {
+   sk->sk_rx_ifindex = skb->skb_iif;
+   sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);
+   }
+}
+
  static inline void sk_set_socke

Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-11 Thread Samudrala, Sridhar



On 9/9/2017 6:28 PM, Alexander Duyck wrote:

On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala
 wrote:

This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used
to enable symmetric tx and rx queues on a socket.

This option is specifically useful for epoll based multi threaded workloads
where each thread handles packets received on a single RX queue . In this model,
we have noticed that it helps to send the packets on the same TX queue
corresponding to the queue-pair associated with the RX queue specifically when
busy poll is enabled with epoll().

Two new fields are added to struct sock_common to cache the last rx ifindex and
the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the 
cached
rx queue when this option is enabled and the TX is happening on the same device.

Signed-off-by: Sridhar Samudrala 
---
  include/net/request_sock.h|  1 +
  include/net/sock.h| 17 +
  include/uapi/asm-generic/socket.h |  2 ++
  net/core/dev.c|  8 +++-
  net/core/sock.c   | 10 ++
  net/ipv4/tcp_input.c  |  1 +
  net/ipv4/tcp_ipv4.c   |  1 +
  net/ipv4/tcp_minisocks.c  |  1 +
  8 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 23e2205..c3bc12e 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct request_sock 
*req)
 req_to_sk(req)->sk_prot = sk_listener->sk_prot;
 sk_node_init(&req_to_sk(req)->sk_node);
 sk_tx_queue_clear(req_to_sk(req));
+   req_to_sk(req)->sk_symmetric_queues = sk_listener->sk_symmetric_queues;
 req->saved_syn = NULL;
 refcount_set(&req->rsk_refcnt, 0);

diff --git a/include/net/sock.h b/include/net/sock.h
index 03a3625..3421809 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char *msg, 
...)
   * @skc_node: main hash linkage for various protocol lookup tables
   * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
   * @skc_tx_queue_mapping: tx queue number for this connection
+ * @skc_rx_queue_mapping: rx queue number for this connection
+ * @skc_rx_ifindex: rx ifindex for this connection
   * @skc_flags: place holder for sk_flags
   * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
   * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
   * @skc_incoming_cpu: record/match cpu processing incoming packets
   * @skc_refcnt: reference count
+ * @skc_symmetric_queues: symmetric tx/rx queues
   *
   * This is the minimal network layer representation of sockets, the header
   * for struct sock and struct inet_timewait_sock.
@@ -177,6 +180,7 @@ struct sock_common {
 unsigned char   skc_reuseport:1;
 unsigned char   skc_ipv6only:1;
 unsigned char   skc_net_refcnt:1;
+   unsigned char   skc_symmetric_queues:1;
 int skc_bound_dev_if;
 union {
 struct hlist_node   skc_bind_node;
@@ -214,6 +218,8 @@ struct sock_common {
 struct hlist_nulls_node skc_nulls_node;
 };
 int skc_tx_queue_mapping;
+   int skc_rx_queue_mapping;
+   int skc_rx_ifindex;
 union {
 int skc_incoming_cpu;
 u32 skc_rcv_wnd;
@@ -324,6 +330,8 @@ struct sock {
  #define sk_nulls_node  __sk_common.skc_nulls_node
  #define sk_refcnt  __sk_common.skc_refcnt
  #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping
+#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping
+#define sk_rx_ifindex  __sk_common.skc_rx_ifindex

  #define sk_dontcopy_begin  __sk_common.skc_dontcopy_begin
  #define sk_dontcopy_end__sk_common.skc_dontcopy_end
@@ -340,6 +348,7 @@ struct sock {
  #define sk_reuseport   __sk_common.skc_reuseport
  #define sk_ipv6only__sk_common.skc_ipv6only
  #define sk_net_refcnt  __sk_common.skc_net_refcnt
+#define sk_symmetric_queues__sk_common.skc_symmetric_queues
  #define sk_bound_dev_if__sk_common.skc_bound_dev_if
  #define sk_bind_node   __sk_common.skc_bind_node
  #define sk_prot__sk_common.skc_prot
@@ -1676,6 +1685,14 @@ static inline int sk_tx_queue_get(const struct sock *sk)
 return sk ? sk->sk_tx_queue_mapping : -1;
  }

+static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb)
+{
+   if (sk->sk_symmetric_queues) {
+   sk->sk_rx_ifindex = skb->skb_iif;
+   sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);
+   }
+}
+
  static inline void sk_set_

Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-10 Thread Tom Herbert
On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala
 wrote:
> This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used
> to enable symmetric tx and rx queues on a socket.
>
> This option is specifically useful for epoll based multi threaded workloads
> where each thread handles packets received on a single RX queue . In this 
> model,
> we have noticed that it helps to send the packets on the same TX queue
> corresponding to the queue-pair associated with the RX queue specifically when
> busy poll is enabled with epoll().
>
> Two new fields are added to struct sock_common to cache the last rx ifindex 
> and
> the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the 
> cached
> rx queue when this option is enabled and the TX is happening on the same 
> device.
>
> Signed-off-by: Sridhar Samudrala 
> ---
>  include/net/request_sock.h|  1 +
>  include/net/sock.h| 17 +
>  include/uapi/asm-generic/socket.h |  2 ++
>  net/core/dev.c|  8 +++-
>  net/core/sock.c   | 10 ++
>  net/ipv4/tcp_input.c  |  1 +
>  net/ipv4/tcp_ipv4.c   |  1 +
>  net/ipv4/tcp_minisocks.c  |  1 +
>  8 files changed, 40 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/request_sock.h b/include/net/request_sock.h
> index 23e2205..c3bc12e 100644
> --- a/include/net/request_sock.h
> +++ b/include/net/request_sock.h
> @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct request_sock 
> *req)
> req_to_sk(req)->sk_prot = sk_listener->sk_prot;
> sk_node_init(&req_to_sk(req)->sk_node);
> sk_tx_queue_clear(req_to_sk(req));
> +   req_to_sk(req)->sk_symmetric_queues = 
> sk_listener->sk_symmetric_queues;
> req->saved_syn = NULL;
> refcount_set(&req->rsk_refcnt, 0);
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 03a3625..3421809 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char *msg, 
> ...)
>   * @skc_node: main hash linkage for various protocol lookup tables
>   * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
>   * @skc_tx_queue_mapping: tx queue number for this connection
> + * @skc_rx_queue_mapping: rx queue number for this connection
> + * @skc_rx_ifindex: rx ifindex for this connection
>   * @skc_flags: place holder for sk_flags
>   * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
>   * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
>   * @skc_incoming_cpu: record/match cpu processing incoming packets
>   * @skc_refcnt: reference count
> + * @skc_symmetric_queues: symmetric tx/rx queues
>   *
>   * This is the minimal network layer representation of sockets, the 
> header
>   * for struct sock and struct inet_timewait_sock.
> @@ -177,6 +180,7 @@ struct sock_common {
> unsigned char   skc_reuseport:1;
> unsigned char   skc_ipv6only:1;
> unsigned char   skc_net_refcnt:1;
> +   unsigned char   skc_symmetric_queues:1;
> int skc_bound_dev_if;
> union {
> struct hlist_node   skc_bind_node;
> @@ -214,6 +218,8 @@ struct sock_common {
> struct hlist_nulls_node skc_nulls_node;
> };
> int skc_tx_queue_mapping;
> +   int skc_rx_queue_mapping;
> +   int skc_rx_ifindex;
> union {
> int skc_incoming_cpu;
> u32 skc_rcv_wnd;
> @@ -324,6 +330,8 @@ struct sock {
>  #define sk_nulls_node  __sk_common.skc_nulls_node
>  #define sk_refcnt  __sk_common.skc_refcnt
>  #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping
> +#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping
> +#define sk_rx_ifindex  __sk_common.skc_rx_ifindex
>
>  #define sk_dontcopy_begin  __sk_common.skc_dontcopy_begin
>  #define sk_dontcopy_end__sk_common.skc_dontcopy_end
> @@ -340,6 +348,7 @@ struct sock {
>  #define sk_reuseport   __sk_common.skc_reuseport
>  #define sk_ipv6only__sk_common.skc_ipv6only
>  #define sk_net_refcnt  __sk_common.skc_net_refcnt
> +#define sk_symmetric_queues__sk_common.skc_symmetric_queues
>  #define sk_bound_dev_if__sk_common.skc_bound_dev_if
>  #define sk_bind_node   __sk_common.skc_bind_node
>  #define sk_prot__sk_common.skc_prot
> @@ -1676,6 +1685,14 @@ static inline int sk_tx_queue_get(const struct sock 
> *sk)
> return sk ? sk->sk_tx_queue_mapping : -1;
>  }
>
> +static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb)
> +{
> +   if (sk->sk_symmetric_queues) {
> +   sk->sk_rx_ifindex = skb->skb

Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-09 Thread Tom Herbert
On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala
 wrote:
> This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used
> to enable symmetric tx and rx queues on a socket.
>
> This option is specifically useful for epoll based multi threaded workloads
> where each thread handles packets received on a single RX queue . In this 
> model,
> we have noticed that it helps to send the packets on the same TX queue
> corresponding to the queue-pair associated with the RX queue specifically when
> busy poll is enabled with epoll().
>
Please provide more details, test results on exactly how this helps.
Why would this better than than optimized XPS?

Thanks,
Tom

> Two new fields are added to struct sock_common to cache the last rx ifindex 
> and
> the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the 
> cached
> rx queue when this option is enabled and the TX is happening on the same 
> device.
>
> Signed-off-by: Sridhar Samudrala 
> ---
>  include/net/request_sock.h|  1 +
>  include/net/sock.h| 17 +
>  include/uapi/asm-generic/socket.h |  2 ++
>  net/core/dev.c|  8 +++-
>  net/core/sock.c   | 10 ++
>  net/ipv4/tcp_input.c  |  1 +
>  net/ipv4/tcp_ipv4.c   |  1 +
>  net/ipv4/tcp_minisocks.c  |  1 +
>  8 files changed, 40 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/request_sock.h b/include/net/request_sock.h
> index 23e2205..c3bc12e 100644
> --- a/include/net/request_sock.h
> +++ b/include/net/request_sock.h
> @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct request_sock 
> *req)
> req_to_sk(req)->sk_prot = sk_listener->sk_prot;
> sk_node_init(&req_to_sk(req)->sk_node);
> sk_tx_queue_clear(req_to_sk(req));
> +   req_to_sk(req)->sk_symmetric_queues = 
> sk_listener->sk_symmetric_queues;
> req->saved_syn = NULL;
> refcount_set(&req->rsk_refcnt, 0);
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 03a3625..3421809 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char *msg, 
> ...)
>   * @skc_node: main hash linkage for various protocol lookup tables
>   * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
>   * @skc_tx_queue_mapping: tx queue number for this connection
> + * @skc_rx_queue_mapping: rx queue number for this connection
> + * @skc_rx_ifindex: rx ifindex for this connection
>   * @skc_flags: place holder for sk_flags
>   * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
>   * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
>   * @skc_incoming_cpu: record/match cpu processing incoming packets
>   * @skc_refcnt: reference count
> + * @skc_symmetric_queues: symmetric tx/rx queues
>   *
>   * This is the minimal network layer representation of sockets, the 
> header
>   * for struct sock and struct inet_timewait_sock.
> @@ -177,6 +180,7 @@ struct sock_common {
> unsigned char   skc_reuseport:1;
> unsigned char   skc_ipv6only:1;
> unsigned char   skc_net_refcnt:1;
> +   unsigned char   skc_symmetric_queues:1;
> int skc_bound_dev_if;
> union {
> struct hlist_node   skc_bind_node;
> @@ -214,6 +218,8 @@ struct sock_common {
> struct hlist_nulls_node skc_nulls_node;
> };
> int skc_tx_queue_mapping;
> +   int skc_rx_queue_mapping;
> +   int skc_rx_ifindex;
> union {
> int skc_incoming_cpu;
> u32 skc_rcv_wnd;
> @@ -324,6 +330,8 @@ struct sock {
>  #define sk_nulls_node  __sk_common.skc_nulls_node
>  #define sk_refcnt  __sk_common.skc_refcnt
>  #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping
> +#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping
> +#define sk_rx_ifindex  __sk_common.skc_rx_ifindex
>
>  #define sk_dontcopy_begin  __sk_common.skc_dontcopy_begin
>  #define sk_dontcopy_end__sk_common.skc_dontcopy_end
> @@ -340,6 +348,7 @@ struct sock {
>  #define sk_reuseport   __sk_common.skc_reuseport
>  #define sk_ipv6only__sk_common.skc_ipv6only
>  #define sk_net_refcnt  __sk_common.skc_net_refcnt
> +#define sk_symmetric_queues__sk_common.skc_symmetric_queues
>  #define sk_bound_dev_if__sk_common.skc_bound_dev_if
>  #define sk_bind_node   __sk_common.skc_bind_node
>  #define sk_prot__sk_common.skc_prot
> @@ -1676,6 +1685,14 @@ static inline int sk_tx_queue_get(const struct sock 
> *sk)
> return sk ? sk->sk_tx_queue_mapping : -1;
>  }
>
> +static inline void sk_mark_rx_queue(

Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-09 Thread Alexander Duyck
On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala
 wrote:
> This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used
> to enable symmetric tx and rx queues on a socket.
>
> This option is specifically useful for epoll based multi threaded workloads
> where each thread handles packets received on a single RX queue . In this 
> model,
> we have noticed that it helps to send the packets on the same TX queue
> corresponding to the queue-pair associated with the RX queue specifically when
> busy poll is enabled with epoll().
>
> Two new fields are added to struct sock_common to cache the last rx ifindex 
> and
> the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the 
> cached
> rx queue when this option is enabled and the TX is happening on the same 
> device.
>
> Signed-off-by: Sridhar Samudrala 
> ---
>  include/net/request_sock.h|  1 +
>  include/net/sock.h| 17 +
>  include/uapi/asm-generic/socket.h |  2 ++
>  net/core/dev.c|  8 +++-
>  net/core/sock.c   | 10 ++
>  net/ipv4/tcp_input.c  |  1 +
>  net/ipv4/tcp_ipv4.c   |  1 +
>  net/ipv4/tcp_minisocks.c  |  1 +
>  8 files changed, 40 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/request_sock.h b/include/net/request_sock.h
> index 23e2205..c3bc12e 100644
> --- a/include/net/request_sock.h
> +++ b/include/net/request_sock.h
> @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct request_sock 
> *req)
> req_to_sk(req)->sk_prot = sk_listener->sk_prot;
> sk_node_init(&req_to_sk(req)->sk_node);
> sk_tx_queue_clear(req_to_sk(req));
> +   req_to_sk(req)->sk_symmetric_queues = 
> sk_listener->sk_symmetric_queues;
> req->saved_syn = NULL;
> refcount_set(&req->rsk_refcnt, 0);
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 03a3625..3421809 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char *msg, 
> ...)
>   * @skc_node: main hash linkage for various protocol lookup tables
>   * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
>   * @skc_tx_queue_mapping: tx queue number for this connection
> + * @skc_rx_queue_mapping: rx queue number for this connection
> + * @skc_rx_ifindex: rx ifindex for this connection
>   * @skc_flags: place holder for sk_flags
>   * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
>   * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
>   * @skc_incoming_cpu: record/match cpu processing incoming packets
>   * @skc_refcnt: reference count
> + * @skc_symmetric_queues: symmetric tx/rx queues
>   *
>   * This is the minimal network layer representation of sockets, the 
> header
>   * for struct sock and struct inet_timewait_sock.
> @@ -177,6 +180,7 @@ struct sock_common {
> unsigned char   skc_reuseport:1;
> unsigned char   skc_ipv6only:1;
> unsigned char   skc_net_refcnt:1;
> +   unsigned char   skc_symmetric_queues:1;
> int skc_bound_dev_if;
> union {
> struct hlist_node   skc_bind_node;
> @@ -214,6 +218,8 @@ struct sock_common {
> struct hlist_nulls_node skc_nulls_node;
> };
> int skc_tx_queue_mapping;
> +   int skc_rx_queue_mapping;
> +   int skc_rx_ifindex;
> union {
> int skc_incoming_cpu;
> u32 skc_rcv_wnd;
> @@ -324,6 +330,8 @@ struct sock {
>  #define sk_nulls_node  __sk_common.skc_nulls_node
>  #define sk_refcnt  __sk_common.skc_refcnt
>  #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping
> +#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping
> +#define sk_rx_ifindex  __sk_common.skc_rx_ifindex
>
>  #define sk_dontcopy_begin  __sk_common.skc_dontcopy_begin
>  #define sk_dontcopy_end__sk_common.skc_dontcopy_end
> @@ -340,6 +348,7 @@ struct sock {
>  #define sk_reuseport   __sk_common.skc_reuseport
>  #define sk_ipv6only__sk_common.skc_ipv6only
>  #define sk_net_refcnt  __sk_common.skc_net_refcnt
> +#define sk_symmetric_queues__sk_common.skc_symmetric_queues
>  #define sk_bound_dev_if__sk_common.skc_bound_dev_if
>  #define sk_bind_node   __sk_common.skc_bind_node
>  #define sk_prot__sk_common.skc_prot
> @@ -1676,6 +1685,14 @@ static inline int sk_tx_queue_get(const struct sock 
> *sk)
> return sk ? sk->sk_tx_queue_mapping : -1;
>  }
>
> +static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb)
> +{
> +   if (sk->sk_symmetric_queues) {
> +   sk->sk_rx_ifindex = skb->skb

[RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-08-31 Thread Sridhar Samudrala
This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used
to enable symmetric tx and rx queues on a socket.

This option is specifically useful for epoll based multi threaded workloads
where each thread handles packets received on a single RX queue . In this model,
we have noticed that it helps to send the packets on the same TX queue
corresponding to the queue-pair associated with the RX queue specifically when
busy poll is enabled with epoll().

Two new fields are added to struct sock_common to cache the last rx ifindex and
the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the 
cached
rx queue when this option is enabled and the TX is happening on the same device.

Signed-off-by: Sridhar Samudrala 
---
 include/net/request_sock.h|  1 +
 include/net/sock.h| 17 +
 include/uapi/asm-generic/socket.h |  2 ++
 net/core/dev.c|  8 +++-
 net/core/sock.c   | 10 ++
 net/ipv4/tcp_input.c  |  1 +
 net/ipv4/tcp_ipv4.c   |  1 +
 net/ipv4/tcp_minisocks.c  |  1 +
 8 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 23e2205..c3bc12e 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct request_sock 
*req)
req_to_sk(req)->sk_prot = sk_listener->sk_prot;
sk_node_init(&req_to_sk(req)->sk_node);
sk_tx_queue_clear(req_to_sk(req));
+   req_to_sk(req)->sk_symmetric_queues = sk_listener->sk_symmetric_queues;
req->saved_syn = NULL;
refcount_set(&req->rsk_refcnt, 0);
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 03a3625..3421809 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char *msg, 
...)
  * @skc_node: main hash linkage for various protocol lookup tables
  * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
  * @skc_tx_queue_mapping: tx queue number for this connection
+ * @skc_rx_queue_mapping: rx queue number for this connection
+ * @skc_rx_ifindex: rx ifindex for this connection
  * @skc_flags: place holder for sk_flags
  * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
  * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
  * @skc_incoming_cpu: record/match cpu processing incoming packets
  * @skc_refcnt: reference count
+ * @skc_symmetric_queues: symmetric tx/rx queues
  *
  * This is the minimal network layer representation of sockets, the header
  * for struct sock and struct inet_timewait_sock.
@@ -177,6 +180,7 @@ struct sock_common {
unsigned char   skc_reuseport:1;
unsigned char   skc_ipv6only:1;
unsigned char   skc_net_refcnt:1;
+   unsigned char   skc_symmetric_queues:1;
int skc_bound_dev_if;
union {
struct hlist_node   skc_bind_node;
@@ -214,6 +218,8 @@ struct sock_common {
struct hlist_nulls_node skc_nulls_node;
};
int skc_tx_queue_mapping;
+   int skc_rx_queue_mapping;
+   int skc_rx_ifindex;
union {
int skc_incoming_cpu;
u32 skc_rcv_wnd;
@@ -324,6 +330,8 @@ struct sock {
 #define sk_nulls_node  __sk_common.skc_nulls_node
 #define sk_refcnt  __sk_common.skc_refcnt
 #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping
+#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping
+#define sk_rx_ifindex  __sk_common.skc_rx_ifindex
 
 #define sk_dontcopy_begin  __sk_common.skc_dontcopy_begin
 #define sk_dontcopy_end__sk_common.skc_dontcopy_end
@@ -340,6 +348,7 @@ struct sock {
 #define sk_reuseport   __sk_common.skc_reuseport
 #define sk_ipv6only__sk_common.skc_ipv6only
 #define sk_net_refcnt  __sk_common.skc_net_refcnt
+#define sk_symmetric_queues__sk_common.skc_symmetric_queues
 #define sk_bound_dev_if__sk_common.skc_bound_dev_if
 #define sk_bind_node   __sk_common.skc_bind_node
 #define sk_prot__sk_common.skc_prot
@@ -1676,6 +1685,14 @@ static inline int sk_tx_queue_get(const struct sock *sk)
return sk ? sk->sk_tx_queue_mapping : -1;
 }
 
+static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb)
+{
+   if (sk->sk_symmetric_queues) {
+   sk->sk_rx_ifindex = skb->skb_iif;
+   sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);
+   }
+}
+
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 {
sk_tx_queue_clear(sk);
diff --git a/include/uapi/asm-generic/socket.h 
b/include/uapi/asm-generic/so