On Mon, Aug 08, 2016 at 11:52:58AM +0200, Martin Pieuchot wrote:
> On 07/30/16 00:17, Alexander Bluhm wrote:
> > Spliced TCP sockets become faster if we put the output part into
> > its own task thread. This is inspired by userland copy where we
> > also have to go through the scheduler. This gives the socket buffer
> > a chance to be filled up and tcp_output() is called less often and
> > with bigger chunks.
>
> This is really interesting. Do you have a clear understanding why is
> it getting faster? I'm worried about introducing a hack just for socket
> splicing and would prefer a more generic solution if possible.
I run a profiling kernel. The first block is without my diff the
second is with the task queue for splicing applied.
As you can see, the number of calls to the input functions are
roughly the same.
[6] 31.6 0.04 2.26 23694 tcp_input [6]
0.52 0.46 14253/29465 tcp_output [7]
0.00 0.63 14239/14244 sorwakeup [15]
0.00 0.42 9451/9451 sowwakeup [18]
0.00 0.00 14239/21847 sbappendstream [487]
...
[8] 20.6 0.06 2.75 23712 tcp_input [8]
0.97 1.32 16470/16984 tcp_output [9]
0.00 0.00 14258/14367 sorwakeup [352]
0.00 0.00 9434/9504 sowwakeup [378]
0.00 0.00 14254/14498 sbappendstream [416]
...
When looking at somove(9) the number of calls go down from 23620
to 205 if it is running in a task. Note that the sbsync calls are
equal which means the same number of mbufs are transferred.
0.00 0.00 1/23620 sosplice [182]
0.00 0.42 9449/23620 sowwakeup [18]
0.00 0.63 14170/23620 sorwakeup [15]
[8] 14.5 0.00 1.05 23620 somove [8]
0.00 1.05 15210/15220 tcp_usrreq [9]
0.00 0.01 7605/7605 m_resethdr [74]
0.00 0.00 23620/250919 splassert_check [410]
0.00 0.00 14241/14242 sbsync [496]
0.00 0.00 1/205 sosplice [292]
0.00 0.08 204/205 sotask [46]
[45] 0.6 0.00 0.08 205 somove [45]
0.02 0.06 410/529 tcp_usrreq [39]
0.00 0.00 205/205 m_resethdr [339]
0.00 0.00 14147/14152 sbsync [708]
0.00 0.00 205/265679 splassert_check [626]
Especially tcp_usrreq() calling tcp_output() makes the difference.
[9] 14.4 0.00 1.05 15220 tcp_usrreq [9]
0.56 0.49 15212/29465 tcp_output [7]
...
[39] 0.7 0.02 0.08 529 tcp_usrreq [39]
0.03 0.04 504/16984 tcp_output [9]
...
The total number of tcp_output() calls is not reduced that much,
as we still have the ACKs triggered from tcp_input().
0.52 0.46 14253/29465 tcp_input [6]
0.56 0.49 15212/29465 tcp_usrreq [9]
[7] 27.9 1.08 0.95 29465 tcp_output [7]
0.44 0.50 23749/23750 ip_output [10]
...
0.03 0.04 504/16984 tcp_usrreq [39]
0.97 1.32 16470/16984 tcp_input [8]
[9] 17.3 1.00 1.36 16984 tcp_output [9]
0.42 0.84 21365/21370 ip_output [12]
...
So my idea is to delay tcp_output() a little bit to allow the socket
buffers fill up. Then we can use bulk operations. The concept to
use a thread is from the analogy with a user land copy. There the
buffer fills until the relay process is scheduled. Using a soft
interrupt would have the same effect. We are running this task
diff in production for a year now.
bluhm