On Mon, Aug 08, 2016 at 11:52:58AM +0200, Martin Pieuchot wrote:
> On 07/30/16 00:17, Alexander Bluhm wrote:
> > Spliced TCP sockets become faster if we put the output part into
> > its own task thread.  This is inspired by userland copy where we
> > also have to go through the scheduler.  This gives the socket buffer
> > a chance to be filled up and tcp_output() is called less often and
> > with bigger chunks.
> 
> This is really interesting.  Do you have a clear understanding why is
> it getting faster?  I'm worried about introducing a hack just for socket
> splicing and would prefer a more generic solution if possible.

I run a profiling kernel.  The first block is without my diff the
second is with the task queue for splicing applied.

As you can see, the number of calls to the input functions are
roughly the same.

[6]     31.6    0.04        2.26   23694         tcp_input [6]
                0.52        0.46   14253/29465       tcp_output [7]
                0.00        0.63   14239/14244       sorwakeup [15]
                0.00        0.42    9451/9451        sowwakeup [18]
                0.00        0.00   14239/21847       sbappendstream [487]
                ...


[8]     20.6    0.06        2.75   23712         tcp_input [8]
                0.97        1.32   16470/16984       tcp_output [9]
                0.00        0.00   14258/14367       sorwakeup [352]
                0.00        0.00    9434/9504        sowwakeup [378]
                0.00        0.00   14254/14498       sbappendstream [416]
                ...

When looking at somove(9) the number of calls go down from 23620
to 205 if it is running in a task.  Note that the sbsync calls are
equal which means the same number of mbufs are transferred.

                0.00        0.00       1/23620       sosplice [182]
                0.00        0.42    9449/23620       sowwakeup [18]
                0.00        0.63   14170/23620       sorwakeup [15]
[8]     14.5    0.00        1.05   23620         somove [8]
                0.00        1.05   15210/15220       tcp_usrreq [9]
                0.00        0.01    7605/7605        m_resethdr [74]
                0.00        0.00   23620/250919      splassert_check [410]
                0.00        0.00   14241/14242       sbsync [496]

                0.00        0.00       1/205         sosplice [292]
                0.00        0.08     204/205         sotask [46]
[45]     0.6    0.00        0.08     205         somove [45]
                0.02        0.06     410/529         tcp_usrreq [39]
                0.00        0.00     205/205         m_resethdr [339]
                0.00        0.00   14147/14152       sbsync [708]
                0.00        0.00     205/265679      splassert_check [626]

Especially tcp_usrreq() calling tcp_output() makes the difference.

[9]     14.4    0.00        1.05   15220         tcp_usrreq [9]
                0.56        0.49   15212/29465       tcp_output [7]
                ...

[39]     0.7    0.02        0.08     529         tcp_usrreq [39]
                0.03        0.04     504/16984       tcp_output [9]
                ...

The total number of tcp_output() calls is not reduced that much,
as we still have the ACKs triggered from tcp_input().

                0.52        0.46   14253/29465       tcp_input [6]
                0.56        0.49   15212/29465       tcp_usrreq [9]
[7]     27.9    1.08        0.95   29465         tcp_output [7]
                0.44        0.50   23749/23750       ip_output [10]
                ...

                0.03        0.04     504/16984       tcp_usrreq [39]
                0.97        1.32   16470/16984       tcp_input [8]
[9]     17.3    1.00        1.36   16984         tcp_output [9]
                0.42        0.84   21365/21370       ip_output [12]
                ...

So my idea is to delay tcp_output() a little bit to allow the socket
buffers fill up.  Then we can use bulk operations.  The concept to
use a thread is from the analogy with a user land copy.  There the
buffer fills until the relay process is scheduled.  Using a soft
interrupt would have the same effect.  We are running this task
diff in production for a year now.

bluhm

Reply via email to