Soumya and I have been working on-and-off for a couple of months on a
design for both async callback and zero-copy, based upon APIs already
implemented for Gluster. Once we have something comprehensive and
well-written, I'd like to get feedback from other FSALs.
And of course, zero-copy is the whole point of RDMA. Earlier Gluster
testing with Samba showed that zero-copy gave a better performance
improvement than async IO.
The underlying Linux OS calls only allow one or the other. For example,
for TCP output in Gashesha V2.5/ntirpc 1.5, I've eliminated one task
switch, but still use the writev() in a semi-dedicated thread, as there
is no async writev() variant. We should have a measurable performance
improvement (but it might be masked by all the MDCACHE changes).
For FSALs we have the opportunity to design a combined system.
Here's the current state of the design introduction:
NFS-Ganesha direct data placement
with reduced task switching and zero-copy
Currently/Previously
(Task switch 1.) Upon signalling (epoll), a master polling thread launches
other worker threads, one for each signal.
(Task switch 2.) If there is more than one concurrent request on the same
transport tuple (IP source, IP destination, source port, destination port), the
request is added to a stall queue.
(Task switch 3.) During parsing the NFS input, each thread can wait for more
data.
(Task switch 4.) After parsing the NFS input, the thread queues the request
according to several (4) priorities for handling by another worker thread.
Requests are not handled in order.
(Task switch 5.) While executing the NFS request, the thread can stall waiting
for FSAL data.
(Task switch 6.) After retrieving the resulting data, the thread hands-off the
output to another thread to handle the system write. [Eliminated in Ganesha
V2.5/ntirpc v1.5]
Ideally
(Task switch 1.) Upon signalling (epoll), the worker thread will make only one
system call to accept the incoming connection.
If there is more than one signal at a time, that same worker will queue the
additional signals, queue another work request to handle the next signal, then
continue hot processing the first signal. Note that this replaces the stall
queue, as the latter
threads utilize a worker pool and are sequentially executed in a fair queuing
fashion.
To remain hot, the thread checks for additional work before returning to the
idle pool.
(Task switch 2.) Instead of waiting for a read system call to complete, use a
callback to schedule another worker thread, parse the NFS request, and call the
appropriate FSAL..
If more [TCP, RDMA] data is needed for the request, the thread will save the
state for the subsequent signal.
(Task switch 3.) While executing the NFS request, the thread can stall waiting
for FSAL data. The FSAL will return its result, and make a second system call
to send output. In the case that FSAL result does not require a stall, no task
switch is needed.
To remain hot, the thread checks for additional output data before returning to
the idle pool. Other threads will queue their output data. (As of Ganesha
V2.5/ntirpc v1.5, this is implemented for TCP.)
Input signal changes
Currently, the (epoll) signal is blocked per fd after each fd signal. The input
signal thread does not reinstate the fd signal until after input processing is
complete. This causes a data backlog in the underlying OS, until data is
dropped for lack of
signal processing. Evidence that sawtoothed patterns appear in TCP, as the OS
will acknowledge (ack) the data until no more data can be held, causing TCP
stall and slow start.
Ideally, the signal should never be blocked. Until the entire task scheme is
upgraded according to this plan, this is not possible. So the block should be
reinstated as soon as practicable, allowing new signals to be queued quickly.
The signal queue(s) implemented for RDMA should be used for all signals.
Preliminary testing by CEA demonstrated that up to 3,000 client connections
could be handled during cold startup. However, this cannot be implemented until
better asynchrony and
parallelism is available.
Transport parallelism
Currently, on SVC_RECV() a new transport (SVCXPRT) is spawned for each incoming
TCP and RDMA connection, but not for UDP connections. This requires extensive
locking around UDP receive and send, as each incoming request uses the same
buffers for input and
output, and stores common data fields used by both input and output. There
exists a UDP multi-threading window between the SVC_RECV() and SVC_SEND() –
that is, the long-standing code is !MT-safe.
Instead, spawn a new UDP transport for each incoming request. Rather than
allocating a separate buffer for each UDP transport, append an IOQ buffer,
replacing the rpc_buffer() pointer. This will keep the number of memory
allocation calls and contention
exactly the same as previously, and permit usage of the significantly faster
duplex IOQ for