Hello all, (Dave, Benoit, Damjan and all others)

We have a VPP application that has a single RX worker/thread that receives
all packets from a NIC and N-1 packet processing thread that are transmit
only. Basically on the NIC we have 1 rx queue and N-1 transmit queues. The
rx-packet/buffer is hand-offed from the rx-thread to a set of cores
(service-chaining, pipe-lining) and each packet processing core transmits
on its transmit queue. Some of the packet processing threads might queue
the packets for seconds, minutes.

I read in VPP buffer management,  a buffer has three states - available,
cached (worker thread), used. There is a single global buffer pool and per
worker cache pool.

Since buffer after packet tx is completed needed to be returned to the
pool, in this specific scenario (1 rx and N-1 tx threads) we would like
buffers to be returned to rx thread so that there is always rx buffers to
receive packets and we don't encounter rx-miss from the NIC.

There is a spinlock that is used alloc/free a buffer from the global pool.
In this case, since there is rx on N-1 threads these are tx only, returning
buffers to local cache does not benefit performance. We would like the
buffers to be returned to the global pool and in fact to the buffer cache
of the single rx-thread directly. I am concerned that as the number of the
tx threads grows, more buffers will be returned to the global pool which
requires the spin-lock to free the buffers. The single rx-thread will run
out of cache-buffer and will attempt to allocate from the global pool and
thus increasing the chances of spin-lock contention overall which could
potentially hurt performance.

Do you agree with my characterization of the problem? Or do you think the
problem is not severe?

Do you have any suggestion how we could optimize buffer allocation in this
case. There are two goals

   - rx-thread never runs out of rx-buffers
   - buffer in pool/caches are not left unused
   - spinlock contention in allocating/freeing buffer from global pool is
   almost 0
   - should scale as we increase number of transmit treads/cores e.g 8, 10,
   12, 16, 20, 24

One obvious solution I was thinking of is to reduce the size of local
buffer cache in transmit worker threads and increase the local buffer cache
of the single rx thread. Does VPP support an application to set per
worker/thread buffer caches size?

The application threads (tx-threads) that queue packets are required to
enforce a max or threshold queue depth. And if the threshold exceeds, the
application flushes out the queued packets.

Is there any other technique we can use, e.g. after transmitting, let the
NIC move the buffers directly back to the rx for instance.

I really appreciate your guidance on optimizing buffer usage and reducing
spinlock contention on tx and rx across cores.

Thank you,

- Pranab K Das
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#21736): https://lists.fd.io/g/vpp-dev/message/21736
Mute This Topic: https://lists.fd.io/mt/92656278/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to