Hello all, (Dave, Benoit, Damjan and all others) We have a VPP application that has a single RX worker/thread that receives all packets from a NIC and N-1 packet processing thread that are transmit only. Basically on the NIC we have 1 rx queue and N-1 transmit queues. The rx-packet/buffer is hand-offed from the rx-thread to a set of cores (service-chaining, pipe-lining) and each packet processing core transmits on its transmit queue. Some of the packet processing threads might queue the packets for seconds, minutes.
I read in VPP buffer management, a buffer has three states - available, cached (worker thread), used. There is a single global buffer pool and per worker cache pool. Since buffer after packet tx is completed needed to be returned to the pool, in this specific scenario (1 rx and N-1 tx threads) we would like buffers to be returned to rx thread so that there is always rx buffers to receive packets and we don't encounter rx-miss from the NIC. There is a spinlock that is used alloc/free a buffer from the global pool. In this case, since there is rx on N-1 threads these are tx only, returning buffers to local cache does not benefit performance. We would like the buffers to be returned to the global pool and in fact to the buffer cache of the single rx-thread directly. I am concerned that as the number of the tx threads grows, more buffers will be returned to the global pool which requires the spin-lock to free the buffers. The single rx-thread will run out of cache-buffer and will attempt to allocate from the global pool and thus increasing the chances of spin-lock contention overall which could potentially hurt performance. Do you agree with my characterization of the problem? Or do you think the problem is not severe? Do you have any suggestion how we could optimize buffer allocation in this case. There are two goals - rx-thread never runs out of rx-buffers - buffer in pool/caches are not left unused - spinlock contention in allocating/freeing buffer from global pool is almost 0 - should scale as we increase number of transmit treads/cores e.g 8, 10, 12, 16, 20, 24 One obvious solution I was thinking of is to reduce the size of local buffer cache in transmit worker threads and increase the local buffer cache of the single rx thread. Does VPP support an application to set per worker/thread buffer caches size? The application threads (tx-threads) that queue packets are required to enforce a max or threshold queue depth. And if the threshold exceeds, the application flushes out the queued packets. Is there any other technique we can use, e.g. after transmitting, let the NIC move the buffers directly back to the rx for instance. I really appreciate your guidance on optimizing buffer usage and reducing spinlock contention on tx and rx across cores. Thank you, - Pranab K Das
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#21736): https://lists.fd.io/g/vpp-dev/message/21736 Mute This Topic: https://lists.fd.io/mt/92656278/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-