Hi Hao, The current design of the application very roughly is following: 1. There is one main thread which pumps out the packets from the NIC queues using rte_eth_rx_burst(), as you said. In the future we may need several main threads to be able to scale the application. Each one of them will work on separate groups of RX queues. The main thread distributes the received packets to N other threads using single producer single consumer rings provided by DPDK (rte_ring). 2. Each one of these N other threads runs a separate F-stack version. As I said we use a networking stack per thread and share nothing design. Let's call them worker threads for now. 3. Each worker thread has its own spsc_ring for the incoming packets and uses a separate NIC queue to send the outgoing packets using rte_eth_tx_burst. The main loop of such worker thread looks roughly in the following way (pseudo code): while (not stopped) {
if (X_milliseconds_have_passed) call_fstack_tick_functionality(); send_queued_fstack_packets(); // using rte_eth_tx_burst dequeue_incoming_packets_from_spsc_ring(); enqueue_the_incoming_packets_to_the_fstack(); if (Y_milliseconds_have_passed) process_fstack_socket_events_using_epoll(); } You may not the following things from the above code. - The packets are sent (rte_eth_tx_burst) in the same thread where the socket events are processed. The outgoing packets are also sent if we queue enough of them while processing socket write events but this will complicate the explanation here. - The timer events and the socket events are not processed on each iteration of the loop. These milliseconds come from a config file and are measured using rte_rdtsc. - The loop is very similar to the one present in the F-stack itself - https://github.com/F-Stack/f-stack/blob/dev/lib/ff_dpdk_if.c#L1817. It's just that in our case this loop is decoupled from the F-stack because we removed the DPDK from the F-stack in order to use the latter as a separate library and use a separate networking stack per thread. 4. The number of worker threads is configurable via the application config file and the application sets up the NIC with the same number of RX/TX queues as the number of worker threads. This way the main thread pumps out packets from N RX queues and each worker thread enqueues packets to each own TX queue i.e. there is no sharing. So the application may run with single RX/TX queue and then it'll have one main thread and one worker thread. Or may run with 10 RX/TX queues and then it'll have 1 main thread and 10 worker threads. It depends on the traffic amount that we expect to handle, the NIC capabilities, etc. Regards, Pavel.