> From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Bruce Richardson > Sent: Wednesday, April 7, 2021 11:59 AM > > On Wed, Apr 07, 2021 at 09:11:23AM +0200, Morten Brørup wrote: > > > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Honnappa > > > Nagarahalli > > > Sent: Wednesday, April 7, 2021 2:48 AM > > > > > > <snip> > > > > > > > > > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Tom > Barbette > > > > > Sent: Wednesday, March 31, 2021 10:53 AM > > > > > > > > > > Le 31-03-21 à 02:44, Honnappa Nagarahalli a écrit : > > > > > > - Ability to tune the values of #defines > > > > > > * Few prominent points discussed > > > > > > - This will result in #ifdefs in the code (for ex: in > > > testpmd) > > > > > > - One option is for all the PMDs to document their > > > configurable > > > > > #defines in PMD specific header files. Having these distributed > is > > > > > much easier to search. > > > > > > - Can some of the existing #defines be converted to runtime > > > > > configurations? For ex: RTE_MAX_LCORE? This might impact ABI. > > > > > > * Bruce to think about converting the doc to a blog or an > > > email > > > > > on the mailing list. But soliciting feedback is most important. > > > > > > > > > > One alternative path worth looking at is to encourage the use > of > > > LTO, > > > > > and modify APIs so the configuration can be provided at linking > > > time, > > > > > and propagated by the compiler. > > > > > > > > > > E.g. one can define rte_max_lcore as a weak constant symbol, > equal > > > to > > > > > 128. At linking time the user may provide a rte_max_lcore that > is > > > more > > > > > tailored, and still, dynamic arrays[rte_max_lcore] will be > > > allocatable > > > > > on the .bss section, avoiding an indirection. The compiler will > be > > > > > able to optimize loops etc which is impossible with pure > runtime > > > > > configuration. > > > > > > > > > > In packetmill.io we actually pushed this to the next level > where > > > the > > > > > driver can completely change its behavior without recompiling > DPDK > > > > > itself and spawning ifdefs everywhere. > > > > > > > > > > However the price is the slowiness of LTO... > > > > > > > > > > My 2 cents. > > > > > > > > > > Tom > > > > > > > > > > > > > If we are moving away from Compile Time parameters, I certainly > > > prefer Tom's > > > > suggestion of Link Time parameters, rather than Run Time > parameters. > > > I think compile time constants are fine if they are not used in > #ifdef. > > > For ex: if they are used in 'if (...)', it will help eliminate code > and > > > branches. > > > > Yes! > > > > And "if (...)" is more flexible than #ifdef/#if because it allows the > expression to be mixed with non-constants. > > > > Then perhaps Bruce's script to automatically make C constants out of > #defines was not so silly anyway. :-) > > > > > > > > > > > > > This might also provide a middle ground for optimizations where > > > Compile > > > > Time parameters are considered unacceptable by the DPDK > community. > > > I'm > > > > thinking about something along the lines of the "constant size" > > > rte_event > > > > array presented at the 2020 Userspace Summit by Harry > > > > > > > > (https://static.sched.com/hosted_files/dpdkuserspace2020/d3/dpdk_usersp > > > ac > > > > e_20_api_performance_hvh.pdf). Taking this thinking even further > out, > > > a Link > > > > Time parameter could perhaps replace the nb_pkts parameter in on > > > > optimized rte_eth_rx_burst() function. > > > > > > > > Optimally, I would like to see e.g. the RX burst size being so > constant that the PMD's RX function knows it and can use vector > functions and possibly loop unrolling, without having to implement a > pre-check on nb_pkts and a trailing non-vector loop for receiving any > remaining odd nb_pkts. All the DPDK examples use #define MAX_PKT_BURST > 32 or similar, and I assume most DPDK applications do too. > > > > I do not trust the compiler to be clever enough to realize that the > PMD's RX function is always called with a specific nb_pkts and optimize > all this cruft away at compile time (or at link time), unless it is a > #define or a compile time constant. > > > > It certainly is not possible to do at compile time, because the calls > are > in a different compilation unit from the functions themselves, not to > mention that a link-time the RX functions are called via a function > pointer.
Exactly. It would only work with a global #define or global compile time constant. > Therefore the only way to do this that I am aware of, is to > have a > wrapper function use for the common values inside the drivers > themselves. > > For example, inside the i40e driver (which I'm using because it's the > one > I'm most familiar with), the main receive function is already a wrapper > around a raw receive function, using constant-expansion by the compiler > of > the final parameter (NULL) to automatically remove the code for > tracking > scattered packets. > > uint16_t > i40e_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts, > uint16_t nb_pkts) > { > return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL); > } > > We can produce a version of this optimized for 32-element dequeues by > special-casing where nb_pkts == 32: > > uint16_t > i40e_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts, > uint16_t nb_pkts) > { > if (nb_pkts == 32) > return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, 32, > NULL); > return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL); > } > > Now the compiler when inlining the _raw_ function, can see that it > needs > two copies, and for the first, that nb_pkts is compile-time constant of > 32. Great example! Now, consider this modification to your example, where rte_eth_rx_burst_size is a global constant variable that can be evaluated at compile time: uint16_t i40e_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) { - if (nb_pkts == 32) - return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, 32, NULL); + if (rte_eth_rx_burst_size) + return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, rte_eth_rx_burst_size, NULL); return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL); } > However, I'm not sure how useful an optimization like this is, and I'd > be interested to see what benefits testing shows. Yes, performance test results would be beneficial. > Beyond the loop iteration > count of 32, there is also the check after each burst of 8 dequeues in > the > driver to check that we have a full set of 8 - and abort the loop if > not. That loop could be optimized too, going from 8 to 32 - or going all the way to global_rx_burst_size. > It's also the case that unless an app is already at maximum load (or > overloaded), one would probably not expect to always get a full set of > 32 > packets each time, as you have no additional headroom for more. > Except if the application is designed to call rte_eth_rx_burst() less frequently to get a full burst as often as possible, using the NIC's many RX descriptors as burst buffer. Working with full bursts throughout the application provides higher total performance per clock cycle. > /Bruce