Re: [dpdk-dev] Minutes of Technical Board Meeting, 2021-03-10

Morten Brørup Wed, 07 Apr 2021 03:29:34 -0700

> From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Bruce Richardson
> Sent: Wednesday, April 7, 2021 11:59 AM
> 
> On Wed, Apr 07, 2021 at 09:11:23AM +0200, Morten Brørup wrote:
> > > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Honnappa
> > > Nagarahalli
> > > Sent: Wednesday, April 7, 2021 2:48 AM
> > >
> > > <snip>
> > > >
> > > > > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Tom
> Barbette
> > > > > Sent: Wednesday, March 31, 2021 10:53 AM
> > > > >
> > > > > Le 31-03-21 à 02:44, Honnappa Nagarahalli a écrit :
> > > > > >     - Ability to tune the values of #defines
> > > > > >     * Few prominent points discussed
> > > > > >     - This will result in #ifdefs in the code (for ex: in
> > > testpmd)
> > > > > >     - One option is for all the PMDs to document their
> > > configurable
> > > > > #defines in PMD specific header files. Having these distributed
> is
> > > > > much easier to search.
> > > > > >     - Can some of the existing #defines be converted to runtime
> > > > > configurations? For ex: RTE_MAX_LCORE? This might impact ABI.
> > > > > >     * Bruce to think about converting the doc to a blog or an
> > > email
> > > > > on the mailing list. But soliciting feedback is most important.
> > > > >
> > > > > One alternative path worth looking at is to encourage the use
> of
> > > LTO,
> > > > > and modify APIs so the configuration can be provided at linking
> > > time,
> > > > > and propagated by the compiler.
> > > > >
> > > > > E.g. one can define rte_max_lcore as a weak constant symbol,
> equal
> > > to
> > > > > 128. At linking time the user may provide a rte_max_lcore that
> is
> > > more
> > > > > tailored, and still, dynamic arrays[rte_max_lcore] will be
> > > allocatable
> > > > > on the .bss section, avoiding an indirection. The compiler will
> be
> > > > > able to optimize loops etc which is impossible with pure
> runtime
> > > > > configuration.
> > > > >
> > > > > In packetmill.io we actually pushed this to the next level
> where
> > > the
> > > > > driver can completely change its behavior without recompiling
> DPDK
> > > > > itself and spawning ifdefs everywhere.
> > > > >
> > > > > However the price is the slowiness of LTO...
> > > > >
> > > > > My 2 cents.
> > > > >
> > > > > Tom
> > > > >
> > > >
> > > > If we are moving away from Compile Time parameters, I certainly
> > > prefer Tom's
> > > > suggestion of Link Time parameters, rather than Run Time
> parameters.
> > > I think compile time constants are fine if they are not used in
> #ifdef.
> > > For ex: if they are used in 'if (...)', it will help eliminate code
> and
> > > branches.
> >
> > Yes!
> >
> > And "if (...)" is more flexible than #ifdef/#if because it allows the
> expression to be mixed with non-constants.
> >
> > Then perhaps Bruce's script to automatically make C constants out of
> #defines was not so silly anyway. :-)
> >
> > >
> > > >
> > > > This might also provide a middle ground for optimizations where
> > > Compile
> > > > Time parameters are considered unacceptable by the DPDK
> community.
> > > I'm
> > > > thinking about something along the lines of the "constant size"
> > > rte_event
> > > > array presented at the 2020 Userspace Summit by Harry
> > > >
> > >
> (https://static.sched.com/hosted_files/dpdkuserspace2020/d3/dpdk_usersp
> > > ac
> > > > e_20_api_performance_hvh.pdf). Taking this thinking even further
> out,
> > > a Link
> > > > Time parameter could perhaps replace the nb_pkts parameter in on
> > > > optimized rte_eth_rx_burst() function.
> > > >
> >
> > Optimally, I would like to see e.g. the RX burst size being so
> constant that the PMD's RX function knows it and can use vector
> functions and possibly loop unrolling, without having to implement a
> pre-check on nb_pkts and a trailing non-vector loop for receiving any
> remaining odd nb_pkts. All the DPDK examples use #define MAX_PKT_BURST
> 32 or similar, and I assume most DPDK applications do too.
> >
> > I do not trust the compiler to be clever enough to realize that the
> PMD's RX function is always called with a specific nb_pkts and optimize
> all this cruft away at compile time (or at link time), unless it is a
> #define or a compile time constant.
> >
> 
> It certainly is not possible to do at compile time, because the calls
> are
> in a different compilation unit from the functions themselves, not to
> mention that a link-time the RX functions are called via a function
> pointer.


Exactly. It would only work with a global #define or global compile time 
constant.

> Therefore the only way to do this that I am aware of, is to
> have a
> wrapper function use for the common values inside the drivers
> themselves.
> 
> For example, inside the i40e driver (which I'm using because it's the
> one
> I'm most familiar with), the main receive function is already a wrapper
> around a raw receive function, using constant-expansion by the compiler
> of
> the final parameter (NULL) to automatically remove the code for
> tracking
> scattered packets.
> 
> uint16_t
> i40e_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
>                    uint16_t nb_pkts)
> {
>       return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
> }
> 
> We can produce a version of this optimized for 32-element dequeues by
> special-casing where nb_pkts == 32:
> 
> uint16_t
> i40e_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
>                    uint16_t nb_pkts)
> {
>       if (nb_pkts == 32)
>               return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, 32,
> NULL);
>       return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
> }
> 
> Now the compiler when inlining the _raw_ function, can see that it
> needs
> two copies, and for the first, that nb_pkts is compile-time constant of
> 32.

Great example! Now, consider this modification to your example, where 
rte_eth_rx_burst_size is a global constant variable that can be evaluated at 
compile time:

uint16_t
i40e_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
                   uint16_t nb_pkts)
{
-       if (nb_pkts == 32)
-               return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, 32, NULL);
+       if (rte_eth_rx_burst_size)
+               return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, 
rte_eth_rx_burst_size, NULL);
        return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
}

> However, I'm not sure how useful an optimization like this is, and I'd
> be interested to see what benefits testing shows.

Yes, performance test results would be beneficial.

> Beyond the loop iteration
> count of 32, there is also the check after each burst of 8 dequeues in
> the
> driver to check that we have a full set of 8 - and abort the loop if
> not.

That loop could be optimized too, going from 8 to 32 - or going all the way to 
global_rx_burst_size.

> It's also the case that unless an app is already at maximum load (or
> overloaded), one would probably not expect to always get a full set of
> 32
> packets each time, as you have no additional headroom for more.
> 

Except if the application is designed to call rte_eth_rx_burst() less 
frequently to get a full burst as often as possible, using the NIC's many RX 
descriptors as burst buffer. Working with full bursts throughout the 
application provides higher total performance per clock cycle.

> /Bruce

Re: [dpdk-dev] Minutes of Technical Board Meeting, 2021-03-10

Reply via email to