> On 6/17/2021 4:17 PM, Morten Brørup wrote: > >> From: Ananyev, Konstantin [mailto:konstantin.anan...@intel.com] > >> Sent: Thursday, 17 June 2021 16.59 > >> > >>>>>> > >>>>>> 14/06/2021 15:15, Bruce Richardson: > >>>>>>> On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote: > >>>>>>>>> From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Thomas > >> Monjalon > >>>>>>>>> Sent: Monday, 14 June 2021 12.59 > >>>>>>>>> > >>>>>>>>> Performance of access in a fixed-size array is very good > >>>>>>>>> because of cache locality > >>>>>>>>> and because there is a single pointer to dereference. > >>>>>>>>> The only drawback is the lack of flexibility: > >>>>>>>>> the size of such an array cannot be increase at runtime. > >>>>>>>>> > >>>>>>>>> An approach to this problem is to allocate the array at > >> runtime, > >>>>>>>>> being as efficient as static arrays, but still limited to a > >> maximum. > >>>>>>>>> > >>>>>>>>> That's why the API rte_parray is introduced, > >>>>>>>>> allowing to declare an array of pointer which can be resized > >>>>>>>>> dynamically > >>>>>>>>> and automatically at runtime while keeping a good read > >> performance. > >>>>>>>>> > >>>>>>>>> After resize, the previous array is kept until the next resize > >>>>>>>>> to avoid crashs during a read without any lock. > >>>>>>>>> > >>>>>>>>> Each element is a pointer to a memory chunk dynamically > >> allocated. > >>>>>>>>> This is not good for cache locality but it allows to keep the > >> same > >>>>>>>>> memory per element, no matter how the array is resized. > >>>>>>>>> Cache locality could be improved with mempools. > >>>>>>>>> The other drawback is having to dereference one more pointer > >>>>>>>>> to read an element. > >>>>>>>>> > >>>>>>>>> There is not much locks, so the API is for internal use only. > >>>>>>>>> This API may be used to completely remove some compilation- > >> time > >>>>>>>>> maximums. > >>>>>>>> > >>>>>>>> I get the purpose and overall intention of this library. > >>>>>>>> > >>>>>>>> I probably already mentioned that I prefer "embedded style > >> programming" with fixed size arrays, rather than runtime > >> configurability. > >>>>> It's > >>>>>> my personal opinion, and the DPDK Tech Board clearly prefers > >> reducing the amount of compile time configurability, so there is no way > >>> for > >>>>>> me to stop this progress, and I do not intend to oppose to this > >> library. :-) > >>>>>>>> > >>>>>>>> This library is likely to become a core library of DPDK, so I > >> think it is important getting it right. Could you please mention a few > >>>>> examples > >>>>>> where you think this internal library should be used, and where > >> it should not be used. Then it is easier to discuss if the border line > >>> between > >>>>>> control path and data plane is correct. E.g. this library is not > >> intended to be used for dynamically sized packet queues that grow and > >>> shrink > >>>>> in > >>>>>> the fast path. > >>>>>>>> > >>>>>>>> If the library becomes a core DPDK library, it should probably > >> be public instead of internal. E.g. if the library is used to make > >>>>>> RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some > >> applications might also need dynamically sized arrays for their > >>>>>> application specific per-port runtime data, and this library > >> could serve that purpose too. > >>>>>>>> > >>>>>>> > >>>>>>> Thanks Thomas for starting this discussion and Morten for > >> follow-up. > >>>>>>> > >>>>>>> My thinking is as follows, and I'm particularly keeping in mind > >> the cases > >>>>>>> of e.g. RTE_MAX_ETHPORTS, as a leading candidate here. > >>>>>>> > >>>>>>> While I dislike the hard-coded limits in DPDK, I'm also not > >> convinced that > >>>>>>> we should switch away from the flat arrays or that we need fully > >> dynamic > >>>>>>> arrays that grow/shrink at runtime for ethdevs. I would suggest > >> a half-way > >>>>>>> house here, where we keep the ethdevs as an array, but one > >> allocated/sized > >>>>>>> at runtime rather than statically. This would allow us to have a > >>>>>>> compile-time default value, but, for use cases that need it, > >> allow use of a > >>>>>>> flag e.g. "max-ethdevs" to change the size of the parameter > >> given to the > >>>>>>> malloc call for the array. This max limit could then be > >> provided to apps > >>>>>>> too if they want to match any array sizes. [Alternatively those > >> apps could > >>>>>>> check the provided size and error out if the size has been > >> increased beyond > >>>>>>> what the app is designed to use?]. There would be no extra > >> dereferences per > >>>>>>> rx/tx burst call in this scenario so performance should be the > >> same as > >>>>>>> before (potentially better if array is in hugepage memory, I > >> suppose). > >>>>>> > >>>>>> I think we need some benchmarks to decide what is the best > >> tradeoff. > >>>>>> I spent time on this implementation, but sorry I won't have time > >> for benchmarks. > >>>>>> Volunteers? > >>>>> > >>>>> I had only a quick look at your approach so far. > >>>>> But from what I can read, in MT environment your suggestion will > >> require > >>>>> extra synchronization for each read-write access to such parray > >> element (lock, rcu, ...). > >>>>> I think what Bruce suggests will be much ligther, easier to > >> implement and less error prone. > >>>>> At least for rte_ethdevs[] and friends. > >>>>> Konstantin > >>>> > >>>> One more thought here - if we are talking about rte_ethdev[] in > >> particular, I think we can: > >>>> 1. move public function pointers (rx_pkt_burst(), etc.) from > >> rte_ethdev into a separate flat array. > >>>> We can keep it public to still use inline functions for 'fast' > >> calls rte_eth_rx_burst(), etc. to avoid > >>>> any regressions. > >>>> That could still be flat array with max_size specified at > >> application startup. > >>>> 2. Hide rest of rte_ethdev struct in .c. > >>>> That will allow us to change the struct itself and the whole > >> rte_ethdev[] table in a way we like > >>>> (flat array, vector, hash, linked list) without ABI/API breakages. > >>>> > >>>> Yes, it would require all PMDs to change prototype for > >> pkt_rx_burst() function > >>>> (to accept port_id, queue_id instead of queue pointer), but the > >> change is mechanical one. > >>>> Probably some macro can be provided to simplify it. > >>>> > >>> > >>> We are already planning some tasks for ABI stability for v21.11, I > >> think > >>> splitting 'struct rte_eth_dev' can be part of that task, it enables > >> hiding more > >>> internal data. > >> > >> Ok, sounds good. > >> > >>> > >>>> The only significant complication I can foresee with implementing > >> that approach - > >>>> we'll need a an array of 'fast' function pointers per queue, not > >> per device as we have now > >>>> (to avoid extra indirection for callback implementation). > >>>> Though as a bonus we'll have ability to use different RX/TX > >> funcions per queue. > >>>> > >>> > >>> What do you think split Rx/Tx callback into its own struct too? > >>> > >>> Overall 'rte_eth_dev' can be split into three as: > >>> 1. rte_eth_dev > >>> 2. rte_eth_dev_burst > >>> 3. rte_eth_dev_cb > >>> > >>> And we can hide 1 from applications even with the inline functions. > >> > >> As discussed off-line, I think: > >> it is possible. > >> My absolute preference would be to have just 1/2 (with CB hidden). > >> But even with 1/2/3 in place I think it would be a good step forward. > >> Probably worth to start with 1/2/3 first and then see how difficult it > >> would be to switch to 1/2. > >> Do you plan to start working on it? > >> > >> Konstantin > > > > If you do proceed with this, be very careful. E.g. the inlined rx/tx burst > > functions should not touch more cache lines than they do today - > especially if there are many active ports. The inlined rx/tx burst functions > are very simple, so thorough code review (and possibly also of the > resulting assembly) is appropriate. Simple performance testing might not > detect if more cache lines are accessed than before the > modifications. > > > > Don't get me wrong... I do consider this an improvement of the ethdev > > library; I'm only asking you to take extra care! > > > > ack > > If we split as above, I think device specific data 'struct rte_eth_dev_data' > should be part of 1 (rte_eth_dev). Which means Rx/Tx inline functions access > additional cache line. > > To prevent this, what about duplicating 'data' in 2 (rte_eth_dev_burst)?
I think it would be better to change rx_pkt_burst() to accept port_id and queue_id, instead of void *. I.E: typedef uint16_t (*eth_rx_burst_t)(uint16_t port_id, uint16_t queue_id, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); And we can do actual de-referencing of private rxq data inside the actual rx function. > We have > enough space for it to fit into single cache line, currently it is: > struct rte_eth_dev { > eth_rx_burst_t rx_pkt_burst; /* 0 8 */ > eth_tx_burst_t tx_pkt_burst; /* 8 8 */ > eth_tx_prep_t tx_pkt_prepare; /* 16 8 */ > eth_rx_queue_count_t rx_queue_count; /* 24 8 */ > eth_rx_descriptor_done_t rx_descriptor_done; /* 32 8 */ > eth_rx_descriptor_status_t rx_descriptor_status; /* 40 8 */ > eth_tx_descriptor_status_t tx_descriptor_status; /* 48 8 */ > struct rte_eth_dev_data * data; /* 56 8 */ > /* --- cacheline 1 boundary (64 bytes) --- */ > > 'rx_descriptor_done' is deprecated and will be removed;