> > From: Ferruh Yigit [mailto:ferruh.yi...@amd.com] > > Sent: Monday, 15 April 2024 17.08 > > > > On 4/12/2024 3:44 PM, Morten Brørup wrote: > > >>>>>>>> Mandate use of rte_eth_tx_prepare() in the mbuf Tx checksum > > >> offload > > >>>>>>>> examples. > > >>>>>>> > > >>>>>>> I strongly disagree with this change! > > >>>>>>> > > >>>>>>> It will cause a huge performance degradation for shaping > > >> applications: > > >>>>>>> > > >>>>>>> A packet will be processed and finalized at an output or > > >> forwarding > > >>>>>> pipeline stage, where some other fields might also be written, > > >> so > > >>>>>>> zeroing e.g. the out_ip checksum at this stage has low cost > > >> (no new > > >>>>>> cache misses). > > >>>>>>> > > >>>>>>> Then, the packet might be queued for QoS or similar. > > >>>>>>> > > >>>>>>> If rte_eth_tx_prepare() must be called at the egress pipeline > > >> stage, > > >>>>>> it has to write to the packet and cause a cache miss per packet, > > >>>>>>> instead of simply passing on the packet to the NIC hardware. > > >>>>>>> > > >>>>>>> It must be possible to finalize the packet at the > > >> output/forwarding > > >>>>>> pipeline stage! > > >>>>>> > > >>>>>> If you can finalize your packet on output/forwarding, then why > > >> you > > >>>>>> can't invoke tx_prepare() on the same stage? > > >>>>>> There seems to be some misunderstanding about what tx_prepare() > > >> does - > > >>>>>> in fact it doesn't communicate with HW queue (doesn't update TXD > > >> ring, > > >>>>>> etc.), what it does - just make changes in mbuf itself. > > >>>>>> Yes, it reads some fields in SW TX queue struct (max number of > > >> TXDs per > > >>>>>> packet, etc.), but AFAIK it is safe > > >>>>>> to call tx_prepare() and tx_burst() from different threads. > > >>>>>> At least on implementations I am aware about. > > >>>>>> Just checked the docs - it seems not stated explicitly anywhere, > > >> might > > >>>>>> be that's why it causing such misunderstanding. > > >>>>>> > > >>>>>>> > > >>>>>>> Also, how is rte_eth_tx_prepare() supposed to work for cloned > > >> packets > > >>>>>> egressing on different NIC hardware? > > >>>>>> > > >>>>>> If you create a clone of full packet (including L2/L3) headers > > >> then > > >>>>>> obviously such construction might not > > >>>>>> work properly with tx_prepare() over two different NICs. > > >>>>>> Though In majority of cases you do clone segments with data, > > >> while at > > >>>>>> least L2 headers are put into different segments. > > >>>>>> One simple approach would be to keep L3 header in that separate > > >> segment. > > >>>>>> But yes, there is a problem when you'll need to send exactly the > > >> same > > >>>>>> packet over different NICs. > > >>>>>> As I remember, for bonding PMD things don't work quite well here > > >> - you > > >>>>>> might have a bond over 2 NICs with > > >>>>>> different tx_prepare() and which one to call might be not clear > > >> till > > >>>>>> actual PMD tx_burst() is invoked. > > >>>>>> > > >>>>>>> > > >>>>>>> In theory, it might get even worse if we make this opaque > > >> instead of > > >>>>>> transparent and standardized: > > >>>>>>> One PMD might reset out_ip checksum to 0x0000, and another PMD > > >> might > > >>>>>> reset it to 0xFFFF. > > >>>>>> > > >>>>>>> > > >>>>>>> I can only see one solution: > > >>>>>>> We need to standardize on common minimum requirements for how > > >> to > > >>>>>> prepare packets for each TX offload. > > >>>>>> > > >>>>>> If we can make each and every vendor to agree here - that > > >> definitely > > >>>>>> will help to simplify things quite a bit. > > >>>>> > > >>>>> An API is more than a function name and parameters. > > >>>>> It also has preconditions and postconditions. > > >>>>> > > >>>>> All major NIC vendors are contributing to DPDK. > > >>>>> It should be possible to reach consensus for reasonable minimum > > >> requirements > > >>>> for offloads. > > >>>>> Hardware- and driver-specific exceptions can be documented with > > >> the offload > > >>>> flag, or with rte_eth_rx/tx_burst(), like the note to > > >>>>> rte_eth_rx_burst(): > > >>>>> "Some drivers using vector instructions require that nb_pkts is > > >> divisible by > > >>>> 4 or 8, depending on the driver implementation." > > >>>> > > >>>> If we introduce a rule that everyone supposed to follow and then > > >> straightway > > >>>> allow people to have a 'documented exceptions', > > >>>> for me it means like 'no rule' in practice. > > >>>> A 'documented exceptions' approach might work if you have 5 > > >> different PMDs to > > >>>> support, but not when you have 50+. > > >>>> No-one would write an app with possible 10 different exception > > cases > > >> in his > > >>>> head. > > >>>> Again, with such approach we can forget about backward > > >> compatibility. > > >>>> I think we already had this discussion before, my opinion remains > > >> the same > > >>>> here - > > >>>> 'documented exceptions' approach is a way to trouble. > > >>> > > >>> The "minimum requirements" should be the lowest common denominator > > of > > >> all NICs. > > >>> Exceptions should be extremely few, for outlier NICs that still want > > >> to provide an offload and its driver is unable to live up to the > > >>> minimum requirements. > > >>> Any exception should require techboard approval. If a NIC/driver > > does > > >> not support the "minimum requirements" for an offload > > >>> feature, it is not allowed to claim support for that offload > > feature, > > >> or needs to seek approval for an exception. > > >>> > > >>> As another option for NICs not supporting the minimum requirements > > of > > >> an offload feature, we could introduce offload flags with > > >>> finer granularity. E.g. one offload flag for "gold standard" TX > > >> checksum update (where the packet's checksum field can have any > > >>> value), and another offload flag for "silver standard" TX checksum > > >> update (where the packet's checksum field must have a > > >>> precomputed value). > > >> > > >> Actually yes, I was thinking in the same direction - we need some > > extra > > >> API to allow user to distinguish. > > >> Probably we can do something like that: a new API for the ethdev call > > >> that would take as a parameter > > >> TX offloads bitmap and in return specify would it need to modify > > >> contents of packet to support these > > >> offloads or not. > > >> Something like: > > >> int rte_ethdev_tx_offload_pkt_mod_required(unt64_t tx_offloads) > > >> > > >> For the majority of the drivers that satisfy these "minimum > > >> requirements" corresponding devops > > >> entry will be empty and we'll always return 0, otherwise PMD has to > > >> provide a proper devop. > > >> Then again, it would be up to the user, to determine can he pass same > > >> packet to 2 different NICs or not. > > >> > > >> I suppose it is similar to what you were talking about? > > > > > > I was thinking something more simple: > > > > > > The NIC exposes its RX and TX offload capabilities to the application > > through the rx/tx_offload_capa and other fields in the rte_eth_dev_info > > structure returned by rte_eth_dev_info_get(). > > > > > > E.g. tx_offload_capa might have the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM flag > > set. > > > These capability flags (or enums) are mostly undocumented in the code, > > but I guess that the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability means that > > the NIC is able to update the IPv4 header checksum at egress (on the > > wire, i.e. without modifying the mbuf or packet data), and that the > > application must set RTE_MBUF_F_TX_IP_CKSUM in the mbufs to utilize this > > offload. > > > I would define and document what each capability flag/enum exactly > > means, the minimum requirements (as defined by the DPDK community) for > > the driver to claim support for it, and the requirements for an > > application to use it. > > > > > > > +1 to improve documentation, and clear offload where it is needed. > > > > Another gap is in testing, whenever a device/driver claims an offload > > capability, we don't have a test suit to confirm and verify this claim. > > Yep, conformance testing is lacking big time in our CI. > Adding the ts-factory tests to the CI was a big step in this direction. > But for now, we are mostly relying on vendor's internal testing. > > > > > > > > For the sake of discussion, let's say that > > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM means "gold standard" TX checksum update > > capability (i.e. no requirements to the checksum field in the packet > > contents). > > > If some NIC requires the checksum field in the packet contents to have > > a precomputed value, the NIC would not be allowed to claim the > > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability. > > > Such a NIC would need to define and document a new capability, e.g. > > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ASSISTED, for the "silver standard" TX > > checksum update capability. > > > > > > In other words: I would encode variations of offload capabilities > > directly in the capabilities flags. > > > Then we don't need additional APIs to help interpret those > > capabilities. > > > > > > This way, the application can probe the NIC capabilities to determine > > what can be offloaded, and how to do it. > > > > > > The application can be designed to: > > > 1. use a common packet processing pipeline, utilizing only the lowest > > common capabilities denominator of all detected NICs, or > > > 2. use a packet processing pipeline, handling packets differently > > according to the capabilities of the involved NICs. > > > > > > > Offload capabilities are already provided to enable applications as you > > mentioned above. > > > > Agree that '_ASSISTED' capability flags gives more details to > > application to manage it but my concern is it complicates offloading > > more. > > > > The number of "assisted" offloads is not much, mainly it is for cksum. > > Current approach is simpler, devices requires precondition implements it > > in 'tx_prepare' and application using these offloads calls before > > 'tx_burst'. Device/drivers doesn't require it don't have 'tx_prepare' so > > no impact to the application. > > > > Will it work to have something in between, another capability to hold if > > 'tx_prepare' needs to be called with this device or not? > > It can be possible to make this variable 32bits that holds offloads > > require assistance in a bit-wise way, but I prefer simple if > > 'tx_prepare' required flag, this may help application to decide to use > > offloads in that device and to call 'tx_prepare' or not. > > Consider an IP Multicast packet to be transmitted on many ports. > > With my suggestion, the packet can be prepared once - i.e. the > output/forwarding stage sets the IP header checksum as required by > the (lowest common denominator) offload capabilities flag - before being > cloned for tx() on each port. > > With tx_prepare(), the packet needs to be copied (deep copy!) for each port, > and then tx_prepare() needs to be called for each port > before tx().
Not really. At least not always. Let say for multicast, you'll most likely will need to have a different L2 header for each ethdev port anyway. Usual way to overcome it - have a separate segment for L2 header attached to the rest of the packet data segment. Nothing stops you to have inner/outer L3/L4 headers in that 'header' segment too. Then you do need to have a multiple copies of data segment. Actually after another thought - that might be a simple way to fix problem with bonding PMD: allow it to create a 'header' segments on need. Not sure how it will affect performance, but should help to avoid current problem when packet contents are modified in tx_burst(). > The opaque tx_prepare() concept violates the concept of not modifying the > packet at egress. That actually depends where you draw a line for egress: before or after tx_prepare(). > DPDK follows a principle of not using opaque types, so application developers > can optimize accordingly. The opaque behavior of > tx_prepare() violates this principle, and I consider it a horrible hack! > > > > > NB: There may be other variations than requiring packet contents to be > > modified, and they might be granular. > > > E.g. a NIC might require assistance for TCP/UDP checksum offload, but > > not for IP checksum offload, so a function telling if packet contents > > requires modification would not suffice. > > > E.g. RTE_ETH_TX_OFFLOAD_MULTI_SEGS is defined, but the > > rte_eth_dev_info structure doesn't expose information about the max > > number of segments it can handle. > > > > > > > Another good point to report max number of segment can be handled, +1 > > > > > > > PS: For backwards compatibility, we might define > > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM as the "silver standard" offload to > > support the current "minimum requirements", and add > > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ANY for the "gold standard" offload. > > > > > > > > >> > > >>> For reference, consider RSS, where the feature support flags have > > very > > >> high granularity. > > >>> > > >>>> > > >>>>> You mention the bonding driver, which is a good example. > > >>>>> The rte_eth_tx_burst() documentation has a note about the API > > >> postcondition > > >>>> exception for the bonding driver: > > >>>>> "This function must not modify mbufs (including packets data) > > >> unless the > > >>>> refcnt is 1. An exception is the bonding PMD, [...], mbufs > > >>>>> may be modified." > > >>>> > > >>>> For me, what we've done for bonding tx_prepare/tx_burst() is a > > >> really bad > > >>>> example. > > >>>> Initial agreement and design choice was that tx_burst() should not > > >> modify > > >>>> contents of the packets > > >>>> (that actually was one of the reasons why tx_prepare() was > > >> introduced). > > >>>> The only reason I agreed on that exception - because I couldn't > > >> come-up with > > >>>> something less uglier. > > >>>> > > >>>> Actually, these problems with bonding PMD made me to start thinking > > >> that > > >>>> current > > >>>> tx_prepare/tx_burst approach might need to be reconsidered somehow. > > >>> > > >>> In cases where a preceding call to tx_prepare() is required, how is > > it > > >> worse modifying the packet in tx_burst() than modifying the > > >>> packet in tx_prepare()? > > >>> > > >>> Both cases violate the postcondition that packets are not modified > > at > > >> egress. > > >>> > > >>>> > > >>>>>> Then we can probably have one common tx_prepare() for all > > >> vendors ;) > > >>>>> > > >>>>> Yes, that would be the goal. > > >>>>> More realistically, the ethdev layer could perform the common > > >> checks, and > > >>>> only the non-conforming drivers would have to implement > > >>>>> their specific tweaks. > > >>>> > > >>>> Hmm, but that's what we have right now: > > >>>> - fields in mbuf and packet data that user has to fill correctly > > and > > >> dev > > >>>> specific tx_prepare(). > > >>>> How what you suggest will differ then? > > >>> > > >>> You're 100 % right here. We could move more checks into the ethdev > > >> layer, specifically checks related to the "minimum > > >>> requirements". > > >>> > > >>>> And how it will help let say with bonding PMD situation, or with > > TX- > > >> ing of the > > >>>> same packet over 2 different NICs? > > >>> > > >>> The bonding driver is broken. > > >>> It can only be fixed by not violating the egress postcondition in > > >> either tx_burst() or tx_prepare(). > > >>> "Minimum requirements" might help doing that. > > >>> > > >>>> > > >>>>> If we don't standardize the meaning of the offload flags, the > > >> application > > >>>> developers cannot trust them! > > >>>>> I'm afraid this is the current situation - application developers > > >> either > > >>>> test with specific NIC hardware, or don't use the offload features. > > >>>> > > >>>> Well, I have used TX offloads through several projects, it worked > > >> quite well. > > >>> > > >>> That is good to hear. > > >>> And I don't oppose to that. > > >>> > > >>> In this discussion, I am worried about the roadmap direction for > > DPDK. > > >>> I oppose to the concept of requiring calling tx_prepare() before > > >> calling tx_burst() when using offload. I think it is conceptually > > wrong, > > >>> and breaks the egress postcondition. > > >>> I propose "minimum requirements" as a better solution. > > >>> > > >>>> Though have to admit, never have to use TX offloads together with > > >> our bonding > > >>>> PMD. > > >>>> > > >