> > > > > > > > > > > Mandate use of rte_eth_tx_prepare() in the mbuf Tx > > > checksum > > > > > offload > > > > > > > > > > > examples. > > > > > > > > > > > > > > > > > > > > I strongly disagree with this change! > > > > > > > > > > > > > > > > > > > > It will cause a huge performance degradation for shaping > > > > > applications: > > > > > > > > > > > > > > > > > > > > A packet will be processed and finalized at an output or > > > > > forwarding > > > > > > > > > pipeline stage, where some other fields might also be > > > written, > > > > > so > > > > > > > > > > zeroing e.g. the out_ip checksum at this stage has low > > > cost > > > > > (no new > > > > > > > > > cache misses). > > > > > > > > > > > > > > > > > > > > Then, the packet might be queued for QoS or similar. > > > > > > > > > > > > > > > > > > > > If rte_eth_tx_prepare() must be called at the egress > > > pipeline > > > > > stage, > > > > > > > > > it has to write to the packet and cause a cache miss per > > > packet, > > > > > > > > > > instead of simply passing on the packet to the NIC > > > hardware. > > > > > > > > > > > > > > > > > > > > It must be possible to finalize the packet at the > > > > > output/forwarding > > > > > > > > > pipeline stage! > > > > > > > > > > > > > > > > > > If you can finalize your packet on output/forwarding, then > > > why > > > > > you > > > > > > > > > can't invoke tx_prepare() on the same stage? > > > > > > > > > There seems to be some misunderstanding about what > > > tx_prepare() > > > > > does - > > > > > > > > > in fact it doesn't communicate with HW queue (doesn't update > > > TXD > > > > > ring, > > > > > > > > > etc.), what it does - just make changes in mbuf itself. > > > > > > > > > Yes, it reads some fields in SW TX queue struct (max number > > > of > > > > > TXDs per > > > > > > > > > packet, etc.), but AFAIK it is safe > > > > > > > > > to call tx_prepare() and tx_burst() from different threads. > > > > > > > > > At least on implementations I am aware about. > > > > > > > > > Just checked the docs - it seems not stated explicitly > > > anywhere, > > > > > might > > > > > > > > > be that's why it causing such misunderstanding. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Also, how is rte_eth_tx_prepare() supposed to work for > > > cloned > > > > > packets > > > > > > > > > egressing on different NIC hardware? > > > > > > > > > > > > > > > > > > If you create a clone of full packet (including L2/L3) > > > headers > > > > > then > > > > > > > > > obviously such construction might not > > > > > > > > > work properly with tx_prepare() over two different NICs. > > > > > > > > > Though In majority of cases you do clone segments with data, > > > > > while at > > > > > > > > > least L2 headers are put into different segments. > > > > > > > > > One simple approach would be to keep L3 header in that > > > separate > > > > > segment. > > > > > > > > > But yes, there is a problem when you'll need to send exactly > > > the > > > > > same > > > > > > > > > packet over different NICs. > > > > > > > > > As I remember, for bonding PMD things don't work quite well > > > here > > > > > - you > > > > > > > > > might have a bond over 2 NICs with > > > > > > > > > different tx_prepare() and which one to call might be not > > > clear > > > > > till > > > > > > > > > actual PMD tx_burst() is invoked. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In theory, it might get even worse if we make this opaque > > > > > instead of > > > > > > > > > transparent and standardized: > > > > > > > > > > One PMD might reset out_ip checksum to 0x0000, and another > > > PMD > > > > > might > > > > > > > > > reset it to 0xFFFF. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I can only see one solution: > > > > > > > > > > We need to standardize on common minimum requirements for > > > how > > > > > to > > > > > > > > > prepare packets for each TX offload. > > > > > > > > > > > > > > > > > > If we can make each and every vendor to agree here - that > > > > > definitely > > > > > > > > > will help to simplify things quite a bit. > > > > > > > > > > > > > > > > An API is more than a function name and parameters. > > > > > > > > It also has preconditions and postconditions. > > > > > > > > > > > > > > > > All major NIC vendors are contributing to DPDK. > > > > > > > > It should be possible to reach consensus for reasonable > > > minimum > > > > > requirements > > > > > > > for offloads. > > > > > > > > Hardware- and driver-specific exceptions can be documented > > > with > > > > > the offload > > > > > > > flag, or with rte_eth_rx/tx_burst(), like the note to > > > > > > > > rte_eth_rx_burst(): > > > > > > > > "Some drivers using vector instructions require that nb_pkts > > > is > > > > > divisible by > > > > > > > 4 or 8, depending on the driver implementation." > > > > > > > > > > > > > > If we introduce a rule that everyone supposed to follow and then > > > > > straightway > > > > > > > allow people to have a 'documented exceptions', > > > > > > > for me it means like 'no rule' in practice. > > > > > > > A 'documented exceptions' approach might work if you have 5 > > > > > different PMDs to > > > > > > > support, but not when you have 50+. > > > > > > > No-one would write an app with possible 10 different exception > > > cases > > > > > in his > > > > > > > head. > > > > > > > Again, with such approach we can forget about backward > > > > > compatibility. > > > > > > > I think we already had this discussion before, my opinion > > > remains > > > > > the same > > > > > > > here - > > > > > > > 'documented exceptions' approach is a way to trouble. > > > > > > > > > > > > The "minimum requirements" should be the lowest common denominator > > > of > > > > > all NICs. > > > > > > Exceptions should be extremely few, for outlier NICs that still > > > want > > > > > to provide an offload and its driver is unable to live up to the > > > > > > minimum requirements. > > > > > > Any exception should require techboard approval. If a NIC/driver > > > does > > > > > not support the "minimum requirements" for an offload > > > > > > feature, it is not allowed to claim support for that offload > > > feature, > > > > > or needs to seek approval for an exception. > > > > > > > > > > > > As another option for NICs not supporting the minimum requirements > > > of > > > > > an offload feature, we could introduce offload flags with > > > > > > finer granularity. E.g. one offload flag for "gold standard" TX > > > > > checksum update (where the packet's checksum field can have any > > > > > > value), and another offload flag for "silver standard" TX checksum > > > > > update (where the packet's checksum field must have a > > > > > > precomputed value). > > > > > > > > > > Actually yes, I was thinking in the same direction - we need some > > > extra > > > > > API to allow user to distinguish. > > > > > Probably we can do something like that: a new API for the ethdev > > > call > > > > > that would take as a parameter > > > > > TX offloads bitmap and in return specify would it need to modify > > > > > contents of packet to support these > > > > > offloads or not. > > > > > Something like: > > > > > int rte_ethdev_tx_offload_pkt_mod_required(unt64_t tx_offloads) > > > > > > > > > > For the majority of the drivers that satisfy these "minimum > > > > > requirements" corresponding devops > > > > > entry will be empty and we'll always return 0, otherwise PMD has to > > > > > provide a proper devop. > > > > > Then again, it would be up to the user, to determine can he pass > > > same > > > > > packet to 2 different NICs or not. > > > > > > > > > > I suppose it is similar to what you were talking about? > > > > > > > > I was thinking something more simple: > > > > > > > > The NIC exposes its RX and TX offload capabilities to the application > > > through the rx/tx_offload_capa and other fields in the > > > > rte_eth_dev_info structure returned by rte_eth_dev_info_get(). > > > > > > > > E.g. tx_offload_capa might have the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM flag > > > set. > > > > These capability flags (or enums) are mostly undocumented in the code, > > > but I guess that the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM > > > > capability means that the NIC is able to update the IPv4 header > > > checksum at egress (on the wire, i.e. without modifying the mbuf or > > > > packet data), and that the application must set RTE_MBUF_F_TX_IP_CKSUM > > > in the mbufs to utilize this offload. > > > > I would define and document what each capability flag/enum exactly > > > means, the minimum requirements (as defined by the DPDK > > > > community) for the driver to claim support for it, and the > > > requirements for an application to use it. > > > > For the sake of discussion, let's say that > > > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM means "gold standard" TX checksum update > > > capability > > > > (i.e. no requirements to the checksum field in the packet contents). > > > > If some NIC requires the checksum field in the packet contents to have > > > a precomputed value, the NIC would not be allowed to claim > > > > the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability. > > > > Such a NIC would need to define and document a new capability, e.g. > > > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ASSISTED, for the "silver > > > > standard" TX checksum update capability. > > > > In other words: I would encode variations of offload capabilities > > > directly in the capabilities flags. > > > > Then we don't need additional APIs to help interpret those > > > capabilities. > > > > > > I understood your intention with different flags, yes it should work too > > > I think. > > > The reason I am not very fond of it - it will require to double > > > TX_OFFLOAD flags. > > > > An additional feature flag is only required if a NIC is not conforming to > > the "minimum requirements" of an offload feature, and the > > techboard permits introducing a variant of an existing feature. > > There should be very few additional feature flags for variants - exceptions > > only - or the "minimum requirements" are not broad > > enough to support the majority of NICs. > > Ok, so you suggest to group all existing reqs plus what all current > tx_prepare() do into "minimum requirements"? > So with current drivers in place we wouldn't need these new flags, but we'll > reserve such opportunity. > That might work, if there are no contradictory requirements in current PMDs, > and PMDs maintainers with > less reqs will agree with these 'extra' stuff.
Just to check how easy/hard would be to get a consensus, compiled a list of mbuf changes done by different PMDs in tx_prepare(). See below. Could be not fully correct or complete. PMD maintainers, feel free to update it, if I missed something. >From how it looks to me: if we'll go the way you suggest, then hns3 and virtio will most likely become a 'second class citizens' - will need a special offload flags for them. Plus, either all PMDs that now set tx_prepare()=NULL will have to agree to require rte_net_intel_cksum_prepare() to be done, or all Intel PMDs and few others will also be downgraded to 'second class'. PMD: atlantic MOD: rte_net_intel_cksum_prepare() /*for ipv4_hdr->hdr_checksum = 0; (tcp|udp)_hdr->cksum=rte_ipv(4|6)_phdr_cksum(...);*/ PMD: cpfl/idpf MOD: none PMD: em/igb/igc/fm10k/i40e/iavf/ice/ixgbe MOD: rte_net_intel_cksum_prepare() PMD: enic MOD: rte_net_intel_cksum_prepare() PMD: hns3 MOD: rte_net_intel_cksum_prepare() plus some extra: /* * A UDP packet with the same dst_port as VXLAN\VXLAN_GPE\GENEVE will * be recognized as a tunnel packet in HW. In this case, if UDP CKSUM * offload is set and the tunnel mask has not been set, the CKSUM will * be wrong since the header length is wrong and driver should complete * the CKSUM to avoid CKSUM error. */ PMD: ionic MOD: none PMD: ngbe MOD: rte_net_intel_cksum_prepare() PMD: qede MOD: none PMD: txgbe MOD: rte_net_intel_cksum_prepare() PMD: virtio: MOD: rte_net_intel_cksum_prepare() plus some extra: - for RTE_MBUF_F_TX_TCP_SEG: virtio_tso_fix_cksum() - for RTE_MBUF_F_TX_VLAN: rte_vlan_insert() PMD: vmxnet3 MOD: rte_net_intel_cksum_prepare() For all other PMDs in our main tree set tx_prepare = NULL.