[dpdk-dev] [PATCH v6 1/6] ethdev: add Tx preparation

2016-10-14 Thread Tomasz Kulasek
Added API for `rte_eth_tx_prep`

uint16_t rte_eth_tx_prep(uint8_t port_id, uint16_t queue_id,
struct rte_mbuf **tx_pkts, uint16_t nb_pkts)

Added fields to the `struct rte_eth_desc_lim`:

uint16_t nb_seg_max;
/**< Max number of segments per whole packet. */

uint16_t nb_mtu_seg_max;
/**< Max number of segments per one MTU */

Created `rte_pkt.h` header with common used functions:

int rte_validate_tx_offload(struct rte_mbuf *m)
to validate general requirements for tx offload in packet such a
flag completness. In current implementation this function is called
optionaly when RTE_LIBRTE_ETHDEV_DEBUG is enabled.

int rte_phdr_cksum_fix(struct rte_mbuf *m)
to fix pseudo header checksum for TSO and non-TSO tcp/udp packets
before hardware tx checksum offload.
 - for non-TSO tcp/udp packets full pseudo-header checksum is
   counted and set.
 - for TSO the IP payload length is not included.


PERFORMANCE TESTS
-

This feature was tested with modified csum engine from test-pmd.

The packet checksum preparation was moved from application to Tx
preparation step placed before burst.

We may expect some overhead costs caused by:
1) using additional callback before burst,
2) rescanning burst,
3) additional condition checking (packet validation),
4) worse optimization (e.g. packet data access, etc.)

We tested it using ixgbe Tx preparation implementation with some parts
disabled to have comparable information about the impact of diferent
parts of implementation.

IMPACT:

1) For unimplemented Tx preparation callback the performance impact is
   negligible,
2) For packet condition check without checksum modifications (nb_segs,
   available offloads, etc.) is 14626628/14252168 (~2.62% drop),
3) Full support in ixgbe driver (point 2 + packet checksum
   initialization) is 14060924/13588094 (~3.48% drop)


Signed-off-by: Tomasz Kulasek 
---
 config/common_base|1 +
 lib/librte_ether/rte_ethdev.h |   85 +
 lib/librte_mbuf/rte_mbuf.h|9 +++
 lib/librte_net/Makefile   |3 +-
 lib/librte_net/rte_pkt.h  |  137 +
 5 files changed, 234 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_net/rte_pkt.h

diff --git a/config/common_base b/config/common_base
index c7fd3db..619284b 100644
--- a/config/common_base
+++ b/config/common_base
@@ -120,6 +120,7 @@ CONFIG_RTE_MAX_QUEUES_PER_PORT=1024
 CONFIG_RTE_LIBRTE_IEEE1588=n
 CONFIG_RTE_ETHDEV_QUEUE_STAT_CNTRS=16
 CONFIG_RTE_ETHDEV_RXTX_CALLBACKS=y
+CONFIG_RTE_ETHDEV_TX_PREP=y

 #
 # Support NIC bypass logic
diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
index 38641e8..a10ed9c 100644
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -182,6 +182,7 @@ extern "C" {
 #include 
 #include 
 #include 
+#include 
 #include "rte_ether.h"
 #include "rte_eth_ctrl.h"
 #include "rte_dev_info.h"
@@ -699,6 +700,8 @@ struct rte_eth_desc_lim {
uint16_t nb_max;   /**< Max allowed number of descriptors. */
uint16_t nb_min;   /**< Min allowed number of descriptors. */
uint16_t nb_align; /**< Number of descriptors should be aligned to. */
+   uint16_t nb_seg_max; /**< Max number of segments per whole packet. 
*/
+   uint16_t nb_mtu_seg_max; /**< Max number of segments per one MTU */
 };

 /**
@@ -1188,6 +1191,11 @@ typedef uint16_t (*eth_tx_burst_t)(void *txq,
   uint16_t nb_pkts);
 /**< @internal Send output packets on a transmit queue of an Ethernet device. 
*/

+typedef uint16_t (*eth_tx_prep_t)(void *txq,
+  struct rte_mbuf **tx_pkts,
+  uint16_t nb_pkts);
+/**< @internal Prepare output packets on a transmit queue of an Ethernet 
device. */
+
 typedef int (*flow_ctrl_get_t)(struct rte_eth_dev *dev,
   struct rte_eth_fc_conf *fc_conf);
 /**< @internal Get current flow control parameter on an Ethernet device */
@@ -1622,6 +1630,7 @@ struct rte_eth_rxtx_callback {
 struct rte_eth_dev {
eth_rx_burst_t rx_pkt_burst; /**< Pointer to PMD receive function. */
eth_tx_burst_t tx_pkt_burst; /**< Pointer to PMD transmit function. */
+   eth_tx_prep_t tx_pkt_prep; /**< Pointer to PMD transmit prepare 
function. */
struct rte_eth_dev_data *data;  /**< Pointer to device data */
const struct eth_driver *driver;/**< Driver for this device */
const struct eth_dev_ops *dev_ops; /**< Functions exported by PMD */
@@ -2816,6 +2825,82 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts, 
nb_pkts);
 }

+/**
+ * Process a burst of output packets on a transmit queue of an Ethernet device.
+ *
+ * The rte_eth_tx_prep() function is invoked to prepare output packets to 

[dpdk-dev] [PATCH v6 1/6] ethdev: add Tx preparation

2016-10-18 Thread Olivier Matz
Hi Tomasz,

I think the principle of tx_prep() is good, it may for instance help to
remove the function virtio_tso_fix_cksum() from the virtio, and maybe
even change the mbuf TSO/cksum API.

I have some questions/comments below, I'm sorry it comes very late.

On 10/14/2016 05:05 PM, Tomasz Kulasek wrote:
> Added API for `rte_eth_tx_prep`
> 
> uint16_t rte_eth_tx_prep(uint8_t port_id, uint16_t queue_id,
>   struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
> 
> Added fields to the `struct rte_eth_desc_lim`:
> 
>   uint16_t nb_seg_max;
>   /**< Max number of segments per whole packet. */
> 
>   uint16_t nb_mtu_seg_max;
>   /**< Max number of segments per one MTU */

Not sure I understand the second one. Is this is case of TSO?

Is it a usual limitation in different network hardware?
Can this info be retrieved/used by the application?

> 
> Created `rte_pkt.h` header with common used functions:
> 
> int rte_validate_tx_offload(struct rte_mbuf *m)
>   to validate general requirements for tx offload in packet such a
>   flag completness. In current implementation this function is called
>   optionaly when RTE_LIBRTE_ETHDEV_DEBUG is enabled.
> 
> int rte_phdr_cksum_fix(struct rte_mbuf *m)
>   to fix pseudo header checksum for TSO and non-TSO tcp/udp packets
>   before hardware tx checksum offload.
>- for non-TSO tcp/udp packets full pseudo-header checksum is
>  counted and set.
>- for TSO the IP payload length is not included.

Why not in rte_net.h?


> [...]
>  
> @@ -2816,6 +2825,82 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
>   return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts, 
> nb_pkts);
>  }
>  
> +/**
> + * Process a burst of output packets on a transmit queue of an Ethernet 
> device.
> + *
> + * The rte_eth_tx_prep() function is invoked to prepare output packets to be
> + * transmitted on the output queue *queue_id* of the Ethernet device 
> designated
> + * by its *port_id*.
> + * The *nb_pkts* parameter is the number of packets to be prepared which are
> + * supplied in the *tx_pkts* array of *rte_mbuf* structures, each of them
> + * allocated from a pool created with rte_pktmbuf_pool_create().
> + * For each packet to send, the rte_eth_tx_prep() function performs
> + * the following operations:
> + *
> + * - Check if packet meets devices requirements for tx offloads.

Do you mean hardware requirements?
Can the application be aware of these requirements? I mean capability
flags, or something in dev_infos?

Maybe the comment could be more precise?

> + * - Check limitations about number of segments.
> + *
> + * - Check additional requirements when debug is enabled.

What kind of additional requirements?

> + *
> + * - Update and/or reset required checksums when tx offload is set for 
> packet.
> + *

By reading this, I think it may not be clear for the user about what
should be set in the mbuf. In mbuf API, it is said:

 * TCP segmentation offload. To enable this offload feature for a
 * packet to be transmitted on hardware supporting TSO:
 *  - set the PKT_TX_TCP_SEG flag in mbuf->ol_flags (this flag implies
 *PKT_TX_TCP_CKSUM)
 *  - set the flag PKT_TX_IPV4 or PKT_TX_IPV6
 *  - if it's IPv4, set the PKT_TX_IP_CKSUM flag and write the IP checksum
 *to 0 in the packet
 *  - fill the mbuf offload information: l2_len, l3_len, l4_len, tso_segsz
 *  - calculate the pseudo header checksum without taking ip_len in account,
 *and set it in the TCP header. Refer to rte_ipv4_phdr_cksum() and
 *rte_ipv6_phdr_cksum() that can be used as helpers.


If I understand well, using tx_prep(), the user will have to do the
same except writing the IP checksum to 0, and without setting the
TCP pseudo header checksum, right?


> + * The rte_eth_tx_prep() function returns the number of packets ready to be
> + * sent. A return value equal to *nb_pkts* means that all packets are valid 
> and
> + * ready to be sent.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The index of the transmit queue through which output packets must be
> + *   sent.
> + *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
> + *   to rte_eth_dev_configure().
> + * @param tx_pkts
> + *   The address of an array of *nb_pkts* pointers to *rte_mbuf* structures
> + *   which contain the output packets.
> + * @param nb_pkts
> + *   The maximum number of packets to process.
> + * @return
> + *   The number of packets correct and ready to be sent. The return value 
> can be
> + *   less than the value of the *tx_pkts* parameter when some packet doesn't
> + *   meet devices requirements with rte_errno set appropriately.
> + */

Can we add the constraint that invalid packets are left untouched?

I think most of the time there will be a software fallback in that
case, so it would be good to ensure that this function does not change
the flags or the packet

[dpdk-dev] [PATCH v6 1/6] ethdev: add Tx preparation

2016-10-19 Thread Kulasek, TomaszX
Hi Olivier,

> -Original Message-
> From: Olivier Matz [mailto:olivier.matz at 6wind.com]
> Sent: Tuesday, October 18, 2016 16:57
> To: Kulasek, TomaszX ; dev at dpdk.org
> Cc: Ananyev, Konstantin ;
> thomas.monjalon at 6wind.com
> Subject: Re: [dpdk-dev] [PATCH v6 1/6] ethdev: add Tx preparation
> 
> Hi Tomasz,
> 
> I think the principle of tx_prep() is good, it may for instance help to
> remove the function virtio_tso_fix_cksum() from the virtio, and maybe even
> change the mbuf TSO/cksum API.
> 
> I have some questions/comments below, I'm sorry it comes very late.
> 
> On 10/14/2016 05:05 PM, Tomasz Kulasek wrote:
> > Added API for `rte_eth_tx_prep`
> >
> > uint16_t rte_eth_tx_prep(uint8_t port_id, uint16_t queue_id,
> > struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
> >
> > Added fields to the `struct rte_eth_desc_lim`:
> >
> > uint16_t nb_seg_max;
> > /**< Max number of segments per whole packet. */
> >
> > uint16_t nb_mtu_seg_max;
> > /**< Max number of segments per one MTU */
> 
> Not sure I understand the second one. Is this is case of TSO?
> 
> Is it a usual limitation in different network hardware?
> Can this info be retrieved/used by the application?
> 

Yes, limitation of number of segments may differ depend of TSO/non TSO, e.g. 
for Fortville NIC. This information is available for application

> >
> > Created `rte_pkt.h` header with common used functions:
> >
> > int rte_validate_tx_offload(struct rte_mbuf *m)
> > to validate general requirements for tx offload in packet such a
> > flag completness. In current implementation this function is called
> > optionaly when RTE_LIBRTE_ETHDEV_DEBUG is enabled.
> >
> > int rte_phdr_cksum_fix(struct rte_mbuf *m)
> > to fix pseudo header checksum for TSO and non-TSO tcp/udp packets
> > before hardware tx checksum offload.
> >  - for non-TSO tcp/udp packets full pseudo-header checksum is
> >counted and set.
> >  - for TSO the IP payload length is not included.
> 
> Why not in rte_net.h?
> 
> 
> > [...]
> >
> > @@ -2816,6 +2825,82 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t
> queue_id,
> > return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts,
> > nb_pkts);  }
> >
> > +/**
> > + * Process a burst of output packets on a transmit queue of an Ethernet
> device.
> > + *
> > + * The rte_eth_tx_prep() function is invoked to prepare output
> > +packets to be
> > + * transmitted on the output queue *queue_id* of the Ethernet device
> > +designated
> > + * by its *port_id*.
> > + * The *nb_pkts* parameter is the number of packets to be prepared
> > +which are
> > + * supplied in the *tx_pkts* array of *rte_mbuf* structures, each of
> > +them
> > + * allocated from a pool created with rte_pktmbuf_pool_create().
> > + * For each packet to send, the rte_eth_tx_prep() function performs
> > + * the following operations:
> > + *
> > + * - Check if packet meets devices requirements for tx offloads.
> 
> Do you mean hardware requirements?
> Can the application be aware of these requirements? I mean capability
> flags, or something in dev_infos?

Yes. If some offloads cannot be handled by hardware, it fails. Also if e.g. 
number of segments is invalid and so on.

> 
> Maybe the comment could be more precise?
> 
> > + * - Check limitations about number of segments.
> > + *
> > + * - Check additional requirements when debug is enabled.
> 
> What kind of additional requirements?
> 

We may assume, that application setts right, e.g. ip header version is required 
for most of checksum offloads. To not have additional performance overhead, 
these checks are done only when debug is on.

> > + *
> > + * - Update and/or reset required checksums when tx offload is set for
> packet.
> > + *
> 
> By reading this, I think it may not be clear for the user about what
> should be set in the mbuf. In mbuf API, it is said:
> 
>  * TCP segmentation offload. To enable this offload feature for a
>  * packet to be transmitted on hardware supporting TSO:
>  *  - set the PKT_TX_TCP_SEG flag in mbuf->ol_flags (this flag implies
>  *PKT_TX_TCP_CKSUM)
>  *  - set the flag PKT_TX_IPV4 or PKT_TX_IPV6
>  *  - if it's IPv4, set the PKT_TX_IP_CKSUM flag and write the IP checksum
>  *to 0 in the packet
>  *  - fill the mbuf offload information: l2_len, l3_len, l4_len, tso_segsz
>  *  - calculate the pseudo header checksum without taking ip_len in
> account,
>  *and set it in the TC

[dpdk-dev] [PATCH v6 1/6] ethdev: add Tx preparation

2016-10-19 Thread Ananyev, Konstantin
Hi guys,

> >
> > > +static inline int
> > > +rte_phdr_cksum_fix(struct rte_mbuf *m) {
> > > + struct ipv4_hdr *ipv4_hdr;
> > > + struct ipv6_hdr *ipv6_hdr;
> > > + struct tcp_hdr *tcp_hdr;
> > > + struct udp_hdr *udp_hdr;
> > > + uint64_t ol_flags = m->ol_flags;
> > > + uint64_t inner_l3_offset = m->l2_len;
> > > +
> > > + if (ol_flags & PKT_TX_OUTER_IP_CKSUM)
> > > + inner_l3_offset += m->outer_l2_len + m->outer_l3_len;
> > > +
> > > + if ((ol_flags & PKT_TX_UDP_CKSUM) == PKT_TX_UDP_CKSUM) {
> > > + if (ol_flags & PKT_TX_IPV4) {
> > > + ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
> > > + inner_l3_offset);
> > > +
> > > + if (ol_flags & PKT_TX_IP_CKSUM)
> > > + ipv4_hdr->hdr_checksum = 0;
> > > +
> > > + udp_hdr = (struct udp_hdr *)((char *)ipv4_hdr + m-
> > >l3_len);
> > > + udp_hdr->dgram_cksum = rte_ipv4_phdr_cksum(ipv4_hdr,
> > ol_flags);
> > > + } else {
> > > + ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
> > > + inner_l3_offset);
> > > + /* non-TSO udp */
> > > + udp_hdr = rte_pktmbuf_mtod_offset(m, struct udp_hdr *,
> > > + inner_l3_offset + m->l3_len);
> > > + udp_hdr->dgram_cksum = rte_ipv6_phdr_cksum(ipv6_hdr,
> > ol_flags);
> > > + }
> > > + } else if ((ol_flags & PKT_TX_TCP_CKSUM) ||
> > > + (ol_flags & PKT_TX_TCP_SEG)) {
> > > + if (ol_flags & PKT_TX_IPV4) {
> > > + ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
> > > + inner_l3_offset);
> > > +
> > > + if (ol_flags & PKT_TX_IP_CKSUM)
> > > + ipv4_hdr->hdr_checksum = 0;
> > > +
> > > + /* non-TSO tcp or TSO */
> > > + tcp_hdr = (struct tcp_hdr *)((char *)ipv4_hdr + m-
> > >l3_len);
> > > + tcp_hdr->cksum = rte_ipv4_phdr_cksum(ipv4_hdr, 
> > > ol_flags);
> > > + } else {
> > > + ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
> > > + inner_l3_offset);
> > > + /* non-TSO tcp or TSO */
> > > + tcp_hdr = rte_pktmbuf_mtod_offset(m, struct tcp_hdr *,
> > > + inner_l3_offset + m->l3_len);
> > > + tcp_hdr->cksum = rte_ipv6_phdr_cksum(ipv6_hdr, 
> > > ol_flags);
> > > + }
> > > + }
> > > + return 0;
> > > +}
> > > +
> > > +#endif /* _RTE_PKT_H_ */
> > >
> >
> > The function expects that all the network headers are in the first, and
> > that each of them is contiguous.
> >
> 
> Yes, I see...

Yes it does.
I suppose it is a legitimate restriction  (assumptions) for those who'd like to 
use that function.
But that's a good point and I suppose we need to state it explicitly in the API 
comments.

> 
> > Also, I had an interesting remark from Stephen [1] on a similar code.
> > If the mbuf is a clone, it will modify the data of the direct mbuf, which
> > should be read-only. Note that it is likely to happen in a TCP stack,
> > because the packet is kept locally in case it has to be retransmitted.
> > Cloning a mbuf is more efficient than duplicating it.
> >
> > I plan to fix it in virtio code by "uncloning" the headers.
> >
> > [1] http://dpdk.org/ml/archives/dev/2016-October/048873.html

This subject is probably a bit off-topic... 
My position on it - it shouldn't be a PMD responsibility to make these kind of 
checks.
I think it should be upper layer responsibility to provide to the PMD an mbuf
that can be safely used (and in that case modified) by the driver.
Let say if upper layer would like to use packet clone for TCP retransmissions,
It can easily overcome that problem by cloning only data part of the packet,
and putting L2/L3/L4 headers in a separate (head) mbuf and chaining it with
cloned data mbuf.

Konstantin