On Fri, Feb 04, 2022 at 09:29:56PM +1000, David Gwynne wrote:
> On Fri, Jan 28, 2022 at 05:26:01PM +0100, Alexander Bluhm wrote:
> > On Wed, Jan 26, 2022 at 11:05:51AM +0100, Claudio Jeker wrote:
> > > On Wed, Jan 26, 2022 at 01:29:42AM +0100, Alexander Bluhm wrote:
> > > > Hi,
> > > >
> > > > There were some problems with ix(4) and ixl(4) hardware checksumming
> > > > for the output path on strict alignment architectures.
> > > >
> > > > I have merged jan@'s diffs and added some sanity checks and
> > > > workarounds.
> > > >
> > > > - If the first mbuf is not aligned or not contigous, use m_copydata()
> > > > to extract the IP, IPv6, TCP header.
> > > > - If the header is in the first mbuf, use m_data for the fast path.
> > > > - Add netstat counter for invalid header chains. This makes
> > > > us aware when hardware checksumming fails.
> > > > - Add netstat counter for header copies. This indicates that
> > > > better storage allocation in the network stack is possible.
> > > > It also allows to recognize alignment problems on non-strict
> > > > architectures.
> > > > - There is not risk of crashes on sparc64.
> > > >
> > > > Does this aproach make sense?
> > >
> > > I think it is overly complicated and too much data is copied around.
> > > First of all there is no need to extract ipproto.
> > > The code can just use the M_TCP_CSUM_OUT and M_UDP_CSUM_OUT flags (they
> > > are not set for other protos).
> > > Because of this only they ip_hlen needs to be accessed and this can be
> > > done with m_getptr().
> >
> > A solution with m_getptr() is where we started. It has already
> > been rejected. The problem is that there are 6 ways to implement
> > this feature. Every option has its drawbacks and was rejected.
> >
> > Options are:
> > 1. m_getptr() and access the struct. Alignment cannot be guaranteed.
> > 2. m_getptr() and access the byte or word. Header fields should be
> > accessed by structs.
> > 3. Always m_copydata. Too much overhead.
> > 4. Always use m_data. Kernel may crash or use invalid data.
> > 5. Combination of m_data and m_copydata. Too complex.
> > 6. Track the fields in mbuf header. Too fragile and memory overhead.
> >
> > In my measurements checksum offloading gave us 10% performance boost
> > so I want this feature. Other drivers also have it.
> >
> > Could we get another idea or find a consensus which option to use?
>
> after staring at this for a few hours my conclusion is option 1 actually
> is the right approach, but the diff for ixl has a bug.
>
> > > In the IP6 case even more can be skipped since ip_hlen is static for IPv6.
> > >
> > > In ixl(4) also the tcp header lenght needs to be extracted. Again the code
> > > can be simplified because HW checksumming is only enabled if ip_hlen == 5
> > > and so the offset of the th_off field is static (for both IPv4 and IPv6).
> > > Again m_getptr can be used to just access the byte with th_off.
> > >
> > > Longterm in_proto_cksum_out() should probably help provide the th_off
> > > field. I think enforcing ip_hlen == 5 for UDP and TCP is fine, who needs
> > > IP options on UDP and TCP?
> >
> > Other diffs have been rejected as they make too many assumtions how
> > the stack works.
>
> my opinion is we can make these assumptions. the CSUM_OUT flags are
> only set in very specific places where the stack itself is constructing
> or checking properly aligned headers, and then the stack maintains
> this alignment until it reaches the driver. this is where the first
> of the bugs in the ixl diff comes in.
>
> in the diff ixl_tx_setup_offload() is called after the dma mapping
> occurs. this is implemented in this code:
>
> static inline int
> ixl_load_mbuf(bus_dma_tag_t dmat, bus_dmamap_t map, struct mbuf *m)
> {
> int error;
>
> error = bus_dmamap_load_mbuf(dmat, map, m,
> BUS_DMA_STREAMING | BUS_DMA_NOWAIT);
> if (error != EFBIG)
> return (error);
>
> error = m_defrag(m, M_DONTWAIT);
> if (error != 0)
> return (error);
>
> return (bus_dmamap_load_mbuf(dmat, map, m,
> BUS_DMA_STREAMING | BUS_DMA_NOWAIT));
> }
>
> the problem is that when we get a heavily fragmented mbuf we call
> m_defrag, which basically allocates a cluster and copies all the
> data from the fragments into it. however, m_defrag does not maintain
> the alignment of payload, meaning the ethernet header gets to start
> on a 4 byte boundary cos that's what we get from the cluster pool,
> and the IP and TCP payloads end up with a 2 byte misalignmnet.
>
> my proposed solution to this is to set up the offload bits before
> calling ixl_load_mbuf. i'm confident this is the bug that deraadt@ hit.
>
> because of when the CSUM_OUT flags are set, i think m_getptr is fine for
> pulling the mbuf apart. if it's not i want the code to blow up so it
> gets fixed. that's why the code in my diff below lacks defenses.
>
> the other bug is that the uint64_t holding the offload flags isn't
> reset between being called for different mbufs. my solution to that is
> to have ixl_tx_setup_offload() return the flags so the value in
> ixl_start is overwritten every time.
>
> lastly, ixl shouldn't (doesn't need to?) spelunk in the packet if
> the CSUM_OUT flags arent set. ixl_tx_setup_offload() returns 0 early
> if none of the flags are set now.
>
> i'll have a look at ix and the ixl rx bits tomorrow.
ix(4) also looked at the packet after a possible m_defrag. this works on
amd64 with ipv4 and ipv6, and with and without vlans and it seems to
behave as expected. we'll try a sparc64 soon.
it's a lot easier to read after it's applied.
Index: if_ix.c
===================================================================
RCS file: /cvs/src/sys/dev/pci/if_ix.c,v
retrieving revision 1.181
diff -u -p -r1.181 if_ix.c
--- if_ix.c 27 Jan 2022 18:28:44 -0000 1.181
+++ if_ix.c 6 Feb 2022 12:32:21 -0000
@@ -156,7 +156,8 @@ int ixgbe_encap(struct tx_ring *, struct
int ixgbe_dma_malloc(struct ix_softc *, bus_size_t,
struct ixgbe_dma_alloc *, int);
void ixgbe_dma_free(struct ix_softc *, struct ixgbe_dma_alloc *);
-int ixgbe_tx_ctx_setup(struct tx_ring *, struct mbuf *, uint32_t *,
+static int
+ ixgbe_tx_ctx_setup(struct tx_ring *, struct mbuf *, uint32_t *,
uint32_t *);
int ixgbe_tso_setup(struct tx_ring *, struct mbuf *, uint32_t *,
uint32_t *);
@@ -1406,6 +1407,14 @@ ixgbe_encap(struct tx_ring *txr, struct
map = txbuf->map;
/*
+ * Set the appropriate offload context
+ * this will becomes the first descriptor.
+ */
+ ntxc = ixgbe_tx_ctx_setup(txr, m_head, &cmd_type_len, &olinfo_status);
+ if (ntxc == -1)
+ goto xmit_fail;
+
+ /*
* Map the packet for DMA.
*/
switch (bus_dmamap_load_mbuf(txr->txdma.dma_tag, map,
@@ -1422,14 +1431,6 @@ ixgbe_encap(struct tx_ring *txr, struct
return (0);
}
- /*
- * Set the appropriate offload context
- * this will becomes the first descriptor.
- */
- ntxc = ixgbe_tx_ctx_setup(txr, m_head, &cmd_type_len, &olinfo_status);
- if (ntxc == -1)
- goto xmit_fail;
-
i = txr->next_avail_desc + ntxc;
if (i >= sc->num_tx_desc)
i -= sc->num_tx_desc;
@@ -1880,6 +1881,7 @@ ixgbe_setup_interface(struct ix_softc *s
#endif
ifp->if_capabilities |= IFCAP_CSUM_TCPv4 | IFCAP_CSUM_UDPv4;
+ ifp->if_capabilities |= IFCAP_CSUM_TCPv6 | IFCAP_CSUM_UDPv6;
/*
* Specify the media types supported by this sc and register
@@ -2426,32 +2428,85 @@ ixgbe_free_transmit_buffers(struct tx_ri
*
**********************************************************************/
-int
+static void
+ixgbe_csum_offload(struct mbuf *mp,
+ uint32_t *vlan_macip_lens_p, uint32_t *type_tucmd_mlhl_p)
+{
+ struct ether_header *eh = mtod(mp, struct ether_header *);
+ struct mbuf *m;
+ int hoff;
+ uint32_t iphlen;
+ uint32_t vlan_macip_lens = 0;
+ uint32_t type_tucmd_mlhl = 0;
+ uint8_t ipproto;
+
+ vlan_macip_lens |= (sizeof(*eh) << IXGBE_ADVTXD_MACLEN_SHIFT);
+
+ switch (ntohs(eh->ether_type)) {
+ case ETHERTYPE_IP: {
+ struct ip *ip;
+
+ m = m_getptr(mp, sizeof(*eh), &hoff);
+ KASSERT(m != NULL && m->m_len - hoff >= sizeof(*ip));
+ ip = (struct ip *)(mtod(m, caddr_t) + hoff);
+
+ iphlen = ip->ip_hl << 2;
+ ipproto = ip->ip_p;
+
+ type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_IPV4;
+ break;
+ }
+
+#ifdef INET6
+ case ETHERTYPE_IPV6: {
+ struct ip6_hdr *ip6;
+
+ m = m_getptr(mp, sizeof(*eh), &hoff);
+ KASSERT(m != NULL && m->m_len - hoff >= sizeof(*ip6));
+ ip6 = (struct ip6_hdr *)(mtod(m, caddr_t) + hoff);
+
+ iphlen = sizeof(*ip6);
+ ipproto = ip6->ip6_nxt;
+
+ type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_IPV6;
+ break;
+ }
+#endif
+
+ default:
+ panic("CSUM_OUT set for non-IP packet");
+ /* NOTREACHED */
+ }
+
+ vlan_macip_lens |= iphlen;
+
+ switch (ipproto) {
+ case IPPROTO_TCP:
+ KASSERT(ISSET(mp->m_pkthdr.csum_flags, M_TCP_CSUM_OUT));
+ type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_TCP;
+ break;
+ case IPPROTO_UDP:
+ KASSERT(ISSET(mp->m_pkthdr.csum_flags, M_UDP_CSUM_OUT));
+ type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_UDP;
+ break;
+ default:
+ panic("CSUM_OUT set for wrong protocol");
+ /* NOTREACHED */
+ }
+
+ *vlan_macip_lens_p = vlan_macip_lens;
+ *type_tucmd_mlhl_p = type_tucmd_mlhl;
+}
+
+static int
ixgbe_tx_ctx_setup(struct tx_ring *txr, struct mbuf *mp,
uint32_t *cmd_type_len, uint32_t *olinfo_status)
{
struct ixgbe_adv_tx_context_desc *TXD;
struct ixgbe_tx_buf *tx_buffer;
-#if NVLAN > 0
- struct ether_vlan_header *eh;
-#else
- struct ether_header *eh;
-#endif
- struct ip *ip;
-#ifdef notyet
- struct ip6_hdr *ip6;
-#endif
- struct mbuf *m;
- int ipoff;
uint32_t vlan_macip_lens = 0, type_tucmd_mlhl = 0;
- int ehdrlen, ip_hlen = 0;
- uint16_t etype;
- uint8_t ipproto = 0;
- int offload = TRUE;
int ctxd = txr->next_avail_desc;
-#if NVLAN > 0
- uint16_t vtag = 0;
-#endif
+ int offload = 0;
#if notyet
/* First check if TSO is to be used */
@@ -2459,112 +2514,35 @@ ixgbe_tx_ctx_setup(struct tx_ring *txr,
return (ixgbe_tso_setup(txr, mp, cmd_type_len, olinfo_status));
#endif
- if ((mp->m_pkthdr.csum_flags & (M_TCP_CSUM_OUT | M_UDP_CSUM_OUT)) == 0)
- offload = FALSE;
-
/* Indicate the whole packet as payload when not doing TSO */
*olinfo_status |= mp->m_pkthdr.len << IXGBE_ADVTXD_PAYLEN_SHIFT;
- /* Now ready a context descriptor */
- TXD = (struct ixgbe_adv_tx_context_desc *) &txr->tx_base[ctxd];
- tx_buffer = &txr->tx_buffers[ctxd];
-
- /*
- * In advanced descriptors the vlan tag must
- * be placed into the descriptor itself. Hence
- * we need to make one even if not doing offloads.
- */
#if NVLAN > 0
- if (mp->m_flags & M_VLANTAG) {
- vtag = mp->m_pkthdr.ether_vtag;
+ if (ISSET(mp->m_flags, M_VLANTAG)) {
+ uint32_t vtag = mp->m_pkthdr.ether_vtag;
vlan_macip_lens |= (vtag << IXGBE_ADVTXD_VLAN_SHIFT);
- } else
-#endif
- if (offload == FALSE)
- return (0); /* No need for CTX */
-
- /*
- * Determine where frame payload starts.
- * Jump over vlan headers if already present,
- * helpful for QinQ too.
- */
- if (mp->m_len < sizeof(struct ether_header))
- return (-1);
-#if NVLAN > 0
- eh = mtod(mp, struct ether_vlan_header *);
- if (eh->evl_encap_proto == htons(ETHERTYPE_VLAN)) {
- if (mp->m_len < sizeof(struct ether_vlan_header))
- return (-1);
- etype = ntohs(eh->evl_proto);
- ehdrlen = ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN;
- } else {
- etype = ntohs(eh->evl_encap_proto);
- ehdrlen = ETHER_HDR_LEN;
+ offload |= 1;
}
-#else
- eh = mtod(mp, struct ether_header *);
- etype = ntohs(eh->ether_type);
- ehdrlen = ETHER_HDR_LEN;
#endif
- /* Set the ether header length */
- vlan_macip_lens |= ehdrlen << IXGBE_ADVTXD_MACLEN_SHIFT;
-
- switch (etype) {
- case ETHERTYPE_IP:
- if (mp->m_pkthdr.len < ehdrlen + sizeof(*ip))
- return (-1);
- m = m_getptr(mp, ehdrlen, &ipoff);
- KASSERT(m != NULL && m->m_len - ipoff >= sizeof(*ip));
- ip = (struct ip *)(m->m_data + ipoff);
- ip_hlen = ip->ip_hl << 2;
- ipproto = ip->ip_p;
- type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_IPV4;
- break;
-#ifdef notyet
- case ETHERTYPE_IPV6:
- if (mp->m_pkthdr.len < ehdrlen + sizeof(*ip6))
- return (-1);
- m = m_getptr(mp, ehdrlen, &ipoff);
- KASSERT(m != NULL && m->m_len - ipoff >= sizeof(*ip6));
- ip6 = (struct ip6 *)(m->m_data + ipoff);
- ip_hlen = sizeof(*ip6);
- /* XXX-BZ this will go badly in case of ext hdrs. */
- ipproto = ip6->ip6_nxt;
- type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_IPV6;
- break;
-#endif
- default:
- offload = FALSE;
- break;
+ if (ISSET(mp->m_pkthdr.csum_flags, M_TCP_CSUM_OUT|M_UDP_CSUM_OUT)) {
+ ixgbe_csum_offload(mp, &vlan_macip_lens, &type_tucmd_mlhl);
+ *olinfo_status |= IXGBE_TXD_POPTS_TXSM << 8;
+ offload |= 1;
}
- vlan_macip_lens |= ip_hlen;
- type_tucmd_mlhl |= IXGBE_ADVTXD_DCMD_DEXT | IXGBE_ADVTXD_DTYP_CTXT;
-
- switch (ipproto) {
- case IPPROTO_TCP:
- if (mp->m_pkthdr.csum_flags & M_TCP_CSUM_OUT)
- type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_TCP;
- break;
- case IPPROTO_UDP:
- if (mp->m_pkthdr.csum_flags & M_UDP_CSUM_OUT)
- type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_UDP;
- break;
- default:
- offload = FALSE;
- break;
- }
+ if (!offload)
+ return (0);
- if (offload) /* For the TX descriptor setup */
- *olinfo_status |= IXGBE_TXD_POPTS_TXSM << 8;
+ type_tucmd_mlhl |= IXGBE_ADVTXD_DCMD_DEXT | IXGBE_ADVTXD_DTYP_CTXT;
- /* Now copy bits into descriptor */
- TXD->vlan_macip_lens = htole32(vlan_macip_lens);
- TXD->type_tucmd_mlhl = htole32(type_tucmd_mlhl);
+ TXD = (struct ixgbe_adv_tx_context_desc *)&txr->tx_base[ctxd];
+ htolem32(&TXD->vlan_macip_lens, vlan_macip_lens);
TXD->seqnum_seed = htole32(0);
+ htolem32(&TXD->type_tucmd_mlhl, type_tucmd_mlhl);
TXD->mss_l4len_idx = htole32(0);
+ tx_buffer = &txr->tx_buffers[ctxd];
tx_buffer->m_head = NULL;
tx_buffer->eop_index = -1;
Index: ixgbe.h
===================================================================
RCS file: /cvs/src/sys/dev/pci/ixgbe.h,v
retrieving revision 1.32
diff -u -p -r1.32 ixgbe.h
--- ixgbe.h 18 Jul 2020 07:18:22 -0000 1.32
+++ ixgbe.h 6 Feb 2022 12:32:21 -0000
@@ -65,6 +65,7 @@
#include <netinet/in.h>
#include <netinet/if_ether.h>
#include <netinet/ip.h>
+#include <netinet/ip6.h>
#if NBPFILTER > 0
#include <net/bpf.h>