Re: [PATCH net] net: ipv6: Validate GSO SKB before finish IPv6 processing
Hi Aya, > Daniel Axtens, does this solve the issue referred in your commit? > 8914a595110a bnx2x: disable GSO where gso_size is too big for hardware No, because: > Note: These cases are handled in the same manner in IPv4 output finish. > This patch aligns the behavior of IPv6 and IPv4. and the issue my patch addresses was hit on IPv4, not IPv6. In my case, the issue was that a packet came in on a ibmveth interface, GSO but with a gso_size that's very very large. Open vSwitch then transferred it to the bnx2x interface, where the firmware promptly paniced. There's a nice long description at https://patchwork.ozlabs.org/project/netdev/patch/20180111235905.10110-1-...@axtens.net/ however the description wasn't included in the commit message. Kind regards, Daniel > > Thanks, > Aya > > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c > index 749ad72386b2..36466f2211bd 100644 > --- a/net/ipv6/ip6_output.c > +++ b/net/ipv6/ip6_output.c > @@ -125,8 +125,43 @@ static int ip6_finish_output2(struct net *net, struct > sock *sk, struct sk_buff * > return -EINVAL; > } > > +static int > +ip6_finish_output_gso_slowpath_drop(struct net *net, struct sock *sk, > + struct sk_buff *skb, unsigned int mtu) > +{ > + struct sk_buff *segs, *nskb; > + netdev_features_t features; > + int ret; > + > + /* Please see corresponding comment in ip_finish_output_gso > + * describing the cases where GSO segment length exceeds the > + * egress MTU. > + */ > + features = netif_skb_features(skb); > + segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK); > + if (IS_ERR_OR_NULL(segs)) { > + kfree_skb(skb); > + return -ENOMEM; > + } > + > + consume_skb(skb); > + > + skb_list_walk_safe(segs, segs, nskb) { > + int err; > + > + skb_mark_not_on_list(segs); > + err = ip6_fragment(net, sk, segs, ip6_finish_output2); > + if (err && ret == 0) > + ret = err; > + } > + > + return ret; > +} > + > static int __ip6_finish_output(struct net *net, struct sock *sk, struct > sk_buff *skb) > { > + unsigned int mtu; > + > #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM) > /* Policy lookup after SNAT yielded a new policy */ > if (skb_dst(skb)->xfrm) { > @@ -135,7 +170,11 @@ static int __ip6_finish_output(struct net *net, struct > sock *sk, struct sk_buff > } > #endif > > - if ((skb->len > ip6_skb_dst_mtu(skb) && !skb_is_gso(skb)) || > + mtu = ip6_skb_dst_mtu(skb); > + if (skb_is_gso(skb) && !skb_gso_validate_network_len(skb, mtu)) > + return ip6_finish_output_gso_slowpath_drop(net, sk, skb, mtu); > + > + if ((skb->len > mtu && !skb_is_gso(skb)) || > dst_allfrag(skb_dst(skb)) || > (IP6CB(skb)->frag_max_size && skb->len > IP6CB(skb)->frag_max_size)) > return ip6_fragment(net, sk, skb, ip6_finish_output2); > -- > 2.14.1
Re: GSO where gso_size is too big for hardware
Willem de Bruijn writes: > On Tue, Jan 22, 2019 at 11:24 AM Ben Hutchings > wrote: >> >> Last year you applied these fixes for a potential denial-of-service in >> the bnx2x driver: >> >> commit 2b16f048729bf35e6c28a40cbfad07239f9dcd90 >> Author: Daniel Axtens >> Date: Wed Jan 31 14:15:33 2018 +1100 >> >> net: create skb_gso_validate_mac_len() >> >> commit 8914a595110a6eca69a5e275b323f5d09e18f4f9 >> Author: Daniel Axtens >> Date: Wed Jan 31 14:15:34 2018 +1100 >> >> bnx2x: disable GSO where gso_size is too big for hardware >> >> However I don't understand why the check is done only in the bnx2x >> driver. Shouldn't the networking core ensure that gso_size + L3/L4 >> headers is <= the device MTU? If not, is every driver that does TSO >> expected to check this? >> >> Also, should these fixes go to stable? I'm not sure whether you're >> still handling stable patches for any of the unfixed versions (< 4.16) >> now. >> >> Ben. > > Irrespective of the GSO issue, this sounds relevant to this other thread > > Stack sends oversize UDP packet to the driver > https://www.mail-archive.com/netdev@vger.kernel.org/msg279006.html > > which discusses a specific cause of larger than MTU packets and its > effect on the bnxt. > > Perhaps these patches were initially applied to the bnx2x driver only, > because at the time that was the only nic known to lock up on such > packets? Either way, a device independent validation is indeed > probably preferable (independent of fixing that lo flapping root > cause). I did try to propose a more generic approach: 1) "[PATCH 0/3] Check gso_size of packets when forwarding" (https://www.spinics.net/lists/netdev/msg478634.html) In this series I looked just at forwarding, where there is a check against regular MTU but not against GSO size. I added checks to is_skb_forwardable and the Open vSwitch forwarding path, but in feedback it was pointed out that I missed other ways a packet could be forwarded such as tc-mired and bpf. A more generic approach was desired, so I proposed: 2) "[PATCH v2 0/4] Check size of packets before sending" (https://www.spinics.net/lists/netdev/msg480847.html) This added a check to is_skb_forwardable and another check to validate_xmit_skb. It was not considered desirable to include this as a primary fix because it would be very hard to backport, so we included the fix for the bnx2x card specifically instead. I think the idea was to come back to this fix later. I do remember then spending quite a bit of time trying to get the generic fix sorted out after that. I remember working on the intricacies of verifing various GSO stuff - I sent some fixes to GSO_BY_FRAGS handling, and I know I got sidetracked by GSO_DODGY somehow as well. I think the problem was that without dealing with dodgy, you end up with edge cases where untrusted providers of dodgy packets can bypass a naive length check. Anyway, then I got busy - my job at that point was mostly support-driven and customers keep having new urgent issues! - so I didn't get to finish it. This was about a year ago, so my recollection may be wrong and I may have misunderstood things. Regards, Daniel
Re: Stack sends oversize UDP packet to the driver
Hi Michael, > I've received a bug report of oversized UDP packets sent to the > bnxt_en driver for transmission. There is no check for illegal length > in the driver and it will send a corrupted BD to the NIC if the > non-TSO length exceeds the maximum MTU supported by the driver. This > ultimately causes the driver to hang. > > Looking a little deeper, it looks like the route of the SKB was > initially to "lo" and therefore no fragmentation was done. And it > looks like the route later got changed to the bnxt_en dev before > transmission. The user was doing multiple VM reboots and the bad > length was happening on the Linux host. > > I can add a length check in the driver to prevent this. But is there > a better way to prevent this in the stack? Thanks. I hit a similar sounding issue on a bnx2x - see commit 8914a595110a6eca69a5e275b323f5d09e18f4f9 In that case, a GSO packet with gso_size too large for the firmware was coming to the bnx2x driver from an ibmveth device via Open vSwitch. I also toyed with a fix in the stack and ended up fixing just the driver. I was hoping to get a generic fix in to the stack afterwards, but didn't get anything finished. Looking back at old branches, it looks like I considered adding MTU validation to validate_xmit_skb, but I never got that upstream. My vague recollection is that I ended up caught by edge cases: GSO_DODGY allows an untrusted source to set gso parameters, so that needed to be validated first - and that was complex and potentially slow, and I just got overtaken by more urgent work. (Note that this was a year ago and was in many ways my introduction to TSO/GSO, so I could be completely wrong.) Anyway, I can send you my partial work if it would be helpful. Regards, Daniel
Re: bnxt: card intermittently hanging and dropping link
Hi Michael, >>> The main issue is the TX timeout. >>> . >>> [ 2682.911693] bnxt_en :3b:00.0 eth4: TX timeout detected, starting reset task! [ 2683.782496] bnxt_en :3b:00.0 eth4: Resp cmpl intr err msg: 0x51 [ 2683.783061] bnxt_en :3b:00.0 eth4: hwrm_ring_free tx failed. rc:-1 [ 2684.634557] bnxt_en :3b:00.0 eth4: Resp cmpl intr err msg: 0x51 [ 2684.635120] bnxt_en :3b:00.0 eth4: hwrm_ring_free tx failed. rc:-1 >>> >>> and it is not recovering. >>> >>> Please provide ethtool -i eth4 which will show the firmware version on >>> the NIC. Let's see if the firmware is too old. >> >> driver: bnxt_en >> version: 1.8.0 >> firmware-version: 20.6.151.0/pkg 20.06.05.11 > > I believe the firmware should be updated. My colleague will contact > you on how to proceed. Thank you very much, I'll follow up with them off-list. Regards, Daniel
Re: bnxt: card intermittently hanging and dropping link
Hi Michael, > The main issue is the TX timeout. > . > >> [ 2682.911693] bnxt_en :3b:00.0 eth4: TX timeout detected, starting >> reset task! >> [ 2683.782496] bnxt_en :3b:00.0 eth4: Resp cmpl intr err msg: 0x51 >> [ 2683.783061] bnxt_en :3b:00.0 eth4: hwrm_ring_free tx failed. rc:-1 >> [ 2684.634557] bnxt_en :3b:00.0 eth4: Resp cmpl intr err msg: 0x51 >> [ 2684.635120] bnxt_en :3b:00.0 eth4: hwrm_ring_free tx failed. rc:-1 > > and it is not recovering. > > Please provide ethtool -i eth4 which will show the firmware version on > the NIC. Let's see if the firmware is too old. driver: bnxt_en version: 1.8.0 firmware-version: 20.6.151.0/pkg 20.06.05.11 expansion-rom-version: bus-info: :3b:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: no supports-priv-flags: no Thanks! Regards, Daniel
bnxt: card intermittently hanging and dropping link
Hi Michael, I have some user reports of issues with a Broadcom 57412 card with the card intermittently hanging and dropping the link. The problem has been observed on a Dell server with an Ubuntu 4.13 kernel (bnxt_en version 1.7.0) and with an Ubuntu 4.15 kernel (bnxt_en version 1.8.0). It seems to occur while mounting an XFS volume over iSCSI, although running blkid on the partition succeeds, and it seems to be a different volume each time. I've included an excerpt from the kernel log below. It seems that other people have reported this with a RHEL7 kernel - I have no idea what driver version that would be running, or what workload they were operating, but the warnings and messages look the same: https://www.dell.com/community/PowerEdge-Hardware-General/Critical-network-bnxt-en-module-crashes-on-14G-servers/td-p/6031769/highlight/true The forum poster reported that disconnecting the card from power for 5 minutes was sufficient to get things working again and I have asked our user to test that. Is this a known issue? Regards, Daniel [ 2662.526151] scsi host14: iSCSI Initiator over TCP/IP [ 2662.538110] scsi host15: iSCSI Initiator over TCP/IP [ 2662.547350] scsi host16: iSCSI Initiator over TCP/IP [ 2662.554660] scsi host17: iSCSI Initiator over TCP/IP [ 2662.813860] scsi 15:0:0:1: Direct-Access PURE FlashArray PQ: 0 ANSI: 6 [ 2662.813888] scsi 14:0:0:1: Direct-Access PURE FlashArray PQ: 0 ANSI: 6 [ 2662.813972] scsi 16:0:0:1: Direct-Access PURE FlashArray PQ: 0 ANSI: 6 [ 2662.814322] sd 15:0:0:1: Attached scsi generic sg1 type 0 [ 2662.814553] sd 14:0:0:1: Attached scsi generic sg2 type 0 [ 2662.814554] scsi 17:0:0:1: Direct-Access PURE FlashArray PQ: 0 ANSI: 6 [ 2662.814612] sd 16:0:0:1: Attached scsi generic sg3 type 0 [ 2662.815081] sd 15:0:0:1: [sdb] 10737418240 512-byte logical blocks: (5.50 TB/5.00 TiB) [ 2662.815195] sd 15:0:0:1: [sdb] Write Protect is off [ 2662.815197] sd 15:0:0:1: [sdb] Mode Sense: 43 00 00 08 [ 2662.815229] sd 14:0:0:1: [sdc] 10737418240 512-byte logical blocks: (5.50 TB/5.00 TiB) [ 2662.815292] sd 17:0:0:1: Attached scsi generic sg4 type 0 [ 2662.815342] sd 14:0:0:1: [sdc] Write Protect is off [ 2662.815343] sd 14:0:0:1: [sdc] Mode Sense: 43 00 00 08 [ 2662.815419] sd 15:0:0:1: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA [ 2662.815447] sd 16:0:0:1: [sdd] 10737418240 512-byte logical blocks: (5.50 TB/5.00 TiB) [ 2662.815544] sd 14:0:0:1: [sdc] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA [ 2662.815614] sd 16:0:0:1: [sdd] Write Protect is off [ 2662.815615] sd 16:0:0:1: [sdd] Mode Sense: 43 00 00 08 [ 2662.815882] sd 16:0:0:1: [sdd] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA [ 2662.816188] sd 17:0:0:1: [sde] 10737418240 512-byte logical blocks: (5.50 TB/5.00 TiB) [ 2662.816298] sd 17:0:0:1: [sde] Write Protect is off [ 2662.816300] sd 17:0:0:1: [sde] Mode Sense: 43 00 00 08 [ 2662.816502] sd 17:0:0:1: [sde] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA [ 2662.820080] sd 15:0:0:1: [sdb] Attached SCSI disk [ 2662.820594] sd 14:0:0:1: [sdc] Attached SCSI disk [ 2662.820995] sd 17:0:0:1: [sde] Attached SCSI disk [ 2662.821176] sd 16:0:0:1: [sdd] Attached SCSI disk [ 2662.913642] device-mapper: multipath round-robin: version 1.2.0 loaded [ 2663.954001] XFS (dm-2): Mounting V5 Filesystem [ 2673.186083] connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295558209, last ping 4295559460, now 4295560768 [ 2673.186135] connection3:0: detected conn error (1022) [ 2673.186137] connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295558209, last ping 4295559460, now 4295560768 [ 2673.186168] connection2:0: detected conn error (1022) [ 2673.186170] connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295558209, last ping 4295559460, now 4295560768 [ 2673.186211] connection1:0: detected conn error (1022) [ 2674.209870] connection4:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295558463, last ping 4295559720, now 4295561024 [ 2674.209924] connection4:0: detected conn error (1022) [ 2678.560630] session1: session recovery timed out after 5 secs [ 2678.560641] session2: session recovery timed out after 5 secs [ 2678.560647] session3: session recovery timed out after 5 secs [ 2678.951453] device-mapper: multipath: Failing path 8:32. [ 2678.951509] device-mapper: multipath: Failing path 8:48. [ 2678.951548] device-mapper: multipath: Failing path 8:16. [ 2679.584302] session4: session recovery timed out after 5 secs [ 2679.584313] sd 17:0:0:1: rejecting I/O to offline device [ 2679.584356] sd 17:0:0:1: [sde] killing request [ 2679.584362] sd 17:0:0:1: rejecting I/O to offline device [ 2679.584392] sd 17:0:0:1: [sde] killing request [ 2679.584401] sd 17:0:0:1: [sde] FAILED Result: hostbyte=DID_NO_CONNECT drive
[PATCH] net: use skb_is_gso_sctp() instead of open-coding
As well as the basic conversion, I noticed that a lot of the SCTP code checks gso_type without first checking skb_is_gso() so I have added that where appropriate. Also, document the helper. Cc: Daniel Borkmann Cc: Marcelo Ricardo Leitner Signed-off-by: Daniel Axtens --- This depends on d02f51cbcf12 ("bpf: fix bpf_skb_adjust_net/bpf_skb_proto_xlat to deal with gso sctp skbs") which introduces the required helper. That is in the bpf tree at the moment. --- Documentation/networking/segmentation-offloads.txt | 5 - net/core/skbuff.c | 2 +- net/sched/act_csum.c | 2 +- net/sctp/input.c | 8 net/sctp/inqueue.c | 2 +- net/sctp/offload.c | 2 +- 6 files changed, 12 insertions(+), 9 deletions(-) diff --git a/Documentation/networking/segmentation-offloads.txt b/Documentation/networking/segmentation-offloads.txt index 23a8dd91a9ec..36bb931b35e0 100644 --- a/Documentation/networking/segmentation-offloads.txt +++ b/Documentation/networking/segmentation-offloads.txt @@ -155,7 +155,10 @@ Therefore, any code in the core networking stack must be aware of the possibility that gso_size will be GSO_BY_FRAGS and handle that case appropriately. -There are a couple of helpers to make this easier: +There are some helpers to make this easier: + + - skb_is_gso(skb) && skb_is_gso_sctp(skb) is the best way to see if + an skb is an SCTP GSO skb. - For size checks, the skb_gso_validate_*_len family of helpers correctly considers GSO_BY_FRAGS. diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 0bb0d8877954..baf990528943 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4904,7 +4904,7 @@ static unsigned int skb_gso_transport_seglen(const struct sk_buff *skb) thlen += inner_tcp_hdrlen(skb); } else if (likely(shinfo->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))) { thlen = tcp_hdrlen(skb); - } else if (unlikely(shinfo->gso_type & SKB_GSO_SCTP)) { + } else if (unlikely(skb_is_gso_sctp(skb))) { thlen = sizeof(struct sctphdr); } /* UFO sets gso_size to the size of the fragmentation diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c index b7ba9b06b147..24b2e8e681cf 100644 --- a/net/sched/act_csum.c +++ b/net/sched/act_csum.c @@ -350,7 +350,7 @@ static int tcf_csum_sctp(struct sk_buff *skb, unsigned int ihl, { struct sctphdr *sctph; - if (skb_is_gso(skb) && skb_shinfo(skb)->gso_type & SKB_GSO_SCTP) + if (skb_is_gso(skb) && skb_is_gso_sctp(skb)) return 1; sctph = tcf_csum_skb_nextlayer(skb, ihl, ipl, sizeof(*sctph)); diff --git a/net/sctp/input.c b/net/sctp/input.c index 0247cc432e02..b381d78548ac 100644 --- a/net/sctp/input.c +++ b/net/sctp/input.c @@ -106,6 +106,7 @@ int sctp_rcv(struct sk_buff *skb) int family; struct sctp_af *af; struct net *net = dev_net(skb->dev); + bool is_gso = skb_is_gso(skb) && skb_is_gso_sctp(skb); if (skb->pkt_type != PACKET_HOST) goto discard_it; @@ -123,8 +124,7 @@ int sctp_rcv(struct sk_buff *skb) * it's better to just linearize it otherwise crc computing * takes longer. */ - if ((!(skb_shinfo(skb)->gso_type & SKB_GSO_SCTP) && -skb_linearize(skb)) || + if ((!is_gso && skb_linearize(skb)) || !pskb_may_pull(skb, sizeof(struct sctphdr))) goto discard_it; @@ -135,7 +135,7 @@ int sctp_rcv(struct sk_buff *skb) if (skb_csum_unnecessary(skb)) __skb_decr_checksum_unnecessary(skb); else if (!sctp_checksum_disable && -!(skb_shinfo(skb)->gso_type & SKB_GSO_SCTP) && +!is_gso && sctp_rcv_checksum(net, skb) < 0) goto discard_it; skb->csum_valid = 1; @@ -1218,7 +1218,7 @@ static struct sctp_association *__sctp_rcv_lookup_harder(struct net *net, * issue as packets hitting this are mostly INIT or INIT-ACK and * those cannot be on GSO-style anyway. */ - if ((skb_shinfo(skb)->gso_type & SKB_GSO_SCTP) == SKB_GSO_SCTP) + if (skb_is_gso(skb) && skb_is_gso_sctp(skb)) return NULL; ch = (struct sctp_chunkhdr *)skb->data; diff --git a/net/sctp/inqueue.c b/net/sctp/inqueue.c index 48392552ee7c..23ebc5318edc 100644 --- a/net/sctp/inqueue.c +++ b/net/sctp/inqueue.c @@ -170,7 +170,7 @@ struct sctp_chunk *sctp_inq_pop(struct sctp_inq *queue) chunk = list_entry(entry, struct sctp_chunk, list); - if ((skb_shinfo(chunk->skb)->gso_type & SKB_GSO_SCTP) == SKB_GSO_SCTP) { + if (skb_
[PATCH] docs: segmentation-offloads.txt: Correct TCP gso_types
Pretty minor: just SKB_GSO_TCP -> SKB_GSO_TCPV4 and SKB_GSO_TCP6 -> SKB_GSO_TCPV6. Signed-off-by: Daniel Axtens --- Documentation/networking/segmentation-offloads.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/networking/segmentation-offloads.txt b/Documentation/networking/segmentation-offloads.txt index d47480b61ac6..e879b009c75d 100644 --- a/Documentation/networking/segmentation-offloads.txt +++ b/Documentation/networking/segmentation-offloads.txt @@ -20,8 +20,8 @@ TCP Segmentation Offload TCP segmentation allows a device to segment a single frame into multiple frames with a data payload size specified in skb_shinfo()->gso_size. -When TCP segmentation requested the bit for either SKB_GSO_TCP or -SKB_GSO_TCP6 should be set in skb_shinfo()->gso_type and +When TCP segmentation requested the bit for either SKB_GSO_TCPV4 or +SKB_GSO_TCPV6 should be set in skb_shinfo()->gso_type and skb_shinfo()->gso_size should be set to a non-zero value. TCP segmentation is dependent on support for the use of partial checksum -- 2.14.1
Re: [PATCH bpf] bpf: fix bpf_skb_adjust_net/bpf_skb_proto_xlat to deal with gso sctp skbs
Hi Daniel, > From: Daniel Axtens > > SCTP GSO skbs have a gso_size of GSO_BY_FRAGS, so any sort of > unconditionally mangling of that will result in nonsense value > and would corrupt the skb later on. > > Therefore, i) add two helpers skb_increase_gso_size() and > skb_decrease_gso_size() that would throw a one time warning and > bail out for such skbs and ii) refuse and return early with an > error in those BPF helpers that are affected. We do need to bail > out as early as possible from there before any changes on the > skb have been performed. > > Fixes: 6578171a7ff0 ("bpf: add bpf_skb_change_proto helper") > Co-authored-by: Daniel Borkmann > Signed-off-by: Daniel Axtens > Cc: Marcelo Ricardo Leitner > Acked-by: Alexei Starovoitov I've looked over your changes and they all look good to me. > +/* Note: Should be called only if skb_is_gso(skb) is true */ > +static inline bool skb_is_gso_sctp(const struct sk_buff *skb) > +{ > + return skb_shinfo(skb)->gso_type & SKB_GSO_SCTP; > +} > + This helper is a fantastic idea and I will send a docs update to highlight it. Regards, Daniel
[PATCH v2 4/4] net: make skb_gso_*_seglen functions private
They're very hard to use properly as they do not consider the GSO_BY_FRAGS case. Code should use skb_gso_validate_network_len and skb_gso_validate_mac_len as they do consider this case. Make the seglen functions static, which stops people using them outside of skbuff.c Signed-off-by: Daniel Axtens --- v2: drop inline from functions. --- include/linux/skbuff.h | 33 - net/core/skbuff.c | 37 +++-- 2 files changed, 35 insertions(+), 35 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index a057dd1a75c7..ddf77cf4ff2d 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3285,7 +3285,6 @@ int skb_zerocopy(struct sk_buff *to, struct sk_buff *from, void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len); int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); void skb_scrub_packet(struct sk_buff *skb, bool xnet); -unsigned int skb_gso_transport_seglen(const struct sk_buff *skb); bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu); bool skb_gso_validate_mac_len(const struct sk_buff *skb, unsigned int len); struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features); @@ -4104,38 +4103,6 @@ static inline bool skb_head_is_locked(const struct sk_buff *skb) return !skb->head_frag || skb_cloned(skb); } -/** - * skb_gso_network_seglen - Return length of individual segments of a gso packet - * - * @skb: GSO skb - * - * skb_gso_network_seglen is used to determine the real size of the - * individual segments, including Layer3 (IP, IPv6) and L4 headers (TCP/UDP). - * - * The MAC/L2 header is not accounted for. - */ -static inline unsigned int skb_gso_network_seglen(const struct sk_buff *skb) -{ - unsigned int hdr_len = skb_transport_header(skb) - - skb_network_header(skb); - return hdr_len + skb_gso_transport_seglen(skb); -} - -/** - * skb_gso_mac_seglen - Return length of individual segments of a gso packet - * - * @skb: GSO skb - * - * skb_gso_mac_seglen is used to determine the real size of the - * individual segments, including MAC/L2, Layer3 (IP, IPv6) and L4 - * headers (TCP/UDP). - */ -static inline unsigned int skb_gso_mac_seglen(const struct sk_buff *skb) -{ - unsigned int hdr_len = skb_transport_header(skb) - skb_mac_header(skb); - return hdr_len + skb_gso_transport_seglen(skb); -} - /* Local Checksum Offload. * Compute outer checksum based on the assumption that the * inner checksum will be offloaded later. diff --git a/net/core/skbuff.c b/net/core/skbuff.c index b63767008824..0bb0d8877954 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4891,7 +4891,7 @@ EXPORT_SYMBOL_GPL(skb_scrub_packet); * * The MAC/L2 or network (IP, IPv6) headers are not accounted for. */ -unsigned int skb_gso_transport_seglen(const struct sk_buff *skb) +static unsigned int skb_gso_transport_seglen(const struct sk_buff *skb) { const struct skb_shared_info *shinfo = skb_shinfo(skb); unsigned int thlen = 0; @@ -4913,7 +4913,40 @@ unsigned int skb_gso_transport_seglen(const struct sk_buff *skb) */ return thlen + shinfo->gso_size; } -EXPORT_SYMBOL_GPL(skb_gso_transport_seglen); + +/** + * skb_gso_network_seglen - Return length of individual segments of a gso packet + * + * @skb: GSO skb + * + * skb_gso_network_seglen is used to determine the real size of the + * individual segments, including Layer3 (IP, IPv6) and L4 headers (TCP/UDP). + * + * The MAC/L2 header is not accounted for. + */ +static unsigned int skb_gso_network_seglen(const struct sk_buff *skb) +{ + unsigned int hdr_len = skb_transport_header(skb) - + skb_network_header(skb); + + return hdr_len + skb_gso_transport_seglen(skb); +} + +/** + * skb_gso_mac_seglen - Return length of individual segments of a gso packet + * + * @skb: GSO skb + * + * skb_gso_mac_seglen is used to determine the real size of the + * individual segments, including MAC/L2, Layer3 (IP, IPv6) and L4 + * headers (TCP/UDP). + */ +static unsigned int skb_gso_mac_seglen(const struct sk_buff *skb) +{ + unsigned int hdr_len = skb_transport_header(skb) - skb_mac_header(skb); + + return hdr_len + skb_gso_transport_seglen(skb); +} /** * skb_gso_size_check - check the skb size, considering GSO_BY_FRAGS -- 2.14.1
[PATCH v2 1/4] net: rename skb_gso_validate_mtu -> skb_gso_validate_network_len
If you take a GSO skb, and split it into packets, will the network length (L3 headers + L4 headers + payload) of those packets be small enough to fit within a given MTU? skb_gso_validate_mtu gives you the answer to that question. However, we recently added to add a way to validate the MAC length of a split GSO skb (L2+L3+L4+payload), and the names get confusing, so rename skb_gso_validate_mtu to skb_gso_validate_network_len Signed-off-by: Daniel Axtens --- include/linux/skbuff.h | 2 +- net/core/skbuff.c | 11 ++- net/ipv4/ip_forward.c | 2 +- net/ipv4/ip_output.c| 2 +- net/ipv4/netfilter/nf_flow_table_ipv4.c | 2 +- net/ipv6/ip6_output.c | 2 +- net/ipv6/netfilter/nf_flow_table_ipv6.c | 2 +- net/mpls/af_mpls.c | 2 +- net/xfrm/xfrm_device.c | 2 +- 9 files changed, 14 insertions(+), 13 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index c1e66bdcf583..a057dd1a75c7 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3286,7 +3286,7 @@ void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len); int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); void skb_scrub_packet(struct sk_buff *skb, bool xnet); unsigned int skb_gso_transport_seglen(const struct sk_buff *skb); -bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu); +bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu); bool skb_gso_validate_mac_len(const struct sk_buff *skb, unsigned int len); struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features); struct sk_buff *skb_vlan_untag(struct sk_buff *skb); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 09bd89c90a71..b63767008824 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4955,19 +4955,20 @@ static inline bool skb_gso_size_check(const struct sk_buff *skb, } /** - * skb_gso_validate_mtu - Return in case such skb fits a given MTU + * skb_gso_validate_network_len - Will a split GSO skb fit into a given MTU? * * @skb: GSO skb * @mtu: MTU to validate against * - * skb_gso_validate_mtu validates if a given skb will fit a wanted MTU - * once split. + * skb_gso_validate_network_len validates if a given skb will fit a + * wanted MTU once split. It considers L3 headers, L4 headers, and the + * payload. */ -bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu) +bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu) { return skb_gso_size_check(skb, skb_gso_network_seglen(skb), mtu); } -EXPORT_SYMBOL_GPL(skb_gso_validate_mtu); +EXPORT_SYMBOL_GPL(skb_gso_validate_network_len); /** * skb_gso_validate_mac_len - Will a split GSO skb fit in a given length? diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c index 2dd21c3281a1..b54b948b0596 100644 --- a/net/ipv4/ip_forward.c +++ b/net/ipv4/ip_forward.c @@ -55,7 +55,7 @@ static bool ip_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu) if (skb->ignore_df) return false; - if (skb_is_gso(skb) && skb_gso_validate_mtu(skb, mtu)) + if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu)) return false; return true; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index e8e675be60ec..66340ab750e6 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -248,7 +248,7 @@ static int ip_finish_output_gso(struct net *net, struct sock *sk, /* common case: seglen is <= mtu */ - if (skb_gso_validate_mtu(skb, mtu)) + if (skb_gso_validate_network_len(skb, mtu)) return ip_finish_output2(net, sk, skb); /* Slowpath - GSO segment length exceeds the egress MTU. diff --git a/net/ipv4/netfilter/nf_flow_table_ipv4.c b/net/ipv4/netfilter/nf_flow_table_ipv4.c index 25d2975da156..2447077d163d 100644 --- a/net/ipv4/netfilter/nf_flow_table_ipv4.c +++ b/net/ipv4/netfilter/nf_flow_table_ipv4.c @@ -185,7 +185,7 @@ static bool __nf_flow_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu) if ((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0) return false; - if (skb_is_gso(skb) && skb_gso_validate_mtu(skb, mtu)) + if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu)) return false; return true; diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 997c7f19ad62..a8a919520090 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -412,7 +412,7 @@ static bool ip6_pkt_too_big(const struct sk_buff *skb, unsigned int mtu) if (skb->ignore_df) return false; - if (skb_is_gso(skb) && skb_gso_validate_mtu(skb, mtu)) + if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
[PATCH v2 2/4] net: sched: tbf: handle GSO_BY_FRAGS case in enqueue
tbf_enqueue() checks the size of a packet before enqueuing it. However, the GSO size check does not consider the GSO_BY_FRAGS case, and so will drop GSO SCTP packets, causing a massive drop in throughput. Use skb_gso_validate_mac_len() instead, as it does consider that case. Signed-off-by: Daniel Axtens --- net/sched/sch_tbf.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c index 229172d509cc..03225a8df973 100644 --- a/net/sched/sch_tbf.c +++ b/net/sched/sch_tbf.c @@ -188,7 +188,8 @@ static int tbf_enqueue(struct sk_buff *skb, struct Qdisc *sch, int ret; if (qdisc_pkt_len(skb) > q->max_size) { - if (skb_is_gso(skb) && skb_gso_mac_seglen(skb) <= q->max_size) + if (skb_is_gso(skb) && + skb_gso_validate_mac_len(skb, q->max_size)) return tbf_segment(skb, sch, to_free); return qdisc_drop(skb, sch, to_free); } -- 2.14.1
[PATCH v2 3/4] net: xfrm: use skb_gso_validate_network_len() to check gso sizes
Replace skb_gso_network_seglen() with skb_gso_validate_network_len(), as it considers the GSO_BY_FRAGS case. Signed-off-by: Daniel Axtens --- net/ipv4/xfrm4_output.c | 3 ++- net/ipv6/xfrm6_output.c | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ipv4/xfrm4_output.c b/net/ipv4/xfrm4_output.c index 94b8702603bc..be980c195fc5 100644 --- a/net/ipv4/xfrm4_output.c +++ b/net/ipv4/xfrm4_output.c @@ -30,7 +30,8 @@ static int xfrm4_tunnel_check_size(struct sk_buff *skb) mtu = dst_mtu(skb_dst(skb)); if ((!skb_is_gso(skb) && skb->len > mtu) || - (skb_is_gso(skb) && skb_gso_network_seglen(skb) > ip_skb_dst_mtu(skb->sk, skb))) { + (skb_is_gso(skb) && +!skb_gso_validate_network_len(skb, ip_skb_dst_mtu(skb->sk, skb { skb->protocol = htons(ETH_P_IP); if (skb->sk) diff --git a/net/ipv6/xfrm6_output.c b/net/ipv6/xfrm6_output.c index 8ae87d4ec5ff..5959ce9620eb 100644 --- a/net/ipv6/xfrm6_output.c +++ b/net/ipv6/xfrm6_output.c @@ -82,7 +82,7 @@ static int xfrm6_tunnel_check_size(struct sk_buff *skb) if ((!skb_is_gso(skb) && skb->len > mtu) || (skb_is_gso(skb) && -skb_gso_network_seglen(skb) > ip6_skb_dst_mtu(skb))) { +!skb_gso_validate_network_len(skb, ip6_skb_dst_mtu(skb { skb->dev = dst->dev; skb->protocol = htons(ETH_P_IPV6); -- 2.14.1
[PATCH v2 0/4] GSO_BY_FRAGS correctness improvements
As requested [1], I went through and had a look at users of gso_size to see if there were things that need to be fixed to consider GSO_BY_FRAGS, and I have tried to improve our helper functions to deal with this case. I found a few. This fixes bugs relating to the use of skb_gso_*_seglen() where GSO_BY_FRAGS is not considered. Patch 1 renames skb_gso_validate_mtu to skb_gso_validate_network_len. This is follow-up to my earlier patch 2b16f048729b ("net: create skb_gso_validate_mac_len()"), and just makes everything a bit clearer. Patches 2 and 3 replace the final users of skb_gso_network_seglen() - which doesn't consider GSO_BY_FRAGS - with skb_gso_validate_network_len(), which does. This allows me to make the skb_gso_*_seglen functions private in patch 4 - now future users won't accidentally do the wrong comparison. Two things remain. One is qdisc_pkt_len_init, which is discussed at [2] - it's caught up in the GSO_DODGY mess. I don't have any expertise in GSO_DODGY, and it looks like a good clean fix will involve unpicking the whole validation mess, so I have left it for now. Secondly, there are 3 eBPF opcodes that change the gso_size of an SKB and don't consider GSO_BY_FRAGS. This is going through the bpf tree. Regards, Daniel [1] https://patchwork.ozlabs.org/comment/1852414/ [2] https://www.spinics.net/lists/netdev/msg482397.html PS: This is all in the core networking stack. For a driver to be affected by this it would need to support NETIF_F_GSO_SCTP / NETIF_F_GSO_SOFTWARE and then either use gso_size or not be a purely virtual device. (Many drivers look at gso_size, but do not support SCTP segmentation, so the core network will segment an SCTP gso before it hits them.) Based on that, the only driver that may be affected is sunvnet, but I have no way of testing it, so I haven't looked at it. v2: split out bpf stuff fix review comments from Dave Miller Daniel Axtens (4): net: rename skb_gso_validate_mtu -> skb_gso_validate_network_len net: sched: tbf: handle GSO_BY_FRAGS case in enqueue net: xfrm: use skb_gso_validate_network_len() to check gso sizes net: make skb_gso_*_seglen functions private include/linux/skbuff.h | 35 +--- net/core/skbuff.c | 48 - net/ipv4/ip_forward.c | 2 +- net/ipv4/ip_output.c| 2 +- net/ipv4/netfilter/nf_flow_table_ipv4.c | 2 +- net/ipv4/xfrm4_output.c | 3 ++- net/ipv6/ip6_output.c | 2 +- net/ipv6/netfilter/nf_flow_table_ipv6.c | 2 +- net/ipv6/xfrm6_output.c | 2 +- net/mpls/af_mpls.c | 2 +- net/sched/sch_tbf.c | 3 ++- net/xfrm/xfrm_device.c | 2 +- 12 files changed, 54 insertions(+), 51 deletions(-) -- 2.14.1
Re: [PATCH 5/6] net: add and use helpers when adjusting gso_size
Hi Daniel, >> It means that a user loaded eBPF program can trigger logs full of >> warnings merely by using this eBPF helper and generating GSO'd SCTP >> traffic. >> >> Daniel and Alexei, this is a serious problem. The eBPF helpers >> mentioned here cannot handle SCTP GSO packets properly, and in fact >> corrupt them if they adjust the gso_size. >> >> SCTP GSO packets use the GSO_BY_FRAGS scheme and cannot be treated >> the same way we treat normal GSO packets. > > Thanks for Cc, I would have missed it. This patch in combination with > patch 6/6 seems okay since we reject it in these cases, which is fine > (although it could be a warn_once), but patch 6/6 is definitely buggy > in that we leave the skb in a half edited state when leaving the helper. > Daniel, want me to fix this up from bpf side and route 5/6 and actual > fix via bpf tree? Otherwise please respin with fixed 6/6. I'm happy for you to take these two via the bpf tree with whatever fixes/changes are appropriate - I'm no eBPF expert so my fix was a 'best guess' only. Let me know if you would like anything respun on my end. Regards, Daniel PS. apologies for missing you on Cc - will make sure you're cced on any future bpf-related work! > > Thanks, > Daniel
[PATCH 3/6] net: xfrm: use skb_gso_validate_network_len() to check gso sizes
Replace skb_gso_network_seglen() with skb_gso_validate_network_len(), as it considers the GSO_BY_FRAGS case. Signed-off-by: Daniel Axtens --- net/ipv4/xfrm4_output.c | 3 ++- net/ipv6/xfrm6_output.c | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ipv4/xfrm4_output.c b/net/ipv4/xfrm4_output.c index 94b8702603bc..be980c195fc5 100644 --- a/net/ipv4/xfrm4_output.c +++ b/net/ipv4/xfrm4_output.c @@ -30,7 +30,8 @@ static int xfrm4_tunnel_check_size(struct sk_buff *skb) mtu = dst_mtu(skb_dst(skb)); if ((!skb_is_gso(skb) && skb->len > mtu) || - (skb_is_gso(skb) && skb_gso_network_seglen(skb) > ip_skb_dst_mtu(skb->sk, skb))) { + (skb_is_gso(skb) && +!skb_gso_validate_network_len(skb, ip_skb_dst_mtu(skb->sk, skb { skb->protocol = htons(ETH_P_IP); if (skb->sk) diff --git a/net/ipv6/xfrm6_output.c b/net/ipv6/xfrm6_output.c index 8ae87d4ec5ff..5959ce9620eb 100644 --- a/net/ipv6/xfrm6_output.c +++ b/net/ipv6/xfrm6_output.c @@ -82,7 +82,7 @@ static int xfrm6_tunnel_check_size(struct sk_buff *skb) if ((!skb_is_gso(skb) && skb->len > mtu) || (skb_is_gso(skb) && -skb_gso_network_seglen(skb) > ip6_skb_dst_mtu(skb))) { +!skb_gso_validate_network_len(skb, ip6_skb_dst_mtu(skb { skb->dev = dst->dev; skb->protocol = htons(ETH_P_IPV6); -- 2.14.1
[PATCH 6/6] net: filter: refuse to change gso_size of SCTP packets
An eBPF tc action on a veth pair can be passed an SCTP GSO skb, which has gso_size of GSO_BY_FRAGS. If that action calls bpf_skb_change_proto(), bpf_skb_net_grow() or bpf_skb_net_shrink(), the code will unconditionally attempt to increment or decrement the gso_size to some nonsense value. It's not clear how this could be done correctly, so simply refuse to operate on SCTP GSO skbs. Cc: Marcelo Ricardo Leitner Signed-off-by: Daniel Axtens --- Marcelo - I am not an SCTP expert by any means, so if this code should be doing something else, please let me know. --- net/core/filter.c | 16 1 file changed, 16 insertions(+) diff --git a/net/core/filter.c b/net/core/filter.c index d9684d8a3ac7..f189c988962e 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2096,6 +2096,10 @@ static int bpf_skb_proto_4_to_6(struct sk_buff *skb) return ret; if (skb_is_gso(skb)) { + /* SCTP uses GSO_BY_FRAGS, cannot adjust */ + if (unlikely(skb_shinfo(skb)->gso_type & SKB_GSO_SCTP)) + return -ENOTSUPP; + /* SKB_GSO_TCPV4 needs to be changed into * SKB_GSO_TCPV6. */ @@ -2132,6 +2136,10 @@ static int bpf_skb_proto_6_to_4(struct sk_buff *skb) return ret; if (skb_is_gso(skb)) { + /* SCTP uses GSO_BY_FRAGS, cannot adjust */ + if (unlikely(skb_shinfo(skb)->gso_type & SKB_GSO_SCTP)) + return -ENOTSUPP; + /* SKB_GSO_TCPV6 needs to be changed into * SKB_GSO_TCPV4. */ @@ -2252,6 +2260,10 @@ static int bpf_skb_net_grow(struct sk_buff *skb, u32 len_diff) return ret; if (skb_is_gso(skb)) { + /* SCTP uses GSO_BY_FRAGS, cannot adjust */ + if (unlikely(skb_shinfo(skb)->gso_type & SKB_GSO_SCTP)) + return -ENOTSUPP; + /* Due to header grow, MSS needs to be downgraded. */ skb_decrease_gso_size(skb, len_diff); /* Header must be checked, and gso_segs recomputed. */ @@ -2276,6 +2288,10 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 len_diff) return ret; if (skb_is_gso(skb)) { + /* SCTP uses GSO_BY_FRAGS, cannot adjust */ + if (unlikely(skb_shinfo(skb)->gso_type & SKB_GSO_SCTP)) + return -ENOTSUPP; + /* Due to header shrink, MSS can be upgraded. */ skb_increase_gso_size(skb, len_diff); /* Header must be checked, and gso_segs recomputed. */ -- 2.14.1
[PATCH 5/6] net: add and use helpers when adjusting gso_size
An audit of users of gso_size reveals that an eBPF tc action on a veth pair can be passed an SCTP GSO skb, which has gso_size of GSO_BY_FRAGS. If that action calls bpf_skb_change_proto(), bpf_skb_net_grow() or bpf_skb_net_shrink(), the gso_size will be unconditionally incremented or decremented to some nonsense value. Add helpers that WARN if attempting to change a gso_size of a GSO_BY_FRAGS skb (and leave the value unchanged). Signed-off-by: Daniel Axtens --- Documentation/networking/segmentation-offloads.txt | 11 +-- include/linux/skbuff.h | 16 net/core/filter.c | 8 3 files changed, 29 insertions(+), 6 deletions(-) diff --git a/Documentation/networking/segmentation-offloads.txt b/Documentation/networking/segmentation-offloads.txt index d47480b61ac6..23a8dd91a9ec 100644 --- a/Documentation/networking/segmentation-offloads.txt +++ b/Documentation/networking/segmentation-offloads.txt @@ -153,8 +153,15 @@ To signal this, gso_size is set to the special value GSO_BY_FRAGS. Therefore, any code in the core networking stack must be aware of the possibility that gso_size will be GSO_BY_FRAGS and handle that case -appropriately. (For size checks, the skb_gso_validate_*_len family of -helpers do this automatically.) +appropriately. + +There are a couple of helpers to make this easier: + + - For size checks, the skb_gso_validate_*_len family of helpers correctly + considers GSO_BY_FRAGS. + + - For manipulating packets, skb_increase_gso_size and skb_decrease_gso_size + will check for GSO_BY_FRAGS and WARN if asked to manipulate these skbs. This also affects drivers with the NETIF_F_FRAGLIST & NETIF_F_GSO_SCTP bits set. Note also that NETIF_F_GSO_SCTP is included in NETIF_F_GSO_SOFTWARE. diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index d8340e6e8814..157a91864e16 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -4047,6 +4047,22 @@ static inline void skb_gso_reset(struct sk_buff *skb) skb_shinfo(skb)->gso_type = 0; } +static inline void skb_increase_gso_size(struct sk_buff *skb, u16 increment) +{ + if (WARN_ON_ONCE(skb_shinfo(skb)->gso_size == GSO_BY_FRAGS)) + return; + + skb_shinfo(skb)->gso_size += increment; +} + +static inline void skb_decrease_gso_size(struct sk_buff *skb, u16 decrement) +{ + if (WARN_ON_ONCE(skb_shinfo(skb)->gso_size == GSO_BY_FRAGS)) + return; + + skb_shinfo(skb)->gso_size -= decrement; +} + void __skb_warn_lro_forwarding(const struct sk_buff *skb); static inline bool skb_warn_if_lro(const struct sk_buff *skb) diff --git a/net/core/filter.c b/net/core/filter.c index 0c121adbdbaa..d9684d8a3ac7 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2105,7 +2105,7 @@ static int bpf_skb_proto_4_to_6(struct sk_buff *skb) } /* Due to IPv6 header, MSS needs to be downgraded. */ - skb_shinfo(skb)->gso_size -= len_diff; + skb_decrease_gso_size(skb, len_diff); /* Header must be checked, and gso_segs recomputed. */ skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY; skb_shinfo(skb)->gso_segs = 0; @@ -2141,7 +2141,7 @@ static int bpf_skb_proto_6_to_4(struct sk_buff *skb) } /* Due to IPv4 header, MSS can be upgraded. */ - skb_shinfo(skb)->gso_size += len_diff; + skb_increase_gso_size(skb, len_diff); /* Header must be checked, and gso_segs recomputed. */ skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY; skb_shinfo(skb)->gso_segs = 0; @@ -2253,7 +2253,7 @@ static int bpf_skb_net_grow(struct sk_buff *skb, u32 len_diff) if (skb_is_gso(skb)) { /* Due to header grow, MSS needs to be downgraded. */ - skb_shinfo(skb)->gso_size -= len_diff; + skb_decrease_gso_size(skb, len_diff); /* Header must be checked, and gso_segs recomputed. */ skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY; skb_shinfo(skb)->gso_segs = 0; @@ -2277,7 +2277,7 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 len_diff) if (skb_is_gso(skb)) { /* Due to header shrink, MSS can be upgraded. */ - skb_shinfo(skb)->gso_size += len_diff; + skb_increase_gso_size(skb, len_diff); /* Header must be checked, and gso_segs recomputed. */ skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY; skb_shinfo(skb)->gso_segs = 0; -- 2.14.1
[PATCH 4/6] net: make skb_gso_*_seglen functions private
They're very hard to use properly as they do not consider the GSO_BY_FRAGS case. Code should use skb_gso_validate_network_len and skb_gso_validate_mac_len as they do consider this case. Make the seglen functions static inline, which stops people using them, and also means the validate functions don't have to make any out of line calls. Signed-off-by: Daniel Axtens --- include/linux/skbuff.h | 33 - net/core/skbuff.c | 37 +++-- 2 files changed, 35 insertions(+), 35 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index eeb76bb7730a..d8340e6e8814 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3288,7 +3288,6 @@ int skb_zerocopy(struct sk_buff *to, struct sk_buff *from, void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len); int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); void skb_scrub_packet(struct sk_buff *skb, bool xnet); -unsigned int skb_gso_transport_seglen(const struct sk_buff *skb); bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu); bool skb_gso_validate_mac_len(const struct sk_buff *skb, unsigned int len); struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features); @@ -4107,38 +4106,6 @@ static inline bool skb_head_is_locked(const struct sk_buff *skb) return !skb->head_frag || skb_cloned(skb); } -/** - * skb_gso_network_seglen - Return length of individual segments of a gso packet - * - * @skb: GSO skb - * - * skb_gso_network_seglen is used to determine the real size of the - * individual segments, including Layer3 (IP, IPv6) and L4 headers (TCP/UDP). - * - * The MAC/L2 header is not accounted for. - */ -static inline unsigned int skb_gso_network_seglen(const struct sk_buff *skb) -{ - unsigned int hdr_len = skb_transport_header(skb) - - skb_network_header(skb); - return hdr_len + skb_gso_transport_seglen(skb); -} - -/** - * skb_gso_mac_seglen - Return length of individual segments of a gso packet - * - * @skb: GSO skb - * - * skb_gso_mac_seglen is used to determine the real size of the - * individual segments, including MAC/L2, Layer3 (IP, IPv6) and L4 - * headers (TCP/UDP). - */ -static inline unsigned int skb_gso_mac_seglen(const struct sk_buff *skb) -{ - unsigned int hdr_len = skb_transport_header(skb) - skb_mac_header(skb); - return hdr_len + skb_gso_transport_seglen(skb); -} - /* Local Checksum Offload. * Compute outer checksum based on the assumption that the * inner checksum will be offloaded later. diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 0c0c1d6f28ef..a664a3ae507e 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4893,7 +4893,7 @@ EXPORT_SYMBOL_GPL(skb_scrub_packet); * * The MAC/L2 or network (IP, IPv6) headers are not accounted for. */ -unsigned int skb_gso_transport_seglen(const struct sk_buff *skb) +static inline unsigned int skb_gso_transport_seglen(const struct sk_buff *skb) { const struct skb_shared_info *shinfo = skb_shinfo(skb); unsigned int thlen = 0; @@ -4915,7 +4915,40 @@ unsigned int skb_gso_transport_seglen(const struct sk_buff *skb) */ return thlen + shinfo->gso_size; } -EXPORT_SYMBOL_GPL(skb_gso_transport_seglen); + +/** + * skb_gso_network_seglen - Return length of individual segments of a gso packet + * + * @skb: GSO skb + * + * skb_gso_network_seglen is used to determine the real size of the + * individual segments, including Layer3 (IP, IPv6) and L4 headers (TCP/UDP). + * + * The MAC/L2 header is not accounted for. + */ +static inline unsigned int skb_gso_network_seglen(const struct sk_buff *skb) +{ + unsigned int hdr_len = skb_transport_header(skb) - + skb_network_header(skb); + + return hdr_len + skb_gso_transport_seglen(skb); +} + +/** + * skb_gso_mac_seglen - Return length of individual segments of a gso packet + * + * @skb: GSO skb + * + * skb_gso_mac_seglen is used to determine the real size of the + * individual segments, including MAC/L2, Layer3 (IP, IPv6) and L4 + * headers (TCP/UDP). + */ +static inline unsigned int skb_gso_mac_seglen(const struct sk_buff *skb) +{ + unsigned int hdr_len = skb_transport_header(skb) - skb_mac_header(skb); + + return hdr_len + skb_gso_transport_seglen(skb); +} /** * skb_gso_size_check - check the skb size, considering GSO_BY_FRAGS -- 2.14.1
[PATCH 2/6] net: sched: tbf: handle GSO_BY_FRAGS case in enqueue
tbf_enqueue() checks the size of a packet before enqueuing it. However, the GSO size check does not consider the GSO_BY_FRAGS case, and so will drop GSO SCTP packets, causing a massive drop in throughput. Use skb_gso_validate_mac_len() instead, as it does consider that case. Signed-off-by: Daniel Axtens --- net/sched/sch_tbf.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c index 229172d509cc..03225a8df973 100644 --- a/net/sched/sch_tbf.c +++ b/net/sched/sch_tbf.c @@ -188,7 +188,8 @@ static int tbf_enqueue(struct sk_buff *skb, struct Qdisc *sch, int ret; if (qdisc_pkt_len(skb) > q->max_size) { - if (skb_is_gso(skb) && skb_gso_mac_seglen(skb) <= q->max_size) + if (skb_is_gso(skb) && + skb_gso_validate_mac_len(skb, q->max_size)) return tbf_segment(skb, sch, to_free); return qdisc_drop(skb, sch, to_free); } -- 2.14.1
[PATCH 0/6]GSO_BY_FRAGS correctness improvements
As requested [1], I went through and had a look at users of gso_size to see if there were things that need to be fixed to consider GSO_BY_FRAGS, and I have tried to improve our helper functions to deal with this case. I found a few, and this fixes all but one of them. The one thing that remains is qdisc_pkt_len_init which is discussed at [2] - it's caught up in the GSO_DODGY mess. I don't have any expertise in GSO_DODGY, and it looks like a good clean fix will involve unpicking the whole validation mess, so I have left it for now. Patch 1 renames skb_gso_validate_mtu to skb_gso_validate_network_len. This is follow-up to my earlier patch 2b16f048729b ("net: create skb_gso_validate_mac_len()"), and just makes everything a bit clearer. Patches 2 and 3 replace the final users of skb_gso_network_seglen() - which doesn't consider GSO_BY_FRAGS - with skb_gso_validate_network_len(), which does. This allows me to make the skb_gso_*_seglen functions private in patch 4 - now future users won't accidentally do the wrong comparison. Lastly, there are 3 eBPF opcodes that change the gso_size of an SKB and don't consider GSO_BY_FRAGS. To address this, patch 5 adds a couple of helpers for changing gso_size that throw a WARN if you use them on a GSO_BY_FRAGS packet. To return a useful error to eBPF, patch 6 goes further and prevents any change of gso_size for SCTP GSO skbs. Regards, Daniel [1] https://patchwork.ozlabs.org/comment/1852414/ [2] https://www.spinics.net/lists/netdev/msg482397.html PS: This is all in the core networking stack. For a driver to be affected by this it would need to support NETIF_F_GSO_SCTP / NETIF_F_GSO_SOFTWARE and then either use gso_size or not be a purely virtual device. (Many drivers look at gso_size, but do not support SCTP segmentation, so the core network will segment an SCTP gso before it hits them.) Based on that, the only driver that may be affected is sunvnet, but I have no way of testing it, so I haven't looked at it. Daniel Axtens (6): net: rename skb_gso_validate_mtu -> skb_gso_validate_network_len net: sched: tbf: handle GSO_BY_FRAGS case in enqueue net: xfrm: use skb_gso_validate_network_len() to check gso sizes net: make skb_gso_*_seglen functions private net: add and use helpers when adjusting gso_size net: filter: refuse to change gso_size of SCTP packets Documentation/networking/segmentation-offloads.txt | 11 - include/linux/skbuff.h | 51 -- net/core/filter.c | 24 -- net/core/skbuff.c | 48 +--- net/ipv4/ip_forward.c | 2 +- net/ipv4/ip_output.c | 2 +- net/ipv4/netfilter/nf_flow_table_ipv4.c| 2 +- net/ipv4/xfrm4_output.c| 3 +- net/ipv6/ip6_output.c | 2 +- net/ipv6/netfilter/nf_flow_table_ipv6.c| 2 +- net/ipv6/xfrm6_output.c| 2 +- net/mpls/af_mpls.c | 2 +- net/sched/sch_tbf.c| 3 +- net/xfrm/xfrm_device.c | 2 +- 14 files changed, 99 insertions(+), 57 deletions(-) -- 2.14.1
[PATCH 1/6] net: rename skb_gso_validate_mtu -> skb_gso_validate_network_len
If you take a GSO skb, and split it into packets, will the network length (L3 headers + L4 headers + payload) of those packets be small enough to fit within a given MTU? skb_gso_validate_mtu gives you the answer to that question. However, we recently added to add a way to validate the MAC length of a split GSO skb (L2+L3+L4+payload), and the names get confusing, so rename skb_gso_validate_mtu to skb_gso_validate_network_len Signed-off-by: Daniel Axtens --- include/linux/skbuff.h | 2 +- net/core/skbuff.c | 11 ++- net/ipv4/ip_forward.c | 2 +- net/ipv4/ip_output.c| 2 +- net/ipv4/netfilter/nf_flow_table_ipv4.c | 2 +- net/ipv6/ip6_output.c | 2 +- net/ipv6/netfilter/nf_flow_table_ipv6.c | 2 +- net/mpls/af_mpls.c | 2 +- net/xfrm/xfrm_device.c | 2 +- 9 files changed, 14 insertions(+), 13 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 9bc1750ca3d3..eeb76bb7730a 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3289,7 +3289,7 @@ void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len); int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); void skb_scrub_packet(struct sk_buff *skb, bool xnet); unsigned int skb_gso_transport_seglen(const struct sk_buff *skb); -bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu); +bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu); bool skb_gso_validate_mac_len(const struct sk_buff *skb, unsigned int len); struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features); struct sk_buff *skb_vlan_untag(struct sk_buff *skb); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 96d36b81a3a5..0c0c1d6f28ef 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4957,19 +4957,20 @@ static inline bool skb_gso_size_check(const struct sk_buff *skb, } /** - * skb_gso_validate_mtu - Return in case such skb fits a given MTU + * skb_gso_validate_network_len - Will a split GSO skb fit into a given MTU? * * @skb: GSO skb * @mtu: MTU to validate against * - * skb_gso_validate_mtu validates if a given skb will fit a wanted MTU - * once split. + * skb_gso_validate_network_len validates if a given skb will fit a + * wanted MTU once split. It considers L3 headers, L4 headers, and the + * payload. */ -bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu) +bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu) { return skb_gso_size_check(skb, skb_gso_network_seglen(skb), mtu); } -EXPORT_SYMBOL_GPL(skb_gso_validate_mtu); +EXPORT_SYMBOL_GPL(skb_gso_validate_network_len); /** * skb_gso_validate_mac_len - Will a split GSO skb fit in a given length? diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c index 2dd21c3281a1..b54b948b0596 100644 --- a/net/ipv4/ip_forward.c +++ b/net/ipv4/ip_forward.c @@ -55,7 +55,7 @@ static bool ip_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu) if (skb->ignore_df) return false; - if (skb_is_gso(skb) && skb_gso_validate_mtu(skb, mtu)) + if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu)) return false; return true; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index e8e675be60ec..66340ab750e6 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -248,7 +248,7 @@ static int ip_finish_output_gso(struct net *net, struct sock *sk, /* common case: seglen is <= mtu */ - if (skb_gso_validate_mtu(skb, mtu)) + if (skb_gso_validate_network_len(skb, mtu)) return ip_finish_output2(net, sk, skb); /* Slowpath - GSO segment length exceeds the egress MTU. diff --git a/net/ipv4/netfilter/nf_flow_table_ipv4.c b/net/ipv4/netfilter/nf_flow_table_ipv4.c index 25d2975da156..2447077d163d 100644 --- a/net/ipv4/netfilter/nf_flow_table_ipv4.c +++ b/net/ipv4/netfilter/nf_flow_table_ipv4.c @@ -185,7 +185,7 @@ static bool __nf_flow_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu) if ((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0) return false; - if (skb_is_gso(skb) && skb_gso_validate_mtu(skb, mtu)) + if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu)) return false; return true; diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 997c7f19ad62..a8a919520090 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -412,7 +412,7 @@ static bool ip6_pkt_too_big(const struct sk_buff *skb, unsigned int mtu) if (skb->ignore_df) return false; - if (skb_is_gso(skb) && skb_gso_validate_mtu(skb, mtu)) + if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
Re: syzcaller patch postings...
Dmitry Vyukov writes: > On Thu, Feb 22, 2018 at 3:35 PM, David Miller wrote: >> From: Dmitry Vyukov >> Date: Thu, 22 Feb 2018 10:58:07 +0100 >> >>> Do I understand it correctly that if syzbot replies to the CC list >>> that was in the testing request, it will resolve the problem? So if >>> netdev wasn't in CC, it will not be added to CC. >>> >>> I will go and fix it now. >> >> I don't want syzbot to send the patch to netdev, even if it >> was in the CC: list. >> >> And again this goes for netfilter-devel and linux-wireless as >> well. >> >> There is no reason whatsoever for syzbot to ever post an already >> posted patch back to the list again, even if it was on the CC: >> list. >> >> In fact netdev will be on that CC: list most of the time. > > But if the list on CC the first time, then the patch is already on > Patchwork, right? When syzbot replies it will add In-Reply-To, so this > should be treated as a comment to the existing patch? Will it still > cause problems? You would think that it would treat it as a comment, but it doesn't. We treat something as a comment if and only if the subject begins with some variant of Re:. An I-R-T is not enough: I think the reasoning was that people sometimes post their v2 series as replies to their v1 series, which is bad; but we erred on the side of not losing patches. There's not really a solid conceptual framework for this - in part because parsing the sheer variety of mail people post is really, really hard. Suggestions on a better algorithm are of course welcome. > How does Patchwork understand that an email contains a patch? Perhaps > if we make these emails appear as they don't contain a patch, it will > solve the problem. We have a pretty sophisticated parser that looks for things that look like the output of diff. [FWIW, the algorithm is at https://github.com/getpatchwork/patchwork/blob/master/patchwork/parser.py#L688 ] You *could* try to trick it: looking at the code, probably the simplest way would be to replace '---' in '--- a/foo/bar.c' with something else, but I'm really quite uncomfortable with that. Regards, Daniel
Re: syzcaller patch postings...
Dmitry Vyukov writes: > On Thu, Feb 22, 2018 at 2:31 PM, Daniel Axtens wrote: >> Dmitry Vyukov writes: >> >>> On Thu, Feb 22, 2018 at 9:26 AM, Paolo Abeni wrote: >>>> On Wed, 2018-02-21 at 16:47 -0500, David Miller wrote: >>>>> I have to mention this now before it gets out of control. >>>>> >>>>> I would like to ask that syzkaller stop posting the patch it is >>>>> testing when it posts to netdev. >>>> >>>> There is an open issue on this topic: >>>> >>>> https://github.com/google/syzkaller/issues/526 >>>> >>>> The current behaviour is that syzbot replies to all get_maintainer.pl >>>> recipients after testing a patch, regardless of the test submission >>>> recipient list, the idea was instead to respect such list. >>> >>> >>> Hi David, Florian, Paolo, >>> >>> Didn't realize it triggers patchwork. This wasn't intentional, sorry. >> >> A little-publicised and incorrectly-documented(!) feature of Patchwork >> is that it supports some email headers. In particular, if you include an >> "X-Patchwork-Hint: ignore" header, the mail will not be parsed by >> Patchwork. >> >> This will stop it being recorded as a patch. Unfortunately it will also >> stop it being recorded as a comment - I don't know if that's an issue in >> this case. Maybe we can set you up with Patchwork 2's new checks >> infrastructure instead. > > Nice. But unfortunately the current mailing technology we use allows > very limited set of headers and no custom headers: > https://cloud.google.com/appengine/docs/standard/go/mail/mail-with-headers-attachments > So while possible, it would require very significant rework... Ah, oh well, nevermind. > What's the Patchwork 2's new checks infrastructure? It's probably more a long-term thing than an immediate fix, but... The checks API is designed to integrate reporting of CI/testing results into Patchwork. It allows - through a REST API - an arbitrary process (like your checking) to report success/warning/failure against a patch. In your case you could report success = patch prevents bug, and failure = bug still exists with patch. It's still slightly a work-in-progress: at the moment you need an API key from a maintainer to post checks. But it does look pretty in the web frontend: e.g. https://patchwork.ozlabs.org/patch/871346/ - and the number of successful/warning/failed tests shows up on the patch list page. There's currently only one project (that I know of) out there that uses the checks API - Snowpatch: https://github.com/ruscur/snowpatch If, at any point in the future, you want to explore this, let me know as I'd be *very* happy to help with the implementation and if needed push features into Patchwork that make it easier/better. > If it will still remain a problem (hopefully not), then maybe it's > possible to blacklist syzbot address from creating new patches. syzbot > can do a lot, but so far does not also generate fixes for the bugs it > discovers :) In immediate practical terms, that might be the easiest. They all come from the same email address, right? Regards, Daniel
Re: syzcaller patch postings...
Dmitry Vyukov writes: > On Thu, Feb 22, 2018 at 9:26 AM, Paolo Abeni wrote: >> On Wed, 2018-02-21 at 16:47 -0500, David Miller wrote: >>> I have to mention this now before it gets out of control. >>> >>> I would like to ask that syzkaller stop posting the patch it is >>> testing when it posts to netdev. >> >> There is an open issue on this topic: >> >> https://github.com/google/syzkaller/issues/526 >> >> The current behaviour is that syzbot replies to all get_maintainer.pl >> recipients after testing a patch, regardless of the test submission >> recipient list, the idea was instead to respect such list. > > > Hi David, Florian, Paolo, > > Didn't realize it triggers patchwork. This wasn't intentional, sorry. A little-publicised and incorrectly-documented(!) feature of Patchwork is that it supports some email headers. In particular, if you include an "X-Patchwork-Hint: ignore" header, the mail will not be parsed by Patchwork. This will stop it being recorded as a patch. Unfortunately it will also stop it being recorded as a comment - I don't know if that's an issue in this case. Maybe we can set you up with Patchwork 2's new checks infrastructure instead. > > Do I understand it correctly that if syzbot replies to the CC list > that was in the testing request, it will resolve the problem? So if > netdev wasn't in CC, it will not be added to CC. > > I will go and fix it now.
Re: [PATCH] net: sched: tbf: handle GSO_BY_FRAGS case in enqueue
Hi Marcelo, > > If this block was meant to be an out-of-band/changelog comment, your > SOB line should be above the first --- marker. > Anyhow, > Reviewed-by: Marcelo Ricardo Leitner Thanks - I did a v2 with this around the right way [1], but DaveM asked me to be a bit more thorough and look for other GSO_BY_FRAGS issues. [2] I did find a couple, including the one I asked you about with qdisk_pkt_len_init [3] (thanks for your answer, btw). Currently I'm trying to wrap my head around GSO_DODGY as Eric mentioned on that thread, and then I'll do a series with this fix and the other GSO_BY_FRAGS fixes I find. Regards, Daniel [1] https://patchwork.ozlabs.org/patch/869145/ [2] https://patchwork.ozlabs.org/comment/1852414/ [3] https://www.spinics.net/lists/netdev/msg482397.html > >> --- >> net/sched/sch_tbf.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c >> index 229172d509cc..03225a8df973 100644 >> --- a/net/sched/sch_tbf.c >> +++ b/net/sched/sch_tbf.c >> @@ -188,7 +188,8 @@ static int tbf_enqueue(struct sk_buff *skb, struct Qdisc >> *sch, >> int ret; >> >> if (qdisc_pkt_len(skb) > q->max_size) { >> -if (skb_is_gso(skb) && skb_gso_mac_seglen(skb) <= q->max_size) >> +if (skb_is_gso(skb) && >> +skb_gso_validate_mac_len(skb, q->max_size)) >> return tbf_segment(skb, sch, to_free); >> return qdisc_drop(skb, sch, to_free); >> } >> -- >> 2.14.1 >>
[PATCH 0/3] Updates to segmentation-offloads.txt
I've been trying to wrap my head around GSO for a while now. This is a set of small changes to the docs that would probably have been helpful when I was starting out. I realise that GSO_DODGY is still a notable omission - I'm hesitant to write too much on it just yet as I don't understand it well and I think it's in the process of changing. Daniel Axtens (3): docs: segmentation-offloads.txt: update for UFO depreciation docs: segmentation-offloads.txt: Fix ref to SKB_GSO_TUNNEL_REMCSUM docs: segmentation-offloads.txt: add SCTP info Documentation/networking/segmentation-offloads.txt | 38 +++--- 1 file changed, 34 insertions(+), 4 deletions(-) -- 2.14.1
[PATCH 3/3] docs: segmentation-offloads.txt: add SCTP info
Most of this is extracted from 90017accff61 ("sctp: Add GSO support"), with some extra text about GSO_BY_FRAGS and the need to check for it. Cc: Marcelo Ricardo Leitner Signed-off-by: Daniel Axtens --- Documentation/networking/segmentation-offloads.txt | 26 ++ 1 file changed, 26 insertions(+) diff --git a/Documentation/networking/segmentation-offloads.txt b/Documentation/networking/segmentation-offloads.txt index b247471a183c..d47480b61ac6 100644 --- a/Documentation/networking/segmentation-offloads.txt +++ b/Documentation/networking/segmentation-offloads.txt @@ -13,6 +13,7 @@ The following technologies are described: * Generic Segmentation Offload - GSO * Generic Receive Offload - GRO * Partial Generic Segmentation Offload - GSO_PARTIAL + * SCTP accelleration with GSO - GSO_BY_FRAGS TCP Segmentation Offload @@ -132,3 +133,28 @@ values for if the header was simply duplicated. The one exception to this is the outer IPv4 ID field. It is up to the device drivers to guarantee that the IPv4 ID field is incremented in the case that a given header does not have the DF bit set. + +SCTP accelleration with GSO +=== + +SCTP - despite the lack of hardware support - can still take advantage of +GSO to pass one large packet through the network stack, rather than +multiple small packets. + +This requires a different approach to other offloads, as SCTP packets +cannot be just segmented to (P)MTU. Rather, the chunks must be contained in +IP segments, padding respected. So unlike regular GSO, SCTP can't just +generate a big skb, set gso_size to the fragmentation point and deliver it +to IP layer. + +Instead, the SCTP protocol layer builds an skb with the segments correctly +padded and stored as chained skbs, and skb_segment() splits based on those. +To signal this, gso_size is set to the special value GSO_BY_FRAGS. + +Therefore, any code in the core networking stack must be aware of the +possibility that gso_size will be GSO_BY_FRAGS and handle that case +appropriately. (For size checks, the skb_gso_validate_*_len family of +helpers do this automatically.) + +This also affects drivers with the NETIF_F_FRAGLIST & NETIF_F_GSO_SCTP bits +set. Note also that NETIF_F_GSO_SCTP is included in NETIF_F_GSO_SOFTWARE. -- 2.14.1
[PATCH 1/3] docs: segmentation-offloads.txt: update for UFO depreciation
UFO is deprecated except for tuntap and packet per 0c19f846d582, ("net: accept UFO datagrams from tuntap and packet"). Update UFO docs to reflect this. Signed-off-by: Daniel Axtens --- Documentation/networking/segmentation-offloads.txt | 4 1 file changed, 4 insertions(+) diff --git a/Documentation/networking/segmentation-offloads.txt b/Documentation/networking/segmentation-offloads.txt index 2f09455a993a..2cda12ab7075 100644 --- a/Documentation/networking/segmentation-offloads.txt +++ b/Documentation/networking/segmentation-offloads.txt @@ -49,6 +49,10 @@ datagram into multiple IPv4 fragments. Many of the requirements for UDP fragmentation offload are the same as TSO. However the IPv4 ID for fragments should not increment as a single IPv4 datagram is fragmented. +UFO is deprecated: modern kernels will no longer generate UFO skbs, but can +still receive them from tuntap and similar devices. Offload of UDP-based +tunnel protocols is still supported. + IPIP, SIT, GRE, UDP Tunnel, and Remote Checksum Offloads -- 2.14.1
[PATCH 2/3] docs: segmentation-offloads.txt: Fix ref to SKB_GSO_TUNNEL_REMCSUM
The doc originally called it SKB_GSO_REMCSUM. Fix it. Fixes: f7a6272bf3cb ("Documentation: Add documentation for TSO and GSO features") Signed-off-by: Daniel Axtens -- The only text change is s/SKB_GSO_REMCSUM/SKB_GSO_TUNNEL_REMCSUM/, but I rewrapped the text which is why the whole para changed. --- Documentation/networking/segmentation-offloads.txt | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/Documentation/networking/segmentation-offloads.txt b/Documentation/networking/segmentation-offloads.txt index 2cda12ab7075..b247471a183c 100644 --- a/Documentation/networking/segmentation-offloads.txt +++ b/Documentation/networking/segmentation-offloads.txt @@ -87,10 +87,10 @@ SKB_GSO_UDP_TUNNEL_CSUM. These two additional tunnel types reflect the fact that the outer header also requests to have a non-zero checksum included in the outer header. -Finally there is SKB_GSO_REMCSUM which indicates that a given tunnel header -has requested a remote checksum offload. In this case the inner headers -will be left with a partial checksum and only the outer header checksum -will be computed. +Finally there is SKB_GSO_TUNNEL_REMCSUM which indicates that a given tunnel +header has requested a remote checksum offload. In this case the inner +headers will be left with a partial checksum and only the outer header +checksum will be computed. Generic Segmentation Offload -- 2.14.1
qdisc_pkt_len_init: SCTP/GSO_BY_FRAGS and robustness questions
Hi Marcelo and Eric, I'm working on checking code that might be impacted by GSO_BY_FRAGS - after finding that the token bucket filter qdisc code doesn't handle it properly, DaveM said I should look for other places where this might be an issue [0]. I'm currently looking at qdisc_pkt_len_init in net/core/dev.c. This is called by __dev_queue_xmit, before validate_xmit_skb, so before an SCTP skb would be segmented if the hardware doesn't support SCTP offload. There are two things I was hoping you two could offer some advice on: 1) Eric, in 7c68d1a6b4db ("net: qdisc_pkt_len_init() should be more robust") you replaced a chunk of code that is similar to the code found in skb_gso_transport_seglen() and replaced it with more robust code. Do we need to change skb_gso_transport_seglen() in a similar way? 2) Marcelo, unlike skb_gso_transport_seglen(), where you added a case for SCTP in 90017accff61 ("sctp: Add GSO support"), there doesn't seem to be a GSO_BY_FRAGS or SCTP check in qdisc_pkt_len_init, so I think the accounting is probably wrong for SCTP. I'm not 100% sure how to fix this as it's now quite different from the calcuations in skb_gso_transport_seglen() - so I was hoping that you might have an idea. Thanks in advance! [0]: https://patchwork.ozlabs.org/patch/869145/#1852414 Regards, Daniel
[PATCH v2] net: sched: tbf: handle GSO_BY_FRAGS case in enqueue
tbf_enqueue() checks the size of a packet before enqueuing it. However, the GSO size check does not consider the GSO_BY_FRAGS case, and so will drop GSO SCTP packets, causing a massive drop in throughput. Use skb_gso_validate_mac_len() instead, as it does consider that case. Signed-off-by: Daniel Axtens --- skb_gso_validate_mac_len() is an out-of-line call, but so is skb_gso_mac_seglen(), so this is slower but not much slower. I will send a patch to make the skb_gso_validate_* functions inline-able shortly. Also, GSO_BY_FRAGS considered harmful - I'm pretty sure this is not the only place it causes issues. v2: put S-o-b in the right spot, thanks Andrew Donnellan --- net/sched/sch_tbf.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c index 229172d509cc..03225a8df973 100644 --- a/net/sched/sch_tbf.c +++ b/net/sched/sch_tbf.c @@ -188,7 +188,8 @@ static int tbf_enqueue(struct sk_buff *skb, struct Qdisc *sch, int ret; if (qdisc_pkt_len(skb) > q->max_size) { - if (skb_is_gso(skb) && skb_gso_mac_seglen(skb) <= q->max_size) + if (skb_is_gso(skb) && + skb_gso_validate_mac_len(skb, q->max_size)) return tbf_segment(skb, sch, to_free); return qdisc_drop(skb, sch, to_free); } -- 2.14.1
[PATCH] net: sched: tbf: handle GSO_BY_FRAGS case in enqueue
tbf_enqueue() checks the size of a packet before enqueuing it. However, the GSO size check does not consider the GSO_BY_FRAGS case, and so will drop GSO SCTP packets, causing a massive drop in throughput. Use skb_gso_validate_mac_len() instead, as it does consider that case. --- skb_gso_validate_mac_len() is an out-of-line call, but so is skb_gso_mac_seglen(), so this is slower but not much slower. I will send a patch to make the skb_gso_validate_* functions inline-able shortly. Also, GSO_BY_FRAGS considered harmful - I'm pretty sure this is not the only place it causes issues. Signed-off-by: Daniel Axtens --- net/sched/sch_tbf.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c index 229172d509cc..03225a8df973 100644 --- a/net/sched/sch_tbf.c +++ b/net/sched/sch_tbf.c @@ -188,7 +188,8 @@ static int tbf_enqueue(struct sk_buff *skb, struct Qdisc *sch, int ret; if (qdisc_pkt_len(skb) > q->max_size) { - if (skb_is_gso(skb) && skb_gso_mac_seglen(skb) <= q->max_size) + if (skb_is_gso(skb) && + skb_gso_validate_mac_len(skb, q->max_size)) return tbf_segment(skb, sch, to_free); return qdisc_drop(skb, sch, to_free); } -- 2.14.1
RE: [PATCH v4 2/2] bnx2x: disable GSO where gso_size is too big for hardware
"Chopra, Manish" writes: >> -Original Message- >> From: Daniel Axtens [mailto:d...@axtens.net] >> Sent: Wednesday, January 31, 2018 8:46 AM >> To: netdev@vger.kernel.org >> Cc: Daniel Axtens ; Eric Dumazet ; >> Chopra, Manish ; Jason Wang >> ; Pravin Shelar ; Marcelo Ricardo >> Leitner >> Subject: [PATCH v4 2/2] bnx2x: disable GSO where gso_size is too big for >> hardware >> >> If a bnx2x card is passed a GSO packet with a gso_size larger than >> ~9700 bytes, it will cause a firmware error that will bring the card >> down: >> >> bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f0)]MC assert! >> bnx2x: [bnx2x_mc_assert:720(enP24p1s0f0)]XSTORM_ASSERT_LIST_INDEX 0x2 >> bnx2x: [bnx2x_mc_assert:736(enP24p1s0f0)]XSTORM_ASSERT_INDEX 0x0 = >> 0x 0x25e43e47 0x00463e01 0x00010052 >> bnx2x: [bnx2x_mc_assert:750(enP24p1s0f0)]Chip Revision: everest3, FW >> Version: 7_13_1 ... (dump of values continues) ... >> >> Detect when the mac length of a GSO packet is greater than the maximum >> packet size (9700 bytes) and disable GSO. >> >> Signed-off-by: Daniel Axtens >> >> --- >> >> v4: Only call the slow check if the gso_size is large. >> Eric - I think this is what you had in mind? >> Manish - do you think this is an acceptable performance trade-off? >> GSO will work for any packet size, and only jumbo frames will >> have to do the slower test. >> >> Again, only build-tested. >> --- >> drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 18 >> ++ >> 1 file changed, 18 insertions(+) >> >> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c >> b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c >> index 7b08323e3f3d..74fc9af4aadb 100644 >> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c >> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c >> @@ -12934,6 +12934,24 @@ static netdev_features_t >> bnx2x_features_check(struct sk_buff *skb, >>struct net_device *dev, >>netdev_features_t features) >> { >> +/* >> + * A skb with gso_size + header length > 9700 will cause a >> + * firmware panic. Drop GSO support. >> + * >> + * Eventually the upper layer should not pass these packets down. >> + * >> + * For speed, if the gso_size is <= 9000, assume there will >> + * not be 700 bytes of headers and pass it through. Only do a >> + * full (slow) validation if the gso_size is > 9000. >> + * >> + * (Due to the way SKB_BY_FRAGS works this will also do a full >> + * validation in that case.) >> + */ >> +if (unlikely(skb_is_gso(skb) && >> + (skb_shinfo(skb)->gso_size > 9000) && >> + !skb_gso_validate_mac_len(skb, 9700))) >> +features &= ~NETIF_F_GSO_MASK; > > Hi Daniel, > > Obviously, it could be bad from performance perspective since every gso > packet has to do these check. > When running iperf/netperf performance benchmark, where GSO is likely to > occur. > > Why do you have to put two checks for skb_is_gso() and gso_size ? Isn't > gso_size > anything means GSO skb ? Yes, the check is redundant. I do it to make my logic clearer. The compiler will optimise it into one check. I compiled with gcc-7.2 and gcc-4.8, both targeting amd64, and in both cases I could only find one relevant cmp: skb_is_gso(): /home/dja/dev/linux/linux/include/linux/skbuff.h:4032 3686: 48 8b 8f d0 00 00 00mov0xd0(%rdi),%rcx bnx2x_features_check(): /home/dja/dev/linux/linux/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c:12950 368d: 66 81 7c 01 04 28 23cmpw $0x2328,0x4(%rcx,%rax,1) 3694: 0f 87 ea 01 00 00 ja 3884 0x2328 is decimal 9000. > I assume it won't cause disabling the offload if "headers [L2 + L3 + L4] + > gso_size" is still <= 9700. ? That is correct. The flow is: If the gso_size is less than or equal to 9000, offload is enabled with no further checks. This is the fast path. If the gso_size is greater than 9000, then the "headers [L2 + L3 + L4] + gso_size" is checked. This is an out-of-line check done once per GSO skb. If headers + gso_size is <= 9700, offload is enabled. If headers + gso_size > 9700, offload is disabled. Regards, Daniel > > Thanks, > Manish
[PATCH v4 2/2] bnx2x: disable GSO where gso_size is too big for hardware
If a bnx2x card is passed a GSO packet with a gso_size larger than ~9700 bytes, it will cause a firmware error that will bring the card down: bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f0)]MC assert! bnx2x: [bnx2x_mc_assert:720(enP24p1s0f0)]XSTORM_ASSERT_LIST_INDEX 0x2 bnx2x: [bnx2x_mc_assert:736(enP24p1s0f0)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e43e47 0x00463e01 0x00010052 bnx2x: [bnx2x_mc_assert:750(enP24p1s0f0)]Chip Revision: everest3, FW Version: 7_13_1 ... (dump of values continues) ... Detect when the mac length of a GSO packet is greater than the maximum packet size (9700 bytes) and disable GSO. Signed-off-by: Daniel Axtens --- v4: Only call the slow check if the gso_size is large. Eric - I think this is what you had in mind? Manish - do you think this is an acceptable performance trade-off? GSO will work for any packet size, and only jumbo frames will have to do the slower test. Again, only build-tested. --- drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c index 7b08323e3f3d..74fc9af4aadb 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c @@ -12934,6 +12934,24 @@ static netdev_features_t bnx2x_features_check(struct sk_buff *skb, struct net_device *dev, netdev_features_t features) { + /* +* A skb with gso_size + header length > 9700 will cause a +* firmware panic. Drop GSO support. +* +* Eventually the upper layer should not pass these packets down. +* +* For speed, if the gso_size is <= 9000, assume there will +* not be 700 bytes of headers and pass it through. Only do a +* full (slow) validation if the gso_size is > 9000. +* +* (Due to the way SKB_BY_FRAGS works this will also do a full +* validation in that case.) +*/ + if (unlikely(skb_is_gso(skb) && +(skb_shinfo(skb)->gso_size > 9000) && +!skb_gso_validate_mac_len(skb, 9700))) + features &= ~NETIF_F_GSO_MASK; + features = vlan_features_check(skb, features); return vxlan_features_check(skb, features); } -- 2.14.1
[PATCH v4 0/2] bnx2x: disable GSO on too-large packets
We observed a case where a packet received on an ibmveth device had a GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x device, where it caused a firmware assert. This is described in detail at [0]. Ultimately we want a fix in the core, but that is very tricky to backport. So for now, just stop the bnx2x driver from crashing. When net-next re-opens I will send the fix to the core and a revert for this. v4 changes: - fix compilation error with EXPORTs (patch 1) - only do slow test if gso_size is greater than 9000 bytes (patch 2) Thanks, Daniel [0]: https://patchwork.ozlabs.org/patch/859410/ Cc: Eric Dumazet Cc: manish.cho...@cavium.com Cc: Jason Wang Cc: Pravin Shelar Cc: Marcelo Ricardo Leitner Daniel Axtens (2): net: create skb_gso_validate_mac_len() bnx2x: disable GSO where gso_size is too big for hardware drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 18 +++ include/linux/skbuff.h | 16 ++ net/core/skbuff.c| 63 +++- net/sched/sch_tbf.c | 10 4 files changed, 84 insertions(+), 23 deletions(-) -- 2.14.1
[PATCH v4 1/2] net: create skb_gso_validate_mac_len()
If you take a GSO skb, and split it into packets, will the MAC length (L2 + L3 + L4 headers + payload) of those packets be small enough to fit within a given length? Move skb_gso_mac_seglen() to skbuff.h with other related functions like skb_gso_network_seglen() so we can use it, and then create skb_gso_validate_mac_len to do the full calculation. Signed-off-by: Daniel Axtens --- v4: fix mistake in EXPORT name --- include/linux/skbuff.h | 16 + net/core/skbuff.c | 63 +++--- net/sched/sch_tbf.c| 10 3 files changed, 66 insertions(+), 23 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index b8e0da6c27d6..242d6773c7c2 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3287,6 +3287,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); void skb_scrub_packet(struct sk_buff *skb, bool xnet); unsigned int skb_gso_transport_seglen(const struct sk_buff *skb); bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu); +bool skb_gso_validate_mac_len(const struct sk_buff *skb, unsigned int len); struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features); struct sk_buff *skb_vlan_untag(struct sk_buff *skb); int skb_ensure_writable(struct sk_buff *skb, int write_len); @@ -4120,6 +4121,21 @@ static inline unsigned int skb_gso_network_seglen(const struct sk_buff *skb) return hdr_len + skb_gso_transport_seglen(skb); } +/** + * skb_gso_mac_seglen - Return length of individual segments of a gso packet + * + * @skb: GSO skb + * + * skb_gso_mac_seglen is used to determine the real size of the + * individual segments, including MAC/L2, Layer3 (IP, IPv6) and L4 + * headers (TCP/UDP). + */ +static inline unsigned int skb_gso_mac_seglen(const struct sk_buff *skb) +{ + unsigned int hdr_len = skb_transport_header(skb) - skb_mac_header(skb); + return hdr_len + skb_gso_transport_seglen(skb); +} + /* Local Checksum Offload. * Compute outer checksum based on the assumption that the * inner checksum will be offloaded later. diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 01e8285aea73..8c61c27c1b28 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4914,37 +4914,74 @@ unsigned int skb_gso_transport_seglen(const struct sk_buff *skb) EXPORT_SYMBOL_GPL(skb_gso_transport_seglen); /** - * skb_gso_validate_mtu - Return in case such skb fits a given MTU + * skb_gso_size_check - check the skb size, considering GSO_BY_FRAGS * - * @skb: GSO skb - * @mtu: MTU to validate against + * There are a couple of instances where we have a GSO skb, and we + * want to determine what size it would be after it is segmented. * - * skb_gso_validate_mtu validates if a given skb will fit a wanted MTU - * once split. + * We might want to check: + * -L3+L4+payload size (e.g. IP forwarding) + * - L2+L3+L4+payload size (e.g. sanity check before passing to driver) + * + * This is a helper to do that correctly considering GSO_BY_FRAGS. + * + * @seg_len: The segmented length (from skb_gso_*_seglen). In the + * GSO_BY_FRAGS case this will be [header sizes + GSO_BY_FRAGS]. + * + * @max_len: The maximum permissible length. + * + * Returns true if the segmented length <= max length. */ -bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu) -{ +static inline bool skb_gso_size_check(const struct sk_buff *skb, + unsigned int seg_len, + unsigned int max_len) { const struct skb_shared_info *shinfo = skb_shinfo(skb); const struct sk_buff *iter; - unsigned int hlen; - - hlen = skb_gso_network_seglen(skb); if (shinfo->gso_size != GSO_BY_FRAGS) - return hlen <= mtu; + return seg_len <= max_len; /* Undo this so we can re-use header sizes */ - hlen -= GSO_BY_FRAGS; + seg_len -= GSO_BY_FRAGS; skb_walk_frags(skb, iter) { - if (hlen + skb_headlen(iter) > mtu) + if (seg_len + skb_headlen(iter) > max_len) return false; } return true; } + +/** + * skb_gso_validate_mtu - Return in case such skb fits a given MTU + * + * @skb: GSO skb + * @mtu: MTU to validate against + * + * skb_gso_validate_mtu validates if a given skb will fit a wanted MTU + * once split. + */ +bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu) +{ + return skb_gso_size_check(skb, skb_gso_network_seglen(skb), mtu); +} EXPORT_SYMBOL_GPL(skb_gso_validate_mtu); +/** + * skb_gso_validate_mac_len - Will a split GSO skb fit in a given length? + * + * @skb: GSO skb + * @len: length to validate against + * + * skb_gso_validate_mac_len validates if a given skb will fit a wanted + * length once split, including L2, L3 and L4 headers and the payload. + */ +bool skb_gso_validate_ma
Re: [PATCH v3 1/2] net: create skb_gso_validate_mac_len()
Marcelo Ricardo Leitner writes: > Hi, > > On Tue, Jan 30, 2018 at 12:14:46PM +1100, Daniel Axtens wrote: >> If you take a GSO skb, and split it into packets, will the MAC >> length (L2 + L3 + L4 headers + payload) of those packets be small >> enough to fit within a given length? >> >> Move skb_gso_mac_seglen() to skbuff.h with other related functions >> like skb_gso_network_seglen() so we can use it, and then create >> skb_gso_validate_mac_len to do the full calculation. >> >> Signed-off-by: Daniel Axtens >> --- >> include/linux/skbuff.h | 16 + >> net/core/skbuff.c | 65 >> +++--- >> net/sched/sch_tbf.c| 10 >> 3 files changed, 67 insertions(+), 24 deletions(-) >> >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h >> index b8e0da6c27d6..242d6773c7c2 100644 >> --- a/include/linux/skbuff.h >> +++ b/include/linux/skbuff.h >> @@ -3287,6 +3287,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff >> *skb, int shiftlen); >> void skb_scrub_packet(struct sk_buff *skb, bool xnet); >> unsigned int skb_gso_transport_seglen(const struct sk_buff *skb); >> bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu); >> +bool skb_gso_validate_mac_len(const struct sk_buff *skb, unsigned int len); >> struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t >> features); >> struct sk_buff *skb_vlan_untag(struct sk_buff *skb); >> int skb_ensure_writable(struct sk_buff *skb, int write_len); >> @@ -4120,6 +4121,21 @@ static inline unsigned int >> skb_gso_network_seglen(const struct sk_buff *skb) >> return hdr_len + skb_gso_transport_seglen(skb); >> } >> >> +/** >> + * skb_gso_mac_seglen - Return length of individual segments of a gso packet >> + * >> + * @skb: GSO skb >> + * >> + * skb_gso_mac_seglen is used to determine the real size of the >> + * individual segments, including MAC/L2, Layer3 (IP, IPv6) and L4 >> + * headers (TCP/UDP). >> + */ >> +static inline unsigned int skb_gso_mac_seglen(const struct sk_buff *skb) >> +{ >> +unsigned int hdr_len = skb_transport_header(skb) - skb_mac_header(skb); >> +return hdr_len + skb_gso_transport_seglen(skb); >> +} >> + >> /* Local Checksum Offload. >> * Compute outer checksum based on the assumption that the >> * inner checksum will be offloaded later. >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c >> index 01e8285aea73..55d84ab7d093 100644 >> --- a/net/core/skbuff.c >> +++ b/net/core/skbuff.c >> @@ -4914,36 +4914,73 @@ unsigned int skb_gso_transport_seglen(const struct >> sk_buff *skb) >> EXPORT_SYMBOL_GPL(skb_gso_transport_seglen); >> >> /** >> - * skb_gso_validate_mtu - Return in case such skb fits a given MTU >> + * skb_gso_size_check - check the skb size, considering GSO_BY_FRAGS >> * >> - * @skb: GSO skb >> - * @mtu: MTU to validate against >> + * There are a couple of instances where we have a GSO skb, and we >> + * want to determine what size it would be after it is segmented. >> * >> - * skb_gso_validate_mtu validates if a given skb will fit a wanted MTU >> - * once split. >> + * We might want to check: >> + * -L3+L4+payload size (e.g. IP forwarding) >> + * - L2+L3+L4+payload size (e.g. sanity check before passing to driver) >> + * >> + * This is a helper to do that correctly considering GSO_BY_FRAGS. >> + * >> + * @seg_len: The segmented length (from skb_gso_*_seglen). In the >> + * GSO_BY_FRAGS case this will be [header sizes + GSO_BY_FRAGS]. >> + * >> + * @max_len: The maximum permissible length. >> + * >> + * Returns true if the segmented length <= max length. >> */ >> -bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu) >> -{ >> +static inline bool skb_gso_size_check(const struct sk_buff *skb, >> + unsigned int seg_len, >> + unsigned int max_len) { >> const struct skb_shared_info *shinfo = skb_shinfo(skb); >> const struct sk_buff *iter; >> -unsigned int hlen; >> - >> -hlen = skb_gso_network_seglen(skb); >> >> if (shinfo->gso_size != GSO_BY_FRAGS) >> -return hlen <= mtu; >> +return seg_len <= max_len; >> >> /* Undo this so we can re-use header sizes */ >> -hlen -= GSO_BY_FRAGS; >> +seg_len -= GSO_BY_FRAGS; >> >
Re: [PATCH v3 1/2] net: create skb_gso_validate_mac_len()
Hi Eric, >> skb_gso_transport_seglen(skb) is quite expensive (out of line) >> >> It is unfortunate bnx2x seems to support 9600 MTU ( >> ETH_MAX_JUMBO_PACKET_SIZE ), because 100 bytes of headers can be too >> small in some cases. >> >> Presumably we could avoid calling the function for standard MTU <= >> 9000 Yes, it is an expensive call. I will send another version where we only call the expensive path in the jumbo frame case. I'll wait until tomorrow in case there is any more feedback in the next 24 hours or so. > Also are we sure about these bnx2x crashes being limited to TSO ? > > Maybe it will crash the same after GSO has segmented the packet and we > provide a big (like 10,000 bytes) packet ? I do not believe upper stack > will prevent this. We did test with TSO off and that was sufficient to prevent the crash. You are right that the upper stack sends down the 10k packet, so I don't know why it prevents the crash. But we did definitely test it! I don't have access to the firmware details but I imagine it's an issue with the offloaded segmentation. Thanks for your patience with the many versions of this patch set. Regards, Daniel
[PATCH v3 1/2] net: create skb_gso_validate_mac_len()
If you take a GSO skb, and split it into packets, will the MAC length (L2 + L3 + L4 headers + payload) of those packets be small enough to fit within a given length? Move skb_gso_mac_seglen() to skbuff.h with other related functions like skb_gso_network_seglen() so we can use it, and then create skb_gso_validate_mac_len to do the full calculation. Signed-off-by: Daniel Axtens --- include/linux/skbuff.h | 16 + net/core/skbuff.c | 65 +++--- net/sched/sch_tbf.c| 10 3 files changed, 67 insertions(+), 24 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index b8e0da6c27d6..242d6773c7c2 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3287,6 +3287,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); void skb_scrub_packet(struct sk_buff *skb, bool xnet); unsigned int skb_gso_transport_seglen(const struct sk_buff *skb); bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu); +bool skb_gso_validate_mac_len(const struct sk_buff *skb, unsigned int len); struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features); struct sk_buff *skb_vlan_untag(struct sk_buff *skb); int skb_ensure_writable(struct sk_buff *skb, int write_len); @@ -4120,6 +4121,21 @@ static inline unsigned int skb_gso_network_seglen(const struct sk_buff *skb) return hdr_len + skb_gso_transport_seglen(skb); } +/** + * skb_gso_mac_seglen - Return length of individual segments of a gso packet + * + * @skb: GSO skb + * + * skb_gso_mac_seglen is used to determine the real size of the + * individual segments, including MAC/L2, Layer3 (IP, IPv6) and L4 + * headers (TCP/UDP). + */ +static inline unsigned int skb_gso_mac_seglen(const struct sk_buff *skb) +{ + unsigned int hdr_len = skb_transport_header(skb) - skb_mac_header(skb); + return hdr_len + skb_gso_transport_seglen(skb); +} + /* Local Checksum Offload. * Compute outer checksum based on the assumption that the * inner checksum will be offloaded later. diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 01e8285aea73..55d84ab7d093 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4914,36 +4914,73 @@ unsigned int skb_gso_transport_seglen(const struct sk_buff *skb) EXPORT_SYMBOL_GPL(skb_gso_transport_seglen); /** - * skb_gso_validate_mtu - Return in case such skb fits a given MTU + * skb_gso_size_check - check the skb size, considering GSO_BY_FRAGS * - * @skb: GSO skb - * @mtu: MTU to validate against + * There are a couple of instances where we have a GSO skb, and we + * want to determine what size it would be after it is segmented. * - * skb_gso_validate_mtu validates if a given skb will fit a wanted MTU - * once split. + * We might want to check: + * -L3+L4+payload size (e.g. IP forwarding) + * - L2+L3+L4+payload size (e.g. sanity check before passing to driver) + * + * This is a helper to do that correctly considering GSO_BY_FRAGS. + * + * @seg_len: The segmented length (from skb_gso_*_seglen). In the + * GSO_BY_FRAGS case this will be [header sizes + GSO_BY_FRAGS]. + * + * @max_len: The maximum permissible length. + * + * Returns true if the segmented length <= max length. */ -bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu) -{ +static inline bool skb_gso_size_check(const struct sk_buff *skb, + unsigned int seg_len, + unsigned int max_len) { const struct skb_shared_info *shinfo = skb_shinfo(skb); const struct sk_buff *iter; - unsigned int hlen; - - hlen = skb_gso_network_seglen(skb); if (shinfo->gso_size != GSO_BY_FRAGS) - return hlen <= mtu; + return seg_len <= max_len; /* Undo this so we can re-use header sizes */ - hlen -= GSO_BY_FRAGS; + seg_len -= GSO_BY_FRAGS; skb_walk_frags(skb, iter) { - if (hlen + skb_headlen(iter) > mtu) + if (seg_len + skb_headlen(iter) > max_len) return false; } return true; } -EXPORT_SYMBOL_GPL(skb_gso_validate_mtu); + +/** + * skb_gso_validate_mtu - Return in case such skb fits a given MTU + * + * @skb: GSO skb + * @mtu: MTU to validate against + * + * skb_gso_validate_mtu validates if a given skb will fit a wanted MTU + * once split. + */ +bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu) +{ + return skb_gso_size_check(skb, skb_gso_network_seglen(skb), mtu); +} +EXPORT_SYMBOL_GPL(skb_gso_validate_network_len); + +/** + * skb_gso_validate_mac_len - Will a split GSO skb fit in a given length? + * + * @skb: GSO skb + * @len: length to validate against + * + * skb_gso_validate_mac_len validates if a given skb will fit a wanted + * length once split, including L2, L3 and L4 headers and the payload. + */ +bool skb_gso_vali
[PATCH v3 0/2] bnx2x: disable GSO on too-large packets
We observed a case where a packet received on an ibmveth device had a GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x device, where it caused a firmware assert. This is described in detail at [0]. Ultimately we want a fix in the core, but that is very tricky to backport. So for now, just stop the bnx2x driver from crashing. When net-next re-opens I will send the fix to the core and a revert for this. Marcelo: I have left renaming skb_gso_validate_mtu() for the next series. Thanks, Daniel [0]: https://patchwork.ozlabs.org/patch/859410/ Cc: manish.cho...@cavium.com Cc: Jason Wang Cc: Pravin Shelar Cc: Marcelo Ricardo Leitner Daniel Axtens (2): net: create skb_gso_validate_mac_len() bnx2x: disable GSO where gso_size is too big for hardware drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 9 include/linux/skbuff.h | 16 ++ net/core/skbuff.c| 65 +++- net/sched/sch_tbf.c | 10 4 files changed, 76 insertions(+), 24 deletions(-) -- 2.14.1
[PATCH v3 2/2] bnx2x: disable GSO where gso_size is too big for hardware
If a bnx2x card is passed a GSO packet with a gso_size larger than ~9700 bytes, it will cause a firmware error that will bring the card down: bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f0)]MC assert! bnx2x: [bnx2x_mc_assert:720(enP24p1s0f0)]XSTORM_ASSERT_LIST_INDEX 0x2 bnx2x: [bnx2x_mc_assert:736(enP24p1s0f0)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e43e47 0x00463e01 0x00010052 bnx2x: [bnx2x_mc_assert:750(enP24p1s0f0)]Chip Revision: everest3, FW Version: 7_13_1 ... (dump of values continues) ... Detect when the mac length of a GSO packet is greater than the maximum packet size (9700 bytes) and disable GSO. Signed-off-by: Daniel Axtens --- v3: use skb_gso_validate_mac_len to get the actual size, to allow jumbo frames to work. I don't have local access to a bnx2x card and sending this off to the user who does would add about a month of round-trip time, so this particular spin is build-tested only. (Earlier spins were tested on real hardware.) --- drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c index 7b08323e3f3d..fbb08e360625 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c @@ -12934,6 +12934,15 @@ static netdev_features_t bnx2x_features_check(struct sk_buff *skb, struct net_device *dev, netdev_features_t features) { + /* +* A skb with gso_size + header length > 9700 will cause a +* firmware panic. Drop GSO support. +* +* Eventually the upper layer should not pass these packets down. +*/ + if (unlikely(skb_is_gso(skb) && !skb_gso_validate_mac_len(skb, 9700))) + features &= ~NETIF_F_GSO_MASK; + features = vlan_features_check(skb, features); return vxlan_features_check(skb, features); } -- 2.14.1
Re: [PATCH v2 0/4] Check size of packets before sending
Eric Dumazet writes: > On Fri, 2018-01-26 at 00:44 +1100, Daniel Axtens wrote: >> Hi Eric, >> >> > May I ask which tree are you targeting ? >> > >> > ( Documentation/networking/netdev-FAQ.txt ) >> >> I have been targeting net-next, but I haven't pulled for about two >> weeks. I will rebase and if there are conflicts I will resend early next >> week. >> >> > Anything touching GSO is very risky and should target net-next, >> > especially considering 4.15 is released this week end. >> > >> > Are we really willing to backport this intrusive series in stable >> > trees, or do we have a smaller fix for bnx2x ? >> >> I do actually have a smaller fix for bnx2x, although it would need more work: >> https://patchwork.ozlabs.org/patch/859410/ >> >> It leaves open the possibility of too-large packets causing issues on >> other drivers. DaveM wasn't a fan: >> https://patchwork.ozlabs.org/patch/859410/#1839429 > > Yes, I know he prefers a generic solution, but I am pragmatic here. > Old kernels are very far from current GSO stack in net-next. > > Backporting all the dependencies is going to be very boring/risky. OK, so how about: - first, a series that introduces skb_gso_validate_mac_len and uses it in bnx2x. This should be backportable without difficulty. - then, a series that wires skb_gso_validate_mac_len into the core - validate_xmit_skb for instance, and reverts the bnx2x fix. This would not need to be backported. If that's an approach we can agree on, I am ready to send it when net-next opens again. Regards, Daniel
Re: [PATCH v2 0/4] Check size of packets before sending
Hi Eric, > May I ask which tree are you targeting ? > > ( Documentation/networking/netdev-FAQ.txt ) I have been targeting net-next, but I haven't pulled for about two weeks. I will rebase and if there are conflicts I will resend early next week. > Anything touching GSO is very risky and should target net-next, > especially considering 4.15 is released this week end. > > Are we really willing to backport this intrusive series in stable > trees, or do we have a smaller fix for bnx2x ? I do actually have a smaller fix for bnx2x, although it would need more work: https://patchwork.ozlabs.org/patch/859410/ It leaves open the possibility of too-large packets causing issues on other drivers. DaveM wasn't a fan: https://patchwork.ozlabs.org/patch/859410/#1839429 Regards, Daniel > > Thanks.
[PATCH v2 4/4] net: check the size of a packet in validate_xmit_skb
There are a number of paths where an oversize skb could be sent to a driver. The driver should not be required to check for this - the core layer should do it instead. Add a check to validate_xmit_skb that checks both GSO and non-GSO packets and drops them if they are too large. Signed-off-by: Daniel Axtens --- net/core/dev.c | 17 + 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 6c96c26aadbf..f09eece2cd21 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1830,13 +1830,11 @@ static inline void net_timestamp_set(struct sk_buff *skb) __net_timestamp(SKB); \ } \ -bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb) +static inline bool skb_mac_len_fits_dev(const struct net_device *dev, + const struct sk_buff *skb) { unsigned int len; - if (!(dev->flags & IFF_UP)) - return false; - len = dev->mtu + dev->hard_header_len + VLAN_HLEN; if (skb->len <= len) return true; @@ -1850,6 +1848,14 @@ bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb) return false; } + +bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb) +{ + if (!(dev->flags & IFF_UP)) + return false; + + return skb_mac_len_fits_dev(dev, skb); +} EXPORT_SYMBOL_GPL(is_skb_forwardable); int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb) @@ -3081,6 +3087,9 @@ static struct sk_buff *validate_xmit_skb(struct sk_buff *skb, struct net_device if (unlikely(!skb)) goto out_null; + if (unlikely(!skb_mac_len_fits_dev(dev, skb))) + goto out_kfree_skb; + if (netif_needs_gso(skb, features)) { struct sk_buff *segs; -- 2.14.1
[PATCH v2 1/4] net: rename skb_gso_validate_mtu -> skb_gso_validate_network_len
If you take a GSO skb, and split it into packets, will the network length (L3 headers + L4 headers + payload) of those packets be small enough to fit within a given MTU? skb_gso_validate_mtu gives you the answer to that question. However, we're about to add a way to validate the MAC length of a split GSO skb (L2+L3+L4+payload), and the names get confusing, so rename skb_gso_validate_mtu to skb_gso_validate_network_len Signed-off-by: Daniel Axtens --- include/linux/skbuff.h | 2 +- net/core/skbuff.c | 9 + net/ipv4/ip_forward.c | 2 +- net/ipv4/ip_output.c| 2 +- net/ipv4/netfilter/nf_flow_table_ipv4.c | 2 +- net/ipv6/ip6_output.c | 2 +- net/ipv6/netfilter/nf_flow_table_ipv6.c | 2 +- net/mpls/af_mpls.c | 2 +- net/xfrm/xfrm_device.c | 2 +- 9 files changed, 13 insertions(+), 12 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index b8e0da6c27d6..b137c79bf88d 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3286,7 +3286,7 @@ void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len); int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); void skb_scrub_packet(struct sk_buff *skb, bool xnet); unsigned int skb_gso_transport_seglen(const struct sk_buff *skb); -bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu); +bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu); struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features); struct sk_buff *skb_vlan_untag(struct sk_buff *skb); int skb_ensure_writable(struct sk_buff *skb, int write_len); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 01e8285aea73..a93e5c7aa5b2 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4914,15 +4914,16 @@ unsigned int skb_gso_transport_seglen(const struct sk_buff *skb) EXPORT_SYMBOL_GPL(skb_gso_transport_seglen); /** - * skb_gso_validate_mtu - Return in case such skb fits a given MTU + * skb_gso_validate_network_len - Will a split GSO skb fit into a given MTU? * * @skb: GSO skb * @mtu: MTU to validate against * - * skb_gso_validate_mtu validates if a given skb will fit a wanted MTU - * once split. + * skb_gso_validate_network_len validates if a given skb will fit a + * wanted MTU once split. It considers L3 headers, L4 headers, and the + * payload. */ -bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu) +bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu) { const struct skb_shared_info *shinfo = skb_shinfo(skb); const struct sk_buff *iter; diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c index 2dd21c3281a1..b54b948b0596 100644 --- a/net/ipv4/ip_forward.c +++ b/net/ipv4/ip_forward.c @@ -55,7 +55,7 @@ static bool ip_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu) if (skb->ignore_df) return false; - if (skb_is_gso(skb) && skb_gso_validate_mtu(skb, mtu)) + if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu)) return false; return true; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index e8e675be60ec..66340ab750e6 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -248,7 +248,7 @@ static int ip_finish_output_gso(struct net *net, struct sock *sk, /* common case: seglen is <= mtu */ - if (skb_gso_validate_mtu(skb, mtu)) + if (skb_gso_validate_network_len(skb, mtu)) return ip_finish_output2(net, sk, skb); /* Slowpath - GSO segment length exceeds the egress MTU. diff --git a/net/ipv4/netfilter/nf_flow_table_ipv4.c b/net/ipv4/netfilter/nf_flow_table_ipv4.c index b2d01eb25f2c..cdf2625dc277 100644 --- a/net/ipv4/netfilter/nf_flow_table_ipv4.c +++ b/net/ipv4/netfilter/nf_flow_table_ipv4.c @@ -185,7 +185,7 @@ static bool __nf_flow_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu) if ((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0) return false; - if (skb_is_gso(skb) && skb_gso_validate_mtu(skb, mtu)) + if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu)) return false; return true; diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 18547a44bdaf..4e888328d4dd 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -412,7 +412,7 @@ static bool ip6_pkt_too_big(const struct sk_buff *skb, unsigned int mtu) if (skb->ignore_df) return false; - if (skb_is_gso(skb) && skb_gso_validate_mtu(skb, mtu)) + if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu)) return false; return true; diff --git a/net/ipv6/netfilter/nf_flow_table_ipv6.c b/net/ipv6/netfilter/nf_flo
[PATCH v2 2/4] net: move skb_gso_mac_seglen to skbuff.h
We're about to use this elsewhere, so move it into the header with the other related functions like skb_gso_network_seglen(). Signed-off-by: Daniel Axtens --- include/linux/skbuff.h | 15 +++ net/sched/sch_tbf.c| 10 -- 2 files changed, 15 insertions(+), 10 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index b137c79bf88d..4b3ca6a5ec0a 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -4120,6 +4120,21 @@ static inline unsigned int skb_gso_network_seglen(const struct sk_buff *skb) return hdr_len + skb_gso_transport_seglen(skb); } +/** + * skb_gso_mac_seglen - Return length of individual segments of a gso packet + * + * @skb: GSO skb + * + * skb_gso_mac_seglen is used to determine the real size of the + * individual segments, including MAC/L2, Layer3 (IP, IPv6) and L4 + * headers (TCP/UDP). + */ +static inline unsigned int skb_gso_mac_seglen(const struct sk_buff *skb) +{ + unsigned int hdr_len = skb_transport_header(skb) - skb_mac_header(skb); + return hdr_len + skb_gso_transport_seglen(skb); +} + /* Local Checksum Offload. * Compute outer checksum based on the assumption that the * inner checksum will be offloaded later. diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c index 83e76d046993..229172d509cc 100644 --- a/net/sched/sch_tbf.c +++ b/net/sched/sch_tbf.c @@ -142,16 +142,6 @@ static u64 psched_ns_t2l(const struct psched_ratecfg *r, return len; } -/* - * Return length of individual segments of a gso packet, - * including all headers (MAC, IP, TCP/UDP) - */ -static unsigned int skb_gso_mac_seglen(const struct sk_buff *skb) -{ - unsigned int hdr_len = skb_transport_header(skb) - skb_mac_header(skb); - return hdr_len + skb_gso_transport_seglen(skb); -} - /* GSO packet is too big, segment it so that tbf can transmit * each segment in time */ -- 2.14.1
[PATCH v2 3/4] net: is_skb_forwardable: check the size of GSO segments
is_skb_forwardable attempts to detect if a packet is too large to be sent to the destination device. However, this test does not consider GSO skbs, and it is possible that a GSO skb, when segmented, will be larger than the device can transmit. Create skb_gso_validate_mac_len, and use that to check. Signed-off-by: Daniel Axtens --- include/linux/skbuff.h | 1 + net/core/dev.c | 7 +++--- net/core/skbuff.c | 67 +++--- 3 files changed, 57 insertions(+), 18 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 4b3ca6a5ec0a..ec9c47b5a1c8 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3287,6 +3287,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); void skb_scrub_packet(struct sk_buff *skb, bool xnet); unsigned int skb_gso_transport_seglen(const struct sk_buff *skb); bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu); +bool skb_gso_validate_mac_len(const struct sk_buff *skb, unsigned int len); struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features); struct sk_buff *skb_vlan_untag(struct sk_buff *skb); int skb_ensure_writable(struct sk_buff *skb, int write_len); diff --git a/net/core/dev.c b/net/core/dev.c index 94435cd09072..6c96c26aadbf 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1841,11 +1841,12 @@ bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb) if (skb->len <= len) return true; - /* if TSO is enabled, we don't care about the length as the packet -* could be forwarded without being segmented before + /* +* if TSO is enabled, we need to check the size of the +* segmented packets */ if (skb_is_gso(skb)) - return true; + return skb_gso_validate_mac_len(skb, len); return false; } diff --git a/net/core/skbuff.c b/net/core/skbuff.c index a93e5c7aa5b2..93f66725c32d 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4914,37 +4914,74 @@ unsigned int skb_gso_transport_seglen(const struct sk_buff *skb) EXPORT_SYMBOL_GPL(skb_gso_transport_seglen); /** - * skb_gso_validate_network_len - Will a split GSO skb fit into a given MTU? + * skb_gso_size_check - check the skb size, considering GSO_BY_FRAGS * - * @skb: GSO skb - * @mtu: MTU to validate against + * There are a couple of instances where we have a GSO skb, and we + * want to determine what size it would be after it is segmented. * - * skb_gso_validate_network_len validates if a given skb will fit a - * wanted MTU once split. It considers L3 headers, L4 headers, and the - * payload. + * We might want to check: + * -L3+L4+payload size (e.g. IP forwarding) + * - L2+L3+L4+payload size (e.g. sanity check before passing to driver) + * + * This is a helper to do that correctly considering GSO_BY_FRAGS. + * + * @seg_len: The segmented length (from skb_gso_*_seglen). In the + * GSO_BY_FRAGS case this will be [header sizes + GSO_BY_FRAGS]. + * + * @max_len: The maximum permissible length. + * + * Returns true if the segmented length <= max length. */ -bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu) -{ +static inline bool skb_gso_size_check(const struct sk_buff *skb, + unsigned int seg_len, + unsigned int max_len) { const struct skb_shared_info *shinfo = skb_shinfo(skb); const struct sk_buff *iter; - unsigned int hlen; - - hlen = skb_gso_network_seglen(skb); if (shinfo->gso_size != GSO_BY_FRAGS) - return hlen <= mtu; + return seg_len <= max_len; /* Undo this so we can re-use header sizes */ - hlen -= GSO_BY_FRAGS; + seg_len -= GSO_BY_FRAGS; skb_walk_frags(skb, iter) { - if (hlen + skb_headlen(iter) > mtu) + if (seg_len + skb_headlen(iter) > max_len) return false; } return true; } -EXPORT_SYMBOL_GPL(skb_gso_validate_mtu); + +/** + * skb_gso_validate_network_len - Does an skb fit a given MTU? + * + * @skb: GSO skb + * @mtu: MTU to validate against + * + * skb_gso_validate_network_len validates if a given skb will fit a + * wanted MTU once split. It considers L3 headers, L4 headers, and the + * payload. + */ +bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu) +{ + return skb_gso_size_check(skb, skb_gso_network_seglen(skb), mtu); +} +EXPORT_SYMBOL_GPL(skb_gso_validate_network_len); + +/** + * skb_gso_validate_mac_len - Will a split GSO skb fit in a given length? + * + * @skb: GSO skb + * @len: length to validate against + * + * skb_gso_validate_mac_len validates if a given skb will fit a wanted + * length once split, including L2, L3 and L4 headers and the payload. + */ +
[PATCH v2 0/4] Check size of packets before sending
There are a few ways we can send packets that are too large to a network driver. When non-GSO packets are forwarded, we validate their size, based on the MTU of the destination device. However, when GSO packets are forwarded, we do not validate their size. We implicitly assume that when they are segmented, the resultant packets will be correctly sized. This is not always the case. We observed a case where a packet received on an ibmveth device had a GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x device, where it caused a firmware assert. This is described in detail at [0] and was the genesis of this series. Rather than fixing this in the driver, this series fixes the core path. It does it in 2 steps: 1) make is_skb_forwardable check GSO packets - this catches bridges 2) make validate_xmit_skb check the size of all packets, so as to catch everything else (e.g. macvlan, tc mired, OVS) I am a bit nervous about how this series will interact with nested VLANs, as the existing code only allows for one VLAN_HLEN. (Previously these packets would sail past unchecked.) But I thought it would be prudent to get more eyes on this sooner rather than later. Thanks, Daniel v1: https://www.spinics.net/lists/netdev/msg478634.html Changes in v2: - improve names, thanks Marcelo Ricardo Leitner - add check to xmit_validate_skb; thanks to everyone who participated in the discussion. - drop extra check in Open vSwitch. Bad packets will be caught by validate_xmit_skb for now and we can come back and add it later if OVS people would like the extra logging. [0]: https://patchwork.ozlabs.org/patch/859410/ Cc: Jason Wang Cc: Pravin Shelar Cc: Marcelo Ricardo Leitner Cc: manish.cho...@cavium.com Cc: d...@openvswitch.org Daniel Axtens (4): net: rename skb_gso_validate_mtu -> skb_gso_validate_network_len net: move skb_gso_mac_seglen to skbuff.h net: is_skb_forwardable: check the size of GSO segments net: check the size of a packet in validate_xmit_skb include/linux/skbuff.h | 18 - net/core/dev.c | 24 net/core/skbuff.c | 66 ++--- net/ipv4/ip_forward.c | 2 +- net/ipv4/ip_output.c| 2 +- net/ipv4/netfilter/nf_flow_table_ipv4.c | 2 +- net/ipv6/ip6_output.c | 2 +- net/ipv6/netfilter/nf_flow_table_ipv6.c | 2 +- net/mpls/af_mpls.c | 2 +- net/sched/sch_tbf.c | 10 - net/xfrm/xfrm_device.c | 2 +- 11 files changed, 93 insertions(+), 39 deletions(-) -- 2.14.1
Re: [PATCH 0/3] Check gso_size of packets when forwarding
Pravin Shelar writes: > On Thu, Jan 18, 2018 at 5:28 PM, Daniel Axtens wrote: >> Pravin Shelar writes: >> >>> On Thu, Jan 18, 2018 at 5:08 AM, Daniel Axtens wrote: >>>> Pravin Shelar writes: >>>> >>>>> On Mon, Jan 15, 2018 at 6:09 PM, Daniel Axtens wrote: >>>>>> When regular packets are forwarded, we validate their size against the >>>>>> MTU of the destination device. However, when GSO packets are >>>>>> forwarded, we do not validate their size against the MTU. We >>>>>> implicitly assume that when they are segmented, the resultant packets >>>>>> will be correctly sized. >>>>>> >>>>>> This is not always the case. >>>>>> >>>>>> We observed a case where a packet received on an ibmveth device had a >>>>>> GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x >>>>>> device, where it caused a firmware assert. This is described in detail >>>>>> at [0] and was the genesis of this series. Rather than fixing it in >>>>>> the driver, this series fixes the forwarding path. >>>>>> >>>>> Are there any other possible forwarding path in networking stack? TC >>>>> is one subsystem that could forward such a packet to the bnx2x device, >>>>> how is that handled ? >>>> >>>> So far I have only looked at bridges, openvswitch and macvlan. In >>>> general, if the code uses dev_forward_skb() it should automatically be >>>> fine as that invokes is_skb_forwardable(), which we patch. >>>> >>> But there are other ways to forward packets, e.g tc-mirred or bpf >>> redirect. We need to handle all these cases rather than fixing one at >>> a time. As Jason suggested netif_needs_gso() looks like good function >>> to validate if a device is capable of handling given GSO packet. >> >> I am not entirely sure this is a better solution. >> >> The biggest reason I am uncomfortable with this is that if >> netif_needs_gso() returns true, the skb will be segmented. The segment >> sizes will be based on gso_size. Since gso_size is greater than the MTU, >> the resulting segments will themselves be over-MTU. Those over-MTU >> segments will then be passed to the network card. I think we should not >> be creating over-MTU segments; we should instead be dropping the packet >> and logging. >> > > Would this oversized segment cause the firmware assert? > If this solves the assert issue then I do not see much value in adding > checks in fast-path just for logging. No - I tested this (or rather, as I don't have direct access to a bnx2x card, this was tested on my behalf): as long as the packet is not a GSO packet, it doesn't cause the crash. So we *could* segment them, I just think that knowingly creating invalid segments is not particularly pleasant. Do you think my approach in a later email which checks and drops without logging is sufficient? It is the same GSO check, and also checks non-GSO packets against the MTU. Regards, Daniel
Re: [PATCH 0/3] Check gso_size of packets when forwarding
Daniel Axtens writes: > Pravin Shelar writes: > >> On Thu, Jan 18, 2018 at 5:08 AM, Daniel Axtens wrote: >>> Pravin Shelar writes: >>> >>>> On Mon, Jan 15, 2018 at 6:09 PM, Daniel Axtens wrote: >>>>> When regular packets are forwarded, we validate their size against the >>>>> MTU of the destination device. However, when GSO packets are >>>>> forwarded, we do not validate their size against the MTU. We >>>>> implicitly assume that when they are segmented, the resultant packets >>>>> will be correctly sized. >>>>> >>>>> This is not always the case. >>>>> >>>>> We observed a case where a packet received on an ibmveth device had a >>>>> GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x >>>>> device, where it caused a firmware assert. This is described in detail >>>>> at [0] and was the genesis of this series. Rather than fixing it in >>>>> the driver, this series fixes the forwarding path. >>>>> >>>> Are there any other possible forwarding path in networking stack? TC >>>> is one subsystem that could forward such a packet to the bnx2x device, >>>> how is that handled ? >>> >>> So far I have only looked at bridges, openvswitch and macvlan. In >>> general, if the code uses dev_forward_skb() it should automatically be >>> fine as that invokes is_skb_forwardable(), which we patch. >>> >> But there are other ways to forward packets, e.g tc-mirred or bpf >> redirect. We need to handle all these cases rather than fixing one at >> a time. As Jason suggested netif_needs_gso() looks like good function >> to validate if a device is capable of handling given GSO packet. > > I am not entirely sure this is a better solution. > > The biggest reason I am uncomfortable with this is that if > netif_needs_gso() returns true, the skb will be segmented. The segment > sizes will be based on gso_size. Since gso_size is greater than the MTU, > the resulting segments will themselves be over-MTU. Those over-MTU > segments will then be passed to the network card. I think we should not > be creating over-MTU segments; we should instead be dropping the packet > and logging. > > I do take the point that you and Jason are making: a more generic > fix would be good. I'm just not sure where to put it. How about this? It puts the check in validate_xmit_skb(). (completely untested) Ultimately I think we need to make a broader policy decision about who is responsible for ensuring that packets are correctly sized - is it the caller of dev_queue_xmit (as in bridges and openswitch) or is it lower down the stack, such as validate_xmit_skb()? Making the caller check is more efficient - packets created on the system are always going to fit due to the checks in the protocol code, so rechecking is inefficient and unnecessary. But checking later is more reliable as we don't have to identify all forwarding paths. Regards, Daniel diff --git a/net/core/dev.c b/net/core/dev.c index 5af04be128c1..b8c3f33ece19 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1830,13 +1830,11 @@ static inline void net_timestamp_set(struct sk_buff *skb) __net_timestamp(SKB); \ } \ -bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb) +static inline bool does_skb_fit_mtu(const struct net_device *dev, + const struct sk_buff *skb) { unsigned int len; - if (!(dev->flags & IFF_UP)) - return false; - len = dev->mtu + dev->hard_header_len + VLAN_HLEN; if (skb->len <= len) return true; @@ -1850,6 +1848,14 @@ bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb) return false; } + +bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb) +{ + if (!(dev->flags & IFF_UP)) + return false; + + return does_skb_fit_mtu(dev, skb); +} EXPORT_SYMBOL_GPL(is_skb_forwardable); int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb) @@ -3081,6 +3087,9 @@ static struct sk_buff *validate_xmit_skb(struct sk_buff *skb, struct net_device if (unlikely(!skb)) goto out_null; + if (unlikely(!does_skb_fit_mtu(dev, skb))) + goto out_kfree_skb; + if (netif_needs_gso(skb, features)) { struct sk_buff *segs; > > > Regards, > Daniel
Re: [PATCH 0/3] Check gso_size of packets when forwarding
Pravin Shelar writes: > On Thu, Jan 18, 2018 at 5:08 AM, Daniel Axtens wrote: >> Pravin Shelar writes: >> >>> On Mon, Jan 15, 2018 at 6:09 PM, Daniel Axtens wrote: >>>> When regular packets are forwarded, we validate their size against the >>>> MTU of the destination device. However, when GSO packets are >>>> forwarded, we do not validate their size against the MTU. We >>>> implicitly assume that when they are segmented, the resultant packets >>>> will be correctly sized. >>>> >>>> This is not always the case. >>>> >>>> We observed a case where a packet received on an ibmveth device had a >>>> GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x >>>> device, where it caused a firmware assert. This is described in detail >>>> at [0] and was the genesis of this series. Rather than fixing it in >>>> the driver, this series fixes the forwarding path. >>>> >>> Are there any other possible forwarding path in networking stack? TC >>> is one subsystem that could forward such a packet to the bnx2x device, >>> how is that handled ? >> >> So far I have only looked at bridges, openvswitch and macvlan. In >> general, if the code uses dev_forward_skb() it should automatically be >> fine as that invokes is_skb_forwardable(), which we patch. >> > But there are other ways to forward packets, e.g tc-mirred or bpf > redirect. We need to handle all these cases rather than fixing one at > a time. As Jason suggested netif_needs_gso() looks like good function > to validate if a device is capable of handling given GSO packet. I am not entirely sure this is a better solution. The biggest reason I am uncomfortable with this is that if netif_needs_gso() returns true, the skb will be segmented. The segment sizes will be based on gso_size. Since gso_size is greater than the MTU, the resulting segments will themselves be over-MTU. Those over-MTU segments will then be passed to the network card. I think we should not be creating over-MTU segments; we should instead be dropping the packet and logging. I do take the point that you and Jason are making: a more generic fix would be good. I'm just not sure where to put it. Regards, Daniel
Re: [PATCH 0/3] Check gso_size of packets when forwarding
Daniel Axtens writes: > Jason Wang writes: > >> On 2018年01月18日 16:28, Pravin Shelar wrote: >>> On Mon, Jan 15, 2018 at 6:09 PM, Daniel Axtens wrote: >>>> When regular packets are forwarded, we validate their size against the >>>> MTU of the destination device. However, when GSO packets are >>>> forwarded, we do not validate their size against the MTU. We >>>> implicitly assume that when they are segmented, the resultant packets >>>> will be correctly sized. >>>> >>>> This is not always the case. >>>> >>>> We observed a case where a packet received on an ibmveth device had a >>>> GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x >>>> device, where it caused a firmware assert. This is described in detail >>>> at [0] and was the genesis of this series. Rather than fixing it in >>>> the driver, this series fixes the forwarding path. >>>> >>> Are there any other possible forwarding path in networking stack? TC >>> is one subsystem that could forward such a packet to the bnx2x device, >>> how is that handled ? >> >> Yes, so it looks to me we should do the check in e.g netif_needs_gso() >> before passing it to hardware. And bnx2x needs to set its gso_max_size >> correctly. > > I don't think gso_max_size is quite the same. If I understand > net/ipv4/tcp.c correctly, gso_max_size is supposed to cap the total > length of the skb, not the length of each segmented packet. The problem > for the bnx2x card is the length of the segment, not the overall length. > >> >> Btw, looks like this could be triggered from a guest which is a DOS. We >> need request a CVE for this. >> > > You are correct about how this can be triggered: in fact it came to my > attention because a network packet from one LPAR (PowerVM virtual > machine) brought down the card attached to a different LPAR. It didn't > occur to me that it was potentially a security issue. I am talking with > the security team at Canonical regarding this. I have requested a CVE from the Distributed Weakness Filing. Regards, Daniel > > Regards, > Daniel > >> Thanks >> >>> >>>> To fix this: >>>> >>>> - Move a helper in patch 1. >>>> >>>> - Validate GSO segment lengths in is_skb_forwardable() in the GSO >>>> case, rather than assuming all will be well. This fixes bridges. >>>> This is patch 2. >>>> >>>> - Open vSwitch uses its own slightly specialised algorithm for >>>> checking lengths. Wire up checking for that in patch 3. >>>> >>>> [0]: https://patchwork.ozlabs.org/patch/859410/ >>>> >>>> Cc: manish.cho...@cavium.com >>>> Cc: d...@openvswitch.org >>>> >>>> Daniel Axtens (3): >>>>net: move skb_gso_mac_seglen to skbuff.h >>>>net: is_skb_forwardable: validate length of GSO packet segments >>>>openvswitch: drop GSO packets that are too large >>>> >>>> include/linux/skbuff.h | 16 >>>> net/core/dev.c | 7 --- >>>> net/core/skbuff.c | 34 ++ >>>> net/openvswitch/vport.c | 37 ++--- >>>> net/sched/sch_tbf.c | 10 -- >>>> 5 files changed, 84 insertions(+), 20 deletions(-) >>>> >>>> -- >>>> 2.14.1 >>>>
Re: [PATCH 0/3] Check gso_size of packets when forwarding
Jason Wang writes: > On 2018年01月18日 16:28, Pravin Shelar wrote: >> On Mon, Jan 15, 2018 at 6:09 PM, Daniel Axtens wrote: >>> When regular packets are forwarded, we validate their size against the >>> MTU of the destination device. However, when GSO packets are >>> forwarded, we do not validate their size against the MTU. We >>> implicitly assume that when they are segmented, the resultant packets >>> will be correctly sized. >>> >>> This is not always the case. >>> >>> We observed a case where a packet received on an ibmveth device had a >>> GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x >>> device, where it caused a firmware assert. This is described in detail >>> at [0] and was the genesis of this series. Rather than fixing it in >>> the driver, this series fixes the forwarding path. >>> >> Are there any other possible forwarding path in networking stack? TC >> is one subsystem that could forward such a packet to the bnx2x device, >> how is that handled ? > > Yes, so it looks to me we should do the check in e.g netif_needs_gso() > before passing it to hardware. And bnx2x needs to set its gso_max_size > correctly. I don't think gso_max_size is quite the same. If I understand net/ipv4/tcp.c correctly, gso_max_size is supposed to cap the total length of the skb, not the length of each segmented packet. The problem for the bnx2x card is the length of the segment, not the overall length. > > Btw, looks like this could be triggered from a guest which is a DOS. We > need request a CVE for this. > You are correct about how this can be triggered: in fact it came to my attention because a network packet from one LPAR (PowerVM virtual machine) brought down the card attached to a different LPAR. It didn't occur to me that it was potentially a security issue. I am talking with the security team at Canonical regarding this. Regards, Daniel > Thanks > >> >>> To fix this: >>> >>> - Move a helper in patch 1. >>> >>> - Validate GSO segment lengths in is_skb_forwardable() in the GSO >>> case, rather than assuming all will be well. This fixes bridges. >>> This is patch 2. >>> >>> - Open vSwitch uses its own slightly specialised algorithm for >>> checking lengths. Wire up checking for that in patch 3. >>> >>> [0]: https://patchwork.ozlabs.org/patch/859410/ >>> >>> Cc: manish.cho...@cavium.com >>> Cc: d...@openvswitch.org >>> >>> Daniel Axtens (3): >>>net: move skb_gso_mac_seglen to skbuff.h >>>net: is_skb_forwardable: validate length of GSO packet segments >>>openvswitch: drop GSO packets that are too large >>> >>> include/linux/skbuff.h | 16 >>> net/core/dev.c | 7 --- >>> net/core/skbuff.c | 34 ++ >>> net/openvswitch/vport.c | 37 ++--- >>> net/sched/sch_tbf.c | 10 -- >>> 5 files changed, 84 insertions(+), 20 deletions(-) >>> >>> -- >>> 2.14.1 >>>
Re: [PATCH 0/3] Check gso_size of packets when forwarding
Pravin Shelar writes: > On Mon, Jan 15, 2018 at 6:09 PM, Daniel Axtens wrote: >> When regular packets are forwarded, we validate their size against the >> MTU of the destination device. However, when GSO packets are >> forwarded, we do not validate their size against the MTU. We >> implicitly assume that when they are segmented, the resultant packets >> will be correctly sized. >> >> This is not always the case. >> >> We observed a case where a packet received on an ibmveth device had a >> GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x >> device, where it caused a firmware assert. This is described in detail >> at [0] and was the genesis of this series. Rather than fixing it in >> the driver, this series fixes the forwarding path. >> > Are there any other possible forwarding path in networking stack? TC > is one subsystem that could forward such a packet to the bnx2x device, > how is that handled ? So far I have only looked at bridges, openvswitch and macvlan. In general, if the code uses dev_forward_skb() it should automatically be fine as that invokes is_skb_forwardable(), which we patch. There is potentially a forwarding path in macvlan that doesn't pass through dev_forward_skb() (which calls is_skb_forwardable). I did post a patch to this earlier, but it got somewhat derailed by how you would get such a packet in to a macvlan device and whether that should be fixed instead. Once we have some consensus on Jason's questions and the overall approach is solid, I'm happy to work on that again. To answer your specific question, I'm not aware of how tc can cause a packet to be forwarded - if you can point me to the right general area I can have a look. Regards, Daniel > >> To fix this: >> >> - Move a helper in patch 1. >> >> - Validate GSO segment lengths in is_skb_forwardable() in the GSO >>case, rather than assuming all will be well. This fixes bridges. >>This is patch 2. >> >> - Open vSwitch uses its own slightly specialised algorithm for >>checking lengths. Wire up checking for that in patch 3. >> >> [0]: https://patchwork.ozlabs.org/patch/859410/ >> >> Cc: manish.cho...@cavium.com >> Cc: d...@openvswitch.org >> >> Daniel Axtens (3): >> net: move skb_gso_mac_seglen to skbuff.h >> net: is_skb_forwardable: validate length of GSO packet segments >> openvswitch: drop GSO packets that are too large >> >> include/linux/skbuff.h | 16 >> net/core/dev.c | 7 --- >> net/core/skbuff.c | 34 ++ >> net/openvswitch/vport.c | 37 ++--- >> net/sched/sch_tbf.c | 10 -- >> 5 files changed, 84 insertions(+), 20 deletions(-) >> >> -- >> 2.14.1 >>
[PATCH 3/3] openvswitch: drop GSO packets that are too large
Open vSwitch attempts to detect if a packet is too large to be sent to the destination device. However, this test does not consider GSO packets, and it is possible that a GSO packet, when resegmented, will be larger than the device can send. Add detection for packets which are too large. We use skb_gso_validate_len, reusing the length calculation in the existing checks - see 738314a084aa ("openvswitch: use hard_header_len instead of hardcoded ETH_HLEN") This is different from the is_skb_forwardable logic in that it only allows for the length of a VLAN tag if one is actually present. Signed-off-by: Daniel Axtens --- net/openvswitch/vport.c | 37 ++--- 1 file changed, 30 insertions(+), 7 deletions(-) diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c index b6c8524032a0..290eeaa82344 100644 --- a/net/openvswitch/vport.c +++ b/net/openvswitch/vport.c @@ -464,16 +464,16 @@ int ovs_vport_receive(struct vport *vport, struct sk_buff *skb, return 0; } -static unsigned int packet_length(const struct sk_buff *skb, - struct net_device *dev) +static unsigned int packet_length_offset(const struct sk_buff *skb, +const struct net_device *dev) { - unsigned int length = skb->len - dev->hard_header_len; + unsigned int length = dev->hard_header_len; if (!skb_vlan_tag_present(skb) && eth_type_vlan(skb->protocol)) - length -= VLAN_HLEN; + length += VLAN_HLEN; - /* Don't subtract for multiple VLAN tags. Most (all?) drivers allow + /* Don't adjust for multiple VLAN tags. Most (all?) drivers allow * (ETH_LEN + VLAN_HLEN) in addition to the mtu value, but almost none * account for 802.1ad. e.g. is_skb_forwardable(). */ @@ -481,6 +481,21 @@ static unsigned int packet_length(const struct sk_buff *skb, return length; } +static inline unsigned int packet_length(const struct sk_buff *skb, +const struct net_device *dev) +{ + return skb->len - packet_length_offset(skb, dev); +} + +static inline bool vport_gso_validate_len(const struct sk_buff *skb, + const struct net_device *dev, + unsigned int mtu) +{ + unsigned int len = mtu + packet_length_offset(skb, dev); + + return skb_gso_validate_len(skb, len); +} + void ovs_vport_send(struct vport *vport, struct sk_buff *skb, u8 mac_proto) { int mtu = vport->dev->mtu; @@ -504,13 +519,21 @@ void ovs_vport_send(struct vport *vport, struct sk_buff *skb, u8 mac_proto) goto drop; } - if (unlikely(packet_length(skb, vport->dev) > mtu && -!skb_is_gso(skb))) { + if (!skb_is_gso(skb) && + unlikely(packet_length(skb, vport->dev) > mtu)) { net_warn_ratelimited("%s: dropped over-mtu packet: %d > %d\n", vport->dev->name, packet_length(skb, vport->dev), mtu); vport->dev->stats.tx_errors++; goto drop; + } else if (skb_is_gso(skb) && + unlikely(!vport_gso_validate_len(skb, vport->dev, mtu))) { + net_warn_ratelimited("%s: dropped over-mtu GSO packet: " +"gso_size = %d, mtu = %d\n", +vport->dev->name, +skb_shinfo(skb)->gso_size, mtu); + vport->dev->stats.tx_errors++; + goto drop; } skb->dev = vport->dev; -- 2.14.1
[PATCH 0/3] Check gso_size of packets when forwarding
When regular packets are forwarded, we validate their size against the MTU of the destination device. However, when GSO packets are forwarded, we do not validate their size against the MTU. We implicitly assume that when they are segmented, the resultant packets will be correctly sized. This is not always the case. We observed a case where a packet received on an ibmveth device had a GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x device, where it caused a firmware assert. This is described in detail at [0] and was the genesis of this series. Rather than fixing it in the driver, this series fixes the forwarding path. To fix this: - Move a helper in patch 1. - Validate GSO segment lengths in is_skb_forwardable() in the GSO case, rather than assuming all will be well. This fixes bridges. This is patch 2. - Open vSwitch uses its own slightly specialised algorithm for checking lengths. Wire up checking for that in patch 3. [0]: https://patchwork.ozlabs.org/patch/859410/ Cc: manish.cho...@cavium.com Cc: d...@openvswitch.org Daniel Axtens (3): net: move skb_gso_mac_seglen to skbuff.h net: is_skb_forwardable: validate length of GSO packet segments openvswitch: drop GSO packets that are too large include/linux/skbuff.h | 16 net/core/dev.c | 7 --- net/core/skbuff.c | 34 ++ net/openvswitch/vport.c | 37 ++--- net/sched/sch_tbf.c | 10 -- 5 files changed, 84 insertions(+), 20 deletions(-) -- 2.14.1
[PATCH 1/3] net: move skb_gso_mac_seglen to skbuff.h
We're about to use this elsewhere, so move it into the header with the other related functions like skb_gso_network_seglen(). Signed-off-by: Daniel Axtens --- include/linux/skbuff.h | 15 +++ net/sched/sch_tbf.c| 10 -- 2 files changed, 15 insertions(+), 10 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index b8e0da6c27d6..b8acafd73272 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -4120,6 +4120,21 @@ static inline unsigned int skb_gso_network_seglen(const struct sk_buff *skb) return hdr_len + skb_gso_transport_seglen(skb); } +/** + * skb_gso_mac_seglen - Return length of individual segments of a gso packet + * + * @skb: GSO skb + * + * skb_gso_mac_seglen is used to determine the real size of the + * individual segments, including MAC/L2, Layer3 (IP, IPv6) and L4 + * headers (TCP/UDP). + */ +static inline unsigned int skb_gso_mac_seglen(const struct sk_buff *skb) +{ + unsigned int hdr_len = skb_transport_header(skb) - skb_mac_header(skb); + return hdr_len + skb_gso_transport_seglen(skb); +} + /* Local Checksum Offload. * Compute outer checksum based on the assumption that the * inner checksum will be offloaded later. diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c index 83e76d046993..229172d509cc 100644 --- a/net/sched/sch_tbf.c +++ b/net/sched/sch_tbf.c @@ -142,16 +142,6 @@ static u64 psched_ns_t2l(const struct psched_ratecfg *r, return len; } -/* - * Return length of individual segments of a gso packet, - * including all headers (MAC, IP, TCP/UDP) - */ -static unsigned int skb_gso_mac_seglen(const struct sk_buff *skb) -{ - unsigned int hdr_len = skb_transport_header(skb) - skb_mac_header(skb); - return hdr_len + skb_gso_transport_seglen(skb); -} - /* GSO packet is too big, segment it so that tbf can transmit * each segment in time */ -- 2.14.1
[PATCH 2/3] net: is_skb_forwardable: validate length of GSO packet segments
is_skb_forwardable attempts to detect if a packet is too large to be sent to the destination device. However, this test does not consider GSO packets, and it is possible that a GSO packet, when resegmented, will be larger than the device can transmit. Add detection for packets which will be too large by creating a skb_gso_validate_len() routine which is similar to skb_gso_validate_mtu(), but which considers L2 headers, and wire it up in is_skb_forwardable(). Signed-off-by: Daniel Axtens --- include/linux/skbuff.h | 1 + net/core/dev.c | 7 --- net/core/skbuff.c | 34 ++ 3 files changed, 39 insertions(+), 3 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index b8acafd73272..6a9e4b6f80a7 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3287,6 +3287,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); void skb_scrub_packet(struct sk_buff *skb, bool xnet); unsigned int skb_gso_transport_seglen(const struct sk_buff *skb); bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu); +bool skb_gso_validate_len(const struct sk_buff *skb, unsigned int len); struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features); struct sk_buff *skb_vlan_untag(struct sk_buff *skb); int skb_ensure_writable(struct sk_buff *skb, int write_len); diff --git a/net/core/dev.c b/net/core/dev.c index 94435cd09072..5af04be128c1 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1841,11 +1841,12 @@ bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb) if (skb->len <= len) return true; - /* if TSO is enabled, we don't care about the length as the packet -* could be forwarded without being segmented before + /* +* if TSO is enabled, we need to check the size of the +* segmented packets */ if (skb_is_gso(skb)) - return true; + return skb_gso_validate_len(skb, len); return false; } diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 01e8285aea73..aefe049e4b0c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4945,6 +4945,40 @@ bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu) } EXPORT_SYMBOL_GPL(skb_gso_validate_mtu); +/** + * skb_gso_validate_len - Will a split GSO skb fit in a given length? + * + * @skb: GSO skb + * @len: length to validate against + * + * skb_gso_validate_len validates if a given skb will fit a wanted + * length once split, including L2, L3 and L4 headers. + * + * Similar to skb_gso_validate_mtu, but inclusive of L2 headers. + */ +bool skb_gso_validate_len(const struct sk_buff *skb, unsigned int len) +{ + const struct skb_shared_info *shinfo = skb_shinfo(skb); + const struct sk_buff *iter; + unsigned int hlen; + + hlen = skb_gso_mac_seglen(skb); + + if (shinfo->gso_size != GSO_BY_FRAGS) + return hlen <= len; + + /* Undo this so we can re-use header sizes */ + hlen -= GSO_BY_FRAGS; + + skb_walk_frags(skb, iter) { + if (hlen + skb_headlen(iter) > len) + return false; + } + + return true; +} +EXPORT_SYMBOL_GPL(skb_gso_validate_len); + static struct sk_buff *skb_reorder_vlan_header(struct sk_buff *skb) { if (skb_cow(skb, skb_headroom(skb)) < 0) { -- 2.14.1
Re: [PATCH v2] bnx2x: disable GSO where gso_size is too big for hardware
David Miller writes: > From: Daniel Axtens > Date: Fri, 12 Jan 2018 10:59:05 +1100 > >> If a bnx2x card is passed a GSO packet with a gso_size larger than >> ~9700 bytes, it will cause a firmware error that will bring the card >> down: >> >> bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f0)]MC assert! >> bnx2x: [bnx2x_mc_assert:720(enP24p1s0f0)]XSTORM_ASSERT_LIST_INDEX 0x2 >> bnx2x: [bnx2x_mc_assert:736(enP24p1s0f0)]XSTORM_ASSERT_INDEX 0x0 = >> 0x 0x25e43e47 0x00463e01 0x00010052 >> bnx2x: [bnx2x_mc_assert:750(enP24p1s0f0)]Chip Revision: everest3, FW >> Version: 7_13_1 >> ... (dump of values continues) ... >> >> Detect when gso_size + header length is greater than the maximum >> packet size (9700 bytes) and disable GSO. For simplicity and speed >> this is approximated by comparing gso_size against 9200 and assuming >> no-one will have more than 500 bytes of headers. > > What is the MTU size configured on the bnx2x device when these 9700 > byte packets are seen? > > If it's less than 9700, whatever is allowing your device (openvswitch, > ibmveth, whatever) needs to be fixed. Sure, I had an approach that checks the gso_size in is_skb_forwardable and the equivalent openvswitch path - I'll send that. Regards, Daniel > > I don't like this at all, quite frankly. We'll have one device now that > has this special check, probably many others can run into this situation > as well but they won't be used on these kinds of powerpc boxes and > therefore nobody is going to notice. > > I'm not applying this without more information or better justification, > sorry.
[PATCH v2] bnx2x: disable GSO where gso_size is too big for hardware
If a bnx2x card is passed a GSO packet with a gso_size larger than ~9700 bytes, it will cause a firmware error that will bring the card down: bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f0)]MC assert! bnx2x: [bnx2x_mc_assert:720(enP24p1s0f0)]XSTORM_ASSERT_LIST_INDEX 0x2 bnx2x: [bnx2x_mc_assert:736(enP24p1s0f0)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e43e47 0x00463e01 0x00010052 bnx2x: [bnx2x_mc_assert:750(enP24p1s0f0)]Chip Revision: everest3, FW Version: 7_13_1 ... (dump of values continues) ... Detect when gso_size + header length is greater than the maximum packet size (9700 bytes) and disable GSO. For simplicity and speed this is approximated by comparing gso_size against 9200 and assuming no-one will have more than 500 bytes of headers. This raises the obvious question - how do we end up with a packet with a gso_size that's greater than 9700? This has been observed on an powerpc system when Open vSwitch is forwarding a packet from an ibmveth device. ibmveth is a bit special. It's the driver for communication between virtual machines (aka 'partitions'/LPARs) running under IBM's proprietary hypervisor on ppc machines. It allows sending very large packets (up to 64kB) between LPARs. This involves some quite 'interesting' things: for example, when talking TCP, the MSS is stored the checksum field (see ibmveth_rx_mss_helper() in ibmveth.c). Normally on a box like this, there would be a Virtual I/O Server (VIOS) partition that owns the physical network card. VIOS lets the AIX partitions know when they're talking to a real network and that they should drop their MSS. This works fine if VIOS owns the physical network card. However, in this case, a Linux partition owns the card (this is known as a NovaLink setup). The negotiation between VIOS and AIX uses a non-standard TCP option, so Linux has never supported that. Instead, Linux just supports receiving large packets. It doesn't support any form of messaging/MSS negotiation back to other LPARs. To get some clarity about where the large MSS was coming from, I asked Thomas Falcon, the maintainer of ibmveth, for some background: "In most cases, large segments are an aggregation of smaller packets by the Virtual I/O Server (VIOS) partition and then are forwarded to the Linux LPAR / ibmveth driver. These segments can be as large as 64KB. In this case, since the customer is using Novalink, I believe what is happening is pretty straightforward: the large segments are created by the AIX partition and then forwarded to the Linux partition, ... The ibmveth driver doesn't do any aggregation itself but just ensures the proper bits are set before sending the frame up to avoid giving the upper layers indigestion." It is possible to stop AIX from sending these large segments, but it requires configuration on each LPAR. While ibmveth's behaviour is admittedly weird, we should fix this here: it shouldn't be possible for it to cause a firmware panic on another card. Cc: Thomas Falcon # ibmveth Cc: Yuval Mintz # bnx2x Thanks-to: Jay Vosburgh # veth info Signed-off-by: Daniel Axtens --- v2: change to a feature check as suggested by Eric Dumazet. --- drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c index 7b08323e3f3d..bab909b5d7a2 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c @@ -12934,6 +12934,17 @@ static netdev_features_t bnx2x_features_check(struct sk_buff *skb, struct net_device *dev, netdev_features_t features) { + /* +* A skb with gso_size + header length > 9700 will cause a +* firmware panic. Drop GSO support. +* +* To avoid costly calculations on all packets (and because +* super-jumbo frames are rare), allow 500 bytes of headers +* and just disable GSO if gso_size is greater than 9200. +*/ + if (unlikely(skb_is_gso(skb) && skb_shinfo(skb)->gso_size > 9200)) + features &= ~NETIF_F_GSO_MASK; + features = vlan_features_check(skb, features); return vxlan_features_check(skb, features); } -- 2.14.1
Re: [PATCH] macvlan: verify MTU before lowerdev xmit
Hi Dave, > So how exactly do the oversized packets get to the macvlan device from > the VM in this scenerio? That detail seems to be missing from the > diagrams you provided earlier. The VM and the macvlan boxes are just > connected with a line. Inside the VM I'm using netperf talking on an interface which the guest believes to have a MTU of 1500. I'm setting up the VM using libvirt - my understanding is that libvirt creates a macvtap device in private mode, and qemu opens that tap device and writes data from the emulated network card (I see the same behaviour with a emulated rtl8139, e1000, and with virtio). I think I could replicate this with any userspace program - qemu is just the easiest for me at the moment. Hopefully that's what you had in mind? Let me know if you wanted different info. Regards, Daniel [The gory details, in case it matters: The VM has a network adaptor with the following XML: ] > > Thank you.
Re: [PATCH] macvlan: verify MTU before lowerdev xmit
Hi Dave, > This is an area where we really haven't set down some clear rules > for behavior. > > If an interface has a particular MTU, it must be able to successfully > send MTU sized packets on that link be it virtual or physical. > > Only a "next hop" can have a different MTU and thus drop packets. > This requirement is absolutely necessary in order for proper > signalling (path MTU messages) to make their way back to the sending > host. > > In this VM-->macvlan case it's more like a point to point connection > and there lacks a "next hop" to serve and the provider of proper > signalling. > > This whole situation seems to be handled quite poorly in virtualized > setups. Allowing one end of the virtual networking "link" into the > guest have a different MTU from the other end is a HUGE mistake. I agree, but users do it, so I'm just trying to figure out the best way to handle it. Currently the bridge code and openvswitch will detect when a large packet arrives and drop the packet* - the bridge code with is_skb_forwardable() and openvswitch with it's own approach in vport.c. This patch tries to bring macvlan in line with those. *except for GSO packets - they are assumed to be ok, which isn't always true. I am working on some patches for this - but my approach may need to be changed in light of what you're saying. > There needs to be control path signalling between the guest and the > provider of the virtual link so that they can synchronize their MTU > settings. We have these sorts of problems on probably every type of virtual interface, some of which are easier to deal with than others. I'm particularly worried about interfaces like ibmveth where the 'other end' is usually an AIX partition on a big powerpc system. AIX tends to bring up these interfaces with MTUs of around 64k as well. (This is what originially got me down the rabbit hole of caring about mismatched-MTU handling!) > Yes this is hard, but what is happening now doesn't fly in the long > term. I'm at a bit of a loss about how we would do a proper fix. The only thing that comes to mind would be for the receive routines of the virtual network interfaces to check the size of incoming packets, but - if I understand correctly - we would need to know what the maximum size for a recieved packet should be. Regards, Daniel
Re: [PATCH] macvlan: verify MTU before lowerdev xmit
Hi Shannon, > Now that I think about this a little more, why is this not already > getting handled by the NETDEV_CHANGEMTU notifier? In what case are you > running into this, and why is it not triggering the notifier? My commit message was probably not super clear here - apologies for any mangling of terminology/concepts. The changed MTU is getting propagated from the lowerdev to the macvtap interface just fine, it's just not getting enforced. Hopefully the following diagrams make it clearer. This is the current behaviour: a VM sends a packet of 1514 bytes and it makes it through the macvtap/macvlan and lowerdev, to the destination, even though the intermediate MTUs are lower. *---* |VM | | | | mtu 1500 | *---* | V 1514 byte packet | *---* | macvtap | | | | mtu 1480 | *---* | V 1514 byte packet | *---* | lowerdev | | | | mtu 1480 | *---* | V 1514 byte packet | *---* | dest| | | | mtu 1500 | *---* My thought here is that the lowerdev should not be asked to transmit a packet that is larger than its MTU. The patch causes the following behaviour: *---* |VM | | | | mtu 1500 | *---* | V 1514 byte packet | *---* | macvtap | | | | mtu 1480 | *---* | | packet dropped X 1500 > 1480 I think this makes macvlan consistent with bridges. >> As for the other paths, I see that the call to dev_forward_skb() already >> has this protection in it, but does the call to dev_queue_xmit_accel() >> in macvlan_start_xmit() need similar protection? I think so. I will have a look and do a v2 if the core idea of the patch is OK. Regards, Daniel >> > > > sln > > >> >> >>> skb->dev = vlan->lowerdev; >>> return dev_queue_xmit(skb); >>> } >>>
Re: [PATCH] ibmveth: Kernel crash LSO offload flag toggle
Hi Bryant, This looks a bit better, but... > The following patch ensures that the bounce_buffer is not null > prior to using it within skb_copy_from_linear_data. How would this occur? Looking at ibmveth.c, I see bounce_buffer being freed in ibmveth_close() and allocated in ibmveth_open() only. If allocation fails, the whole opening of the device fails with -ENOMEM. It seems your test case - changing TSO - causes ibmveth_set_tso() to cause an adaptor restart - an ibmveth_close(dev) and then an ibmveth_open(dev). Is this happening in parallel with an out of memory condition - is the memory allocation failing? Alternatively, could it be the case that you're closing the device while packets are in flight, and then trying to use a bounce_buffer that's been freed elsewhere? Do you need to decouple memory freeing from ibmveth_close? > The problem can be recreated toggling on/off Large send offload. > > The following script when run (along with some iperf traffic recreates the > crash within 5-10 mins or so). > > while true > do > ethtool -k ibmveth0 | grep tcp-segmentation-offload > ethtool -K ibmveth0 tso off > ethtool -k ibmveth0 | grep tcp-segmentation-offload > ethtool -K ibmveth0 tso on > done > > Note: This issue happens the very first time largsesend offload is > turned off too (but the above script recreates the issue all the times) > > [76563.914173] Unable to handle kernel paging request for data at address > 0x > [76563.914197] Faulting instruction address: 0xc0063940 > [76563.914205] Oops: Kernel access of bad area, sig: 11 [#1] > [76563.914210] SMP NR_CPUS=2048 NUMA pSeries > [76563.914217] Modules linked in: rpadlpar_io rpaphp dccp_diag dccp tcp_diag > udp_diag inet_diag unix_diag af_packet_diag netlink_diag nls_utf8 isofs > binfmt_misc pseries_rng rtc_generic autofs4 ibmvfc scsi_transport_fc ibmvscsi > ibmveth > [76563.914251] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.4.0-34-generic > #53-Ubuntu ^--- yikes! There are relevant changes to this area since 4.4: 2c42bf4b4317 ("ibmveth: check return of skb_linearize in ibmveth_start_xmit") 66aa0678efc2 ("ibmveth: Support to enable LSO/CSO for Trunk VEA.") Does this crash occurs on a more recent kernel? > diff --git a/drivers/net/ethernet/ibm/ibmveth.c > b/drivers/net/ethernet/ibm/ibmveth.c > index f210398200ece..1d29b1649118d 100644 > --- a/drivers/net/ethernet/ibm/ibmveth.c > +++ b/drivers/net/ethernet/ibm/ibmveth.c > @@ -1092,8 +1092,14 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff > *skb, >*/ > if (force_bounce || (!skb_is_nonlinear(skb) && > (skb->len < tx_copybreak))) { > - skb_copy_from_linear_data(skb, adapter->bounce_buffer, > - skb->len); > + if (adapter->bounce_buffer) { > + skb_copy_from_linear_data(skb, adapter->bounce_buffer, > + skb->len); > + } else { > + adapter->tx_send_failed++; > + netdev->stats.tx_dropped++; > + goto out; Finally, as I alluded to at the top of this message, isn't the disappearance of the bounce-buffer a pretty serious issue? As I understand it, it means your data structures are now inconsistent. Do you need to - at the least - be more chatty here? Regards, Daniel > + } > > descs[0].fields.flags_len = desc_flags | skb->len; > descs[0].fields.address = adapter->bounce_buffer_dma; > -- > 2.13.6 (Apple Git-96)
[PATCH] macvlan: verify MTU before lowerdev xmit
If a macvlan device which is not in bridge mode receives a packet, it is sent straight to the lowerdev without checking against the device's MTU. This also happens for multicast traffic. Add an is_skb_forwardable() check against the lowerdev before sending the packet out through it. I think this is the simplest and best way to do it, and is consistent with the use of dev_forward_skb() in the bridge path. This is easy to replicate: - create a VM with a macvtap connection in private mode - set the lowerdev MTU to something low in the host (e.g. 1480) - do not set the MTU lower in the guest (e.g. keep at 1500) - netperf to a different host with the same high MTU - observe that currently, the driver will forward too-big packets - observe that with this patch the packets are dropped Cc: Shannon Nelson Signed-off-by: Daniel Axtens --- After hearing Shannon's lightning talk on macvlan at netdev I figured I'd strike while the iron is hot and get this out of my patch queue where it has been languishing. --- drivers/net/macvlan.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index a178c5efd33e..8adcad6798c5 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -534,6 +534,10 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev) } xmit_world: + /* verify MTU */ + if (!is_skb_forwardable(vlan->lowerdev, skb)) + return NET_XMIT_DROP; + skb->dev = vlan->lowerdev; return dev_queue_xmit(skb); } -- 2.11.0
Re: [PATCH] ibmveth: Kernel crash LSO offload flag toggle
Hi Bryant, A few things: 1) The commit message could probably be trimmed, especially the stack traces. 2) What exactly are you changing and why does it fix the issue? I couldn't figure that out from the commit message. 3) Does this need a Fixes: tag? > } > > - netdev->min_mtu = IBMVETH_MIN_MTU; > - netdev->max_mtu = ETH_MAX_MTU; > - 4) What does the above hunk do? It seems unrelated... Regards, Daniel
Re: [PATCH] bnx2x: drop packets where gso_size is too big for hardware
Hi Eric, >> +if (unlikely(skb_shinfo(skb)->gso_size + hlen > >> MAX_PACKET_SIZE)) { >> +BNX2X_ERR("reported gso segment size plus headers " >> + "(%d + %d) > MAX_PACKET_SIZE; dropping pkt!", >> + skb_shinfo(skb)->gso_size, hlen); >> + >> +goto free_and_drop; >> +} >> + > > > If you had this test in bnx2x_features_check(), packet could be > segmented by core networking stack before reaching bnx2x_start_xmit() by > clearing NETIF_F_GSO_MASK > > -> No drop would be involved. > > check i40evf_features_check() for similar logic. So I've been experimenting with this and reading through the core networking code. If my understanding is correct, disabling GSO will cause the packet to be segmented, but it will be segemented into gso_size+header length packets. So in this case (~10kB gso_size) the resultant packets will still be too big - although at least they don't cause a crash in that case. We could continue with this anyway as it at least prevents the crash - but, and I haven't been able to find a nice definitive answer to this - are implementations of ndo_start_xmit permitted to assume that the the skb passed in will fit within the MTU? I notice that most callers will attempt to ensure this - for example ip_output.c, ip6_output.c and ip_forward.c all contain calls to skb_gso_validate_mtu(). If implementations are permitted to assume this, perhaps a fix to openvswitch would be more appropriate? Regards, Daniel
Re: [PATCH] bnx2x: drop packets where gso_size is too big for hardware
Eric Dumazet writes: > If you had this test in bnx2x_features_check(), packet could be > segmented by core networking stack before reaching bnx2x_start_xmit() by > clearing NETIF_F_GSO_MASK > > -> No drop would be involved. Thanks for the pointer - networking code is all a bit new to me. I'm just struggling at the moment to figure out what the right way to calculate the length. My original patch uses gso_size + hlen, but: - On reflection, while this solves the immediate bug, I'm not 100% sure this is the right thing to be calculating - If it is, then we have the problem that hlen is calculated in a bunch of weird and wonderful ways which make it a nightmare to extract. Yuval (or anyone else who groks the driver properly) - what's the right test to be doing here to make sure we don't write to much data to the card? Regards, Daniel
[PATCH] bnx2x: drop packets where gso_size is too big for hardware
If a bnx2x card is passed a GSO packet with a gso_size larger than ~9700 bytes, it will cause a firmware error that will bring the card down: bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f0)]MC assert! bnx2x: [bnx2x_mc_assert:720(enP24p1s0f0)]XSTORM_ASSERT_LIST_INDEX 0x2 bnx2x: [bnx2x_mc_assert:736(enP24p1s0f0)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e43e47 0x00463e01 0x00010052 bnx2x: [bnx2x_mc_assert:750(enP24p1s0f0)]Chip Revision: everest3, FW Version: 7_13_1 ... (dump of values continues) ... Detect when gso_size + header length is greater than the maximum packet size (9700 bytes) and drop the packet. This raises the obvious question - how do we end up with a packet with a gso_size that's greater than 9700? This has been observed on an powerpc system when Open vSwitch is forwarding a packet from an ibmveth device. ibmveth is a bit special. It's the driver for communication between virtual machines (aka 'partitions'/LPARs) running under IBM's proprietary hypervisor on ppc machines. It allows sending very large packets (up to 64kB) between LPARs. This involves some quite 'interesting' things: for example, when talking TCP, the MSS is stored the checksum field (see ibmveth_rx_mss_helper() in ibmveth.c). Normally on a box like this, there would be a Virtual I/O Server (VIOS) partition that owns the physical network card. VIOS lets the AIX partitions know when they're talking to a real network and that they should drop their MSS. This works fine if VIOS owns the physical network card. However, in this case, a Linux partition owns the card (this is known as a NovaLink setup). The negotiation between VIOS and AIX uses a non-standard TCP option, so Linux has never supported that. Instead, Linux just supports receiving large packets. It doesn't support any form of messaging/MSS negotiation back to other LPARs. To get some clarity about where the large MSS was coming from, I asked Thomas Falcon, the maintainer of ibmveth, for some background: "In most cases, large segments are an aggregation of smaller packets by the Virtual I/O Server (VIOS) partition and then are forwarded to the Linux LPAR / ibmveth driver. These segments can be as large as 64KB. In this case, since the customer is using Novalink, I believe what is happening is pretty straightforward: the large segments are created by the AIX partition and then forwarded to the Linux partition, ... The ibmveth driver doesn't do any aggregation itself but just ensures the proper bits are set before sending the frame up to avoid giving the upper layers indigestion." It is possible to stop AIX from sending these large segments, but it requires configuration on each LPAR. While ibmveth's behaviour is admittedly weird, we should fix this here: it shouldn't be possible for it to cause a firmware panic on another card. Cc: Thomas Falcon # ibmveth Cc: Yuval Mintz # bnx2x Thanks-to: Jay Vosburgh # veth info Signed-off-by: Daniel Axtens --- drivers/net/ethernet/broadcom/bnx2x/bnx2x.h | 2 ++ drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 33 +++- drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c | 1 - 3 files changed, 23 insertions(+), 13 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h index 352beff796ae..b36d54737d70 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h @@ -2517,4 +2517,6 @@ void bnx2x_set_rx_ts(struct bnx2x *bp, struct sk_buff *skb); */ int bnx2x_vlan_reconfigure_vid(struct bnx2x *bp); +#define MAX_PACKET_SIZE(9700) + #endif /* bnx2x.h */ diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c index 1216c1f1e052..1c5517a9348c 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c @@ -3742,6 +3742,7 @@ netdev_tx_t bnx2x_start_xmit(struct sk_buff *skb, struct net_device *dev) __le16 pkt_size = 0; struct ethhdr *eth; u8 mac_type = UNICAST_ADDRESS; + unsigned int pkts_compl = 0, bytes_compl = 0; #ifdef BNX2X_STOP_ON_ERROR if (unlikely(bp->panic)) @@ -4029,6 +4030,14 @@ netdev_tx_t bnx2x_start_xmit(struct sk_buff *skb, struct net_device *dev) skb->len, hlen, skb_headlen(skb), skb_shinfo(skb)->gso_size); + if (unlikely(skb_shinfo(skb)->gso_size + hlen > MAX_PACKET_SIZE)) { + BNX2X_ERR("reported gso segment size plus headers " + "(%d + %d) > MAX_PACKET_SIZE; dropping pkt!", + skb_shinfo(skb)->gso_size, hlen); + + goto free_and_drop; + } + tx_start_bd->bd_flags.as_bitfield |= ETH_TX_BD_FLAGS_SW_LSO;
[PATCH] openvswitch: fix mis-ordered comment lines for ovs_skb_cb
I was trying to wrap my head around meaning of mru, and realised that the second line of the comment defining it had somehow ended up after the line defining cutlen, leading to much confusion. Reorder the lines to make sense. Signed-off-by: Daniel Axtens --- net/openvswitch/datapath.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h index da931bdef8a7..5d8dcd88815f 100644 --- a/net/openvswitch/datapath.h +++ b/net/openvswitch/datapath.h @@ -98,8 +98,8 @@ struct datapath { * @input_vport: The original vport packet came in on. This value is cached * when a packet is received by OVS. * @mru: The maximum received fragement size; 0 if the packet is not - * @cutlen: The number of bytes from the packet end to be removed. * fragmented. + * @cutlen: The number of bytes from the packet end to be removed. */ struct ovs_skb_cb { struct vport*input_vport; -- 2.11.0
bnx2x - bnx2x_attn_int_deasserted3 - MC assert!
Hi, I'm looking into a powerpc 64-bit little-endian system with 4 (IBM-branded) Broadcom Corporation NetXtreme II BCM57800 1/10 Gigabit Ethernet adaptors installed. It's running 4.11 and using Open vSwitch and bonded interfaces. We're seeing regular firmware errors coming out of the network cards when netperf is started as soon as the interfaces appear: [60072.330838] bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f0)]MC assert! [60072.330854] bnx2x: [bnx2x_mc_assert:720(enP24p1s0f0)]XSTORM_ASSERT_LIST_INDEX 0x2 [60072.330860] bnx2x: [bnx2x_mc_assert:736(enP24p1s0f0)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e43e47 0x00463e01 0x00010052 [60072.330878] bnx2x: [bnx2x_mc_assert:750(enP24p1s0f0)]Chip Revision: everest3, FW Version: 7_13_1 ... (dump of values follows) ... I've included the dmesg output with the dump of both crashing cards below. The errors do not appear if netperf isn't started until a few minutes after the interfaces come up. The errors are not new in 4.11 - they have been reproduced in 4.8 and 4.10. This all seems to come out of firmware rather than Linux, and I don't have the datasheets to decode the magic numbers, so I was hoping you'd be able to shed some light on it? Thanks in advance! Regards, Daniel Axtens == dmesg output == [60072.330838] bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f0)]MC assert! [60072.330854] bnx2x: [bnx2x_mc_assert:720(enP24p1s0f0)]XSTORM_ASSERT_LIST_INDEX 0x2 [60072.330860] bnx2x: [bnx2x_mc_assert:736(enP24p1s0f0)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e43e47 0x00463e01 0x00010052 [60072.330878] bnx2x: [bnx2x_mc_assert:750(enP24p1s0f0)]Chip Revision: everest3, FW Version: 7_13_1 [60072.330881] bnx2x: [bnx2x_attn_int_deasserted3:4329(enP24p1s0f0)]driver assert [60072.330884] bnx2x: [bnx2x_panic_dump:923(enP24p1s0f0)]begin crash dump - [60072.330888] bnx2x: [bnx2x_panic_dump:933(enP24p1s0f0)]def_idx(0xe51f) def_att_idx(0x4) attn_state(0x1) spq_prod_idx(0x37) next_stats_cnt(0xe517) [60072.330892] bnx2x: [bnx2x_panic_dump:938(enP24p1s0f0)]DSB: attn bits(0x0) ack(0x1) id(0x0) idx(0x4) [60072.330894] bnx2x: [bnx2x_panic_dump:939(enP24p1s0f0)] def (0x0 0x0 0x0 0x0 0x0 0x0 0x0 0xe605 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0) igu_sb_id(0x0) igu_seg_id(0x1) pf_id(0x0) vnic_id(0x0) vf_id(0xff) vf_valid (0x0) state(0x1) [60072.330909] bnx2x: [bnx2x_panic_dump:990(enP24p1s0f0)]fp0: rx_bd_prod(0x4245) rx_bd_cons(0x7e) rx_comp_prod(0x76e1) rx_comp_cons(0x7515) *rx_cons_sb(0x7515) [60072.330913] bnx2x: [bnx2x_panic_dump:993(enP24p1s0f0)] rx_sge_prod(0x0) last_max_sge(0x0) fp_hc_idx(0x287b) [60072.330916] bnx2x: [bnx2x_panic_dump:1010(enP24p1s0f0)]fp0: tx_pkt_prod(0x211b) tx_pkt_cons(0x211b) tx_bd_prod(0x4278) tx_bd_cons(0x4277) *tx_cons_sb(0x211b) [60072.330920] bnx2x: [bnx2x_panic_dump:1010(enP24p1s0f0)]fp0: tx_pkt_prod(0x0) tx_pkt_cons(0x0) tx_bd_prod(0x0) tx_bd_cons(0x0) *tx_cons_sb(0x0) [60072.330924] bnx2x: [bnx2x_panic_dump:1010(enP24p1s0f0)]fp0: tx_pkt_prod(0x0) tx_pkt_cons(0x0) tx_bd_prod(0x0) tx_bd_cons(0x0) *tx_cons_sb(0x0) [60072.330928] bnx2x: [bnx2x_panic_dump:1021(enP24p1s0f0)] run indexes (0x287b 0x0) [60072.330929] bnx2x: [bnx2x_panic_dump:1027(enP24p1s0f0)] indexes (0x0 0x7515 0x0 0x0 0x0 0x211b 0x0 0x0)pf_id(0x0) vf_id(0xff) vf_valid(0x0) vnic_id(0x0) same_igu_sb_1b(0x1) state(0x1) [60072.330956] bnx2x: [bnx2x_panic_dump:990(enP24p1s0f0)]fp1: rx_bd_prod(0x8af8) rx_bd_cons(0x931) rx_comp_prod(0xa811) rx_comp_cons(0xa645) *rx_cons_sb(0xa645) [60072.330960] bnx2x: [bnx2x_panic_dump:993(enP24p1s0f0)] rx_sge_prod(0x0) last_max_sge(0x0) fp_hc_idx(0x21d8) [60072.330964] bnx2x: [bnx2x_panic_dump:1010(enP24p1s0f0)]fp1: tx_pkt_prod(0x1eb8) tx_pkt_cons(0x1eb8) tx_bd_prod(0x3dad) tx_bd_cons(0x3dac) *tx_cons_sb(0x1eb8) [60072.330968] bnx2x: [bnx2x_panic_dump:1010(enP24p1s0f0)]fp1: tx_pkt_prod(0x0) tx_pkt_cons(0x0) tx_bd_prod(0x0) tx_bd_cons(0x0) *tx_cons_sb(0x0) [60072.330972] bnx2x: [bnx2x_panic_dump:1010(enP24p1s0f0)]fp1: tx_pkt_prod(0x0) tx_pkt_cons(0x0) tx_bd_prod(0x0) tx_bd_cons(0x0) *tx_cons_sb(0x0) [60072.330975] bnx2x: [bnx2x_panic_dump:1021(enP24p1s0f0)] run indexes (0x21d8 0x0) [60072.330976] bnx2x: [bnx2x_panic_dump:1027(enP24p1s0f0)] indexes (0x0 0xa645 0x0 0x0 0x0 0x1eb8 0x0 0x0)pf_id(0x0) vf_id(0xff) vf_valid(0x0) vnic_id(0x0) same_igu_sb_1b(0x1) state(0x1) [60072.331003] bnx2x: [bnx2x_panic_dump:990(enP24p1s0f0)]fp2: rx_bd_prod(0x1cd5) rx_bd_cons(0xb0e) rx_comp_prod(0xdd30) rx_comp_cons(0xdb64) *rx_cons_sb(0xdb64) [60072.331007] bnx2x: [bnx2x_panic_dump:993(enP24p1s0f0)] rx_sge_prod(0x0) last_max_sge(0x0) fp_hc_idx(0x97d0) [60072.331011] bnx2x: [bnx2x_panic_dump:1010(enP24p1s0f0)]fp2: tx_pkt_prod(0x1ed6) tx_pkt_cons(0x1ed6) tx_bd_prod(0x3de9) tx_bd_cons(0x3de8) *tx_cons_sb(0x1ed6) [60072.331015] bnx2x: [bnx2x_panic_dump:1010(enP24p1s0f0)]fp2: tx_pkt_prod(0x0)