Re: Moving frags and SKBTX_DEV_ZEROCOPY skbs
On 14/05/14 20:41, Zoltan Kiss wrote: But here is the thing: deliver_skb calls orphan_frags for every packet delivered to the local stack, so we are safe IF these functions are called before the IP stack. So we are safe now, but things can go wrong, if: - such a frag-mangling function is called before deliver_skb, now or in the future - if someone wants to take advantage of zerocopy in the guest<->backend path Running through the code I've found the following core functions can shuffle frags between skbs (and don't handle zerocopy skbs already): skb_gro_receive skb_shift skb_split None of them can meet at the moment with zerocopy skbs, but it's better to keep it in mind for the future, that would blow up these kind of skbs. Zoli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Moving frags and SKBTX_DEV_ZEROCOPY skbs
On 14/05/14 15:23, Eric Dumazet wrote: On Wed, 2014-05-14 at 14:40 +0100, Zoltan Kiss wrote: Hi, Recently I've investigated issues around SKBTX_DEV_ZEROCOPY skbs where the frags list were modified. I came across this function skb_shift(), which moves frags between skbs. And there are a lot more of such kind, skb_split or skb_try_coalesce, for example. It could be a dangerous thing if a frag is referenced from an skb which doesn't have the original destructor_arg, and to avoid that skb_orphan_frags should be called. Although probably these functions are not normally touched in usual usecases, I think it would be useful to review core skb functions proactively and add an skb_orphan_frags everywhere where the frags could be referenced from other places. Any opinion about this? For skb_shift(), it is currently used from tcp stack only, where this SKBTX_DEV_ZEROCOPY thing is not used, so I do not think there is a bug for the moment. It is called from tcp_input.c, which suggests it can be called on incoming TCP packets. If the backend domain communicates with the frontend through sockets, zerocopy packets can turn up here. But here is the thing: deliver_skb calls orphan_frags for every packet delivered to the local stack, so we are safe IF these functions are called before the IP stack. So we are safe now, but things can go wrong, if: - such a frag-mangling function is called before deliver_skb, now or in the future - if someone wants to take advantage of zerocopy in the guest<->backend path I already gave a patch for skb_try_coalesce() : For this one we do not wan skb_orphan_frags() overhead. Its simply better in this case to abort. diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 1b62343f5837..85995a14aafc 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -3838,7 +3839,10 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, return true; } - if (skb_has_frag_list(to) || skb_has_frag_list(from)) + if (skb_has_frag_list(to) || + skb_has_frag_list(from) || + (skb_shinfo(to)->tx_flags & SKBTX_DEV_ZEROCOPY) || + (skb_shinfo(from)->tx_flags & SKBTX_DEV_ZEROCOPY)) return false; if (skb_headlen(from) != 0) { -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Moving frags and SKBTX_DEV_ZEROCOPY skbs
Hi, Recently I've investigated issues around SKBTX_DEV_ZEROCOPY skbs where the frags list were modified. I came across this function skb_shift(), which moves frags between skbs. And there are a lot more of such kind, skb_split or skb_try_coalesce, for example. It could be a dangerous thing if a frag is referenced from an skb which doesn't have the original destructor_arg, and to avoid that skb_orphan_frags should be called. Although probably these functions are not normally touched in usual usecases, I think it would be useful to review core skb functions proactively and add an skb_orphan_frags everywhere where the frags could be referenced from other places. Any opinion about this? Regards, Zoltan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3.13] core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
On 22/04/14 20:18, Ben Hutchings wrote: From: Zoltan Kiss commit 36d5fe6a000790f56039afe26834265db0a3ad4c upstream. skb_zerocopy can copy elements of the frags array between skbs, but it doesn't orphan them. Also, it doesn't handle errors, so this patch takes care of that as well, and modify the callers accordingly. skb_tx_error() is also added to the callers so they will signal the failed delivery towards the creator of the skb. Signed-off-by: Zoltan Kiss Signed-off-by: David S. Miller [bwh: Backported to 3.13: skb_zerocopy() is new in 3.14, but was moved from a static function in nfnetlink_queue. We need to patch that and its caller, but not openvswitch.] Signed-off-by: Ben Hutchings --- On Tue, 2014-04-22 at 19:02 +0100, Zoltan Kiss wrote: On 22/04/14 16:38, Ben Hutchings wrote: On Mon, 2014-04-21 at 12:26 +0100, Luis Henriques wrote: Hi David, On Thu, Mar 27, 2014 at 03:29:56PM -0400, David Miller wrote: From: Zoltan Kiss Date: Wed, 26 Mar 2014 22:37:45 + skb_zerocopy can copy elements of the frags array between skbs, but it doesn't orphan them. Also, it doesn't handle errors, so this patch takes care of that as well, and modify the callers accordingly. skb_tx_error() is also added to the callers so they will signal the failed delivery towards the creator of the skb. Signed-off-by: Zoltan Kiss Fingers crossed :-) Applied, thanks Zoltan. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Could you please queue this for stable as well? It has CVE-2014-2568 assigned to it and I believe it is applicable to some of the trees. It was applied to 3.13, but needs backporting to earlier versions. I posted my attempt in <1397429860.10849.86.ca...@deadeye.wl.decadent.org.uk> but it needs testing/reviewing. This one is a different issue, although it is very similar. Sorry for mixing these up. I believe this bug has been present since zerocopy was added to nfqueue in Linux 3.10 (commit ae08ce002108). This is the version I used for Debian's 3.13 branch, which might be usable for older stable branches too. Seems OK, thanks! Ben. --- --- a/net/netfilter/nfnetlink_queue_core.c +++ b/net/netfilter/nfnetlink_queue_core.c @@ -235,22 +235,23 @@ nfqnl_flush(struct nfqnl_instance *queue spin_unlock_bh(&queue->lock); } -static void +static int nfqnl_zcopy(struct sk_buff *to, const struct sk_buff *from, int len, int hlen) { int i, j = 0; int plen = 0; /* length of skb->head fragment */ + int ret; struct page *page; unsigned int offset; /* dont bother with small payloads */ - if (len <= skb_tailroom(to)) { - skb_copy_bits(from, 0, skb_put(to, len), len); - return; - } + if (len <= skb_tailroom(to)) + return skb_copy_bits(from, 0, skb_put(to, len), len); if (hlen) { - skb_copy_bits(from, 0, skb_put(to, hlen), hlen); + ret = skb_copy_bits(from, 0, skb_put(to, hlen), hlen); + if (unlikely(ret)) + return ret; len -= hlen; } else { plen = min_t(int, skb_headlen(from), len); @@ -268,6 +269,11 @@ nfqnl_zcopy(struct sk_buff *to, const st to->len += len + plen; to->data_len += len + plen; + if (unlikely(skb_orphan_frags(from, GFP_ATOMIC))) { + skb_tx_error(from); + return -ENOMEM; + } + for (i = 0; i < skb_shinfo(from)->nr_frags; i++) { if (!len) break; @@ -278,6 +284,8 @@ nfqnl_zcopy(struct sk_buff *to, const st j++; } skb_shinfo(to)->nr_frags = j; + + return 0; } static int @@ -374,13 +382,16 @@ nfqnl_build_packet_message(struct net *n skb = nfnetlink_alloc_skb(net, size, queue->peer_portid, GFP_ATOMIC); - if (!skb) + if (!skb) { + skb_tx_error(entskb); return NULL; + } nlh = nlmsg_put(skb, 0, 0, NFNL_SUBSYS_QUEUE << 8 | NFQNL_MSG_PACKET, sizeof(struct nfgenmsg), 0); if (!nlh) { + skb_tx_error(entskb); kfree_skb(skb); return NULL; } @@ -504,13 +515,15 @@ nfqnl_build_packet_message(struct net *n nla->nla_type = NFQA_PAYLOAD; nla->nla_len = nla_attr_size(data_len); - nfqnl_zcopy(skb, entskb, data_len, hlen); + if (nfqnl_zcopy(skb, entskb, data_len, hlen)) + goto nla_put_failure; } nlh->nlmsg_len = skb->len; return skb; nla_put_failure: + skb_tx_error(entskb);
Re: [PATCH v5] core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
On 22/04/14 16:38, Ben Hutchings wrote: On Mon, 2014-04-21 at 12:26 +0100, Luis Henriques wrote: Hi David, On Thu, Mar 27, 2014 at 03:29:56PM -0400, David Miller wrote: From: Zoltan Kiss Date: Wed, 26 Mar 2014 22:37:45 + skb_zerocopy can copy elements of the frags array between skbs, but it doesn't orphan them. Also, it doesn't handle errors, so this patch takes care of that as well, and modify the callers accordingly. skb_tx_error() is also added to the callers so they will signal the failed delivery towards the creator of the skb. Signed-off-by: Zoltan Kiss Fingers crossed :-) Applied, thanks Zoltan. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Could you please queue this for stable as well? It has CVE-2014-2568 assigned to it and I believe it is applicable to some of the trees. It was applied to 3.13, but needs backporting to earlier versions. I posted my attempt in <1397429860.10849.86.ca...@deadeye.wl.decadent.org.uk> but it needs testing/reviewing. This one is a different issue, although it is very similar. Zoli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
On 26/03/14 20:12, David Miller wrote: From: David Miller Date: Wed, 26 Mar 2014 15:59:58 -0400 (EDT) From: Zoltan Kiss Date: Fri, 21 Mar 2014 10:31:34 + skb_zerocopy can copy elements of the frags array between skbs, but it doesn't orphan them. Also, it doesn't handle errors, so this patch takes care of that as well, and modify the callers accordingly. skb_tx_error() is also added to the callers so they will signal the failed delivery towards the creator of the skb. Signed-off-by: Zoltan Kiss Applied, thanks Zoltan. Actually, Zoltan, you have to fix this: net/core/skbuff.c: In function ‘skb_zerocopy’: net/core/skbuff.c:2172:2: warning: passing argument 1 of ‘skb_orphan_frags’ discards ‘const’ qualifi er from pointer target type [enabled by default] In file included from include/linux/tcp.h:21:0, from net/core/skbuff.c:50: include/linux/skbuff.h:1904:19: note: expected ‘struct sk_buff *’ but argument is of type ‘const str uct sk_buff *’ net/core/skbuff.c:2173:3: warning: passing argument 1 of ‘skb_tx_error’ discards ‘const’ qualifier f rom pointer target type [enabled by default] net/core/skbuff.c:642:6: note: expected ‘struct sk_buff *’ but argument is of type ‘const struct sk_ buff *’ Ok, resubmitted. 'from' is now not a const parameter, because skb->pfmemalloc might change. Zoli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5] core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
skb_zerocopy can copy elements of the frags array between skbs, but it doesn't orphan them. Also, it doesn't handle errors, so this patch takes care of that as well, and modify the callers accordingly. skb_tx_error() is also added to the callers so they will signal the failed delivery towards the creator of the skb. Signed-off-by: Zoltan Kiss --- v2: orphan the frags right before touching the frags v3: - orphan 'from' instead of 'to' - call skb_tx_error() in the callers if something went wrong v4: correctly use error path in queue_userspace_packet v5: remove const qualifier of "from" in skb_zerocopy parameters diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 03db95a..0db91fa 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2508,8 +2508,8 @@ int skb_splice_bits(struct sk_buff *skb, unsigned int offset, unsigned int flags); void skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to); unsigned int skb_zerocopy_headlen(const struct sk_buff *from); -void skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, - int len, int hlen); +int skb_zerocopy(struct sk_buff *to, struct sk_buff *from, +int len, int hlen); void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len); int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); void skb_scrub_packet(struct sk_buff *skb, bool xnet); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3f14c63..908ad98 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -2127,25 +2127,31 @@ EXPORT_SYMBOL_GPL(skb_zerocopy_headlen); * * The `hlen` as calculated by skb_zerocopy_headlen() specifies the * headroom in the `to` buffer. + * + * Return value: + * 0: everything is OK + * -ENOMEM: couldn't orphan frags of @from due to lack of memory + * -EFAULT: skb_copy_bits() found some problem with skb geometry */ -void -skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, int len, int hlen) +int +skb_zerocopy(struct sk_buff *to, struct sk_buff *from, int len, int hlen) { int i, j = 0; int plen = 0; /* length of skb->head fragment */ + int ret; struct page *page; unsigned int offset; BUG_ON(!from->head_frag && !hlen); /* dont bother with small payloads */ - if (len <= skb_tailroom(to)) { - skb_copy_bits(from, 0, skb_put(to, len), len); - return; - } + if (len <= skb_tailroom(to)) + return skb_copy_bits(from, 0, skb_put(to, len), len); if (hlen) { - skb_copy_bits(from, 0, skb_put(to, hlen), hlen); + ret = skb_copy_bits(from, 0, skb_put(to, hlen), hlen); + if (unlikely(ret)) + return ret; len -= hlen; } else { plen = min_t(int, skb_headlen(from), len); @@ -2163,6 +2169,11 @@ skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, int len, int hlen) to->len += len + plen; to->data_len += len + plen; + if (unlikely(skb_orphan_frags(from, GFP_ATOMIC))) { + skb_tx_error(from); + return -ENOMEM; + } + for (i = 0; i < skb_shinfo(from)->nr_frags; i++) { if (!len) break; @@ -2173,6 +2184,8 @@ skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, int len, int hlen) j++; } skb_shinfo(to)->nr_frags = j; + + return 0; } EXPORT_SYMBOL_GPL(skb_zerocopy); diff --git a/net/netfilter/nfnetlink_queue_core.c b/net/netfilter/nfnetlink_queue_core.c index f072fe8..108120f 100644 --- a/net/netfilter/nfnetlink_queue_core.c +++ b/net/netfilter/nfnetlink_queue_core.c @@ -354,13 +354,16 @@ nfqnl_build_packet_message(struct net *net, struct nfqnl_instance *queue, skb = nfnetlink_alloc_skb(net, size, queue->peer_portid, GFP_ATOMIC); - if (!skb) + if (!skb) { + skb_tx_error(entskb); return NULL; + } nlh = nlmsg_put(skb, 0, 0, NFNL_SUBSYS_QUEUE << 8 | NFQNL_MSG_PACKET, sizeof(struct nfgenmsg), 0); if (!nlh) { + skb_tx_error(entskb); kfree_skb(skb); return NULL; } @@ -488,13 +491,15 @@ nfqnl_build_packet_message(struct net *net, struct nfqnl_instance *queue, nla->nla_type = NFQA_PAYLOAD; nla->nla_len = nla_attr_size(data_len); - skb_zerocopy(skb, entskb, data_len, hlen); + if (skb_zerocopy(skb, entskb, data_len, hlen)) + goto nla_put_failure; } nlh->nlmsg_len = skb->len; return skb; nla_put_failure: + skb_tx_error(en
Re: [PATCH v3] core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
On 20/03/14 22:22, Thomas Graf wrote: On 03/20/2014 05:02 PM, Zoltan Kiss wrote: --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -464,7 +464,9 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb, } nla->nla_len = nla_attr_size(skb->len); -skb_zerocopy(user_skb, skb, skb->len, hlen); +err = skb_zerocopy(user_skb, skb, skb->len, hlen); +if (err) +goto out; /* Pad OVS_PACKET_ATTR_PACKET if linear copy was performed */ if (!(dp->user_features & OVS_DP_F_UNALIGNED)) { @@ -478,6 +480,7 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb, err = genlmsg_unicast(ovs_dp_get_net(dp), user_skb, upcall_info->portid); out: +skb_tx_error(skb); Really sorry to bug you again. We'll get this right ;-) Patch looks great except for this call to skb_tx_error() which is now done in the error *and* success path. Make the call conditional on if (err) and we're good. Ah, sorry, I'm really sloppy with this patch. I've sent a fixed one. Zoli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4] core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
skb_zerocopy can copy elements of the frags array between skbs, but it doesn't orphan them. Also, it doesn't handle errors, so this patch takes care of that as well, and modify the callers accordingly. skb_tx_error() is also added to the callers so they will signal the failed delivery towards the creator of the skb. Signed-off-by: Zoltan Kiss --- v2: orphan the frags right before touching the frags v3: - orphan 'from' instead of 'to' - call skb_tx_error() in the callers if something went wrong v4: correctly use error path in queue_userspace_packet diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 03db95a..35c4e85 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2508,8 +2508,8 @@ int skb_splice_bits(struct sk_buff *skb, unsigned int offset, unsigned int flags); void skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to); unsigned int skb_zerocopy_headlen(const struct sk_buff *from); -void skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, - int len, int hlen); +int skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, +int len, int hlen); void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len); int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); void skb_scrub_packet(struct sk_buff *skb, bool xnet); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3f14c63..4cf0ee5 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -2127,25 +2127,31 @@ EXPORT_SYMBOL_GPL(skb_zerocopy_headlen); * * The `hlen` as calculated by skb_zerocopy_headlen() specifies the * headroom in the `to` buffer. + * + * Return value: + * 0: everything is OK + * -ENOMEM: couldn't orphan frags of @from due to lack of memory + * -EFAULT: skb_copy_bits() found some problem with skb geometry */ -void +int skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, int len, int hlen) { int i, j = 0; int plen = 0; /* length of skb->head fragment */ + int ret; struct page *page; unsigned int offset; BUG_ON(!from->head_frag && !hlen); /* dont bother with small payloads */ - if (len <= skb_tailroom(to)) { - skb_copy_bits(from, 0, skb_put(to, len), len); - return; - } + if (len <= skb_tailroom(to)) + return skb_copy_bits(from, 0, skb_put(to, len), len); if (hlen) { - skb_copy_bits(from, 0, skb_put(to, hlen), hlen); + ret = skb_copy_bits(from, 0, skb_put(to, hlen), hlen); + if (unlikely(ret)) + return ret; len -= hlen; } else { plen = min_t(int, skb_headlen(from), len); @@ -2163,6 +2169,11 @@ skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, int len, int hlen) to->len += len + plen; to->data_len += len + plen; + if (unlikely(skb_orphan_frags(from, GFP_ATOMIC))) { + skb_tx_error(from); + return -ENOMEM; + } + for (i = 0; i < skb_shinfo(from)->nr_frags; i++) { if (!len) break; @@ -2173,6 +2184,8 @@ skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, int len, int hlen) j++; } skb_shinfo(to)->nr_frags = j; + + return 0; } EXPORT_SYMBOL_GPL(skb_zerocopy); diff --git a/net/netfilter/nfnetlink_queue_core.c b/net/netfilter/nfnetlink_queue_core.c index f072fe8..108120f 100644 --- a/net/netfilter/nfnetlink_queue_core.c +++ b/net/netfilter/nfnetlink_queue_core.c @@ -354,13 +354,16 @@ nfqnl_build_packet_message(struct net *net, struct nfqnl_instance *queue, skb = nfnetlink_alloc_skb(net, size, queue->peer_portid, GFP_ATOMIC); - if (!skb) + if (!skb) { + skb_tx_error(entskb); return NULL; + } nlh = nlmsg_put(skb, 0, 0, NFNL_SUBSYS_QUEUE << 8 | NFQNL_MSG_PACKET, sizeof(struct nfgenmsg), 0); if (!nlh) { + skb_tx_error(entskb); kfree_skb(skb); return NULL; } @@ -488,13 +491,15 @@ nfqnl_build_packet_message(struct net *net, struct nfqnl_instance *queue, nla->nla_type = NFQA_PAYLOAD; nla->nla_len = nla_attr_size(data_len); - skb_zerocopy(skb, entskb, data_len, hlen); + if (skb_zerocopy(skb, entskb, data_len, hlen)) + goto nla_put_failure; } nlh->nlmsg_len = skb->len; return skb; nla_put_failure: + skb_tx_error(entskb); kfree_skb(skb); net_err_ratelimited("nf_queue: error creating packet message\n"); return NULL; diff --
[PATCH v3] core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
skb_zerocopy can copy elements of the frags array between skbs, but it doesn't orphan them. Also, it doesn't handle errors, so this patch takes care of that as well, and modify the callers accordingly. skb_tx_error() is also added to the callers so they will signal the failed delivery towards the creator of the skb. Signed-off-by: Zoltan Kiss --- v2: orphan the frags right before touching the frags v3: - orphan 'from' instead of 'to' - call skb_tx_error() in the callers if something went wrong diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 03db95a..35c4e85 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2508,8 +2508,8 @@ int skb_splice_bits(struct sk_buff *skb, unsigned int offset, unsigned int flags); void skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to); unsigned int skb_zerocopy_headlen(const struct sk_buff *from); -void skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, - int len, int hlen); +int skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, +int len, int hlen); void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len); int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); void skb_scrub_packet(struct sk_buff *skb, bool xnet); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3f14c63..73f0291 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -2127,25 +2127,31 @@ EXPORT_SYMBOL_GPL(skb_zerocopy_headlen); * * The `hlen` as calculated by skb_zerocopy_headlen() specifies the * headroom in the `to` buffer. + * + * Return value: + * 0: everything is OK + * -ENOMEM: couldn't orphan frags of from due to lack of memory + * -EFAULT: skb_copy_bits() found some problem with geometry */ -void +int skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, int len, int hlen) { int i, j = 0; int plen = 0; /* length of skb->head fragment */ + int ret; struct page *page; unsigned int offset; BUG_ON(!from->head_frag && !hlen); /* dont bother with small payloads */ - if (len <= skb_tailroom(to)) { - skb_copy_bits(from, 0, skb_put(to, len), len); - return; - } + if (len <= skb_tailroom(to)) + return skb_copy_bits(from, 0, skb_put(to, len), len); if (hlen) { - skb_copy_bits(from, 0, skb_put(to, hlen), hlen); + ret = skb_copy_bits(from, 0, skb_put(to, hlen), hlen); + if (unlikely(ret)) + return ret; len -= hlen; } else { plen = min_t(int, skb_headlen(from), len); @@ -2163,6 +2169,11 @@ skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, int len, int hlen) to->len += len + plen; to->data_len += len + plen; + if (unlikely(skb_orphan_frags(from, GFP_ATOMIC))) { + skb_tx_error(from); + return -ENOMEM; + } + for (i = 0; i < skb_shinfo(from)->nr_frags; i++) { if (!len) break; @@ -2173,6 +2184,8 @@ skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, int len, int hlen) j++; } skb_shinfo(to)->nr_frags = j; + + return 0; } EXPORT_SYMBOL_GPL(skb_zerocopy); diff --git a/net/netfilter/nfnetlink_queue_core.c b/net/netfilter/nfnetlink_queue_core.c index f072fe8..108120f 100644 --- a/net/netfilter/nfnetlink_queue_core.c +++ b/net/netfilter/nfnetlink_queue_core.c @@ -354,13 +354,16 @@ nfqnl_build_packet_message(struct net *net, struct nfqnl_instance *queue, skb = nfnetlink_alloc_skb(net, size, queue->peer_portid, GFP_ATOMIC); - if (!skb) + if (!skb) { + skb_tx_error(entskb); return NULL; + } nlh = nlmsg_put(skb, 0, 0, NFNL_SUBSYS_QUEUE << 8 | NFQNL_MSG_PACKET, sizeof(struct nfgenmsg), 0); if (!nlh) { + skb_tx_error(entskb); kfree_skb(skb); return NULL; } @@ -488,13 +491,15 @@ nfqnl_build_packet_message(struct net *net, struct nfqnl_instance *queue, nla->nla_type = NFQA_PAYLOAD; nla->nla_len = nla_attr_size(data_len); - skb_zerocopy(skb, entskb, data_len, hlen); + if (skb_zerocopy(skb, entskb, data_len, hlen)) + goto nla_put_failure; } nlh->nlmsg_len = skb->len; return skb; nla_put_failure: + skb_tx_error(entskb); kfree_skb(skb); net_err_ratelimited("nf_queue: error creating packet message\n"); return NULL; diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c inde
Re: [PATCH v2] core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
On 20/03/14 12:33, Thomas Graf wrote: On 03/20/2014 01:16 PM, Thomas Graf wrote: On 03/19/2014 10:07 PM, Zoltan Kiss wrote: skb_zerocopy can copy elements of the frags array between skbs, but it doesn't orphan them. Also, it doesn't handle errors, so this patch takes care of that as well. Signed-off-by: Zoltan Kiss Acked-by: Thomas Graf I take this back ;) --- +if (unlikely(skb_orphan_frags(to, GFP_ATOMIC))) { +skb_tx_error(to); +return -ENOMEM; +} Just noticed that you orphan the Netlink skb frags which do not exist yet instead of the source skb frags. Oh, sorry, I'll fix this up. Did you consider calling skb_tx_error() for Netlink message allocation failures for the upcall as well? That memory pressure is currently not reported back. Yeah, that makes sense, I'll fix it the callers. Btw. I guess that just provides some statistical data for the creator of the skb. At least for netback it only increment a stat counter. Zoli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
On 19/03/14 20:47, Thomas Graf wrote: On 03/19/2014 09:38 PM, Zoltan Kiss wrote: skb_zerocopy can copy elements of the frags array between skbs, but it doesn't orphan them. Also, it doesn't handle errors, so this patch takes care of that as well. Signed-off-by: Zoltan Kiss --- +if (unlikely(skb_orphan_frags(to, GFP_ATOMIC))) { +skb_tx_error(to); +return -ENOMEM; } I think you should move this down to right before we iterate over the frags. I agree, I was mislead by that __skb_fill_page_desc, but now I see it is actually the head buffer of the source, not a frag from there. And in case of a failure in skb_copy_bits we can spare an orphan_frag. Zoli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
skb_zerocopy can copy elements of the frags array between skbs, but it doesn't orphan them. Also, it doesn't handle errors, so this patch takes care of that as well. Signed-off-by: Zoltan Kiss --- v2: orphan the frags right before touching the frags diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 03db95a..35c4e85 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2508,8 +2508,8 @@ int skb_splice_bits(struct sk_buff *skb, unsigned int offset, unsigned int flags); void skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to); unsigned int skb_zerocopy_headlen(const struct sk_buff *from); -void skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, - int len, int hlen); +int skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, +int len, int hlen); void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len); int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); void skb_scrub_packet(struct sk_buff *skb, bool xnet); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3f14c63..5cc99b1 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -2127,25 +2127,31 @@ EXPORT_SYMBOL_GPL(skb_zerocopy_headlen); * * The `hlen` as calculated by skb_zerocopy_headlen() specifies the * headroom in the `to` buffer. + * + * Return value: + * 0: everything is OK + * -ENOMEM: couldn't orphan frags of from due to lack of memory + * -EFAULT: skb_copy_bits() found some problem with geometry */ -void +int skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, int len, int hlen) { int i, j = 0; int plen = 0; /* length of skb->head fragment */ + int ret; struct page *page; unsigned int offset; BUG_ON(!from->head_frag && !hlen); /* dont bother with small payloads */ - if (len <= skb_tailroom(to)) { - skb_copy_bits(from, 0, skb_put(to, len), len); - return; - } + if (len <= skb_tailroom(to)) + return skb_copy_bits(from, 0, skb_put(to, len), len); if (hlen) { - skb_copy_bits(from, 0, skb_put(to, hlen), hlen); + ret = skb_copy_bits(from, 0, skb_put(to, hlen), hlen); + if (unlikely(ret)) + return ret; len -= hlen; } else { plen = min_t(int, skb_headlen(from), len); @@ -2163,6 +2169,11 @@ skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, int len, int hlen) to->len += len + plen; to->data_len += len + plen; + if (unlikely(skb_orphan_frags(to, GFP_ATOMIC))) { + skb_tx_error(to); + return -ENOMEM; + } + for (i = 0; i < skb_shinfo(from)->nr_frags; i++) { if (!len) break; @@ -2173,6 +2184,8 @@ skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, int len, int hlen) j++; } skb_shinfo(to)->nr_frags = j; + + return 0; } EXPORT_SYMBOL_GPL(skb_zerocopy); diff --git a/net/netfilter/nfnetlink_queue_core.c b/net/netfilter/nfnetlink_queue_core.c index f072fe8..ea02d16 100644 --- a/net/netfilter/nfnetlink_queue_core.c +++ b/net/netfilter/nfnetlink_queue_core.c @@ -488,7 +488,8 @@ nfqnl_build_packet_message(struct net *net, struct nfqnl_instance *queue, nla->nla_type = NFQA_PAYLOAD; nla->nla_len = nla_attr_size(data_len); - skb_zerocopy(skb, entskb, data_len, hlen); + if (skb_zerocopy(skb, entskb, data_len, hlen)) + goto nla_put_failure; } nlh->nlmsg_len = skb->len; diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index c53fe0c..9a9f668 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -464,7 +464,9 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb, } nla->nla_len = nla_attr_size(skb->len); - skb_zerocopy(user_skb, skb, skb->len, hlen); + err = skb_zerocopy(user_skb, skb, skb->len, hlen); + if (err) + goto out; /* Pad OVS_PACKET_ATTR_PACKET if linear copy was performed */ if (!(dp->user_features & OVS_DP_F_UNALIGNED)) { -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
On 19/03/14 20:24, David Miller wrote: From: Zoltan Kiss Date: Tue, 18 Mar 2014 21:17:35 + skb_zerocopy can copy elements of the frags array between skbs, but it doesn't orphan them. Also, it doesn't handle errors, so this patch takes care of that as well. Signed-off-by: Zoltan Kiss --- net/openvswitch/datapath.c |6 ++ Something is seriously wrong with the diffstat generated for your submission, please fix this and resubmit. Sorry, I've turned off diffstat in guilt a while ago, but this patch had an old version of it. I've resent it. Zoli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
skb_zerocopy can copy elements of the frags array between skbs, but it doesn't orphan them. Also, it doesn't handle errors, so this patch takes care of that as well. Signed-off-by: Zoltan Kiss --- diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 03db95a..35c4e85 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2508,8 +2508,8 @@ int skb_splice_bits(struct sk_buff *skb, unsigned int offset, unsigned int flags); void skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to); unsigned int skb_zerocopy_headlen(const struct sk_buff *from); -void skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, - int len, int hlen); +int skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, +int len, int hlen); void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len); int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); void skb_scrub_packet(struct sk_buff *skb, bool xnet); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3f14c63..60f623c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -2127,25 +2127,36 @@ EXPORT_SYMBOL_GPL(skb_zerocopy_headlen); * * The `hlen` as calculated by skb_zerocopy_headlen() specifies the * headroom in the `to` buffer. + * + * Return value: + * 0: everything is OK + * -ENOMEM: couldn't orphan frags of from due to lack of memory + * -EFAULT: skb_copy_bits() found some problem with geometry */ -void +int skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, int len, int hlen) { int i, j = 0; int plen = 0; /* length of skb->head fragment */ + int ret; struct page *page; unsigned int offset; BUG_ON(!from->head_frag && !hlen); /* dont bother with small payloads */ - if (len <= skb_tailroom(to)) { - skb_copy_bits(from, 0, skb_put(to, len), len); - return; + if (len <= skb_tailroom(to)) + return skb_copy_bits(from, 0, skb_put(to, len), len); + + if (unlikely(skb_orphan_frags(to, GFP_ATOMIC))) { + skb_tx_error(to); + return -ENOMEM; } if (hlen) { - skb_copy_bits(from, 0, skb_put(to, hlen), hlen); + ret = skb_copy_bits(from, 0, skb_put(to, hlen), hlen); + if (unlikely(ret)) + return ret; len -= hlen; } else { plen = min_t(int, skb_headlen(from), len); @@ -2173,6 +2184,8 @@ skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, int len, int hlen) j++; } skb_shinfo(to)->nr_frags = j; + + return 0; } EXPORT_SYMBOL_GPL(skb_zerocopy); diff --git a/net/netfilter/nfnetlink_queue_core.c b/net/netfilter/nfnetlink_queue_core.c index f072fe8..ea02d16 100644 --- a/net/netfilter/nfnetlink_queue_core.c +++ b/net/netfilter/nfnetlink_queue_core.c @@ -488,7 +488,8 @@ nfqnl_build_packet_message(struct net *net, struct nfqnl_instance *queue, nla->nla_type = NFQA_PAYLOAD; nla->nla_len = nla_attr_size(data_len); - skb_zerocopy(skb, entskb, data_len, hlen); + if (skb_zerocopy(skb, entskb, data_len, hlen)) + goto nla_put_failure; } nlh->nlmsg_len = skb->len; diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index c53fe0c..9a9f668 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -464,7 +464,9 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb, } nla->nla_len = nla_attr_size(skb->len); - skb_zerocopy(user_skb, skb, skb->len, hlen); + err = skb_zerocopy(user_skb, skb, skb->len, hlen); + if (err) + goto out; /* Pad OVS_PACKET_ATTR_PACKET if linear copy was performed */ if (!(dp->user_features & OVS_DP_F_UNALIGNED)) { -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
skb_zerocopy can copy elements of the frags array between skbs, but it doesn't orphan them. Also, it doesn't handle errors, so this patch takes care of that as well. Signed-off-by: Zoltan Kiss --- net/openvswitch/datapath.c |6 ++ 1 file changed, 6 insertions(+) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 03db95a..35c4e85 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2508,8 +2508,8 @@ int skb_splice_bits(struct sk_buff *skb, unsigned int offset, unsigned int flags); void skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to); unsigned int skb_zerocopy_headlen(const struct sk_buff *from); -void skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, - int len, int hlen); +int skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, +int len, int hlen); void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len); int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); void skb_scrub_packet(struct sk_buff *skb, bool xnet); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3f14c63..60f623c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -2127,25 +2127,36 @@ EXPORT_SYMBOL_GPL(skb_zerocopy_headlen); * * The `hlen` as calculated by skb_zerocopy_headlen() specifies the * headroom in the `to` buffer. + * + * Return value: + * 0: everything is OK + * -ENOMEM: couldn't orphan frags of from due to lack of memory + * -EFAULT: skb_copy_bits() found some problem with geometry */ -void +int skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, int len, int hlen) { int i, j = 0; int plen = 0; /* length of skb->head fragment */ + int ret; struct page *page; unsigned int offset; BUG_ON(!from->head_frag && !hlen); /* dont bother with small payloads */ - if (len <= skb_tailroom(to)) { - skb_copy_bits(from, 0, skb_put(to, len), len); - return; + if (len <= skb_tailroom(to)) + return skb_copy_bits(from, 0, skb_put(to, len), len); + + if (unlikely(skb_orphan_frags(to, GFP_ATOMIC))) { + skb_tx_error(to); + return -ENOMEM; } if (hlen) { - skb_copy_bits(from, 0, skb_put(to, hlen), hlen); + ret = skb_copy_bits(from, 0, skb_put(to, hlen), hlen); + if (unlikely(ret)) + return ret; len -= hlen; } else { plen = min_t(int, skb_headlen(from), len); @@ -2173,6 +2184,8 @@ skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, int len, int hlen) j++; } skb_shinfo(to)->nr_frags = j; + + return 0; } EXPORT_SYMBOL_GPL(skb_zerocopy); diff --git a/net/netfilter/nfnetlink_queue_core.c b/net/netfilter/nfnetlink_queue_core.c index f072fe8..ea02d16 100644 --- a/net/netfilter/nfnetlink_queue_core.c +++ b/net/netfilter/nfnetlink_queue_core.c @@ -488,7 +488,8 @@ nfqnl_build_packet_message(struct net *net, struct nfqnl_instance *queue, nla->nla_type = NFQA_PAYLOAD; nla->nla_len = nla_attr_size(data_len); - skb_zerocopy(skb, entskb, data_len, hlen); + if (skb_zerocopy(skb, entskb, data_len, hlen)) + goto nla_put_failure; } nlh->nlmsg_len = skb->len; diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index c53fe0c..9a9f668 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -464,7 +464,9 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb, } nla->nla_len = nla_attr_size(skb->len); - skb_zerocopy(user_skb, skb, skb->len, hlen); + err = skb_zerocopy(user_skb, skb, skb->len, hlen); + if (err) + goto out; /* Pad OVS_PACKET_ATTR_PACKET if linear copy was performed */ if (!(dp->user_features & OVS_DP_F_UNALIGNED)) { -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
On 07/03/14 04:46, Pravin Shelar wrote: On Thu, Mar 6, 2014 at 9:09 AM, Zoltan Kiss wrote: Do you have any feedback on this? I'm also adding KVM list as they might be interested in this. Zoli On 28/02/14 19:16, Zoltan Kiss wrote: The kernel datapath now switched to zerocopy Netlink messages, but that also means that the pages on frags array are sent straight to userspace. If those pages came outside the kernel, we have to swap them out with local copies. Signed-off-by: Zoltan Kiss I do not think this is required, netlink zero copy only maps pre-allocated buffers to user-space. How do you mean "pre-allocated"? By who? As far as I've seen the skb in this function came straight from the device (vif in our case), and skb_zerocopy just copy the frags to user_skb, which is sent to the userspace. Those frags contain pages from guest, and it's a bad idea to pass them to userspace: e.g if userspace dies in the meantime, what happens with them? Also, in Xen's case they are actually not mapped to userspace, so accessing them can lead to garbage. Zoli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
Do you have any feedback on this? I'm also adding KVM list as they might be interested in this. Zoli On 28/02/14 19:16, Zoltan Kiss wrote: The kernel datapath now switched to zerocopy Netlink messages, but that also means that the pages on frags array are sent straight to userspace. If those pages came outside the kernel, we have to swap them out with local copies. Signed-off-by: Zoltan Kiss --- net/openvswitch/datapath.c |6 ++ 1 file changed, 6 insertions(+) diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index 36f8872..ffb563c 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -464,6 +464,12 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb, } nla->nla_len = nla_attr_size(skb->len); + if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) { + err = -ENOMEM; + skb_tx_error(skb); + goto out; + } + skb_zerocopy(user_skb, skb, skb->len, hlen); /* Pad OVS_PACKET_ATTR_PACKET if linear copy was performed */ -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] openvswitch: orphan frags on local receive
Hi, I'm also planning a similar patch, but it will call skb_orphan_frags on the skb in datapath.c::queue_userspace_packet, right before skb_zerocopy, so packets sent up to userspace via Netlink doesn't harm guests. I haven't checked your patch thoroughly, does it handle a different scenario? Regards, Zoltan Kiss On 24/02/14 13:15, Qin Chuanyu wrote: with vhost tx zero_copy, guest nic might get hang when host reserving skb in socket queue delivered by guest, the case has been solved in tun, it also been needed by openvswitch. This could easily happened when a LAST_ACK state tcp occuring between guest and host. Signed-off-by: Chuanyu Qin --- net/openvswitch/vport-internal_dev.c |3 +++ net/openvswitch/vport.c |1 + 2 files changed, 4 insertions(+), 0 deletions(-) diff --git a/net/openvswitch/vport-internal_dev.c b/net/openvswitch/vport-internal_dev.c index 729c687..adb25e2 100644 --- a/net/openvswitch/vport-internal_dev.c +++ b/net/openvswitch/vport-internal_dev.c @@ -212,6 +212,9 @@ static int internal_dev_recv(struct vport *vport, struct sk_buff *skb) struct net_device *netdev = netdev_vport_priv(vport)->dev; int len; +if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) +return -NET_RX_DROP; + len = skb->len; skb_dst_drop(skb); diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c index 208dd9a..04172d6 100644 --- a/net/openvswitch/vport.c +++ b/net/openvswitch/vport.c @@ -383,6 +383,7 @@ int ovs_vport_send(struct vport *vport, struct sk_buff *skb) u64_stats_update_end(&stats->syncp); } else if (sent < 0) { ovs_vport_record_error(vport, VPORT_E_TX_ERROR); +skb_tx_error(skb); kfree_skb(skb); } else ovs_vport_record_error(vport, VPORT_E_TX_DROPPED); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [RFC v2 1/4] bridge: enable interfaces to opt out from becoming the root bridge
On 20/02/14 20:01, Luis R. Rodriguez wrote: On Thu, Feb 20, 2014 at 5:19 AM, Zoltan Kiss wrote: How about this: netback sets the root_block flag and a random MAC by default. So the default behaviour won't change, DAD will be happy, and userspace don't have to do anything unless it's using netback for STP root bridge (I don't think there are too many toolstacks doing that), in which case it has to remove the root_block flag instead of setting a random MAC. :D that's exactly what I ended up proposing too. I mentioned how xen-netback could do this as well, we'd keep or rename the flag I added, and then the bridge could would look at it and enable the root block if the flag is set. Stephen however does not like having the bridge code look at magic flags for this behavior and would prefer for us to get the tools to ask for the root block. Let's follow more up on that thread We don't need that new flag, just forget about it. Set that root_block flag from netback device init, around the time you generate the random MAC, or at the earliest possible time. Nothing else has to be done from kernel side. If someone wants netback to be a root port, then remove root_block from their tools, instead of changing the the MAC address, as it happens now. Another problem with the random addresses, pointed out by Ian earlier, that when adding/removing interfaces, the bridge does recalculate it's MAC address, and choose the lowest one. In the general usecase I think that's normal, but in case of Xen networking, we would like to keep the bridge using the physical interface's MAC, because the local port of the bridge is used for Dom0 network traffic, therefore changing the bridge MAC when a netback device has lower MAC breaks that traffic. I think the best is to address this from userspace: if it set the MAC of the bridge explicitly, dev_set_mac_address() does dev->addr_assign_type = NET_ADDR_SET;, so br_stp_recalculate_bridge_id() will exit before changing anything. And when I say userspace, I mean Xen specific tools which does networking configuration, e.g. xapi in XenServer case. Not brctl, it doesn't have to know whether this is a xenbrX device or a bridge used for another purposes. Zoli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [RFC v2 1/4] bridge: enable interfaces to opt out from becoming the root bridge
On 20/02/14 20:24, Luis R. Rodriguez wrote: On Thu, Feb 20, 2014 at 9:19 AM, Stephen Hemminger wrote: On Wed, 19 Feb 2014 09:59:33 -0800 "Luis R. Rodriguez" wrote: On Wed, Feb 19, 2014 at 9:08 AM, Stephen Hemminger wrote: Please only use the netlink/sysfs flags fields that already exist for new features. Sure, but what if we know a driver in most cases wants the root block and we'd want to make it the default, thereby only requiring userspace for toggling it off. Something in userspace has to put the device into the bridge. Fix the port setup in that tool via the netlink or sysfs flags in the bridge. It should not have to be handled in the bridge looking at magic flags in the device. Agreed that's the best strategy and I'll work on sending patches to brctl to enable the root_block preference. This approach however also I don't think brctl should deal with any Xen specific stuff. I assume there is a misunderstanding in this thread: when I (and possibly other Xen folks) talk about "userspace" or "toolstack" here, I mean Xen specific tools which use e.g. brctl to set up bridges. Not brctl itself. requires a userspace upgrade. I'm trying to see if we can get an old-nasty-cryptic-hack practice removed from the kernel and we'd try to prevent future drivers from using it -- without requiring userspace upgrade. In this case the bad practice is to using a high static MAC address for mimicking a root block default preference. In order to remove that *without* requiring a userspace upgrade the dev->priv_flag approach is the only thing I can think of. If this would go in we'd replace the high static MAC address with a random MAC address to prevent IPv6 SLAAC / DAD conflicts. I'd document this flag and indicate with preference for userspace to be the one tuning these knobs. Without this we'd have to keep the high static MAC address on upstream drivers and let userspace do the random'ization if it confirms the userspace knob to turn the root block flag is available. Is the priv_flag approach worth the compromise to remove the root block hack practice? Luis -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 2/4] net: enables interface option to skip IP
On 20/02/14 20:39, Luis R. Rodriguez wrote: On Wed, Feb 19, 2014 at 11:13 AM, Zoltan Kiss wrote: On 19/02/14 17:20, Luis R. Rodriguez wrote: On 19/02/14 17:20, Luis R. Rodriguez also wrote: Zoltan has noted though some use cases of IPv4 or IPv6 addresses on backends though <...> As discussed in the other threads though there *is* some use cases of assigning IPv4 or IPv6 addresses to the backend interfaces though: routing them (although its unclear to me if iptables can be used instead, Zoltan?). Not with OVS, it steals the packet before netfilter hooks. Got it, thanks! Can't the route be added using a front-end IP address instead on the host though ? I just tried that on a Xen system and it seems to work. Perhaps I'm not understand the exact topology on the routing case. So in my case I have the backend without any IPv4 or IPv6 interfaces, the guest has IPv4, IPv6 addresses and even a TUN for VPN and I can create routes on the host to the front end by not using the backend device name but instead using the front-end target IP. Check this how current Xen scripts does routed networking: http://wiki.xen.org/wiki/Xen_Networking#Associating_routes_with_virtual_devices Note, there are no bridges involved here! As the above page says, the backend has to have IP address, maybe it's not true anymore. I'm not too familiar with this setup too, I've used it only once. Zoli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [RFC v2 1/4] bridge: enable interfaces to opt out from becoming the root bridge
On 19/02/14 16:45, Luis R. Rodriguez wrote: On Mon, Feb 17, 2014 at 9:52 AM, Zoltan Kiss wrote: On 15/02/14 02:59, Luis R. Rodriguez wrote: From: "Luis R. Rodriguez" It doesn't make sense for some interfaces to become a root bridge at any point in time. One example is virtual backend interfaces which rely on other entities on the bridge for actual physical connectivity. They only provide virtual access. It is possible that a guest bridge together to VIF, either from the same Dom0 bridge or from different ones. In that case using STP on VIFs sound sensible to me. You seem to describe a case whereby it can make sense for xen-netback interfaces to end up becoming the root port of a bridge. Can you elaborate a little more on that as it was unclear the use case. Well, I might be wrong on that, but the scenario I was thinking: a guest (let's say domain 1) can have multiple interfaces on different Dom0 (or driver domain) bridges, let's say vif1.0 is plugged into xenbr0 and vif1.1 is in xenbr1. If the guest wants to make a bridge of this two, then using STP makes sense. I wanted to bring up CloudStack's virtual router as an example, but then I realized it's probably doesn't do such thing. However I don't think we should hardcode that a netback interface can't be RP ever. Additionally if such cases exist then under the current upstream implementation one would simply need to change the MAC address in order to enable a vif to become the root port. Stephen noted there is a way to avoid nominating an interface for a root port through the root block flag. We should use that instead of the MAC address hacks. Let's keep in mind that part of the motivation for this series is to avoid a duplicate IPv6 address left in place by use cases whereby the MAC address of the backend vif was left static. The use case your are explaining likely describes the more prevalent use case where address conflicts can occur, perhaps when administrators for got to change the backend MAC address. If we embrace a random MAC address we'd avoid that issue, and but we'd need to update userspace to use the root block on topologies where desired. If I understand you correctly, this is the same I suggested in my another email sent 1.5 hour ago. Zoli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [RFC v2 1/4] bridge: enable interfaces to opt out from becoming the root bridge
On 19/02/14 17:02, Luis R. Rodriguez wrote: On Wed, Feb 19, 2014 at 6:35 AM, Zoltan Kiss wrote: On 19/02/14 09:52, Ian Campbell wrote: Can't we arrange things in the Xen hotplug scripts such that if the root_block stuff isn't available/doesn't work we fallback to the existing fe:ff:ff:ff:ff usage? That would avoid concerns about forward/backwards compat I think. It wouldn't solve the issue you are targeting on old systems, but it also doesn't regress them any further. I agree, I think this problem could be better handled from userspace: if it can set root_block then change the default MAC to a random one, if it can't, then stay with the default one. Or if someone doesn't care about STP but DAD is still important, userspace can have a force_random_mac option somewhere to change to a random MAC regardless of root_block presence. Folks, what if I repurpose my patch to use the IFF_BRIDGE_NON_ROOT (or relabel to IFF_ROOT_BLOCK_DEF) flag for a default driver preference upon initialization so that root block will be used once the device gets added to a bridge. The purpose would be to avoid drivers from using the high MAC address hack, streamline to use a random MAC address thereby avoiding the possible duplicate address situation for IPv6. In the STP use case for these interfaces we'd just require userspace to unset the root block. I'd consider the STP use case the most odd of all. The caveat to this approach is 3.8 would be needed (or its the root block patches cherry picked) for base kernels older than 3.8. How about this: netback sets the root_block flag and a random MAC by default. So the default behaviour won't change, DAD will be happy, and userspace don't have to do anything unless it's using netback for STP root bridge (I don't think there are too many toolstacks doing that), in which case it has to remove the root_block flag instead of setting a random MAC. Zoli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 2/4] net: enables interface option to skip IP
On 19/02/14 17:20, Luis R. Rodriguez wrote: On Wed, Feb 19, 2014 at 8:45 AM, Dan Williams wrote: On Tue, 2014-02-18 at 13:19 -0800, Luis R. Rodriguez wrote: On Mon, Feb 17, 2014 at 12:23 PM, Dan Williams wrote: On Fri, 2014-02-14 at 18:59 -0800, Luis R. Rodriguez wrote: From: "Luis R. Rodriguez" Some interfaces do not need to have any IPv4 or IPv6 addresses, so enable an option to specify this. One example where this is observed are virtualization backend interfaces which just use the net_device constructs to help with their respective frontends. This should optimize boot time and complexity on virtualization environments for each backend interface while also avoiding triggering SLAAC and DAD, which is simply pointless for these type of interfaces. Would it not be better/cleaner to use disable_ipv6 and then add a disable_ipv4 sysctl, then use those with that interface? Sure, but note that the both disable_ipv6 and accept_dada sysctl parameters are global. ipv4 and ipv6 interfaces are created upon NETDEVICE_REGISTER, which will get triggered when a driver calls register_netdev(). The goal of this patch was to enable an early optimization for drivers that have no need ever for ipv4 or ipv6 interfaces. Each interface gets override sysctls too though, eg: /proc/sys/net/ipv6/conf/enp0s25/disable_ipv6 I hadn't seen those, thanks! which is the one I meant; you're obviously right that the global ones aren't what you want here. But the specific ones should be suitable? Under the approach Stephen mentioned by first ensuring the interface is down yes. There's one use case I can consider to still want the patch though, more on that below. If you set that on a per-interface basis, then you'll get EPERM or something whenever you try to add IPv6 addresses or do IPv6 routing. Neat, thanks. Zoltan has noted though some use cases of IPv4 or IPv6 addresses on backends though, as such this is no longer applicable as a requirement. The ipv4 sysctl however still seems like a reasonable approach to enable optimizations of the network in topologies where its known we won't need them but -- we'd need to consider a much more granular solution, not just global as it is now for disable_ipv6, and we'd also have to figure out a clean way to do this to not incur the cost of early address interface addition upon register_netdev(). Given that we have a use case for ipv4 and ipv6 addresses on xen-netback we no longer have an immediate use case for such early optimization primitives though, so I'll drop this. The IFF_SKIP_IP seems to duplicate at least part of what disable_ipv6 is already doing. disable_ipv6 is global, the goal was to make this granular and skip the cost upon early boot, but its been clarified we don't need this. Like Stephen says, you need to make sure you set them before IFF_UP, but beyond that, wouldn't the interface-specific sysctls work? Yeah that'll do it, unless there is a measurable run time benefit cost to never even add these in the first place. Consider a host with tons of guests, not sure how many is 'a lot' these days. One would have to measure the cost of reducing the amount of time it takes to boot these up. As discussed in the other threads though there *is* some use cases of assigning IPv4 or IPv6 addresses to the backend interfaces though: routing them (although its unclear to me if iptables can be used instead, Zoltan?). Not with OVS, it steals the packet before netfilter hooks. Zoli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [RFC v2 1/4] bridge: enable interfaces to opt out from becoming the root bridge
On 19/02/14 09:52, Ian Campbell wrote: On Tue, 2014-02-18 at 13:02 -0800, Luis R. Rodriguez wrote: On Sun, Feb 16, 2014 at 10:57 AM, Stephen Hemminger wrote: On Fri, 14 Feb 2014 18:59:37 -0800 "Luis R. Rodriguez" wrote: From: "Luis R. Rodriguez" It doesn't make sense for some interfaces to become a root bridge at any point in time. One example is virtual backend interfaces which rely on other entities on the bridge for actual physical connectivity. They only provide virtual access. Device drivers that know they should never become part of the root bridge have been using a trick of setting their MAC address to a high broadcast MAC address such as FE:FF:FF:FF:FF:FF. Instead of using these hacks lets the interfaces annotate its intent and generalizes a solution for multiple drivers, while letting the drivers use a random MAC address or one prefixed with a proper OUI. This sort of hack is used by both qemu and xen for their backend interfaces. Cc: Stephen Hemminger Cc: bri...@lists.linux-foundation.org Cc: net...@vger.kernel.org Cc: linux-ker...@vger.kernel.org Signed-off-by: Luis R. Rodriguez This is already supported in a more standard way via the root block flag. Great! For documentation purposes the root_block flag is a sysfs attribute, added via 3.8 through commit 1007dd1a. The respective interface flag is IFLA_BRPORT_PROTECT and can be set via the iproute2 bridge utility or through sysfs: mcgrof@garbanzo ~/linux (git::master)$ find /sys/ -name root_block /sys/devices/pci:00/:00:04.0/:02:00.0/net/eth0/brport/root_block /sys/devices/vif-3-0/net/vif3.0/brport/root_block /sys/devices/virtual/net/vif3.0-emu/brport/root_block mcgrof@garbanzo ~/devel/iproute2 (git::master)$ cat /sys/devices/vif-3-0/net/vif3.0/brport/root_block 0 mcgrof@garbanzo ~/devel/iproute2 (git::master)$ sudo bridge link set dev vif3.0 root_block on mcgrof@garbanzo ~/devel/iproute2 (git::master)$ cat /sys/devices/vif-3-0/net/vif3.0/brport/root_block 1 So if we'd want to avoid using the MAC address hack alternative to skip a root port userspace would need to be updated to simply set this attribute after adding the device to the bridge. Based on Zoltan's feedback there seems to be use cases to not enable this always for all xen-netback interfaces though as such we can just punt this to userspace for the topologies that require this. The original motivation for this series was to avoid the IPv6 duplicate address incurred by the MAC address hack for avoiding the root bridge. Given that Zoltan also noted a use case whereby IPv4 and IPv6 addresses can be assigned to the backend interfaces we should be able to avoid the duplicate address situation for IPv6 by using a proper random MAC address *once* userspace has been updated also to use IFLA_BRPORT_PROTECT. New userspace can't and won't need to set this flag for older kernels (older than 3.8) as root_block is not implemented on those kernels and the MAC address hack would still be used there. This strategy however does put a requirement on new kernels to use new userspace as otherwise the MAC address workaround would not be in place and root_block would not take effect. Can't we arrange things in the Xen hotplug scripts such that if the root_block stuff isn't available/doesn't work we fallback to the existing fe:ff:ff:ff:ff usage? That would avoid concerns about forward/backwards compat I think. It wouldn't solve the issue you are targeting on old systems, but it also doesn't regress them any further. I agree, I think this problem could be better handled from userspace: if it can set root_block then change the default MAC to a random one, if it can't, then stay with the default one. Or if someone doesn't care about STP but DAD is still important, userspace can have a force_random_mac option somewhere to change to a random MAC regardless of root_block presence. Zoli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [RFC v2 1/4] bridge: enable interfaces to opt out from becoming the root bridge
On 15/02/14 02:59, Luis R. Rodriguez wrote: From: "Luis R. Rodriguez" It doesn't make sense for some interfaces to become a root bridge at any point in time. One example is virtual backend interfaces which rely on other entities on the bridge for actual physical connectivity. They only provide virtual access. It is possible that a guest bridge together to VIF, either from the same Dom0 bridge or from different ones. In that case using STP on VIFs sound sensible to me. Zoli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [RFC v2 4/4] xen-netback: skip IPv4 and IPv6 interfaces
There is a valid scenario to put IP addresses on the backend VIFs: http://wiki.xen.org/wiki/Xen_Networking#Routing Also, the backend is not necessarily Dom0, you can connect twou guests with backend/frontend pairs. Zoli On 15/02/14 02:59, Luis R. Rodriguez wrote: From: "Luis R. Rodriguez" The xen-netback driver is used only to provide a backend interface for the frontend. The link is the only thing we use, and that is used internally for letting us know when the xen-netfront is ready, when it switches to XenbusStateConnected. Note that only when the both the xen-netfront and xen-netback are both in state XenbusStateConnected will xen-netback allow userspace on the host (backend) to bring up the interface. Enabling and disabling the interface will simply enable or disable NAPI respectively, and that's used for IRQ communication set up with the xen event channels. Cc: Paul Durrant Cc: Ian Campbell Cc: Wei Liu Cc: xen-de...@lists.xenproject.org Cc: net...@vger.kernel.org Signed-off-by: Luis R. Rodriguez --- drivers/net/xen-netback/interface.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c index d380e3f..07e6fd2 100644 --- a/drivers/net/xen-netback/interface.c +++ b/drivers/net/xen-netback/interface.c @@ -351,7 +351,7 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid, eth_hw_addr_random(dev); memcpy(dev->dev_addr, xen_oui, 3); - dev->priv_flags |= IFF_BRIDGE_NON_ROOT; + dev->priv_flags |= IFF_BRIDGE_NON_ROOT | IFF_SKIP_IP; netif_napi_add(dev, &vif->napi, xenvif_poll, XENVIF_NAPI_WEIGHT); netif_carrier_off(dev); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html