Re: tunnels: Don't apply GRO to multiple layers of encapsulation.
On Thu, Aug 31, 2017 at 6:58 AM,wrote: > [ resend due to mail problems at my end ] > > Hi Jesse, > > The backport of fac8e0f579695a3ecbc4d3cac369139d7f819971, > "tunnels: Don't apply GRO to multiple layers of encapsulation", > to linux-4.1.y seems to have missed a line. > > The 4.1 commit is 066b300e5be43cb61697539e2a3a9aac5afb422f. > > The potentially missing line is: > > - .gro_receive= ipv6_gro_receive, > + .gro_receive= sit_gro_receive, > > > I am not experiencing any bad symptoms. I simply noticed > that the patch introduced a new function, sit_gro_receive, > without introducing any users, and that same patch in > linux-4.4.y does have a user. Thanks for pointing that out. The line you mentioned should indeed be there and seems to have been missed in the backport. The backport was actually done by Sasha, not by me - would you mind sending a patch to him or working with him to fix it?
Re: tunnels: Don't apply GRO to multiple layers of encapsulation.
On Thu, Aug 31, 2017 at 6:58 AM, wrote: > [ resend due to mail problems at my end ] > > Hi Jesse, > > The backport of fac8e0f579695a3ecbc4d3cac369139d7f819971, > "tunnels: Don't apply GRO to multiple layers of encapsulation", > to linux-4.1.y seems to have missed a line. > > The 4.1 commit is 066b300e5be43cb61697539e2a3a9aac5afb422f. > > The potentially missing line is: > > - .gro_receive= ipv6_gro_receive, > + .gro_receive= sit_gro_receive, > > > I am not experiencing any bad symptoms. I simply noticed > that the patch introduced a new function, sit_gro_receive, > without introducing any users, and that same patch in > linux-4.4.y does have a user. Thanks for pointing that out. The line you mentioned should indeed be there and seems to have been missed in the backport. The backport was actually done by Sasha, not by me - would you mind sending a patch to him or working with him to fix it?
Re: [PATCH] geneve: fix max_mtu setting
On Mon, Jun 27, 2016 at 6:27 PM, 严海双 <yanhaishu...@cmss.chinamobile.com> wrote: > > On Jun 28, 2016, at 12:10 AM, Jesse Gross <je...@kernel.org> wrote: > > On Sun, Jun 26, 2016 at 6:13 PM, Haishuang Yan > <yanhaishu...@cmss.chinamobile.com> wrote: > > > On Jun 26, 2016, at 8:35 PM, zhuyj <zyjzyj2...@gmail.com> wrote: > > + if (geneve->remote.sa.sa_family == AF_INET) > + max_mtu -= sizeof(struct iphdr); > + else > + max_mtu -= sizeof(struct ipv6hdr); > > Sorry, if sa_family is not AF_NET, it is AF_INET6? > > There is a lot of macros in include/linux/socket.h. > > Zhu Yanjun > > > There are only two enumerations AF_INET and AF_INET6 have been assigned in > geneve_newlink: > > > There's actually a third possibility: AF_UNSPEC, which is the default > if neither remote type is specified. This is used by lightweight > tunnels and should be able to work with either IPv4/v6. For the > purposes of the MTU calculation this means that the IPv4 header size > should be used to avoid disallowing potentially valid configurations. > > > Yes, you’re right. Thanks for you advise. I will send a v2 commit like this: > >if (geneve->remote.sa.sa_family == AF_INET6) > max_mtu -= sizeof(struct ipv6hdr); >else > max_mtu -= sizeof(struct iphdr); > > Is this ok? Yes, that looks fine to me.
Re: [PATCH] geneve: fix max_mtu setting
On Mon, Jun 27, 2016 at 6:27 PM, 严海双 wrote: > > On Jun 28, 2016, at 12:10 AM, Jesse Gross wrote: > > On Sun, Jun 26, 2016 at 6:13 PM, Haishuang Yan > wrote: > > > On Jun 26, 2016, at 8:35 PM, zhuyj wrote: > > + if (geneve->remote.sa.sa_family == AF_INET) > + max_mtu -= sizeof(struct iphdr); > + else > + max_mtu -= sizeof(struct ipv6hdr); > > Sorry, if sa_family is not AF_NET, it is AF_INET6? > > There is a lot of macros in include/linux/socket.h. > > Zhu Yanjun > > > There are only two enumerations AF_INET and AF_INET6 have been assigned in > geneve_newlink: > > > There's actually a third possibility: AF_UNSPEC, which is the default > if neither remote type is specified. This is used by lightweight > tunnels and should be able to work with either IPv4/v6. For the > purposes of the MTU calculation this means that the IPv4 header size > should be used to avoid disallowing potentially valid configurations. > > > Yes, you’re right. Thanks for you advise. I will send a v2 commit like this: > >if (geneve->remote.sa.sa_family == AF_INET6) > max_mtu -= sizeof(struct ipv6hdr); >else > max_mtu -= sizeof(struct iphdr); > > Is this ok? Yes, that looks fine to me.
Re: [PATCH] geneve: fix max_mtu setting
On Sun, Jun 26, 2016 at 6:13 PM, 严海双wrote: > >> On Jun 26, 2016, at 8:35 PM, zhuyj wrote: >> >> + if (geneve->remote.sa.sa_family == AF_INET) >> + max_mtu -= sizeof(struct iphdr); >> + else >> + max_mtu -= sizeof(struct ipv6hdr); >> >> Sorry, if sa_family is not AF_NET, it is AF_INET6? >> >> There is a lot of macros in include/linux/socket.h. >> >> Zhu Yanjun >> > > There are only two enumerations AF_INET and AF_INET6 have been assigned in > geneve_newlink: There's actually a third possibility: AF_UNSPEC, which is the default if neither remote type is specified. This is used by lightweight tunnels and should be able to work with either IPv4/v6. For the purposes of the MTU calculation this means that the IPv4 header size should be used to avoid disallowing potentially valid configurations.
Re: [PATCH] geneve: fix max_mtu setting
On Sun, Jun 26, 2016 at 6:13 PM, 严海双 wrote: > >> On Jun 26, 2016, at 8:35 PM, zhuyj wrote: >> >> + if (geneve->remote.sa.sa_family == AF_INET) >> + max_mtu -= sizeof(struct iphdr); >> + else >> + max_mtu -= sizeof(struct ipv6hdr); >> >> Sorry, if sa_family is not AF_NET, it is AF_INET6? >> >> There is a lot of macros in include/linux/socket.h. >> >> Zhu Yanjun >> > > There are only two enumerations AF_INET and AF_INET6 have been assigned in > geneve_newlink: There's actually a third possibility: AF_UNSPEC, which is the default if neither remote type is specified. This is used by lightweight tunnels and should be able to work with either IPv4/v6. For the purposes of the MTU calculation this means that the IPv4 header size should be used to avoid disallowing potentially valid configurations.
Re: [ovs-dev] [PATCH] openvswitch: reduce padding in struct sw_flow_key
On Fri, Mar 18, 2016 at 6:34 AM, Arnd Bergmannwrote: > This means it's still too large really, we just don't warn about it any more, > and will get the warning again once another member is added. My patch is a > band-aid at best, but more work is needed here. One problem is that > ovs_flow_cmd_new and ovs_flow_cmd_set have two copies of struct sw_flow_key on > the stack, one of them nested within sw_flow_mask. If we could reduce that to > a single instance, the stack usage would be cut in half here. I think it should be possible to reduce those two functions to only use a single instance of the structure. For new flows, we're already allocating the key structure on the heap, it seems like we could just translate the key into that rather than to intermediate location on the stack. And when setting flows, I'm not sure that we really need the mask at all - we don't do anything with it other than check it against the actions but we really should be using the original mask for that rather than the new one. It seems like it would be preferable to go down that route rather than this patch, since as you say, this patch is really only suppressing the warning and doesn't have a significant impact on the actual stack consumption. Plus, the ordering of the flow layout affects how much data needs to be compared during packet lookup, so this change will likely be bad for forwarding performance.
Re: [ovs-dev] [PATCH] openvswitch: reduce padding in struct sw_flow_key
On Fri, Mar 18, 2016 at 6:34 AM, Arnd Bergmann wrote: > This means it's still too large really, we just don't warn about it any more, > and will get the warning again once another member is added. My patch is a > band-aid at best, but more work is needed here. One problem is that > ovs_flow_cmd_new and ovs_flow_cmd_set have two copies of struct sw_flow_key on > the stack, one of them nested within sw_flow_mask. If we could reduce that to > a single instance, the stack usage would be cut in half here. I think it should be possible to reduce those two functions to only use a single instance of the structure. For new flows, we're already allocating the key structure on the heap, it seems like we could just translate the key into that rather than to intermediate location on the stack. And when setting flows, I'm not sure that we really need the mask at all - we don't do anything with it other than check it against the actions but we really should be using the original mask for that rather than the new one. It seems like it would be preferable to go down that route rather than this patch, since as you say, this patch is really only suppressing the warning and doesn't have a significant impact on the actual stack consumption. Plus, the ordering of the flow layout affects how much data needs to be compared during packet lookup, so this change will likely be bad for forwarding performance.
Re: [ovs-dev] [PATCH] ovs: do not allocate memory from offline numa node
On Fri, Oct 9, 2015 at 8:54 AM, Jarno Rajahalme wrote: > > On Oct 8, 2015, at 4:03 PM, Jesse Gross wrote: > > On Wed, Oct 7, 2015 at 10:47 AM, Jarno Rajahalme > wrote: > > > On Oct 6, 2015, at 6:01 PM, Jesse Gross wrote: > > On Mon, Oct 5, 2015 at 1:25 PM, Alexander Duyck > wrote: > > On 10/05/2015 06:59 AM, Vlastimil Babka wrote: > > > On 10/02/2015 12:18 PM, Konstantin Khlebnikov wrote: > > > When openvswitch tries allocate memory from offline numa node 0: > stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO, > 0) > It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid)) > [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h > This patch disables numa affinity in this case. > > Signed-off-by: Konstantin Khlebnikov > > > > ... > > diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c > index f2ea83ba4763..c7f74aab34b9 100644 > --- a/net/openvswitch/flow_table.c > +++ b/net/openvswitch/flow_table.c > @@ -93,7 +93,8 @@ struct sw_flow *ovs_flow_alloc(void) > > /* Initialize the default stat node. */ > stats = kmem_cache_alloc_node(flow_stats_cache, > - GFP_KERNEL | __GFP_ZERO, 0); > + GFP_KERNEL | __GFP_ZERO, > + node_online(0) ? 0 : NUMA_NO_NODE); > > > > Stupid question: can node 0 become offline between this check, and the > VM_WARN_ON? :) BTW what kind of system has node 0 offline? > > > > Another question to ask would be is it possible for node 0 to be online, but > be a memoryless node? > > I would say you are better off just making this call kmem_cache_alloc. I > don't see anything that indicates the memory has to come from node 0, so > adding the extra overhead doesn't provide any value. > > > I agree that this at least makes me wonder, though I actually have > concerns in the opposite direction - I see assumptions about this > being on node 0 in net/openvswitch/flow.c. > > Jarno, since you original wrote this code, can you take a look to see > if everything still makes sense? > > > We keep the pre-allocated stats node at array index 0, which is initially > used by all CPUs, but if CPUs from multiple numa nodes start updating the > stats, we allocate additional stats nodes (up to one per numa node), and the > CPUs on node 0 keep using the preallocated entry. If stats cannot be > allocated from CPUs local node, then those CPUs keep using the entry at > index 0. Currently the code in net/openvswitch/flow.c will try to allocate > the local memory repeatedly, which may not be optimal when there is no > memory at the local node. > > Allocating the memory for the index 0 from other than node 0, as discussed > here, just means that the CPUs on node 0 will keep on using non-local memory > for stats. In a scenario where there are CPUs on two nodes (0, 1), but only > the node 1 has memory, a shared flow entry will still end up having separate > memory allocated for both nodes, but both of the nodes would be at node 1. > However, there is still a high likelihood that the memory allocations would > not share a cache line, which should prevent the nodes from invalidating > each other’s caches. Based on this I do not see a problem relaxing the > memory allocation for the default stats node. If node 0 has memory, however, > it would be better to allocate the memory from node 0. > > > Thanks for going through all of that. > > It seems like the question that is being raised is whether it actually > makes sense to try to get the initial memory on node 0, especially > since it seems to introduce some corner cases? Is there any reason why > the flow is more likely to hit node 0 than a randomly chosen one? > (Assuming that this is a multinode system, otherwise it's kind of a > moot point.) We could have a separate pointer to the default allocated > memory, so it wouldn't conflict with memory that was intentionally > allocated for node 0. > > > It would still be preferable to know from which node the default stats node > was allocated, and store it in the appropriate pointer in the array. We > could then add a new “default stats node index” that would be used to locate > the node in the array of pointers we already have. That way we would avoid > extra allocation and processing of the default stats node. I agree, that sounds reasonable to me. Will you make that change? Besides eliminating corner cases, it might help performance in some cases too by avoiding stressing memory bandwidth on node 0. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ovs-dev] [PATCH] ovs: do not allocate memory from offline numa node
On Fri, Oct 9, 2015 at 8:54 AM, Jarno Rajahalme <jrajaha...@nicira.com> wrote: > > On Oct 8, 2015, at 4:03 PM, Jesse Gross <je...@nicira.com> wrote: > > On Wed, Oct 7, 2015 at 10:47 AM, Jarno Rajahalme <jrajaha...@nicira.com> > wrote: > > > On Oct 6, 2015, at 6:01 PM, Jesse Gross <je...@nicira.com> wrote: > > On Mon, Oct 5, 2015 at 1:25 PM, Alexander Duyck > <alexander.du...@gmail.com> wrote: > > On 10/05/2015 06:59 AM, Vlastimil Babka wrote: > > > On 10/02/2015 12:18 PM, Konstantin Khlebnikov wrote: > > > When openvswitch tries allocate memory from offline numa node 0: > stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO, > 0) > It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid)) > [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h > This patch disables numa affinity in this case. > > Signed-off-by: Konstantin Khlebnikov <khlebni...@yandex-team.ru> > > > > ... > > diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c > index f2ea83ba4763..c7f74aab34b9 100644 > --- a/net/openvswitch/flow_table.c > +++ b/net/openvswitch/flow_table.c > @@ -93,7 +93,8 @@ struct sw_flow *ovs_flow_alloc(void) > > /* Initialize the default stat node. */ > stats = kmem_cache_alloc_node(flow_stats_cache, > - GFP_KERNEL | __GFP_ZERO, 0); > + GFP_KERNEL | __GFP_ZERO, > + node_online(0) ? 0 : NUMA_NO_NODE); > > > > Stupid question: can node 0 become offline between this check, and the > VM_WARN_ON? :) BTW what kind of system has node 0 offline? > > > > Another question to ask would be is it possible for node 0 to be online, but > be a memoryless node? > > I would say you are better off just making this call kmem_cache_alloc. I > don't see anything that indicates the memory has to come from node 0, so > adding the extra overhead doesn't provide any value. > > > I agree that this at least makes me wonder, though I actually have > concerns in the opposite direction - I see assumptions about this > being on node 0 in net/openvswitch/flow.c. > > Jarno, since you original wrote this code, can you take a look to see > if everything still makes sense? > > > We keep the pre-allocated stats node at array index 0, which is initially > used by all CPUs, but if CPUs from multiple numa nodes start updating the > stats, we allocate additional stats nodes (up to one per numa node), and the > CPUs on node 0 keep using the preallocated entry. If stats cannot be > allocated from CPUs local node, then those CPUs keep using the entry at > index 0. Currently the code in net/openvswitch/flow.c will try to allocate > the local memory repeatedly, which may not be optimal when there is no > memory at the local node. > > Allocating the memory for the index 0 from other than node 0, as discussed > here, just means that the CPUs on node 0 will keep on using non-local memory > for stats. In a scenario where there are CPUs on two nodes (0, 1), but only > the node 1 has memory, a shared flow entry will still end up having separate > memory allocated for both nodes, but both of the nodes would be at node 1. > However, there is still a high likelihood that the memory allocations would > not share a cache line, which should prevent the nodes from invalidating > each other’s caches. Based on this I do not see a problem relaxing the > memory allocation for the default stats node. If node 0 has memory, however, > it would be better to allocate the memory from node 0. > > > Thanks for going through all of that. > > It seems like the question that is being raised is whether it actually > makes sense to try to get the initial memory on node 0, especially > since it seems to introduce some corner cases? Is there any reason why > the flow is more likely to hit node 0 than a randomly chosen one? > (Assuming that this is a multinode system, otherwise it's kind of a > moot point.) We could have a separate pointer to the default allocated > memory, so it wouldn't conflict with memory that was intentionally > allocated for node 0. > > > It would still be preferable to know from which node the default stats node > was allocated, and store it in the appropriate pointer in the array. We > could then add a new “default stats node index” that would be used to locate > the node in the array of pointers we already have. That way we would avoid > extra allocation and processing of the default stats node. I agree, that sounds reasonable to me. Will you make that change? Besides eliminating corner cases, it might help performance in some cases too by avoiding stressing memory bandwidth on node 0. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ovs-dev] [PATCH] ovs: do not allocate memory from offline numa node
On Wed, Oct 7, 2015 at 10:47 AM, Jarno Rajahalme wrote: > >> On Oct 6, 2015, at 6:01 PM, Jesse Gross wrote: >> >> On Mon, Oct 5, 2015 at 1:25 PM, Alexander Duyck >> wrote: >>> On 10/05/2015 06:59 AM, Vlastimil Babka wrote: >>>> >>>> On 10/02/2015 12:18 PM, Konstantin Khlebnikov wrote: >>>>> >>>>> When openvswitch tries allocate memory from offline numa node 0: >>>>> stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO, >>>>> 0) >>>>> It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid)) >>>>> [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h >>>>> This patch disables numa affinity in this case. >>>>> >>>>> Signed-off-by: Konstantin Khlebnikov >>>> >>>> >>>> ... >>>> >>>>> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c >>>>> index f2ea83ba4763..c7f74aab34b9 100644 >>>>> --- a/net/openvswitch/flow_table.c >>>>> +++ b/net/openvswitch/flow_table.c >>>>> @@ -93,7 +93,8 @@ struct sw_flow *ovs_flow_alloc(void) >>>>> >>>>> /* Initialize the default stat node. */ >>>>> stats = kmem_cache_alloc_node(flow_stats_cache, >>>>> - GFP_KERNEL | __GFP_ZERO, 0); >>>>> + GFP_KERNEL | __GFP_ZERO, >>>>> + node_online(0) ? 0 : NUMA_NO_NODE); >>>> >>>> >>>> Stupid question: can node 0 become offline between this check, and the >>>> VM_WARN_ON? :) BTW what kind of system has node 0 offline? >>> >>> >>> Another question to ask would be is it possible for node 0 to be online, but >>> be a memoryless node? >>> >>> I would say you are better off just making this call kmem_cache_alloc. I >>> don't see anything that indicates the memory has to come from node 0, so >>> adding the extra overhead doesn't provide any value. >> >> I agree that this at least makes me wonder, though I actually have >> concerns in the opposite direction - I see assumptions about this >> being on node 0 in net/openvswitch/flow.c. >> >> Jarno, since you original wrote this code, can you take a look to see >> if everything still makes sense? > > We keep the pre-allocated stats node at array index 0, which is initially > used by all CPUs, but if CPUs from multiple numa nodes start updating the > stats, we allocate additional stats nodes (up to one per numa node), and the > CPUs on node 0 keep using the preallocated entry. If stats cannot be > allocated from CPUs local node, then those CPUs keep using the entry at index > 0. Currently the code in net/openvswitch/flow.c will try to allocate the > local memory repeatedly, which may not be optimal when there is no memory at > the local node. > > Allocating the memory for the index 0 from other than node 0, as discussed > here, just means that the CPUs on node 0 will keep on using non-local memory > for stats. In a scenario where there are CPUs on two nodes (0, 1), but only > the node 1 has memory, a shared flow entry will still end up having separate > memory allocated for both nodes, but both of the nodes would be at node 1. > However, there is still a high likelihood that the memory allocations would > not share a cache line, which should prevent the nodes from invalidating each > other’s caches. Based on this I do not see a problem relaxing the memory > allocation for the default stats node. If node 0 has memory, however, it > would be better to allocate the memory from node 0. Thanks for going through all of that. It seems like the question that is being raised is whether it actually makes sense to try to get the initial memory on node 0, especially since it seems to introduce some corner cases? Is there any reason why the flow is more likely to hit node 0 than a randomly chosen one? (Assuming that this is a multinode system, otherwise it's kind of a moot point.) We could have a separate pointer to the default allocated memory, so it wouldn't conflict with memory that was intentionally allocated for node 0. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ovs-dev] [PATCH] ovs: do not allocate memory from offline numa node
On Wed, Oct 7, 2015 at 10:47 AM, Jarno Rajahalme <jrajaha...@nicira.com> wrote: > >> On Oct 6, 2015, at 6:01 PM, Jesse Gross <je...@nicira.com> wrote: >> >> On Mon, Oct 5, 2015 at 1:25 PM, Alexander Duyck >> <alexander.du...@gmail.com> wrote: >>> On 10/05/2015 06:59 AM, Vlastimil Babka wrote: >>>> >>>> On 10/02/2015 12:18 PM, Konstantin Khlebnikov wrote: >>>>> >>>>> When openvswitch tries allocate memory from offline numa node 0: >>>>> stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO, >>>>> 0) >>>>> It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid)) >>>>> [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h >>>>> This patch disables numa affinity in this case. >>>>> >>>>> Signed-off-by: Konstantin Khlebnikov <khlebni...@yandex-team.ru> >>>> >>>> >>>> ... >>>> >>>>> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c >>>>> index f2ea83ba4763..c7f74aab34b9 100644 >>>>> --- a/net/openvswitch/flow_table.c >>>>> +++ b/net/openvswitch/flow_table.c >>>>> @@ -93,7 +93,8 @@ struct sw_flow *ovs_flow_alloc(void) >>>>> >>>>> /* Initialize the default stat node. */ >>>>> stats = kmem_cache_alloc_node(flow_stats_cache, >>>>> - GFP_KERNEL | __GFP_ZERO, 0); >>>>> + GFP_KERNEL | __GFP_ZERO, >>>>> + node_online(0) ? 0 : NUMA_NO_NODE); >>>> >>>> >>>> Stupid question: can node 0 become offline between this check, and the >>>> VM_WARN_ON? :) BTW what kind of system has node 0 offline? >>> >>> >>> Another question to ask would be is it possible for node 0 to be online, but >>> be a memoryless node? >>> >>> I would say you are better off just making this call kmem_cache_alloc. I >>> don't see anything that indicates the memory has to come from node 0, so >>> adding the extra overhead doesn't provide any value. >> >> I agree that this at least makes me wonder, though I actually have >> concerns in the opposite direction - I see assumptions about this >> being on node 0 in net/openvswitch/flow.c. >> >> Jarno, since you original wrote this code, can you take a look to see >> if everything still makes sense? > > We keep the pre-allocated stats node at array index 0, which is initially > used by all CPUs, but if CPUs from multiple numa nodes start updating the > stats, we allocate additional stats nodes (up to one per numa node), and the > CPUs on node 0 keep using the preallocated entry. If stats cannot be > allocated from CPUs local node, then those CPUs keep using the entry at index > 0. Currently the code in net/openvswitch/flow.c will try to allocate the > local memory repeatedly, which may not be optimal when there is no memory at > the local node. > > Allocating the memory for the index 0 from other than node 0, as discussed > here, just means that the CPUs on node 0 will keep on using non-local memory > for stats. In a scenario where there are CPUs on two nodes (0, 1), but only > the node 1 has memory, a shared flow entry will still end up having separate > memory allocated for both nodes, but both of the nodes would be at node 1. > However, there is still a high likelihood that the memory allocations would > not share a cache line, which should prevent the nodes from invalidating each > other’s caches. Based on this I do not see a problem relaxing the memory > allocation for the default stats node. If node 0 has memory, however, it > would be better to allocate the memory from node 0. Thanks for going through all of that. It seems like the question that is being raised is whether it actually makes sense to try to get the initial memory on node 0, especially since it seems to introduce some corner cases? Is there any reason why the flow is more likely to hit node 0 than a randomly chosen one? (Assuming that this is a multinode system, otherwise it's kind of a moot point.) We could have a separate pointer to the default allocated memory, so it wouldn't conflict with memory that was intentionally allocated for node 0. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ovs-dev] [PATCH] ovs: do not allocate memory from offline numa node
On Mon, Oct 5, 2015 at 1:25 PM, Alexander Duyck wrote: > On 10/05/2015 06:59 AM, Vlastimil Babka wrote: >> >> On 10/02/2015 12:18 PM, Konstantin Khlebnikov wrote: >>> >>> When openvswitch tries allocate memory from offline numa node 0: >>> stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO, >>> 0) >>> It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid)) >>> [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h >>> This patch disables numa affinity in this case. >>> >>> Signed-off-by: Konstantin Khlebnikov >> >> >> ... >> >>> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c >>> index f2ea83ba4763..c7f74aab34b9 100644 >>> --- a/net/openvswitch/flow_table.c >>> +++ b/net/openvswitch/flow_table.c >>> @@ -93,7 +93,8 @@ struct sw_flow *ovs_flow_alloc(void) >>> >>> /* Initialize the default stat node. */ >>> stats = kmem_cache_alloc_node(flow_stats_cache, >>> - GFP_KERNEL | __GFP_ZERO, 0); >>> + GFP_KERNEL | __GFP_ZERO, >>> + node_online(0) ? 0 : NUMA_NO_NODE); >> >> >> Stupid question: can node 0 become offline between this check, and the >> VM_WARN_ON? :) BTW what kind of system has node 0 offline? > > > Another question to ask would be is it possible for node 0 to be online, but > be a memoryless node? > > I would say you are better off just making this call kmem_cache_alloc. I > don't see anything that indicates the memory has to come from node 0, so > adding the extra overhead doesn't provide any value. I agree that this at least makes me wonder, though I actually have concerns in the opposite direction - I see assumptions about this being on node 0 in net/openvswitch/flow.c. Jarno, since you original wrote this code, can you take a look to see if everything still makes sense? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ovs-dev] [PATCH] ovs: do not allocate memory from offline numa node
On Mon, Oct 5, 2015 at 1:25 PM, Alexander Duyckwrote: > On 10/05/2015 06:59 AM, Vlastimil Babka wrote: >> >> On 10/02/2015 12:18 PM, Konstantin Khlebnikov wrote: >>> >>> When openvswitch tries allocate memory from offline numa node 0: >>> stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO, >>> 0) >>> It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid)) >>> [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h >>> This patch disables numa affinity in this case. >>> >>> Signed-off-by: Konstantin Khlebnikov >> >> >> ... >> >>> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c >>> index f2ea83ba4763..c7f74aab34b9 100644 >>> --- a/net/openvswitch/flow_table.c >>> +++ b/net/openvswitch/flow_table.c >>> @@ -93,7 +93,8 @@ struct sw_flow *ovs_flow_alloc(void) >>> >>> /* Initialize the default stat node. */ >>> stats = kmem_cache_alloc_node(flow_stats_cache, >>> - GFP_KERNEL | __GFP_ZERO, 0); >>> + GFP_KERNEL | __GFP_ZERO, >>> + node_online(0) ? 0 : NUMA_NO_NODE); >> >> >> Stupid question: can node 0 become offline between this check, and the >> VM_WARN_ON? :) BTW what kind of system has node 0 offline? > > > Another question to ask would be is it possible for node 0 to be online, but > be a memoryless node? > > I would say you are better off just making this call kmem_cache_alloc. I > don't see anything that indicates the memory has to come from node 0, so > adding the extra overhead doesn't provide any value. I agree that this at least makes me wonder, though I actually have concerns in the opposite direction - I see assumptions about this being on node 0 in net/openvswitch/flow.c. Jarno, since you original wrote this code, can you take a look to see if everything still makes sense? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net-next 1/9] openvswitch: Scrub packet in ovs_vport_receive()
On Tue, Aug 4, 2015 at 9:40 PM, Joe Stringer wrote: > On 1 August 2015 at 12:17, Thomas Graf wrote: >> On 07/31/15 at 10:51am, Joe Stringer wrote: >>> On 31 July 2015 at 07:34, Hannes Frederic Sowa wrote: >>> > In general, this shouldn't be necessary as the packet should already be >>> > scrubbed before they arrive here. >>> > >>> > Could you maybe add a WARN_ON and check how those skbs with conntrack >>> > data traverse the stack? I also didn't understand why make it dependent >>> > on the socket. >>> >>> OK, sure. One case I could think of is with an OVS internal port in >>> another namespace, directly attached to the bridge. I'll have a play >>> around with WARN_ON and see if I can come up with something more >>> trimmed down. >> >> The OVS internal port will definitely pass through an unscrubbed >> packet across namespaces. I think the proper thing to do would be >> to scrub but conditionally keep metadata. > > It's only "unscrubbed" when receiving from local stack at the moment. > Some pieces are cleared when handing towards the local stack, and > there's no configuration for that behaviour. Presumably internal port > transmit and receive should mirror each other? > > I don't have a specific use case either way. The remaining code for > this series handles this case correctly, it's just a matter of what > behaviour we're looking for. We could implement the flag as you say, I > presume that userspace would need to specify this during vport > creation and the default should work similar to the existing behaviour > (ie, keep metadata). One thing that's not entirely clear to me is > exactly which metadata should be represented by this flag and whether > the single flag is expressive enough. I would prefer not to have a flag as it seems unnecessarily complicated (doubly so if we try to have multiple flags to express different combinations). The use case for moving internal ports to different namespaces is pretty narrow, so it seems like we can just pick a set of metadata to keep and go with that. Mark seems the primary one to me. I also think that it would be better to use skb->dev to determine the original namespace rather than the socket. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net-next 1/9] openvswitch: Scrub packet in ovs_vport_receive()
On Tue, Aug 4, 2015 at 9:40 PM, Joe Stringer joestrin...@nicira.com wrote: On 1 August 2015 at 12:17, Thomas Graf tg...@suug.ch wrote: On 07/31/15 at 10:51am, Joe Stringer wrote: On 31 July 2015 at 07:34, Hannes Frederic Sowa han...@redhat.com wrote: In general, this shouldn't be necessary as the packet should already be scrubbed before they arrive here. Could you maybe add a WARN_ON and check how those skbs with conntrack data traverse the stack? I also didn't understand why make it dependent on the socket. OK, sure. One case I could think of is with an OVS internal port in another namespace, directly attached to the bridge. I'll have a play around with WARN_ON and see if I can come up with something more trimmed down. The OVS internal port will definitely pass through an unscrubbed packet across namespaces. I think the proper thing to do would be to scrub but conditionally keep metadata. It's only unscrubbed when receiving from local stack at the moment. Some pieces are cleared when handing towards the local stack, and there's no configuration for that behaviour. Presumably internal port transmit and receive should mirror each other? I don't have a specific use case either way. The remaining code for this series handles this case correctly, it's just a matter of what behaviour we're looking for. We could implement the flag as you say, I presume that userspace would need to specify this during vport creation and the default should work similar to the existing behaviour (ie, keep metadata). One thing that's not entirely clear to me is exactly which metadata should be represented by this flag and whether the single flag is expressive enough. I would prefer not to have a flag as it seems unnecessarily complicated (doubly so if we try to have multiple flags to express different combinations). The use case for moving internal ports to different namespaces is pretty narrow, so it seems like we can just pick a set of metadata to keep and go with that. Mark seems the primary one to me. I also think that it would be better to use skb-dev to determine the original namespace rather than the socket. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-nics] [PATCHv4 net] i40e: Implement ndo_gso_check()
On Fri, Dec 26, 2014 at 3:58 PM, Jesse Gross wrote: > On Fri, Dec 5, 2014 at 2:12 PM, Jeff Kirsher > wrote: >> On Fri, 2014-12-05 at 10:41 -0800, Joe Stringer wrote: >>> ndo_gso_check() was recently introduced to allow NICs to report the >>> offloading support that they have on a per-skb basis. Add an >>> implementation for this driver which checks for IPIP, GRE, UDP >>> tunnels. >>> >>> Signed-off-by: Joe Stringer >>> --- >>> v4: Simplify the check to just do tunnel header length. >>> Fix #define style issue. >>> v3: Drop IPIP and GRE (no driver support even though hw supports it). >>> Check for UDP outer protocol for UDP tunnels. >>> v2: Expand to include IP in IP and IPv4/IPv6 inside GRE/UDP tunnels. >>> Add MAX_INNER_LENGTH (as 80). >>> --- >>> drivers/net/ethernet/intel/i40e/i40e_main.c | 12 >>> 1 file changed, 12 insertions(+) >> >> Thanks Joe, I will update the patch in my queue with your latest >> version. > > Jeff, as a heads-up, this patch and the corresponding one for fm10k > will no longer apply now that the ndo has changed. This was changed by > 5f35227e ("net: Generalize ndo_gso_check to ndo_features_check"). The > update needed is trivial and is just due to the change in the function > signature but I'm not sure where you are in the testing process with > this. Jeff - just wanted to check if you had seen this. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-nics] [PATCHv4 net] i40e: Implement ndo_gso_check()
On Fri, Dec 26, 2014 at 3:58 PM, Jesse Gross je...@nicira.com wrote: On Fri, Dec 5, 2014 at 2:12 PM, Jeff Kirsher jeffrey.t.kirs...@intel.com wrote: On Fri, 2014-12-05 at 10:41 -0800, Joe Stringer wrote: ndo_gso_check() was recently introduced to allow NICs to report the offloading support that they have on a per-skb basis. Add an implementation for this driver which checks for IPIP, GRE, UDP tunnels. Signed-off-by: Joe Stringer joestrin...@nicira.com --- v4: Simplify the check to just do tunnel header length. Fix #define style issue. v3: Drop IPIP and GRE (no driver support even though hw supports it). Check for UDP outer protocol for UDP tunnels. v2: Expand to include IP in IP and IPv4/IPv6 inside GRE/UDP tunnels. Add MAX_INNER_LENGTH (as 80). --- drivers/net/ethernet/intel/i40e/i40e_main.c | 12 1 file changed, 12 insertions(+) Thanks Joe, I will update the patch in my queue with your latest version. Jeff, as a heads-up, this patch and the corresponding one for fm10k will no longer apply now that the ndo has changed. This was changed by 5f35227e (net: Generalize ndo_gso_check to ndo_features_check). The update needed is trivial and is just due to the change in the function signature but I'm not sure where you are in the testing process with this. Jeff - just wanted to check if you had seen this. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-nics] [PATCHv4 net] i40e: Implement ndo_gso_check()
On Fri, Dec 5, 2014 at 2:12 PM, Jeff Kirsher wrote: > On Fri, 2014-12-05 at 10:41 -0800, Joe Stringer wrote: >> ndo_gso_check() was recently introduced to allow NICs to report the >> offloading support that they have on a per-skb basis. Add an >> implementation for this driver which checks for IPIP, GRE, UDP >> tunnels. >> >> Signed-off-by: Joe Stringer >> --- >> v4: Simplify the check to just do tunnel header length. >> Fix #define style issue. >> v3: Drop IPIP and GRE (no driver support even though hw supports it). >> Check for UDP outer protocol for UDP tunnels. >> v2: Expand to include IP in IP and IPv4/IPv6 inside GRE/UDP tunnels. >> Add MAX_INNER_LENGTH (as 80). >> --- >> drivers/net/ethernet/intel/i40e/i40e_main.c | 12 >> 1 file changed, 12 insertions(+) > > Thanks Joe, I will update the patch in my queue with your latest > version. Jeff, as a heads-up, this patch and the corresponding one for fm10k will no longer apply now that the ndo has changed. This was changed by 5f35227e ("net: Generalize ndo_gso_check to ndo_features_check"). The update needed is trivial and is just due to the change in the function signature but I'm not sure where you are in the testing process with this. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-nics] [PATCHv4 net] i40e: Implement ndo_gso_check()
On Fri, Dec 5, 2014 at 2:12 PM, Jeff Kirsher jeffrey.t.kirs...@intel.com wrote: On Fri, 2014-12-05 at 10:41 -0800, Joe Stringer wrote: ndo_gso_check() was recently introduced to allow NICs to report the offloading support that they have on a per-skb basis. Add an implementation for this driver which checks for IPIP, GRE, UDP tunnels. Signed-off-by: Joe Stringer joestrin...@nicira.com --- v4: Simplify the check to just do tunnel header length. Fix #define style issue. v3: Drop IPIP and GRE (no driver support even though hw supports it). Check for UDP outer protocol for UDP tunnels. v2: Expand to include IP in IP and IPv4/IPv6 inside GRE/UDP tunnels. Add MAX_INNER_LENGTH (as 80). --- drivers/net/ethernet/intel/i40e/i40e_main.c | 12 1 file changed, 12 insertions(+) Thanks Joe, I will update the patch in my queue with your latest version. Jeff, as a heads-up, this patch and the corresponding one for fm10k will no longer apply now that the ndo has changed. This was changed by 5f35227e (net: Generalize ndo_gso_check to ndo_features_check). The update needed is trivial and is just due to the change in the function signature but I'm not sure where you are in the testing process with this. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 net] i40e: Implement ndo_gso_check()
On Tue, Dec 2, 2014 at 10:26 AM, Jesse Gross wrote: > On Mon, Dec 1, 2014 at 4:09 PM, Tom Herbert wrote: >> On Mon, Dec 1, 2014 at 3:53 PM, Jesse Gross wrote: >>> On Mon, Dec 1, 2014 at 3:47 PM, Tom Herbert wrote: >>>> On Mon, Dec 1, 2014 at 3:35 PM, Joe Stringer >>>> wrote: >>>>> On 21 November 2014 at 09:59, Joe Stringer wrote: >>>>>> On 20 November 2014 16:19, Jesse Gross wrote: >>>>>>> I don't know if we need to have the check at all for IPIP though - >>>>>>> after all the driver doesn't expose support for it all (actually it >>>>>>> doesn't expose GRE either). This raises kind of an interesting >>>>>>> question about the checks though - it's pretty easy to add support to >>>>>>> the driver for a new GSO type (and I imagine that people will be >>>>>>> adding GRE soon) and forget to update the check. >>>>>> >>>>>> If the check is more conservative, then testing would show that it's >>>>>> not working and lead people to figure out why (and update the check). >>>>> >>>>> More concretely, one suggestion would be something like following at >>>>> the start of each gso_check(): >>>>> >>>>> + const int supported = SKB_GSO_TCPV4 | SKB_GSO_TCPV6 | >>>>> SKB_GSO_FCOE | >>>>> + SKB_GSO_UDP | SKB_GSO_UDP_TUNNEL; >>>>> + >>>>> + if (skb_shinfo(skb)->gso_type & ~supported) >>>>> + return false; >>>> >>>> This should already be handled by net_gso_ok. >>> >>> My original point wasn't so much that this isn't handled at the moment >>> but that it's easy to add a supported GSO type but then forget to >>> update this check - i.e. if a driver already supports UDP_TUNNEL and >>> adds support for GRE with the same constraints. It seems not entirely >>> ideal that this function is acting as a blacklist rather than a >>> whitelist. >> >> Agreed, it would be nice to have all the checking logic in one place. >> If all the drivers end up implementing ndo_gso_check then we could >> potentially get rid of the GSO types as features. This probably >> wouldn't be a bad thing since we already know that the features >> mechanism doesn't scale (for instance there's no way to indicate that >> certain combinations of GSO types are supported by a device). > > This crossed my mind and I agree that it's pretty clear that the > features mechanism isn't scaling very well. Presumably, the logical > extension of this is that each driver would have a function that looks > at a packet and returns a set of offload operations that it can > support rather than exposing a set of protocols. However, it seems > like it would probably result in a bunch of duplicate code in each > driver. I think a possible middleground here is to convert ndo_gso_check() to ndo_features_check(). This would behave similarly to netif_skb_features() and give drivers an opportunity to knock out features for a given packet. This would allow us to avoid duplicate code in the immediate case of tunnels where we need to handle both GSO and checksums and potentially enable wider usage in the future if it makes sense. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 net] i40e: Implement ndo_gso_check()
On Tue, Dec 2, 2014 at 10:26 AM, Jesse Gross je...@nicira.com wrote: On Mon, Dec 1, 2014 at 4:09 PM, Tom Herbert therb...@google.com wrote: On Mon, Dec 1, 2014 at 3:53 PM, Jesse Gross je...@nicira.com wrote: On Mon, Dec 1, 2014 at 3:47 PM, Tom Herbert therb...@google.com wrote: On Mon, Dec 1, 2014 at 3:35 PM, Joe Stringer joestrin...@nicira.com wrote: On 21 November 2014 at 09:59, Joe Stringer joestrin...@nicira.com wrote: On 20 November 2014 16:19, Jesse Gross je...@nicira.com wrote: I don't know if we need to have the check at all for IPIP though - after all the driver doesn't expose support for it all (actually it doesn't expose GRE either). This raises kind of an interesting question about the checks though - it's pretty easy to add support to the driver for a new GSO type (and I imagine that people will be adding GRE soon) and forget to update the check. If the check is more conservative, then testing would show that it's not working and lead people to figure out why (and update the check). More concretely, one suggestion would be something like following at the start of each gso_check(): + const int supported = SKB_GSO_TCPV4 | SKB_GSO_TCPV6 | SKB_GSO_FCOE | + SKB_GSO_UDP | SKB_GSO_UDP_TUNNEL; + + if (skb_shinfo(skb)-gso_type ~supported) + return false; This should already be handled by net_gso_ok. My original point wasn't so much that this isn't handled at the moment but that it's easy to add a supported GSO type but then forget to update this check - i.e. if a driver already supports UDP_TUNNEL and adds support for GRE with the same constraints. It seems not entirely ideal that this function is acting as a blacklist rather than a whitelist. Agreed, it would be nice to have all the checking logic in one place. If all the drivers end up implementing ndo_gso_check then we could potentially get rid of the GSO types as features. This probably wouldn't be a bad thing since we already know that the features mechanism doesn't scale (for instance there's no way to indicate that certain combinations of GSO types are supported by a device). This crossed my mind and I agree that it's pretty clear that the features mechanism isn't scaling very well. Presumably, the logical extension of this is that each driver would have a function that looks at a packet and returns a set of offload operations that it can support rather than exposing a set of protocols. However, it seems like it would probably result in a bunch of duplicate code in each driver. I think a possible middleground here is to convert ndo_gso_check() to ndo_features_check(). This would behave similarly to netif_skb_features() and give drivers an opportunity to knock out features for a given packet. This would allow us to avoid duplicate code in the immediate case of tunnels where we need to handle both GSO and checksums and potentially enable wider usage in the future if it makes sense. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 net] i40e: Implement ndo_gso_check()
On Mon, Dec 1, 2014 at 4:09 PM, Tom Herbert wrote: > On Mon, Dec 1, 2014 at 3:53 PM, Jesse Gross wrote: >> On Mon, Dec 1, 2014 at 3:47 PM, Tom Herbert wrote: >>> On Mon, Dec 1, 2014 at 3:35 PM, Joe Stringer wrote: >>>> On 21 November 2014 at 09:59, Joe Stringer wrote: >>>>> On 20 November 2014 16:19, Jesse Gross wrote: >>>>>> I don't know if we need to have the check at all for IPIP though - >>>>>> after all the driver doesn't expose support for it all (actually it >>>>>> doesn't expose GRE either). This raises kind of an interesting >>>>>> question about the checks though - it's pretty easy to add support to >>>>>> the driver for a new GSO type (and I imagine that people will be >>>>>> adding GRE soon) and forget to update the check. >>>>> >>>>> If the check is more conservative, then testing would show that it's >>>>> not working and lead people to figure out why (and update the check). >>>> >>>> More concretely, one suggestion would be something like following at >>>> the start of each gso_check(): >>>> >>>> + const int supported = SKB_GSO_TCPV4 | SKB_GSO_TCPV6 | SKB_GSO_FCOE >>>> | >>>> + SKB_GSO_UDP | SKB_GSO_UDP_TUNNEL; >>>> + >>>> + if (skb_shinfo(skb)->gso_type & ~supported) >>>> + return false; >>> >>> This should already be handled by net_gso_ok. >> >> My original point wasn't so much that this isn't handled at the moment >> but that it's easy to add a supported GSO type but then forget to >> update this check - i.e. if a driver already supports UDP_TUNNEL and >> adds support for GRE with the same constraints. It seems not entirely >> ideal that this function is acting as a blacklist rather than a >> whitelist. > > Agreed, it would be nice to have all the checking logic in one place. > If all the drivers end up implementing ndo_gso_check then we could > potentially get rid of the GSO types as features. This probably > wouldn't be a bad thing since we already know that the features > mechanism doesn't scale (for instance there's no way to indicate that > certain combinations of GSO types are supported by a device). This crossed my mind and I agree that it's pretty clear that the features mechanism isn't scaling very well. Presumably, the logical extension of this is that each driver would have a function that looks at a packet and returns a set of offload operations that it can support rather than exposing a set of protocols. However, it seems like it would probably result in a bunch of duplicate code in each driver. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 net] i40e: Implement ndo_gso_check()
On Mon, Dec 1, 2014 at 4:09 PM, Tom Herbert therb...@google.com wrote: On Mon, Dec 1, 2014 at 3:53 PM, Jesse Gross je...@nicira.com wrote: On Mon, Dec 1, 2014 at 3:47 PM, Tom Herbert therb...@google.com wrote: On Mon, Dec 1, 2014 at 3:35 PM, Joe Stringer joestrin...@nicira.com wrote: On 21 November 2014 at 09:59, Joe Stringer joestrin...@nicira.com wrote: On 20 November 2014 16:19, Jesse Gross je...@nicira.com wrote: I don't know if we need to have the check at all for IPIP though - after all the driver doesn't expose support for it all (actually it doesn't expose GRE either). This raises kind of an interesting question about the checks though - it's pretty easy to add support to the driver for a new GSO type (and I imagine that people will be adding GRE soon) and forget to update the check. If the check is more conservative, then testing would show that it's not working and lead people to figure out why (and update the check). More concretely, one suggestion would be something like following at the start of each gso_check(): + const int supported = SKB_GSO_TCPV4 | SKB_GSO_TCPV6 | SKB_GSO_FCOE | + SKB_GSO_UDP | SKB_GSO_UDP_TUNNEL; + + if (skb_shinfo(skb)-gso_type ~supported) + return false; This should already be handled by net_gso_ok. My original point wasn't so much that this isn't handled at the moment but that it's easy to add a supported GSO type but then forget to update this check - i.e. if a driver already supports UDP_TUNNEL and adds support for GRE with the same constraints. It seems not entirely ideal that this function is acting as a blacklist rather than a whitelist. Agreed, it would be nice to have all the checking logic in one place. If all the drivers end up implementing ndo_gso_check then we could potentially get rid of the GSO types as features. This probably wouldn't be a bad thing since we already know that the features mechanism doesn't scale (for instance there's no way to indicate that certain combinations of GSO types are supported by a device). This crossed my mind and I agree that it's pretty clear that the features mechanism isn't scaling very well. Presumably, the logical extension of this is that each driver would have a function that looks at a packet and returns a set of offload operations that it can support rather than exposing a set of protocols. However, it seems like it would probably result in a bunch of duplicate code in each driver. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 net] i40e: Implement ndo_gso_check()
On Mon, Dec 1, 2014 at 3:47 PM, Tom Herbert wrote: > On Mon, Dec 1, 2014 at 3:35 PM, Joe Stringer wrote: >> On 21 November 2014 at 09:59, Joe Stringer wrote: >>> On 20 November 2014 16:19, Jesse Gross wrote: >>>> I don't know if we need to have the check at all for IPIP though - >>>> after all the driver doesn't expose support for it all (actually it >>>> doesn't expose GRE either). This raises kind of an interesting >>>> question about the checks though - it's pretty easy to add support to >>>> the driver for a new GSO type (and I imagine that people will be >>>> adding GRE soon) and forget to update the check. >>> >>> If the check is more conservative, then testing would show that it's >>> not working and lead people to figure out why (and update the check). >> >> More concretely, one suggestion would be something like following at >> the start of each gso_check(): >> >> + const int supported = SKB_GSO_TCPV4 | SKB_GSO_TCPV6 | SKB_GSO_FCOE | >> + SKB_GSO_UDP | SKB_GSO_UDP_TUNNEL; >> + >> + if (skb_shinfo(skb)->gso_type & ~supported) >> + return false; > > This should already be handled by net_gso_ok. My original point wasn't so much that this isn't handled at the moment but that it's easy to add a supported GSO type but then forget to update this check - i.e. if a driver already supports UDP_TUNNEL and adds support for GRE with the same constraints. It seems not entirely ideal that this function is acting as a blacklist rather than a whitelist. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 net] i40e: Implement ndo_gso_check()
On Mon, Dec 1, 2014 at 3:47 PM, Tom Herbert therb...@google.com wrote: On Mon, Dec 1, 2014 at 3:35 PM, Joe Stringer joestrin...@nicira.com wrote: On 21 November 2014 at 09:59, Joe Stringer joestrin...@nicira.com wrote: On 20 November 2014 16:19, Jesse Gross je...@nicira.com wrote: I don't know if we need to have the check at all for IPIP though - after all the driver doesn't expose support for it all (actually it doesn't expose GRE either). This raises kind of an interesting question about the checks though - it's pretty easy to add support to the driver for a new GSO type (and I imagine that people will be adding GRE soon) and forget to update the check. If the check is more conservative, then testing would show that it's not working and lead people to figure out why (and update the check). More concretely, one suggestion would be something like following at the start of each gso_check(): + const int supported = SKB_GSO_TCPV4 | SKB_GSO_TCPV6 | SKB_GSO_FCOE | + SKB_GSO_UDP | SKB_GSO_UDP_TUNNEL; + + if (skb_shinfo(skb)-gso_type ~supported) + return false; This should already be handled by net_gso_ok. My original point wasn't so much that this isn't handled at the moment but that it's easy to add a supported GSO type but then forget to update this check - i.e. if a driver already supports UDP_TUNNEL and adds support for GRE with the same constraints. It seems not entirely ideal that this function is acting as a blacklist rather than a whitelist. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 net] i40e: Implement ndo_gso_check()
On Thu, Nov 20, 2014 at 3:11 PM, Joe Stringer wrote: > diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c > b/drivers/net/ethernet/intel/i40e/i40e_main.c > index c3a7f4a..2b01c8d 100644 > --- a/drivers/net/ethernet/intel/i40e/i40e_main.c > +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c > @@ -7447,6 +7447,36 @@ static int i40e_ndo_fdb_dump(struct sk_buff *skb, > > #endif /* USE_DEFAULT_FDB_DEL_DUMP */ > #endif /* HAVE_FDB_OPS */ > +static bool i40e_gso_check(struct sk_buff *skb, struct net_device *dev) > +{ > + if ((skb_shinfo(skb)->gso_type & SKB_GSO_IPIP) && > + (skb->inner_protocol_type != ENCAP_TYPE_IPPROTO || > +skb->inner_protocol != htons(IPPROTO_IPIP))) { I think this check on inner_protocol isn't really semantically correct - that field is a union with inner_ipproto, so it would be more correct to check against that and then you wouldn't need to use htons either. I don't know if we need to have the check at all for IPIP though - after all the driver doesn't expose support for it all (actually it doesn't expose GRE either). This raises kind of an interesting question about the checks though - it's pretty easy to add support to the driver for a new GSO type (and I imagine that people will be adding GRE soon) and forget to update the check. > + if (skb_shinfo(skb)->gso_type & (SKB_GSO_GRE | SKB_GSO_UDP_TUNNEL)) { > + unsigned char *ihdr; > + > + if (skb->inner_protocol_type != ENCAP_TYPE_ETHER) > + return false; I guess if you want to be nice, it might be good to check the equivalent IP protocol types as well. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net 2/5] i40e: Implement ndo_gso_check()
On Thu, Nov 20, 2014 at 11:16 AM, Joe Stringer wrote: > On Tuesday, November 04, 2014 15:45:22 Jesse Gross wrote: >> On Tue, Nov 4, 2014 at 1:56 PM, Joe Stringer wrote: >> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c >> > b/drivers/net/ethernet/intel/i40e/i40e_main.c index c3a7f4a..21829b5 >> > 100644 >> > --- a/drivers/net/ethernet/intel/i40e/i40e_main.c >> > +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c >> > +static bool i40e_gso_check(struct sk_buff *skb, struct net_device *dev) >> > +{ >> > + if ((skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL) && >> > + (skb->inner_protocol_type != ENCAP_TYPE_ETHER || >> > +skb->inner_protocol != htons(ETH_P_TEB) || >> > +skb_inner_mac_header(skb) - skb_transport_header(skb) > 64)) >> > + return false; >> >> I think it may be possible to even support a few more things here. >> According to the datasheet here: >> http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xl71 >> 0-10-40-controller-datasheet.pdf >> >> This can actually support 64 bytes beyond the tunnel header, which >> would make for a total of 80 bytes. It looks like it can also support >> IPv4 or IPv6 beyond just Ethernet as the encapsulated protocol. >> >> Intel guys, can you confirm that this is correct? > > I'm just respinning this for v4/6 beyond GRE/UDP tunnel and IPIP, and I found > the description of max protocol parsing size of 480B (with individual header > limit of 255B). I couldn't find where you get this 64/80 number or which > headers it maps to. Could you (or one of the intel guys) expand on this? The number that I gave was from the section on Geneve support (on page 708), which says that it can support up to 64 bytes of options (this was also my understanding from previous conversations with Intel guys). I searched for 480 byte limit and it seems like it for receive instead of transmit, which could conceivably be different. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net 2/5] i40e: Implement ndo_gso_check()
On Thu, Nov 20, 2014 at 11:16 AM, Joe Stringer joestrin...@nicira.com wrote: On Tuesday, November 04, 2014 15:45:22 Jesse Gross wrote: On Tue, Nov 4, 2014 at 1:56 PM, Joe Stringer joestrin...@nicira.com wrote: diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index c3a7f4a..21829b5 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c +static bool i40e_gso_check(struct sk_buff *skb, struct net_device *dev) +{ + if ((skb_shinfo(skb)-gso_type SKB_GSO_UDP_TUNNEL) + (skb-inner_protocol_type != ENCAP_TYPE_ETHER || +skb-inner_protocol != htons(ETH_P_TEB) || +skb_inner_mac_header(skb) - skb_transport_header(skb) 64)) + return false; I think it may be possible to even support a few more things here. According to the datasheet here: http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xl71 0-10-40-controller-datasheet.pdf This can actually support 64 bytes beyond the tunnel header, which would make for a total of 80 bytes. It looks like it can also support IPv4 or IPv6 beyond just Ethernet as the encapsulated protocol. Intel guys, can you confirm that this is correct? I'm just respinning this for v4/6 beyond GRE/UDP tunnel and IPIP, and I found the description of max protocol parsing size of 480B (with individual header limit of 255B). I couldn't find where you get this 64/80 number or which headers it maps to. Could you (or one of the intel guys) expand on this? The number that I gave was from the section on Geneve support (on page 708), which says that it can support up to 64 bytes of options (this was also my understanding from previous conversations with Intel guys). I searched for 480 byte limit and it seems like it for receive instead of transmit, which could conceivably be different. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 net] i40e: Implement ndo_gso_check()
On Thu, Nov 20, 2014 at 3:11 PM, Joe Stringer joestrin...@nicira.com wrote: diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index c3a7f4a..2b01c8d 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -7447,6 +7447,36 @@ static int i40e_ndo_fdb_dump(struct sk_buff *skb, #endif /* USE_DEFAULT_FDB_DEL_DUMP */ #endif /* HAVE_FDB_OPS */ +static bool i40e_gso_check(struct sk_buff *skb, struct net_device *dev) +{ + if ((skb_shinfo(skb)-gso_type SKB_GSO_IPIP) + (skb-inner_protocol_type != ENCAP_TYPE_IPPROTO || +skb-inner_protocol != htons(IPPROTO_IPIP))) { I think this check on inner_protocol isn't really semantically correct - that field is a union with inner_ipproto, so it would be more correct to check against that and then you wouldn't need to use htons either. I don't know if we need to have the check at all for IPIP though - after all the driver doesn't expose support for it all (actually it doesn't expose GRE either). This raises kind of an interesting question about the checks though - it's pretty easy to add support to the driver for a new GSO type (and I imagine that people will be adding GRE soon) and forget to update the check. + if (skb_shinfo(skb)-gso_type (SKB_GSO_GRE | SKB_GSO_UDP_TUNNEL)) { + unsigned char *ihdr; + + if (skb-inner_protocol_type != ENCAP_TYPE_ETHER) + return false; I guess if you want to be nice, it might be good to check the equivalent IP protocol types as well. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net 0/5] Implement ndo_gso_check() for vxlan nics
On Thu, Nov 6, 2014 at 8:06 AM, Tom Herbert wrote: > On Wed, Nov 5, 2014 at 10:16 PM, Sathya Perla wrote: >>> -Original Message- >>> From: Tom Herbert [mailto:therb...@google.com] >>> >>> On Wed, Nov 5, 2014 at 6:15 PM, David Miller >>> wrote: >>> > From: Joe Stringer >>> > Date: Wed, 5 Nov 2014 17:06:46 -0800 >>> > >>> >> My impression was that the changes are more likely to be >>> >> hardware-specific (like the i40e changes) rather than software-specific, >>> >> like changes that might be integrated into the helper. >>> > >>> > I think there is more commonality amongst hardware capabilities, >>> > and this is why I want the helper to play itself out. >>> > >>> >> That said, I can rework for one helper. The way I see it would be the >>> >> same code as these patches, as "vxlan_gso_check(struct sk_buff *)" in >>> >> drivers/net/vxlan.c which would be called from each driver. Is that what >>> >> you had in mind? >>> > >>> > Yes. >>> >>> Note that this code is not VXLAN specific, it will also accept NVGRE >>> and GRE/UDP with keyid and TEB. I imagine all these cases should be >>> indistinguishable to the hardware so they probably just work (which >>> would be cool!). It might be better to name and locate the helper >>> function to reflect that. >> >> Tom, I'm confused as to how the value of (skb_inner_mac_header(skb) - >> skb_transport_header(skb)) >> would be the same for VxLAN and NVGRE encapsulated packets. Wouldn't this >> value be 16 for VxLAN >> and 8 for NVGRE? >> > The inner headers are reset in iptunnel_handle_offloads. This is > called in the xmit encapsulation functions (GRE, fou, VXLAN, etc.) > before adding in encapsulation headers (skb_push), so the > mac_inner_header will point to the encapsulation payload, i.e. the > encapsulated packet. This should not change after being set, although > inner network and inner transport can. The headers are only set on the > first encapsulation, so with nested tunnels the inner headers point to > the innermost encapsulated packet. Since VXLAN and NVGRE have same > size of encapsulation (8 UDP + 8 header), skb_inner_mac_header(skb) > - skb_transport_header(skb) should always be 16. Tom, NVGRE is not encapsulated in UDP and it is not 16 bytes. http://tools.ietf.org/html/draft-sridharan-virtualization-nvgre-06 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net 0/5] Implement ndo_gso_check() for vxlan nics
On Thu, Nov 6, 2014 at 8:06 AM, Tom Herbert therb...@google.com wrote: On Wed, Nov 5, 2014 at 10:16 PM, Sathya Perla sathya.pe...@emulex.com wrote: -Original Message- From: Tom Herbert [mailto:therb...@google.com] On Wed, Nov 5, 2014 at 6:15 PM, David Miller da...@davemloft.net wrote: From: Joe Stringer joestrin...@nicira.com Date: Wed, 5 Nov 2014 17:06:46 -0800 My impression was that the changes are more likely to be hardware-specific (like the i40e changes) rather than software-specific, like changes that might be integrated into the helper. I think there is more commonality amongst hardware capabilities, and this is why I want the helper to play itself out. That said, I can rework for one helper. The way I see it would be the same code as these patches, as vxlan_gso_check(struct sk_buff *) in drivers/net/vxlan.c which would be called from each driver. Is that what you had in mind? Yes. Note that this code is not VXLAN specific, it will also accept NVGRE and GRE/UDP with keyid and TEB. I imagine all these cases should be indistinguishable to the hardware so they probably just work (which would be cool!). It might be better to name and locate the helper function to reflect that. Tom, I'm confused as to how the value of (skb_inner_mac_header(skb) - skb_transport_header(skb)) would be the same for VxLAN and NVGRE encapsulated packets. Wouldn't this value be 16 for VxLAN and 8 for NVGRE? The inner headers are reset in iptunnel_handle_offloads. This is called in the xmit encapsulation functions (GRE, fou, VXLAN, etc.) before adding in encapsulation headers (skb_push), so the mac_inner_header will point to the encapsulation payload, i.e. the encapsulated packet. This should not change after being set, although inner network and inner transport can. The headers are only set on the first encapsulation, so with nested tunnels the inner headers point to the innermost encapsulated packet. Since VXLAN and NVGRE have same size of encapsulation (8 UDP + 8 header), skb_inner_mac_header(skb) - skb_transport_header(skb) should always be 16. Tom, NVGRE is not encapsulated in UDP and it is not 16 bytes. http://tools.ietf.org/html/draft-sridharan-virtualization-nvgre-06 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net 2/5] i40e: Implement ndo_gso_check()
On Tue, Nov 4, 2014 at 1:56 PM, Joe Stringer wrote: > diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c > b/drivers/net/ethernet/intel/i40e/i40e_main.c > index c3a7f4a..21829b5 100644 > --- a/drivers/net/ethernet/intel/i40e/i40e_main.c > +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c > +static bool i40e_gso_check(struct sk_buff *skb, struct net_device *dev) > +{ > + if ((skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL) && > + (skb->inner_protocol_type != ENCAP_TYPE_ETHER || > +skb->inner_protocol != htons(ETH_P_TEB) || > +skb_inner_mac_header(skb) - skb_transport_header(skb) > 64)) > + return false; I think it may be possible to even support a few more things here. According to the datasheet here: http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xl710-10-40-controller-datasheet.pdf This can actually support 64 bytes beyond the tunnel header, which would make for a total of 80 bytes. It looks like it can also support IPv4 or IPv6 beyond just Ethernet as the encapsulated protocol. Intel guys, can you confirm that this is correct? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net 2/5] i40e: Implement ndo_gso_check()
On Tue, Nov 4, 2014 at 1:56 PM, Joe Stringer joestrin...@nicira.com wrote: diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index c3a7f4a..21829b5 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c +static bool i40e_gso_check(struct sk_buff *skb, struct net_device *dev) +{ + if ((skb_shinfo(skb)-gso_type SKB_GSO_UDP_TUNNEL) + (skb-inner_protocol_type != ENCAP_TYPE_ETHER || +skb-inner_protocol != htons(ETH_P_TEB) || +skb_inner_mac_header(skb) - skb_transport_header(skb) 64)) + return false; I think it may be possible to even support a few more things here. According to the datasheet here: http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xl710-10-40-controller-datasheet.pdf This can actually support 64 bytes beyond the tunnel header, which would make for a total of 80 bytes. It looks like it can also support IPv4 or IPv6 beyond just Ethernet as the encapsulated protocol. Intel guys, can you confirm that this is correct? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] openvswitch: supply a dummy err_handler of gre_cisco_protocol to prevent kernel crash
On Fri, Apr 4, 2014 at 9:20 PM, wei zhang wrote: > At 2014-04-05 07:05:59,"Jesse Gross" wrote: >>On Tue, Apr 1, 2014 at 5:23 PM, Wei Zhang wrote: >>> >>> v2 -> v1: use the same logic of the gre_rcv() to distinguish which packet is >>> intended to us! >> >>As a tip on kernel process: if you put the version information after >>three dashes below the signed-off-by line then git will automatically >>remove it when the final patch is applied. > > Thanks, should I modify it and send a v3 patch? > >> >>> diff --git a/net/openvswitch/vport-gre.c b/net/openvswitch/vport-gre.c >>> index a3d6951..f391df1 100644 >>> --- a/net/openvswitch/vport-gre.c >>> +++ b/net/openvswitch/vport-gre.c >>> @@ -110,6 +110,21 @@ static int gre_rcv(struct sk_buff *skb, >>> return PACKET_RCVD; >>> } >>> >>> +/* Called with rcu_read_lock and BH disabled. */ >>> +static int gre_err(struct sk_buff *skb, u32 info, >>> + const struct tnl_ptk_info *tpi) >>> +{ >>> + struct ovs_net *ovs_net; >>> + struct vport *vport; >>> + >>> + ovs_net = net_generic(dev_net(skb->dev), ovs_net_id); >>> + vport = rcu_dereference(ovs_net->vport_net.gre_vport); >>> + if (unlikely(!vport)) >>> + return PACKET_REJECT; >>> + else >>> + return PACKET_RCVD; >> >>Sorry, I forgot to say this before - if we receive the packet then we >>should also call consume_skb() on it. > > Maybe there is no need to call consume_skb()? The icmp_rcv() would > call kfree_skb() for us. I also checked the ipgre_err(), it return > PACKET_RCVD without call consume_skb() too. Thanks, you are right. I applied your patch as is. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] openvswitch: supply a dummy err_handler of gre_cisco_protocol to prevent kernel crash
On Fri, Apr 4, 2014 at 9:20 PM, wei zhang asuka@163.com wrote: At 2014-04-05 07:05:59,Jesse Gross je...@nicira.com wrote: On Tue, Apr 1, 2014 at 5:23 PM, Wei Zhang asuka@163.com wrote: v2 - v1: use the same logic of the gre_rcv() to distinguish which packet is intended to us! As a tip on kernel process: if you put the version information after three dashes below the signed-off-by line then git will automatically remove it when the final patch is applied. Thanks, should I modify it and send a v3 patch? diff --git a/net/openvswitch/vport-gre.c b/net/openvswitch/vport-gre.c index a3d6951..f391df1 100644 --- a/net/openvswitch/vport-gre.c +++ b/net/openvswitch/vport-gre.c @@ -110,6 +110,21 @@ static int gre_rcv(struct sk_buff *skb, return PACKET_RCVD; } +/* Called with rcu_read_lock and BH disabled. */ +static int gre_err(struct sk_buff *skb, u32 info, + const struct tnl_ptk_info *tpi) +{ + struct ovs_net *ovs_net; + struct vport *vport; + + ovs_net = net_generic(dev_net(skb-dev), ovs_net_id); + vport = rcu_dereference(ovs_net-vport_net.gre_vport); + if (unlikely(!vport)) + return PACKET_REJECT; + else + return PACKET_RCVD; Sorry, I forgot to say this before - if we receive the packet then we should also call consume_skb() on it. Maybe there is no need to call consume_skb()? The icmp_rcv() would call kfree_skb() for us. I also checked the ipgre_err(), it return PACKET_RCVD without call consume_skb() too. Thanks, you are right. I applied your patch as is. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] openvswitch: supply a dummy err_handler of gre_cisco_protocol to prevent kernel crash
On Tue, Apr 1, 2014 at 5:23 PM, Wei Zhang wrote: > When use gre vport, openvswitch register a gre_cisco_protocol but does not > supply a err_handler with it. The gre_cisco_err() in net/ipv4/gre_demux.c > expect > err_handler be provided with the gre_cisco_protocol implementation, and call > ->err_handler() without existence check, cause the kernel crash. > > This patch provide a err_handler to fix this bug. > > v2 -> v1: use the same logic of the gre_rcv() to distinguish which packet is > intended to us! As a tip on kernel process: if you put the version information after three dashes below the signed-off-by line then git will automatically remove it when the final patch is applied. > diff --git a/net/openvswitch/vport-gre.c b/net/openvswitch/vport-gre.c > index a3d6951..f391df1 100644 > --- a/net/openvswitch/vport-gre.c > +++ b/net/openvswitch/vport-gre.c > @@ -110,6 +110,21 @@ static int gre_rcv(struct sk_buff *skb, > return PACKET_RCVD; > } > > +/* Called with rcu_read_lock and BH disabled. */ > +static int gre_err(struct sk_buff *skb, u32 info, > + const struct tnl_ptk_info *tpi) > +{ > + struct ovs_net *ovs_net; > + struct vport *vport; > + > + ovs_net = net_generic(dev_net(skb->dev), ovs_net_id); > + vport = rcu_dereference(ovs_net->vport_net.gre_vport); > + if (unlikely(!vport)) > + return PACKET_REJECT; > + else > + return PACKET_RCVD; Sorry, I forgot to say this before - if we receive the packet then we should also call consume_skb() on it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] openvswitch: supply a dummy err_handler of gre_cisco_protocol to prevent kernel crash
On Tue, Apr 1, 2014 at 5:23 PM, Wei Zhang asuka@163.com wrote: When use gre vport, openvswitch register a gre_cisco_protocol but does not supply a err_handler with it. The gre_cisco_err() in net/ipv4/gre_demux.c expect err_handler be provided with the gre_cisco_protocol implementation, and call -err_handler() without existence check, cause the kernel crash. This patch provide a err_handler to fix this bug. v2 - v1: use the same logic of the gre_rcv() to distinguish which packet is intended to us! As a tip on kernel process: if you put the version information after three dashes below the signed-off-by line then git will automatically remove it when the final patch is applied. diff --git a/net/openvswitch/vport-gre.c b/net/openvswitch/vport-gre.c index a3d6951..f391df1 100644 --- a/net/openvswitch/vport-gre.c +++ b/net/openvswitch/vport-gre.c @@ -110,6 +110,21 @@ static int gre_rcv(struct sk_buff *skb, return PACKET_RCVD; } +/* Called with rcu_read_lock and BH disabled. */ +static int gre_err(struct sk_buff *skb, u32 info, + const struct tnl_ptk_info *tpi) +{ + struct ovs_net *ovs_net; + struct vport *vport; + + ovs_net = net_generic(dev_net(skb-dev), ovs_net_id); + vport = rcu_dereference(ovs_net-vport_net.gre_vport); + if (unlikely(!vport)) + return PACKET_REJECT; + else + return PACKET_RCVD; Sorry, I forgot to say this before - if we receive the packet then we should also call consume_skb() on it. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] openvswitch: supply a dummy err_handler of gre_cisco_protocol to prevent kernel crash
On Tue, Apr 1, 2014 at 8:24 AM, wei zhang wrote: > At 2014-04-01 08:49:53,"Jesse Gross" wrote: >>On Sun, Mar 30, 2014 at 5:12 AM, wei zhang wrote: >>> At 2014-03-29 06:02:25,"Jesse Gross" wrote: > >>> Maybe I misunderstand something? I think if we discard all packet pass to us >>> when we use gre vport, new gre_cisco_protocol which has lower priority could >>> not see the packet intended to it. >> >>That's true but in this case it would also not see any data packets, >>so I don't think that situation would work well anyways. >> >>> I checked the implementation of the ipgre_err(), which has be called before >>> the err_handler of gre vport. It use the the (local address, remote >>> address, key) >>> to distinguish the packet which is realy intended to it, although it could >>> not >>> always get the key from the icmp packet. Should we do as the same as it? >>> I'm not sure this is feasible, any advice is appreciate. >> >>OVS does flow based matching rather than using a static set of >>configuration parameters, so everything "matches" in some way >>(although the result might be to drop). > > So the flow based match could dynamically determine by the ovs daemon, we > could > not find out the belonging of the packet as far as we call ovs_dp_upcall(), > isn't it? That's right - and since the OVS flow table always has a default behavior (even if it is drop or send to controller) there's never a packet that isn't considered to be destined to OVS once it is received. If this makes sense to you, would you mind submitting the patch you had earlier formally with a commit message and signed off by line? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] openvswitch: supply a dummy err_handler of gre_cisco_protocol to prevent kernel crash
On Tue, Apr 1, 2014 at 8:24 AM, wei zhang asuka@163.com wrote: At 2014-04-01 08:49:53,Jesse Gross je...@nicira.com wrote: On Sun, Mar 30, 2014 at 5:12 AM, wei zhang asuka@163.com wrote: At 2014-03-29 06:02:25,Jesse Gross je...@nicira.com wrote: Maybe I misunderstand something? I think if we discard all packet pass to us when we use gre vport, new gre_cisco_protocol which has lower priority could not see the packet intended to it. That's true but in this case it would also not see any data packets, so I don't think that situation would work well anyways. I checked the implementation of the ipgre_err(), which has be called before the err_handler of gre vport. It use the the (local address, remote address, key) to distinguish the packet which is realy intended to it, although it could not always get the key from the icmp packet. Should we do as the same as it? I'm not sure this is feasible, any advice is appreciate. OVS does flow based matching rather than using a static set of configuration parameters, so everything matches in some way (although the result might be to drop). So the flow based match could dynamically determine by the ovs daemon, we could not find out the belonging of the packet as far as we call ovs_dp_upcall(), isn't it? That's right - and since the OVS flow table always has a default behavior (even if it is drop or send to controller) there's never a packet that isn't considered to be destined to OVS once it is received. If this makes sense to you, would you mind submitting the patch you had earlier formally with a commit message and signed off by line? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] openvswitch: supply a dummy err_handler of gre_cisco_protocol to prevent kernel crash
On Sun, Mar 30, 2014 at 5:12 AM, wei zhang wrote: > At 2014-03-29 06:02:25,"Jesse Gross" wrote: > >>I'm not sure that rejecting all ICMP packets is the correct thing do >>here since it means that we could pass them onto a later caller even >>though they are intended for us. We should probably use the same logic >>as for receiving packets and just discard them here. > > Thank you very much for your advice, did you mean this logic? Yes, that's what I was thinking. [...] > Maybe I misunderstand something? I think if we discard all packet pass to us > when we use gre vport, new gre_cisco_protocol which has lower priority could > not see the packet intended to it. That's true but in this case it would also not see any data packets, so I don't think that situation would work well anyways. > I checked the implementation of the ipgre_err(), which has be called before > the err_handler of gre vport. It use the the (local address, remote address, > key) > to distinguish the packet which is realy intended to it, although it could not > always get the key from the icmp packet. Should we do as the same as it? > I'm not sure this is feasible, any advice is appreciate. OVS does flow based matching rather than using a static set of configuration parameters, so everything "matches" in some way (although the result might be to drop). This generally means that OVS is the receiver of last resort and nothing currently has a lower priority. That actually means the difference between the patches is somewhat academic but it seems more robust for the logic to be consistent. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] openvswitch: supply a dummy err_handler of gre_cisco_protocol to prevent kernel crash
On Sun, Mar 30, 2014 at 5:12 AM, wei zhang asuka@163.com wrote: At 2014-03-29 06:02:25,Jesse Gross je...@nicira.com wrote: I'm not sure that rejecting all ICMP packets is the correct thing do here since it means that we could pass them onto a later caller even though they are intended for us. We should probably use the same logic as for receiving packets and just discard them here. Thank you very much for your advice, did you mean this logic? Yes, that's what I was thinking. [...] Maybe I misunderstand something? I think if we discard all packet pass to us when we use gre vport, new gre_cisco_protocol which has lower priority could not see the packet intended to it. That's true but in this case it would also not see any data packets, so I don't think that situation would work well anyways. I checked the implementation of the ipgre_err(), which has be called before the err_handler of gre vport. It use the the (local address, remote address, key) to distinguish the packet which is realy intended to it, although it could not always get the key from the icmp packet. Should we do as the same as it? I'm not sure this is feasible, any advice is appreciate. OVS does flow based matching rather than using a static set of configuration parameters, so everything matches in some way (although the result might be to drop). This generally means that OVS is the receiver of last resort and nothing currently has a lower priority. That actually means the difference between the patches is somewhat academic but it seems more robust for the logic to be consistent. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] net/openvswitch: Use with RCU_INIT_POINTER(x, NULL) in vport-gre.c
On Sun, Mar 23, 2014 at 12:22 PM, Monam Agarwal wrote: > This patch replaces rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL) > > The rcu_assign_pointer() ensures that the initialization of a structure > is carried out before storing a pointer to that structure. > And in the case of the NULL pointer, there is no structure to initialize. > So, rcu_assign_pointer(p, NULL) can be safely converted to > RCU_INIT_POINTER(p, NULL) > > Signed-off-by: Monam Agarwal Applied. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] openvswitch: supply a dummy err_handler of gre_cisco_protocol to prevent kernel crash
On Thu, Mar 27, 2014 at 2:56 PM, Wei Zhang wrote: > diff --git a/net/openvswitch/vport-gre.c b/net/openvswitch/vport-gre.c > index a3d6951..d64c639 100644 > --- a/net/openvswitch/vport-gre.c > +++ b/net/openvswitch/vport-gre.c > @@ -110,6 +110,12 @@ static int gre_rcv(struct sk_buff *skb, > return PACKET_RCVD; > } > > +static int gre_dummy_err(struct sk_buff *skb, u32 info, > +const struct tnl_ptk_info *tpi) > +{ > + return PACKET_REJECT; > +} I'm not sure that rejecting all ICMP packets is the correct thing do here since it means that we could pass them onto a later caller even though they are intended for us. We should probably use the same logic as for receiving packets and just discard them here. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] openvswitch: supply a dummy err_handler of gre_cisco_protocol to prevent kernel crash
On Thu, Mar 27, 2014 at 2:56 PM, Wei Zhang asuka@163.com wrote: diff --git a/net/openvswitch/vport-gre.c b/net/openvswitch/vport-gre.c index a3d6951..d64c639 100644 --- a/net/openvswitch/vport-gre.c +++ b/net/openvswitch/vport-gre.c @@ -110,6 +110,12 @@ static int gre_rcv(struct sk_buff *skb, return PACKET_RCVD; } +static int gre_dummy_err(struct sk_buff *skb, u32 info, +const struct tnl_ptk_info *tpi) +{ + return PACKET_REJECT; +} I'm not sure that rejecting all ICMP packets is the correct thing do here since it means that we could pass them onto a later caller even though they are intended for us. We should probably use the same logic as for receiving packets and just discard them here. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] net/openvswitch: Use with RCU_INIT_POINTER(x, NULL) in vport-gre.c
On Sun, Mar 23, 2014 at 12:22 PM, Monam Agarwal monamagarwal...@gmail.com wrote: This patch replaces rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL) The rcu_assign_pointer() ensures that the initialization of a structure is carried out before storing a pointer to that structure. And in the case of the NULL pointer, there is no structure to initialize. So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL) Signed-off-by: Monam Agarwal monamagarwal...@gmail.com Applied. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ovs-discuss] Linus GIT Head OOPs reproducable in open vswitch when running mininet topology
On Fri, Jan 31, 2014 at 10:18 AM, Thomas Glanzmann wrote: >> Do you know if this happens with an older kernel or with a simpler topology? > > No, I don't. I just verified that the Ubuntu Mininet uses the > openvswitch kernel module from openvswitch and not the one that is > shipped with the kernel. Ubuntu precise does not crash with the exact same > topology. The kernel from Precise doesn't call the function that is triggering the problem, so it's not too surprising that it doesn't have the same issue. It's not clear that this is actually a bug in the OVS code since it happens in a different function and that function accesses data that OVS doesn't really touch. Do you know if this happens with the bridge? Or can you try bisecting? Or use gdb to track down the faulting address? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ovs-discuss] Linus GIT Head OOPs reproducable in open vswitch when running mininet topology
On Fri, Jan 31, 2014 at 10:18 AM, Thomas Glanzmann tho...@glanzmann.de wrote: Do you know if this happens with an older kernel or with a simpler topology? No, I don't. I just verified that the Ubuntu Mininet uses the openvswitch kernel module from openvswitch and not the one that is shipped with the kernel. Ubuntu precise does not crash with the exact same topology. The kernel from Precise doesn't call the function that is triggering the problem, so it's not too surprising that it doesn't have the same issue. It's not clear that this is actually a bug in the OVS code since it happens in a different function and that function accesses data that OVS doesn't really touch. Do you know if this happens with the bridge? Or can you try bisecting? Or use gdb to track down the faulting address? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ovs-discuss] Linus GIT Head OOPs reproducable in open vswitch when running mininet topology
On Thu, Jan 30, 2014 at 6:33 PM, Thomas Glanzmann wrote: > Hello Jesse, > >> This looks like the kernel module included with upstream Linux instead >> of from OVS git, is that correct? > > coorect. > >> Can you please describe what you are doing instead of just giving your >> script? > > I created 8 hosts. 2 hosts are connected two each switches. That gives > me 4 switches which are connected using a ring topology. The reason for > that is that I want to test the Layer2, Layer3 IPv4 and IPv6 > capabilities of OpenDayLight. Do you know what type of devices are being attached to OVS (i.e. tap, veth, etc.)? Do you know if this happens with an older kernel or with a simpler topology? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ovs-discuss] Linus GIT Head OOPs reproducable in open vswitch when running mininet topology
On Thu, Jan 30, 2014 at 6:33 PM, Thomas Glanzmann tho...@glanzmann.de wrote: Hello Jesse, This looks like the kernel module included with upstream Linux instead of from OVS git, is that correct? coorect. Can you please describe what you are doing instead of just giving your script? I created 8 hosts. 2 hosts are connected two each switches. That gives me 4 switches which are connected using a ring topology. The reason for that is that I want to test the Layer2, Layer3 IPv4 and IPv6 capabilities of OpenDayLight. Do you know what type of devices are being attached to OVS (i.e. tap, veth, etc.)? Do you know if this happens with an older kernel or with a simpler topology? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ovs-discuss] Linus GIT Head OOPs reproducable in open vswitch when running mininet topology
On Thu, Jan 30, 2014 at 12:44 PM, Thomas Glanzmann wrote: > Hello, > open vswitch git head with Linus tip OOPses for me reproducable when I > load the following mininet topology: This looks like the kernel module included with upstream Linux instead of from OVS git, is that correct? Can you please describe what you are doing instead of just giving your script? It would also be helpful if you could use GDB to find out the source of the faulting address. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ovs-discuss] Linus GIT Head OOPs reproducable in open vswitch when running mininet topology
On Thu, Jan 30, 2014 at 12:44 PM, Thomas Glanzmann tho...@glanzmann.de wrote: Hello, open vswitch git head with Linus tip OOPses for me reproducable when I load the following mininet topology: This looks like the kernel module included with upstream Linux instead of from OVS git, is that correct? Can you please describe what you are doing instead of just giving your script? It would also be helpful if you could use GDB to find out the source of the faulting address. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-next] openvswitch BUILD_BUG_ON failed
On Tue, Sep 3, 2013 at 11:55 PM, Geert Uytterhoeven wrote: > On Tue, Sep 3, 2013 at 11:44 PM, Jesse Gross wrote: >> On Sat, Aug 31, 2013 at 5:11 AM, Geert Uytterhoeven >> wrote: >>> On Fri, Aug 30, 2013 at 3:11 AM, Jesse Gross wrote: >>>> On Thu, Aug 29, 2013 at 3:10 PM, David Miller wrote: >>>>> From: Jesse Gross >>>>> Date: Thu, 29 Aug 2013 14:42:22 -0700 >>>>> >>>>>> On Thu, Aug 29, 2013 at 2:21 PM, Geert Uytterhoeven >>>>>> wrote: >>>>>>> However, I have some doubts about other alignment "enforcements": >>>>>>> >>>>>>> "__aligned(__alignof__(long))" makes the whole struct aligned to the >>>>>>> alignment rule for "long": >>>>>>>1. This is only 2 bytes on m68k, i.e. != sizeof(long). >>>>>>>2. This is 4 bytes on many 32-bit platforms, which may be less than >>>>>>> the >>>>>>> default alignment for "__be64" (cfr. some members of struct >>>>>>> ovs_key_ipv4_tunnel), so this may make those 64-bit members >>>>>>> unaligned. >>>>>> >>>>>> Do any of those 32-bit architectures actually care about alignment of >>>>>> 64 bit values? On 32-bit x86, a long is 32 bits but the alignment >>>>>> requirement of __be64 is also 32 bit. >>>>> >>>>> All except x86-32 do, it is in fact the odd man out with respect to this >>>>> issue. >>>> >>>> Thanks, good to know. >>>> >>>> Andy, do you want to modify your patch to just drop the alignment >>>> specification as Geert suggested (but definitely keep the new build >>>> assert that you added)? It's probably better to just send the patch to >>>> netdev (against net-next) as well since you'll likely get better >>>> comments there and we can fix this faster if you cut out the >>>> middleman. >>> >>> Why do you want to keep the build asserts? >>> Is this in-memory structure also transfered as-is over the network? >>> If yes, you definitely want the padding. >> >> Well they caught this bug and really don't cost anything. >> >>> Nevertheless, as the struct contains u32 and even __be64 members, the >>> size of the struct will always be a multiple of the alignment unit for >>> 64-bit quantities (and thus also for long), as per the C standard. >>> Hence the check >>> >>> BUILD_BUG_ON(sizeof(struct sw_flow_key) % __alignof__(long)); >>> >>> will only catch bad compiler bugs or people adding __packed to the struct. >> >> It's possible that we might want to pack the structure in the future. >> More generally though, the contents of the struct is really >> independent of the alignment requirements here because we're accessing >> it as an array of bytes in long-sized chunks so implicitly depending >> on the size of the members is not that great. > > So you're accessing it as an array of bytes in long-sized chunks. > What are you doing with this accessed data? > Transfering over the network? > Storing on disk? > Then it must be portable across machines and architectures, right? It's just an in-memory hash table lookup. No one else ever sees it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-next] openvswitch BUILD_BUG_ON failed
On Tue, Sep 3, 2013 at 11:55 PM, Geert Uytterhoeven ge...@linux-m68k.org wrote: On Tue, Sep 3, 2013 at 11:44 PM, Jesse Gross je...@nicira.com wrote: On Sat, Aug 31, 2013 at 5:11 AM, Geert Uytterhoeven ge...@linux-m68k.org wrote: On Fri, Aug 30, 2013 at 3:11 AM, Jesse Gross je...@nicira.com wrote: On Thu, Aug 29, 2013 at 3:10 PM, David Miller da...@davemloft.net wrote: From: Jesse Gross je...@nicira.com Date: Thu, 29 Aug 2013 14:42:22 -0700 On Thu, Aug 29, 2013 at 2:21 PM, Geert Uytterhoeven ge...@linux-m68k.org wrote: However, I have some doubts about other alignment enforcements: __aligned(__alignof__(long)) makes the whole struct aligned to the alignment rule for long: 1. This is only 2 bytes on m68k, i.e. != sizeof(long). 2. This is 4 bytes on many 32-bit platforms, which may be less than the default alignment for __be64 (cfr. some members of struct ovs_key_ipv4_tunnel), so this may make those 64-bit members unaligned. Do any of those 32-bit architectures actually care about alignment of 64 bit values? On 32-bit x86, a long is 32 bits but the alignment requirement of __be64 is also 32 bit. All except x86-32 do, it is in fact the odd man out with respect to this issue. Thanks, good to know. Andy, do you want to modify your patch to just drop the alignment specification as Geert suggested (but definitely keep the new build assert that you added)? It's probably better to just send the patch to netdev (against net-next) as well since you'll likely get better comments there and we can fix this faster if you cut out the middleman. Why do you want to keep the build asserts? Is this in-memory structure also transfered as-is over the network? If yes, you definitely want the padding. Well they caught this bug and really don't cost anything. Nevertheless, as the struct contains u32 and even __be64 members, the size of the struct will always be a multiple of the alignment unit for 64-bit quantities (and thus also for long), as per the C standard. Hence the check BUILD_BUG_ON(sizeof(struct sw_flow_key) % __alignof__(long)); will only catch bad compiler bugs or people adding __packed to the struct. It's possible that we might want to pack the structure in the future. More generally though, the contents of the struct is really independent of the alignment requirements here because we're accessing it as an array of bytes in long-sized chunks so implicitly depending on the size of the members is not that great. So you're accessing it as an array of bytes in long-sized chunks. What are you doing with this accessed data? Transfering over the network? Storing on disk? Then it must be portable across machines and architectures, right? It's just an in-memory hash table lookup. No one else ever sees it. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-next] openvswitch BUILD_BUG_ON failed
On Sat, Aug 31, 2013 at 5:11 AM, Geert Uytterhoeven wrote: > On Fri, Aug 30, 2013 at 3:11 AM, Jesse Gross wrote: >> On Thu, Aug 29, 2013 at 3:10 PM, David Miller wrote: >>> From: Jesse Gross >>> Date: Thu, 29 Aug 2013 14:42:22 -0700 >>> >>>> On Thu, Aug 29, 2013 at 2:21 PM, Geert Uytterhoeven >>>> wrote: >>>>> However, I have some doubts about other alignment "enforcements": >>>>> >>>>> "__aligned(__alignof__(long))" makes the whole struct aligned to the >>>>> alignment rule for "long": >>>>>1. This is only 2 bytes on m68k, i.e. != sizeof(long). >>>>>2. This is 4 bytes on many 32-bit platforms, which may be less than the >>>>> default alignment for "__be64" (cfr. some members of struct >>>>> ovs_key_ipv4_tunnel), so this may make those 64-bit members >>>>> unaligned. >>>> >>>> Do any of those 32-bit architectures actually care about alignment of >>>> 64 bit values? On 32-bit x86, a long is 32 bits but the alignment >>>> requirement of __be64 is also 32 bit. >>> >>> All except x86-32 do, it is in fact the odd man out with respect to this >>> issue. >> >> Thanks, good to know. >> >> Andy, do you want to modify your patch to just drop the alignment >> specification as Geert suggested (but definitely keep the new build >> assert that you added)? It's probably better to just send the patch to >> netdev (against net-next) as well since you'll likely get better >> comments there and we can fix this faster if you cut out the >> middleman. > > Why do you want to keep the build asserts? > Is this in-memory structure also transfered as-is over the network? > If yes, you definitely want the padding. Well they caught this bug and really don't cost anything. > Nevertheless, as the struct contains u32 and even __be64 members, the > size of the struct will always be a multiple of the alignment unit for > 64-bit quantities (and thus also for long), as per the C standard. > Hence the check > > BUILD_BUG_ON(sizeof(struct sw_flow_key) % __alignof__(long)); > > will only catch bad compiler bugs or people adding __packed to the struct. It's possible that we might want to pack the structure in the future. More generally though, the contents of the struct is really independent of the alignment requirements here because we're accessing it as an array of bytes in long-sized chunks so implicitly depending on the size of the members is not that great. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-next] openvswitch BUILD_BUG_ON failed
On Sat, Aug 31, 2013 at 5:11 AM, Geert Uytterhoeven ge...@linux-m68k.org wrote: On Fri, Aug 30, 2013 at 3:11 AM, Jesse Gross je...@nicira.com wrote: On Thu, Aug 29, 2013 at 3:10 PM, David Miller da...@davemloft.net wrote: From: Jesse Gross je...@nicira.com Date: Thu, 29 Aug 2013 14:42:22 -0700 On Thu, Aug 29, 2013 at 2:21 PM, Geert Uytterhoeven ge...@linux-m68k.org wrote: However, I have some doubts about other alignment enforcements: __aligned(__alignof__(long)) makes the whole struct aligned to the alignment rule for long: 1. This is only 2 bytes on m68k, i.e. != sizeof(long). 2. This is 4 bytes on many 32-bit platforms, which may be less than the default alignment for __be64 (cfr. some members of struct ovs_key_ipv4_tunnel), so this may make those 64-bit members unaligned. Do any of those 32-bit architectures actually care about alignment of 64 bit values? On 32-bit x86, a long is 32 bits but the alignment requirement of __be64 is also 32 bit. All except x86-32 do, it is in fact the odd man out with respect to this issue. Thanks, good to know. Andy, do you want to modify your patch to just drop the alignment specification as Geert suggested (but definitely keep the new build assert that you added)? It's probably better to just send the patch to netdev (against net-next) as well since you'll likely get better comments there and we can fix this faster if you cut out the middleman. Why do you want to keep the build asserts? Is this in-memory structure also transfered as-is over the network? If yes, you definitely want the padding. Well they caught this bug and really don't cost anything. Nevertheless, as the struct contains u32 and even __be64 members, the size of the struct will always be a multiple of the alignment unit for 64-bit quantities (and thus also for long), as per the C standard. Hence the check BUILD_BUG_ON(sizeof(struct sw_flow_key) % __alignof__(long)); will only catch bad compiler bugs or people adding __packed to the struct. It's possible that we might want to pack the structure in the future. More generally though, the contents of the struct is really independent of the alignment requirements here because we're accessing it as an array of bytes in long-sized chunks so implicitly depending on the size of the members is not that great. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-next] openvswitch BUILD_BUG_ON failed
On Thu, Aug 29, 2013 at 3:10 PM, David Miller wrote: > From: Jesse Gross > Date: Thu, 29 Aug 2013 14:42:22 -0700 > >> On Thu, Aug 29, 2013 at 2:21 PM, Geert Uytterhoeven >> wrote: >>> However, I have some doubts about other alignment "enforcements": >>> >>> "__aligned(__alignof__(long))" makes the whole struct aligned to the >>> alignment rule for "long": >>>1. This is only 2 bytes on m68k, i.e. != sizeof(long). >>>2. This is 4 bytes on many 32-bit platforms, which may be less than the >>> default alignment for "__be64" (cfr. some members of struct >>> ovs_key_ipv4_tunnel), so this may make those 64-bit members unaligned. >> >> Do any of those 32-bit architectures actually care about alignment of >> 64 bit values? On 32-bit x86, a long is 32 bits but the alignment >> requirement of __be64 is also 32 bit. > > All except x86-32 do, it is in fact the odd man out with respect to this > issue. Thanks, good to know. Andy, do you want to modify your patch to just drop the alignment specification as Geert suggested (but definitely keep the new build assert that you added)? It's probably better to just send the patch to netdev (against net-next) as well since you'll likely get better comments there and we can fix this faster if you cut out the middleman. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-next] openvswitch BUILD_BUG_ON failed
On Thu, Aug 29, 2013 at 2:21 PM, Geert Uytterhoeven wrote: > However, I have some doubts about other alignment "enforcements": > > "__aligned(__alignof__(long))" makes the whole struct aligned to the > alignment rule for "long": >1. This is only 2 bytes on m68k, i.e. != sizeof(long). >2. This is 4 bytes on many 32-bit platforms, which may be less than the > default alignment for "__be64" (cfr. some members of struct > ovs_key_ipv4_tunnel), so this may make those 64-bit members unaligned. Do any of those 32-bit architectures actually care about alignment of 64 bit values? On 32-bit x86, a long is 32 bits but the alignment requirement of __be64 is also 32 bit. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-next] openvswitch BUILD_BUG_ON failed
On Thu, Aug 29, 2013 at 2:21 PM, Geert Uytterhoeven ge...@linux-m68k.org wrote: However, I have some doubts about other alignment enforcements: __aligned(__alignof__(long)) makes the whole struct aligned to the alignment rule for long: 1. This is only 2 bytes on m68k, i.e. != sizeof(long). 2. This is 4 bytes on many 32-bit platforms, which may be less than the default alignment for __be64 (cfr. some members of struct ovs_key_ipv4_tunnel), so this may make those 64-bit members unaligned. Do any of those 32-bit architectures actually care about alignment of 64 bit values? On 32-bit x86, a long is 32 bits but the alignment requirement of __be64 is also 32 bit. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-next] openvswitch BUILD_BUG_ON failed
On Thu, Aug 29, 2013 at 3:10 PM, David Miller da...@davemloft.net wrote: From: Jesse Gross je...@nicira.com Date: Thu, 29 Aug 2013 14:42:22 -0700 On Thu, Aug 29, 2013 at 2:21 PM, Geert Uytterhoeven ge...@linux-m68k.org wrote: However, I have some doubts about other alignment enforcements: __aligned(__alignof__(long)) makes the whole struct aligned to the alignment rule for long: 1. This is only 2 bytes on m68k, i.e. != sizeof(long). 2. This is 4 bytes on many 32-bit platforms, which may be less than the default alignment for __be64 (cfr. some members of struct ovs_key_ipv4_tunnel), so this may make those 64-bit members unaligned. Do any of those 32-bit architectures actually care about alignment of 64 bit values? On 32-bit x86, a long is 32 bits but the alignment requirement of __be64 is also 32 bit. All except x86-32 do, it is in fact the odd man out with respect to this issue. Thanks, good to know. Andy, do you want to modify your patch to just drop the alignment specification as Geert suggested (but definitely keep the new build assert that you added)? It's probably better to just send the patch to netdev (against net-next) as well since you'll likely get better comments there and we can fix this faster if you cut out the middleman. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: Tree for Jun 21 (netdev: openvswitch)
On Fri, Jun 21, 2013 at 8:22 AM, Randy Dunlap wrote: > On 06/21/13 01:17, Stephen Rothwell wrote: >> Hi all, >> >> Happy solstice! >> >> Changes since 20130620: >> > > when CONFIG_INET is not enabled: > > CC net/openvswitch/flow.o > In file included from net/openvswitch/flow.c:43:0: > include/net/ip_tunnels.h: In function 'tunnel_ip_select_ident': > include/net/ip_tunnels.h:155:3: error: implicit declaration of function > '__ip_select_ident' [-Werror=implicit-function-declaration] Thanks, I just sent out a couple of patches to fix config issues with the recent tunneling series. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: Tree for Jun 21 (netdev: openvswitch)
On Fri, Jun 21, 2013 at 8:22 AM, Randy Dunlap rdun...@infradead.org wrote: On 06/21/13 01:17, Stephen Rothwell wrote: Hi all, Happy solstice! Changes since 20130620: when CONFIG_INET is not enabled: CC net/openvswitch/flow.o In file included from net/openvswitch/flow.c:43:0: include/net/ip_tunnels.h: In function 'tunnel_ip_select_ident': include/net/ip_tunnels.h:155:3: error: implicit declaration of function '__ip_select_ident' [-Werror=implicit-function-declaration] Thanks, I just sent out a couple of patches to fix config issues with the recent tunneling series. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/9] net: openvswitch: use this_cpu_ptr per-cpu helper
On Fri, Nov 16, 2012 at 12:35 AM, Shan Wei wrote: > Shan Wei said, at 2012/11/13 9:52: >> From: Shan Wei >> >> just use more faster this_cpu_ptr instead of per_cpu_ptr(p, >> smp_processor_id()); >> >> >> Signed-off-by: Shan Wei >> Reviewed-by: Christoph Lameter > > Jesse Gross, would you like to pick it up to your tree? Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/9] net: openvswitch: use this_cpu_ptr per-cpu helper
On Fri, Nov 16, 2012 at 12:35 AM, Shan Wei shanwe...@gmail.com wrote: Shan Wei said, at 2012/11/13 9:52: From: Shan Wei davids...@tencent.com just use more faster this_cpu_ptr instead of per_cpu_ptr(p, smp_processor_id()); Signed-off-by: Shan Wei davids...@tencent.com Reviewed-by: Christoph Lameter c...@linux.com Jesse Gross, would you like to pick it up to your tree? Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/9] net: openvswitch: use this_cpu_ptr per-cpu helper
On Thu, Nov 1, 2012 at 7:33 AM, Christoph Lameter wrote: > On Thu, 1 Nov 2012, Shan Wei wrote: > >> But for different field in same per-cpu variable, how to guarantee n_missed >> and n_hit are from same cpu? >> this_cpu_read(dp->stats_percpu->n_missed); >> [processor changed] >> this_cpu_read(dp->stats_percpu->n_hit); > > What does current guarantee that? If it is guaranteed then you can use the > __this_cpu_xxx ops. Preemption is disabled in all of the places where writes are done and all of the reads are from foreign CPUs. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/9] net: openvswitch: use this_cpu_ptr per-cpu helper
On Thu, Nov 1, 2012 at 7:33 AM, Christoph Lameter c...@linux.com wrote: On Thu, 1 Nov 2012, Shan Wei wrote: But for different field in same per-cpu variable, how to guarantee n_missed and n_hit are from same cpu? this_cpu_read(dp-stats_percpu-n_missed); [processor changed] this_cpu_read(dp-stats_percpu-n_hit); What does current guarantee that? If it is guaranteed then you can use the __this_cpu_xxx ops. Preemption is disabled in all of the places where writes are done and all of the reads are from foreign CPUs. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/