Re: [v1,net-next 3/4] net: qos: police action add index for tc flower offloading
On 2020-06-24 8:34 p.m., Po Liu wrote: -Original Message- That is the point i was trying to get to. Basically: You have a counter table which is referenced by "index" You also have a meter/policer table which is referenced by "index". They should be one same group and same meaning. Didnt follow. You mean the index is the same for both the stat and policer? For policers, they maintain their own stats. So when i say: tc ... flower ... action police ... index 5 The index referred to is in the policer table Sure. Means police with No. 5 entry. But for other actions, example when i say: tc ... flower ... action drop index 10 Still the question, does gact action drop could bind with index? It doesn't meanful. Depends on your hardware. From this discussion i am trying to understand where the constraint is for your case. Whether it is your h/w or the TSN spec. For a sample counting which is flexible see here: https://p4.org/p4-spec/docs/PSA.html#sec-counters That concept is not specific to P4 but rather to newer flow-based hardware. More context: The assumption these days is we can have a _lot_ of flows with a lot of actions. Then you want to be able to collect the stats separately, possibly one counter entry for each action of interest. Why is this important?f For analytics uses cases, when you are retrieving the stats you want to reduce the amount of data being retrieved. Typically these stats are polled every X seconds. For starters, you dont dump filters (which in your case seems to be the only way to get the stats). In current tc, you dump the actions. But that could be improved so you can just dump the stats. The mapping of stats index to actions is known to the entity doing the dump. Does that make sense? The index is in the counter/stats table. It is not exactly "10" in hardware, the driver magically hides it from the user - so it could be hw counter index 1234 Not exactly. Current flower offloading stats means get the chain index for that flow filter. The other actions should bind to that chain index. > So if i read correctly: You have an index per filter pointing to the counter table. Is this something _you_ decided to do in software or is it how the hardware works? (note i referred to this as "legacy ACL" approach earlier. It worked like that in old hardware because the main use case was to have one action on a match (drop/accept kind). Like IEEE802.1Qci, what I am doing is bind gate action to filter chain(mandatory). And also police action as optional. I cant seem to find this spec online. Is it freely available? Also, if i understand you correctly you are saying according to this spec you can only have the following type of policy: tc .. filter match-spec-here .. \ action gate gate-action-attributes \ action police ... That "action gate" MUST always be present but "action police" is optional? There is stream counter table which summary the counters pass gate action entry and police action entry for that chain index(there is a bit different if two chain sharing same action list). One chain counter which tc show stats get counter source: struct psfp_streamfilter_counters { u64 matching_frames_count; u64 passing_frames_count; u64 not_passing_frames_count; u64 passing_sdu_count; u64 not_passing_sdu_count; u64 red_frames_count; }; Assuming psfp is something defined in IEEE802.1Qci and the spec will describe these? Is the filter "index" pointing to one of those in some counter table? When pass to the user space, summarize as: stats.pkts = counters.matching_frames_count + counters.not_passing_sdu_count - filter->stats.pkts; > stats.drops = counters.not_passing_frames_count + counters.not_passing_sdu_count + counters.red_frames_count - filter->stats.drops; Thanks for the explanation. What is filter->stats? The rest of those counters seem related to the gate action. How do you account for policing actions? cheers, jamal
Re: [v1,net-next 3/4] net: qos: police action add index for tc flower offloading
On 2020-06-23 7:52 p.m., Po Liu wrote: Hi Jamal, My question: Is this any different from how stats are structured? [..] My question: Why cant you apply the same semantics for the counters? Does your hardware have an indexed counter/stats table? If yes then you Yes, That is the point i was trying to get to. Basically: You have a counter table which is referenced by "index" You also have a meter/policer table which is referenced by "index". For policers, they maintain their own stats. So when i say: tc ... flower ... action police ... index 5 The index referred to is in the policer table But for other actions, example when i say: tc ... flower ... action drop index 10 The index is in the counter/stats table. It is not exactly "10" in hardware, the driver magically hides it from the user - so it could be hw counter index 1234 The old approach is to assume the classifier (flower in this case) has a counter. The reason for this assumption is older hardware was designed to deal with a single action per match. So a counter to the filter is also the counter to the (single) action. I get the feeling your hardware fits in that space. Modern use cases have evolved from the ACL single match and action approach. Maintaining the old thought/architecture breaks in two use cases: 1) when you have multiple actions per policy filter. You need counter-per-action for various reasons 2) Sharing of counters across filters and action. This can be achieve tc supports the above and is sufficient to cover the old use cases. I am just worried, architecturally, we are restricting ourselves to the old scheme. Another reason this is important is for the sake of analytics. A user space app can poll just for the stats table in hardware (or the cached version in the kernel) and reduce the amount of data crossing to user space.. cheers, jamal
Re: [v1,net-next 3/4] net: qos: police action add index for tc flower offloading
On 2020-06-23 7:55 a.m., Po Liu wrote: [..] My question: Is this any different from how stats are structured? I don't know I fully catch the question. Are you trying to get how many frames for each filter chain passing one index policing action? If one index police action bind to multiple tc filter(they should have differnt chain index ). All those filter should get same index police action stats value since they are sharing the same hardware entry. But I don't think this is the problem. This is a good thing. What is nice is i can use the same index for s/w and h/w (and no need for a translation/remapping). With index provide to device driver(map the s/w action index to a h/w table index ), user could list the police actions list by command: # tc actions show action police Shows the police action table by index. This is also nice. My question: Why cant you apply the same semantics for the counters? Does your hardware have an indexed counter/stats table? If yes then you should be able to do similar thing for counters as you do for policer (i.e use an index and share counters across actions). So when i say: tc action drop index 5 and tc action ok index 5 infact they use the same counter. cheers, jamal
Re: [v1,net-next 3/4] net: qos: police action add index for tc flower offloading
This certainly brings an interesting point which i brought up earlier when Jiri was doing offloading of stats. In this case the action index is being used as the offloaded policer index (note: there'd need to be a check whether the index is infact acceptable to the h/w etc unless there 2^32 meters available in the hardware). My question: Is this any different from how stats are structured? In this case you can map the s/w action index to a h/w table index (of meters). My comment then was: hardware i have encountered (and i pointed to P4 model as well) assumes an indexed table of stats. cheers, jamal On 2020-06-23 2:34 a.m., Po Liu wrote: From: Po Liu Hardware may own many entries for police flow. So that make one(or multi) flow to be policed by one hardware entry. This patch add the police action index provide to the driver side make it mapping the driver hardware entry index. Signed-off-by: Po Liu --- include/net/flow_offload.h | 1 + net/sched/cls_api.c| 1 + 2 files changed, 2 insertions(+) diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h index c2ef19c6b27d..eed98075b1ae 100644 --- a/include/net/flow_offload.h +++ b/include/net/flow_offload.h @@ -232,6 +232,7 @@ struct flow_action_entry { booltruncate; } sample; struct {/* FLOW_ACTION_POLICE */ + u32 index; s64 burst; u64 rate_bytes_ps; u32 mtu; diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index 6aba7d5ba1ec..fdc4c89ca1fa 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -3659,6 +3659,7 @@ int tc_setup_flow_action(struct flow_action *flow_action, entry->police.rate_bytes_ps = tcf_police_rate_bytes_ps(act); entry->police.mtu = tcf_police_tcfp_mtu(act); + entry->police.index = act->tcfa_index; } else if (is_tcf_ct(act)) { entry->id = FLOW_ACTION_CT; entry->ct.action = tcf_ct_action(act);
Re: [PATCH net-next 0/2] Change tc action identifiers to be more consistent
On 2019-02-07 2:45 a.m., Eli Cohen wrote: This two patch series modifies TC actions identifiers to be more consistent and also puts them in one place so new identifiers numbers can be chosen more easily. For the series: Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH/RFC] Make perf_event_open() propagate errors for use in bpf_perf_event_open()
On 2019-01-11 10:55 a.m., Arnaldo Carvalho de Melo wrote: Hi Peter, bpf_perf_event_open() already returns a value, but if perf_event_output's output_begin (mostly perf_output_begin) fails, the only way to know about that is looking before/after the rb->lost, right? For ring buffer users that is ok, we'll get a PERF_RECORD_LOST, etc, but for BPF programs it would be convenient to get that -ENOSPC and do some fallback, whatever makes sense, like in my augmented_syscalls stuff for 'perf trace', i.e. don't augment it (i.e. push stuff at the end of the normal payload), just don't filter the raw_syscalls:sys_enter, 'perf trace' will get the enter syscall enter event without the pointer dereference at the end, etc, warn the user but don't lose a syscall in the strace-like output. What do you think? Am I missing something? Probably ;-) Ah, its just test built. Works as advertised ;-> Tested-by: Jamal Hadi Salim cheers, jamal
Re: linux-next: manual merge of the net-next tree with the net tree
On 2018-10-08 9:21 p.m., Stephen Rothwell wrote: Hi all, Today's linux-next merge of the net-next tree got a conflict in: net/sched/cls_u32.c between commit: 6d4c407744dd ("net: sched: cls_u32: fix hnode refcounting") from the net tree and commit: a030598690c6 ("net: sched: cls_u32: simplify the hell out u32_delete() emptiness check") from the net-next tree. I fixed it up (I reverted the net tree commit as I could not tell wich parts of it, if any, are still needed) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. Attached should fix it. Al, please double check. cheers, jamal --- a/net-next/net/sched/cls_u32.c 2018-10-09 05:18:04.046642676 -0400 +++ b/net/sched/cls_u32.c 2018-10-09 05:29:51.965528032 -0400 @@ -391,6 +391,7 @@ RCU_INIT_POINTER(root_ht->next, tp_c->hlist); rcu_assign_pointer(tp_c->hlist, root_ht); + root_ht->refcnt++; rcu_assign_pointer(tp->root, root_ht); tp->data = tp_c; return 0; @@ -606,7 +607,7 @@ struct tc_u_hnode __rcu **hn; struct tc_u_hnode *phn; - WARN_ON(ht->refcnt); + WARN_ON(--ht->refcnt); u32_clear_hnode(tp, ht, extack); @@ -634,7 +635,7 @@ WARN_ON(root_ht == NULL); - if (root_ht && --root_ht->refcnt == 0) + if (root_ht && --root_ht->refcnt == 1) u32_destroy_hnode(tp, root_ht, extack); if (--tp_c->refcnt == 0) { @@ -679,7 +680,6 @@ } if (ht->refcnt == 1) { - ht->refcnt--; u32_destroy_hnode(tp, ht, extack); } else { NL_SET_ERR_MSG_MOD(extack, "Can not delete in-use filter"); @@ -1079,8 +1079,7 @@ } #endif - err = u32_set_parms(net, tp, base, n, tb, tca[TCA_RATE], ovr, - extack); + err = u32_set_parms(net, tp, base, n, tb, tca[TCA_RATE], ovr, extack); if (err == 0) { struct tc_u_knode __rcu **ins; struct tc_u_knode *pins;
Re: linux-next: manual merge of the net-next tree with the net tree
On 2018-10-08 9:21 p.m., Stephen Rothwell wrote: Hi all, Today's linux-next merge of the net-next tree got a conflict in: net/sched/cls_u32.c between commit: 6d4c407744dd ("net: sched: cls_u32: fix hnode refcounting") from the net tree and commit: a030598690c6 ("net: sched: cls_u32: simplify the hell out u32_delete() emptiness check") from the net-next tree. I fixed it up (I reverted the net tree commit as I could not tell wich parts of it, if any, are still needed) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. Attached should fix it. Al, please double check. cheers, jamal --- a/net-next/net/sched/cls_u32.c 2018-10-09 05:18:04.046642676 -0400 +++ b/net/sched/cls_u32.c 2018-10-09 05:29:51.965528032 -0400 @@ -391,6 +391,7 @@ RCU_INIT_POINTER(root_ht->next, tp_c->hlist); rcu_assign_pointer(tp_c->hlist, root_ht); + root_ht->refcnt++; rcu_assign_pointer(tp->root, root_ht); tp->data = tp_c; return 0; @@ -606,7 +607,7 @@ struct tc_u_hnode __rcu **hn; struct tc_u_hnode *phn; - WARN_ON(ht->refcnt); + WARN_ON(--ht->refcnt); u32_clear_hnode(tp, ht, extack); @@ -634,7 +635,7 @@ WARN_ON(root_ht == NULL); - if (root_ht && --root_ht->refcnt == 0) + if (root_ht && --root_ht->refcnt == 1) u32_destroy_hnode(tp, root_ht, extack); if (--tp_c->refcnt == 0) { @@ -679,7 +680,6 @@ } if (ht->refcnt == 1) { - ht->refcnt--; u32_destroy_hnode(tp, ht, extack); } else { NL_SET_ERR_MSG_MOD(extack, "Can not delete in-use filter"); @@ -1079,8 +1079,7 @@ } #endif - err = u32_set_parms(net, tp, base, n, tb, tca[TCA_RATE], ovr, - extack); + err = u32_set_parms(net, tp, base, n, tb, tca[TCA_RATE], ovr, extack); if (err == 0) { struct tc_u_knode __rcu **ins; struct tc_u_knode *pins;
Re: [PATCH net-next v2] net: sched: don't disable bh when accessing action idr
On 21/05/18 04:03 PM, Vlad Buslov wrote: Initial net_device implementation used ingress_lock spinlock to synchronize ingress path of device. This lock was used in both process and bh context. In some code paths action map lock was obtained while holding ingress_lock. Commit e1e992e52faa ("[NET_SCHED] protect action config/dump from irqs") modified actions to always disable bh, while using action map lock, in order to prevent deadlock on ingress_lock in softirq. This lock was removed from net_device, so disabling bh, while accessing action map, is no longer necessary. Replace all action idr spinlock usage with regular calls that do not disable bh. Signed-off-by: Vlad Buslov <vla...@mellanox.com> Acked-by: Jamal Hadi Salim <j...@mojatatu.com> cheers, jamal
Re: [PATCH net-next v2] net: sched: don't disable bh when accessing action idr
On 21/05/18 04:03 PM, Vlad Buslov wrote: Initial net_device implementation used ingress_lock spinlock to synchronize ingress path of device. This lock was used in both process and bh context. In some code paths action map lock was obtained while holding ingress_lock. Commit e1e992e52faa ("[NET_SCHED] protect action config/dump from irqs") modified actions to always disable bh, while using action map lock, in order to prevent deadlock on ingress_lock in softirq. This lock was removed from net_device, so disabling bh, while accessing action map, is no longer necessary. Replace all action idr spinlock usage with regular calls that do not disable bh. Signed-off-by: Vlad Buslov Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH 00/14] Modify action API for implementing lockless actions
On 14/05/18 04:46 PM, Vlad Buslov wrote: On Mon 14 May 2018 at 18:03, Jamal Hadi Salim <j...@mojatatu.com> wrote: On 14/05/18 10:27 AM, Vlad Buslov wrote: Hello Jamal, I'm trying to run tdc, but keep getting following error even on clean branch without my patches: Vlad, not sure if you saw my email: Apply Roman's patch and try again https://marc.info/?l=linux-netdev=152639369112020=2 cheers, jamal
Re: [PATCH 00/14] Modify action API for implementing lockless actions
On 14/05/18 04:46 PM, Vlad Buslov wrote: On Mon 14 May 2018 at 18:03, Jamal Hadi Salim wrote: On 14/05/18 10:27 AM, Vlad Buslov wrote: Hello Jamal, I'm trying to run tdc, but keep getting following error even on clean branch without my patches: Vlad, not sure if you saw my email: Apply Roman's patch and try again https://marc.info/?l=linux-netdev=152639369112020=2 cheers, jamal
Re: [PATCH 00/14] Modify action API for implementing lockless actions
On 14/05/18 10:27 AM, Vlad Buslov wrote: Currently, all netlink protocol handlers for updating rules, actions and qdiscs are protected with single global rtnl lock which removes any possibility for parallelism. This patch set is a first step to remove rtnl lock dependency from TC rules update path. It updates act API to use atomic operations, rcu and spinlocks for fine-grained locking. It also extend API with functions that are needed to update existing actions for parallel execution. Outline of changes: - Change tc action to use atomic reference and bind counters, rcu mechanism for cookie update. - Extend action ops API with 'delete' function and 'unlocked' flag. - Change action API to work with actions in lockless manner based on primitives implemented in previous patches. - Extend action API with new functions necessary to implement unlocked actions. Please run all the tdc tests with these changes. This area has almost good test coverage at this point. If you need help just ping me. cheers, jamal
Re: [PATCH 00/14] Modify action API for implementing lockless actions
On 14/05/18 10:27 AM, Vlad Buslov wrote: Currently, all netlink protocol handlers for updating rules, actions and qdiscs are protected with single global rtnl lock which removes any possibility for parallelism. This patch set is a first step to remove rtnl lock dependency from TC rules update path. It updates act API to use atomic operations, rcu and spinlocks for fine-grained locking. It also extend API with functions that are needed to update existing actions for parallel execution. Outline of changes: - Change tc action to use atomic reference and bind counters, rcu mechanism for cookie update. - Extend action ops API with 'delete' function and 'unlocked' flag. - Change action API to work with actions in lockless manner based on primitives implemented in previous patches. - Extend action API with new functions necessary to implement unlocked actions. Please run all the tdc tests with these changes. This area has almost good test coverage at this point. If you need help just ping me. cheers, jamal
Re: [PATCH net-next 2/2] net sched act_vlan: VLAN action rewrite to use RCU lock/unlock and update
On 17-10-10 10:33 PM, Manish Kurup wrote: Using a spinlock in the VLAN action causes performance issues when the VLAN action is used on multiple cores. Rewrote the VLAN action to use RCU read locking for reads and updates instead. Signed-off-by: Manish Kurup <manish.ku...@verizon.com> Acked-by: Jamal Hadi Salim <j...@mojatatu.com> cheers, jamal
Re: [PATCH net-next 2/2] net sched act_vlan: VLAN action rewrite to use RCU lock/unlock and update
On 17-10-10 10:33 PM, Manish Kurup wrote: Using a spinlock in the VLAN action causes performance issues when the VLAN action is used on multiple cores. Rewrote the VLAN action to use RCU read locking for reads and updates instead. Signed-off-by: Manish Kurup Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH] net/sched: act_skbmod: remove unneeded rcu_read_unlock in tcf_skbmod_dump
On 17-03-04 07:01 PM, Alexey Khoroshilov wrote: Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Alexey Khoroshilov <khoroshi...@ispras.ru> Acked-by: Jamal Hadi Salim <j...@mojatatu.com> cheers, jamal
Re: [PATCH] net/sched: act_skbmod: remove unneeded rcu_read_unlock in tcf_skbmod_dump
On 17-03-04 07:01 PM, Alexey Khoroshilov wrote: Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Alexey Khoroshilov Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH]net:sched:release lock before tcf_dump_walker() normal return to avoid deadlock
On 16-12-06 12:36 AM, Feng Deng wrote: From: Feng Dengrelease lock before tcf_dump_walker() normal return to avoid deadlock /Scratching my head. I am probably missing something obvious. What are the condition under which this deadlock will happen? Do you have a testcase we can try? cheers, jamal
Re: [PATCH]net:sched:release lock before tcf_dump_walker() normal return to avoid deadlock
On 16-12-06 12:36 AM, Feng Deng wrote: From: Feng Deng release lock before tcf_dump_walker() normal return to avoid deadlock /Scratching my head. I am probably missing something obvious. What are the condition under which this deadlock will happen? Do you have a testcase we can try? cheers, jamal
Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets
On 16-09-28 08:45 AM, Cyrill Gorcunov wrote: Note: inet_diag somewhere has a netlink structure that has a hole. I pointed it out to Eric D. and he said we cant add it now because it would break ABI. Naming holes generated by a compiler for alignment sake should not break abi (because alignments are abi by self), but I may be missing something in context of where the structure you're talking about is present. And what about non-x86 archs? They might have different members alignment requirements. struct tcp_info. Sorry - i didnt mean to drag this for long; but the more i think about it, the union is a pragmatic approach. cheers, jamal
Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets
On 16-09-28 08:45 AM, Cyrill Gorcunov wrote: Note: inet_diag somewhere has a netlink structure that has a hole. I pointed it out to Eric D. and he said we cant add it now because it would break ABI. Naming holes generated by a compiler for alignment sake should not break abi (because alignments are abi by self), but I may be missing something in context of where the structure you're talking about is present. And what about non-x86 archs? They might have different members alignment requirements. struct tcp_info. Sorry - i didnt mean to drag this for long; but the more i think about it, the union is a pragmatic approach. cheers, jamal
Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets
On 16-09-28 08:27 AM, Jamal Hadi Salim wrote: On 16-09-28 08:16 AM, David Miller wrote: From: Jamal Hadi Salim <j...@mojatatu.com> Date: Wed, 28 Sep 2016 08:09:28 -0400 On 16-09-28 08:07 AM, David Miller wrote: Right, it would be legal for an existing user to have code that explicitly initializes every member of the structure, including 'pad'. So we have to keep that member around, at a minimum, for their sake. I think we need to start labelling any new pad fields added as "Not UAPI. Do not fsck fondle this". They must initialize it to zero. What if in the future actually meant to use 0 for something?;-> For example in Cyrill's case it means PROTO_IP Not sure if it useful to interpret or not but it is part of the enum for protocols. I just tested with: socket(AF_INET, SOCK_RAW, 0); and got back: EPROTONOSUPPORT So in this case doesnt matter. But my point stands i.e 0 could mean something. cheers, jamal
Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets
On 16-09-28 08:27 AM, Jamal Hadi Salim wrote: On 16-09-28 08:16 AM, David Miller wrote: From: Jamal Hadi Salim Date: Wed, 28 Sep 2016 08:09:28 -0400 On 16-09-28 08:07 AM, David Miller wrote: Right, it would be legal for an existing user to have code that explicitly initializes every member of the structure, including 'pad'. So we have to keep that member around, at a minimum, for their sake. I think we need to start labelling any new pad fields added as "Not UAPI. Do not fsck fondle this". They must initialize it to zero. What if in the future actually meant to use 0 for something?;-> For example in Cyrill's case it means PROTO_IP Not sure if it useful to interpret or not but it is part of the enum for protocols. I just tested with: socket(AF_INET, SOCK_RAW, 0); and got back: EPROTONOSUPPORT So in this case doesnt matter. But my point stands i.e 0 could mean something. cheers, jamal
Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets
On 16-09-28 08:16 AM, David Miller wrote: From: Jamal Hadi Salim <j...@mojatatu.com> Date: Wed, 28 Sep 2016 08:09:28 -0400 On 16-09-28 08:07 AM, David Miller wrote: Right, it would be legal for an existing user to have code that explicitly initializes every member of the structure, including 'pad'. So we have to keep that member around, at a minimum, for their sake. I think we need to start labelling any new pad fields added as "Not UAPI. Do not fsck fondle this". They must initialize it to zero. What if in the future actually meant to use 0 for something?;-> For example in Cyrill's case it means PROTO_IP Not sure if it useful to interpret or not but it is part of the enum for protocols. Maybe we shouldnt be adding pad fields in these netlink structure definitions then one can liberally add new ones. Note: inet_diag somewhere has a netlink structure that has a hole. I pointed it out to Eric D. and he said we cant add it now because it would break ABI. So where do we draw the line for future extensions? cheers, jamal
Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets
On 16-09-28 08:16 AM, David Miller wrote: From: Jamal Hadi Salim Date: Wed, 28 Sep 2016 08:09:28 -0400 On 16-09-28 08:07 AM, David Miller wrote: Right, it would be legal for an existing user to have code that explicitly initializes every member of the structure, including 'pad'. So we have to keep that member around, at a minimum, for their sake. I think we need to start labelling any new pad fields added as "Not UAPI. Do not fsck fondle this". They must initialize it to zero. What if in the future actually meant to use 0 for something?;-> For example in Cyrill's case it means PROTO_IP Not sure if it useful to interpret or not but it is part of the enum for protocols. Maybe we shouldnt be adding pad fields in these netlink structure definitions then one can liberally add new ones. Note: inet_diag somewhere has a netlink structure that has a hole. I pointed it out to Eric D. and he said we cant add it now because it would break ABI. So where do we draw the line for future extensions? cheers, jamal
Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets
On 16-09-28 08:07 AM, David Miller wrote: Right, it would be legal for an existing user to have code that explicitly initializes every member of the structure, including 'pad'. So we have to keep that member around, at a minimum, for their sake. I think we need to start labelling any new pad fields added as "Not UAPI. Do not fsck fondle this". cheers, jamal
Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets
On 16-09-28 08:07 AM, David Miller wrote: Right, it would be legal for an existing user to have code that explicitly initializes every member of the structure, including 'pad'. So we have to keep that member around, at a minimum, for their sake. I think we need to start labelling any new pad fields added as "Not UAPI. Do not fsck fondle this". cheers, jamal
Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets
On 16-09-28 07:27 AM, Cyrill Gorcunov wrote: On Wed, Sep 28, 2016 at 07:06:26AM -0400, Jamal Hadi Salim wrote: This structure is uapi, so anyone has complete rights to reference @pad in the userspace programs. Sure it would be more clear to remove the @pad completely, but if we choose so I think it's better to do on top instead and then if someone complain we can easily revert the single trivial commit instead of this big patch. I am conflicted. A field labelled "pad" does not appear to be valid as "UAPI". It is a cosmetic indicator. If you did sizeof() with or without it being present the value doesnt change. I think you miss the point what I'm trying to say: currently end-user may have reference to this member (for any reason) and his program will compile and run. If we change the name the compilation procedure fails and this will break API. Yes, referrning @pad is bad idea for userspace code, and yes (!) better to simply rename it but lets do that later, on top, so that if we break something in userspace we could easily revert the oneline change. I understood well your point;-> Maybe my response was not clear: _nobody should be fscking fondling pad fields_ setting them or otherwise. Maybe let these programs fail. I asked if you knew any such app which did anything with a pad field. cheers, jamal
Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets
On 16-09-28 07:27 AM, Cyrill Gorcunov wrote: On Wed, Sep 28, 2016 at 07:06:26AM -0400, Jamal Hadi Salim wrote: This structure is uapi, so anyone has complete rights to reference @pad in the userspace programs. Sure it would be more clear to remove the @pad completely, but if we choose so I think it's better to do on top instead and then if someone complain we can easily revert the single trivial commit instead of this big patch. I am conflicted. A field labelled "pad" does not appear to be valid as "UAPI". It is a cosmetic indicator. If you did sizeof() with or without it being present the value doesnt change. I think you miss the point what I'm trying to say: currently end-user may have reference to this member (for any reason) and his program will compile and run. If we change the name the compilation procedure fails and this will break API. Yes, referrning @pad is bad idea for userspace code, and yes (!) better to simply rename it but lets do that later, on top, so that if we break something in userspace we could easily revert the oneline change. I understood well your point;-> Maybe my response was not clear: _nobody should be fscking fondling pad fields_ setting them or otherwise. Maybe let these programs fail. I asked if you knew any such app which did anything with a pad field. cheers, jamal
Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets
On 16-09-28 06:51 AM, Cyrill Gorcunov wrote: On Wed, Sep 28, 2016 at 06:43:01AM -0400, Jamal Hadi Salim wrote: [..] I dont know how compilation will fail but you may be right with note: that is not how pads have been used in the past. They are supposed to cosmetic annotation which indicates "here's a hole; use it in the future if you are looking to add something". And someone in the future can claim them. I am not sure if MBZ philosophy applies. This structure is uapi, so anyone has complete rights to reference @pad in the userspace programs. Sure it would be more clear to remove the @pad completely, but if we choose so I think it's better to do on top instead and then if someone complain we can easily revert the single trivial commit instead of this big patch. I am conflicted. A field labelled "pad" does not appear to be valid as "UAPI". It is a cosmetic indicator. If you did sizeof() with or without it being present the value doesnt change. BTW: There is at least one major structure in inet diag has a hole today and doesnt have a padding indicator. If protocol goes over u8 then complete inet_diag_req_v2 structure will have to be reworked becaue @sdiag_protocol is u8 as well. IOW, once someone liftup IPPROTO_MAX > 255, he will notice the problem immediately because diag for such module simply stop working properly. ok. cheers, jamal
Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets
On 16-09-28 06:51 AM, Cyrill Gorcunov wrote: On Wed, Sep 28, 2016 at 06:43:01AM -0400, Jamal Hadi Salim wrote: [..] I dont know how compilation will fail but you may be right with note: that is not how pads have been used in the past. They are supposed to cosmetic annotation which indicates "here's a hole; use it in the future if you are looking to add something". And someone in the future can claim them. I am not sure if MBZ philosophy applies. This structure is uapi, so anyone has complete rights to reference @pad in the userspace programs. Sure it would be more clear to remove the @pad completely, but if we choose so I think it's better to do on top instead and then if someone complain we can easily revert the single trivial commit instead of this big patch. I am conflicted. A field labelled "pad" does not appear to be valid as "UAPI". It is a cosmetic indicator. If you did sizeof() with or without it being present the value doesnt change. BTW: There is at least one major structure in inet diag has a hole today and doesnt have a padding indicator. If protocol goes over u8 then complete inet_diag_req_v2 structure will have to be reworked becaue @sdiag_protocol is u8 as well. IOW, once someone liftup IPPROTO_MAX > 255, he will notice the problem immediately because diag for such module simply stop working properly. ok. cheers, jamal
Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets
On 16-09-28 06:17 AM, Cyrill Gorcunov wrote: On Wed, Sep 28, 2016 at 06:08:00AM -0400, Jamal Hadi Salim wrote: ... @@ -38,7 +38,10 @@ struct inet_diag_req_v2 { __u8sdiag_family; __u8sdiag_protocol; __u8idiag_ext; - __u8pad; + union { + __u8pad; + __u8sdiag_raw_protocol; /* SOCK_RAW only, @pad for others */ + }; Above looks funny. Why is it a union? pad is for exposing a byte-hole for padding/alignment reasons and i doubt anybody is using it. Someone may have set it to zero explicitly on source level, and the compilation will fail on new kernel then. So no, keeping the name is reasonable. I dont know how compilation will fail but you may be right with note: that is not how pads have been used in the past. They are supposed to cosmetic annotation which indicates "here's a hole; use it in the future if you are looking to add something". And someone in the future can claim them. I am not sure if MBZ philosophy applies. Should you not just rename it? Also I notice when things like __raw_v4_lookup() are claiming it is unsigned short instead of a u8? The protocol is still up to 255 for a while, is it expected that IPPROTO_MAX will be increased in more-less near future? Of course I can drop the idea of using @pad here and switch to some extended reauest but prefer to stick with simplier solution. Hm? Ok. If i understood correctly it was already unsigned short before your patch -so i agree it doesnt matter. Maybe just put a comment to express that if ever protocol goes above 255 it wont be sufficient. cheers, jamal PS:- sorry for butting in the discussion - i blame it on coffee.
Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets
On 16-09-28 06:17 AM, Cyrill Gorcunov wrote: On Wed, Sep 28, 2016 at 06:08:00AM -0400, Jamal Hadi Salim wrote: ... @@ -38,7 +38,10 @@ struct inet_diag_req_v2 { __u8sdiag_family; __u8sdiag_protocol; __u8idiag_ext; - __u8pad; + union { + __u8pad; + __u8sdiag_raw_protocol; /* SOCK_RAW only, @pad for others */ + }; Above looks funny. Why is it a union? pad is for exposing a byte-hole for padding/alignment reasons and i doubt anybody is using it. Someone may have set it to zero explicitly on source level, and the compilation will fail on new kernel then. So no, keeping the name is reasonable. I dont know how compilation will fail but you may be right with note: that is not how pads have been used in the past. They are supposed to cosmetic annotation which indicates "here's a hole; use it in the future if you are looking to add something". And someone in the future can claim them. I am not sure if MBZ philosophy applies. Should you not just rename it? Also I notice when things like __raw_v4_lookup() are claiming it is unsigned short instead of a u8? The protocol is still up to 255 for a while, is it expected that IPPROTO_MAX will be increased in more-less near future? Of course I can drop the idea of using @pad here and switch to some extended reauest but prefer to stick with simplier solution. Hm? Ok. If i understood correctly it was already unsigned short before your patch -so i agree it doesnt matter. Maybe just put a comment to express that if ever protocol goes above 255 it wont be sufficient. cheers, jamal PS:- sorry for butting in the discussion - i blame it on coffee.
Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets
On 16-09-28 05:03 AM, Cyrill Gorcunov wrote: In criu we are actively using diag interface to collect sockets present in the system when dumping applications. And while for unix, tcp, udp[lite], packet, netlink it works as expected, the raw sockets do not have. Thus add it. v2: - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@) - implement @destroy for diag requests (by dsa@) v3: - add export of raw_abort for IPv6 (by dsa@) - pass net-admin flag into inet_sk_diag_fill due to changes in net-next branch (by dsa@) v4: - use @pad in struct inet_diag_req_v2 for raw socket protocol specification: raw module carries sockets which may have custom protocol passed from socket() syscall and sole @sdiag_protocol is not enough to match underlied ones - start reporting protocol specifed in socket() call when sockets are raw ones for the same reason: user space tools like ss may parse this attribute and use it for socket matching v5 (by eric.dumazet@): - use sock_hold in raw_sock_get instead of atomic_inc, we're holding (raw_v4_hashinfo|raw_v6_hashinfo)->lock when looking up so counter won't be zero here. CC: David S. MillerCC: Eric Dumazet CC: David Ahern CC: Alexey Kuznetsov CC: James Morris CC: Hideaki YOSHIFUJI CC: Patrick McHardy CC: Andrey Vagin CC: Stephen Hemminger Signed-off-by: Cyrill Gorcunov --- Thanks all for feedback! Take a look please once time permit. include/net/raw.h |6 + include/net/rawv6.h|7 + include/uapi/linux/inet_diag.h |5 net/ipv4/Kconfig |8 + net/ipv4/Makefile |1 net/ipv4/inet_diag.c |9 + net/ipv4/raw.c | 21 +++ net/ipv4/raw_diag.c| 233 + net/ipv6/raw.c |7 - 9 files changed, 292 insertions(+), 5 deletions(-) Index: linux-ml.git/include/net/raw.h === --- linux-ml.git.orig/include/net/raw.h +++ linux-ml.git/include/net/raw.h @@ -23,6 +23,12 @@ extern struct proto raw_prot; +extern struct raw_hashinfo raw_v4_hashinfo; +struct sock *__raw_v4_lookup(struct net *net, struct sock *sk, +unsigned short num, __be32 raddr, +__be32 laddr, int dif); + +int raw_abort(struct sock *sk, int err); void raw_icmp_error(struct sk_buff *, int, u32); int raw_local_deliver(struct sk_buff *, int); Index: linux-ml.git/include/net/rawv6.h === --- linux-ml.git.orig/include/net/rawv6.h +++ linux-ml.git/include/net/rawv6.h @@ -3,6 +3,13 @@ #include +extern struct raw_hashinfo raw_v6_hashinfo; +struct sock *__raw_v6_lookup(struct net *net, struct sock *sk, +unsigned short num, const struct in6_addr *loc_addr, +const struct in6_addr *rmt_addr, int dif); + +int raw_abort(struct sock *sk, int err); + void raw6_icmp_error(struct sk_buff *, int nexthdr, u8 type, u8 code, int inner_offset, __be32); bool raw6_local_deliver(struct sk_buff *, int); Index: linux-ml.git/include/uapi/linux/inet_diag.h === --- linux-ml.git.orig/include/uapi/linux/inet_diag.h +++ linux-ml.git/include/uapi/linux/inet_diag.h @@ -38,7 +38,10 @@ struct inet_diag_req_v2 { __u8sdiag_family; __u8sdiag_protocol; __u8idiag_ext; - __u8pad; + union { + __u8pad; + __u8sdiag_raw_protocol; /* SOCK_RAW only, @pad for others */ + }; Above looks funny. Why is it a union? pad is for exposing a byte-hole for padding/alignment reasons and i doubt anybody is using it. Should you not just rename it? Also I notice when things like __raw_v4_lookup() are claiming it is unsigned short instead of a u8? cheers, jamal
Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets
On 16-09-28 05:03 AM, Cyrill Gorcunov wrote: In criu we are actively using diag interface to collect sockets present in the system when dumping applications. And while for unix, tcp, udp[lite], packet, netlink it works as expected, the raw sockets do not have. Thus add it. v2: - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@) - implement @destroy for diag requests (by dsa@) v3: - add export of raw_abort for IPv6 (by dsa@) - pass net-admin flag into inet_sk_diag_fill due to changes in net-next branch (by dsa@) v4: - use @pad in struct inet_diag_req_v2 for raw socket protocol specification: raw module carries sockets which may have custom protocol passed from socket() syscall and sole @sdiag_protocol is not enough to match underlied ones - start reporting protocol specifed in socket() call when sockets are raw ones for the same reason: user space tools like ss may parse this attribute and use it for socket matching v5 (by eric.dumazet@): - use sock_hold in raw_sock_get instead of atomic_inc, we're holding (raw_v4_hashinfo|raw_v6_hashinfo)->lock when looking up so counter won't be zero here. CC: David S. Miller CC: Eric Dumazet CC: David Ahern CC: Alexey Kuznetsov CC: James Morris CC: Hideaki YOSHIFUJI CC: Patrick McHardy CC: Andrey Vagin CC: Stephen Hemminger Signed-off-by: Cyrill Gorcunov --- Thanks all for feedback! Take a look please once time permit. include/net/raw.h |6 + include/net/rawv6.h|7 + include/uapi/linux/inet_diag.h |5 net/ipv4/Kconfig |8 + net/ipv4/Makefile |1 net/ipv4/inet_diag.c |9 + net/ipv4/raw.c | 21 +++ net/ipv4/raw_diag.c| 233 + net/ipv6/raw.c |7 - 9 files changed, 292 insertions(+), 5 deletions(-) Index: linux-ml.git/include/net/raw.h === --- linux-ml.git.orig/include/net/raw.h +++ linux-ml.git/include/net/raw.h @@ -23,6 +23,12 @@ extern struct proto raw_prot; +extern struct raw_hashinfo raw_v4_hashinfo; +struct sock *__raw_v4_lookup(struct net *net, struct sock *sk, +unsigned short num, __be32 raddr, +__be32 laddr, int dif); + +int raw_abort(struct sock *sk, int err); void raw_icmp_error(struct sk_buff *, int, u32); int raw_local_deliver(struct sk_buff *, int); Index: linux-ml.git/include/net/rawv6.h === --- linux-ml.git.orig/include/net/rawv6.h +++ linux-ml.git/include/net/rawv6.h @@ -3,6 +3,13 @@ #include +extern struct raw_hashinfo raw_v6_hashinfo; +struct sock *__raw_v6_lookup(struct net *net, struct sock *sk, +unsigned short num, const struct in6_addr *loc_addr, +const struct in6_addr *rmt_addr, int dif); + +int raw_abort(struct sock *sk, int err); + void raw6_icmp_error(struct sk_buff *, int nexthdr, u8 type, u8 code, int inner_offset, __be32); bool raw6_local_deliver(struct sk_buff *, int); Index: linux-ml.git/include/uapi/linux/inet_diag.h === --- linux-ml.git.orig/include/uapi/linux/inet_diag.h +++ linux-ml.git/include/uapi/linux/inet_diag.h @@ -38,7 +38,10 @@ struct inet_diag_req_v2 { __u8sdiag_family; __u8sdiag_protocol; __u8idiag_ext; - __u8pad; + union { + __u8pad; + __u8sdiag_raw_protocol; /* SOCK_RAW only, @pad for others */ + }; Above looks funny. Why is it a union? pad is for exposing a byte-hole for padding/alignment reasons and i doubt anybody is using it. Should you not just rename it? Also I notice when things like __raw_v4_lookup() are claiming it is unsigned short instead of a u8? cheers, jamal
Re: [BUG] act_ife: sleeping functions called in atomic context
On 16-06-17 01:31 PM, Cong Wang wrote: My patch is against -net. (I see you already figured out your patch is missing in -net-next.) Ok, should have re-read this email before working on the patch;-> Or are you suggesting to rebase it for -net-next? I think it fixes some real bug so -net is better, although it is slightly large as a bug fix. I am conflicted. There are a lot of changes in net-next at the moment; adding this to -net seems like will definetely cause merge issues for Dave. Ok, Cong - patch attached and tested. Let me know what you think. cheers, jamal diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c index b7fa969..b52deb4 100644 --- a/net/sched/act_ife.c +++ b/net/sched/act_ife.c @@ -239,8 +239,7 @@ static int ife_validate_metatype(struct tcf_meta_ops *ops, void *val, int len) return ret; } -/* called when adding new meta information - * under ife->tcf_lock +/* called to find new metadata ops. Possibly load it as a module. */ static int load_metaops_and_vet(struct tcf_ife_info *ife, u32 metaid, void *val, int len) @@ -251,11 +250,9 @@ static int load_metaops_and_vet(struct tcf_ife_info *ife, u32 metaid, if (!ops) { ret = -ENOENT; #ifdef CONFIG_MODULES - spin_unlock_bh(>tcf_lock); rtnl_unlock(); request_module("ifemeta%u", metaid); rtnl_lock(); - spin_lock_bh(>tcf_lock); ops = find_ife_oplist(metaid); #endif } @@ -272,7 +269,6 @@ static int load_metaops_and_vet(struct tcf_ife_info *ife, u32 metaid, } /* called when adding new meta information - * under ife->tcf_lock */ static int add_metainfo(struct tcf_ife_info *ife, u32 metaid, void *metaval, int len) @@ -302,7 +298,9 @@ static int add_metainfo(struct tcf_ife_info *ife, u32 metaid, void *metaval, } } + spin_lock_bh(>tcf_lock); list_add_tail(>metalist, >metalist); + spin_unlock_bh(>tcf_lock); return ret; } @@ -357,7 +355,6 @@ out_nlmsg_trim: return -1; } -/* under ife->tcf_lock */ static void _tcf_ife_cleanup(struct tc_action *a, int bind) { struct tcf_ife_info *ife = a->priv; @@ -385,7 +382,6 @@ static void tcf_ife_cleanup(struct tc_action *a, int bind) spin_unlock_bh(>tcf_lock); } -/* under ife->tcf_lock */ static int populate_metalist(struct tcf_ife_info *ife, struct nlattr **tb) { int len = 0; @@ -465,6 +461,8 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla, } ife = to_ife(a); + if (exists) + spin_lock_bh(>tcf_lock); ife->flags = parm->flags; if (parm->flags & IFE_ENCODE) { @@ -475,10 +473,9 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla, saddr = nla_data(tb[TCA_IFE_SMAC]); } - spin_lock_bh(>tcf_lock); ife->tcf_action = parm->action; - if (parm->flags & IFE_ENCODE) { + if (daddr) ether_addr_copy(ife->eth_dst, daddr); else @@ -492,6 +489,9 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla, ife->eth_type = ife_type; } + if (exists) + spin_unlock_bh(>tcf_lock); + if (ret == ACT_P_CREATED) INIT_LIST_HEAD(>metalist); @@ -505,7 +505,6 @@ metadata_parse_err: if (ret == ACT_P_CREATED) _tcf_ife_cleanup(a, bind); - spin_unlock_bh(>tcf_lock); return err; } @@ -524,13 +523,10 @@ metadata_parse_err: if (ret == ACT_P_CREATED) _tcf_ife_cleanup(a, bind); - spin_unlock_bh(>tcf_lock); return err; } } - spin_unlock_bh(>tcf_lock); - if (ret == ACT_P_CREATED) tcf_hash_insert(tn, a);
Re: [BUG] act_ife: sleeping functions called in atomic context
On 16-06-17 01:31 PM, Cong Wang wrote: My patch is against -net. (I see you already figured out your patch is missing in -net-next.) Ok, should have re-read this email before working on the patch;-> Or are you suggesting to rebase it for -net-next? I think it fixes some real bug so -net is better, although it is slightly large as a bug fix. I am conflicted. There are a lot of changes in net-next at the moment; adding this to -net seems like will definetely cause merge issues for Dave. Ok, Cong - patch attached and tested. Let me know what you think. cheers, jamal diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c index b7fa969..b52deb4 100644 --- a/net/sched/act_ife.c +++ b/net/sched/act_ife.c @@ -239,8 +239,7 @@ static int ife_validate_metatype(struct tcf_meta_ops *ops, void *val, int len) return ret; } -/* called when adding new meta information - * under ife->tcf_lock +/* called to find new metadata ops. Possibly load it as a module. */ static int load_metaops_and_vet(struct tcf_ife_info *ife, u32 metaid, void *val, int len) @@ -251,11 +250,9 @@ static int load_metaops_and_vet(struct tcf_ife_info *ife, u32 metaid, if (!ops) { ret = -ENOENT; #ifdef CONFIG_MODULES - spin_unlock_bh(>tcf_lock); rtnl_unlock(); request_module("ifemeta%u", metaid); rtnl_lock(); - spin_lock_bh(>tcf_lock); ops = find_ife_oplist(metaid); #endif } @@ -272,7 +269,6 @@ static int load_metaops_and_vet(struct tcf_ife_info *ife, u32 metaid, } /* called when adding new meta information - * under ife->tcf_lock */ static int add_metainfo(struct tcf_ife_info *ife, u32 metaid, void *metaval, int len) @@ -302,7 +298,9 @@ static int add_metainfo(struct tcf_ife_info *ife, u32 metaid, void *metaval, } } + spin_lock_bh(>tcf_lock); list_add_tail(>metalist, >metalist); + spin_unlock_bh(>tcf_lock); return ret; } @@ -357,7 +355,6 @@ out_nlmsg_trim: return -1; } -/* under ife->tcf_lock */ static void _tcf_ife_cleanup(struct tc_action *a, int bind) { struct tcf_ife_info *ife = a->priv; @@ -385,7 +382,6 @@ static void tcf_ife_cleanup(struct tc_action *a, int bind) spin_unlock_bh(>tcf_lock); } -/* under ife->tcf_lock */ static int populate_metalist(struct tcf_ife_info *ife, struct nlattr **tb) { int len = 0; @@ -465,6 +461,8 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla, } ife = to_ife(a); + if (exists) + spin_lock_bh(>tcf_lock); ife->flags = parm->flags; if (parm->flags & IFE_ENCODE) { @@ -475,10 +473,9 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla, saddr = nla_data(tb[TCA_IFE_SMAC]); } - spin_lock_bh(>tcf_lock); ife->tcf_action = parm->action; - if (parm->flags & IFE_ENCODE) { + if (daddr) ether_addr_copy(ife->eth_dst, daddr); else @@ -492,6 +489,9 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla, ife->eth_type = ife_type; } + if (exists) + spin_unlock_bh(>tcf_lock); + if (ret == ACT_P_CREATED) INIT_LIST_HEAD(>metalist); @@ -505,7 +505,6 @@ metadata_parse_err: if (ret == ACT_P_CREATED) _tcf_ife_cleanup(a, bind); - spin_unlock_bh(>tcf_lock); return err; } @@ -524,13 +523,10 @@ metadata_parse_err: if (ret == ACT_P_CREATED) _tcf_ife_cleanup(a, bind); - spin_unlock_bh(>tcf_lock); return err; } } - spin_unlock_bh(>tcf_lock); - if (ret == ACT_P_CREATED) tcf_hash_insert(tn, a);
Re: [BUG] act_ife: sleeping functions called in atomic context
On 16-06-17 01:38 AM, Cong Wang wrote: On Thu, Jun 16, 2016 at 7:14 PM, Cong Wangwrote: I think we can just remove that tcf_lock, I am testing a patch now. Please try the attached patch, I will do more tests tomorrow. Thanks! Cong, What tree are you using? I dont see the time aggregation patches that I sent (and Dave took in) in your changes. Comments: Is GFP_ATOMIC really necessary? Thats user->kernel interface. GFP_KERNEL should be sufficient. Also, it would be nice to kill the lock - but this feels like two patches in one. 1) to fix the alloc not to be under the lock 2) to kill said lock. Maybe split it as such for easier review. I am using this action extensively so will be happy to test. I think my patch is a good beginning to #1 - if you fix the forgotten unlock and ensure we lock around updating ife fields when it exists already (you said it in your earlier email and I thought about that afterwards). cheers, jamal
Re: [BUG] act_ife: sleeping functions called in atomic context
On 16-06-17 01:38 AM, Cong Wang wrote: On Thu, Jun 16, 2016 at 7:14 PM, Cong Wang wrote: I think we can just remove that tcf_lock, I am testing a patch now. Please try the attached patch, I will do more tests tomorrow. Thanks! Cong, What tree are you using? I dont see the time aggregation patches that I sent (and Dave took in) in your changes. Comments: Is GFP_ATOMIC really necessary? Thats user->kernel interface. GFP_KERNEL should be sufficient. Also, it would be nice to kill the lock - but this feels like two patches in one. 1) to fix the alloc not to be under the lock 2) to kill said lock. Maybe split it as such for easier review. I am using this action extensively so will be happy to test. I think my patch is a good beginning to #1 - if you fix the forgotten unlock and ensure we lock around updating ife fields when it exists already (you said it in your earlier email and I thought about that afterwards). cheers, jamal
Re: [BUG] act_ife: sleeping functions called in atomic context
On 16-06-16 05:43 PM, Cong Wang wrote: On Thu, Jun 16, 2016 at 1:50 PM, Alexey Khoroshilovwrote: tcf_ife_init() contains a big chunk of code executed with ife->tcf_lock spinlock held. But that code contains several calls to sleeping functions: populate_metalist() and use_all_metadata() -> add_metainfo() -> find_ife_oplist(metaid) -> read_lock() -> try_module_get(o->owner) -> kzalloc(sizeof(*mi), GFP_KERNEL); Hmm, we don't need to hold that spinlock when we create a new ife action, since we haven't inserted it yet. We do need it when we modify an existing one. So I am thinking if we can refactor that code to avoid spinlock whenever possible. Does attached (compile tested) patch help? -> ops->alloc(mi, metaval); -> module_put(ops->owner); _tcf_ife_cleanup() -> module_put() The same problem is actual for tcf_ife_cleanup() as well. Huh? Both module_put() and kfree() should not sleep, right? I dont think there is any sleeping there ;-> cheers, jamal diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c index 6bbc518..e341bef 100644 --- a/net/sched/act_ife.c +++ b/net/sched/act_ife.c @@ -302,7 +302,9 @@ static int add_metainfo(struct tcf_ife_info *ife, u32 metaid, void *metaval, } } + spin_lock_bh(>tcf_lock); list_add_tail(>metalist, >metalist); + spin_unlock_bh(>tcf_lock); return ret; } @@ -474,7 +476,6 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla, saddr = nla_data(tb[TCA_IFE_SMAC]); } - spin_lock_bh(>tcf_lock); ife->tcf_action = parm->action; if (parm->flags & IFE_ENCODE) { @@ -504,7 +505,6 @@ metadata_parse_err: if (ret == ACT_P_CREATED) _tcf_ife_cleanup(a, bind); - spin_unlock_bh(>tcf_lock); return err; } @@ -523,13 +523,10 @@ metadata_parse_err: if (ret == ACT_P_CREATED) _tcf_ife_cleanup(a, bind); - spin_unlock_bh(>tcf_lock); return err; } } - spin_unlock_bh(>tcf_lock); - if (ret == ACT_P_CREATED) tcf_hash_insert(tn, a);
Re: [BUG] act_ife: sleeping functions called in atomic context
On 16-06-16 05:43 PM, Cong Wang wrote: On Thu, Jun 16, 2016 at 1:50 PM, Alexey Khoroshilov wrote: tcf_ife_init() contains a big chunk of code executed with ife->tcf_lock spinlock held. But that code contains several calls to sleeping functions: populate_metalist() and use_all_metadata() -> add_metainfo() -> find_ife_oplist(metaid) -> read_lock() -> try_module_get(o->owner) -> kzalloc(sizeof(*mi), GFP_KERNEL); Hmm, we don't need to hold that spinlock when we create a new ife action, since we haven't inserted it yet. We do need it when we modify an existing one. So I am thinking if we can refactor that code to avoid spinlock whenever possible. Does attached (compile tested) patch help? -> ops->alloc(mi, metaval); -> module_put(ops->owner); _tcf_ife_cleanup() -> module_put() The same problem is actual for tcf_ife_cleanup() as well. Huh? Both module_put() and kfree() should not sleep, right? I dont think there is any sleeping there ;-> cheers, jamal diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c index 6bbc518..e341bef 100644 --- a/net/sched/act_ife.c +++ b/net/sched/act_ife.c @@ -302,7 +302,9 @@ static int add_metainfo(struct tcf_ife_info *ife, u32 metaid, void *metaval, } } + spin_lock_bh(>tcf_lock); list_add_tail(>metalist, >metalist); + spin_unlock_bh(>tcf_lock); return ret; } @@ -474,7 +476,6 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla, saddr = nla_data(tb[TCA_IFE_SMAC]); } - spin_lock_bh(>tcf_lock); ife->tcf_action = parm->action; if (parm->flags & IFE_ENCODE) { @@ -504,7 +505,6 @@ metadata_parse_err: if (ret == ACT_P_CREATED) _tcf_ife_cleanup(a, bind); - spin_unlock_bh(>tcf_lock); return err; } @@ -523,13 +523,10 @@ metadata_parse_err: if (ret == ACT_P_CREATED) _tcf_ife_cleanup(a, bind); - spin_unlock_bh(>tcf_lock); return err; } } - spin_unlock_bh(>tcf_lock); - if (ret == ACT_P_CREATED) tcf_hash_insert(tn, a);
Re: [PATCH net-next V2] tun: introduce tx skb ring
On 16-06-15 07:52 AM, Jamal Hadi Salim wrote: > On 16-06-15 04:38 AM, Jason Wang wrote: >> We used to queue tx packets in sk_receive_queue, this is less >> efficient since it requires spinlocks to synchronize between producer > > So this is more exercising the skb array improvements. For tun > it would be useful to see general performance numbers on user/kernel > crossing (i.e tun read/write). > If you have the cycles can you run such tests? > Ignore my message - you are running pktgen from a VM towards the host. So the numbers you posted are what i was interested in. Thanks for the good work. cheers, jamal
Re: [PATCH net-next V2] tun: introduce tx skb ring
On 16-06-15 07:52 AM, Jamal Hadi Salim wrote: > On 16-06-15 04:38 AM, Jason Wang wrote: >> We used to queue tx packets in sk_receive_queue, this is less >> efficient since it requires spinlocks to synchronize between producer > > So this is more exercising the skb array improvements. For tun > it would be useful to see general performance numbers on user/kernel > crossing (i.e tun read/write). > If you have the cycles can you run such tests? > Ignore my message - you are running pktgen from a VM towards the host. So the numbers you posted are what i was interested in. Thanks for the good work. cheers, jamal
Re: [PATCH net-next V2] tun: introduce tx skb ring
On 16-06-15 04:38 AM, Jason Wang wrote: > We used to queue tx packets in sk_receive_queue, this is less > efficient since it requires spinlocks to synchronize between producer > and consumer. > > This patch tries to address this by: > > - introduce a new mode which will be only enabled with IFF_TX_ARRAY >set and switch from sk_receive_queue to a fixed size of skb >array with 256 entries in this mode. > - introduce a new proto_ops peek_len which was used for peeking the >skb length. > - implement a tun version of peek_len for vhost_net to use and convert >vhost_net to use peek_len if possible. > > Pktgen test shows about 18% improvement on guest receiving pps for small > buffers: > > Before: ~122pps > After : ~144pps > So this is more exercising the skb array improvements. For tun it would be useful to see general performance numbers on user/kernel crossing (i.e tun read/write). If you have the cycles can you run such tests? cheers, jamal
Re: [PATCH net-next V2] tun: introduce tx skb ring
On 16-06-15 04:38 AM, Jason Wang wrote: > We used to queue tx packets in sk_receive_queue, this is less > efficient since it requires spinlocks to synchronize between producer > and consumer. > > This patch tries to address this by: > > - introduce a new mode which will be only enabled with IFF_TX_ARRAY >set and switch from sk_receive_queue to a fixed size of skb >array with 256 entries in this mode. > - introduce a new proto_ops peek_len which was used for peeking the >skb length. > - implement a tun version of peek_len for vhost_net to use and convert >vhost_net to use peek_len if possible. > > Pktgen test shows about 18% improvement on guest receiving pps for small > buffers: > > Before: ~122pps > After : ~144pps > So this is more exercising the skb array improvements. For tun it would be useful to see general performance numbers on user/kernel crossing (i.e tun read/write). If you have the cycles can you run such tests? cheers, jamal
Re: Deleting child qdisc doesn't reset parent to default qdisc?
On 16-04-14 01:49 PM, Eric Dumazet wrote: And what would be the chosen behavior ? TBF is probably a bad example because it started life as a classless qdisc. There was only one built-in fifo queue that was shaped. Then someone made it classful and changed this behavior. To me it sounds reasonable to have the default behavior restored. At minimal consistency. Relying on TBF installing a bfifo for you at delete would be hazardous. For example CBQ got it differently than HFSC If qdisc_create_dflt() fails in CBQ, we fail the 'delete', while HFSC falls back to noop_qdisc, without warning the user :( At least always using noop_qdisc is consistent. No magic there. Doing 'unification' right now would break existing scripts. This is too late, I am afraid. Sigh. So rant: IMO, we should let any new APIs and API updates stay longer in discussion. Or better mark them as unstable for sometime. The excuse that "it is out in the wild therefore cant be changed" is harmful because the timeline is "forever" whereas patches are applied after a short period of posting and discussions and sometimes not involving the right people. It is like having a jury issuing a death sentence after 1 week of deliberation. You cant take it back after execution. cheers, jamal
Re: Deleting child qdisc doesn't reset parent to default qdisc?
On 16-04-14 01:49 PM, Eric Dumazet wrote: And what would be the chosen behavior ? TBF is probably a bad example because it started life as a classless qdisc. There was only one built-in fifo queue that was shaped. Then someone made it classful and changed this behavior. To me it sounds reasonable to have the default behavior restored. At minimal consistency. Relying on TBF installing a bfifo for you at delete would be hazardous. For example CBQ got it differently than HFSC If qdisc_create_dflt() fails in CBQ, we fail the 'delete', while HFSC falls back to noop_qdisc, without warning the user :( At least always using noop_qdisc is consistent. No magic there. Doing 'unification' right now would break existing scripts. This is too late, I am afraid. Sigh. So rant: IMO, we should let any new APIs and API updates stay longer in discussion. Or better mark them as unstable for sometime. The excuse that "it is out in the wild therefore cant be changed" is harmful because the timeline is "forever" whereas patches are applied after a short period of posting and discussions and sometimes not involving the right people. It is like having a jury issuing a death sentence after 1 week of deliberation. You cant take it back after execution. cheers, jamal
Re: [PATCH net v3 2/4] net: add rx_nohandler stat counter
On 16-02-09 03:40 AM, David Miller wrote: From: Eric Dumazet Date: Mon, 08 Feb 2016 14:57:40 -0800 Whole point of TLV is that it allows us to add new fields at the end of the structures. ... Look at iproute2, you were the one adding in 2004 code to cope with various tcp_info sizes. So 12 years later, you cannot say it does not work anymore. +1 The TLV L should be canonical way to determine length. i.e should be sufficient to just look at L and understand that content has changed. But: Using sizeof could be dangerous unless the data is packed to be 32-bit aligned. Looking INET_DIAG_INFO check for sizeof there is a small 8 bit hole in tcp_info I think between these two fields: __u8tcpi_snd_wscale : 4, tcpi_rcv_wscale : 4; __u32 tcpi_rto; --- The kernel will pad to make sure the TLV data is 32-bit aligned. I am not sure if that will be the same length as sizeof() in all hardware + compilers... For this case, it is almost safe to just add a version field - probably in the hole. Or have a #define to say what the expected length should be. Or add an 8 bit pad. In general adding new fields that are non-optional is problematic. i.e by non-optional i mean always expected to be present. I think a good test is old kernel with new iproute2. If the new field is non-optional, it will fail (example iproute2 may try to print a value that it expects but because old kernel doesnt understand it; it is non-existent). cheers, jamal
Re: [PATCH net v3 2/4] net: add rx_nohandler stat counter
On 16-02-09 03:40 AM, David Miller wrote: From: Eric DumazetDate: Mon, 08 Feb 2016 14:57:40 -0800 Whole point of TLV is that it allows us to add new fields at the end of the structures. ... Look at iproute2, you were the one adding in 2004 code to cope with various tcp_info sizes. So 12 years later, you cannot say it does not work anymore. +1 The TLV L should be canonical way to determine length. i.e should be sufficient to just look at L and understand that content has changed. But: Using sizeof could be dangerous unless the data is packed to be 32-bit aligned. Looking INET_DIAG_INFO check for sizeof there is a small 8 bit hole in tcp_info I think between these two fields: __u8tcpi_snd_wscale : 4, tcpi_rcv_wscale : 4; __u32 tcpi_rto; --- The kernel will pad to make sure the TLV data is 32-bit aligned. I am not sure if that will be the same length as sizeof() in all hardware + compilers... For this case, it is almost safe to just add a version field - probably in the hole. Or have a #define to say what the expected length should be. Or add an 8 bit pad. In general adding new fields that are non-optional is problematic. i.e by non-optional i mean always expected to be present. I think a good test is old kernel with new iproute2. If the new field is non-optional, it will fail (example iproute2 may try to print a value that it expects but because old kernel doesnt understand it; it is non-existent). cheers, jamal
Re: [PATCH 0/7] Freescale DPAA FMan FLIB(s)
Hi Emil, On 03/05/15 10:04, Emil Medve wrote: Hello Jamal, On 03/05/2015 08:35 AM, Jamal Hadi Salim wrote: Hi Emil, No. All the kernel drivers/code we want to upstream is meant to stand on its own and be used the "normal" Linux/Unix way Ok, thanks - that was my only concern. Note there is a lot of offload efforts going on in Linux right now. Take a look at the proceedings from netdev01. You should take advantage of that to shape the direction of your patches. I have suffered at the hands of the sdk for this processor in the past; it is just a bad idea to keep these SDKs around when Linux can express the features sufficiently. Note: The main thing i'd be interested in is when you get to offloading the classifier/scheduler etc. I know you are focussing on just the ethernet level at the moment, cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/7] Freescale DPAA FMan FLIB(s)
Hi Emil, On 03/05/15 10:04, Emil Medve wrote: Hello Jamal, On 03/05/2015 08:35 AM, Jamal Hadi Salim wrote: Hi Emil, No. All the kernel drivers/code we want to upstream is meant to stand on its own and be used the normal Linux/Unix way Ok, thanks - that was my only concern. Note there is a lot of offload efforts going on in Linux right now. Take a look at the proceedings from netdev01. You should take advantage of that to shape the direction of your patches. I have suffered at the hands of the sdk for this processor in the past; it is just a bad idea to keep these SDKs around when Linux can express the features sufficiently. Note: The main thing i'd be interested in is when you get to offloading the classifier/scheduler etc. I know you are focussing on just the ethernet level at the moment, cheers, jamal -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/7] Freescale DPAA FMan FLIB(s)
Hi Emil, On 03/05/15 08:48, Emil Medve wrote: The intent is to upstream the entire suite of the DPAA drivers. All the drivers are still WIP, but B/QMan have been already presented to the upstream community and this is the first attempt to publish (some low level code of) the FMan driver. As we go through our internal checklist and in the same time address community feedback we'll soon get the drivers to be acceptable for the upstream trees The first version of the actual Ethernet driver will follow imminently SDK enablement is a side-effect Meaning? Let me ask the question differently: Do i need your sdk to use the features exposed or can i use something like tc to set up the deficit rr or wred or the exposed classifiers and associated actions? Would your sdk (via user space direct programming) benefit because you have pushed these pieces into the kernel? How are you planning to add support for your classifiers, queue schedulers etc? Yes Yes as in these will be available via linux kernel or via your sdk? Is that a patch on top of this or it is something that sits on user space? Both. Full DPAA/Ethernet enablement will be present in the kernel. We also have support for user-space based approach. I'm unsure where/when we might publish that. Of course the SDK is always a place you can turn to for all the code we have (in whatever state it might be) the sdk is open source? cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/7] Freescale DPAA FMan FLIB(s)
On 03/05/15 00:45, Emil Medve wrote: From: Igal Liberman The Freescale Data Path Acceleration Architecture (DPAA) is a set of hardware components on specific QorIQ P and T series multicore processors. This architecture provides the infrastructure to support simplified sharing of networking interfaces and accelerators by multiple CPU cores, and the accelerators themselves. One of the DPAA accelerators is the Frame Manager (FMan), which combines the Ethernet network interfaces with packet distribution logic to provide intelligent distribution and queuing decisions for incoming traffic at line rate. This patch presents the FMan Foundation Libraries (FLIB) headers. The FMan FLIB suite adds basic support for the DPAA FMan hardware register access. The FMan FLIB suite is used in Freescale's SDK Releases. Is this intended to merely enable your sdk? How are you planning to add support for your classifiers, queue schedulers etc? Is that a patch on top of this or it is something that sits on user space? cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/7] Freescale DPAA FMan FLIB(s)
On 03/05/15 00:45, Emil Medve wrote: From: Igal Liberman igal.liber...@freescale.com The Freescale Data Path Acceleration Architecture (DPAA) is a set of hardware components on specific QorIQ P and T series multicore processors. This architecture provides the infrastructure to support simplified sharing of networking interfaces and accelerators by multiple CPU cores, and the accelerators themselves. One of the DPAA accelerators is the Frame Manager (FMan), which combines the Ethernet network interfaces with packet distribution logic to provide intelligent distribution and queuing decisions for incoming traffic at line rate. This patch presents the FMan Foundation Libraries (FLIB) headers. The FMan FLIB suite adds basic support for the DPAA FMan hardware register access. The FMan FLIB suite is used in Freescale's SDK Releases. Is this intended to merely enable your sdk? How are you planning to add support for your classifiers, queue schedulers etc? Is that a patch on top of this or it is something that sits on user space? cheers, jamal -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/7] Freescale DPAA FMan FLIB(s)
Hi Emil, On 03/05/15 08:48, Emil Medve wrote: The intent is to upstream the entire suite of the DPAA drivers. All the drivers are still WIP, but B/QMan have been already presented to the upstream community and this is the first attempt to publish (some low level code of) the FMan driver. As we go through our internal checklist and in the same time address community feedback we'll soon get the drivers to be acceptable for the upstream trees The first version of the actual Ethernet driver will follow imminently SDK enablement is a side-effect Meaning? Let me ask the question differently: Do i need your sdk to use the features exposed or can i use something like tc to set up the deficit rr or wred or the exposed classifiers and associated actions? Would your sdk (via user space direct programming) benefit because you have pushed these pieces into the kernel? How are you planning to add support for your classifiers, queue schedulers etc? Yes Yes as in these will be available via linux kernel or via your sdk? Is that a patch on top of this or it is something that sits on user space? Both. Full DPAA/Ethernet enablement will be present in the kernel. We also have support for user-space based approach. I'm unsure where/when we might publish that. Of course the SDK is always a place you can turn to for all the code we have (in whatever state it might be) the sdk is open source? cheers, jamal -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] net: Add Keystone NetCP ethernet driver support
On 09/09/14 11:19, Santosh Shilimkar wrote: All the documentation is open including packet accelerator offload in ti.com. Very nice. Would you do me a kindness and point to the switch interface documentation (and other ones on that soc)? We got such requests from customers but couldn't support it for Linux. It has been difficult because every chip vendor is trying to do their own thing. Some have huge (fugly) SDKs in user space which make it worse. Thats the struggle we are trying to deal with. Of course none of those vendors want to open up their specs. You present a nice opportunity to not follow that path. We are also looking for such support and any direction are welcome. Your slide deck seems to capture the key topics like L2/IPSEC offload which we are also interested to hear. The slides list the most popular offloads. But not necessarily all known offloads. Just to be clear, your point was about L2 switch offload which the driver don't support at the moment. It might confuse others. The driver doesn't implements anything non-standard. If i understood you correctly: Your initial patches dont intend to expose any offloads - you are just abstracting this as a NIC. I think that is a legit reason. However, the problem is you are also exposing the packet processors and switch offloading in a proprietary way. For a sample of how L2 basic functions like FDB tables are controlled within a NIC - take a look at the Intel NICs. Either that or you hide all the offload interfaces and over time add them (starting with L2 - NICs with L2 are common). cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] net: Add Keystone NetCP ethernet driver support
On 09/09/14 11:19, Santosh Shilimkar wrote: All the documentation is open including packet accelerator offload in ti.com. Very nice. Would you do me a kindness and point to the switch interface documentation (and other ones on that soc)? We got such requests from customers but couldn't support it for Linux. It has been difficult because every chip vendor is trying to do their own thing. Some have huge (fugly) SDKs in user space which make it worse. Thats the struggle we are trying to deal with. Of course none of those vendors want to open up their specs. You present a nice opportunity to not follow that path. We are also looking for such support and any direction are welcome. Your slide deck seems to capture the key topics like L2/IPSEC offload which we are also interested to hear. The slides list the most popular offloads. But not necessarily all known offloads. Just to be clear, your point was about L2 switch offload which the driver don't support at the moment. It might confuse others. The driver doesn't implements anything non-standard. If i understood you correctly: Your initial patches dont intend to expose any offloads - you are just abstracting this as a NIC. I think that is a legit reason. However, the problem is you are also exposing the packet processors and switch offloading in a proprietary way. For a sample of how L2 basic functions like FDB tables are controlled within a NIC - take a look at the Intel NICs. Either that or you hide all the offload interfaces and over time add them (starting with L2 - NICs with L2 are common). cheers, jamal -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] net: Add Keystone NetCP ethernet driver support
On 09/08/14 10:41, Santosh Shilimkar wrote: The NetCP plugin module infrastructure use all the standard kernel infrastructure and its very tiny. So i found this manual here: http://www.silica.com/fileadmin/02_Products/Productdetails/Texas_Instruments/SILICA_TI_66AK2E05-ds.pdf Glad there is an open document! There are a couple of ethernet switch chips I can spot there. Can i control those with "bridge" or say "brctl" utilities? I can see the bridge ports are exposed and i should be able to control them via ifconfig or ip link. Thats what "standard kernel infrastructure" means. Magic hidden in a driver is not. Take a look at recent netconf discussion (as well as earlier referenced discussions): http://vger.kernel.org/netconf-nf-offload.pdf Maybe we can help providing you some direction? The problem is it doesnt seem that the offload specs for those other pieces are open? e.g how do i add an entry to the L2 switch? cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] net: Add Keystone NetCP ethernet driver support
On 09/08/14 10:41, Santosh Shilimkar wrote: The NetCP plugin module infrastructure use all the standard kernel infrastructure and its very tiny. So i found this manual here: http://www.silica.com/fileadmin/02_Products/Productdetails/Texas_Instruments/SILICA_TI_66AK2E05-ds.pdf Glad there is an open document! There are a couple of ethernet switch chips I can spot there. Can i control those with bridge or say brctl utilities? I can see the bridge ports are exposed and i should be able to control them via ifconfig or ip link. Thats what standard kernel infrastructure means. Magic hidden in a driver is not. Take a look at recent netconf discussion (as well as earlier referenced discussions): http://vger.kernel.org/netconf-nf-offload.pdf Maybe we can help providing you some direction? The problem is it doesnt seem that the offload specs for those other pieces are open? e.g how do i add an entry to the L2 switch? cheers, jamal -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] net: Add Keystone NetCP ethernet driver
On 08/15/14 11:12, Santosh Shilimkar wrote: I am curious about these two calls below(netcp_process_one_rx_packet and netcp_tx_submit_skb): On tx you seem to be broadcasting to all "sub-modules" and on receive you seem to be invoking from all as well. I couldnt find the code for any of the sub-modules but i suspect things like switching/bridging or ipsec etc could be sub-modules. If yes, have you thought about integrating these features into Linux proper instead? cheers, jamal + +static inline int netcp_process_one_rx_packet(struct netcp_intf *netcp) +{ + + /* Call each of the RX hooks */ + p_info.skb = skb; + p_info.rxtstamp_complete = false; + list_for_each_entry(rx_hook, >rxhook_list_head, list) { + int ret; + + ret = rx_hook->hook_rtn(rx_hook->order, rx_hook->hook_data, + _info); + if (unlikely(ret)) { + dev_err(netcp->ndev_dev, "RX hook %d failed: %d\n", + rx_hook->order, ret); + netcp->ndev->stats.rx_errors++; + dev_kfree_skb(skb); + return 0; + } + } .. + return 0; +} [..] +static inline int netcp_tx_submit_skb(struct netcp_intf *netcp, + struct sk_buff *skb, + struct knav_dma_desc *desc) +{ + struct netcp_tx_pipe *tx_pipe = NULL; + + p_info.netcp = netcp; + p_info.skb = skb; + p_info.tx_pipe = NULL; + p_info.psdata_len = 0; + p_info.ts_context = NULL; + p_info.txtstamp_complete = NULL; + p_info.epib = desc->epib; + p_info.psdata = desc->psdata; + memset(p_info.epib, 0, KNAV_DMA_NUM_EPIB_WORDS * sizeof(u32)); + + /* Find out where to inject the packet for transmission */ + list_for_each_entry(tx_hook, >txhook_list_head, list) { + ret = tx_hook->hook_rtn(tx_hook->order, tx_hook->hook_data, + _info); + if (unlikely(ret != 0)) { + dev_err(netcp->ndev_dev, "TX hook %d rejected the packet with reason(%d)\n", + tx_hook->order, ret); + ret = (ret < 0) ? ret : NETDEV_TX_OK; + goto out; + } + } + -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] net: Add Keystone NetCP ethernet driver
On 08/15/14 11:12, Santosh Shilimkar wrote: I am curious about these two calls below(netcp_process_one_rx_packet and netcp_tx_submit_skb): On tx you seem to be broadcasting to all sub-modules and on receive you seem to be invoking from all as well. I couldnt find the code for any of the sub-modules but i suspect things like switching/bridging or ipsec etc could be sub-modules. If yes, have you thought about integrating these features into Linux proper instead? cheers, jamal + +static inline int netcp_process_one_rx_packet(struct netcp_intf *netcp) +{ + + /* Call each of the RX hooks */ + p_info.skb = skb; + p_info.rxtstamp_complete = false; + list_for_each_entry(rx_hook, netcp-rxhook_list_head, list) { + int ret; + + ret = rx_hook-hook_rtn(rx_hook-order, rx_hook-hook_data, + p_info); + if (unlikely(ret)) { + dev_err(netcp-ndev_dev, RX hook %d failed: %d\n, + rx_hook-order, ret); + netcp-ndev-stats.rx_errors++; + dev_kfree_skb(skb); + return 0; + } + } .. + return 0; +} [..] +static inline int netcp_tx_submit_skb(struct netcp_intf *netcp, + struct sk_buff *skb, + struct knav_dma_desc *desc) +{ + struct netcp_tx_pipe *tx_pipe = NULL; + + p_info.netcp = netcp; + p_info.skb = skb; + p_info.tx_pipe = NULL; + p_info.psdata_len = 0; + p_info.ts_context = NULL; + p_info.txtstamp_complete = NULL; + p_info.epib = desc-epib; + p_info.psdata = desc-psdata; + memset(p_info.epib, 0, KNAV_DMA_NUM_EPIB_WORDS * sizeof(u32)); + + /* Find out where to inject the packet for transmission */ + list_for_each_entry(tx_hook, netcp-txhook_list_head, list) { + ret = tx_hook-hook_rtn(tx_hook-order, tx_hook-hook_data, + p_info); + if (unlikely(ret != 0)) { + dev_err(netcp-ndev_dev, TX hook %d rejected the packet with reason(%d)\n, + tx_hook-order, ret); + ret = (ret 0) ? ret : NETDEV_TX_OK; + goto out; + } + } + -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] xfrm: fix handling of XFRM policies mark and mask.
On 13-02-06 09:39 AM, Emmanuel Thierry wrote: I think you misread the example ! I did ;-> Marks are both 1, masks are different. This case is more complex than a policy with no mark (so mark=0 and mask=0) versus a policy with an exact mark (so mark=1 and mask=0x), and i wanted to know if the algorithm would take these kind of cases into account. Aha. I think this is pushing the envelope a little - are there good use cases for this? certainly you could insert with most exact mask first. No such check is made at the moment. cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] xfrm: fix handling of XFRM policies mark and mask.
On 13-02-06 08:53 AM, Emmanuel Thierry wrote: Actually, we didn't think about this problem since we work with priorities, putting the default policy (without a mark) at a minor priority than the marked one. I think priorities are the way to go in cases of ambiguity. Your remark makes clearer the ideas behind the design of XFRM, but this leads to an interesting concern. If on policy insertion, the policy were inserted depending on the accuracy of the mark (the more the mask is specific, the more the mark must be put at the beginning of the list), how would we decide which is the more specific between these ones ? ip -6 xfrm policy add src fd00::1/128 dst fd00::2/128 dir out mark 0x0001 mask 0x0005 ip -6 xfrm policy add src fd00::1/128 dst fd00::2/128 dir out mark 0x0001 mask 0x0003 They look different to me, no? i.e i dont see a conflict - one has mark=5 and the other has mark=3. cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] xfrm: fix handling of XFRM policies mark and mask.
On 13-02-06 08:53 AM, Emmanuel Thierry wrote: Actually, we didn't think about this problem since we work with priorities, putting the default policy (without a mark) at a minor priority than the marked one. I think priorities are the way to go in cases of ambiguity. Your remark makes clearer the ideas behind the design of XFRM, but this leads to an interesting concern. If on policy insertion, the policy were inserted depending on the accuracy of the mark (the more the mask is specific, the more the mark must be put at the beginning of the list), how would we decide which is the more specific between these ones ? ip -6 xfrm policy add src fd00::1/128 dst fd00::2/128 dir out mark 0x0001 mask 0x0005 ip -6 xfrm policy add src fd00::1/128 dst fd00::2/128 dir out mark 0x0001 mask 0x0003 They look different to me, no? i.e i dont see a conflict - one has mark=5 and the other has mark=3. cheers, jamal -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] xfrm: fix handling of XFRM policies mark and mask.
On 13-02-06 09:39 AM, Emmanuel Thierry wrote: I think you misread the example ! I did ;- Marks are both 1, masks are different. This case is more complex than a policy with no mark (so mark=0 and mask=0) versus a policy with an exact mark (so mark=1 and mask=0x), and i wanted to know if the algorithm would take these kind of cases into account. Aha. I think this is pushing the envelope a little - are there good use cases for this? certainly you could insert with most exact mask first. No such check is made at the moment. cheers, jamal -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [net V2] act_mirred: do not drop packets when fails to mirror it
On Thu, 2012-08-16 at 14:44 +0800, Jason Wang wrote: > We drop packet unconditionally when we fail to mirror it. This is not intended > in some cases. Consdier for kvm guest, we may mirror the traffic of the bridge > to a tap device used by a VM. When kernel fails to mirror the packet in > conditions such as when qemu crashes or stop polling the tap, it's hard for > the > management software to detect such condition and clean the the mirroring > before. This would lead all packets to the bridge to be dropped and break the > netowrk of other virtual machines. > > To solve the issue, the patch does not drop packets when kernel fails to > mirror > it, and only drop the redirected packets. > > Signed-off-by: Jason Wang Signed-off-by: Jamal Hadi Salim cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [net V2] act_mirred: do not drop packets when fails to mirror it
On Thu, 2012-08-16 at 14:44 +0800, Jason Wang wrote: We drop packet unconditionally when we fail to mirror it. This is not intended in some cases. Consdier for kvm guest, we may mirror the traffic of the bridge to a tap device used by a VM. When kernel fails to mirror the packet in conditions such as when qemu crashes or stop polling the tap, it's hard for the management software to detect such condition and clean the the mirroring before. This would lead all packets to the bridge to be dropped and break the netowrk of other virtual machines. To solve the issue, the patch does not drop packets when kernel fails to mirror it, and only drop the redirected packets. Signed-off-by: Jason Wang jasow...@redhat.com Signed-off-by: Jamal Hadi Salim j...@mojatatu.com cheers, jamal -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] act_mirred: do not drop packets when fails to mirror it
On Wed, 2012-08-15 at 21:42 +0800, Jason Wang wrote: > > I met it actually through the following steps: > > - start a kvm guest with tap and make it to be an interface of the bridge > - mirror the ingress traffic of the bridge to the tap > - terminate the qemu process, the tap device is then removed > - all packet goes to bridge would be dropped, so the network of guests > in the same bridge would be broken > Makes sense. Can you please leave the err check braces i.e if (err) { m->tcf_qstats.overlimits++; if (m->tcfm_eaction != TCA_EGRESS_MIRROR) retval = TC_ACT_SHOT; else retval = m->tcf_action; } else { retval = m->tcf_action; } Or at least dont use TC_ACT_STOLEN. cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] act_mirred: do not drop packets when fails to mirror it
On Wed, 2012-08-15 at 17:37 +0800, Jason Wang wrote: > We drop packet unconditionally when we fail to mirror it. This is not intended > in some cases. Hi Jason, Did you actually notice the behavior you described or were you going by the XXX comment I had in the code? cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 15/21] net sched: Pass the skb into change so it can access NETLINK_CB
On Mon, 2012-08-13 at 13:18 -0700, Eric W. Biederman wrote: > From: "Eric W. Biederman" > > cls_flow.c plays with uids and gids. Unless I misread that > code it is possible for classifiers to depend on the specific uid and > gid values. Therefore I need to know the user namespace of the > netlink socket that is installing the packet classifiers. Pass > in the rtnetlink skb so I can access the NETLINK_CB of the passed > packet. In particular I want access to sk_user_ns(NETLINK_CB(in_skb).ssk). > > Pass in not the user namespace but the incomming rtnetlink skb into > the the classifier change routines as that is generally the more useful > parameter. > > Cc: Jamal Hadi Salim > Acked-by: Serge Hallyn > Signed-off-by: Eric W. Biederman Acked-by: Jamal Hadi Salim cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 15/21] net sched: Pass the skb into change so it can access NETLINK_CB
On Mon, 2012-08-13 at 13:18 -0700, Eric W. Biederman wrote: From: Eric W. Biederman ebied...@xmission.com cls_flow.c plays with uids and gids. Unless I misread that code it is possible for classifiers to depend on the specific uid and gid values. Therefore I need to know the user namespace of the netlink socket that is installing the packet classifiers. Pass in the rtnetlink skb so I can access the NETLINK_CB of the passed packet. In particular I want access to sk_user_ns(NETLINK_CB(in_skb).ssk). Pass in not the user namespace but the incomming rtnetlink skb into the the classifier change routines as that is generally the more useful parameter. Cc: Jamal Hadi Salim j...@mojatatu.com Acked-by: Serge Hallyn serge.hal...@canonical.com Signed-off-by: Eric W. Biederman ebied...@xmission.com Acked-by: Jamal Hadi Salim j...@mojatatu.com cheers, jamal -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] act_mirred: do not drop packets when fails to mirror it
On Wed, 2012-08-15 at 17:37 +0800, Jason Wang wrote: We drop packet unconditionally when we fail to mirror it. This is not intended in some cases. Hi Jason, Did you actually notice the behavior you described or were you going by the XXX comment I had in the code? cheers, jamal -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] act_mirred: do not drop packets when fails to mirror it
On Wed, 2012-08-15 at 21:42 +0800, Jason Wang wrote: I met it actually through the following steps: - start a kvm guest with tap and make it to be an interface of the bridge - mirror the ingress traffic of the bridge to the tap - terminate the qemu process, the tap device is then removed - all packet goes to bridge would be dropped, so the network of guests in the same bridge would be broken Makes sense. Can you please leave the err check braces i.e if (err) { m-tcf_qstats.overlimits++; if (m-tcfm_eaction != TCA_EGRESS_MIRROR) retval = TC_ACT_SHOT; else retval = m-tcf_action; } else { retval = m-tcf_action; } Or at least dont use TC_ACT_STOLEN. cheers, jamal -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/