Re: [v1,net-next 3/4] net: qos: police action add index for tc flower offloading

2020-06-26 Thread Jamal Hadi Salim

On 2020-06-24 8:34 p.m., Po Liu wrote:




-Original Message-



That is the point i was trying to get to. Basically:
You have a counter table which is referenced by "index"
You also have a meter/policer table which is referenced by "index".


They should be one same group and same meaning.



Didnt follow. You mean the index is the same for both the
stat and policer?



For policers, they maintain their own stats. So when i say:
tc ... flower ... action police ... index 5 The index referred to is in the
policer table



Sure. Means police with No. 5 entry.


But for other actions, example when i say:
tc ... flower ... action drop index 10


Still the question, does gact action drop could bind with index? It doesn't 
meanful.



Depends on your hardware. From this discussion i am
trying to understand where the constraint is for your case.
Whether it is your h/w or the TSN spec.
For a sample counting which is flexible see here:
https://p4.org/p4-spec/docs/PSA.html#sec-counters

That concept is not specific to P4 but rather to
newer flow-based hardware.

More context:
The assumption these days is we can have a _lot_ of flows with a lot
of actions.
Then you want to be able to collect the stats separately, possibly one
counter entry for each action of interest.
Why is this important?f For analytics uses cases,
when you are retrieving the stats you want to reduce the amount of
data being retrieved. Typically these stats are polled every X seconds.
For starters, you dont dump filters (which in your case seems to be
the only way to get the stats).
In current tc, you dump the actions. But that could be improved so
you can just dump the stats. The mapping of stats index to actions
is known to the entity doing the dump.

Does that make sense?


The index is in the counter/stats table.
It is not exactly "10" in hardware, the driver magically hides it from the
user - so it could be hw counter index 1234


Not exactly. Current flower offloading stats means get the chain index for that 
flow filter. The other actions should bind to that chain index.

>

So if i read correctly: You have an index per filter pointing to the
counter table.
Is this something _you_ decided to do in software or is it how the
hardware works? (note i referred to this as "legacy ACL" approach
earlier. It worked like that in old hardware because the main use
case was to have one action on a match (drop/accept kind).


Like IEEE802.1Qci, what I am doing is bind gate action to filter 
chain(mandatory). And also police action as optional.


I cant seem to find this spec online. Is it freely available?
Also, if i understand you correctly you are saying according to this
spec you can only have the following type of policy:
tc .. filter match-spec-here .. \
action gate gate-action-attributes \
action police ...

That "action gate" MUST always be present
but "action police" is optional?


There is stream counter table which summary the counters pass gate action entry 
and police action entry for that chain index(there is a bit different if two 
chain sharing same action list).
One chain counter which tc show stats get counter source:
struct psfp_streamfilter_counters {
 u64 matching_frames_count;
 u64 passing_frames_count;
 u64 not_passing_frames_count;
 u64 passing_sdu_count;
 u64 not_passing_sdu_count;
 u64 red_frames_count;
};



Assuming psfp is something defined in IEEE802.1Qci and the spec will
describe these?
Is the filter  "index" pointing to one of those in some counter table?



When pass to the user space, summarize as:
 stats.pkts = counters.matching_frames_count +  
counters.not_passing_sdu_count - filter->stats.pkts;

>

 stats.drops = counters.not_passing_frames_count + 
counters.not_passing_sdu_count +   counters.red_frames_count - 
filter->stats.drops;



Thanks for the explanation.
What is filter->stats?
The rest of those counters seem related to the gate action.
How do you account for policing actions?

cheers,
jamal



Re: [v1,net-next 3/4] net: qos: police action add index for tc flower offloading

2020-06-24 Thread Jamal Hadi Salim

On 2020-06-23 7:52 p.m., Po Liu wrote:

Hi Jamal,





My question: Is this any different from how stats are structured?




[..]

My question: Why cant you apply the same semantics for the counters?
Does your hardware have an indexed counter/stats table? If yes then you


Yes, 


That is the point i was trying to get to. Basically:
You have a counter table which is referenced by "index"
You also have a meter/policer table which is referenced by "index".

For policers, they maintain their own stats. So when i say:
tc ... flower ... action police ... index 5
The index referred to is in the policer table

But for other actions, example when i say:
tc ... flower ... action drop index 10
The index is in the counter/stats table.
It is not exactly "10" in hardware, the driver magically hides
it from the user - so it could be hw counter index 1234

The old approach is to assume the classifier (flower in this
case) has a counter. The reason for this assumption is older
hardware was designed to deal with a single action per match.
So a counter to the filter is also the counter to the
(single) action. I get the feeling your hardware fits in that
space.

Modern use cases have evolved from the ACL single match and action
approach. Maintaining the old thought/architecture breaks in two
use cases:
1) when you have multiple actions per policy filter. You need
counter-per-action for various reasons
2) Sharing of counters across filters and action. This can
be achieve

tc supports the above and is sufficient to cover the old use
cases.
I am just worried, architecturally, we are restricting ourselves
to the old scheme.

Another reason this is important is for the sake of analytics.
A user space app can poll just for the stats table in hardware
(or the cached version in the kernel) and reduce the amount of
data crossing to user space..

cheers,
jamal







Re: [v1,net-next 3/4] net: qos: police action add index for tc flower offloading

2020-06-23 Thread Jamal Hadi Salim

On 2020-06-23 7:55 a.m., Po Liu wrote:


[..]

My question: Is this any different from how stats are structured?


I don't know I fully catch the question. Are you trying to get how many frames 
for each filter chain passing one index policing action?
If one index police action bind to multiple tc filter(they should have differnt 
chain index ). All those filter should get same index police action stats value 
since they are sharing the same hardware entry. But I don't think this is the 
problem.



This is a good thing. What is nice is i can use the same index for
s/w and h/w (and no need for a translation/remapping).


With index provide to device driver(map the s/w action index to a h/w table 
index ), user could list the police actions list by command:
# tc actions show action police
Shows the police action table by index.


This is also nice.

My question: Why cant you apply the same semantics for the counters?
Does your hardware have an indexed counter/stats table? If yes
then you should be able to do similar thing for counters
as you do for policer (i.e use an index and share counters across
actions). So when i say:
tc action drop index 5
and
tc action ok index 5
infact they use the same counter.


cheers,
jamal



Re: [v1,net-next 3/4] net: qos: police action add index for tc flower offloading

2020-06-23 Thread Jamal Hadi Salim

This certainly brings an interesting point which i brought up earlier
when Jiri was doing offloading of stats.
In this case the action index is being used as the offloaded
policer index (note: there'd need to be a check whether the
index is infact acceptable to the h/w etc unless there
2^32 meters available in the hardware).

My question: Is this any different from how stats are structured?
In this case you can map the s/w action index to a h/w table index
(of meters).
My comment then was: hardware i have encountered (and i pointed to P4
model as well) assumes an indexed table of stats.

cheers,
jamal

On 2020-06-23 2:34 a.m., Po Liu wrote:

From: Po Liu 

Hardware may own many entries for police flow. So that make one(or
  multi) flow to be policed by one hardware entry. This patch add the
police action index provide to the driver side make it mapping the
driver hardware entry index.

Signed-off-by: Po Liu 
---
  include/net/flow_offload.h | 1 +
  net/sched/cls_api.c| 1 +
  2 files changed, 2 insertions(+)

diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index c2ef19c6b27d..eed98075b1ae 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -232,6 +232,7 @@ struct flow_action_entry {
booltruncate;
} sample;
struct {/* FLOW_ACTION_POLICE */
+   u32 index;
s64 burst;
u64 rate_bytes_ps;
u32 mtu;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 6aba7d5ba1ec..fdc4c89ca1fa 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3659,6 +3659,7 @@ int tc_setup_flow_action(struct flow_action *flow_action,
entry->police.rate_bytes_ps =
tcf_police_rate_bytes_ps(act);
entry->police.mtu = tcf_police_tcfp_mtu(act);
+   entry->police.index = act->tcfa_index;
} else if (is_tcf_ct(act)) {
entry->id = FLOW_ACTION_CT;
entry->ct.action = tcf_ct_action(act);





Re: [PATCH net-next 0/2] Change tc action identifiers to be more consistent

2019-02-07 Thread Jamal Hadi Salim

On 2019-02-07 2:45 a.m., Eli Cohen wrote:

This two patch series modifies TC actions identifiers to be more consistent and
also puts them in one place so new identifiers numbers can be chosen more
easily.



For the series:
Acked-by: Jamal Hadi Salim 

cheers,
jamal


Re: [PATCH/RFC] Make perf_event_open() propagate errors for use in bpf_perf_event_open()

2019-01-12 Thread Jamal Hadi Salim

On 2019-01-11 10:55 a.m., Arnaldo Carvalho de Melo wrote:

Hi Peter,

bpf_perf_event_open() already returns a value, but if
perf_event_output's output_begin (mostly perf_output_begin) fails,
the only way to know about that is looking before/after the rb->lost,
right?

For ring buffer users that is ok, we'll get a PERF_RECORD_LOST,
etc, but for BPF programs it would be convenient to get that -ENOSPC and
do some fallback, whatever makes sense, like in my augmented_syscalls
stuff for 'perf trace', i.e. don't augment it (i.e. push stuff at the
end of the normal payload), just don't filter the
raw_syscalls:sys_enter, 'perf trace' will get the enter syscall enter
event without the pointer dereference at the end, etc, warn the user but
don't lose a syscall in the strace-like output. 

What do you think? Am I missing something? Probably ;-)

Ah, its just test built.


Works as advertised ;->
Tested-by: Jamal Hadi Salim 

cheers,
jamal


Re: linux-next: manual merge of the net-next tree with the net tree

2018-10-09 Thread Jamal Hadi Salim

On 2018-10-08 9:21 p.m., Stephen Rothwell wrote:

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

   net/sched/cls_u32.c

between commit:

   6d4c407744dd ("net: sched: cls_u32: fix hnode refcounting")

from the net tree and commit:

   a030598690c6 ("net: sched: cls_u32: simplify the hell out u32_delete() emptiness 
check")

from the net-next tree.

I fixed it up (I reverted the net tree commit as I could not tell
wich parts of it, if any, are still needed) and can carry the fix as
necessary. This is now fixed as far as linux-next is concerned, but any
non trivial conflicts should be mentioned to your upstream maintainer
when your tree is submitted for merging.  You may also want to consider
cooperating with the maintainer of the conflicting tree to minimise any
particularly complex conflicts.



Attached should fix it. Al, please double check.

cheers,
jamal
--- a/net-next/net/sched/cls_u32.c  2018-10-09 05:18:04.046642676 -0400
+++ b/net/sched/cls_u32.c   2018-10-09 05:29:51.965528032 -0400
@@ -391,6 +391,7 @@
RCU_INIT_POINTER(root_ht->next, tp_c->hlist);
rcu_assign_pointer(tp_c->hlist, root_ht);
 
+   root_ht->refcnt++;
rcu_assign_pointer(tp->root, root_ht);
tp->data = tp_c;
return 0;
@@ -606,7 +607,7 @@
struct tc_u_hnode __rcu **hn;
struct tc_u_hnode *phn;
 
-   WARN_ON(ht->refcnt);
+   WARN_ON(--ht->refcnt);
 
u32_clear_hnode(tp, ht, extack);
 
@@ -634,7 +635,7 @@
 
WARN_ON(root_ht == NULL);
 
-   if (root_ht && --root_ht->refcnt == 0)
+   if (root_ht && --root_ht->refcnt == 1)
u32_destroy_hnode(tp, root_ht, extack);
 
if (--tp_c->refcnt == 0) {
@@ -679,7 +680,6 @@
}
 
if (ht->refcnt == 1) {
-   ht->refcnt--;
u32_destroy_hnode(tp, ht, extack);
} else {
NL_SET_ERR_MSG_MOD(extack, "Can not delete in-use filter");
@@ -1079,8 +1079,7 @@
}
 #endif
 
-   err = u32_set_parms(net, tp, base, n, tb, tca[TCA_RATE], ovr,
-   extack);
+   err = u32_set_parms(net, tp, base, n, tb, tca[TCA_RATE], ovr, extack);
if (err == 0) {
struct tc_u_knode __rcu **ins;
struct tc_u_knode *pins;


Re: linux-next: manual merge of the net-next tree with the net tree

2018-10-09 Thread Jamal Hadi Salim

On 2018-10-08 9:21 p.m., Stephen Rothwell wrote:

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

   net/sched/cls_u32.c

between commit:

   6d4c407744dd ("net: sched: cls_u32: fix hnode refcounting")

from the net tree and commit:

   a030598690c6 ("net: sched: cls_u32: simplify the hell out u32_delete() emptiness 
check")

from the net-next tree.

I fixed it up (I reverted the net tree commit as I could not tell
wich parts of it, if any, are still needed) and can carry the fix as
necessary. This is now fixed as far as linux-next is concerned, but any
non trivial conflicts should be mentioned to your upstream maintainer
when your tree is submitted for merging.  You may also want to consider
cooperating with the maintainer of the conflicting tree to minimise any
particularly complex conflicts.



Attached should fix it. Al, please double check.

cheers,
jamal
--- a/net-next/net/sched/cls_u32.c  2018-10-09 05:18:04.046642676 -0400
+++ b/net/sched/cls_u32.c   2018-10-09 05:29:51.965528032 -0400
@@ -391,6 +391,7 @@
RCU_INIT_POINTER(root_ht->next, tp_c->hlist);
rcu_assign_pointer(tp_c->hlist, root_ht);
 
+   root_ht->refcnt++;
rcu_assign_pointer(tp->root, root_ht);
tp->data = tp_c;
return 0;
@@ -606,7 +607,7 @@
struct tc_u_hnode __rcu **hn;
struct tc_u_hnode *phn;
 
-   WARN_ON(ht->refcnt);
+   WARN_ON(--ht->refcnt);
 
u32_clear_hnode(tp, ht, extack);
 
@@ -634,7 +635,7 @@
 
WARN_ON(root_ht == NULL);
 
-   if (root_ht && --root_ht->refcnt == 0)
+   if (root_ht && --root_ht->refcnt == 1)
u32_destroy_hnode(tp, root_ht, extack);
 
if (--tp_c->refcnt == 0) {
@@ -679,7 +680,6 @@
}
 
if (ht->refcnt == 1) {
-   ht->refcnt--;
u32_destroy_hnode(tp, ht, extack);
} else {
NL_SET_ERR_MSG_MOD(extack, "Can not delete in-use filter");
@@ -1079,8 +1079,7 @@
}
 #endif
 
-   err = u32_set_parms(net, tp, base, n, tb, tca[TCA_RATE], ovr,
-   extack);
+   err = u32_set_parms(net, tp, base, n, tb, tca[TCA_RATE], ovr, extack);
if (err == 0) {
struct tc_u_knode __rcu **ins;
struct tc_u_knode *pins;


Re: [PATCH net-next v2] net: sched: don't disable bh when accessing action idr

2018-05-22 Thread Jamal Hadi Salim

On 21/05/18 04:03 PM, Vlad Buslov wrote:

Initial net_device implementation used ingress_lock spinlock to synchronize
ingress path of device. This lock was used in both process and bh context.
In some code paths action map lock was obtained while holding ingress_lock.
Commit e1e992e52faa ("[NET_SCHED] protect action config/dump from irqs")
modified actions to always disable bh, while using action map lock, in
order to prevent deadlock on ingress_lock in softirq. This lock was removed
from net_device, so disabling bh, while accessing action map, is no longer
necessary.

Replace all action idr spinlock usage with regular calls that do not
disable bh.

Signed-off-by: Vlad Buslov <vla...@mellanox.com>


Acked-by: Jamal Hadi Salim <j...@mojatatu.com>

cheers,
jamal


Re: [PATCH net-next v2] net: sched: don't disable bh when accessing action idr

2018-05-22 Thread Jamal Hadi Salim

On 21/05/18 04:03 PM, Vlad Buslov wrote:

Initial net_device implementation used ingress_lock spinlock to synchronize
ingress path of device. This lock was used in both process and bh context.
In some code paths action map lock was obtained while holding ingress_lock.
Commit e1e992e52faa ("[NET_SCHED] protect action config/dump from irqs")
modified actions to always disable bh, while using action map lock, in
order to prevent deadlock on ingress_lock in softirq. This lock was removed
from net_device, so disabling bh, while accessing action map, is no longer
necessary.

Replace all action idr spinlock usage with regular calls that do not
disable bh.

Signed-off-by: Vlad Buslov 


Acked-by: Jamal Hadi Salim 

cheers,
jamal


Re: [PATCH 00/14] Modify action API for implementing lockless actions

2018-05-15 Thread Jamal Hadi Salim

On 14/05/18 04:46 PM, Vlad Buslov wrote:


On Mon 14 May 2018 at 18:03, Jamal Hadi Salim <j...@mojatatu.com> wrote:

On 14/05/18 10:27 AM, Vlad Buslov wrote:




Hello Jamal,

I'm trying to run tdc, but keep getting following error even on clean
branch without my patches:


Vlad, not sure if you saw my email:
Apply Roman's patch and try again

https://marc.info/?l=linux-netdev=152639369112020=2

cheers,
jamal


Re: [PATCH 00/14] Modify action API for implementing lockless actions

2018-05-15 Thread Jamal Hadi Salim

On 14/05/18 04:46 PM, Vlad Buslov wrote:


On Mon 14 May 2018 at 18:03, Jamal Hadi Salim  wrote:

On 14/05/18 10:27 AM, Vlad Buslov wrote:




Hello Jamal,

I'm trying to run tdc, but keep getting following error even on clean
branch without my patches:


Vlad, not sure if you saw my email:
Apply Roman's patch and try again

https://marc.info/?l=linux-netdev=152639369112020=2

cheers,
jamal


Re: [PATCH 00/14] Modify action API for implementing lockless actions

2018-05-14 Thread Jamal Hadi Salim

On 14/05/18 10:27 AM, Vlad Buslov wrote:

Currently, all netlink protocol handlers for updating rules, actions and
qdiscs are protected with single global rtnl lock which removes any
possibility for parallelism. This patch set is a first step to remove
rtnl lock dependency from TC rules update path. It updates act API to
use atomic operations, rcu and spinlocks for fine-grained locking. It
also extend API with functions that are needed to update existing
actions for parallel execution.

Outline of changes:
- Change tc action to use atomic reference and bind counters, rcu
   mechanism for cookie update.
- Extend action ops API with 'delete' function and 'unlocked' flag.
- Change action API to work with actions in lockless manner based on
   primitives implemented in previous patches.
- Extend action API with new functions necessary to implement unlocked
   actions.


Please run all the tdc tests with these changes. This area has almost
good test coverage at this point. If you need help just ping me.

cheers,
jamal


Re: [PATCH 00/14] Modify action API for implementing lockless actions

2018-05-14 Thread Jamal Hadi Salim

On 14/05/18 10:27 AM, Vlad Buslov wrote:

Currently, all netlink protocol handlers for updating rules, actions and
qdiscs are protected with single global rtnl lock which removes any
possibility for parallelism. This patch set is a first step to remove
rtnl lock dependency from TC rules update path. It updates act API to
use atomic operations, rcu and spinlocks for fine-grained locking. It
also extend API with functions that are needed to update existing
actions for parallel execution.

Outline of changes:
- Change tc action to use atomic reference and bind counters, rcu
   mechanism for cookie update.
- Extend action ops API with 'delete' function and 'unlocked' flag.
- Change action API to work with actions in lockless manner based on
   primitives implemented in previous patches.
- Extend action API with new functions necessary to implement unlocked
   actions.


Please run all the tdc tests with these changes. This area has almost
good test coverage at this point. If you need help just ping me.

cheers,
jamal


Re: [PATCH net-next 2/2] net sched act_vlan: VLAN action rewrite to use RCU lock/unlock and update

2017-10-11 Thread Jamal Hadi Salim

On 17-10-10 10:33 PM, Manish Kurup wrote:

Using a spinlock in the VLAN action causes performance issues when the VLAN
action is used on multiple cores. Rewrote the VLAN action to use RCU read
locking for reads and updates instead.

Signed-off-by: Manish Kurup <manish.ku...@verizon.com>


Acked-by: Jamal Hadi Salim <j...@mojatatu.com>

cheers,
jamal


Re: [PATCH net-next 2/2] net sched act_vlan: VLAN action rewrite to use RCU lock/unlock and update

2017-10-11 Thread Jamal Hadi Salim

On 17-10-10 10:33 PM, Manish Kurup wrote:

Using a spinlock in the VLAN action causes performance issues when the VLAN
action is used on multiple cores. Rewrote the VLAN action to use RCU read
locking for reads and updates instead.

Signed-off-by: Manish Kurup 


Acked-by: Jamal Hadi Salim 

cheers,
jamal


Re: [PATCH] net/sched: act_skbmod: remove unneeded rcu_read_unlock in tcf_skbmod_dump

2017-03-04 Thread Jamal Hadi Salim

On 17-03-04 07:01 PM, Alexey Khoroshilov wrote:

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov <khoroshi...@ispras.ru>


Acked-by: Jamal Hadi Salim <j...@mojatatu.com>

cheers,
jamal



Re: [PATCH] net/sched: act_skbmod: remove unneeded rcu_read_unlock in tcf_skbmod_dump

2017-03-04 Thread Jamal Hadi Salim

On 17-03-04 07:01 PM, Alexey Khoroshilov wrote:

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov 


Acked-by: Jamal Hadi Salim 

cheers,
jamal



Re: [PATCH]net:sched:release lock before tcf_dump_walker() normal return to avoid deadlock

2016-12-06 Thread Jamal Hadi Salim

On 16-12-06 12:36 AM, Feng Deng wrote:

From: Feng Deng

release lock before tcf_dump_walker() normal return to avoid deadlock



/Scratching my head.

I am probably missing something obvious.
What are the condition under which this deadlock will happen?
Do you have a testcase we can try?

cheers,
jamal





Re: [PATCH]net:sched:release lock before tcf_dump_walker() normal return to avoid deadlock

2016-12-06 Thread Jamal Hadi Salim

On 16-12-06 12:36 AM, Feng Deng wrote:

From: Feng Deng

release lock before tcf_dump_walker() normal return to avoid deadlock



/Scratching my head.

I am probably missing something obvious.
What are the condition under which this deadlock will happen?
Do you have a testcase we can try?

cheers,
jamal





Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets

2016-09-28 Thread Jamal Hadi Salim

On 16-09-28 08:45 AM, Cyrill Gorcunov wrote:


Note: inet_diag somewhere has a netlink structure that
has a hole. I pointed it out to Eric D. and he said we cant
add it now because it would break ABI.


Naming holes generated by a compiler for alignment sake should
not break abi (because alignments are abi by self), but I may
be missing something in context of where the structure you're
talking about is present. And what about non-x86 archs? They
might have different members alignment requirements.



struct tcp_info.

Sorry - i didnt mean to drag this for long; but the more i think
about it, the union is a pragmatic approach.

cheers,
jamal



Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets

2016-09-28 Thread Jamal Hadi Salim

On 16-09-28 08:45 AM, Cyrill Gorcunov wrote:


Note: inet_diag somewhere has a netlink structure that
has a hole. I pointed it out to Eric D. and he said we cant
add it now because it would break ABI.


Naming holes generated by a compiler for alignment sake should
not break abi (because alignments are abi by self), but I may
be missing something in context of where the structure you're
talking about is present. And what about non-x86 archs? They
might have different members alignment requirements.



struct tcp_info.

Sorry - i didnt mean to drag this for long; but the more i think
about it, the union is a pragmatic approach.

cheers,
jamal



Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets

2016-09-28 Thread Jamal Hadi Salim

On 16-09-28 08:27 AM, Jamal Hadi Salim wrote:

On 16-09-28 08:16 AM, David Miller wrote:

From: Jamal Hadi Salim <j...@mojatatu.com>
Date: Wed, 28 Sep 2016 08:09:28 -0400


On 16-09-28 08:07 AM, David Miller wrote:


Right, it would be legal for an existing user to have code that
explicitly initializes every member of the structure, including 'pad'.
So we have to keep that member around, at a minimum, for their sake.



I think we need to start labelling any new pad fields added as
"Not UAPI. Do not fsck fondle this".


They must initialize it to zero.



What if in the future actually meant to use 0 for
something?;-> For example in Cyrill's case it means PROTO_IP
Not sure if it useful to interpret or not but it is part of the
enum for protocols.


I just tested with: socket(AF_INET, SOCK_RAW, 0);

and got back: EPROTONOSUPPORT

So in this case doesnt matter. But my point stands i.e 0 could mean
something.

cheers,
jamal


Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets

2016-09-28 Thread Jamal Hadi Salim

On 16-09-28 08:27 AM, Jamal Hadi Salim wrote:

On 16-09-28 08:16 AM, David Miller wrote:

From: Jamal Hadi Salim 
Date: Wed, 28 Sep 2016 08:09:28 -0400


On 16-09-28 08:07 AM, David Miller wrote:


Right, it would be legal for an existing user to have code that
explicitly initializes every member of the structure, including 'pad'.
So we have to keep that member around, at a minimum, for their sake.



I think we need to start labelling any new pad fields added as
"Not UAPI. Do not fsck fondle this".


They must initialize it to zero.



What if in the future actually meant to use 0 for
something?;-> For example in Cyrill's case it means PROTO_IP
Not sure if it useful to interpret or not but it is part of the
enum for protocols.


I just tested with: socket(AF_INET, SOCK_RAW, 0);

and got back: EPROTONOSUPPORT

So in this case doesnt matter. But my point stands i.e 0 could mean
something.

cheers,
jamal


Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets

2016-09-28 Thread Jamal Hadi Salim

On 16-09-28 08:16 AM, David Miller wrote:

From: Jamal Hadi Salim <j...@mojatatu.com>
Date: Wed, 28 Sep 2016 08:09:28 -0400


On 16-09-28 08:07 AM, David Miller wrote:


Right, it would be legal for an existing user to have code that
explicitly initializes every member of the structure, including 'pad'.
So we have to keep that member around, at a minimum, for their sake.



I think we need to start labelling any new pad fields added as
"Not UAPI. Do not fsck fondle this".


They must initialize it to zero.



What if in the future actually meant to use 0 for
something?;-> For example in Cyrill's case it means PROTO_IP
Not sure if it useful to interpret or not but it is part of the
enum for protocols.

Maybe we shouldnt be adding pad fields in these netlink
structure definitions then one can liberally add new ones.
Note: inet_diag somewhere has a netlink structure that
has a hole. I pointed it out to Eric D. and he said we cant
add it now because it would break ABI.
So where do we draw the line for future extensions?

cheers,
jamal


Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets

2016-09-28 Thread Jamal Hadi Salim

On 16-09-28 08:16 AM, David Miller wrote:

From: Jamal Hadi Salim 
Date: Wed, 28 Sep 2016 08:09:28 -0400


On 16-09-28 08:07 AM, David Miller wrote:


Right, it would be legal for an existing user to have code that
explicitly initializes every member of the structure, including 'pad'.
So we have to keep that member around, at a minimum, for their sake.



I think we need to start labelling any new pad fields added as
"Not UAPI. Do not fsck fondle this".


They must initialize it to zero.



What if in the future actually meant to use 0 for
something?;-> For example in Cyrill's case it means PROTO_IP
Not sure if it useful to interpret or not but it is part of the
enum for protocols.

Maybe we shouldnt be adding pad fields in these netlink
structure definitions then one can liberally add new ones.
Note: inet_diag somewhere has a netlink structure that
has a hole. I pointed it out to Eric D. and he said we cant
add it now because it would break ABI.
So where do we draw the line for future extensions?

cheers,
jamal


Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets

2016-09-28 Thread Jamal Hadi Salim

On 16-09-28 08:07 AM, David Miller wrote:


Right, it would be legal for an existing user to have code that
explicitly initializes every member of the structure, including 'pad'.
So we have to keep that member around, at a minimum, for their sake.



I think we need to start labelling any new pad fields added as
"Not UAPI. Do not fsck fondle this".

cheers,
jamal


Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets

2016-09-28 Thread Jamal Hadi Salim

On 16-09-28 08:07 AM, David Miller wrote:


Right, it would be legal for an existing user to have code that
explicitly initializes every member of the structure, including 'pad'.
So we have to keep that member around, at a minimum, for their sake.



I think we need to start labelling any new pad fields added as
"Not UAPI. Do not fsck fondle this".

cheers,
jamal


Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets

2016-09-28 Thread Jamal Hadi Salim

On 16-09-28 07:27 AM, Cyrill Gorcunov wrote:

On Wed, Sep 28, 2016 at 07:06:26AM -0400, Jamal Hadi Salim wrote:


This structure is uapi, so anyone has complete rights to reference
@pad in the userspace programs. Sure it would be more clear to remove
the @pad completely, but if we choose so I think it's better to do
on top instead and then if someone complain we can easily revert
the single trivial commit instead of this big patch.


I am conflicted.
A field labelled "pad" does not appear to be valid as "UAPI". It is
a cosmetic indicator. If you did sizeof() with or without it being
present the value doesnt change.


I think you miss the point what I'm trying to say: currently end-user
may have reference to this member (for any reason) and his program
will compile and run. If we change the name the compilation procedure
fails and this will break API. Yes, referrning @pad is bad idea for
userspace code, and yes (!) better to simply rename it but lets do
that later, on top, so that if we break something in userspace
we could easily revert the oneline change.




I understood well your point;-> Maybe my response was not clear:
_nobody should be fscking fondling pad fields_ setting them or
otherwise.
Maybe let these programs fail. I asked if you knew any such app which
did anything with a pad field.

cheers,
jamal



Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets

2016-09-28 Thread Jamal Hadi Salim

On 16-09-28 07:27 AM, Cyrill Gorcunov wrote:

On Wed, Sep 28, 2016 at 07:06:26AM -0400, Jamal Hadi Salim wrote:


This structure is uapi, so anyone has complete rights to reference
@pad in the userspace programs. Sure it would be more clear to remove
the @pad completely, but if we choose so I think it's better to do
on top instead and then if someone complain we can easily revert
the single trivial commit instead of this big patch.


I am conflicted.
A field labelled "pad" does not appear to be valid as "UAPI". It is
a cosmetic indicator. If you did sizeof() with or without it being
present the value doesnt change.


I think you miss the point what I'm trying to say: currently end-user
may have reference to this member (for any reason) and his program
will compile and run. If we change the name the compilation procedure
fails and this will break API. Yes, referrning @pad is bad idea for
userspace code, and yes (!) better to simply rename it but lets do
that later, on top, so that if we break something in userspace
we could easily revert the oneline change.




I understood well your point;-> Maybe my response was not clear:
_nobody should be fscking fondling pad fields_ setting them or
otherwise.
Maybe let these programs fail. I asked if you knew any such app which
did anything with a pad field.

cheers,
jamal



Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets

2016-09-28 Thread Jamal Hadi Salim

On 16-09-28 06:51 AM, Cyrill Gorcunov wrote:

On Wed, Sep 28, 2016 at 06:43:01AM -0400, Jamal Hadi Salim wrote:


[..]

I dont know how compilation will fail but you may be right with note:
that is not how pads have been used in the past. They are supposed to
cosmetic annotation which indicates "here's a hole; use it in the
future if you are looking to add something". And someone in the
future can claim them. I am not sure if MBZ philosophy applies.


This structure is uapi, so anyone has complete rights to reference
@pad in the userspace programs. Sure it would be more clear to remove
the @pad completely, but if we choose so I think it's better to do
on top instead and then if someone complain we can easily revert
the single trivial commit instead of this big patch.


I am conflicted.
A field labelled "pad" does not appear to be valid as "UAPI". It is
a cosmetic indicator. If you did sizeof() with or without it being
present the value doesnt change.
BTW: There is at least one major structure in inet diag has a hole
today and doesnt have a padding indicator.


If protocol goes over u8 then complete inet_diag_req_v2 structure will
have to be reworked becaue @sdiag_protocol is u8 as well. IOW, once
someone liftup IPPROTO_MAX > 255, he will notice the problem immediately
because diag for such module simply stop working properly.



ok.

cheers,
jamal


Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets

2016-09-28 Thread Jamal Hadi Salim

On 16-09-28 06:51 AM, Cyrill Gorcunov wrote:

On Wed, Sep 28, 2016 at 06:43:01AM -0400, Jamal Hadi Salim wrote:


[..]

I dont know how compilation will fail but you may be right with note:
that is not how pads have been used in the past. They are supposed to
cosmetic annotation which indicates "here's a hole; use it in the
future if you are looking to add something". And someone in the
future can claim them. I am not sure if MBZ philosophy applies.


This structure is uapi, so anyone has complete rights to reference
@pad in the userspace programs. Sure it would be more clear to remove
the @pad completely, but if we choose so I think it's better to do
on top instead and then if someone complain we can easily revert
the single trivial commit instead of this big patch.


I am conflicted.
A field labelled "pad" does not appear to be valid as "UAPI". It is
a cosmetic indicator. If you did sizeof() with or without it being
present the value doesnt change.
BTW: There is at least one major structure in inet diag has a hole
today and doesnt have a padding indicator.


If protocol goes over u8 then complete inet_diag_req_v2 structure will
have to be reworked becaue @sdiag_protocol is u8 as well. IOW, once
someone liftup IPPROTO_MAX > 255, he will notice the problem immediately
because diag for such module simply stop working properly.



ok.

cheers,
jamal


Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets

2016-09-28 Thread Jamal Hadi Salim

On 16-09-28 06:17 AM, Cyrill Gorcunov wrote:

On Wed, Sep 28, 2016 at 06:08:00AM -0400, Jamal Hadi Salim wrote:
...

@@ -38,7 +38,10 @@ struct inet_diag_req_v2 {
__u8sdiag_family;
__u8sdiag_protocol;
__u8idiag_ext;
-   __u8pad;
+   union {
+   __u8pad;
+   __u8sdiag_raw_protocol; /* SOCK_RAW only, @pad for 
others */
+   };



Above looks funny. Why is it a union? pad is for exposing a byte-hole
for padding/alignment reasons and i doubt anybody is using it.


Someone may have set it to zero explicitly on source level, and the
compilation will fail on new kernel then. So no, keeping the name
is reasonable.



I dont know how compilation will fail but you may be right with note:
that is not how pads have been used in the past. They are supposed to
cosmetic annotation which indicates "here's a hole; use it in the
future if you are looking to add something". And someone in the
future can claim them. I am not sure if MBZ philosophy applies.


Should you not just rename it?
Also I notice when things like __raw_v4_lookup() are claiming it is unsigned
short instead of a u8?


The protocol is still up to 255 for a while, is it expected that IPPROTO_MAX
will be increased in more-less near future? Of course I can drop the idea
of using @pad here and switch to some extended reauest but prefer to stick
with simplier solution. Hm?



Ok. If i understood correctly it was already unsigned short before your
patch -so i agree it doesnt matter. Maybe just put a comment to express
that if ever protocol goes above 255 it wont be sufficient.



cheers,
jamal

PS:- sorry for butting in the discussion - i blame it on coffee.


Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets

2016-09-28 Thread Jamal Hadi Salim

On 16-09-28 06:17 AM, Cyrill Gorcunov wrote:

On Wed, Sep 28, 2016 at 06:08:00AM -0400, Jamal Hadi Salim wrote:
...

@@ -38,7 +38,10 @@ struct inet_diag_req_v2 {
__u8sdiag_family;
__u8sdiag_protocol;
__u8idiag_ext;
-   __u8pad;
+   union {
+   __u8pad;
+   __u8sdiag_raw_protocol; /* SOCK_RAW only, @pad for 
others */
+   };



Above looks funny. Why is it a union? pad is for exposing a byte-hole
for padding/alignment reasons and i doubt anybody is using it.


Someone may have set it to zero explicitly on source level, and the
compilation will fail on new kernel then. So no, keeping the name
is reasonable.



I dont know how compilation will fail but you may be right with note:
that is not how pads have been used in the past. They are supposed to
cosmetic annotation which indicates "here's a hole; use it in the
future if you are looking to add something". And someone in the
future can claim them. I am not sure if MBZ philosophy applies.


Should you not just rename it?
Also I notice when things like __raw_v4_lookup() are claiming it is unsigned
short instead of a u8?


The protocol is still up to 255 for a while, is it expected that IPPROTO_MAX
will be increased in more-less near future? Of course I can drop the idea
of using @pad here and switch to some extended reauest but prefer to stick
with simplier solution. Hm?



Ok. If i understood correctly it was already unsigned short before your
patch -so i agree it doesnt matter. Maybe just put a comment to express
that if ever protocol goes above 255 it wont be sufficient.



cheers,
jamal

PS:- sorry for butting in the discussion - i blame it on coffee.


Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets

2016-09-28 Thread Jamal Hadi Salim

On 16-09-28 05:03 AM, Cyrill Gorcunov wrote:

In criu we are actively using diag interface to collect sockets
present in the system when dumping applications. And while for
unix, tcp, udp[lite], packet, netlink it works as expected,
the raw sockets do not have. Thus add it.

v2:
 - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@)
 - implement @destroy for diag requests (by dsa@)

v3:
 - add export of raw_abort for IPv6 (by dsa@)
 - pass net-admin flag into inet_sk_diag_fill due to
   changes in net-next branch (by dsa@)

v4:
 - use @pad in struct inet_diag_req_v2 for raw socket
   protocol specification: raw module carries sockets
   which may have custom protocol passed from socket()
   syscall and sole @sdiag_protocol is not enough to
   match underlied ones
 - start reporting protocol specifed in socket() call
   when sockets are raw ones for the same reason: user
   space tools like ss may parse this attribute and use
   it for socket matching

v5 (by eric.dumazet@):
 - use sock_hold in raw_sock_get instead of atomic_inc,
   we're holding (raw_v4_hashinfo|raw_v6_hashinfo)->lock
   when looking up so counter won't be zero here.

CC: David S. Miller 
CC: Eric Dumazet 
CC: David Ahern 
CC: Alexey Kuznetsov 
CC: James Morris 
CC: Hideaki YOSHIFUJI 
CC: Patrick McHardy 
CC: Andrey Vagin 
CC: Stephen Hemminger 
Signed-off-by: Cyrill Gorcunov 
---

Thanks all for feedback! Take a look please once time permit.

 include/net/raw.h  |6 +
 include/net/rawv6.h|7 +
 include/uapi/linux/inet_diag.h |5
 net/ipv4/Kconfig   |8 +
 net/ipv4/Makefile  |1
 net/ipv4/inet_diag.c   |9 +
 net/ipv4/raw.c |   21 +++
 net/ipv4/raw_diag.c|  233 +
 net/ipv6/raw.c |7 -
 9 files changed, 292 insertions(+), 5 deletions(-)

Index: linux-ml.git/include/net/raw.h
===
--- linux-ml.git.orig/include/net/raw.h
+++ linux-ml.git/include/net/raw.h
@@ -23,6 +23,12 @@

 extern struct proto raw_prot;

+extern struct raw_hashinfo raw_v4_hashinfo;
+struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
+unsigned short num, __be32 raddr,
+__be32 laddr, int dif);
+
+int raw_abort(struct sock *sk, int err);
 void raw_icmp_error(struct sk_buff *, int, u32);
 int raw_local_deliver(struct sk_buff *, int);

Index: linux-ml.git/include/net/rawv6.h
===
--- linux-ml.git.orig/include/net/rawv6.h
+++ linux-ml.git/include/net/rawv6.h
@@ -3,6 +3,13 @@

 #include 

+extern struct raw_hashinfo raw_v6_hashinfo;
+struct sock *__raw_v6_lookup(struct net *net, struct sock *sk,
+unsigned short num, const struct in6_addr 
*loc_addr,
+const struct in6_addr *rmt_addr, int dif);
+
+int raw_abort(struct sock *sk, int err);
+
 void raw6_icmp_error(struct sk_buff *, int nexthdr,
u8 type, u8 code, int inner_offset, __be32);
 bool raw6_local_deliver(struct sk_buff *, int);
Index: linux-ml.git/include/uapi/linux/inet_diag.h
===
--- linux-ml.git.orig/include/uapi/linux/inet_diag.h
+++ linux-ml.git/include/uapi/linux/inet_diag.h
@@ -38,7 +38,10 @@ struct inet_diag_req_v2 {
__u8sdiag_family;
__u8sdiag_protocol;
__u8idiag_ext;
-   __u8pad;
+   union {
+   __u8pad;
+   __u8sdiag_raw_protocol; /* SOCK_RAW only, @pad for 
others */
+   };



Above looks funny. Why is it a union? pad is for exposing a byte-hole
for padding/alignment reasons and i doubt anybody is using it.
Should you not just rename it?
Also I notice when things like __raw_v4_lookup() are claiming it is 
unsigned short instead of a u8?


cheers,
jamal


Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets

2016-09-28 Thread Jamal Hadi Salim

On 16-09-28 05:03 AM, Cyrill Gorcunov wrote:

In criu we are actively using diag interface to collect sockets
present in the system when dumping applications. And while for
unix, tcp, udp[lite], packet, netlink it works as expected,
the raw sockets do not have. Thus add it.

v2:
 - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@)
 - implement @destroy for diag requests (by dsa@)

v3:
 - add export of raw_abort for IPv6 (by dsa@)
 - pass net-admin flag into inet_sk_diag_fill due to
   changes in net-next branch (by dsa@)

v4:
 - use @pad in struct inet_diag_req_v2 for raw socket
   protocol specification: raw module carries sockets
   which may have custom protocol passed from socket()
   syscall and sole @sdiag_protocol is not enough to
   match underlied ones
 - start reporting protocol specifed in socket() call
   when sockets are raw ones for the same reason: user
   space tools like ss may parse this attribute and use
   it for socket matching

v5 (by eric.dumazet@):
 - use sock_hold in raw_sock_get instead of atomic_inc,
   we're holding (raw_v4_hashinfo|raw_v6_hashinfo)->lock
   when looking up so counter won't be zero here.

CC: David S. Miller 
CC: Eric Dumazet 
CC: David Ahern 
CC: Alexey Kuznetsov 
CC: James Morris 
CC: Hideaki YOSHIFUJI 
CC: Patrick McHardy 
CC: Andrey Vagin 
CC: Stephen Hemminger 
Signed-off-by: Cyrill Gorcunov 
---

Thanks all for feedback! Take a look please once time permit.

 include/net/raw.h  |6 +
 include/net/rawv6.h|7 +
 include/uapi/linux/inet_diag.h |5
 net/ipv4/Kconfig   |8 +
 net/ipv4/Makefile  |1
 net/ipv4/inet_diag.c   |9 +
 net/ipv4/raw.c |   21 +++
 net/ipv4/raw_diag.c|  233 +
 net/ipv6/raw.c |7 -
 9 files changed, 292 insertions(+), 5 deletions(-)

Index: linux-ml.git/include/net/raw.h
===
--- linux-ml.git.orig/include/net/raw.h
+++ linux-ml.git/include/net/raw.h
@@ -23,6 +23,12 @@

 extern struct proto raw_prot;

+extern struct raw_hashinfo raw_v4_hashinfo;
+struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
+unsigned short num, __be32 raddr,
+__be32 laddr, int dif);
+
+int raw_abort(struct sock *sk, int err);
 void raw_icmp_error(struct sk_buff *, int, u32);
 int raw_local_deliver(struct sk_buff *, int);

Index: linux-ml.git/include/net/rawv6.h
===
--- linux-ml.git.orig/include/net/rawv6.h
+++ linux-ml.git/include/net/rawv6.h
@@ -3,6 +3,13 @@

 #include 

+extern struct raw_hashinfo raw_v6_hashinfo;
+struct sock *__raw_v6_lookup(struct net *net, struct sock *sk,
+unsigned short num, const struct in6_addr 
*loc_addr,
+const struct in6_addr *rmt_addr, int dif);
+
+int raw_abort(struct sock *sk, int err);
+
 void raw6_icmp_error(struct sk_buff *, int nexthdr,
u8 type, u8 code, int inner_offset, __be32);
 bool raw6_local_deliver(struct sk_buff *, int);
Index: linux-ml.git/include/uapi/linux/inet_diag.h
===
--- linux-ml.git.orig/include/uapi/linux/inet_diag.h
+++ linux-ml.git/include/uapi/linux/inet_diag.h
@@ -38,7 +38,10 @@ struct inet_diag_req_v2 {
__u8sdiag_family;
__u8sdiag_protocol;
__u8idiag_ext;
-   __u8pad;
+   union {
+   __u8pad;
+   __u8sdiag_raw_protocol; /* SOCK_RAW only, @pad for 
others */
+   };



Above looks funny. Why is it a union? pad is for exposing a byte-hole
for padding/alignment reasons and i doubt anybody is using it.
Should you not just rename it?
Also I notice when things like __raw_v4_lookup() are claiming it is 
unsigned short instead of a u8?


cheers,
jamal


Re: [BUG] act_ife: sleeping functions called in atomic context

2016-06-18 Thread Jamal Hadi Salim

On 16-06-17 01:31 PM, Cong Wang wrote:


My patch is against -net. (I see you already figured out your patch is
missing in -net-next.)



Ok, should have re-read this email before working on the patch;->


Or are you suggesting to rebase it for -net-next? I think it fixes some real
bug so -net is better, although it is slightly large as a bug fix.



I am conflicted. There are a lot of changes in net-next at the moment;
adding this to -net seems like will definetely cause merge issues for
Dave.

Ok, Cong - patch attached and tested. Let me know what you think.

cheers,
jamal


diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index b7fa969..b52deb4 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -239,8 +239,7 @@ static int ife_validate_metatype(struct tcf_meta_ops *ops, 
void *val, int len)
return ret;
 }
 
-/* called when adding new meta information
- * under ife->tcf_lock
+/* called to find new metadata ops. Possibly load it as a module.
 */
 static int load_metaops_and_vet(struct tcf_ife_info *ife, u32 metaid,
void *val, int len)
@@ -251,11 +250,9 @@ static int load_metaops_and_vet(struct tcf_ife_info *ife, 
u32 metaid,
if (!ops) {
ret = -ENOENT;
 #ifdef CONFIG_MODULES
-   spin_unlock_bh(>tcf_lock);
rtnl_unlock();
request_module("ifemeta%u", metaid);
rtnl_lock();
-   spin_lock_bh(>tcf_lock);
ops = find_ife_oplist(metaid);
 #endif
}
@@ -272,7 +269,6 @@ static int load_metaops_and_vet(struct tcf_ife_info *ife, 
u32 metaid,
 }
 
 /* called when adding new meta information
- * under ife->tcf_lock
 */
 static int add_metainfo(struct tcf_ife_info *ife, u32 metaid, void *metaval,
int len)
@@ -302,7 +298,9 @@ static int add_metainfo(struct tcf_ife_info *ife, u32 
metaid, void *metaval,
}
}
 
+   spin_lock_bh(>tcf_lock);
list_add_tail(>metalist, >metalist);
+   spin_unlock_bh(>tcf_lock);
 
return ret;
 }
@@ -357,7 +355,6 @@ out_nlmsg_trim:
return -1;
 }
 
-/* under ife->tcf_lock */
 static void _tcf_ife_cleanup(struct tc_action *a, int bind)
 {
struct tcf_ife_info *ife = a->priv;
@@ -385,7 +382,6 @@ static void tcf_ife_cleanup(struct tc_action *a, int bind)
spin_unlock_bh(>tcf_lock);
 }
 
-/* under ife->tcf_lock */
 static int populate_metalist(struct tcf_ife_info *ife, struct nlattr **tb)
 {
int len = 0;
@@ -465,6 +461,8 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla,
}
 
ife = to_ife(a);
+   if (exists)
+   spin_lock_bh(>tcf_lock);
ife->flags = parm->flags;
 
if (parm->flags & IFE_ENCODE) {
@@ -475,10 +473,9 @@ static int tcf_ife_init(struct net *net, struct nlattr 
*nla,
saddr = nla_data(tb[TCA_IFE_SMAC]);
}
 
-   spin_lock_bh(>tcf_lock);
ife->tcf_action = parm->action;
-
if (parm->flags & IFE_ENCODE) {
+
if (daddr)
ether_addr_copy(ife->eth_dst, daddr);
else
@@ -492,6 +489,9 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla,
ife->eth_type = ife_type;
}
 
+   if (exists)
+   spin_unlock_bh(>tcf_lock);
+
if (ret == ACT_P_CREATED)
INIT_LIST_HEAD(>metalist);
 
@@ -505,7 +505,6 @@ metadata_parse_err:
if (ret == ACT_P_CREATED)
_tcf_ife_cleanup(a, bind);
 
-   spin_unlock_bh(>tcf_lock);
return err;
}
 
@@ -524,13 +523,10 @@ metadata_parse_err:
if (ret == ACT_P_CREATED)
_tcf_ife_cleanup(a, bind);
 
-   spin_unlock_bh(>tcf_lock);
return err;
}
}
 
-   spin_unlock_bh(>tcf_lock);
-
if (ret == ACT_P_CREATED)
tcf_hash_insert(tn, a);
 


Re: [BUG] act_ife: sleeping functions called in atomic context

2016-06-18 Thread Jamal Hadi Salim

On 16-06-17 01:31 PM, Cong Wang wrote:


My patch is against -net. (I see you already figured out your patch is
missing in -net-next.)



Ok, should have re-read this email before working on the patch;->


Or are you suggesting to rebase it for -net-next? I think it fixes some real
bug so -net is better, although it is slightly large as a bug fix.



I am conflicted. There are a lot of changes in net-next at the moment;
adding this to -net seems like will definetely cause merge issues for
Dave.

Ok, Cong - patch attached and tested. Let me know what you think.

cheers,
jamal


diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index b7fa969..b52deb4 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -239,8 +239,7 @@ static int ife_validate_metatype(struct tcf_meta_ops *ops, 
void *val, int len)
return ret;
 }
 
-/* called when adding new meta information
- * under ife->tcf_lock
+/* called to find new metadata ops. Possibly load it as a module.
 */
 static int load_metaops_and_vet(struct tcf_ife_info *ife, u32 metaid,
void *val, int len)
@@ -251,11 +250,9 @@ static int load_metaops_and_vet(struct tcf_ife_info *ife, 
u32 metaid,
if (!ops) {
ret = -ENOENT;
 #ifdef CONFIG_MODULES
-   spin_unlock_bh(>tcf_lock);
rtnl_unlock();
request_module("ifemeta%u", metaid);
rtnl_lock();
-   spin_lock_bh(>tcf_lock);
ops = find_ife_oplist(metaid);
 #endif
}
@@ -272,7 +269,6 @@ static int load_metaops_and_vet(struct tcf_ife_info *ife, 
u32 metaid,
 }
 
 /* called when adding new meta information
- * under ife->tcf_lock
 */
 static int add_metainfo(struct tcf_ife_info *ife, u32 metaid, void *metaval,
int len)
@@ -302,7 +298,9 @@ static int add_metainfo(struct tcf_ife_info *ife, u32 
metaid, void *metaval,
}
}
 
+   spin_lock_bh(>tcf_lock);
list_add_tail(>metalist, >metalist);
+   spin_unlock_bh(>tcf_lock);
 
return ret;
 }
@@ -357,7 +355,6 @@ out_nlmsg_trim:
return -1;
 }
 
-/* under ife->tcf_lock */
 static void _tcf_ife_cleanup(struct tc_action *a, int bind)
 {
struct tcf_ife_info *ife = a->priv;
@@ -385,7 +382,6 @@ static void tcf_ife_cleanup(struct tc_action *a, int bind)
spin_unlock_bh(>tcf_lock);
 }
 
-/* under ife->tcf_lock */
 static int populate_metalist(struct tcf_ife_info *ife, struct nlattr **tb)
 {
int len = 0;
@@ -465,6 +461,8 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla,
}
 
ife = to_ife(a);
+   if (exists)
+   spin_lock_bh(>tcf_lock);
ife->flags = parm->flags;
 
if (parm->flags & IFE_ENCODE) {
@@ -475,10 +473,9 @@ static int tcf_ife_init(struct net *net, struct nlattr 
*nla,
saddr = nla_data(tb[TCA_IFE_SMAC]);
}
 
-   spin_lock_bh(>tcf_lock);
ife->tcf_action = parm->action;
-
if (parm->flags & IFE_ENCODE) {
+
if (daddr)
ether_addr_copy(ife->eth_dst, daddr);
else
@@ -492,6 +489,9 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla,
ife->eth_type = ife_type;
}
 
+   if (exists)
+   spin_unlock_bh(>tcf_lock);
+
if (ret == ACT_P_CREATED)
INIT_LIST_HEAD(>metalist);
 
@@ -505,7 +505,6 @@ metadata_parse_err:
if (ret == ACT_P_CREATED)
_tcf_ife_cleanup(a, bind);
 
-   spin_unlock_bh(>tcf_lock);
return err;
}
 
@@ -524,13 +523,10 @@ metadata_parse_err:
if (ret == ACT_P_CREATED)
_tcf_ife_cleanup(a, bind);
 
-   spin_unlock_bh(>tcf_lock);
return err;
}
}
 
-   spin_unlock_bh(>tcf_lock);
-
if (ret == ACT_P_CREATED)
tcf_hash_insert(tn, a);
 


Re: [BUG] act_ife: sleeping functions called in atomic context

2016-06-17 Thread Jamal Hadi Salim

On 16-06-17 01:38 AM, Cong Wang wrote:

On Thu, Jun 16, 2016 at 7:14 PM, Cong Wang  wrote:


I think we can just remove that tcf_lock, I am testing a patch now.


Please try the attached patch, I will do more tests tomorrow.

Thanks!



Cong, What tree are you using? I dont see the time aggregation patches
that I sent (and Dave took in) in your changes.

Comments:
Is GFP_ATOMIC really necessary? Thats user->kernel interface. GFP_KERNEL 
should be sufficient.

Also, it would be nice to kill the lock - but this feels like two
patches in one. 1) to fix the alloc not to be under the lock 2) to
kill said lock. Maybe split it as such for easier review.
I am using this action extensively so will be happy to test.
I think my patch is a good beginning to #1 - if you fix the forgotten
unlock and ensure we lock around updating ife fields when it exists
already (you said it in your earlier email and I thought about
that afterwards).

cheers,
jamal



Re: [BUG] act_ife: sleeping functions called in atomic context

2016-06-17 Thread Jamal Hadi Salim

On 16-06-17 01:38 AM, Cong Wang wrote:

On Thu, Jun 16, 2016 at 7:14 PM, Cong Wang  wrote:


I think we can just remove that tcf_lock, I am testing a patch now.


Please try the attached patch, I will do more tests tomorrow.

Thanks!



Cong, What tree are you using? I dont see the time aggregation patches
that I sent (and Dave took in) in your changes.

Comments:
Is GFP_ATOMIC really necessary? Thats user->kernel interface. GFP_KERNEL 
should be sufficient.

Also, it would be nice to kill the lock - but this feels like two
patches in one. 1) to fix the alloc not to be under the lock 2) to
kill said lock. Maybe split it as such for easier review.
I am using this action extensively so will be happy to test.
I think my patch is a good beginning to #1 - if you fix the forgotten
unlock and ensure we lock around updating ife fields when it exists
already (you said it in your earlier email and I thought about
that afterwards).

cheers,
jamal



Re: [BUG] act_ife: sleeping functions called in atomic context

2016-06-16 Thread Jamal Hadi Salim

On 16-06-16 05:43 PM, Cong Wang wrote:

On Thu, Jun 16, 2016 at 1:50 PM, Alexey Khoroshilov
 wrote:

tcf_ife_init() contains a big chunk of code executed with
ife->tcf_lock spinlock held. But that code contains several calls
to sleeping functions:
   populate_metalist() and use_all_metadata()
 -> add_metainfo()
   -> find_ife_oplist(metaid)
 -> read_lock()
 -> try_module_get(o->owner)
   -> kzalloc(sizeof(*mi), GFP_KERNEL);


Hmm, we don't need to hold that spinlock when we create a new ife action,
since we haven't inserted it yet. We do need it when we modify an existing
one. So I am thinking if we can refactor that code to avoid spinlock
whenever possible.



Does attached (compile tested) patch help?


   -> ops->alloc(mi, metaval);
   -> module_put(ops->owner);
   _tcf_ife_cleanup()
 -> module_put()

The same problem is actual for tcf_ife_cleanup() as well.



Huh? Both module_put() and kfree() should not sleep, right?



I dont think there is any sleeping there ;->

cheers,
jamal
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 6bbc518..e341bef 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -302,7 +302,9 @@ static int add_metainfo(struct tcf_ife_info *ife, u32 
metaid, void *metaval,
}
}
 
+   spin_lock_bh(>tcf_lock);
list_add_tail(>metalist, >metalist);
+   spin_unlock_bh(>tcf_lock);
 
return ret;
 }
@@ -474,7 +476,6 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla,
saddr = nla_data(tb[TCA_IFE_SMAC]);
}
 
-   spin_lock_bh(>tcf_lock);
ife->tcf_action = parm->action;
 
if (parm->flags & IFE_ENCODE) {
@@ -504,7 +505,6 @@ metadata_parse_err:
if (ret == ACT_P_CREATED)
_tcf_ife_cleanup(a, bind);
 
-   spin_unlock_bh(>tcf_lock);
return err;
}
 
@@ -523,13 +523,10 @@ metadata_parse_err:
if (ret == ACT_P_CREATED)
_tcf_ife_cleanup(a, bind);
 
-   spin_unlock_bh(>tcf_lock);
return err;
}
}
 
-   spin_unlock_bh(>tcf_lock);
-
if (ret == ACT_P_CREATED)
tcf_hash_insert(tn, a);
 


Re: [BUG] act_ife: sleeping functions called in atomic context

2016-06-16 Thread Jamal Hadi Salim

On 16-06-16 05:43 PM, Cong Wang wrote:

On Thu, Jun 16, 2016 at 1:50 PM, Alexey Khoroshilov
 wrote:

tcf_ife_init() contains a big chunk of code executed with
ife->tcf_lock spinlock held. But that code contains several calls
to sleeping functions:
   populate_metalist() and use_all_metadata()
 -> add_metainfo()
   -> find_ife_oplist(metaid)
 -> read_lock()
 -> try_module_get(o->owner)
   -> kzalloc(sizeof(*mi), GFP_KERNEL);


Hmm, we don't need to hold that spinlock when we create a new ife action,
since we haven't inserted it yet. We do need it when we modify an existing
one. So I am thinking if we can refactor that code to avoid spinlock
whenever possible.



Does attached (compile tested) patch help?


   -> ops->alloc(mi, metaval);
   -> module_put(ops->owner);
   _tcf_ife_cleanup()
 -> module_put()

The same problem is actual for tcf_ife_cleanup() as well.



Huh? Both module_put() and kfree() should not sleep, right?



I dont think there is any sleeping there ;->

cheers,
jamal
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 6bbc518..e341bef 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -302,7 +302,9 @@ static int add_metainfo(struct tcf_ife_info *ife, u32 
metaid, void *metaval,
}
}
 
+   spin_lock_bh(>tcf_lock);
list_add_tail(>metalist, >metalist);
+   spin_unlock_bh(>tcf_lock);
 
return ret;
 }
@@ -474,7 +476,6 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla,
saddr = nla_data(tb[TCA_IFE_SMAC]);
}
 
-   spin_lock_bh(>tcf_lock);
ife->tcf_action = parm->action;
 
if (parm->flags & IFE_ENCODE) {
@@ -504,7 +505,6 @@ metadata_parse_err:
if (ret == ACT_P_CREATED)
_tcf_ife_cleanup(a, bind);
 
-   spin_unlock_bh(>tcf_lock);
return err;
}
 
@@ -523,13 +523,10 @@ metadata_parse_err:
if (ret == ACT_P_CREATED)
_tcf_ife_cleanup(a, bind);
 
-   spin_unlock_bh(>tcf_lock);
return err;
}
}
 
-   spin_unlock_bh(>tcf_lock);
-
if (ret == ACT_P_CREATED)
tcf_hash_insert(tn, a);
 


Re: [PATCH net-next V2] tun: introduce tx skb ring

2016-06-15 Thread Jamal Hadi Salim
On 16-06-15 07:52 AM, Jamal Hadi Salim wrote:
> On 16-06-15 04:38 AM, Jason Wang wrote:
>> We used to queue tx packets in sk_receive_queue, this is less
>> efficient since it requires spinlocks to synchronize between producer

> 
> So this is more exercising the skb array improvements. For tun
> it would be useful to see general performance numbers on user/kernel
> crossing (i.e tun read/write).
> If you have the cycles can you run such tests?
> 

Ignore my message - you are running pktgen from a VM towards the host.
So the numbers you posted are what i was interested in.
Thanks for the good work.

cheers,
jamal



Re: [PATCH net-next V2] tun: introduce tx skb ring

2016-06-15 Thread Jamal Hadi Salim
On 16-06-15 07:52 AM, Jamal Hadi Salim wrote:
> On 16-06-15 04:38 AM, Jason Wang wrote:
>> We used to queue tx packets in sk_receive_queue, this is less
>> efficient since it requires spinlocks to synchronize between producer

> 
> So this is more exercising the skb array improvements. For tun
> it would be useful to see general performance numbers on user/kernel
> crossing (i.e tun read/write).
> If you have the cycles can you run such tests?
> 

Ignore my message - you are running pktgen from a VM towards the host.
So the numbers you posted are what i was interested in.
Thanks for the good work.

cheers,
jamal



Re: [PATCH net-next V2] tun: introduce tx skb ring

2016-06-15 Thread Jamal Hadi Salim
On 16-06-15 04:38 AM, Jason Wang wrote:
> We used to queue tx packets in sk_receive_queue, this is less
> efficient since it requires spinlocks to synchronize between producer
> and consumer.
> 
> This patch tries to address this by:
> 
> - introduce a new mode which will be only enabled with IFF_TX_ARRAY
>set and switch from sk_receive_queue to a fixed size of skb
>array with 256 entries in this mode.
> - introduce a new proto_ops peek_len which was used for peeking the
>skb length.
> - implement a tun version of peek_len for vhost_net to use and convert
>vhost_net to use peek_len if possible.
> 
> Pktgen test shows about 18% improvement on guest receiving pps for small
> buffers:
> 
> Before: ~122pps
> After : ~144pps
> 

So this is more exercising the skb array improvements. For tun
it would be useful to see general performance numbers on user/kernel
crossing (i.e tun read/write).
If you have the cycles can you run such tests?

cheers,
jamal





Re: [PATCH net-next V2] tun: introduce tx skb ring

2016-06-15 Thread Jamal Hadi Salim
On 16-06-15 04:38 AM, Jason Wang wrote:
> We used to queue tx packets in sk_receive_queue, this is less
> efficient since it requires spinlocks to synchronize between producer
> and consumer.
> 
> This patch tries to address this by:
> 
> - introduce a new mode which will be only enabled with IFF_TX_ARRAY
>set and switch from sk_receive_queue to a fixed size of skb
>array with 256 entries in this mode.
> - introduce a new proto_ops peek_len which was used for peeking the
>skb length.
> - implement a tun version of peek_len for vhost_net to use and convert
>vhost_net to use peek_len if possible.
> 
> Pktgen test shows about 18% improvement on guest receiving pps for small
> buffers:
> 
> Before: ~122pps
> After : ~144pps
> 

So this is more exercising the skb array improvements. For tun
it would be useful to see general performance numbers on user/kernel
crossing (i.e tun read/write).
If you have the cycles can you run such tests?

cheers,
jamal





Re: Deleting child qdisc doesn't reset parent to default qdisc?

2016-04-15 Thread Jamal Hadi Salim

On 16-04-14 01:49 PM, Eric Dumazet wrote:


And what would be the chosen behavior ?



TBF is probably a bad example because it started life as
a classless qdisc. There was only one built-in fifo queue
that was shaped. Then someone made it classful and changed
this behavior. To me it sounds reasonable to have the
default behavior restored. At minimal consistency.


Relying on TBF installing a bfifo for you at delete would be hazardous.

For example CBQ got it differently than HFSC

If qdisc_create_dflt() fails in CBQ, we fail the 'delete', while HFSC
falls back to noop_qdisc, without warning the user :(

At least always using noop_qdisc is consistent. No magic there.

Doing 'unification' right now would break existing scripts.

This is too late, I am afraid.



Sigh. So rant:
IMO, we should let any new APIs and API updates stay longer
in discussion. Or better mark them as unstable for sometime.
The excuse that "it is out in the wild therefore cant be changed"
is harmful because the timeline is "forever" whereas
patches are applied after a short period of posting
and discussions and sometimes not involving the right people.
It is like having a jury issuing a death sentence after 1 week of
deliberation. You cant take it back after execution.

cheers,
jamal


Re: Deleting child qdisc doesn't reset parent to default qdisc?

2016-04-15 Thread Jamal Hadi Salim

On 16-04-14 01:49 PM, Eric Dumazet wrote:


And what would be the chosen behavior ?



TBF is probably a bad example because it started life as
a classless qdisc. There was only one built-in fifo queue
that was shaped. Then someone made it classful and changed
this behavior. To me it sounds reasonable to have the
default behavior restored. At minimal consistency.


Relying on TBF installing a bfifo for you at delete would be hazardous.

For example CBQ got it differently than HFSC

If qdisc_create_dflt() fails in CBQ, we fail the 'delete', while HFSC
falls back to noop_qdisc, without warning the user :(

At least always using noop_qdisc is consistent. No magic there.

Doing 'unification' right now would break existing scripts.

This is too late, I am afraid.



Sigh. So rant:
IMO, we should let any new APIs and API updates stay longer
in discussion. Or better mark them as unstable for sometime.
The excuse that "it is out in the wild therefore cant be changed"
is harmful because the timeline is "forever" whereas
patches are applied after a short period of posting
and discussions and sometimes not involving the right people.
It is like having a jury issuing a death sentence after 1 week of
deliberation. You cant take it back after execution.

cheers,
jamal


Re: [PATCH net v3 2/4] net: add rx_nohandler stat counter

2016-02-09 Thread Jamal Hadi Salim

On 16-02-09 03:40 AM, David Miller wrote:

From: Eric Dumazet 
Date: Mon, 08 Feb 2016 14:57:40 -0800


Whole point of TLV is that it allows us to add new fields at the end of
the structures.

  ...

Look at iproute2, you were the one adding in 2004 code to cope with
various tcp_info sizes.

So 12 years later, you cannot say it does not work anymore.


+1



The TLV L should be canonical way to determine length. i.e should be
sufficient to just look at L and understand that content has changed.
But:
Using sizeof could be dangerous unless the data is packed to be
32-bit aligned. Looking INET_DIAG_INFO check for sizeof
there is a small 8 bit hole in tcp_info I think between
these two fields:


__u8tcpi_snd_wscale : 4, tcpi_rcv_wscale : 4;
__u32   tcpi_rto;
---

The kernel will pad to make sure the TLV data is 32-bit aligned.
I am not sure if that will be the same length as sizeof() in all
hardware + compilers... For this case,
it is almost safe to just add a version field - probably in the hole.
Or have a #define to say what the expected length should be. Or add
an 8 bit pad.

In general adding new fields that are non-optional is problematic. i.e
by non-optional i mean always expected to be present.
I think a good test is old kernel with new iproute2. If the new field
is non-optional, it will fail (example iproute2 may try to print a value
that it expects but because old kernel doesnt understand it; it is 
non-existent).


cheers,
jamal


Re: [PATCH net v3 2/4] net: add rx_nohandler stat counter

2016-02-09 Thread Jamal Hadi Salim

On 16-02-09 03:40 AM, David Miller wrote:

From: Eric Dumazet 
Date: Mon, 08 Feb 2016 14:57:40 -0800


Whole point of TLV is that it allows us to add new fields at the end of
the structures.

  ...

Look at iproute2, you were the one adding in 2004 code to cope with
various tcp_info sizes.

So 12 years later, you cannot say it does not work anymore.


+1



The TLV L should be canonical way to determine length. i.e should be
sufficient to just look at L and understand that content has changed.
But:
Using sizeof could be dangerous unless the data is packed to be
32-bit aligned. Looking INET_DIAG_INFO check for sizeof
there is a small 8 bit hole in tcp_info I think between
these two fields:


__u8tcpi_snd_wscale : 4, tcpi_rcv_wscale : 4;
__u32   tcpi_rto;
---

The kernel will pad to make sure the TLV data is 32-bit aligned.
I am not sure if that will be the same length as sizeof() in all
hardware + compilers... For this case,
it is almost safe to just add a version field - probably in the hole.
Or have a #define to say what the expected length should be. Or add
an 8 bit pad.

In general adding new fields that are non-optional is problematic. i.e
by non-optional i mean always expected to be present.
I think a good test is old kernel with new iproute2. If the new field
is non-optional, it will fail (example iproute2 may try to print a value
that it expects but because old kernel doesnt understand it; it is 
non-existent).


cheers,
jamal


Re: [PATCH 0/7] Freescale DPAA FMan FLIB(s)

2015-03-06 Thread Jamal Hadi Salim

Hi Emil,

On 03/05/15 10:04, Emil Medve wrote:

Hello Jamal,


On 03/05/2015 08:35 AM, Jamal Hadi Salim wrote:

Hi Emil,



No. All the kernel drivers/code we want to upstream is meant to stand on
its own and be used the "normal" Linux/Unix way



Ok, thanks - that was my only concern.
Note there is a lot of offload efforts going on in Linux right now.
Take a look at the proceedings from netdev01. You should take advantage
of that to shape the direction of your patches.
I have suffered at the hands of the sdk for this processor in the past;
it is just a bad idea to keep these SDKs around when Linux can express
the features sufficiently.
Note: The main thing i'd be interested in is when you get to offloading
the classifier/scheduler etc. I know you are focussing on just the 
ethernet level at the moment,


cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/7] Freescale DPAA FMan FLIB(s)

2015-03-06 Thread Jamal Hadi Salim

Hi Emil,

On 03/05/15 10:04, Emil Medve wrote:

Hello Jamal,


On 03/05/2015 08:35 AM, Jamal Hadi Salim wrote:

Hi Emil,



No. All the kernel drivers/code we want to upstream is meant to stand on
its own and be used the normal Linux/Unix way



Ok, thanks - that was my only concern.
Note there is a lot of offload efforts going on in Linux right now.
Take a look at the proceedings from netdev01. You should take advantage
of that to shape the direction of your patches.
I have suffered at the hands of the sdk for this processor in the past;
it is just a bad idea to keep these SDKs around when Linux can express
the features sufficiently.
Note: The main thing i'd be interested in is when you get to offloading
the classifier/scheduler etc. I know you are focussing on just the 
ethernet level at the moment,


cheers,
jamal

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/7] Freescale DPAA FMan FLIB(s)

2015-03-05 Thread Jamal Hadi Salim

Hi Emil,

On 03/05/15 08:48, Emil Medve wrote:


The intent is to upstream the entire suite of the DPAA drivers. All the
drivers are still WIP, but B/QMan have been already presented to the
upstream community and this is the first attempt to publish (some low
level code of) the FMan driver. As we go through our internal checklist
and in the same time address community feedback we'll soon get the
drivers to be acceptable for the upstream trees

The first version of the actual Ethernet driver will follow imminently

SDK enablement is a side-effect



Meaning? Let me ask the question differently:
Do i need your sdk to use the features exposed or can i use something
like tc to set up the deficit rr or wred or the exposed classifiers
and associated actions?
Would your sdk (via user space direct programming) benefit because you
have pushed these pieces into the kernel?


How are you planning to
add support for your classifiers, queue schedulers etc?


Yes



Yes as in these will be available via linux kernel or via your sdk?


Is that a patch
on top of this or it is something that sits on user space?


Both. Full DPAA/Ethernet enablement will be present in the kernel. We
also have support for user-space based approach. I'm unsure where/when
we might publish that. Of course the SDK is always a place you can turn
to for all the code we have (in whatever state it might be)



the sdk is open source?

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/7] Freescale DPAA FMan FLIB(s)

2015-03-05 Thread Jamal Hadi Salim

On 03/05/15 00:45, Emil Medve wrote:

From: Igal Liberman 

The Freescale Data Path Acceleration Architecture (DPAA) is a set of
hardware components on specific QorIQ P and T series multicore processors.
This architecture provides the infrastructure to support simplified
sharing of networking interfaces and accelerators by multiple CPU cores,
and the accelerators themselves.

One of the DPAA accelerators is the Frame Manager (FMan), which
combines the Ethernet network interfaces with packet distribution
logic to provide intelligent distribution and queuing decisions for
incoming traffic at line rate.

This patch presents the FMan Foundation Libraries (FLIB) headers.
The FMan FLIB suite adds basic support for the DPAA FMan hardware register 
access.
The FMan FLIB suite is used in Freescale's SDK Releases.



Is this intended to merely enable your sdk? How are you planning to
add support for your classifiers, queue schedulers etc? Is that a patch
on top of this or it is something that sits on user space?

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/7] Freescale DPAA FMan FLIB(s)

2015-03-05 Thread Jamal Hadi Salim

On 03/05/15 00:45, Emil Medve wrote:

From: Igal Liberman igal.liber...@freescale.com

The Freescale Data Path Acceleration Architecture (DPAA) is a set of
hardware components on specific QorIQ P and T series multicore processors.
This architecture provides the infrastructure to support simplified
sharing of networking interfaces and accelerators by multiple CPU cores,
and the accelerators themselves.

One of the DPAA accelerators is the Frame Manager (FMan), which
combines the Ethernet network interfaces with packet distribution
logic to provide intelligent distribution and queuing decisions for
incoming traffic at line rate.

This patch presents the FMan Foundation Libraries (FLIB) headers.
The FMan FLIB suite adds basic support for the DPAA FMan hardware register 
access.
The FMan FLIB suite is used in Freescale's SDK Releases.



Is this intended to merely enable your sdk? How are you planning to
add support for your classifiers, queue schedulers etc? Is that a patch
on top of this or it is something that sits on user space?

cheers,
jamal

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/7] Freescale DPAA FMan FLIB(s)

2015-03-05 Thread Jamal Hadi Salim

Hi Emil,

On 03/05/15 08:48, Emil Medve wrote:


The intent is to upstream the entire suite of the DPAA drivers. All the
drivers are still WIP, but B/QMan have been already presented to the
upstream community and this is the first attempt to publish (some low
level code of) the FMan driver. As we go through our internal checklist
and in the same time address community feedback we'll soon get the
drivers to be acceptable for the upstream trees

The first version of the actual Ethernet driver will follow imminently

SDK enablement is a side-effect



Meaning? Let me ask the question differently:
Do i need your sdk to use the features exposed or can i use something
like tc to set up the deficit rr or wred or the exposed classifiers
and associated actions?
Would your sdk (via user space direct programming) benefit because you
have pushed these pieces into the kernel?


How are you planning to
add support for your classifiers, queue schedulers etc?


Yes



Yes as in these will be available via linux kernel or via your sdk?


Is that a patch
on top of this or it is something that sits on user space?


Both. Full DPAA/Ethernet enablement will be present in the kernel. We
also have support for user-space based approach. I'm unsure where/when
we might publish that. Of course the SDK is always a place you can turn
to for all the code we have (in whatever state it might be)



the sdk is open source?

cheers,
jamal
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/3] net: Add Keystone NetCP ethernet driver support

2014-09-10 Thread Jamal Hadi Salim

On 09/09/14 11:19, Santosh Shilimkar wrote:


All the documentation is open including packet accelerator offload
in ti.com.


Very nice.
Would you do me a kindness and point to the switch interface
documentation (and other ones on that soc)?


We got such requests from customers but couldn't
support it for Linux.


It has been difficult because every chip vendor is trying
to do their own thing. Some have huge (fugly) SDKs in user space
which make it worse. Thats the struggle we are trying to
deal with. Of course none of those vendors want to open
up their specs. You present a nice opportunity to not follow
that path.


We are also looking for such
support and any direction are welcome. Your slide
deck seems to capture the key topics like L2/IPSEC
offload which we are also interested to hear.



The slides list the most popular offloads. But not necessarily
all known offloads.


Just to be clear, your point was about L2 switch offload
which the driver don't support at the moment. It might confuse
others. The driver doesn't implements anything non-standard.



If i understood you correctly:
Your initial patches dont intend to expose any offloads - you are just
abstracting this as a NIC. I think that is a legit reason.
However, the problem is you are also exposing the packet processors
and switch offloading in a proprietary way.
For a sample of how L2 basic functions like FDB tables are controlled
within a NIC - take a look at the Intel NICs.
Either that or you hide all the offload interfaces and over time add
them (starting with L2 - NICs with L2 are common).

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/3] net: Add Keystone NetCP ethernet driver support

2014-09-10 Thread Jamal Hadi Salim

On 09/09/14 11:19, Santosh Shilimkar wrote:


All the documentation is open including packet accelerator offload
in ti.com.


Very nice.
Would you do me a kindness and point to the switch interface
documentation (and other ones on that soc)?


We got such requests from customers but couldn't
support it for Linux.


It has been difficult because every chip vendor is trying
to do their own thing. Some have huge (fugly) SDKs in user space
which make it worse. Thats the struggle we are trying to
deal with. Of course none of those vendors want to open
up their specs. You present a nice opportunity to not follow
that path.


We are also looking for such
support and any direction are welcome. Your slide
deck seems to capture the key topics like L2/IPSEC
offload which we are also interested to hear.



The slides list the most popular offloads. But not necessarily
all known offloads.


Just to be clear, your point was about L2 switch offload
which the driver don't support at the moment. It might confuse
others. The driver doesn't implements anything non-standard.



If i understood you correctly:
Your initial patches dont intend to expose any offloads - you are just
abstracting this as a NIC. I think that is a legit reason.
However, the problem is you are also exposing the packet processors
and switch offloading in a proprietary way.
For a sample of how L2 basic functions like FDB tables are controlled
within a NIC - take a look at the Intel NICs.
Either that or you hide all the offload interfaces and over time add
them (starting with L2 - NICs with L2 are common).

cheers,
jamal
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/3] net: Add Keystone NetCP ethernet driver support

2014-09-09 Thread Jamal Hadi Salim

On 09/08/14 10:41, Santosh Shilimkar wrote:


The NetCP plugin module infrastructure use all the standard kernel
infrastructure and its very tiny.


So i found this manual here:
http://www.silica.com/fileadmin/02_Products/Productdetails/Texas_Instruments/SILICA_TI_66AK2E05-ds.pdf

Glad there is an open document!
There are a couple of ethernet switch chips I can spot there.

Can i control those with "bridge" or say "brctl" utilities?

I can see the bridge ports are exposed and i should be able to
control them via ifconfig or ip link. Thats what "standard
kernel infrastructure" means. Magic hidden in a driver is
not.

Take a look at recent netconf discussion (as well as earlier
referenced discussions):
http://vger.kernel.org/netconf-nf-offload.pdf

Maybe we can help providing you some direction?
The problem is it doesnt seem that the offload specs for
those other pieces are open? e.g how do i add an entry
to the L2 switch?

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/3] net: Add Keystone NetCP ethernet driver support

2014-09-09 Thread Jamal Hadi Salim

On 09/08/14 10:41, Santosh Shilimkar wrote:


The NetCP plugin module infrastructure use all the standard kernel
infrastructure and its very tiny.


So i found this manual here:
http://www.silica.com/fileadmin/02_Products/Productdetails/Texas_Instruments/SILICA_TI_66AK2E05-ds.pdf

Glad there is an open document!
There are a couple of ethernet switch chips I can spot there.

Can i control those with bridge or say brctl utilities?

I can see the bridge ports are exposed and i should be able to
control them via ifconfig or ip link. Thats what standard
kernel infrastructure means. Magic hidden in a driver is
not.

Take a look at recent netconf discussion (as well as earlier
referenced discussions):
http://vger.kernel.org/netconf-nf-offload.pdf

Maybe we can help providing you some direction?
The problem is it doesnt seem that the offload specs for
those other pieces are open? e.g how do i add an entry
to the L2 switch?

cheers,
jamal
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] net: Add Keystone NetCP ethernet driver

2014-08-19 Thread Jamal Hadi Salim

On 08/15/14 11:12, Santosh Shilimkar wrote:


I am curious about these two calls below(netcp_process_one_rx_packet
and netcp_tx_submit_skb):
On tx you seem to be broadcasting to all "sub-modules" and on receive
you seem to be invoking from all as well.
I couldnt find the code for any of the sub-modules but i suspect
things like switching/bridging or ipsec etc could be sub-modules.
If yes, have you thought about integrating these features into
Linux proper instead?

cheers,
jamal


+
+static inline int netcp_process_one_rx_packet(struct netcp_intf *netcp)
+{




+
+   /* Call each of the RX hooks */
+   p_info.skb = skb;
+   p_info.rxtstamp_complete = false;
+   list_for_each_entry(rx_hook, >rxhook_list_head, list) {
+   int ret;
+
+   ret = rx_hook->hook_rtn(rx_hook->order, rx_hook->hook_data,
+   _info);
+   if (unlikely(ret)) {
+   dev_err(netcp->ndev_dev, "RX hook %d failed: %d\n",
+   rx_hook->order, ret);
+   netcp->ndev->stats.rx_errors++;
+   dev_kfree_skb(skb);
+   return 0;
+   }
+   }

..


+   return 0;
+}


[..]


+static inline int netcp_tx_submit_skb(struct netcp_intf *netcp,
+ struct sk_buff *skb,
+ struct knav_dma_desc *desc)
+{
+   struct netcp_tx_pipe *tx_pipe = NULL;



+
+   p_info.netcp = netcp;
+   p_info.skb = skb;
+   p_info.tx_pipe = NULL;
+   p_info.psdata_len = 0;
+   p_info.ts_context = NULL;
+   p_info.txtstamp_complete = NULL;
+   p_info.epib = desc->epib;
+   p_info.psdata = desc->psdata;
+   memset(p_info.epib, 0, KNAV_DMA_NUM_EPIB_WORDS * sizeof(u32));
+
+   /* Find out where to inject the packet for transmission */
+   list_for_each_entry(tx_hook, >txhook_list_head, list) {
+   ret = tx_hook->hook_rtn(tx_hook->order, tx_hook->hook_data,
+   _info);
+   if (unlikely(ret != 0)) {
+   dev_err(netcp->ndev_dev, "TX hook %d rejected the packet 
with reason(%d)\n",
+   tx_hook->order, ret);
+   ret = (ret < 0) ? ret : NETDEV_TX_OK;
+   goto out;
+   }
+   }
+



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] net: Add Keystone NetCP ethernet driver

2014-08-19 Thread Jamal Hadi Salim

On 08/15/14 11:12, Santosh Shilimkar wrote:


I am curious about these two calls below(netcp_process_one_rx_packet
and netcp_tx_submit_skb):
On tx you seem to be broadcasting to all sub-modules and on receive
you seem to be invoking from all as well.
I couldnt find the code for any of the sub-modules but i suspect
things like switching/bridging or ipsec etc could be sub-modules.
If yes, have you thought about integrating these features into
Linux proper instead?

cheers,
jamal


+
+static inline int netcp_process_one_rx_packet(struct netcp_intf *netcp)
+{




+
+   /* Call each of the RX hooks */
+   p_info.skb = skb;
+   p_info.rxtstamp_complete = false;
+   list_for_each_entry(rx_hook, netcp-rxhook_list_head, list) {
+   int ret;
+
+   ret = rx_hook-hook_rtn(rx_hook-order, rx_hook-hook_data,
+   p_info);
+   if (unlikely(ret)) {
+   dev_err(netcp-ndev_dev, RX hook %d failed: %d\n,
+   rx_hook-order, ret);
+   netcp-ndev-stats.rx_errors++;
+   dev_kfree_skb(skb);
+   return 0;
+   }
+   }

..


+   return 0;
+}


[..]


+static inline int netcp_tx_submit_skb(struct netcp_intf *netcp,
+ struct sk_buff *skb,
+ struct knav_dma_desc *desc)
+{
+   struct netcp_tx_pipe *tx_pipe = NULL;



+
+   p_info.netcp = netcp;
+   p_info.skb = skb;
+   p_info.tx_pipe = NULL;
+   p_info.psdata_len = 0;
+   p_info.ts_context = NULL;
+   p_info.txtstamp_complete = NULL;
+   p_info.epib = desc-epib;
+   p_info.psdata = desc-psdata;
+   memset(p_info.epib, 0, KNAV_DMA_NUM_EPIB_WORDS * sizeof(u32));
+
+   /* Find out where to inject the packet for transmission */
+   list_for_each_entry(tx_hook, netcp-txhook_list_head, list) {
+   ret = tx_hook-hook_rtn(tx_hook-order, tx_hook-hook_data,
+   p_info);
+   if (unlikely(ret != 0)) {
+   dev_err(netcp-ndev_dev, TX hook %d rejected the packet 
with reason(%d)\n,
+   tx_hook-order, ret);
+   ret = (ret  0) ? ret : NETDEV_TX_OK;
+   goto out;
+   }
+   }
+



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] xfrm: fix handling of XFRM policies mark and mask.

2013-02-06 Thread Jamal Hadi Salim

On 13-02-06 09:39 AM, Emmanuel Thierry wrote:





I think you misread the example !


I did ;->


Marks are both 1, masks are different.

This case is more complex than a policy
with no mark (so mark=0 and mask=0) versus
a policy with an exact mark (so mark=1 and mask=0x),
and i wanted to know if the algorithm would take these kind of cases into 
account.



Aha. I think this is pushing the envelope a little - are there good use 
cases for this?
certainly you could insert with most exact mask first. No such check is 
made at the moment.


cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] xfrm: fix handling of XFRM policies mark and mask.

2013-02-06 Thread Jamal Hadi Salim

On 13-02-06 08:53 AM, Emmanuel Thierry wrote:

Actually, we didn't think about this problem since we work with priorities, 
putting the default policy (without a mark) at a minor priority than the marked 
one.


I think priorities are the way to go in cases of ambiguity.


Your remark makes clearer the ideas behind the design of XFRM, but this leads 
to an interesting concern. If on policy insertion, the policy were inserted 
depending on the accuracy of the mark (the more the mask is specific, the more 
the mark must be put at the beginning of the list), how would we decide which 
is the more specific between these ones ?

ip -6 xfrm policy add src fd00::1/128 dst fd00::2/128 dir out mark 0x0001 
mask 0x0005

ip -6 xfrm policy add src fd00::1/128 dst fd00::2/128 dir out mark 0x0001 
mask 0x0003


They look different to me, no? i.e i dont see a conflict - one has 
mark=5 and the other

has mark=3.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] xfrm: fix handling of XFRM policies mark and mask.

2013-02-06 Thread Jamal Hadi Salim

On 13-02-06 08:53 AM, Emmanuel Thierry wrote:

Actually, we didn't think about this problem since we work with priorities, 
putting the default policy (without a mark) at a minor priority than the marked 
one.


I think priorities are the way to go in cases of ambiguity.


Your remark makes clearer the ideas behind the design of XFRM, but this leads 
to an interesting concern. If on policy insertion, the policy were inserted 
depending on the accuracy of the mark (the more the mask is specific, the more 
the mark must be put at the beginning of the list), how would we decide which 
is the more specific between these ones ?

ip -6 xfrm policy add src fd00::1/128 dst fd00::2/128 dir out mark 0x0001 
mask 0x0005

ip -6 xfrm policy add src fd00::1/128 dst fd00::2/128 dir out mark 0x0001 
mask 0x0003


They look different to me, no? i.e i dont see a conflict - one has 
mark=5 and the other

has mark=3.

cheers,
jamal

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] xfrm: fix handling of XFRM policies mark and mask.

2013-02-06 Thread Jamal Hadi Salim

On 13-02-06 09:39 AM, Emmanuel Thierry wrote:





I think you misread the example !


I did ;-


Marks are both 1, masks are different.

This case is more complex than a policy
with no mark (so mark=0 and mask=0) versus
a policy with an exact mark (so mark=1 and mask=0x),
and i wanted to know if the algorithm would take these kind of cases into 
account.



Aha. I think this is pushing the envelope a little - are there good use 
cases for this?
certainly you could insert with most exact mask first. No such check is 
made at the moment.


cheers,
jamal
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [net V2] act_mirred: do not drop packets when fails to mirror it

2012-08-16 Thread Jamal Hadi Salim
On Thu, 2012-08-16 at 14:44 +0800, Jason Wang wrote:
> We drop packet unconditionally when we fail to mirror it. This is not intended
> in some cases. Consdier for kvm guest, we may mirror the traffic of the bridge
> to a tap device used by a VM. When kernel fails to mirror the packet in
> conditions such as when qemu crashes or stop polling the tap, it's hard for 
> the
> management software to detect such condition and clean the the mirroring
> before. This would lead all packets to the bridge to be dropped and break the
> netowrk of other virtual machines.
> 
> To solve the issue, the patch does not drop packets when kernel fails to 
> mirror
> it, and only drop the redirected packets.
> 
> Signed-off-by: Jason Wang 

Signed-off-by: Jamal Hadi Salim 

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [net V2] act_mirred: do not drop packets when fails to mirror it

2012-08-16 Thread Jamal Hadi Salim
On Thu, 2012-08-16 at 14:44 +0800, Jason Wang wrote:
 We drop packet unconditionally when we fail to mirror it. This is not intended
 in some cases. Consdier for kvm guest, we may mirror the traffic of the bridge
 to a tap device used by a VM. When kernel fails to mirror the packet in
 conditions such as when qemu crashes or stop polling the tap, it's hard for 
 the
 management software to detect such condition and clean the the mirroring
 before. This would lead all packets to the bridge to be dropped and break the
 netowrk of other virtual machines.
 
 To solve the issue, the patch does not drop packets when kernel fails to 
 mirror
 it, and only drop the redirected packets.
 
 Signed-off-by: Jason Wang jasow...@redhat.com

Signed-off-by: Jamal Hadi Salim j...@mojatatu.com

cheers,
jamal

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] act_mirred: do not drop packets when fails to mirror it

2012-08-15 Thread Jamal Hadi Salim
On Wed, 2012-08-15 at 21:42 +0800, Jason Wang wrote:

> 
> I met it actually through the following steps:
> 
> - start a kvm guest with tap and make it to be an interface of the bridge
> - mirror the ingress traffic of the bridge to the tap
> - terminate the qemu process, the tap device is then removed
> - all packet goes to bridge would be dropped, so the network of guests 
> in the same bridge would be broken
> 

Makes sense.
Can you please leave the err check braces i.e
if (err) {
m->tcf_qstats.overlimits++;
if (m->tcfm_eaction != TCA_EGRESS_MIRROR)
retval = TC_ACT_SHOT;
else 
retval = m->tcf_action;   
} else {
retval = m->tcf_action;
}

Or at least dont use TC_ACT_STOLEN.

cheers,
jamal


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] act_mirred: do not drop packets when fails to mirror it

2012-08-15 Thread Jamal Hadi Salim

On Wed, 2012-08-15 at 17:37 +0800, Jason Wang wrote:
> We drop packet unconditionally when we fail to mirror it. This is not intended
> in some cases.

Hi Jason,
Did you actually notice the behavior you described or were you going by
the XXX comment I had in the code?

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 15/21] net sched: Pass the skb into change so it can access NETLINK_CB

2012-08-15 Thread Jamal Hadi Salim
On Mon, 2012-08-13 at 13:18 -0700, Eric W. Biederman wrote:
> From: "Eric W. Biederman" 
> 
> cls_flow.c plays with uids and gids.  Unless I misread that
> code it is possible for classifiers to depend on the specific uid and
> gid values.  Therefore I need to know the user namespace of the
> netlink socket that is installing the packet classifiers.  Pass
> in the rtnetlink skb so I can access the NETLINK_CB of the passed
> packet.  In particular I want access to sk_user_ns(NETLINK_CB(in_skb).ssk).
> 
> Pass in not the user namespace but the incomming rtnetlink skb into
> the the classifier change routines as that is generally the more useful
> parameter.
> 
> Cc: Jamal Hadi Salim 
> Acked-by: Serge Hallyn 
> Signed-off-by: Eric W. Biederman 


Acked-by: Jamal Hadi Salim 

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 15/21] net sched: Pass the skb into change so it can access NETLINK_CB

2012-08-15 Thread Jamal Hadi Salim
On Mon, 2012-08-13 at 13:18 -0700, Eric W. Biederman wrote:
 From: Eric W. Biederman ebied...@xmission.com
 
 cls_flow.c plays with uids and gids.  Unless I misread that
 code it is possible for classifiers to depend on the specific uid and
 gid values.  Therefore I need to know the user namespace of the
 netlink socket that is installing the packet classifiers.  Pass
 in the rtnetlink skb so I can access the NETLINK_CB of the passed
 packet.  In particular I want access to sk_user_ns(NETLINK_CB(in_skb).ssk).
 
 Pass in not the user namespace but the incomming rtnetlink skb into
 the the classifier change routines as that is generally the more useful
 parameter.
 
 Cc: Jamal Hadi Salim j...@mojatatu.com
 Acked-by: Serge Hallyn serge.hal...@canonical.com
 Signed-off-by: Eric W. Biederman ebied...@xmission.com


Acked-by: Jamal Hadi Salim j...@mojatatu.com

cheers,
jamal

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] act_mirred: do not drop packets when fails to mirror it

2012-08-15 Thread Jamal Hadi Salim

On Wed, 2012-08-15 at 17:37 +0800, Jason Wang wrote:
 We drop packet unconditionally when we fail to mirror it. This is not intended
 in some cases.

Hi Jason,
Did you actually notice the behavior you described or were you going by
the XXX comment I had in the code?

cheers,
jamal

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] act_mirred: do not drop packets when fails to mirror it

2012-08-15 Thread Jamal Hadi Salim
On Wed, 2012-08-15 at 21:42 +0800, Jason Wang wrote:

 
 I met it actually through the following steps:
 
 - start a kvm guest with tap and make it to be an interface of the bridge
 - mirror the ingress traffic of the bridge to the tap
 - terminate the qemu process, the tap device is then removed
 - all packet goes to bridge would be dropped, so the network of guests 
 in the same bridge would be broken
 

Makes sense.
Can you please leave the err check braces i.e
if (err) {
m-tcf_qstats.overlimits++;
if (m-tcfm_eaction != TCA_EGRESS_MIRROR)
retval = TC_ACT_SHOT;
else 
retval = m-tcf_action;   
} else {
retval = m-tcf_action;
}

Or at least dont use TC_ACT_STOLEN.

cheers,
jamal


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/