Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread Jamal Hadi Salim

On 2018-10-11 2:44 p.m., David Ahern wrote:

On 10/11/18 12:05 PM, Jamal Hadi Salim wrote:

On 2018-10-11 1:04 p.m., David Ahern wrote:




I meant the general API of users passing filter arguments as attributes
to the dump (or values in the header) -- KIND, MASTER, device index,
etc. This is an existing API and existing capability.



which i referred to as use-case driven. It is not unreasonable
to optimize for the most common - but every time somebody
comes up with something new you need to patch the kernel.



I disagree with your overall premise of bpf the end-all hammer. It is a
tool but not the only tool. For starters, you are proposing building the
message, run the filter on it, and potentially back the message up to
drop the recently added piece because the filter does not want it
included. That is still wasting a lot of cpu cycles to build and drop. I
am thinking about scale to 1 million routes -- I do not need the dump
loop building a message for 1 million entries only to drop 99% of them.
That is crazy.



My earlier suggestion was for somewhere before the skb is formed.
In the vicinity of xxx_fill_info()
The "create skb and drop" kind works already today with some
acrobatics needed for some cases with cbpf.

Is it unfeasible to add an ebpf hook at that point and ask a user
supplied code "is this ok to send?" - this is no different
than doing a "get by key" operation where key/filter is any arbitrary
construction of fields rtm understands (including the ones you
provided like table index) that are passed in the user program.
Classical "select" mechanism in database tables


The way the kernel manages route tables says I should pass in the table
id as it is a major differentiator on what is returned. From there
lookup the specific table (I need to fix this part per my response to
Andrew), and then only walk it. The existing semantics, capabilities
that exist for other dump commands is the most efficient for some of
these high level, big hammer filters.



Sure.


What you want gets into the tiniest of details and yes the imagination
can go wild with combinations of filter options. So maybe this scanning
of post-built messages is reasonable *after* the high level sorting is done.



That doesnt require any change.

cheers,
jamal


Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread David Miller
From: Sowmini Varadhan 
Date: Thu, 11 Oct 2018 15:32:48 -0400

> Without getting into Ahern's patchset, which he obviously feels 
> quite passionately about..
> 
> On (10/11/18 12:28), David Miller wrote:
>> 
>> Once you've composed the message, the whole point of filtering is lost.
> 
> it would be nice to apply the filter *before* constructing the skb, 
> but afaict most things in BPF today only operate on sk_buffs. How should
> we use *BPF on something other than an sk_buff?

Personally I'm not going to spend cycles on that.

What's important to me in the short term is that David's patch set is
an appropriate way to add filtering, using existing facilities and
mechanisms that already exist for that purpose.

If people want to explore a possible eBPF mechanism for the future,
with an emphasis on "future", feel free to explore to your heart's
content.

But that doesn't exist in any form whatsoever, so that's not what we
should be talking about here.


Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread Sowmini Varadhan


Without getting into Ahern's patchset, which he obviously feels 
quite passionately about..

On (10/11/18 12:28), David Miller wrote:
> 
> Once you've composed the message, the whole point of filtering is lost.

it would be nice to apply the filter *before* constructing the skb, 
but afaict most things in BPF today only operate on sk_buffs. How should
we use *BPF on something other than an sk_buff?

--Sowmini




Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread David Miller
From: David Ahern 
Date: Thu, 11 Oct 2018 12:44:49 -0600

> I disagree with your overall premise of bpf the end-all hammer. It is a
> tool but not the only tool. For starters, you are proposing building the
> message, run the filter on it, and potentially back the message up to
> drop the recently added piece because the filter does not want it
> included. That is still wasting a lot of cpu cycles to build and drop. I
> am thinking about scale to 1 million routes -- I do not need the dump
> loop building a message for 1 million entries only to drop 99% of them.
> That is crazy.

I completely agree.

Once you've composed the message, the whole point of filtering is lost.


Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread David Ahern
On 10/11/18 12:05 PM, Jamal Hadi Salim wrote:
> On 2018-10-11 1:04 p.m., David Ahern wrote:
> 
>> You can already filter link dumps by kind. How? By passing in the KIND
>> attribute on a dump request. This type of filtering exists for link
>> dumps, neighbor dumps, fdb dumps. Why is there a push to make route
>> dumps different? Why can't they be consistent and use existing semantics?
> 
> I think you meant filtering by ifindex in neighbor.

I meant the general API of users passing filter arguments as attributes
to the dump (or values in the header) -- KIND, MASTER, device index,
etc. This is an existing API and existing capability.

> note: I would argue that there are already "adhoc" ways of filtering
> in place, mostly use case driven). Otherwise Sowmini wouldnt have to
> craft that bpf filter. There are netlink users who have none or some
> weird filtering involved. There is no arguement that your approach
> works for rtm. But the rest of the users missing filters will require
> similar kernel changes. Could this be made generic enough to benefit
> other netlink users?
> The problem is there's always one new attribute that would make sense
> for some use case which requires a kernel change ("send me an event only
> if you get link down" or "dump all ports with link down").
> 

I disagree with your overall premise of bpf the end-all hammer. It is a
tool but not the only tool. For starters, you are proposing building the
message, run the filter on it, and potentially back the message up to
drop the recently added piece because the filter does not want it
included. That is still wasting a lot of cpu cycles to build and drop. I
am thinking about scale to 1 million routes -- I do not need the dump
loop building a message for 1 million entries only to drop 99% of them.
That is crazy.

The way the kernel manages route tables says I should pass in the table
id as it is a major differentiator on what is returned. From there
lookup the specific table (I need to fix this part per my response to
Andrew), and then only walk it. The existing semantics, capabilities
that exist for other dump commands is the most efficient for some of
these high level, big hammer filters.

What you want gets into the tiniest of details and yes the imagination
can go wild with combinations of filter options. So maybe this scanning
of post-built messages is reasonable *after* the high level sorting is done.


Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread Jamal Hadi Salim

On 2018-10-11 1:04 p.m., David Ahern wrote:


You can already filter link dumps by kind. How? By passing in the KIND
attribute on a dump request. This type of filtering exists for link
dumps, neighbor dumps, fdb dumps. Why is there a push to make route
dumps different? Why can't they be consistent and use existing semantics?


I think you meant filtering by ifindex in neighbor.
note: I would argue that there are already "adhoc" ways of filtering
in place, mostly use case driven). Otherwise Sowmini wouldnt have to
craft that bpf filter. There are netlink users who have none or some
weird filtering involved. There is no arguement that your approach
works for rtm. But the rest of the users missing filters will require
similar kernel changes. Could this be made generic enough to benefit
other netlink users?
The problem is there's always one new attribute that would make sense
for some use case which requires a kernel change ("send me an event only
if you get link down" or "dump all ports with link down").

cheers,
jamal




Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread David Ahern
On 10/11/18 10:46 AM, Jamal Hadi Salim wrote:
> On 2018-10-11 12:16 p.m., David Ahern wrote:
> 
> Yes, you can do it with cBPF but some complexity may occur. Example:
> if i was interested to netdevice events of "kind = vxlan &&
> admin flag is down" then that is non trivial to do with classical but
> would be reasonably comfortable to do with ebpf.
> Note:
> That filter will work fine for dumps as well since the semantics
> are the same.

You can already filter link dumps by kind. How? By passing in the KIND
attribute on a dump request. This type of filtering exists for link
dumps, neighbor dumps, fdb dumps. Why is there a push to make route
dumps different? Why can't they be consistent and use existing semantics?


Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread Jamal Hadi Salim

On 2018-10-11 12:16 p.m., David Ahern wrote:




IMO, bpf at the fill_info stage is not appropriate.



Somewhere before the skb is formed (and nlmsg is built).
If you go as far as constructing it, then cBPF per what
Sowmini should work; but there will be constructs which
are trickier.




skb->sk (hence attached filter) should be available at that point.
Classical bpf per Sowmini's example maybe trickier.
Better - why dont we have an ebpf hook at this stage and then
we dont have to make changes to the kernel when someone adds
one more field to the filter?

BTW: useful for events as well - not just dumps (as the name
fib_dump_filter suggests)


you mean kernel notifications on internal events?
1. there is no user socket when notifications are created and the
*_fill_info is invoked

2. notifications are global going to potentially many sockets. For these
cases the existing sk_filter is appropriate.


#2 - netlink being a broad/multicast bus with attached listeners.

Yes, you can do it with cBPF but some complexity may occur. Example:
if i was interested to netdevice events of "kind = vxlan &&
admin flag is down" then that is non trivial to do with classical but
would be reasonably comfortable to do with ebpf.
Note:
That filter will work fine for dumps as well since the semantics
are the same.
the win is:  in the future when you just wanna add that one new
filter attribute you dont need a kernel patch in and roll in a
new production kernel.

cheers,
jamal


Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread Sowmini Varadhan
On (10/11/18 09:33), Roopa Prabhu wrote:
> 3. All networking subsystems already have this type of netlink
> attribute filtering that apps rely on. This series
> just makes it consistent for route dumps. Apps use such mechanism
> already when requesting dumps.
> Like everywhere else, BPF hook can be an alternate parallel mechanism.

sure and that make sense. though I hope we will explore those
alternate mechanisms too.

--Sowmini



Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread Roopa Prabhu
On Thu, Oct 11, 2018 at 9:16 AM David Ahern  wrote:
>
> On 10/11/18 10:07 AM, Jamal Hadi Salim wrote:
> > On 2018-10-11 11:46 a.m., Sowmini Varadhan wrote:
> >> On (10/11/18 08:26), Stephen Hemminger wrote:
> >>> You can do the something like this already with BPF socket filters.
> >>> But writing BPF for multi-part messages is hard.
> >>
> >> Indeed. And I was just experimenting with this for ARP just last week.
> >> So to handle the caes of "ip neigh show a.b.c.d" without walking through
> >> the entire arp table and filtering in userspace, you could add a
> >> sk_filter()
> >> hook like this:
> >>
> >
> > If this could be done a lot earlier aka at xxx_fill_info() bpf would
> > be a very good answer.
>
> IMO, bpf at the fill_info stage is not appropriate.
>
>
> > skb->sk (hence attached filter) should be available at that point.
> > Classical bpf per Sowmini's example maybe trickier.
> > Better - why dont we have an ebpf hook at this stage and then
> > we dont have to make changes to the kernel when someone adds
> > one more field to the filter?
> >
> > BTW: useful for events as well - not just dumps (as the name
> > fib_dump_filter suggests)
>
> you mean kernel notifications on internal events?
> 1. there is no user socket when notifications are created and the
> *_fill_info is invoked
>
> 2. notifications are global going to potentially many sockets. For these
> cases the existing sk_filter is appropriate.

3. All networking subsystems already have this type of netlink
attribute filtering that apps rely on. This series
just makes it consistent for route dumps. Apps use such mechanism
already when requesting dumps.
Like everywhere else, BPF hook can be an alternate parallel mechanism.


Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread David Ahern
On 10/11/18 10:07 AM, Jamal Hadi Salim wrote:
> On 2018-10-11 11:46 a.m., Sowmini Varadhan wrote:
>> On (10/11/18 08:26), Stephen Hemminger wrote:
>>> You can do the something like this already with BPF socket filters.
>>> But writing BPF for multi-part messages is hard.
>>
>> Indeed. And I was just experimenting with this for ARP just last week.
>> So to handle the caes of "ip neigh show a.b.c.d" without walking through
>> the entire arp table and filtering in userspace, you could add a
>> sk_filter()
>> hook like this:
>>
> 
> If this could be done a lot earlier aka at xxx_fill_info() bpf would
> be a very good answer.

IMO, bpf at the fill_info stage is not appropriate.


> skb->sk (hence attached filter) should be available at that point.
> Classical bpf per Sowmini's example maybe trickier.
> Better - why dont we have an ebpf hook at this stage and then
> we dont have to make changes to the kernel when someone adds
> one more field to the filter?
> 
> BTW: useful for events as well - not just dumps (as the name
> fib_dump_filter suggests)

you mean kernel notifications on internal events?
1. there is no user socket when notifications are created and the
*_fill_info is invoked

2. notifications are global going to potentially many sockets. For these
cases the existing sk_filter is appropriate.


Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread David Ahern
On 10/11/18 10:10 AM, Sowmini Varadhan wrote:
> On (10/11/18 09:32), David Ahern wrote:
>>
>> Route dumps are done for the entire FIB for each address family. As we
>> approach internet routing tables (700k+ routes for IPv4, currently
>> around 55k for IPv6) with many VRFs dumping the entire table is grossly
>> inefficient when for example only a single VRF table is wanted.
> 
> I think someone mentioned a long time ago that a VRF is not an 
> interface/driver/net_device but rather a separate routing table with a
> dedicated set of interfaces, iirc :-) :-)
> 
> In the latter model, if you wanted to dump a VRF table, you'd only
> lock that table, and walk it, instead of holding up other VRFS 
> 
> sorry, could not resist my i-told-you-so moment :-P

huh?

VRF is a device has absolutely no relevance to this patch set.

There is no existing mechanism for dumping a single table. That's one of
the points of this set.


Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread Sowmini Varadhan
On (10/11/18 09:32), David Ahern wrote:
> 
> Route dumps are done for the entire FIB for each address family. As we
> approach internet routing tables (700k+ routes for IPv4, currently
> around 55k for IPv6) with many VRFs dumping the entire table is grossly
> inefficient when for example only a single VRF table is wanted.

I think someone mentioned a long time ago that a VRF is not an 
interface/driver/net_device but rather a separate routing table with a
dedicated set of interfaces, iirc :-) :-)

In the latter model, if you wanted to dump a VRF table, you'd only
lock that table, and walk it, instead of holding up other VRFS 

sorry, could not resist my i-told-you-so moment :-P

--Sowmini


Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread Jamal Hadi Salim

On 2018-10-11 11:46 a.m., Sowmini Varadhan wrote:

On (10/11/18 08:26), Stephen Hemminger wrote:

You can do the something like this already with BPF socket filters.
But writing BPF for multi-part messages is hard.


Indeed. And I was just experimenting with this for ARP just last week.
So to handle the caes of "ip neigh show a.b.c.d" without walking through
the entire arp table and filtering in userspace, you could add a sk_filter()
hook like this:



If this could be done a lot earlier aka at xxx_fill_info() bpf would
be a very good answer.
skb->sk (hence attached filter) should be available at that point.
Classical bpf per Sowmini's example maybe trickier.
Better - why dont we have an ebpf hook at this stage and then
we dont have to make changes to the kernel when someone adds
one more field to the filter?

BTW: useful for events as well - not just dumps (as the name
fib_dump_filter suggests)

cheers,
jamal


Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread Sowmini Varadhan
On (10/11/18 08:26), Stephen Hemminger wrote:
> You can do the something like this already with BPF socket filters.
> But writing BPF for multi-part messages is hard.

Indeed. And I was just experimenting with this for ARP just last week.
So to handle the caes of "ip neigh show a.b.c.d" without walking through
the entire arp table and filtering in userspace, you could add a sk_filter()
hook like this:

@@ -2258,6 +2260,24 @@ static int neigh_fill_info(struct sk_buff *skb, struct ne
goto nla_put_failure;
 
nlmsg_end(skb, nlh);
+
+   /* XXX skb->sk can be null in the neigh_timer_handler->__neigh_notify 
+* path. Revisit..
+*/
+   if (!skb->sk)
+   return 0;
+
+   /* pull/push skb->data pointers so that sk_filter only sees the
+* most recent nlh that wasjust added.
+*/
+   len = skb->len - nlh->nlmsg_len;
+   skb_pull(skb, len);
+   ret = sk_filter(skb->sk, skb);
+   skb_push(skb, len);
+   if (ret)
+   nlmsg_cancel(skb, nlh);
return 0;
 
Writing the cBPF filter is not horrible, due to the nla extension. e.g.,
to pass a filter that matches on if_index and ipv4 address, the bpf_asm
src below will do the job. The benefit of using cBPF is that we can 
use this older kernels as well

/*
 * Generated from the bpf_asm src
 * ldb [20] ; len(nlmsghdr) + offsetof(ndm_ifindex)
 * jne sll->sll_ifindex, skip
 * ld #28   ; A <- len(nlmsghdr) + len(ndmsg), payload offset
 * ldx #1   ; X <-  NDA_DST
 * ld #nla  ; A <- offset(NDA_DST)
 * jeq #0, skip
 * tax
 * ld [x + 4]   ; A <- value(NDA_DST)
 * jneq htonl(addr), skip
 * ret #-1
 * skip: ret #0
 */
struct sock_filter bpf_filter[] = {
{ 0x30,  0,  0, 0x0014 },
{ 0x15,  0,  1, sll->sll_ifindex },
{ ,  0,  0, 0x001c },
{ 0x01,  0,  0, 0x0001 },
{ 0x20,  0,  0, 0xf00c },
{ 0x15,  4,  0, 00 },
{ 0x07,  0,  0, 00 },
{ 0x40,  0,  0, 0x0004 },
{ 0x15,  0,  1, htonl(addr) },
{ 0x06,  0,  0, 0x },
{ 0x06,  0,  0, 00 },
{ 0x06,  0,  0, 0x },
{ 0x06,  0,  0, 00 },
};


Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread David Ahern
On 10/11/18 9:26 AM, Stephen Hemminger wrote:
>>
> 
> You can do the something like this already with BPF socket filters.
> But writing BPF for multi-part messages is hard.
> 
> Maybe a generic eBPF filter mechanism would be more flexible?
> 

That exists today and does not cover what is needed here:
1. The filters apply *after* the skb has been filled in.

2. an skb will have many routes within it and the user filter could
apply to any of those messages within the skb. It is not efficient to
generate the skb and then re-create it with a bpf filter.

The point here is to not even fill in the skb for something userspace
does not care about.

Route dumps are done for the entire FIB for each address family. As we
approach internet routing tables (700k+ routes for IPv4, currently
around 55k for IPv6) with many VRFs dumping the entire table is grossly
inefficient when for example only a single VRF table is wanted.


Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread Stephen Hemminger
On Thu, 11 Oct 2018 08:06:18 -0700
David Ahern  wrote:

> From: David Ahern 
> 
> Implement kernel side filtering of route dumps by protocol (e.g., which
> routing daemon installed the route), route type (e.g., unicast), table
> id and nexthop device.
> 
> iproute2 has been doing this filtering in userspace for years; pushing
> the filters to the kernel side reduces the amount of data the kernel
> sends and reduces wasted cycles on both sides processing unwanted data.
> These initial options provide a huge improvement for efficiently
> examining routes on large scale systems.
> 
> David Ahern (9):
>   net: Add struct for fib dump filter
>   net/ipv4: Plumb support for filtering route dumps
>   net/ipv6: Plumb support for filtering route dumps
>   net/mpls: Plumb support for filtering route dumps
>   net: Plumb support for filtering ipv4 and ipv6 multicast route dumps
>   net: Enable kernel side filtering of route dumps
>   net/mpls: Handle kernel side filtering of route dumps
>   net/ipv6: Bail early if user only wants cloned entries
>   net/ipv4: Bail early if user only wants prefix entries
> 
>  include/linux/mroute_base.h |  5 +--
>  include/net/ip6_route.h |  1 +
>  include/net/ip_fib.h| 14 ++--
>  net/ipv4/fib_frontend.c | 64 +++--
>  net/ipv4/fib_trie.c | 37 +--
>  net/ipv4/ipmr.c |  8 +++--
>  net/ipv4/ipmr_base.c| 33 -
>  net/ipv6/ip6_fib.c  | 17 +++--
>  net/ipv6/ip6mr.c|  7 ++--
>  net/ipv6/route.c| 40 -
>  net/mpls/af_mpls.c  | 86 
> +++--
>  11 files changed, 262 insertions(+), 50 deletions(-)
> 

You can do the something like this already with BPF socket filters.
But writing BPF for multi-part messages is hard.

Maybe a generic eBPF filter mechanism would be more flexible?


[PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread David Ahern
From: David Ahern 

Implement kernel side filtering of route dumps by protocol (e.g., which
routing daemon installed the route), route type (e.g., unicast), table
id and nexthop device.

iproute2 has been doing this filtering in userspace for years; pushing
the filters to the kernel side reduces the amount of data the kernel
sends and reduces wasted cycles on both sides processing unwanted data.
These initial options provide a huge improvement for efficiently
examining routes on large scale systems.

David Ahern (9):
  net: Add struct for fib dump filter
  net/ipv4: Plumb support for filtering route dumps
  net/ipv6: Plumb support for filtering route dumps
  net/mpls: Plumb support for filtering route dumps
  net: Plumb support for filtering ipv4 and ipv6 multicast route dumps
  net: Enable kernel side filtering of route dumps
  net/mpls: Handle kernel side filtering of route dumps
  net/ipv6: Bail early if user only wants cloned entries
  net/ipv4: Bail early if user only wants prefix entries

 include/linux/mroute_base.h |  5 +--
 include/net/ip6_route.h |  1 +
 include/net/ip_fib.h| 14 ++--
 net/ipv4/fib_frontend.c | 64 +++--
 net/ipv4/fib_trie.c | 37 +--
 net/ipv4/ipmr.c |  8 +++--
 net/ipv4/ipmr_base.c| 33 -
 net/ipv6/ip6_fib.c  | 17 +++--
 net/ipv6/ip6mr.c|  7 ++--
 net/ipv6/route.c| 40 -
 net/mpls/af_mpls.c  | 86 +++--
 11 files changed, 262 insertions(+), 50 deletions(-)

-- 
2.11.0