Re: [RFC 02/11] Add RoCE driver framework

2016-09-14 Thread Leon Romanovsky
On Thu, Sep 15, 2016 at 05:11:03AM +, Mintz, Yuval wrote:
> > As a summary, I didn't see in your responses any real life example where 
> > you will
> > need global debug level for your driver.
>
> Not sure what you you're expecting - a list of BZs /private e-mails where
> user reproductions were needed?
> You're basically ignoring my claims that such are used, instead wanting
> "evidence". I'm not going to try and produce any such.

I asked an example and not evidence, where "modprobe your_driver
debug=1" will be superior to "modprobe your_driver dyndbg==pmf".

https://www.kernel.org/doc/Documentation/dynamic-debug-howto.txt


signature.asc
Description: PGP signature


Re: [PATCH V2 2/3] net-next: dsa: add Qualcomm tag RX/TX handler

2016-09-14 Thread Florian Fainelli
On 09/14/2016 03:39 AM, John Crispin wrote:
> Add support for the 2-bytes Qualcomm tag that gigabit switches such as
> the QCA8337/N might insert when receiving packets, or that we need
> to insert while targeting specific switch ports. The tag is inserted
> directly behind the ethernet header.
> 
> Signed-off-by: John Crispin 

Reviewed-by: Florian Fainelli 
-- 
Florian


RE: [RFC 02/11] Add RoCE driver framework

2016-09-14 Thread Mintz, Yuval
> > > If you want dynamic prints, you have two options:
> > > 1. Add support of ethtool to whole RDMA stack.
> > > 2. Use dynamic tracing infrastructure.
> >
> > > Which option do you prefer?
> > Option 3 - continuing this discussion. :-)
> 
> Sorry,
> I was under impression that you want this driver to be merged, but it looks 
> like It
> was incorrect assumption. Let's continue discussion.

No, this is an RFC - there's no chance for *this* to merge, but this is exactly
the right time to discuss this sort of stuff.

> > Perhaps I misread your intentions - I thought that by dynamic debug
> > you meant that all debug in RDMA should be pr_debug() based, and
> > therefore my objection regarding the ease with which users can
> > configure it.
> 
> It is not for all RDMA, but in your proposed driver. You are adding this 
> "debug"
> module argument to your module.

I don't get your answer. 
I made a generic remark [and actually one in favor of your arguments],
and instead of saying something meaningful you bash the driver.

> > If all you meant was 'dynamically set' as opposed to 'statically set'
> > then I agree that having that sort of configurability is preferable
> > [Even though end-user would still probably prefer a module parameter
> > for reproductions; As the name implies, 'debug' isn't meant to be used
> > in other situations].
> 
> We are not adding code just for fun, but for a real reason, and especially
> interfaces which will be visible to user.
> 
> The overall expectation from the driver's authors that they are submitting 
> driver
> which doesn't have bringup issues. For real life scenarios, where the bugs 
> will be
> reveled after some time of usage, this global debug is useless.

This has nothing to do with bringup; Real drivers are experiencing issues years 
after
they're productized.

> > Do notice you would be harming user-experience of reproductions though
> > - as it would have to follow different mechanisms to open debug prints
> > of various qed* components.
> 
> I don't understand this point at all. Do you think that it is normal to ask 
> user to
> debug your driver? Is this called "user-experience"?

No, I call this 'user involved in fixing the driver' - it has nothing to do with
user-experience. Sometimes user have specifics in his system that can't
be easily identified and thus lab reproductions fail, and the user assists
in the reproduction. While I never claimed this is good practice it does happen.

> As a summary, I didn't see in your responses any real life example where you 
> will
> need global debug level for your driver.

Not sure what you you're expecting - a list of BZs /private e-mails where
user reproductions were needed?
You're basically ignoring my claims that such are used, instead wanting
"evidence". I'm not going to try and produce any such.

Doug - I think we need a definite answer from you here; Doesn't look like
this discussion would bear any fruit.
If a debug module parameter is completely unacceptable, we'd remove it
[regardless of what I think about it].



Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Alexei Starovoitov
On Wed, Sep 14, 2016 at 09:38:16PM -0700, Andy Lutomirski wrote:
> On Wed, Sep 14, 2016 at 9:31 PM, Alexei Starovoitov
>  wrote:
> > On Wed, Sep 14, 2016 at 09:08:57PM -0700, Andy Lutomirski wrote:
> >> On Wed, Sep 14, 2016 at 9:00 PM, Alexei Starovoitov
> >>  wrote:
> >> > On Wed, Sep 14, 2016 at 07:27:08PM -0700, Andy Lutomirski wrote:
> >> >> >> >
> >> >> >> > This RFC handle both cgroup and seccomp approaches in a similar 
> >> >> >> > way. I
> >> >> >> > don't see why building on top of cgroup v2 is a problem. Is there
> >> >> >> > security issues with delegation?
> >> >> >>
> >> >> >> What I mean is: cgroup v2 delegation has a functionality problem.
> >> >> >> Tejun says [1]:
> >> >> >>
> >> >> >> We haven't had to face this decision because cgroup has never 
> >> >> >> properly
> >> >> >> supported delegating to applications and the in-use setups where this
> >> >> >> happens are custom configurations where there is no boundary between
> >> >> >> system and applications and adhoc trial-and-error is good enough a 
> >> >> >> way
> >> >> >> to find a working solution.  That wiggle room goes away once we
> >> >> >> officially open this up to individual applications.
> >> >> >>
> >> >> >> Unless and until that changes, I think that landlock should stay away
> >> >> >> from cgroups.  Others could reasonably disagree with me.
> >> >> >
> >> >> > Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
> >> >> > and not for sandboxing. So the above doesn't matter in such contexts.
> >> >> > lsm hooks + cgroups provide convenient scope and existing entry 
> >> >> > points.
> >> >> > Please see checmate examples how it's used.
> >> >> >
> >> >>
> >> >> To be clear: I'm not arguing at all that there shouldn't be
> >> >> bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
> >> >> landlock interface shouldn't expose any cgroup integration, at least
> >> >> until the cgroup situation settles down a lot.
> >> >
> >> > ahh. yes. we're perfectly in agreement here.
> >> > I'm suggesting that the next RFC shouldn't include unpriv
> >> > and seccomp at all. Once bpf+lsm+cgroup is merged, we can
> >> > argue about unpriv with cgroups and even unpriv as a whole,
> >> > since it's not a given. Seccomp integration is also questionable.
> >> > I'd rather not have seccomp as a gate keeper for this lsm.
> >> > lsm and seccomp are orthogonal hook points. Syscalls and lsm hooks
> >> > don't have one to one relationship, so mixing them up is only
> >> > asking for trouble further down the road.
> >> > If we really need to carry some information from seccomp to lsm+bpf,
> >> > it's easier to add eBPF support to seccomp and let bpf side deal
> >> > with passing whatever information.
> >> >
> >>
> >> As an argument for keeping seccomp (or an extended seccomp) as the
> >> interface for an unprivileged bpf+lsm: seccomp already checks off most
> >> of the boxes for safely letting unprivileged programs sandbox
> >> themselves.
> >
> > you mean the attach part of seccomp syscall that deals with no_new_priv?
> > sure, that's reusable.
> >
> >> Furthermore, to the extent that there are use cases for
> >> unprivileged bpf+lsm that *aren't* expressible within the seccomp
> >> hierarchy, I suspect that syscall filters have exactly the same
> >> problem and that we should fix seccomp to cover it.
> >
> > not sure what you mean by 'seccomp hierarchy'. The normal process
> > hierarchy ?
> 
> Kind of.  I mean the filter layers that are inherited across fork(),
> the TSYNC mechanism, etc.
> 
> > imo the main deficiency of secccomp is inability to look into arguments.
> > One can argue that it's a blessing, since composite args
> > are not yet copied into the kernel memory.
> > But in a lot of cases the seccomp arguments are FDs pointing
> > to kernel objects and if programs could examine those objects
> > the sandboxing scope would be more precise.
> > lsm+bpf solves that part and I'd still argue that it's
> > orthogonal to seccomp's pass/reject flow.
> > I mean if seccomp says 'ok' the syscall should continue executing
> > as normal and whatever LSM hooks were triggered by it may have
> > their own lsm+bpf verdicts.
> 
> I agree with all of this...
> 
> > Furthermore in the process hierarchy different children
> > should be able to set their own lsm+bpf filters that are not
> > related to parallel seccomp+bpf hierarchy of programs.
> > seccomp syscall can be an interface to attach programs
> > to lsm hooks, but nothing more than that.
> 
> I'm not sure what you mean.  I mean that, logically, I think we should
> be able to do:
> 
> seccomp(attach a syscall filter);
> fork();
> child does seccomp(attach some lsm filters);
> 
> I think that they *should* be related to the seccomp+bpf hierarchy of
> programs in that they are entries in the same logical list of filter
> layers installed.  Some of those layers can be syscall filters and
> some of the layers can be lsm 

Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Andy Lutomirski
On Wed, Sep 14, 2016 at 9:31 PM, Alexei Starovoitov
 wrote:
> On Wed, Sep 14, 2016 at 09:08:57PM -0700, Andy Lutomirski wrote:
>> On Wed, Sep 14, 2016 at 9:00 PM, Alexei Starovoitov
>>  wrote:
>> > On Wed, Sep 14, 2016 at 07:27:08PM -0700, Andy Lutomirski wrote:
>> >> >> >
>> >> >> > This RFC handle both cgroup and seccomp approaches in a similar way. 
>> >> >> > I
>> >> >> > don't see why building on top of cgroup v2 is a problem. Is there
>> >> >> > security issues with delegation?
>> >> >>
>> >> >> What I mean is: cgroup v2 delegation has a functionality problem.
>> >> >> Tejun says [1]:
>> >> >>
>> >> >> We haven't had to face this decision because cgroup has never properly
>> >> >> supported delegating to applications and the in-use setups where this
>> >> >> happens are custom configurations where there is no boundary between
>> >> >> system and applications and adhoc trial-and-error is good enough a way
>> >> >> to find a working solution.  That wiggle room goes away once we
>> >> >> officially open this up to individual applications.
>> >> >>
>> >> >> Unless and until that changes, I think that landlock should stay away
>> >> >> from cgroups.  Others could reasonably disagree with me.
>> >> >
>> >> > Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
>> >> > and not for sandboxing. So the above doesn't matter in such contexts.
>> >> > lsm hooks + cgroups provide convenient scope and existing entry points.
>> >> > Please see checmate examples how it's used.
>> >> >
>> >>
>> >> To be clear: I'm not arguing at all that there shouldn't be
>> >> bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
>> >> landlock interface shouldn't expose any cgroup integration, at least
>> >> until the cgroup situation settles down a lot.
>> >
>> > ahh. yes. we're perfectly in agreement here.
>> > I'm suggesting that the next RFC shouldn't include unpriv
>> > and seccomp at all. Once bpf+lsm+cgroup is merged, we can
>> > argue about unpriv with cgroups and even unpriv as a whole,
>> > since it's not a given. Seccomp integration is also questionable.
>> > I'd rather not have seccomp as a gate keeper for this lsm.
>> > lsm and seccomp are orthogonal hook points. Syscalls and lsm hooks
>> > don't have one to one relationship, so mixing them up is only
>> > asking for trouble further down the road.
>> > If we really need to carry some information from seccomp to lsm+bpf,
>> > it's easier to add eBPF support to seccomp and let bpf side deal
>> > with passing whatever information.
>> >
>>
>> As an argument for keeping seccomp (or an extended seccomp) as the
>> interface for an unprivileged bpf+lsm: seccomp already checks off most
>> of the boxes for safely letting unprivileged programs sandbox
>> themselves.
>
> you mean the attach part of seccomp syscall that deals with no_new_priv?
> sure, that's reusable.
>
>> Furthermore, to the extent that there are use cases for
>> unprivileged bpf+lsm that *aren't* expressible within the seccomp
>> hierarchy, I suspect that syscall filters have exactly the same
>> problem and that we should fix seccomp to cover it.
>
> not sure what you mean by 'seccomp hierarchy'. The normal process
> hierarchy ?

Kind of.  I mean the filter layers that are inherited across fork(),
the TSYNC mechanism, etc.

> imo the main deficiency of secccomp is inability to look into arguments.
> One can argue that it's a blessing, since composite args
> are not yet copied into the kernel memory.
> But in a lot of cases the seccomp arguments are FDs pointing
> to kernel objects and if programs could examine those objects
> the sandboxing scope would be more precise.
> lsm+bpf solves that part and I'd still argue that it's
> orthogonal to seccomp's pass/reject flow.
> I mean if seccomp says 'ok' the syscall should continue executing
> as normal and whatever LSM hooks were triggered by it may have
> their own lsm+bpf verdicts.

I agree with all of this...

> Furthermore in the process hierarchy different children
> should be able to set their own lsm+bpf filters that are not
> related to parallel seccomp+bpf hierarchy of programs.
> seccomp syscall can be an interface to attach programs
> to lsm hooks, but nothing more than that.

I'm not sure what you mean.  I mean that, logically, I think we should
be able to do:

seccomp(attach a syscall filter);
fork();
child does seccomp(attach some lsm filters);

I think that they *should* be related to the seccomp+bpf hierarchy of
programs in that they are entries in the same logical list of filter
layers installed.  Some of those layers can be syscall filters and
some of the layers can be lsm filters.  If we subsequently add a way
to attach a removable seccomp filter or a way to attach a seccomp
filter that logs failures to some fd watched by an outside monitor, I
think that should work for lsm, too, with more or less the same
interface.

If we need a way for a sandbox manager to opt 

Re: [RFC 02/11] Add RoCE driver framework

2016-09-14 Thread Leon Romanovsky
On Wed, Sep 14, 2016 at 06:25:38PM +, Mintz, Yuval wrote:
> > > > > >> >> +uint debug;
> > > > > >> >> +module_param(debug, uint, 0);
> > > > > > >>> +MODULE_PARM_DESC(debug, "Default debug msglevel");
> > > > > >>
> > > > > >> >Why are you adding this as a module parameter?
> > > > > >>
> > > > > >>  I believe this is mostly to follow same line as qede which also 
> > > > > >>defines
> > > > > > > 'debug' module parameter for allowing easy user control of debug
> > > > > > > prints [& specifically for probe prints, which can't be controlled
> > > > > > > otherwise].
> > > > >
> > > > > > Can you give us an example where dynamic debug and tracing 
> > > > > > infrastructures
> > > > > > are not enough?
> > > > >
> > > > > > AFAIK, most of these debug module parameters are legacy copy/paste
> > > > > > code which is useless in real life scenarios.
> > > > >
> > > > > Define 'enough'; Using dynamic debug you can provide all the necessary
> > > > > information and at an even better granularity that's achieved by 
> > > > > suggested
> > >  > > infrastructure,  but is harder for an end-user to use. Same goes for 
> > > tracing.
> > > > >
> > > > > The 'debug' option provides an easy grouping for prints related to a 
> > > > > specific
> > > > > area in the driver.
> > > >
> > > > It is hard to agree with you that user which knows how-to load modules
> > > > with parameters won't success to enable debug prints.
> > >
> > > I think you're giving too much credit to the end-user. :-D
> > >
> > > > In addition, global increase in debug level for whole driver will create
> > > > printk storm in dmesg and give nothing to debuggability.
> > >
> > > So basically, what you're claiming is that ethtool 'msglvl' setting for 
> > > devices
> > > is completely obselete. While this *might* be true, we use it extensively
> > > in our qede and qed drivers; The debug module parameter merely provides
> > > a manner of setting the debug value prior to initial probe for all 
> > > interfaces.
> > > qedr follows the same practice.
>
> > Thanks for this excellent example. Ethtool 'msglvl' adds this
> > dynamically, while your DEBUG argument works for loading module
> > only.
>
> > If you want dynamic prints, you have two options:
> > 1. Add support of ethtool to whole RDMA stack.
> > 2. Use dynamic tracing infrastructure.
>
> > Which option do you prefer?
> Option 3 - continuing this discussion. :-)

Sorry,
I was under impression that you want this driver to be merged, but it
looks like It was incorrect assumption. Let's continue discussion.

>
> Perhaps I misread your intentions - I thought that by dynamic debug
> you meant that all debug in RDMA should be pr_debug() based, and
> therefore my objection regarding the ease with which users can
> configure it.

It is not for all RDMA, but in your proposed driver. You are adding this
"debug" module argument to your module.

> If all you meant was 'dynamically set' as opposed to 'statically set'
> then I agree that having that sort of configurability is preferable
> [Even though end-user would still probably prefer a module
> parameter for reproductions; As the name implies, 'debug' isn't
> meant to be used in other situations].

We are not adding code just for fun, but for a real reason, and
especially interfaces which will be visible to user.

The overall expectation from the driver's authors that they are
submitting driver which doesn't have bringup issues. For real
life scenarios, where the bugs will be reveled after some time of
usage, this global debug is useless.

>
> The other thing to consider are the probe-time prints.
> Problem is, you wouldn't have a control node between probe
> and until after the probing would be over, so it would be a bit
> hard to configure that.
> You can always think of some generic method of fixing that as well
> [sysfs node for the entire system for probe-time prints, perhaps?]

/sys/kernel/debug/tracing
/sys/kernel/debug/dynamic_debug

>
> Do notice you would be harming user-experience of reproductions
> though - as it would have to follow different mechanisms to open
> debug prints of various qed* components.

I don't understand this point at all. Do you think that it is normal to
ask user to debug your driver? Is this called "user-experience"?

And regarding the second point, the old code is not an excuse to
copy/paste bad practices.

As a summary, I didn't see in your responses any real life example where
you will need global debug level for your driver.

Thanks


signature.asc
Description: PGP signature


Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Alexei Starovoitov
On Wed, Sep 14, 2016 at 09:08:57PM -0700, Andy Lutomirski wrote:
> On Wed, Sep 14, 2016 at 9:00 PM, Alexei Starovoitov
>  wrote:
> > On Wed, Sep 14, 2016 at 07:27:08PM -0700, Andy Lutomirski wrote:
> >> >> >
> >> >> > This RFC handle both cgroup and seccomp approaches in a similar way. I
> >> >> > don't see why building on top of cgroup v2 is a problem. Is there
> >> >> > security issues with delegation?
> >> >>
> >> >> What I mean is: cgroup v2 delegation has a functionality problem.
> >> >> Tejun says [1]:
> >> >>
> >> >> We haven't had to face this decision because cgroup has never properly
> >> >> supported delegating to applications and the in-use setups where this
> >> >> happens are custom configurations where there is no boundary between
> >> >> system and applications and adhoc trial-and-error is good enough a way
> >> >> to find a working solution.  That wiggle room goes away once we
> >> >> officially open this up to individual applications.
> >> >>
> >> >> Unless and until that changes, I think that landlock should stay away
> >> >> from cgroups.  Others could reasonably disagree with me.
> >> >
> >> > Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
> >> > and not for sandboxing. So the above doesn't matter in such contexts.
> >> > lsm hooks + cgroups provide convenient scope and existing entry points.
> >> > Please see checmate examples how it's used.
> >> >
> >>
> >> To be clear: I'm not arguing at all that there shouldn't be
> >> bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
> >> landlock interface shouldn't expose any cgroup integration, at least
> >> until the cgroup situation settles down a lot.
> >
> > ahh. yes. we're perfectly in agreement here.
> > I'm suggesting that the next RFC shouldn't include unpriv
> > and seccomp at all. Once bpf+lsm+cgroup is merged, we can
> > argue about unpriv with cgroups and even unpriv as a whole,
> > since it's not a given. Seccomp integration is also questionable.
> > I'd rather not have seccomp as a gate keeper for this lsm.
> > lsm and seccomp are orthogonal hook points. Syscalls and lsm hooks
> > don't have one to one relationship, so mixing them up is only
> > asking for trouble further down the road.
> > If we really need to carry some information from seccomp to lsm+bpf,
> > it's easier to add eBPF support to seccomp and let bpf side deal
> > with passing whatever information.
> >
> 
> As an argument for keeping seccomp (or an extended seccomp) as the
> interface for an unprivileged bpf+lsm: seccomp already checks off most
> of the boxes for safely letting unprivileged programs sandbox
> themselves.  

you mean the attach part of seccomp syscall that deals with no_new_priv?
sure, that's reusable.

> Furthermore, to the extent that there are use cases for
> unprivileged bpf+lsm that *aren't* expressible within the seccomp
> hierarchy, I suspect that syscall filters have exactly the same
> problem and that we should fix seccomp to cover it.

not sure what you mean by 'seccomp hierarchy'. The normal process
hierarchy ?
imo the main deficiency of secccomp is inability to look into arguments.
One can argue that it's a blessing, since composite args
are not yet copied into the kernel memory.
But in a lot of cases the seccomp arguments are FDs pointing
to kernel objects and if programs could examine those objects
the sandboxing scope would be more precise.
lsm+bpf solves that part and I'd still argue that it's
orthogonal to seccomp's pass/reject flow.
I mean if seccomp says 'ok' the syscall should continue executing
as normal and whatever LSM hooks were triggered by it may have
their own lsm+bpf verdicts.
Furthermore in the process hierarchy different children
should be able to set their own lsm+bpf filters that are not
related to parallel seccomp+bpf hierarchy of programs.
seccomp syscall can be an interface to attach programs
to lsm hooks, but nothing more than that.



Re: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not

2016-09-14 Thread John Fastabend
On 16-09-14 05:43 PM, Alexei Starovoitov wrote:
> On Wed, Sep 14, 2016 at 11:57:24PM +, Brown, Aaron F wrote:
>>> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On
>>> Behalf Of John Fastabend
>>> Sent: Monday, September 12, 2016 3:13 PM
>>> To: bbla...@plumgrid.com; john.fastab...@gmail.com;
>>> alexei.starovoi...@gmail.com; Kirsher, Jeffrey T
>>> ; bro...@redhat.com; da...@davemloft.net
>>> Cc: xiyou.wangc...@gmail.com; intel-wired-...@lists.osuosl.org;
>>> u9012...@gmail.com; netdev@vger.kernel.org
>>> Subject: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes
>>> regardless of skb or not
>>>
>>> The BQL API does not reference the sk_buff nor does the driver need to
>>> reference the sk_buff to calculate the length of a transmitted frame.
>>> This patch removes an sk_buff reference from the xmit irq path and
>>> also allows packets sent from XDP to use BQL.
>>>
>>> Signed-off-by: John Fastabend 
>>> ---
>>>  drivers/net/ethernet/intel/e1000/e1000_main.c |7 ++-
>>>  1 file changed, 2 insertions(+), 5 deletions(-)
>>
>> This patch is causing all my e1000 adapters to fail a simple ftp session 
>> with really slow response (hashing on) followed by adapter resets.  I have 
>> seen this on 4 different e1000 nics now, an 82543GC, 82544GC, 82546EB and an 
>> 82545GM.  On a few occasions I get a splat captured to dmesg.  Here is an 
>> example:
>> 
>> [ cut here ]
>> WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:316 
>> dev_watchdog+0x1c2/0x1d0
>> NETDEV WATCHDOG: eth1 (e1000): transmit queue 0 timed out
> 
> Thanks a lot for the tests! Really appreciate it.
> 

Thanks.

Jeff, please drop the series for now obviously this wont work. It needs
some work.


Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Andy Lutomirski
On Wed, Sep 14, 2016 at 9:00 PM, Alexei Starovoitov
 wrote:
> On Wed, Sep 14, 2016 at 07:27:08PM -0700, Andy Lutomirski wrote:
>> >> >
>> >> > This RFC handle both cgroup and seccomp approaches in a similar way. I
>> >> > don't see why building on top of cgroup v2 is a problem. Is there
>> >> > security issues with delegation?
>> >>
>> >> What I mean is: cgroup v2 delegation has a functionality problem.
>> >> Tejun says [1]:
>> >>
>> >> We haven't had to face this decision because cgroup has never properly
>> >> supported delegating to applications and the in-use setups where this
>> >> happens are custom configurations where there is no boundary between
>> >> system and applications and adhoc trial-and-error is good enough a way
>> >> to find a working solution.  That wiggle room goes away once we
>> >> officially open this up to individual applications.
>> >>
>> >> Unless and until that changes, I think that landlock should stay away
>> >> from cgroups.  Others could reasonably disagree with me.
>> >
>> > Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
>> > and not for sandboxing. So the above doesn't matter in such contexts.
>> > lsm hooks + cgroups provide convenient scope and existing entry points.
>> > Please see checmate examples how it's used.
>> >
>>
>> To be clear: I'm not arguing at all that there shouldn't be
>> bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
>> landlock interface shouldn't expose any cgroup integration, at least
>> until the cgroup situation settles down a lot.
>
> ahh. yes. we're perfectly in agreement here.
> I'm suggesting that the next RFC shouldn't include unpriv
> and seccomp at all. Once bpf+lsm+cgroup is merged, we can
> argue about unpriv with cgroups and even unpriv as a whole,
> since it's not a given. Seccomp integration is also questionable.
> I'd rather not have seccomp as a gate keeper for this lsm.
> lsm and seccomp are orthogonal hook points. Syscalls and lsm hooks
> don't have one to one relationship, so mixing them up is only
> asking for trouble further down the road.
> If we really need to carry some information from seccomp to lsm+bpf,
> it's easier to add eBPF support to seccomp and let bpf side deal
> with passing whatever information.
>

As an argument for keeping seccomp (or an extended seccomp) as the
interface for an unprivileged bpf+lsm: seccomp already checks off most
of the boxes for safely letting unprivileged programs sandbox
themselves.  Furthermore, to the extent that there are use cases for
unprivileged bpf+lsm that *aren't* expressible within the seccomp
hierarchy, I suspect that syscall filters have exactly the same
problem and that we should fix seccomp to cover it.

If I ever add a "seccomp monitor", which is something I want to do
eventually, I think it should work for lsm+bpf as well, which is
another argument for keeping it in seccomp.

--Andy


Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Alexei Starovoitov
On Wed, Sep 14, 2016 at 07:27:08PM -0700, Andy Lutomirski wrote:
> >> >
> >> > This RFC handle both cgroup and seccomp approaches in a similar way. I
> >> > don't see why building on top of cgroup v2 is a problem. Is there
> >> > security issues with delegation?
> >>
> >> What I mean is: cgroup v2 delegation has a functionality problem.
> >> Tejun says [1]:
> >>
> >> We haven't had to face this decision because cgroup has never properly
> >> supported delegating to applications and the in-use setups where this
> >> happens are custom configurations where there is no boundary between
> >> system and applications and adhoc trial-and-error is good enough a way
> >> to find a working solution.  That wiggle room goes away once we
> >> officially open this up to individual applications.
> >>
> >> Unless and until that changes, I think that landlock should stay away
> >> from cgroups.  Others could reasonably disagree with me.
> >
> > Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
> > and not for sandboxing. So the above doesn't matter in such contexts.
> > lsm hooks + cgroups provide convenient scope and existing entry points.
> > Please see checmate examples how it's used.
> >
> 
> To be clear: I'm not arguing at all that there shouldn't be
> bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
> landlock interface shouldn't expose any cgroup integration, at least
> until the cgroup situation settles down a lot.

ahh. yes. we're perfectly in agreement here.
I'm suggesting that the next RFC shouldn't include unpriv
and seccomp at all. Once bpf+lsm+cgroup is merged, we can
argue about unpriv with cgroups and even unpriv as a whole,
since it's not a given. Seccomp integration is also questionable.
I'd rather not have seccomp as a gate keeper for this lsm.
lsm and seccomp are orthogonal hook points. Syscalls and lsm hooks
don't have one to one relationship, so mixing them up is only
asking for trouble further down the road.
If we really need to carry some information from seccomp to lsm+bpf,
it's easier to add eBPF support to seccomp and let bpf side deal
with passing whatever information. 



[PATCH] nfp: fix error return code in nfp_net_netdev_open()

2016-09-14 Thread Wei Yongjun
From: Wei Yongjun 

Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: 73725d9dfd99 ("nfp: allocate ring SW structs dynamically")
Signed-off-by: Wei Yongjun 
---
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 252e492..39dadfc 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -2044,12 +2044,16 @@ static int nfp_net_netdev_open(struct net_device 
*netdev)
 
nn->rx_rings = kcalloc(nn->num_rx_rings, sizeof(*nn->rx_rings),
   GFP_KERNEL);
-   if (!nn->rx_rings)
+   if (!nn->rx_rings) {
+   err = -ENOMEM;
goto err_free_lsc;
+   }
nn->tx_rings = kcalloc(nn->num_tx_rings, sizeof(*nn->tx_rings),
   GFP_KERNEL);
-   if (!nn->tx_rings)
+   if (!nn->tx_rings) {
+   err = -ENOMEM;
goto err_free_rx_rings;
+   }
 
for (r = 0; r < nn->num_r_vecs; r++) {
err = nfp_net_prepare_vector(nn, >r_vecs[r], r);



Re: [PATCH 0/5] Make /sys/class/net per net namespace objects belong to container

2016-09-14 Thread Dmitry Torokhov
On Mon, Aug 29, 2016 at 5:38 AM, Eric W. Biederman
 wrote:
> David Miller  writes:
>
>> From: Dmitry Torokhov 
>> Date: Tue, 16 Aug 2016 15:33:10 -0700
>>
>>> There are objects in /sys hierarchy (/sys/class/net/) that logically belong
>>> to a namespace/container. Unfortunately all sysfs objects start their life
>>> belonging to global root, and while we could change ownership manually,
>>> keeping tracks of all objects that come and go is cumbersome. It would
>>> be better if kernel created them using correct uid/gid from the beginning.
>>>
>>> This series changes kernfs to allow creating object's with arbitrary
>>> uid/gid, adds get_ownership() callback to ktype structure so subsystems
>>> could supply their own logic (likely tied to namespace support) for
>>> determining ownership of kobjects, and adjusts sysfs code to make use of
>>> this information. Lastly net-sysfs is adjusted to make sure that objects in
>>> net namespace are owned by the root user from the owning user namespace.
>>>
>>> Note that we do not adjust ownership of objects moved into a new namespace
>>> (as when moving a network device into a container) as userspace can easily
>>> do it.
>>
>> I need some domain experts to review this series please.
>
> I just came back from vacation and I will aim to take a look shortly.
>
> The big picture idea seems sensible.  Having a better ownship of sysfs
> files that are part of a network namespace.  I will have to look at the
> details to see if the implementation is similarly sensible.

Eric,

Did you find anything objectionable in the series or should I fix up
the !CONFIG_SYSFS error in networking patch and resubmit?

Thanks.

-- 
Dmitry


Re: [PATCH -next] net: dsa: b53: Remove unused including

2016-09-14 Thread Florian Fainelli
2016-09-14 19:24 GMT-07:00 Wei Yongjun :
> From: Wei Yongjun 
>
> Remove including  that don't need it.
>
> Signed-off-by: Wei Yongjun 

Acked-by: Florian Fainelli 
--
Florian


Re: [PATCH -next] net: dsa: bcm_sf2: Fix non static symbol warning

2016-09-14 Thread Florian Fainelli
2016-09-14 19:24 GMT-07:00 Wei Yongjun :
> From: Wei Yongjun 
>
> Fixes the following sparse warning:
>
> drivers/net/dsa/bcm_sf2.c:963:19: warning:
>  symbol 'bcm_sf2_io_ops' was not declared. Should it be static?
>
> Signed-off-by: Wei Yongjun 

Acked-by: Florian Fainelli 
-- 
Florian


Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Andy Lutomirski
On Wed, Sep 14, 2016 at 7:19 PM, Alexei Starovoitov
 wrote:
> On Wed, Sep 14, 2016 at 06:25:07PM -0700, Andy Lutomirski wrote:
>> On Wed, Sep 14, 2016 at 3:11 PM, Mickaël Salaün  wrote:
>> >
>> > On 14/09/2016 20:27, Andy Lutomirski wrote:
>> >> On Wed, Sep 14, 2016 at 12:24 AM, Mickaël Salaün  wrote:
>> >>> Add a new flag CGRP_NO_NEW_PRIVS for each cgroup. This flag is initially
>> >>> set for all cgroup except the root. The flag is clear when a new process
>> >>> without the no_new_privs flags is attached to the cgroup.
>> >>>
>> >>> If a cgroup is landlocked, then any new attempt, from an unprivileged
>> >>> process, to attach a process without no_new_privs to this cgroup will
>> >>> be denied.
>> >>
>> >> Until and unless everyone can agree on a way to properly namespace,
>> >> delegate, etc cgroups, I think that trying to add unprivileged
>> >> semantics to cgroups is nuts.  Given the big thread about cgroup v2,
>> >> no-internal-tasks, etc, I just don't see how this approach can be
>> >> viable.
>> >
>> > As far as I can tell, the no_new_privs flag of at task is not related to
>> > namespaces. The CGRP_NO_NEW_PRIVS flag is only a cache to quickly access
>> > the no_new_privs property of *tasks* in a cgroup. The semantic is 
>> > unchanged.
>> >
>> > Using cgroup is optional, any task could use the seccomp-based
>> > landlocking instead. However, for those that want/need to manage a
>> > security policy in a more dynamic way, using cgroups may make sense.
>> >
>> > I though cgroup delegation was OK in the v2, isn't it the case? Do you
>> > have some links?
>> >
>> >>
>> >> Can we try to make landlock work completely independently of cgroups
>> >> so that it doesn't get stuck and so that programs can use it without
>> >> worrying about cgroup v1 vs v2, interactions with cgroup managers,
>> >> cgroup managers that (supposedly?) will start migrating processes
>> >> around piecemeal and almost certainly blowing up landlock in the
>> >> process, etc?
>> >
>> > This RFC handle both cgroup and seccomp approaches in a similar way. I
>> > don't see why building on top of cgroup v2 is a problem. Is there
>> > security issues with delegation?
>>
>> What I mean is: cgroup v2 delegation has a functionality problem.
>> Tejun says [1]:
>>
>> We haven't had to face this decision because cgroup has never properly
>> supported delegating to applications and the in-use setups where this
>> happens are custom configurations where there is no boundary between
>> system and applications and adhoc trial-and-error is good enough a way
>> to find a working solution.  That wiggle room goes away once we
>> officially open this up to individual applications.
>>
>> Unless and until that changes, I think that landlock should stay away
>> from cgroups.  Others could reasonably disagree with me.
>
> Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
> and not for sandboxing. So the above doesn't matter in such contexts.
> lsm hooks + cgroups provide convenient scope and existing entry points.
> Please see checmate examples how it's used.
>

To be clear: I'm not arguing at all that there shouldn't be
bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
landlock interface shouldn't expose any cgroup integration, at least
until the cgroup situation settles down a lot.

--Andy


[PATCH -next] net: emac: remove .owner field for driver

2016-09-14 Thread Wei Yongjun
From: Wei Yongjun 

Remove .owner field if calls are used which set it automatically.

Generated by: scripts/coccinelle/api/platform_no_drv_owner.cocci

Signed-off-by: Wei Yongjun 
---
 drivers/net/ethernet/qualcomm/emac/emac.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/qualcomm/emac/emac.c 
b/drivers/net/ethernet/qualcomm/emac/emac.c
index 56e0a9f..764e7b4b 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac.c
@@ -731,7 +731,6 @@ static struct platform_driver emac_platform_driver = {
.probe  = emac_probe,
.remove = emac_remove,
.driver = {
-   .owner  = THIS_MODULE,
.name   = "qcom-emac",
.of_match_table = emac_dt_match,
},





[PATCH -next] net: emac: remove unnecessary dev_set_drvdata()

2016-09-14 Thread Wei Yongjun
From: Wei Yongjun 

The driver core clears the driver data to NULL after device_release
or on probe failure. Thus, it is not needed to manually clear the
device driver data to NULL.

Signed-off-by: Wei Yongjun 
---
 drivers/net/ethernet/qualcomm/emac/emac.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/qualcomm/emac/emac.c 
b/drivers/net/ethernet/qualcomm/emac/emac.c
index 56e0a9f..42d2d233 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac.c
@@ -722,7 +722,6 @@ static int emac_remove(struct platform_device *pdev)
 
mdiobus_unregister(adpt->mii_bus);
free_netdev(netdev);
-   dev_set_drvdata(>dev, NULL);
 
return 0;
 }





[PATCH -next] net: dsa: b53: Remove unused including

2016-09-14 Thread Wei Yongjun
From: Wei Yongjun 

Remove including  that don't need it.

Signed-off-by: Wei Yongjun 
---
 drivers/net/dsa/b53/b53_priv.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/dsa/b53/b53_priv.h b/drivers/net/dsa/b53/b53_priv.h
index 76672da..f192a67 100644
--- a/drivers/net/dsa/b53/b53_priv.h
+++ b/drivers/net/dsa/b53/b53_priv.h
@@ -372,7 +372,6 @@ static inline void b53_arl_from_entry(u64 *mac_vid, u32 
*fwd_entry,
 
 #ifdef CONFIG_BCM47XX
 
-#include 
 #include 
 #include 
 static inline int b53_switch_get_reset_gpio(struct b53_device *dev)





[PATCH -next] net: dsa: bcm_sf2: Fix non static symbol warning

2016-09-14 Thread Wei Yongjun
From: Wei Yongjun 

Fixes the following sparse warning:

drivers/net/dsa/bcm_sf2.c:963:19: warning:
 symbol 'bcm_sf2_io_ops' was not declared. Should it be static?

Signed-off-by: Wei Yongjun 
---
 drivers/net/dsa/bcm_sf2.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index 51f1fc0..f8bb5a5 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -960,7 +960,7 @@ static int bcm_sf2_core_write64(struct b53_device *dev, u8 
page, u8 reg,
return 0;
 }
 
-struct b53_io_ops bcm_sf2_io_ops = {
+static struct b53_io_ops bcm_sf2_io_ops = {
.read8  = bcm_sf2_core_read8,
.read16 = bcm_sf2_core_read16,
.read32 = bcm_sf2_core_read32,



Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Alexei Starovoitov
On Wed, Sep 14, 2016 at 06:25:07PM -0700, Andy Lutomirski wrote:
> On Wed, Sep 14, 2016 at 3:11 PM, Mickaël Salaün  wrote:
> >
> > On 14/09/2016 20:27, Andy Lutomirski wrote:
> >> On Wed, Sep 14, 2016 at 12:24 AM, Mickaël Salaün  wrote:
> >>> Add a new flag CGRP_NO_NEW_PRIVS for each cgroup. This flag is initially
> >>> set for all cgroup except the root. The flag is clear when a new process
> >>> without the no_new_privs flags is attached to the cgroup.
> >>>
> >>> If a cgroup is landlocked, then any new attempt, from an unprivileged
> >>> process, to attach a process without no_new_privs to this cgroup will
> >>> be denied.
> >>
> >> Until and unless everyone can agree on a way to properly namespace,
> >> delegate, etc cgroups, I think that trying to add unprivileged
> >> semantics to cgroups is nuts.  Given the big thread about cgroup v2,
> >> no-internal-tasks, etc, I just don't see how this approach can be
> >> viable.
> >
> > As far as I can tell, the no_new_privs flag of at task is not related to
> > namespaces. The CGRP_NO_NEW_PRIVS flag is only a cache to quickly access
> > the no_new_privs property of *tasks* in a cgroup. The semantic is unchanged.
> >
> > Using cgroup is optional, any task could use the seccomp-based
> > landlocking instead. However, for those that want/need to manage a
> > security policy in a more dynamic way, using cgroups may make sense.
> >
> > I though cgroup delegation was OK in the v2, isn't it the case? Do you
> > have some links?
> >
> >>
> >> Can we try to make landlock work completely independently of cgroups
> >> so that it doesn't get stuck and so that programs can use it without
> >> worrying about cgroup v1 vs v2, interactions with cgroup managers,
> >> cgroup managers that (supposedly?) will start migrating processes
> >> around piecemeal and almost certainly blowing up landlock in the
> >> process, etc?
> >
> > This RFC handle both cgroup and seccomp approaches in a similar way. I
> > don't see why building on top of cgroup v2 is a problem. Is there
> > security issues with delegation?
> 
> What I mean is: cgroup v2 delegation has a functionality problem.
> Tejun says [1]:
> 
> We haven't had to face this decision because cgroup has never properly
> supported delegating to applications and the in-use setups where this
> happens are custom configurations where there is no boundary between
> system and applications and adhoc trial-and-error is good enough a way
> to find a working solution.  That wiggle room goes away once we
> officially open this up to individual applications.
> 
> Unless and until that changes, I think that landlock should stay away
> from cgroups.  Others could reasonably disagree with me.

Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
and not for sandboxing. So the above doesn't matter in such contexts.
lsm hooks + cgroups provide convenient scope and existing entry points.
Please see checmate examples how it's used.



Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Andy Lutomirski
On Wed, Sep 14, 2016 at 3:11 PM, Mickaël Salaün  wrote:
>
> On 14/09/2016 20:27, Andy Lutomirski wrote:
>> On Wed, Sep 14, 2016 at 12:24 AM, Mickaël Salaün  wrote:
>>> Add a new flag CGRP_NO_NEW_PRIVS for each cgroup. This flag is initially
>>> set for all cgroup except the root. The flag is clear when a new process
>>> without the no_new_privs flags is attached to the cgroup.
>>>
>>> If a cgroup is landlocked, then any new attempt, from an unprivileged
>>> process, to attach a process without no_new_privs to this cgroup will
>>> be denied.
>>
>> Until and unless everyone can agree on a way to properly namespace,
>> delegate, etc cgroups, I think that trying to add unprivileged
>> semantics to cgroups is nuts.  Given the big thread about cgroup v2,
>> no-internal-tasks, etc, I just don't see how this approach can be
>> viable.
>
> As far as I can tell, the no_new_privs flag of at task is not related to
> namespaces. The CGRP_NO_NEW_PRIVS flag is only a cache to quickly access
> the no_new_privs property of *tasks* in a cgroup. The semantic is unchanged.
>
> Using cgroup is optional, any task could use the seccomp-based
> landlocking instead. However, for those that want/need to manage a
> security policy in a more dynamic way, using cgroups may make sense.
>
> I though cgroup delegation was OK in the v2, isn't it the case? Do you
> have some links?
>
>>
>> Can we try to make landlock work completely independently of cgroups
>> so that it doesn't get stuck and so that programs can use it without
>> worrying about cgroup v1 vs v2, interactions with cgroup managers,
>> cgroup managers that (supposedly?) will start migrating processes
>> around piecemeal and almost certainly blowing up landlock in the
>> process, etc?
>
> This RFC handle both cgroup and seccomp approaches in a similar way. I
> don't see why building on top of cgroup v2 is a problem. Is there
> security issues with delegation?

What I mean is: cgroup v2 delegation has a functionality problem.
Tejun says [1]:

We haven't had to face this decision because cgroup has never properly
supported delegating to applications and the in-use setups where this
happens are custom configurations where there is no boundary between
system and applications and adhoc trial-and-error is good enough a way
to find a working solution.  That wiggle room goes away once we
officially open this up to individual applications.

Unless and until that changes, I think that landlock should stay away
from cgroups.  Others could reasonably disagree with me.


[1] https://lkml.kernel.org/g/<20160909225747.ga30...@mtj.duckdns.org


Re: [RFC v3 19/22] landlock: Add interrupted origin

2016-09-14 Thread Andy Lutomirski
On Wed, Sep 14, 2016 at 3:14 PM, Mickaël Salaün  wrote:
>
> On 14/09/2016 20:29, Andy Lutomirski wrote:
>> On Wed, Sep 14, 2016 at 12:24 AM, Mickaël Salaün  wrote:
>>> This third origin of hook call should cover all possible trigger paths
>>> (e.g. page fault). Landlock eBPF programs can then take decisions
>>> accordingly.
>>>
>>> Signed-off-by: Mickaël Salaün 
>>> Cc: Alexei Starovoitov 
>>> Cc: Andy Lutomirski 
>>> Cc: Daniel Borkmann 
>>> Cc: Kees Cook 
>>> ---
>>
>>
>>>
>>> +   if (unlikely(in_interrupt())) {
>>
>> IMO security hooks have no business being called from interrupts.
>> Aren't they all synchronous things done by tasks?  Interrupts are
>> driver things.
>>
>> Are you trying to check for page faults and such?
>
> Yes, that was the idea you did put in my mind. Not sure how to deal with
> this.
>

It's not so easy, unfortunately.  The easiest reliable way might be to
set a TS_ flag on all syscall entries when TIF_SECCOMP or similar is
set.

--Andy


Re: [PATCH V2 3/3] net-next: dsa: add new driver for qca8xxx family

2016-09-14 Thread Andrew Lunn
On Wed, Sep 14, 2016 at 12:39:02PM +0200, John Crispin wrote:
> This patch contains initial support for the QCA8337 switch. It
> will detect a QCA8337 switch, if present and declared in the DT.
> 
> Each port will be represented through a standalone net_device interface,
> as for other DSA switches. CPU can communicate with any of the ports by
> setting an IP@ on ethN interface. Most of the extra callbacks of the DSA
> subsystem are already supported, such as bridge offloading, stp, fdb.
> 
> Signed-off-by: John Crispin 
> ---
> Changes in V2
> * add proper locking for the FDB table
> * remove udelay when changing the page. neither datasheet nore SDK code
>   requires a sleep
> * add a cond_resched to the busy wait loop
> * use nested locking when accessing the mdio bus
> * remove the phy_to_port() wrappers
> * remove mmd access function and use existing phy helpers
> * fix a copy/paste bug breaking the eee callbacks
> * use default vid 1 when fdb entries are added fro vid 0
> * remove the phy id check and add a switch id check instead
> * add error handling to the mdio read/write functions
> * remove inline usage

Hi John

Thanks for the changes, it looks a lot better.

> +static u16 qca8k_current_page = 0x;
> +
> +static struct
> +qca8k_priv *qca8k_to_priv(struct dsa_switch *ds)
> +{
> + struct qca8k_priv *priv = ds->priv;
> +
> + return priv;
> +}

https://lkml.org/lkml/2016/8/31/936

In this patch, Vivien removed all the callers to ds_to_priv(). Which
is what this function is. Please just use ds->priv.

> +
> +static void
> +qca8k_fdb_flush(struct qca8k_priv *priv)
> +{
> + mutex_lock(>fdb_mutex);
> + qca8k_fdb_access(priv, QCA8K_FDB_FLUSH, -1);
> + mutex_unlock(>fdb_mutex);
> +}

So this protects the fdb. But should this mutex be more general. Take for 
example:

> +qca8k_eee_enable_set(struct dsa_switch *ds, int port, bool enable)
> +{
> + struct qca8k_priv *priv = qca8k_to_priv(ds);
> + u32 lpi_en = QCA8K_REG_EEE_CTRL_LPI_EN(port);
> + u32 reg;
> +
> + reg = qca8k_read(priv, QCA8K_REG_EEE_CTRL);
> + if (enable)
> + reg |= lpi_en;
> + else
> + reg &= ~lpi_en;
> + qca8k_write(priv, QCA8K_REG_EEE_CTRL, reg);
> +}
> +

Here you do a read/modify/write. Could this be called on two
interfaces at the same time? I think you might want this protected as
well by a mutex. So maybe rename fdb_mutex to something more generic
and use it in places like this as well?

> +static int
> +qca8k_setup(struct dsa_switch *ds)
> +{
> + struct qca8k_priv *priv = qca8k_to_priv(ds);
> + int ret, i, phy_mode = -1;
> +
> + /* Start by setting up the register mapping */
> + priv->regmap = devm_regmap_init(ds->dev, NULL, priv,
> + _regmap_config);
> +
> + if (IS_ERR(priv->regmap))
> + pr_warn("regmap initialization failed");
> +
> + /* Initialize CPU port pad mode (xMII type, delays...) */
> + phy_mode = of_get_phy_mode(ds->ports[ds->dst->cpu_port].dn);
> + if (phy_mode < 0) {
> + pr_err("Can't find phy-mode for master device\n");
> + return phy_mode;
> + }
> + ret = qca8k_set_pad_ctrl(priv, QCA8K_CPU_PORT, phy_mode);
> + if (ret < 0)
> + return ret;

Maybe add a check here that dsa_is_cpu_port(ds, 0) is true?  If you
say the CPU port has to be 0, it should be checked for.

> + /* Enable CPU Port */
> + qca8k_reg_set(priv, QCA8K_REG_GLOBAL_FW_CTRL0,
> +   QCA8K_GLOBAL_FW_CTRL0_CPU_PORT_EN);
> +
> + /* Enable MIB counters */
> + qca8k_reg_set(priv, QCA8K_REG_MIB, QCA8K_MIB_CPU_KEEP);
> + qca8k_write(priv, QCA8K_REG_MODULE_EN, QCA8K_MODULE_EN_MIB);

Will this implicitly clear the MIB counters? They should be set to
zero, otherwise they can survive a reboot of the host.

> +
> + /* Setup connection between CPU ports & PHYs */

You missed a few "PHY". We tend to call such ports user ports, or
external ports.

> + for (i = 0; i < DSA_MAX_PORTS; i++) {
> + /* CPU port gets connected to all PHYs in the switch */
> + if (dsa_is_cpu_port(ds, i)) {
> + qca8k_rmw(priv, QCA8K_PORT_LOOKUP_CTRL(QCA8K_CPU_PORT),
> +   QCA8K_PORT_LOOKUP_MEMBER,
> +   ds->enabled_port_mask);
> + }
> +
> + /* Invividual PHYs gets connected to CPU port only */
> + if (ds->enabled_port_mask & BIT(i)) {
> + int shift = 16 * (i % 2);
> +
> + qca8k_rmw(priv, QCA8K_PORT_LOOKUP_CTRL(i),
> +   QCA8K_PORT_LOOKUP_MEMBER,
> +   BIT(QCA8K_CPU_PORT));
> +
> + /* Enable ARP Auto-learning by default */
> + qca8k_reg_set(priv, QCA8K_PORT_LOOKUP_CTRL(i),
> +   QCA8K_PORT_LOOKUP_LEARN);
> +
> + /* For 

Re: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not

2016-09-14 Thread Alexei Starovoitov
On Wed, Sep 14, 2016 at 11:57:24PM +, Brown, Aaron F wrote:
> > From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On
> > Behalf Of John Fastabend
> > Sent: Monday, September 12, 2016 3:13 PM
> > To: bbla...@plumgrid.com; john.fastab...@gmail.com;
> > alexei.starovoi...@gmail.com; Kirsher, Jeffrey T
> > ; bro...@redhat.com; da...@davemloft.net
> > Cc: xiyou.wangc...@gmail.com; intel-wired-...@lists.osuosl.org;
> > u9012...@gmail.com; netdev@vger.kernel.org
> > Subject: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes
> > regardless of skb or not
> > 
> > The BQL API does not reference the sk_buff nor does the driver need to
> > reference the sk_buff to calculate the length of a transmitted frame.
> > This patch removes an sk_buff reference from the xmit irq path and
> > also allows packets sent from XDP to use BQL.
> > 
> > Signed-off-by: John Fastabend 
> > ---
> >  drivers/net/ethernet/intel/e1000/e1000_main.c |7 ++-
> >  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> This patch is causing all my e1000 adapters to fail a simple ftp session with 
> really slow response (hashing on) followed by adapter resets.  I have seen 
> this on 4 different e1000 nics now, an 82543GC, 82544GC, 82546EB and an 
> 82545GM.  On a few occasions I get a splat captured to dmesg.  Here is an 
> example:
> 
> [ cut here ]
> WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x1c2/0x1d0
> NETDEV WATCHDOG: eth1 (e1000): transmit queue 0 timed out

Thanks a lot for the tests! Really appreciate it.



Re: [PATCH net-next 1/7] lwt: Add net to build_state argument

2016-09-14 Thread kbuild test robot
Hi Tom,

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Tom-Herbert/net-ILA-resolver-and-generic-resolver-backend/20160915-073357
config: microblaze-mmu_defconfig (attached as .config)
compiler: microblaze-linux-gcc (GCC) 4.9.0
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=microblaze 

All warnings (new ones prefixed by >>):

   net/ipv4/fib_semantics.c: In function 'fib_encap_match':
>> net/ipv4/fib_semantics.c:614:8: warning: passing argument 1 of 
>> 'lwtunnel_build_state' from incompatible pointer type
 ret = lwtunnel_build_state(net, dev, encap_type, encap,
   ^
   In file included from net/ipv4/fib_semantics.c:45:0:
   include/net/lwtunnel.h:172:19: note: expected 'struct net_device *' but 
argument is of type 'struct net *'
static inline int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
  ^
   net/ipv4/fib_semantics.c:614:8: warning: passing argument 2 of 
'lwtunnel_build_state' makes integer from pointer without a cast
 ret = lwtunnel_build_state(net, dev, encap_type, encap,
   ^
   In file included from net/ipv4/fib_semantics.c:45:0:
   include/net/lwtunnel.h:172:19: note: expected 'u16' but argument is of type 
'struct net_device *'
static inline int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
  ^
   net/ipv4/fib_semantics.c:614:8: warning: passing argument 3 of 
'lwtunnel_build_state' makes pointer from integer without a cast
 ret = lwtunnel_build_state(net, dev, encap_type, encap,
   ^
   In file included from net/ipv4/fib_semantics.c:45:0:
   include/net/lwtunnel.h:172:19: note: expected 'struct nlattr *' but argument 
is of type 'u16'
static inline int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
  ^
   net/ipv4/fib_semantics.c:614:8: warning: passing argument 4 of 
'lwtunnel_build_state' makes integer from pointer without a cast
 ret = lwtunnel_build_state(net, dev, encap_type, encap,
   ^
   In file included from net/ipv4/fib_semantics.c:45:0:
   include/net/lwtunnel.h:172:19: note: expected 'unsigned int' but argument is 
of type 'struct nlattr *'
static inline int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
  ^
   net/ipv4/fib_semantics.c:614:8: warning: passing argument 5 of 
'lwtunnel_build_state' makes pointer from integer without a cast
 ret = lwtunnel_build_state(net, dev, encap_type, encap,
   ^
   In file included from net/ipv4/fib_semantics.c:45:0:
   include/net/lwtunnel.h:172:19: note: expected 'const void *' but argument is 
of type 'int'
static inline int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
  ^
   net/ipv4/fib_semantics.c:614:8: warning: passing argument 6 of 
'lwtunnel_build_state' from incompatible pointer type
 ret = lwtunnel_build_state(net, dev, encap_type, encap,
   ^
   In file included from net/ipv4/fib_semantics.c:45:0:
   include/net/lwtunnel.h:172:19: note: expected 'struct lwtunnel_state **' but 
argument is of type 'const struct fib_config *'
static inline int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
  ^
   net/ipv4/fib_semantics.c:614:8: error: too many arguments to function 
'lwtunnel_build_state'
 ret = lwtunnel_build_state(net, dev, encap_type, encap,
   ^
   In file included from net/ipv4/fib_semantics.c:45:0:
   include/net/lwtunnel.h:172:19: note: declared here
static inline int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
  ^
   net/ipv4/fib_semantics.c: In function 'fib_create_info':
   net/ipv4/fib_semantics.c:1102:10: warning: passing argument 1 of 
'lwtunnel_build_state' from incompatible pointer type
   err = lwtunnel_build_state(net, dev, cfg->fc_encap_type,
 ^
   In file included from net/ipv4/fib_semantics.c:45:0:
   include/net/lwtunnel.h:172:19: note: expected 'struct net_device *' but 
argument is of type 'struct net *'
static inline int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
  ^
   net/ipv4/fib_semantics.c:1102:10: warning: passing argument 2 of 
'lwtunnel_build_state' makes integer from pointer without a cast
   err = lwtunnel_build_state(net, dev, cfg->fc_encap_type,
 ^
   In file included from net/ipv4/fib_semantics.c:45:0:
   include/net/lwtunnel.h:172:19: note: expected 'u16' but argument is of type 
'struct net_device *'
static inline int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
  ^
   net/ipv4/fib_semantics.c:1102:10: warning: passing argument 3 of 
'lwtunnel_build_state' makes pointer from 

Re: [PATCH V2 2/3] net-next: dsa: add Qualcomm tag RX/TX handler

2016-09-14 Thread Andrew Lunn
On Wed, Sep 14, 2016 at 12:39:01PM +0200, John Crispin wrote:
> Add support for the 2-bytes Qualcomm tag that gigabit switches such as
> the QCA8337/N might insert when receiving packets, or that we need
> to insert while targeting specific switch ports. The tag is inserted
> directly behind the ethernet header.
> 
> Signed-off-by: John Crispin 

Reviewed-by: Andrew Lunn 

Andrew


Re: [PATCH V2 1/3] Documentation: devicetree: add qca8k binding

2016-09-14 Thread Andrew Lunn
On Wed, Sep 14, 2016 at 12:39:00PM +0200, John Crispin wrote:
> Add device-tree binding for ar8xxx switch families.
> 
> Cc: devicet...@vger.kernel.org
> Signed-off-by: John Crispin 
> ---
> Changes in V2
> * fixup ecample to include phy nodes and corresponding phandles
> * add a note explaining why we need to phy nodes
> 
>  .../devicetree/bindings/net/dsa/qca8k.txt  |   88 
> 
>  1 file changed, 88 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/dsa/qca8k.txt
> 
> diff --git a/Documentation/devicetree/bindings/net/dsa/qca8k.txt 
> b/Documentation/devicetree/bindings/net/dsa/qca8k.txt
> new file mode 100644
> index 000..2c1582a
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/dsa/qca8k.txt
> @@ -0,0 +1,88 @@
> +* Qualcomm Atheros QCA8xxx switch family
> +
> +Required properties:
> +
> +- compatible: should be "qca,qca8337"
> +- #size-cells: must be 0
> +- #address-cells: must be 1
> +
> +Subnodes:
> +
> +The integrated switch subnode should be specified according to the binding
> +described in dsa/dsa.txt. As the QCA8K switches do not have a N:N mapping of
> +port and PHY id, each subnode describing a port needs to have a valid phandle
> +referencing the internal PHY connected to it.

Hi John

I've not looked at the driver yet, but you said yesterday the CPU port
has to be port 0. I think it would be good to document that here.

Otherwise, this is looking good.

   Andrew


Re: [PATCH net-next 1/7] lwt: Add net to build_state argument

2016-09-14 Thread kbuild test robot
Hi Tom,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Tom-Herbert/net-ILA-resolver-and-generic-resolver-backend/20160915-073357
config: i386-randconfig-x003-201637 (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

Note: the 
linux-review/Tom-Herbert/net-ILA-resolver-and-generic-resolver-backend/20160915-073357
 HEAD ef54da9d3e09732810a23f283e0f8039fb18eede builds fine.
  It only hurts bisectibility.

All errors (new ones prefixed by >>):

   net/core/lwtunnel.c: In function 'lwtunnel_encap_str':
>> net/core/lwtunnel.c:42:7: error: 'LWTUNNEL_ENCAP_ILA_NOTIFY' undeclared 
>> (first use in this function)
 case LWTUNNEL_ENCAP_ILA_NOTIFY:
  ^
   net/core/lwtunnel.c:42:7: note: each undeclared identifier is reported only 
once for each function it appears in

vim +/LWTUNNEL_ENCAP_ILA_NOTIFY +42 net/core/lwtunnel.c

36   */
37  switch (encap_type) {
38  case LWTUNNEL_ENCAP_MPLS:
39  return "MPLS";
40  case LWTUNNEL_ENCAP_ILA:
41  return "ILA";
  > 42  case LWTUNNEL_ENCAP_ILA_NOTIFY:
43  return "ILA_NOTIFY";
44  case LWTUNNEL_ENCAP_IP6:
45  case LWTUNNEL_ENCAP_IP:

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [RFC 2/3] staging: rtl8712: Change _LED_STATE enum in rtl871x driver to avoid conflicts with LED namespace

2016-09-14 Thread Larry Finger

On 09/14/2016 04:55 PM, Zach Brown wrote:

Adding led support for phy causes namespace conflicts for some
phy drivers.

The rtl871 driver declared an enum for representing LED states. The enum
contains constant LED_OFF which conflicted with declartation found in
linux/leds.h. LED_OFF changed to LED_STATE_OFF

Signed-off-by: Zach Brown 


I have no objection to this change. My only substantive comment is that LED_ON 
should also be changed to LED_STATE_ON, otherwise there might be another 
namespace collision later. There is also a typo in the commit message. In 
addition, staging driver patches should be sent to Greg KH.


Larry


---
 drivers/staging/rtl8712/rtl8712_led.c | 192 +-
 1 file changed, 96 insertions(+), 96 deletions(-)

diff --git a/drivers/staging/rtl8712/rtl8712_led.c 
b/drivers/staging/rtl8712/rtl8712_led.c
index 9055827..d53ad76 100644
--- a/drivers/staging/rtl8712/rtl8712_led.c
+++ b/drivers/staging/rtl8712/rtl8712_led.c
@@ -52,7 +52,7 @@
 enum _LED_STATE_871x {
LED_UNKNOWN = 0,
LED_ON = 1,
-   LED_OFF = 2,
+   LED_STATE_OFF = 2,
LED_BLINK_NORMAL = 3,
LED_BLINK_SLOWLY = 4,
LED_POWER_ON_BLINK = 5,
@@ -92,7 +92,7 @@ static void InitLed871x(struct _adapter *padapter, struct 
LED_871x *pLed,
nic = padapter->pnetdev;
pLed->padapter = padapter;
pLed->LedPin = LedPin;
-   pLed->CurrLedState = LED_OFF;
+   pLed->CurrLedState = LED_STATE_OFF;
pLed->bLedOn = false;
pLed->bLedBlinkInProgress = false;
pLed->BlinkTimes = 0;
@@ -249,7 +249,7 @@ static void SwLedBlink(struct LED_871x *pLed)
} else {
/* Assign LED state to toggle. */
if (pLed->BlinkingLedState == LED_ON)
-   pLed->BlinkingLedState = LED_OFF;
+   pLed->BlinkingLedState = LED_STATE_OFF;
else
pLed->BlinkingLedState = LED_ON;

@@ -312,7 +312,7 @@ static void SwLedBlink1(struct LED_871x *pLed)
switch (pLed->CurrLedState) {
case LED_BLINK_SLOWLY:
if (pLed->bLedOn)
-   pLed->BlinkingLedState = LED_OFF;
+   pLed->BlinkingLedState = LED_STATE_OFF;
else
pLed->BlinkingLedState = LED_ON;
mod_timer(>BlinkTimer, jiffies +
@@ -320,7 +320,7 @@ static void SwLedBlink1(struct LED_871x *pLed)
break;
case LED_BLINK_NORMAL:
if (pLed->bLedOn)
-   pLed->BlinkingLedState = LED_OFF;
+   pLed->BlinkingLedState = LED_STATE_OFF;
else
pLed->BlinkingLedState = LED_ON;
mod_timer(>BlinkTimer, jiffies +
@@ -335,7 +335,7 @@ static void SwLedBlink1(struct LED_871x *pLed)
pLed->bLedLinkBlinkInProgress = true;
pLed->CurrLedState = LED_BLINK_NORMAL;
if (pLed->bLedOn)
-   pLed->BlinkingLedState = LED_OFF;
+   pLed->BlinkingLedState = LED_STATE_OFF;
else
pLed->BlinkingLedState = LED_ON;
mod_timer(>BlinkTimer, jiffies +
@@ -344,7 +344,7 @@ static void SwLedBlink1(struct LED_871x *pLed)
pLed->bLedNoLinkBlinkInProgress = true;
pLed->CurrLedState = LED_BLINK_SLOWLY;
if (pLed->bLedOn)
-   pLed->BlinkingLedState = LED_OFF;
+   pLed->BlinkingLedState = LED_STATE_OFF;
else
pLed->BlinkingLedState = LED_ON;
mod_timer(>BlinkTimer, jiffies +
@@ -353,7 +353,7 @@ static void SwLedBlink1(struct LED_871x *pLed)
pLed->bLedScanBlinkInProgress = false;
} else {
 if (pLed->bLedOn)
-   pLed->BlinkingLedState = LED_OFF;
+   pLed->BlinkingLedState = LED_STATE_OFF;
else
pLed->BlinkingLedState = LED_ON;
mod_timer(>BlinkTimer, jiffies +
@@ -369,7 +369,7 @@ static void SwLedBlink1(struct LED_871x *pLed)
pLed->bLedLinkBlinkInProgress = true;
pLed->CurrLedState = LED_BLINK_NORMAL;
if (pLed->bLedOn)
-   pLed->BlinkingLedState = LED_OFF;
+   pLed->BlinkingLedState = LED_STATE_OFF;
else
pLed->BlinkingLedState = 

RE: [PATCH v4 05/16] IB/pvrdma: Add functions for Verbs support

2016-09-14 Thread Adit Ranadive
On Wed, Sep 14, 2016 at 05:49:50 -0700 Christoph Hellwig wrote:
> > +   props->max_fmr = dev->dsr->caps.max_fmr;
> > +   props->max_map_per_fmr = dev->dsr->caps.max_map_per_fmr;
> 
> Please don't add FMR support to any new drivers.

We don't and our device reports these as 0. If you want me to more explicit I
can remove the zero'd out properties.


Re: [PATCH v4 09/16] IB/pvrdma: Add support for Completion Queues

2016-09-14 Thread Adit Ranadive
On Wed, Sep 14, 2016 at 05:43:37 -0700, Yuval Shaia wrote:
> On Sun, Sep 11, 2016 at 09:49:19PM -0700, Adit Ranadive wrote:
> > +
> > +static int pvrdma_poll_one(struct pvrdma_cq *cq, struct pvrdma_qp
> **cur_qp,
> > +  struct ib_wc *wc)
> > +{
> > +   struct pvrdma_dev *dev = to_vdev(cq->ibcq.device);
> > +   int has_data;
> > +   unsigned int head;
> > +   bool tried = false;
> > +   struct pvrdma_cqe *cqe;
> > +
> > +retry:
> > +   has_data = pvrdma_idx_ring_has_data(>ring_state->rx,
> > +   cq->ibcq.cqe, );
> > +   if (has_data == 0) {
> > +   if (tried)
> > +   return -EAGAIN;
> > +
> > +   /* Pass down POLL to give physical HCA a chance to poll. */
> > +   pvrdma_write_uar_cq(dev, cq->cq_handle |
> PVRDMA_UAR_CQ_POLL);
> > +
> > +   tried = true;
> > +   goto retry;
> > +   } else if (has_data == PVRDMA_INVALID_IDX) {
> 
> I didn't went throw the entire life cycle of RX-ring's head and tail but you
> need to make sure that PVRDMA_INVALID_IDX error is recoverable one, i.e
> there is probability that in the next call to pvrdma_poll_one it will be fine.
> Otherwise it is an endless loop.

We have never run into this issue internally but I don't think we can recover 
here 
in the driver. The only way to recover would be to destroy and recreate the CQ 
which we shouldn't do since it could be used by multiple QPs. 
We don't have a way yet to recover in the device. Once we add that this check
should go away.

The reason I returned an error value from poll_cq in v3 was to break the 
possible 
loop so that it might give clients a chance to recover. But since poll_cq is 
not expected
to fail I just log the device error here. I can revert to that version if you 
want to break
the possible loop.


Re: [PATCH net-next 1/7] lwt: Add net to build_state argument

2016-09-14 Thread kbuild test robot
Hi Tom,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Tom-Herbert/net-ILA-resolver-and-generic-resolver-backend/20160915-073357
config: x86_64-randconfig-x013-201637 (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All error/warnings (new ones prefixed by >>):

   net/ipv4/fib_semantics.c: In function 'fib_get_nhs':
>> net/ipv4/fib_semantics.c:514:32: error: passing argument 1 of 
>> 'lwtunnel_build_state' from incompatible pointer type 
>> [-Werror=incompatible-pointer-types]
ret = lwtunnel_build_state(net, dev,
   ^~~
   In file included from net/ipv4/fib_semantics.c:45:0:
   include/net/lwtunnel.h:172:19: note: expected 'struct net_device *' but 
argument is of type 'struct net *'
static inline int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
  ^~~~
>> net/ipv4/fib_semantics.c:514:37: warning: passing argument 2 of 
>> 'lwtunnel_build_state' makes integer from pointer without a cast 
>> [-Wint-conversion]
ret = lwtunnel_build_state(net, dev,
^~~
   In file included from net/ipv4/fib_semantics.c:45:0:
   include/net/lwtunnel.h:172:19: note: expected 'u16 {aka short unsigned int}' 
but argument is of type 'struct net_device *'
static inline int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
  ^~~~
>> net/ipv4/fib_semantics.c:515:11: warning: passing argument 3 of 
>> 'lwtunnel_build_state' makes pointer from integer without a cast 
>> [-Wint-conversion]
  nla_get_u16(
  ^~~
   In file included from net/ipv4/fib_semantics.c:45:0:
   include/net/lwtunnel.h:172:19: note: expected 'struct nlattr *' but argument 
is of type 'u16 {aka short unsigned int}'
static inline int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
  ^~~~
   net/ipv4/fib_semantics.c:517:11: warning: passing argument 4 of 
'lwtunnel_build_state' makes integer from pointer without a cast 
[-Wint-conversion]
  nla,  AF_INET, cfg,
  ^~~
   In file included from net/ipv4/fib_semantics.c:45:0:
   include/net/lwtunnel.h:172:19: note: expected 'unsigned int' but argument is 
of type 'struct nlattr *'
static inline int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
  ^~~~
   In file included from net/ipv4/fib_semantics.c:23:0:
>> include/linux/socket.h:163:18: warning: passing argument 5 of 
>> 'lwtunnel_build_state' makes pointer from integer without a cast 
>> [-Wint-conversion]
#define AF_INET  2 /* Internet IP Protocol  */
 ^
>> net/ipv4/fib_semantics.c:517:17: note: in expansion of macro 'AF_INET'
  nla,  AF_INET, cfg,
^~~
   In file included from net/ipv4/fib_semantics.c:45:0:
   include/net/lwtunnel.h:172:19: note: expected 'const void *' but argument is 
of type 'int'
static inline int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
  ^~~~
   net/ipv4/fib_semantics.c:517:26: error: passing argument 6 of 
'lwtunnel_build_state' from incompatible pointer type 
[-Werror=incompatible-pointer-types]
  nla,  AF_INET, cfg,
 ^~~
   In file included from net/ipv4/fib_semantics.c:45:0:
   include/net/lwtunnel.h:172:19: note: expected 'struct lwtunnel_state **' but 
argument is of type 'struct fib_config *'
static inline int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
  ^~~~
>> net/ipv4/fib_semantics.c:514:11: error: too many arguments to function 
>> 'lwtunnel_build_state'
ret = lwtunnel_build_state(net, dev,
  ^~~~
   In file included from net/ipv4/fib_semantics.c:45:0:
   include/net/lwtunnel.h:172:19: note: declared here
static inline int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
  ^~~~
   net/ipv4/fib_semantics.c: In function 'fib_encap_match':
   net/ipv4/fib_semantics.c:614:29: error: passing argument 1 of 
'lwtunnel_build_state' from incompatible pointer type 
[-Werror=incompatible-pointer-types]
 ret = lwtunnel_build_state(net, dev, encap_type, encap,
^~~
   In file included from net/ipv4/fib_semantics.c:45:0:
   include/net/lwtunnel.h:172:19: note: expected 'struct net_device *' but 
argument is of type 'struct net *'
static inline int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
  ^~~~
   net/ipv4/fib_semantics.c:614:34: warning: passing argument 2 of 
'lwtunnel_build_state' makes integer from pointer without a cast 

RE: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not

2016-09-14 Thread Brown, Aaron F
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On
> Behalf Of John Fastabend
> Sent: Monday, September 12, 2016 3:13 PM
> To: bbla...@plumgrid.com; john.fastab...@gmail.com;
> alexei.starovoi...@gmail.com; Kirsher, Jeffrey T
> ; bro...@redhat.com; da...@davemloft.net
> Cc: xiyou.wangc...@gmail.com; intel-wired-...@lists.osuosl.org;
> u9012...@gmail.com; netdev@vger.kernel.org
> Subject: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes
> regardless of skb or not
> 
> The BQL API does not reference the sk_buff nor does the driver need to
> reference the sk_buff to calculate the length of a transmitted frame.
> This patch removes an sk_buff reference from the xmit irq path and
> also allows packets sent from XDP to use BQL.
> 
> Signed-off-by: John Fastabend 
> ---
>  drivers/net/ethernet/intel/e1000/e1000_main.c |7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)

This patch is causing all my e1000 adapters to fail a simple ftp session with 
really slow response (hashing on) followed by adapter resets.  I have seen this 
on 4 different e1000 nics now, an 82543GC, 82544GC, 82546EB and an 82545GM.  On 
a few occasions I get a splat captured to dmesg.  Here is an example:

[ cut here ]
WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x1c2/0x1d0
NETDEV WATCHDOG: eth1 (e1000): transmit queue 0 timed out
Modules linked in: e1000 e100 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs
4 ipv6 crc_ccitt p4_clockmod dm_mirror dm_region_hash dm_log uinput sg serio_raw
 dcdbas mii i2c_piix4 i2c_core cfi_probe gen_probe cfi_util scb2_flash dm_mod(E)
 ext4(E) mbcache(E) jbd2(E) sd_mod(E) sr_mod(E) cdrom(E) aacraid(E) pata_acpi(E)
 ata_generic(E) pata_serverworks(E) [last unloaded: e1000]
CPU: 1 PID: 0 Comm: swapper/1 Tainted: GE   4.8.0-rc6_next-queue_dev
-queue_b0b5ade #8
Hardware name: Dell Computer Corporation PowerEdge 2650 /0D5995, BIO
S A21 10/05/2006
  c06724fb c0b6038a c0b6038a 013c c04596d5 c0b647ac f3d2fed0
  c0b6038a 013c c08bff52 c08bff52 0009 f2922000 
 f318cf00 0001e788 c045979b 0009  f3d2feb8 c0b647ac f3d2fed0
Call Trace:
 [] ? dump_stack+0x47/0x6c
 [] ? __warn+0x105/0x120
 [] ? dev_watchdog+0x1c2/0x1d0
 [] ? dev_watchdog+0x1c2/0x1d0
 [] ? warn_slowpath_fmt+0x3b/0x40
 [] ? dev_watchdog+0x1c2/0x1d0
 [] ? call_timer_fn+0x3b/0x140
 [] ? dev_trans_start+0x60/0x60
 [] ? expire_timers+0x9b/0x110
 [] ? run_timer_softirq+0xd1/0x260
 [] ? net_tx_action+0xe0/0x1a0
 [] ? rcu_process_callbacks+0x47/0xe0
 [] ? __do_softirq+0xc8/0x280
 [] ? irq_exit+0x90/0x90
 [] ? do_softirq_own_stack+0x22/0x30
   [] ? irq_exit+0x85/0x90
 [] ? smp_apic_timer_interrupt+0x30/0x40
 [] ? apic_timer_interrupt+0x2d/0x34
 [] ? default_idle+0x1c/0xd0
 [] ? __tick_nohz_idle_enter+0x92/0x140
 [] ? arch_cpu_idle+0x6/0x10
 [] ? cpuidle_idle_call+0x7d/0xe0
 [] ? cpu_idle_loop+0x155/0x210
 [] ? start_secondary+0xb5/0xe0
---[ end trace fa448b49f7848a42 ]---
e1000 :02:06.0 eth1: Reset adapter
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000 :02:06.0 eth1: Reset adapter
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000 :02:06.0 eth1: Reset adapter
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000 :02:06.0 eth1: Reset adapter
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000 :02:06.0 eth1: Reset adapter
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000 :02:06.0 eth1: Reset adapter
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000 :02:06.0 eth1: Reset adapter
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX


RE: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-14 Thread Brown, Aaron F
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On
> Behalf Of Rustad, Mark D
> Sent: Tuesday, September 13, 2016 3:41 PM
> To: Alexei Starovoitov 
> Cc: Eric Dumazet ; Tom Herbert
> ; Brenden Blanco ; Linux
> Kernel Network Developers ; intel-wired-lan  wired-...@lists.osuosl.org>; William Tu ; Jesper
> Dangaard Brouer ; Cong Wang
> ; David S. Miller 
> Subject: Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP
> support
> 
> Alexei Starovoitov  wrote:
> 
> > On Tue, Sep 13, 2016 at 07:14:27PM +, Rustad, Mark D wrote:
> >> Alexei Starovoitov  wrote:
> >>
> >>> On Tue, Sep 13, 2016 at 06:28:03PM +, Rustad, Mark D wrote:
>  Alexei Starovoitov  wrote:
> 
> > I've looked through qemu and it appears only emulate e1k and tg3.
> > The latter is still used in the field, so the risk of touching
> > it is higher.
> 
>  I have no idea what makes you think that e1k is *not* "used in the
>  field".

I personally use a number of physical e1000 parts in my lab.  Primarily for an 
out of band control adapter, as it almost never changes so tends to be very 
stable, and it is rare that I have to test it so I'm not fighting with a 
control adapter that I am also testing the driver on.

>  I grant you it probably is used more virtualized than not these days,
>  but it
>  certainly exists and is used. You can still buy them new at Newegg for
>  goodness sakes!
> >>>
> >>> the point that it's only used virtualized, since PCI (not PCIE) have
> >>> been long dead.

Maybe for new server installations.  But modern motherboards often still have 
PCI, university / lab environments are slow to upgrade specialty test equipment 
that uses PCI and home users often don't want to get rid of their expensive 
high quality sound card, video capture card (or whatever.)  Plain old PCI is 
still in use.

> >>
> >> My point is precisely the opposite. It is a real device, it exists in real
> >> systems and it is used in those systems. I worked on embedded systems
> that
> >> ran Linux and used e1000 devices. I am sure they are still out there
> >> because
> >> customers are still paying for support of those systems.
> >>
> >> Yes, PCI(-X) is absent from any current hardware and has been for some
> >> years
> >> now, but there is an installed base that continues. What part of that
> >> installed base updates software? I don't know, but I would not just
> assume
> >> that it is 0. I know that I updated the kernel on those embedded systems
> >> that I worked on when I was supporting them. Never at the bleeding edge,
> >> but
> >> generally hopping from one LTS kernel to another as needed.
> >
> > I suspect modern linux won't boot on such old pci only systems for other
> > reasons not related to networking, since no one really cares to test
> > kernels there.
> 
> Actually it does boot, because although the motherboard was PCIe, the slots
> and the adapters in them were PCI-X. So the core architecture was not so
> stale.
> 
> > So I think we mostly agree. There is chance that this xdp e1k code will
> > find a way to that old system. What are the chances those users will
> > be using xdp there? I think pretty close to zero.
> 
> For sure they wouldn't be using XDP, but they could suffer regressions in a
> changed driver that might find its way there. That is the risk.
> 
> > The pci-e nics integrated into motherboards that pretend to be tg3
> > (though they're not at all build by broadcom) are significantly more
> > common.
> > That's why I picked e1k instead of tg3.
> 
> That may be true (I really don't know anything about tg3 so I certainly
> can't dispute it), so the risk could be smaller with e1k, but there is
> still a regression risk for real existing hardware. That is my concern.

And one of the patches in supporting this seems to have done just that.  I'll 
reply to the patch itself with details.
> 
> > Also note how this patch is not touching anything in the main e1k path
> > (because I don't have a hw to test and I suspect Intel's driver team
> > doesn't have it either)

Your suspicion is incorrect.  My lab has multiple boxes out on benches and at 
least one cabinet 3/4 full of ancient hardware, including a decent variety of 
e1000 cards (and even a number of e100 ones.)  I also keep a handful of test 
systems in my regression rack devoted to e1000, though those have dwindled (old 
hardware eventually dies) and is currently in need of a bit of work, which this 
patch set has encouraged me to get on to.

> >  to make sure there is no breakage on those
> > old systems. I created separate e1000_xmit_raw_frame() routine
> > instead of adding flags into 

[PATCH v2] net: VRF: Pass original iif to ip_route_input()

2016-09-14 Thread Mark Tomlinson
The function ip_rcv_finish() calls l3mdev_ip_rcv(). On any VRF except
the global VRF, this replaces skb->dev with the VRF master interface.
When calling ip_route_input_noref() from here, the checks for forwarding
look at this master device instead of the initial ingress interface.
This will allow packets to be routed which normally would be dropped.
For example, an interface that is not assigned an IP address should
drop packets, but because the checking is against the master device, the
packet will be forwarded.

The fix here is to still call l3mdev_ip_rcv(), but remember the initial
net_device. This is passed to the other functions within ip_rcv_finish,
so they still see the original interface.

Signed-off-by: Mark Tomlinson 
Acked-by: David Ahern 
---
 net/ipv4/ip_input.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 4b351af..d6feabb 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -312,6 +312,7 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, 
struct sk_buff *skb)
 {
const struct iphdr *iph = ip_hdr(skb);
struct rtable *rt;
+   struct net_device *dev = skb->dev;
 
/* if ingress device is enslaved to an L3 master device pass the
 * skb to its handler for processing
@@ -341,7 +342,7 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, 
struct sk_buff *skb)
 */
if (!skb_valid_dst(skb)) {
int err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
-  iph->tos, skb->dev);
+  iph->tos, dev);
if (unlikely(err)) {
if (err == -EXDEV)
__NET_INC_STATS(net, LINUX_MIB_IPRPFILTER);
@@ -370,7 +371,7 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, 
struct sk_buff *skb)
__IP_UPD_PO_STATS(net, IPSTATS_MIB_INBCAST, skb->len);
} else if (skb->pkt_type == PACKET_BROADCAST ||
   skb->pkt_type == PACKET_MULTICAST) {
-   struct in_device *in_dev = __in_dev_get_rcu(skb->dev);
+   struct in_device *in_dev = __in_dev_get_rcu(dev);
 
/* RFC 1122 3.3.6:
 *
-- 
2.9.3



Re: [RFC v3 03/22] bpf,landlock: Add a new arraymap type to deal with (Landlock) handles

2016-09-14 Thread Alexei Starovoitov
On Thu, Sep 15, 2016 at 01:22:49AM +0200, Mickaël Salaün wrote:
> 
> On 14/09/2016 20:51, Alexei Starovoitov wrote:
> > On Wed, Sep 14, 2016 at 09:23:56AM +0200, Mickaël Salaün wrote:
> >> This new arraymap looks like a set and brings new properties:
> >> * strong typing of entries: the eBPF functions get the array type of
> >>   elements instead of CONST_PTR_TO_MAP (e.g.
> >>   CONST_PTR_TO_LANDLOCK_HANDLE_FS);
> >> * force sequential filling (i.e. replace or append-only update), which
> >>   allow quick browsing of all entries.
> >>
> >> This strong typing is useful to statically check if the content of a map
> >> can be passed to an eBPF function. For example, Landlock use it to store
> >> and manage kernel objects (e.g. struct file) instead of dealing with
> >> userland raw data. This improve efficiency and ensure that an eBPF
> >> program can only call functions with the right high-level arguments.
> >>
> >> The enum bpf_map_handle_type list low-level types (e.g.
> >> BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD) which are identified when
> >> updating a map entry (handle). This handle types are used to infer a
> >> high-level arraymap type which are listed in enum bpf_map_array_type
> >> (e.g. BPF_MAP_ARRAY_TYPE_LANDLOCK_FS).
> >>
> >> For now, this new arraymap is only used by Landlock LSM (cf. next
> >> commits) but it could be useful for other needs.
> >>
> >> Changes since v2:
> >> * add a RLIMIT_NOFILE-based limit to the maximum number of arraymap
> >>   handle entries (suggested by Andy Lutomirski)
> >> * remove useless checks
> >>
> >> Changes since v1:
> >> * arraymap of handles replace custom checker groups
> >> * simpler userland API
> >>
> >> Signed-off-by: Mickaël Salaün 
> >> Cc: Alexei Starovoitov 
> >> Cc: Andy Lutomirski 
> >> Cc: Daniel Borkmann 
> >> Cc: David S. Miller 
> >> Cc: Kees Cook 
> >> Link: 
> >> https://lkml.kernel.org/r/calcetrwwtiz3kztkegow24-dvhqq6lftwexh77fd2g5o71y...@mail.gmail.com
> >> ---
> >>  include/linux/bpf.h  |  14 
> >>  include/uapi/linux/bpf.h |  18 +
> >>  kernel/bpf/arraymap.c| 203 
> >> +++
> >>  kernel/bpf/verifier.c|  12 ++-
> >>  4 files changed, 246 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> >> index fa9a988400d9..eae4ce4542c1 100644
> >> --- a/include/linux/bpf.h
> >> +++ b/include/linux/bpf.h
> >> @@ -13,6 +13,10 @@
> >>  #include 
> >>  #include 
> >>  
> >> +#ifdef CONFIG_SECURITY_LANDLOCK
> >> +#include  /* struct file */
> >> +#endif /* CONFIG_SECURITY_LANDLOCK */
> >> +
> >>  struct perf_event;
> >>  struct bpf_map;
> >>  
> >> @@ -38,6 +42,7 @@ struct bpf_map_ops {
> >>  struct bpf_map {
> >>atomic_t refcnt;
> >>enum bpf_map_type map_type;
> >> +  enum bpf_map_array_type map_array_type;
> >>u32 key_size;
> >>u32 value_size;
> >>u32 max_entries;
> >> @@ -187,6 +192,9 @@ struct bpf_array {
> >> */
> >>enum bpf_prog_type owner_prog_type;
> >>bool owner_jited;
> >> +#ifdef CONFIG_SECURITY_LANDLOCK
> >> +  u32 n_entries;  /* number of entries in a handle array */
> >> +#endif /* CONFIG_SECURITY_LANDLOCK */
> >>union {
> >>char value[0] __aligned(8);
> >>void *ptrs[0] __aligned(8);
> >> @@ -194,6 +202,12 @@ struct bpf_array {
> >>};
> >>  };
> >>  
> >> +#ifdef CONFIG_SECURITY_LANDLOCK
> >> +struct map_landlock_handle {
> >> +  u32 type; /* enum bpf_map_handle_type */
> >> +};
> >> +#endif /* CONFIG_SECURITY_LANDLOCK */
> >> +
> >>  #define MAX_TAIL_CALL_CNT 32
> >>  
> >>  struct bpf_event_entry {
> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> index 7cd36166f9b7..b68de57f7ab8 100644
> >> --- a/include/uapi/linux/bpf.h
> >> +++ b/include/uapi/linux/bpf.h
> >> @@ -87,6 +87,15 @@ enum bpf_map_type {
> >>BPF_MAP_TYPE_PERCPU_ARRAY,
> >>BPF_MAP_TYPE_STACK_TRACE,P_TYPE_CGROUP_ARRAY
> >>BPF_MAP_TYPE_CGROUP_ARRAY,
> >> +  BPF_MAP_TYPE_LANDLOCK_ARRAY,
> >> +};
> >> +
> >> +enum bpf_map_array_type {
> >> +  BPF_MAP_ARRAY_TYPE_UNSPEC,
> >> +};
> >> +
> >> +enum bpf_map_handle_type {
> >> +  BPF_MAP_HANDLE_TYPE_UNSPEC,
> >>  };
> > 
> > missing something. why it has to be special to have it's own
> > fd array implementation?
> > Please take a look how BPF_MAP_TYPE_PERF_EVENT_ARRAY, 
> > BPF_MAP_TYPE_CGROUP_ARRAY and BPF_MAP_TYPE_PROG_ARRAY are done.
> > The all store objects into array map that user space passes via FD.
> > I think the same model should apply here.
> 
> The idea is to have multiple way for userland to describe a resource
> (e.g. an open file descriptor, a path or a glob pattern). The kernel
> representation could then be a "struct path *" or dedicated types (e.g.
> custom glob).

hmm. I think user space api should only deal with FD. Everything
else is user space job to encapsulate/hide.

> Another interesting 

[PATCH net-next 7/7] ila: Resolver mechanism

2016-09-14 Thread Tom Herbert
Implement an ILA resolver. This uses LWT to implement the hook to a
userspace resolver and tracks pending unresolved address using the
backend net resolver.

The idea is that the kernel sets an ILA resolver route to the
SIR prefix, something like:

ip route add ::/64 encap ila-resolve \
 via 2401:db00:20:911a::27:0 dev eth0

When a packet hits the route the address is looked up in a resolver
table. If the entry is created (no entry with the address already
exists) then an rtnl message is generated with group
RTNLGRP_ILA_NOTIFY and type RTM_ADDR_RESOLVE. A userspace
daemon can listen for such messages and perform an ILA resolution
protocol to determine the ILA mapping. If the mapping is resolved
then a /128 ila encap router is set so that host can perform
ILA translation and send directly to destination.

Signed-off-by: Tom Herbert 
---
 include/uapi/linux/ila.h   |   9 ++
 include/uapi/linux/lwtunnel.h  |   1 +
 include/uapi/linux/rtnetlink.h |   8 +-
 net/ipv6/Kconfig   |   1 +
 net/ipv6/ila/Makefile  |   2 +-
 net/ipv6/ila/ila.h |  16 +++
 net/ipv6/ila/ila_common.c  |   7 ++
 net/ipv6/ila/ila_lwt.c |   9 ++
 net/ipv6/ila/ila_resolver.c| 246 +
 net/ipv6/ila/ila_xlat.c|  15 ++-
 10 files changed, 304 insertions(+), 10 deletions(-)
 create mode 100644 net/ipv6/ila/ila_resolver.c

diff --git a/include/uapi/linux/ila.h b/include/uapi/linux/ila.h
index 948c0a9..f186f8b 100644
--- a/include/uapi/linux/ila.h
+++ b/include/uapi/linux/ila.h
@@ -42,4 +42,13 @@ enum {
ILA_CSUM_NO_ACTION,
 };
 
+enum {
+   ILA_NOTIFY_ATTR_UNSPEC,
+   ILA_NOTIFY_ATTR_TIMEOUT,/* u32 */
+
+   __ILA_NOTIFY_ATTR_MAX,
+};
+
+#define ILA_NOTIFY_ATTR_MAX(__ILA_NOTIFY_ATTR_MAX - 1)
+
 #endif /* _UAPI_LINUX_ILA_H */
diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h
index a478fe8..d880e49 100644
--- a/include/uapi/linux/lwtunnel.h
+++ b/include/uapi/linux/lwtunnel.h
@@ -9,6 +9,7 @@ enum lwtunnel_encap_types {
LWTUNNEL_ENCAP_IP,
LWTUNNEL_ENCAP_ILA,
LWTUNNEL_ENCAP_IP6,
+   LWTUNNEL_ENCAP_ILA_NOTIFY,
__LWTUNNEL_ENCAP_MAX,
 };
 
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 262f037..a775464 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -12,7 +12,8 @@
  */
 #define RTNL_FAMILY_IPMR   128
 #define RTNL_FAMILY_IP6MR  129
-#define RTNL_FAMILY_MAX129
+#define RTNL_FAMILY_ILA130
+#define RTNL_FAMILY_MAX130
 
 /
  * Routing/neighbour discovery messages.
@@ -144,6 +145,9 @@ enum {
RTM_GETSTATS = 94,
 #define RTM_GETSTATS RTM_GETSTATS
 
+   RTM_ADDR_RESOLVE = 95,
+#define RTM_ADDR_RESOLVE RTM_ADDR_RESOLVE
+
__RTM_MAX,
 #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1)
 };
@@ -656,6 +660,8 @@ enum rtnetlink_groups {
 #define RTNLGRP_MPLS_ROUTE RTNLGRP_MPLS_ROUTE
RTNLGRP_NSID,
 #define RTNLGRP_NSID   RTNLGRP_NSID
+   RTNLGRP_ILA_NOTIFY,
+#define RTNLGRP_ILA_NOTIFY RTNLGRP_ILA_NOTIFY
__RTNLGRP_MAX
 };
 #define RTNLGRP_MAX(__RTNLGRP_MAX - 1)
diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
index 2343e4f..cf3ea8e 100644
--- a/net/ipv6/Kconfig
+++ b/net/ipv6/Kconfig
@@ -97,6 +97,7 @@ config IPV6_ILA
tristate "IPv6: Identifier Locator Addressing (ILA)"
depends on NETFILTER
select LWTUNNEL
+   select NET_EXT_RESOLVER
---help---
  Support for IPv6 Identifier Locator Addressing (ILA).
 
diff --git a/net/ipv6/ila/Makefile b/net/ipv6/ila/Makefile
index 4b32e59..f2aadc3 100644
--- a/net/ipv6/ila/Makefile
+++ b/net/ipv6/ila/Makefile
@@ -4,4 +4,4 @@
 
 obj-$(CONFIG_IPV6_ILA) += ila.o
 
-ila-objs := ila_common.o ila_lwt.o ila_xlat.o
+ila-objs := ila_common.o ila_lwt.o ila_xlat.o ila_resolver.o
diff --git a/net/ipv6/ila/ila.h b/net/ipv6/ila/ila.h
index e0170f6..e369611 100644
--- a/net/ipv6/ila/ila.h
+++ b/net/ipv6/ila/ila.h
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -23,6 +24,16 @@
 #include 
 #include 
 
+extern unsigned int ila_net_id;
+
+struct ila_net {
+   struct rhashtable rhash_table;
+   spinlock_t *locks; /* Bucket locks for entry manipulation */
+   unsigned int locks_mask;
+   bool hooks_registered;
+   struct net_rslv *nrslv;
+};
+
 struct ila_locator {
union {
__u8v8[8];
@@ -114,9 +125,14 @@ void ila_update_ipv6_locator(struct sk_buff *skb, struct 
ila_params *p,
 
 void ila_init_saved_csum(struct ila_params *p);
 
+void ila_rslv_resolved(struct ila_net *ilan, struct ila_addr *iaddr);
 int ila_lwt_init(void);
 void ila_lwt_fini(void);
 int ila_xlat_init(void);
 void ila_xlat_fini(void);
+int ila_rslv_init(void);
+void 

Re: [RFC v3 07/22] landlock: Handle file comparisons

2016-09-14 Thread Alexei Starovoitov
On Thu, Sep 15, 2016 at 01:02:22AM +0200, Mickaël Salaün wrote:
> > 
> > I would suggest for the next RFC to do minimal 7 patches up to this point
> > with simple example that demonstrates the use case.
> > I would avoid all unpriv stuff and all of seccomp for the next RFC as well,
> > otherwise I don't think we can realistically make forward progress, since
> > there are too many issues raised in the subsequent patches.
> 
> I hope we will find a common agreement about seccomp vs cgroup… I think
> both approaches have their advantages, can be complementary and nicely
> combined.

I don't mind having both task based lsm and cgroup based as long as
infrastracture is not duplicated and scaling issues from earlier version
are resolved.
I'm proposing to do cgroup only for the next RFC, since mine and Sargun's
use case for this bpf+lsm+cgroup is _not_ security or sandboxing.
No need for unpriv, no_new_priv to cgroups are other things that Andy
is concerned about.

> Unprivileged sandboxing is the main goal of Landlock. This should not be
> a problem, even for privileged features, thanks to the new subtype/access.

yes. the point that unpriv stuff can come later after agreement is reached.
If we keep arguing about seccomp details this set won't go anywhere.
Even in basic part (which is cgroup+bpf+lsm) are plenty of questions
to be still agreed.

> Agreed. With this RFC, the Checmate features (i.e. network helpers)
> should be able to sit on top of Landlock.

I think neither of them should be called fancy names for no technical reason.
We will have only one bpf based lsm. That's it and it doesn't
need an obscure name. Directory name can be security/bpf/..stuff.c



[PATCH net-next 3/7] rhashtable: Call library function alloc_bucket_locks

2016-09-14 Thread Tom Herbert
To allocate the array of bucket locks for the hash table we now
call library function alloc_bucket_spinlocks. This function is
based on the old alloc_bucket_locks in rhashtable and should
produce the same effect.

Acked-by: Thomas Graf 
Signed-off-by: Tom Herbert 
---
 lib/rhashtable.c | 46 --
 1 file changed, 4 insertions(+), 42 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 06c2872..5b53304 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -59,50 +59,10 @@ EXPORT_SYMBOL_GPL(lockdep_rht_bucket_is_held);
 #define ASSERT_RHT_MUTEX(HT)
 #endif
 
-
-static int alloc_bucket_locks(struct rhashtable *ht, struct bucket_table *tbl,
- gfp_t gfp)
-{
-   unsigned int i, size;
-#if defined(CONFIG_PROVE_LOCKING)
-   unsigned int nr_pcpus = 2;
-#else
-   unsigned int nr_pcpus = num_possible_cpus();
-#endif
-
-   nr_pcpus = min_t(unsigned int, nr_pcpus, 64UL);
-   size = roundup_pow_of_two(nr_pcpus * ht->p.locks_mul);
-
-   /* Never allocate more than 0.5 locks per bucket */
-   size = min_t(unsigned int, size, tbl->size >> 1);
-
-   if (sizeof(spinlock_t) != 0) {
-   tbl->locks = NULL;
-#ifdef CONFIG_NUMA
-   if (size * sizeof(spinlock_t) > PAGE_SIZE &&
-   gfp == GFP_KERNEL)
-   tbl->locks = vmalloc(size * sizeof(spinlock_t));
-#endif
-   if (gfp != GFP_KERNEL)
-   gfp |= __GFP_NOWARN | __GFP_NORETRY;
-
-   if (!tbl->locks)
-   tbl->locks = kmalloc_array(size, sizeof(spinlock_t),
-  gfp);
-   if (!tbl->locks)
-   return -ENOMEM;
-   for (i = 0; i < size; i++)
-   spin_lock_init(>locks[i]);
-   }
-   tbl->locks_mask = size - 1;
-
-   return 0;
-}
-
 static void bucket_table_free(const struct bucket_table *tbl)
 {
if (tbl)
-   kvfree(tbl->locks);
+   free_bucket_spinlocks(tbl->locks);
 
kvfree(tbl);
 }
@@ -131,7 +91,9 @@ static struct bucket_table *bucket_table_alloc(struct 
rhashtable *ht,
 
tbl->size = nbuckets;
 
-   if (alloc_bucket_locks(ht, tbl, gfp) < 0) {
+   /* Never allocate more than 0.5 locks per bucket */
+   if (alloc_bucket_spinlocks(>locks, >locks_mask,
+  tbl->size >> 1, ht->p.locks_mul, gfp)) {
bucket_table_free(tbl);
return NULL;
}
-- 
2.8.0.rc2



[PATCH net-next 1/7] lwt: Add net to build_state argument

2016-09-14 Thread Tom Herbert
Users of LWT need to know net if they want to have per net operations
in LWT.

Signed-off-by: Tom Herbert 
---
 include/net/lwtunnel.h| 10 +-
 net/core/lwtunnel.c   | 11 +++
 net/ipv4/fib_semantics.c  |  7 ---
 net/ipv4/ip_tunnel_core.c | 12 ++--
 net/ipv6/ila/ila_lwt.c|  6 +++---
 net/ipv6/route.c  |  2 +-
 net/mpls/mpls_iptunnel.c  |  6 +++---
 7 files changed, 29 insertions(+), 25 deletions(-)

diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
index ea3f80f..3cd6b7c 100644
--- a/include/net/lwtunnel.h
+++ b/include/net/lwtunnel.h
@@ -33,9 +33,9 @@ struct lwtunnel_state {
 };
 
 struct lwtunnel_encap_ops {
-   int (*build_state)(struct net_device *dev, struct nlattr *encap,
-  unsigned int family, const void *cfg,
-  struct lwtunnel_state **ts);
+   int (*build_state)(struct net *net, struct net_device *dev,
+  struct nlattr *encap, unsigned int family,
+  const void *cfg, struct lwtunnel_state **ts);
int (*output)(struct net *net, struct sock *sk, struct sk_buff *skb);
int (*input)(struct sk_buff *skb);
int (*fill_encap)(struct sk_buff *skb,
@@ -106,8 +106,8 @@ int lwtunnel_encap_add_ops(const struct lwtunnel_encap_ops 
*op,
   unsigned int num);
 int lwtunnel_encap_del_ops(const struct lwtunnel_encap_ops *op,
   unsigned int num);
-int lwtunnel_build_state(struct net_device *dev, u16 encap_type,
-struct nlattr *encap,
+int lwtunnel_build_state(struct net *net, struct net_device *dev,
+u16 encap_type, struct nlattr *encap,
 unsigned int family, const void *cfg,
 struct lwtunnel_state **lws);
 int lwtunnel_fill_encap(struct sk_buff *skb,
diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index e5f84c2..ba8be0b 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -39,6 +39,8 @@ static const char *lwtunnel_encap_str(enum 
lwtunnel_encap_types encap_type)
return "MPLS";
case LWTUNNEL_ENCAP_ILA:
return "ILA";
+   case LWTUNNEL_ENCAP_ILA_NOTIFY:
+   return "ILA_NOTIFY";
case LWTUNNEL_ENCAP_IP6:
case LWTUNNEL_ENCAP_IP:
case LWTUNNEL_ENCAP_NONE:
@@ -96,9 +98,10 @@ int lwtunnel_encap_del_ops(const struct lwtunnel_encap_ops 
*ops,
 }
 EXPORT_SYMBOL(lwtunnel_encap_del_ops);
 
-int lwtunnel_build_state(struct net_device *dev, u16 encap_type,
-struct nlattr *encap, unsigned int family,
-const void *cfg, struct lwtunnel_state **lws)
+int lwtunnel_build_state(struct net *net, struct net_device *dev,
+u16 encap_type, struct nlattr *encap,
+unsigned int family, const void *cfg,
+struct lwtunnel_state **lws)
 {
const struct lwtunnel_encap_ops *ops;
int ret = -EINVAL;
@@ -123,7 +126,7 @@ int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
}
 #endif
if (likely(ops && ops->build_state))
-   ret = ops->build_state(dev, encap, family, cfg, lws);
+   ret = ops->build_state(net, dev, encap, family, cfg, lws);
rcu_read_unlock();
 
return ret;
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 388d3e2..aee4e95 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -511,7 +511,8 @@ static int fib_get_nhs(struct fib_info *fi, struct 
rtnexthop *rtnh,
goto err_inval;
if (cfg->fc_oif)
dev = __dev_get_by_index(net, 
cfg->fc_oif);
-   ret = lwtunnel_build_state(dev, nla_get_u16(
+   ret = lwtunnel_build_state(net, dev,
+  nla_get_u16(
   nla_entype),
   nla,  AF_INET, cfg,
   );
@@ -610,7 +611,7 @@ static int fib_encap_match(struct net *net, u16 encap_type,
 
if (oif)
dev = __dev_get_by_index(net, oif);
-   ret = lwtunnel_build_state(dev, encap_type, encap,
+   ret = lwtunnel_build_state(net, dev, encap_type, encap,
   AF_INET, cfg, );
if (!ret) {
result = lwtunnel_cmp_encap(lwtstate, nh->nh_lwtstate);
@@ -1098,7 +1099,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
goto err_inval;
if (cfg->fc_oif)
dev = __dev_get_by_index(net, cfg->fc_oif);
-   err = 

Re: [RFC v3 03/22] bpf,landlock: Add a new arraymap type to deal with (Landlock) handles

2016-09-14 Thread Mickaël Salaün

On 14/09/2016 20:51, Alexei Starovoitov wrote:
> On Wed, Sep 14, 2016 at 09:23:56AM +0200, Mickaël Salaün wrote:
>> This new arraymap looks like a set and brings new properties:
>> * strong typing of entries: the eBPF functions get the array type of
>>   elements instead of CONST_PTR_TO_MAP (e.g.
>>   CONST_PTR_TO_LANDLOCK_HANDLE_FS);
>> * force sequential filling (i.e. replace or append-only update), which
>>   allow quick browsing of all entries.
>>
>> This strong typing is useful to statically check if the content of a map
>> can be passed to an eBPF function. For example, Landlock use it to store
>> and manage kernel objects (e.g. struct file) instead of dealing with
>> userland raw data. This improve efficiency and ensure that an eBPF
>> program can only call functions with the right high-level arguments.
>>
>> The enum bpf_map_handle_type list low-level types (e.g.
>> BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD) which are identified when
>> updating a map entry (handle). This handle types are used to infer a
>> high-level arraymap type which are listed in enum bpf_map_array_type
>> (e.g. BPF_MAP_ARRAY_TYPE_LANDLOCK_FS).
>>
>> For now, this new arraymap is only used by Landlock LSM (cf. next
>> commits) but it could be useful for other needs.
>>
>> Changes since v2:
>> * add a RLIMIT_NOFILE-based limit to the maximum number of arraymap
>>   handle entries (suggested by Andy Lutomirski)
>> * remove useless checks
>>
>> Changes since v1:
>> * arraymap of handles replace custom checker groups
>> * simpler userland API
>>
>> Signed-off-by: Mickaël Salaün 
>> Cc: Alexei Starovoitov 
>> Cc: Andy Lutomirski 
>> Cc: Daniel Borkmann 
>> Cc: David S. Miller 
>> Cc: Kees Cook 
>> Link: 
>> https://lkml.kernel.org/r/calcetrwwtiz3kztkegow24-dvhqq6lftwexh77fd2g5o71y...@mail.gmail.com
>> ---
>>  include/linux/bpf.h  |  14 
>>  include/uapi/linux/bpf.h |  18 +
>>  kernel/bpf/arraymap.c| 203 
>> +++
>>  kernel/bpf/verifier.c|  12 ++-
>>  4 files changed, 246 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index fa9a988400d9..eae4ce4542c1 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -13,6 +13,10 @@
>>  #include 
>>  #include 
>>  
>> +#ifdef CONFIG_SECURITY_LANDLOCK
>> +#include  /* struct file */
>> +#endif /* CONFIG_SECURITY_LANDLOCK */
>> +
>>  struct perf_event;
>>  struct bpf_map;
>>  
>> @@ -38,6 +42,7 @@ struct bpf_map_ops {
>>  struct bpf_map {
>>  atomic_t refcnt;
>>  enum bpf_map_type map_type;
>> +enum bpf_map_array_type map_array_type;
>>  u32 key_size;
>>  u32 value_size;
>>  u32 max_entries;
>> @@ -187,6 +192,9 @@ struct bpf_array {
>>   */
>>  enum bpf_prog_type owner_prog_type;
>>  bool owner_jited;
>> +#ifdef CONFIG_SECURITY_LANDLOCK
>> +u32 n_entries;  /* number of entries in a handle array */
>> +#endif /* CONFIG_SECURITY_LANDLOCK */
>>  union {
>>  char value[0] __aligned(8);
>>  void *ptrs[0] __aligned(8);
>> @@ -194,6 +202,12 @@ struct bpf_array {
>>  };
>>  };
>>  
>> +#ifdef CONFIG_SECURITY_LANDLOCK
>> +struct map_landlock_handle {
>> +u32 type; /* enum bpf_map_handle_type */
>> +};
>> +#endif /* CONFIG_SECURITY_LANDLOCK */
>> +
>>  #define MAX_TAIL_CALL_CNT 32
>>  
>>  struct bpf_event_entry {
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 7cd36166f9b7..b68de57f7ab8 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -87,6 +87,15 @@ enum bpf_map_type {
>>  BPF_MAP_TYPE_PERCPU_ARRAY,
>>  BPF_MAP_TYPE_STACK_TRACE,P_TYPE_CGROUP_ARRAY
>>  BPF_MAP_TYPE_CGROUP_ARRAY,
>> +BPF_MAP_TYPE_LANDLOCK_ARRAY,
>> +};
>> +
>> +enum bpf_map_array_type {
>> +BPF_MAP_ARRAY_TYPE_UNSPEC,
>> +};
>> +
>> +enum bpf_map_handle_type {
>> +BPF_MAP_HANDLE_TYPE_UNSPEC,
>>  };
> 
> missing something. why it has to be special to have it's own
> fd array implementation?
> Please take a look how BPF_MAP_TYPE_PERF_EVENT_ARRAY, 
> BPF_MAP_TYPE_CGROUP_ARRAY and BPF_MAP_TYPE_PROG_ARRAY are done.
> The all store objects into array map that user space passes via FD.
> I think the same model should apply here.

The idea is to have multiple way for userland to describe a resource
(e.g. an open file descriptor, a path or a glob pattern). The kernel
representation could then be a "struct path *" or dedicated types (e.g.
custom glob).

Another interesting point (that could replace
check_map_func_compatibility()) is that BPF_MAP_TYPE_LANDLOCK_ARRAY
translate to dedicated (abstract) types (instead of CONST_PTR_TO_MAP)
thanks to bpf_reg_type_from_map(). This is useful to abstract userland
(map) interface with kernel object(s) dealing with that type.

A third point is that BPF_MAP_TYPE_LANDLOCK_ARRAY is a kind of set. It
is optimized to 

[PATCH net-next 5/7] rhashtable: abstract out function to get hash

2016-09-14 Thread Tom Herbert
Split out most of rht_key_hashfn which is calculating the hash into
its own function. This way the hash function can be called separately to
get the hash value.

Acked-by: Thomas Graf 
Signed-off-by: Tom Herbert 
---
 include/linux/rhashtable.h | 28 ++--
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index fd82584..e398a62 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -208,34 +208,42 @@ static inline unsigned int rht_bucket_index(const struct 
bucket_table *tbl,
return (hash >> RHT_HASH_RESERVED_SPACE) & (tbl->size - 1);
 }
 
-static inline unsigned int rht_key_hashfn(
-   struct rhashtable *ht, const struct bucket_table *tbl,
-   const void *key, const struct rhashtable_params params)
+static inline unsigned int rht_key_get_hash(struct rhashtable *ht,
+   const void *key, const struct rhashtable_params params,
+   unsigned int hash_rnd)
 {
unsigned int hash;
 
/* params must be equal to ht->p if it isn't constant. */
if (!__builtin_constant_p(params.key_len))
-   hash = ht->p.hashfn(key, ht->key_len, tbl->hash_rnd);
+   hash = ht->p.hashfn(key, ht->key_len, hash_rnd);
else if (params.key_len) {
unsigned int key_len = params.key_len;
 
if (params.hashfn)
-   hash = params.hashfn(key, key_len, tbl->hash_rnd);
+   hash = params.hashfn(key, key_len, hash_rnd);
else if (key_len & (sizeof(u32) - 1))
-   hash = jhash(key, key_len, tbl->hash_rnd);
+   hash = jhash(key, key_len, hash_rnd);
else
-   hash = jhash2(key, key_len / sizeof(u32),
- tbl->hash_rnd);
+   hash = jhash2(key, key_len / sizeof(u32), hash_rnd);
} else {
unsigned int key_len = ht->p.key_len;
 
if (params.hashfn)
-   hash = params.hashfn(key, key_len, tbl->hash_rnd);
+   hash = params.hashfn(key, key_len, hash_rnd);
else
-   hash = jhash(key, key_len, tbl->hash_rnd);
+   hash = jhash(key, key_len, hash_rnd);
}
 
+   return hash;
+}
+
+static inline unsigned int rht_key_hashfn(
+   struct rhashtable *ht, const struct bucket_table *tbl,
+   const void *key, const struct rhashtable_params params)
+{
+   unsigned int hash = rht_key_get_hash(ht, key, params, tbl->hash_rnd);
+
return rht_bucket_index(tbl, hash);
 }
 
-- 
2.8.0.rc2



[PATCH net-next 2/7] spinlock: Add library function to allocate spinlock buckets array

2016-09-14 Thread Tom Herbert
Add two new library functions alloc_bucket_spinlocks and
free_bucket_spinlocks. These are use to allocate and free an array
of spinlocks that are useful as locks for hash buckets. The interface
specifies the maximum number of spinlocks in the array as well
as a CPU multiplier to derive the number of spinlocks to allocate.
The number to allocated is rounded up to a power of two to make
the array amenable to hash lookup.

Reviewed by Greg Rose 
Acked-by: Thomas Graf 

Signed-off-by: Tom Herbert 
---
 include/linux/spinlock.h |  6 +
 lib/Makefile |  2 +-
 lib/bucket_locks.c   | 63 
 3 files changed, 70 insertions(+), 1 deletion(-)
 create mode 100644 lib/bucket_locks.c

diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 47dd0ce..4ebdfbf 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -416,4 +416,10 @@ extern int _atomic_dec_and_lock(atomic_t *atomic, 
spinlock_t *lock);
 #define atomic_dec_and_lock(atomic, lock) \
__cond_lock(lock, _atomic_dec_and_lock(atomic, lock))
 
+int alloc_bucket_spinlocks(spinlock_t **locks, unsigned int *lock_mask,
+  unsigned int max_size, unsigned int cpu_mult,
+  gfp_t gfp);
+
+void free_bucket_spinlocks(spinlock_t *locks);
+
 #endif /* __LINUX_SPINLOCK_H */
diff --git a/lib/Makefile b/lib/Makefile
index 5dc77a8..f91185e 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -36,7 +36,7 @@ obj-y += bcd.o div64.o sort.o parser.o halfmd4.o 
debug_locks.o random32.o \
 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-once.o
+once.o bucket_locks.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += hexdump.o
diff --git a/lib/bucket_locks.c b/lib/bucket_locks.c
new file mode 100644
index 000..bb9bf11
--- /dev/null
+++ b/lib/bucket_locks.c
@@ -0,0 +1,63 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* Allocate an array of spinlocks to be accessed by a hash. Two arguments
+ * indicate the number of elements to allocate in the array. max_size
+ * gives the maximum number of elements to allocate. cpu_mult gives
+ * the number of locks per CPU to allocate. The size is rounded up
+ * to a power of 2 to be suitable as a hash table.
+ */
+int alloc_bucket_spinlocks(spinlock_t **locks, unsigned int *locks_mask,
+  unsigned int max_size, unsigned int cpu_mult,
+  gfp_t gfp)
+{
+   unsigned int i, size;
+#if defined(CONFIG_PROVE_LOCKING)
+   unsigned int nr_pcpus = 2;
+#else
+   unsigned int nr_pcpus = num_possible_cpus();
+#endif
+   spinlock_t *tlocks = NULL;
+
+   if (cpu_mult) {
+   nr_pcpus = min_t(unsigned int, nr_pcpus, 64UL);
+   size = min_t(unsigned int, nr_pcpus * cpu_mult, max_size);
+   } else {
+   size = max_size;
+   }
+   size = roundup_pow_of_two(size);
+
+   if (!size)
+   return -EINVAL;
+
+   if (sizeof(spinlock_t) != 0) {
+#ifdef CONFIG_NUMA
+   if (size * sizeof(spinlock_t) > PAGE_SIZE &&
+   gfp == GFP_KERNEL)
+   tlocks = vmalloc(size * sizeof(spinlock_t));
+#endif
+   if (gfp != GFP_KERNEL)
+   gfp |= __GFP_NOWARN | __GFP_NORETRY;
+
+   if (!tlocks)
+   tlocks = kmalloc_array(size, sizeof(spinlock_t), gfp);
+   if (!tlocks)
+   return -ENOMEM;
+   for (i = 0; i < size; i++)
+   spin_lock_init([i]);
+   }
+   *locks = tlocks;
+   *locks_mask = size - 1;
+
+   return 0;
+}
+EXPORT_SYMBOL(alloc_bucket_spinlocks);
+
+void free_bucket_spinlocks(spinlock_t *locks)
+{
+   kvfree(locks);
+}
+EXPORT_SYMBOL(free_bucket_spinlocks);
-- 
2.8.0.rc2



[PATCH net-next 0/7] net: ILA resolver and generic resolver backend

2016-09-14 Thread Tom Herbert
This patch set implements an ILA host side resolver. This uses LWT to
implement the hook to a userspace resolver and tracks pending unresolved
address using the backend net resolver.

This patch set contains:

- An new library function to allocate an array of spinlocks for use
  with locking hash buckets.
- Make hash function in rhashtable directly callable.
- A generic resolver backend infrastructure. This primary does two
  things: track unsesolved addresses and implement a timeout for
  resolution not happening. These mechanisms provides rate limiting
  control over resolution requests (for instance in ILA it use used
  to rate limit requests to userspace to resolve addresses).
- The ILA resolver. This is implements to path from the kernel ILA
  implementation to a userspace daemon that an identifier address
  needs to be resolved.
- Routing messages are used over netlink to indicate resoltion
  requests.

Changes from intial RFC:

 - Added net argument to LWT build_state
 - Made resolve timeout an attribute of the LWT encap route
 - Changed ILA notifications to be regular routing messages of event
   RTM_ADDR_RESOLVE, family RTNL_FAMILY_ILA, and group
   RTNLGRP_ILA_NOTIFY

Tested:
 - Ran a UDP flood to random addresses in a resolver prefix. Observed
   timeout and limits were working (watching "ip monitor").
 - Also ran against an ILA client daemon that runs the resolver
   protocol. Observed that when resolution completes (ILA encap route is
   installed) routing messages are no longer sent.

Tom Herbert (7):
  lwt: Add net to build_state argument
  spinlock: Add library function to allocate spinlock buckets array
  rhashtable: Call library function alloc_bucket_locks
  ila: Call library function alloc_bucket_locks
  rhashtable: abstract out function to get hash
  net: Generic resolver backend
  ila: Resolver mechanism

 include/linux/rhashtable.h |  28 +++--
 include/linux/spinlock.h   |   6 +
 include/net/lwtunnel.h |  10 +-
 include/net/resolver.h |  58 +
 include/uapi/linux/ila.h   |   9 ++
 include/uapi/linux/lwtunnel.h  |   1 +
 include/uapi/linux/rtnetlink.h |   8 +-
 lib/Makefile   |   2 +-
 lib/bucket_locks.c |  63 ++
 lib/rhashtable.c   |  46 +--
 net/Kconfig|   4 +
 net/core/Makefile  |   1 +
 net/core/lwtunnel.c|  11 +-
 net/core/resolver.c| 268 +
 net/ipv4/fib_semantics.c   |   7 +-
 net/ipv4/ip_tunnel_core.c  |  12 +-
 net/ipv6/Kconfig   |   1 +
 net/ipv6/ila/Makefile  |   2 +-
 net/ipv6/ila/ila.h |  16 +++
 net/ipv6/ila/ila_common.c  |   7 ++
 net/ipv6/ila/ila_lwt.c |  15 ++-
 net/ipv6/ila/ila_resolver.c| 246 +
 net/ipv6/ila/ila_xlat.c|  51 ++--
 net/ipv6/route.c   |   2 +-
 net/mpls/mpls_iptunnel.c   |   6 +-
 25 files changed, 761 insertions(+), 119 deletions(-)
 create mode 100644 include/net/resolver.h
 create mode 100644 lib/bucket_locks.c
 create mode 100644 net/core/resolver.c
 create mode 100644 net/ipv6/ila/ila_resolver.c

-- 
2.8.0.rc2



[PATCH net-next 4/7] ila: Call library function alloc_bucket_locks

2016-09-14 Thread Tom Herbert
To allocate the array of bucket locks for the hash table we now
call library function alloc_bucket_spinlocks.

Signed-off-by: Tom Herbert 
---
 net/ipv6/ila/ila_xlat.c | 36 +---
 1 file changed, 5 insertions(+), 31 deletions(-)

diff --git a/net/ipv6/ila/ila_xlat.c b/net/ipv6/ila/ila_xlat.c
index e604013..7d1c34b 100644
--- a/net/ipv6/ila/ila_xlat.c
+++ b/net/ipv6/ila/ila_xlat.c
@@ -30,34 +30,6 @@ struct ila_net {
bool hooks_registered;
 };
 
-#defineLOCKS_PER_CPU 10
-
-static int alloc_ila_locks(struct ila_net *ilan)
-{
-   unsigned int i, size;
-   unsigned int nr_pcpus = num_possible_cpus();
-
-   nr_pcpus = min_t(unsigned int, nr_pcpus, 32UL);
-   size = roundup_pow_of_two(nr_pcpus * LOCKS_PER_CPU);
-
-   if (sizeof(spinlock_t) != 0) {
-#ifdef CONFIG_NUMA
-   if (size * sizeof(spinlock_t) > PAGE_SIZE)
-   ilan->locks = vmalloc(size * sizeof(spinlock_t));
-   else
-#endif
-   ilan->locks = kmalloc_array(size, sizeof(spinlock_t),
-   GFP_KERNEL);
-   if (!ilan->locks)
-   return -ENOMEM;
-   for (i = 0; i < size; i++)
-   spin_lock_init(>locks[i]);
-   }
-   ilan->locks_mask = size - 1;
-
-   return 0;
-}
-
 static u32 hashrnd __read_mostly;
 static __always_inline void __ila_hash_secret_init(void)
 {
@@ -561,14 +533,16 @@ static const struct genl_ops ila_nl_ops[] = {
},
 };
 
-#define ILA_HASH_TABLE_SIZE 1024
+#define LOCKS_PER_CPU 10
+#define MAX_LOCKS 1024
 
 static __net_init int ila_init_net(struct net *net)
 {
int err;
struct ila_net *ilan = net_generic(net, ila_net_id);
 
-   err = alloc_ila_locks(ilan);
+   err = alloc_bucket_spinlocks(>locks, >locks_mask,
+MAX_LOCKS, LOCKS_PER_CPU, GFP_KERNEL);
if (err)
return err;
 
@@ -583,7 +557,7 @@ static __net_exit void ila_exit_net(struct net *net)
 
rhashtable_free_and_destroy(>rhash_table, ila_free_cb, NULL);
 
-   kvfree(ilan->locks);
+   free_bucket_spinlocks(ilan->locks);
 
if (ilan->hooks_registered)
nf_unregister_net_hooks(net, ila_nf_hook_ops,
-- 
2.8.0.rc2



[PATCH net-next 6/7] net: Generic resolver backend

2016-09-14 Thread Tom Herbert
This patch implements the backend of a resolver, specifically it
provides a means to track unresolved addresses and to time them out.

The resolver is mostly a frontend to an rhashtable where the key
of the table is whatever address type or object is tracked. A resolver
instance is created by net_rslv_create. A resolver is destroyed by
net_rslv_destroy.

There are two functions that are used to manipulate entries in the
table: net_rslv_lookup_and_create and net_rslv_resolved.

net_rslv_lookup_and_create is called with an unresolved address as
the argument. It returns a structure of type net_rslv_ent. When
called a lookup is performed to see if an entry for the address
is already in the table, if it is the entry is return and the
false is returned in the new bool pointer argument to indicate that
the entry was preexisting. If an entry is not found, one is create
and true is returned on the new pointer argument. It is expected
that when an entry is new the address resolution protocol is
initiated (for instance a RTM_ADDR_RESOLVE message may be sent to a
userspace daemon as we will do in ILA). If net_rslv_lookup_and_create
returns NULL then presumably the hash table has reached the limit of
number of outstanding unresolved addresses, the caller should take
appropriate actions to avoid spamming the resolution protocol.

net_rslv_resolved is called when resolution is completely (e.g.
ILA locator mapping was instantiated for a locator. The entry is
removed for the hash table.

An argument to net_rslv_create indicates a time for the pending
resolution in milliseconds. If the timer fires before resolution
then the entry is removed from the table. Subsequently, another
attempt to resolve the same address will result in a new entry in
the table.

net_rslv_lookup_and_create allocates an net_rslv_ent struct and
includes allocating related user data. This is the object[] field
in the structure. The key (unresolved address) is always the first
field in the the object. Following that the caller may add it's
own private field data. The key length and size of the user object
(including the key) are specific in net_rslv_create.

There are three callback functions that can be set as arugments in
net_rslv_create:

   - cmp_fn: Compare function for hash table. Arguments are the
   key and an object in the table. If this is NULL then the
   default memcmp of rhashtable is used.

   - init_fn: Initial a new net_rslv_ent structure. This allows
   initialization of the user portion of the structure
   (the object[]).

   - destroy_fn: Called right before a net_rslv_ent is freed.
   This allows cleanup of user data associated with the
   entry.

Note that the resolver backend only tracks unresolved addresses, it
is up to the caller to perform the mechanism of resolution. This
includes the possible of queuing packets awaiting resolution; this
can be accomplished for instance by maintaining an skbuff queue
in the net_rslv_ent user object[] data.

DOS mitigation is done by limiting the number of entries in the
resolver table (the max_size which argument of net_rslv_create)
and setting a timeout. IF the timeout is set then the maximum rate
of new resolution requests is max_table_size / timeout. For
instance, with a maximum size of 1000 entries and a timeout of 100
msecs the maximum rate of resolutions requests is 1/s.

Signed-off-by: Tom Herbert 
---
 include/net/resolver.h |  58 +++
 net/Kconfig|   4 +
 net/core/Makefile  |   1 +
 net/core/resolver.c| 268 +
 4 files changed, 331 insertions(+)
 create mode 100644 include/net/resolver.h
 create mode 100644 net/core/resolver.c

diff --git a/include/net/resolver.h b/include/net/resolver.h
new file mode 100644
index 000..9274237
--- /dev/null
+++ b/include/net/resolver.h
@@ -0,0 +1,58 @@
+#ifndef __NET_RESOLVER_H
+#define __NET_RESOLVER_H
+
+#include 
+
+struct net_rslv;
+struct net_rslv_ent;
+
+typedef int (*net_rslv_cmpfn)(struct net_rslv *nrslv, const void *key,
+ const void *object);
+typedef void (*net_rslv_initfn)(struct net_rslv *nrslv, void *object);
+typedef void (*net_rslv_destroyfn)(struct net_rslv_ent *nrent);
+
+struct net_rslv {
+   struct rhashtable rhash_table;
+   struct rhashtable_params params;
+   net_rslv_cmpfn rslv_cmp;
+   net_rslv_initfn rslv_init;
+   net_rslv_destroyfn rslv_destroy;
+   size_t obj_size;
+   spinlock_t *locks;
+   unsigned int locks_mask;
+   unsigned int hash_rnd;
+};
+
+struct net_rslv_ent {
+   struct rcu_head rcu;
+   union {
+   /* Fields set when entry is in hash table */
+   struct {
+   struct rhash_head node;
+   struct delayed_work timeout_work;
+   struct net_rslv *nrslv;
+   };
+
+   /* Fields set when rcu freeing structure */
+

Re: [RFC 3/3] phy,leds: add support for led triggers on phy link state change

2016-09-14 Thread Florian Fainelli
On 09/14/2016 02:55 PM, Zach Brown wrote:
> From: Josh Cartwright 
> 
> Create an option CONFIG_LED_TRIGGER_PHY (default n), which will
> create a set of led triggers for each instantiated PHY device.  There is
> one LED trigger per link-speed, per-phy.
> 
> This allows for a user to configure their system to allow a set of LEDs
> to represent link state changes on the phy.

The part that seems to be missing from this changeset (not an issue) is
how you can "accelerate" the triggers, or rather make sure that the
trigger configuration potentially calls back into the PHY driver when
the requested trigger is actually supported by the on-PHY LEDs.

Other than that, just minor nits here and there.

> 
> Signed-off-by: Josh Cartwright 
> Signed-off-by: Nathan Sullivan 
> Signed-off-by: Zach Brown 
> ---

> +config LED_TRIGGER_PHY
> + bool "Support LED triggers for tracking link state"
> + depends on LEDS_TRIGGERS
> + ---help---
> +   Adds support for a set of LED trigger events per-PHY.  Link
> +   state change will trigger the events, for consumption by an
> +   LED class driver.  There are triggers for each link speed,
> +   and are of the form:
> +::
> +
> +   Where speed is one of: 10Mb, 100Mb, Gb, 2.5Gb, or 10GbE.

I would do s/10GbE/10Gb/ just to be consistent with the other speeds, or
maybe, to avoid too much duplicationg of how we represent speeds, use
the same set of strings that drivers/net/phy/phy.c::phy_speed_to_str uses.


> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index c6f6683..3345767 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -936,6 +936,7 @@ void phy_state_machine(struct work_struct *work)
>   phydev->state = PHY_NOLINK;
>   netif_carrier_off(phydev->attached_dev);
>   phydev->adjust_link(phydev->attached_dev);
> + phy_led_trigger_change_speed(phydev);

Since we have essentially two actions to take when performing link state
changes:

- call the adjust_link callback
- call into the LED trigger

we might consider encapsulating this into a function that could be named
phy_adjust_link() and does both of these. That would be a preliminary
patch to this this one.

>  
> @@ -345,6 +347,8 @@ struct phy_device *phy_device_create(struct mii_bus *bus, 
> int addr, int phy_id,
>  
>   dev->state = PHY_DOWN;
>  
> + phy_led_triggers_register(dev);
> +
>   mutex_init(>lock);
>   INIT_DELAYED_WORK(>state_queue, phy_state_machine);
>   INIT_WORK(>phy_queue, phy_change);

Humm, should it be before the PHY state machine workqueue creation or
after? Just wondering if there is a remote chance we could get an user
to immediately program a trigger and this could create a problem, maybe
not so much.

> diff --git a/drivers/net/phy/phy_led_triggers.c 
> b/drivers/net/phy/phy_led_triggers.c
> new file mode 100644
> index 000..6c40414
> --- /dev/null
> +++ b/drivers/net/phy/phy_led_triggers.c
> @@ -0,0 +1,109 @@
> +/* Copyright (C) 2016 National Instruments Corp.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +#include 
> +#include 
> +
> +void phy_led_trigger_change_speed(struct phy_device *phy)
> +{
> + struct phy_led_trigger *plt;
> +
> + if (!phy->link) {
> + if (phy->last_triggered) {
> + led_trigger_event(>last_triggered->trigger,
> +   LED_OFF);
> + phy->last_triggered = NULL;
> + }
> + return;
> + }
> +
> + switch (phy->speed) {
> + case SPEED_10:
> + plt = >phy_led_trigger[0];
> + break;
> + case SPEED_100:
> + plt = >phy_led_trigger[1];
> + break;
> + case SPEED_1000:
> + plt = >phy_led_trigger[2];
> + break;
> + case SPEED_2500:
> + plt = >phy_led_trigger[3];
> + break;
> + case SPEED_1:
> + plt = >phy_led_trigger[4];
> + break;

We could use a table here indexed by the speed, or have a function that
does phy_speed_to_led_trigger(unsigned int speed) for instance, which
would be more robust to adding other speeds in the future.

> + default:
> + plt = NULL;
> + break;
> + }
> +
> + if (plt != phy->last_triggered) {
> + 

Re: [PATCHv3 net-next 08/15] nfp: add BPF to NFP code translator

2016-09-14 Thread Alexei Starovoitov
On Wed, Sep 14, 2016 at 08:00:16PM +0100, Jakub Kicinski wrote:
> Add translator for JITing eBPF to operations which
> can be executed on NFP's programmable engines.
> 
> Signed-off-by: Jakub Kicinski 
> ---
> v3:
>  - don't clone the program for the verifier (no longer needed);
>  - temporarily add a local copy of macros from bitfield.h.

so what's the status of that other patch? which tree is it going through?
Does it mean we'd have to wait till after the merge window? :(
That would be sad, since it looks like it almost ready.



[PATCH net-next 0/4] rxrpc: Support IPv6 [ver #2]

2016-09-14 Thread David Howells

Here is a set of patches that add IPv6 support.  They need to be applied on
top of the just-posted miscellaneous fix patches.  They are:

 (1) Make autobinding of an unconnected socket work when sendmsg() is
 called to initiate a client call.

 (2) Don't specify the protocol when creating the client socket, but rather
 take the default instead.

 (3) Use rxrpc_extract_addr_from_skb() in a couple of places that were
 doing the same thing manually.  This allows the IPv6 address
 extraction to be done in fewer places.

 (4) Add IPv6 support.  With this, calls can be made to IPv6 servers from
 userspace AF_RXRPC programs; AFS, however, can't use IPv6 yet as the
 RPC calls need to be upgradeable.

Changes:

 (V2) Made IPv6 support conditional on CONFIG_IPV6.

  Changed a memcpy argument that was taking the address of an array
  struct member (with the '&' operator) and adding 12 to just add 12 to
  the array member.  This otherwise causes a problem on gcc-4.9.

The patches can be found here also:


http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite

Tagged thusly:

git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
rxrpc-rewrite-20160913-2

David
---
David Howells (4):
  rxrpc: Create an address for sendmsg() to bind unbound socket with
  rxrpc: Don't specify protocol to when creating transport socket
  rxrpc: Use rxrpc_extract_addr_from_skb() rather than doing this manually
  rxrpc: Add IPv6 support


 net/rxrpc/Kconfig|7 +++
 net/rxrpc/af_rxrpc.c |   32 +++-
 net/rxrpc/conn_object.c  |   10 
 net/rxrpc/local_event.c  |   13 ++---
 net/rxrpc/local_object.c |   41 +++-
 net/rxrpc/output.c   |   50 +--
 net/rxrpc/peer_event.c   |   26 ++
 net/rxrpc/peer_object.c  |  119 ++
 net/rxrpc/proc.c |   30 +---
 net/rxrpc/utils.c|2 +
 10 files changed, 211 insertions(+), 119 deletions(-)



[PATCH net-next 2/4] rxrpc: Don't specify protocol to when creating transport socket [ver #2]

2016-09-14 Thread David Howells
Pass 0 as the protocol argument when creating the transport socket rather
than IPPROTO_UDP.

Signed-off-by: David Howells 
---

 net/rxrpc/local_object.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rxrpc/local_object.c b/net/rxrpc/local_object.c
index 782b9adf67cb..8720be2a6250 100644
--- a/net/rxrpc/local_object.c
+++ b/net/rxrpc/local_object.c
@@ -103,8 +103,8 @@ static int rxrpc_open_socket(struct rxrpc_local *local)
_enter("%p{%d}", local, local->srx.transport_type);
 
/* create a socket to represent the local endpoint */
-   ret = sock_create_kern(_net, PF_INET, local->srx.transport_type,
-  IPPROTO_UDP, >socket);
+   ret = sock_create_kern(_net, local->srx.transport.family,
+  local->srx.transport_type, 0, >socket);
if (ret < 0) {
_leave(" = %d [socket]", ret);
return ret;



[PATCH net-next 1/4] rxrpc: Create an address for sendmsg() to bind unbound socket with [ver #2]

2016-09-14 Thread David Howells
Create an address for sendmsg() to bind unbound socket with rather than
using a completely blank address otherwise the transport socket creation
will fail because it will try to use address family 0.

We use the address family specified in the protocol argument when the
AF_RXRPC socket was created and SOCK_DGRAM as the default.  For anything
else, bind() must be used.

Signed-off-by: David Howells 
---

 net/rxrpc/af_rxrpc.c |   12 
 1 file changed, 12 insertions(+)

diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index 25d00ded24bc..741b0d8d2e8c 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -401,6 +401,18 @@ static int rxrpc_sendmsg(struct socket *sock, struct 
msghdr *m, size_t len)
 
switch (rx->sk.sk_state) {
case RXRPC_UNBOUND:
+   rx->srx.srx_family = AF_RXRPC;
+   rx->srx.srx_service = 0;
+   rx->srx.transport_type = SOCK_DGRAM;
+   rx->srx.transport.family = rx->family;
+   switch (rx->family) {
+   case AF_INET:
+   rx->srx.transport_len = sizeof(struct sockaddr_in);
+   break;
+   default:
+   ret = -EAFNOSUPPORT;
+   goto error_unlock;
+   }
local = rxrpc_lookup_local(>srx);
if (IS_ERR(local)) {
ret = PTR_ERR(local);



[PATCH net-next 3/4] rxrpc: Use rxrpc_extract_addr_from_skb() rather than doing this manually [ver #2]

2016-09-14 Thread David Howells
There are two places that want to transmit a packet in response to one just
received and manually pick the address to reply to out of the sk_buff.
Make them use rxrpc_extract_addr_from_skb() instead so that IPv6 is handled
automatically.

Signed-off-by: David Howells 
---

 net/rxrpc/local_event.c |   13 +
 net/rxrpc/output.c  |   32 ++--
 2 files changed, 11 insertions(+), 34 deletions(-)

diff --git a/net/rxrpc/local_event.c b/net/rxrpc/local_event.c
index cdd58e6e9fbd..f073e932500e 100644
--- a/net/rxrpc/local_event.c
+++ b/net/rxrpc/local_event.c
@@ -15,8 +15,6 @@
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
 #include 
 #include 
@@ -33,7 +31,7 @@ static void rxrpc_send_version_request(struct rxrpc_local 
*local,
 {
struct rxrpc_wire_header whdr;
struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
-   struct sockaddr_in sin;
+   struct sockaddr_rxrpc srx;
struct msghdr msg;
struct kvec iov[2];
size_t len;
@@ -41,12 +39,11 @@ static void rxrpc_send_version_request(struct rxrpc_local 
*local,
 
_enter("");
 
-   sin.sin_family = AF_INET;
-   sin.sin_port = udp_hdr(skb)->source;
-   sin.sin_addr.s_addr = ip_hdr(skb)->saddr;
+   if (rxrpc_extract_addr_from_skb(, skb) < 0)
+   return;
 
-   msg.msg_name= 
-   msg.msg_namelen = sizeof(sin);
+   msg.msg_name= 
+   msg.msg_namelen = srx.transport_len;
msg.msg_control = NULL;
msg.msg_controllen = 0;
msg.msg_flags   = 0;
diff --git a/net/rxrpc/output.c b/net/rxrpc/output.c
index 90c7722d5779..ec3621f2c5c8 100644
--- a/net/rxrpc/output.c
+++ b/net/rxrpc/output.c
@@ -15,8 +15,6 @@
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
 #include 
 #include "ar-internal.h"
@@ -272,10 +270,7 @@ send_fragmentable:
  */
 void rxrpc_reject_packets(struct rxrpc_local *local)
 {
-   union {
-   struct sockaddr sa;
-   struct sockaddr_in sin;
-   } sa;
+   struct sockaddr_rxrpc srx;
struct rxrpc_skb_priv *sp;
struct rxrpc_wire_header whdr;
struct sk_buff *skb;
@@ -292,32 +287,21 @@ void rxrpc_reject_packets(struct rxrpc_local *local)
iov[1].iov_len = sizeof(code);
size = sizeof(whdr) + sizeof(code);
 
-   msg.msg_name = 
+   msg.msg_name = 
msg.msg_control = NULL;
msg.msg_controllen = 0;
msg.msg_flags = 0;
 
-   memset(, 0, sizeof(sa));
-   sa.sa.sa_family = local->srx.transport.family;
-   switch (sa.sa.sa_family) {
-   case AF_INET:
-   msg.msg_namelen = sizeof(sa.sin);
-   break;
-   default:
-   msg.msg_namelen = 0;
-   break;
-   }
-
memset(, 0, sizeof(whdr));
whdr.type = RXRPC_PACKET_TYPE_ABORT;
 
while ((skb = skb_dequeue(>reject_queue))) {
rxrpc_see_skb(skb);
sp = rxrpc_skb(skb);
-   switch (sa.sa.sa_family) {
-   case AF_INET:
-   sa.sin.sin_port = udp_hdr(skb)->source;
-   sa.sin.sin_addr.s_addr = ip_hdr(skb)->saddr;
+
+   if (rxrpc_extract_addr_from_skb(, skb) == 0) {
+   msg.msg_namelen = srx.transport_len;
+
code = htonl(skb->priority);
 
whdr.epoch  = htonl(sp->hdr.epoch);
@@ -329,10 +313,6 @@ void rxrpc_reject_packets(struct rxrpc_local *local)
whdr.flags  &= RXRPC_CLIENT_INITIATED;
 
kernel_sendmsg(local->socket, , iov, 2, size);
-   break;
-
-   default:
-   break;
}
 
rxrpc_free_skb(skb);



[PATCH net-next 4/4] rxrpc: Add IPv6 support [ver #2]

2016-09-14 Thread David Howells
Add IPv6 support to AF_RXRPC.  With this, AF_RXRPC sockets can be created:

service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET6);

instead of:

service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET);

The AFS filesystem doesn't support IPv6 at the moment, though, since that
requires upgrades to some of the RPC calls.

Note that a good portion of this patch is replacing "%pI4:%u" in print
statements with "%pISpc" which is able to handle both protocols and print
the port.

Signed-off-by: David Howells 
---

 net/rxrpc/Kconfig|7 +++
 net/rxrpc/af_rxrpc.c |   20 ++--
 net/rxrpc/conn_object.c  |   10 
 net/rxrpc/local_object.c |   37 +++---
 net/rxrpc/output.c   |   18 +++
 net/rxrpc/peer_event.c   |   26 ++
 net/rxrpc/peer_object.c  |  119 ++
 net/rxrpc/proc.c |   30 +---
 net/rxrpc/utils.c|2 +
 9 files changed, 186 insertions(+), 83 deletions(-)

diff --git a/net/rxrpc/Kconfig b/net/rxrpc/Kconfig
index 784c53163b7b..13396c74b5c1 100644
--- a/net/rxrpc/Kconfig
+++ b/net/rxrpc/Kconfig
@@ -19,6 +19,13 @@ config AF_RXRPC
 
  See Documentation/networking/rxrpc.txt.
 
+config AF_RXRPC_IPV6
+   bool "IPv6 support for RxRPC"
+   depends on (IPV6 = m && AF_RXRPC = m) || (IPV6 = y && AF_RXRPC)
+   help
+ Say Y here to allow AF_RXRPC to use IPV6 UDP as well as IPV4 UDP as
+ its network transport.
+
 
 config AF_RXRPC_DEBUG
bool "RxRPC dynamic debugging"
diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index 741b0d8d2e8c..09f81befc705 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -106,19 +106,25 @@ static int rxrpc_validate_address(struct rxrpc_sock *rx,
case AF_INET:
if (srx->transport_len < sizeof(struct sockaddr_in))
return -EINVAL;
-   _debug("INET: %x @ %pI4",
-  ntohs(srx->transport.sin.sin_port),
-  >transport.sin.sin_addr);
tail = offsetof(struct sockaddr_rxrpc, transport.sin.__pad);
break;
 
+#ifdef CONFIG_AF_RXRPC_IPV6
case AF_INET6:
+   if (srx->transport_len < sizeof(struct sockaddr_in6))
+   return -EINVAL;
+   tail = offsetof(struct sockaddr_rxrpc, transport) +
+   sizeof(struct sockaddr_in6);
+   break;
+#endif
+
default:
return -EAFNOSUPPORT;
}
 
if (tail < len)
memset((void *)srx + tail, 0, len - tail);
+   _debug("INET: %pISp", >transport);
return 0;
 }
 
@@ -409,6 +415,11 @@ static int rxrpc_sendmsg(struct socket *sock, struct 
msghdr *m, size_t len)
case AF_INET:
rx->srx.transport_len = sizeof(struct sockaddr_in);
break;
+#ifdef CONFIG_AF_RXRPC_IPV6
+   case AF_INET6:
+   rx->srx.transport_len = sizeof(struct sockaddr_in6);
+   break;
+#endif
default:
ret = -EAFNOSUPPORT;
goto error_unlock;
@@ -563,7 +574,8 @@ static int rxrpc_create(struct net *net, struct socket 
*sock, int protocol,
return -EAFNOSUPPORT;
 
/* we support transport protocol UDP/UDP6 only */
-   if (protocol != PF_INET)
+   if (protocol != PF_INET &&
+   IS_ENABLED(CONFIG_AF_RXRPC_IPV6) && protocol != PF_INET6)
return -EPROTONOSUPPORT;
 
if (sock->type != SOCK_DGRAM)
diff --git a/net/rxrpc/conn_object.c b/net/rxrpc/conn_object.c
index ffa9addb97b2..bb1f29280aea 100644
--- a/net/rxrpc/conn_object.c
+++ b/net/rxrpc/conn_object.c
@@ -134,6 +134,16 @@ struct rxrpc_connection *rxrpc_find_connection_rcu(struct 
rxrpc_local *local,
srx.transport.sin.sin_addr.s_addr)
goto not_found;
break;
+#ifdef CONFIG_AF_RXRPC_IPV6
+   case AF_INET6:
+   if (peer->srx.transport.sin6.sin6_port !=
+   srx.transport.sin6.sin6_port ||
+   memcmp(>srx.transport.sin6.sin6_addr,
+  _addr,
+  sizeof(struct in6_addr)) != 0)
+   goto not_found;
+   break;
+#endif
default:
BUG();
}
diff --git a/net/rxrpc/local_object.c b/net/rxrpc/local_object.c
index 8720be2a6250..e3fad80b0795 100644
--- a/net/rxrpc/local_object.c
+++ b/net/rxrpc/local_object.c
@@ -58,6 +58,17 @@ static long rxrpc_local_cmp_key(const struct rxrpc_local 
*local,
memcmp(>srx.transport.sin.sin_addr,
   >transport.sin.sin_addr,
   sizeof(struct in_addr));
+#ifdef 

Re: [PATCHv3 net-next 05/15] bpf: enable non-core use of the verfier

2016-09-14 Thread Alexei Starovoitov
On Wed, Sep 14, 2016 at 08:00:13PM +0100, Jakub Kicinski wrote:
> Advanced JIT compilers and translators may want to use
> eBPF verifier as a base for parsers or to perform custom
> checks and validations.
> 
> Add ability for external users to invoke the verifier
> and provide callbacks to be invoked for every intruction
> checked.  For now only add most basic callback for
> per-instruction pre-interpretation checks is added.  More
> advanced users may also like to have per-instruction post
> callback and state comparison callback.
> 
> Signed-off-by: Jakub Kicinski 
> ---
>  include/linux/bpf_parser.h |  89 ++
>  kernel/bpf/verifier.c  | 134 
> +++--
>  2 files changed, 158 insertions(+), 65 deletions(-)
>  create mode 100644 include/linux/bpf_parser.h
> 
> diff --git a/include/linux/bpf_parser.h b/include/linux/bpf_parser.h
> new file mode 100644
> index ..daa53b204f4d
> --- /dev/null
> +++ b/include/linux/bpf_parser.h

'bpf parser' is a bit misleading name, since it can be interpreted
as parser written in bpf.
Also the header file containes verifier bits, therefore I think
the better name would be bpf_verifier.h ?

> @@ -0,0 +1,89 @@
> +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +#ifndef _LINUX_BPF_PARSER_H
> +#define _LINUX_BPF_PARSER_H 1
> +
> +#include  /* for enum bpf_reg_type */
> +#include  /* for MAX_BPF_STACK */
> +
> +struct reg_state {
> + enum bpf_reg_type type;
> + union {
> + /* valid when type == CONST_IMM | PTR_TO_STACK | UNKNOWN_VALUE 
> */
> + s64 imm;
> +
> + /* valid when type == PTR_TO_PACKET* */
> + struct {
> + u32 id;
> + u16 off;
> + u16 range;
> + };
> +
> + /* valid when type == CONST_PTR_TO_MAP | PTR_TO_MAP_VALUE |
> +  *   PTR_TO_MAP_VALUE_OR_NULL
> +  */
> + struct bpf_map *map_ptr;
> + };
> +};
> +
> +enum bpf_stack_slot_type {
> + STACK_INVALID,/* nothing was stored in this stack slot */
> + STACK_SPILL,  /* register spilled into stack */
> + STACK_MISC/* BPF program wrote some data into this slot */
> +};
> +
> +#define BPF_REG_SIZE 8   /* size of eBPF register in bytes */
> +
> +/* state of the program:
> + * type of all registers and stack info
> + */
> +struct verifier_state {
> + struct reg_state regs[MAX_BPF_REG];
> + u8 stack_slot_type[MAX_BPF_STACK];
> + struct reg_state spilled_regs[MAX_BPF_STACK / BPF_REG_SIZE];
> +};
> +
> +/* linked list of verifier states used to prune search */
> +struct verifier_state_list {
> + struct verifier_state state;
> + struct verifier_state_list *next;
> +};
> +
> +struct bpf_insn_aux_data {
> + enum bpf_reg_type ptr_type; /* pointer type for load/store insns */
> +};
> +
> +#define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program 
> */
> +
> +struct verifier_env;
> +struct bpf_ext_parser_ops {
> + int (*insn_hook)(struct verifier_env *env,
> +  int insn_idx, int prev_insn_idx);
> +};

How about calling this bpf_ext_analyzer_ops
and main entry bpf_analyzer() ?
I think it will better convey what it's doing.

> +
> +/* single container for all structs
> + * one verifier_env per bpf_check() call
> + */
> +struct verifier_env {
> + struct bpf_prog *prog;  /* eBPF program being verified */
> + struct verifier_stack_elem *head; /* stack of verifier states to be 
> processed */
> + int stack_size; /* number of states to be processed */
> + struct verifier_state cur_state; /* current verifier state */
> + struct verifier_state_list **explored_states; /* search pruning 
> optimization */
> + const struct bpf_ext_parser_ops *pops; /* external parser ops */
> + void *ppriv; /* pointer to external parser's private data */

a bit hard to review, since move and addition is in one patch.
I think ppriv and pops are too obscure names.
May be analyzer_ops and analyzer_priv ?

Conceptually looks good.



Re: [RFC v3 07/22] landlock: Handle file comparisons

2016-09-14 Thread Mickaël Salaün


On 14/09/2016 23:06, Alexei Starovoitov wrote:
> On Wed, Sep 14, 2016 at 09:24:00AM +0200, Mickaël Salaün wrote:
>> Add eBPF functions to compare file system access with a Landlock file
>> system handle:
>> * bpf_landlock_cmp_fs_prop_with_struct_file(prop, map, map_op, file)
>>   This function allows to compare the dentry, inode, device or mount
>>   point of the currently accessed file, with a reference handle.
>> * bpf_landlock_cmp_fs_beneath_with_struct_file(opt, map, map_op, file)
>>   This function allows an eBPF program to check if the current accessed
>>   file is the same or in the hierarchy of a reference handle.
>>
>> The goal of file system handle is to abstract kernel objects such as a
>> struct file or a struct inode. Userland can create this kind of handle
>> thanks to the BPF_MAP_UPDATE_ELEM command. The element is a struct
>> landlock_handle containing the handle type (e.g.
>> BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD) and a file descriptor. This could
>> also be any descriptions able to match a struct file or a struct inode
>> (e.g. path or glob string).
>>
>> Changes since v2:
>> * add MNT_INTERNAL check to only add file handle from user-visible FS
>>   (e.g. no anonymous inode)
>> * replace struct file* with struct path* in map_landlock_handle
>> * add BPF protos
>> * fix bpf_landlock_cmp_fs_prop_with_struct_file()
>>
>> Signed-off-by: Mickaël Salaün 
>> Cc: Alexei Starovoitov 
>> Cc: Andy Lutomirski 
>> Cc: Daniel Borkmann 
>> Cc: David S. Miller 
>> Cc: James Morris 
>> Cc: Kees Cook 
>> Cc: Serge E. Hallyn 
>> Link: 
>> https://lkml.kernel.org/r/calcetrwwtiz3kztkegow24-dvhqq6lftwexh77fd2g5o71y...@mail.gmail.com
> 
> thanks for keeping the links to the previous discussion.
> Long term it should help, though I worry we already at the point
> where there are too many outstanding issues to resolve before we
> can proceed with reasonable code review.
> 
>> +/*
>> + * bpf_landlock_cmp_fs_prop_with_struct_file
>> + *
>> + * Cf. include/uapi/linux/bpf.h
>> + */
>> +static inline u64 bpf_landlock_cmp_fs_prop_with_struct_file(u64 r1_property,
>> +u64 r2_map, u64 r3_map_op, u64 r4_file, u64 r5)
>> +{
>> +u8 property = (u8) r1_property;
>> +struct bpf_map *map = (struct bpf_map *) (unsigned long) r2_map;
>> +enum bpf_map_array_op map_op = r3_map_op;
>> +struct file *file = (struct file *) (unsigned long) r4_file;
> 
> please use just added BPF_CALL_ macros. They will help readability of the 
> above.
> 
>> +struct bpf_array *array = container_of(map, struct bpf_array, map);
>> +struct path *p1, *p2;
>> +struct map_landlock_handle *handle;
>> +int i;
>> +
>> +/* ARG_CONST_PTR_TO_LANDLOCK_HANDLE_FS is an arraymap */
>> +if (unlikely(!map)) {
>> +WARN_ON(1);
>> +return -EFAULT;
>> +}
>> +if (unlikely(!file))
>> +return -ENOENT;
>> +if (unlikely((property | _LANDLOCK_FLAG_FS_MASK) != 
>> _LANDLOCK_FLAG_FS_MASK))
>> +return -EINVAL;
>> +
>> +/* for now, only handle OP_OR */
>> +switch (map_op) {
>> +case BPF_MAP_ARRAY_OP_OR:
>> +break;
>> +case BPF_MAP_ARRAY_OP_UNSPEC:
>> +case BPF_MAP_ARRAY_OP_AND:
>> +case BPF_MAP_ARRAY_OP_XOR:
>> +default:
>> +return -EINVAL;
>> +}
>> +p2 = >f_path;
>> +
>> +synchronize_rcu();
> 
> that is completely broken.
> bpf programs are executing under rcu_lock.
> please enable CONFIG_PROVE_RCU and retest everything.

Thanks for the tip. I will fix this.

> 
> I would suggest for the next RFC to do minimal 7 patches up to this point
> with simple example that demonstrates the use case.
> I would avoid all unpriv stuff and all of seccomp for the next RFC as well,
> otherwise I don't think we can realistically make forward progress, since
> there are too many issues raised in the subsequent patches.

I hope we will find a common agreement about seccomp vs cgroup… I think
both approaches have their advantages, can be complementary and nicely
combined.

Unprivileged sandboxing is the main goal of Landlock. This should not be
a problem, even for privileged features, thanks to the new subtype/access.

> 
> The common part that is mergeable is prog's subtype extension to
> the verifier that can be used for better tracing and is the common
> piece of infra needed for both landlock and checmate LSMs
> (which must be one LSM anyway)

Agreed. With this RFC, the Checmate features (i.e. network helpers)
should be able to sit on top of Landlock.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 00/16] Add Paravirtual RDMA Driver

2016-09-14 Thread Jason Gunthorpe
On Wed, Sep 14, 2016 at 10:20:22PM +, Woodruff, Robert J wrote:

> >this new scheme works with >kernel.org 4.8, then it is possible
> >that it could go into that OFED-4.8 Release, but again, we are
> >still looking at the new scheme and evaluating how it affects >the
> >community OFED.
> 
> One other question.  Jason, the OFA (and its members) want to
> maintain the dual-license (BSD/GPL) for the code, as is the case for
> all the code that was in the OFA git trees on the OFA server that
> you pulled from.

The code in the OFA trees has a mixture of licenses. It is all GPLv2
compatible for sure, but there are at least three variations of the
'BSD' license. As far as I know, only the 2 clause OpenIB.org BSD
license is approved for use in OFA projects. Someone from the board
could correct me if I am wrong.

This means the 3 projects with the BSD patent clause, hosted on the
OFA servers, were already not conforming.

As is Intels HFI1 driver which uses a three clause BSD license.

> package follows that licensing model for accepting any new code into
> that combined repo ?

As with the kernel we'd discourage 're-licensing' existing files.

However, since this is not a OFA project, I, personally, would not
turn away a GPLv2 compatible contribution, but I am proposing that the
'default' license for the project be OFA compatible.

I think license enforcement of its members falls to the OFA.

Doug may feel differently.

Jason


Re: [PATCHv3 net-next 04/15] bpf: don't (ab)use instructions to store state

2016-09-14 Thread Alexei Starovoitov
On Wed, Sep 14, 2016 at 08:00:12PM +0100, Jakub Kicinski wrote:
> Storing state in reserved fields of instructions makes
> it impossible to run validator on programs already
> marked as read-only. Allocate and use an array of
> per-instruction state instead.
> 
> While touching the error path rename and move existing
> jump target.
> 
> Suggested-by: Alexei Starovoitov 
> Signed-off-by: Jakub Kicinski 

Nice and clean. Thanks!
Acked-by: Alexei Starovoitov 



Re: [RFC v3 21/22] bpf,landlock: Add optional skb pointer in the Landlock context

2016-09-14 Thread Mickaël Salaün

On 14/09/2016 23:20, Alexei Starovoitov wrote:
> On Wed, Sep 14, 2016 at 09:24:14AM +0200, Mickaël Salaün wrote:
>> This is a proof of concept to expose optional values that could depend
>> of the process access rights.
>>
>> There is two dedicated flags: LANDLOCK_FLAG_ACCESS_SKB_READ and
>> LANDLOCK_FLAG_ACCESS_SKB_WRITE. Each of them can be activated to access
>> eBPF functions manipulating a skb in a read or write way.
>>
>> Signed-off-by: Mickaël Salaün 
> ...
>>  /* Handle check flags */
>>  #define LANDLOCK_FLAG_FS_DENTRY (1 << 0)
>> @@ -619,12 +621,15 @@ struct landlock_handle {
>>   * @args: LSM hook arguments, see include/linux/lsm_hooks.h for there
>>   *description and the LANDLOCK_HOOK* definitions from
>>   *security/landlock/lsm.c for their types.
>> + * @opt_skb: optional skb pointer, accessible with the
>> + *   LANDLOCK_FLAG_ACCESS_SKB_* flags for network-related hooks.
>>   */
>>  struct landlock_data {
>>  __u32 hook; /* enum landlock_hook_id */
>>  __u16 origin; /* LANDLOCK_FLAG_ORIGIN_* */
>>  __u16 cookie; /* seccomp RET_LANDLOCK */
>>  __u64 args[6];
>> +__u64 opt_skb;
>>  };
> 
> missing something here.
> This patch doesn't make use of it.
> That's something for the future?
> How that field will be populated?
> Why make it different vs the rest or args[6] ?
> 
> 

I don't use this code, it's only purpose is to show how to deal with
fine-grained privileges of Landlock programs (to allow Sargun to add his
custom helpers from Checmate). However, this optional field may be part
of args[6].



signature.asc
Description: OpenPGP digital signature


Re: [RFC v3 07/22] landlock: Handle file comparisons

2016-09-14 Thread Mickaël Salaün

On 14/09/2016 21:07, Jann Horn wrote:
> On Wed, Sep 14, 2016 at 09:24:00AM +0200, Mickaël Salaün wrote:
>> Add eBPF functions to compare file system access with a Landlock file
>> system handle:
>> * bpf_landlock_cmp_fs_prop_with_struct_file(prop, map, map_op, file)
>>   This function allows to compare the dentry, inode, device or mount
>>   point of the currently accessed file, with a reference handle.
>> * bpf_landlock_cmp_fs_beneath_with_struct_file(opt, map, map_op, file)
>>   This function allows an eBPF program to check if the current accessed
>>   file is the same or in the hierarchy of a reference handle.
> [...]
>> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
>> index 94256597eacd..edaab4c87292 100644
>> --- a/kernel/bpf/arraymap.c
>> +++ b/kernel/bpf/arraymap.c
>> @@ -603,6 +605,9 @@ static void landlock_put_handle(struct 
>> map_landlock_handle *handle)
>>  enum bpf_map_handle_type handle_type = handle->type;
>>  
>>  switch (handle_type) {
>> +case BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD:
>> +path_put(>path);
>> +break;
>>  case BPF_MAP_HANDLE_TYPE_UNSPEC:
>>  default:
>>  WARN_ON(1);
> [...]
>> diff --git a/security/landlock/checker_fs.c b/security/landlock/checker_fs.c
>> new file mode 100644
>> index ..39eb85dc7d18
>> --- /dev/null
>> +++ b/security/landlock/checker_fs.c
> [...]
>> +static inline u64 bpf_landlock_cmp_fs_prop_with_struct_file(u64 r1_property,
>> +u64 r2_map, u64 r3_map_op, u64 r4_file, u64 r5)
>> +{
>> +u8 property = (u8) r1_property;
>> +struct bpf_map *map = (struct bpf_map *) (unsigned long) r2_map;
>> +enum bpf_map_array_op map_op = r3_map_op;
>> +struct file *file = (struct file *) (unsigned long) r4_file;
>> +struct bpf_array *array = container_of(map, struct bpf_array, map);
>> +struct path *p1, *p2;
>> +struct map_landlock_handle *handle;
>> +int i;
> 
> Please don't use int when iterating over an array, use size_t.

OK, I will use size_t.

> 
> 
>> +/* for now, only handle OP_OR */
> 
> Is "OP_OR" an appropriate name for something that ANDs the success of
> checks?
> 
> 
> [...]
>> +synchronize_rcu();
> 
> Can you put a comment here that explains what's going on?

Hum, this should not be here.

> 
> 
>> +for (i = 0; i < array->n_entries; i++) {
>> +bool result_dentry = !(property & LANDLOCK_FLAG_FS_DENTRY);
>> +bool result_inode = !(property & LANDLOCK_FLAG_FS_INODE);
>> +bool result_device = !(property & LANDLOCK_FLAG_FS_DEVICE);
>> +bool result_mount = !(property & LANDLOCK_FLAG_FS_MOUNT);
>> +
>> +handle = (struct map_landlock_handle *)
>> +(array->value + array->elem_size * i);
>> +
>> +if (handle->type != BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD) {
>> +WARN_ON(1);
>> +return -EFAULT;
>> +}
>> +p1 = >path;
>> +
>> +if (!result_dentry && p1->dentry == p2->dentry)
>> +result_dentry = true;
> 
> Why is this safe? As far as I can tell, this is not in an RCU read-side
> critical section (synchronize_rcu() was just called), and no lock has been
> taken. What prevents someone from removing the arraymap entry while we're
> looking at it? Am I missing something?

I will try to properly deal with RCU.



signature.asc
Description: OpenPGP digital signature


Re: [RFC v3 11/22] seccomp,landlock: Handle Landlock hooks per process hierarchy

2016-09-14 Thread Mickaël Salaün

On 14/09/2016 20:43, Andy Lutomirski wrote:
> On Wed, Sep 14, 2016 at 12:24 AM, Mickaël Salaün  wrote:
>> A Landlock program will be triggered according to its subtype/origin
>> bitfield. The LANDLOCK_FLAG_ORIGIN_SECCOMP value will trigger the
>> Landlock program when a seccomp filter will return RET_LANDLOCK.
>> Moreover, it is possible to return a 16-bit cookie which will be
>> readable by the Landlock programs in its context.
> 
> Are you envisioning that the filters will return RET_LANDLOCK most of
> the time or rarely?  If it's most of the time, then maybe this could
> be simplified a bit by unconditionally calling the landlock filter and
> letting the landlock filter access a struct seccomp_data if needed.

Exposing seccomp_data in a Landlock context may be a good idea. The main
implication is that Landlock programs may then be architecture specific
(if dealing with data) as seccomp filters are. Another point is that it
remove any direct binding between seccomp filters and Landlock programs.
I will try this (more simple) approach.

> 
>>
>> Only seccomp filters loaded from the same thread and before a Landlock
>> program can trigger it through LANDLOCK_FLAG_ORIGIN_SECCOMP. Multiple
>> Landlock programs can be triggered by one or more seccomp filters. This
>> way, each RET_LANDLOCK (with specific cookie) will trigger all the
>> allowed Landlock programs once.
> 
> This interface seems somewhat awkward.  Should we not have a way to
> atomicaly install a whole pile of landlock filters and associated
> seccomp filter all at once?

I can change the seccomp(2) use in this way: instead of loading a
Landlock program, (atomically) load an array of Landlock programs.

However, exposing seccomp_data to Landlock programs looks like a better
way to deal with it. This does not needs to manage an array of Landlock
programs.

 Mickaël



signature.asc
Description: OpenPGP digital signature


RE: [PATCH v4 00/16] Add Paravirtual RDMA Driver

2016-09-14 Thread Woodruff, Robert J
Woody Wrote,
>We are still discussing this within the OFA EWG for the OFED releases. I 
>suspect that if all of the maintainers of the user-space packages agree to 
>merge into >and support this new merged repo-model, then OFED would eventually 
>base their OFED user-space packages on that repo, rather than the individual 
>repo's >that are used today. The question is, when. Work has now just started 
>on OFED-4.8, that is based on the kernel.org 4.8 kernel. If this new scheme 
>works with >kernel.org 4.8, then it is possible that it could go into that 
>OFED-4.8 Release, but again, we are still looking at the new scheme and 
>evaluating how it affects >the community OFED.

One other question.  Jason, the OFA (and its members) want to maintain the 
dual-license (BSD/GPL) for the code, as is the case for all the code that was 
in the OFA git trees on the OFA server that you pulled from. Will you guys be 
ensuring that this new user-space package 
follows that licensing model for accepting any new code into that combined repo 
?


Re: [PATCH 3/3] mm: memcontrol: consolidate cgroup socket tracking

2016-09-14 Thread Andrew Morton
On Thu, 15 Sep 2016 13:34:24 +0800 kbuild test robot <l...@intel.com> wrote:

> Hi Johannes,
> 
> [auto build test ERROR on net/master]
> [also build test ERROR on v4.8-rc6 next-20160914]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> [Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for 
> convenience) to record what (public, well-known) commit your patch series was 
> built on]
> [Check https://git-scm.com/docs/git-format-patch for more information]
> 
> url:
> https://github.com/0day-ci/linux/commits/Johannes-Weiner/mm-memcontrol-make-per-cpu-charge-cache-IRQ-safe-for-socket-accounting/20160915-035634
> config: m68k-sun3_defconfig (attached as .config)
> compiler: m68k-linux-gcc (GCC) 4.9.0
> reproduce:
> wget 
> https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
>  -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=m68k 
> 
> All errors (new ones prefixed by >>):
> 
>net/built-in.o: In function `sk_alloc':
> >> (.text+0x4076): undefined reference to `mem_cgroup_sk_alloc'
>net/built-in.o: In function `__sk_destruct':
> >> sock.c:(.text+0x457e): undefined reference to `mem_cgroup_sk_free'
>net/built-in.o: In function `sk_clone_lock':
>(.text+0x4f1c): undefined reference to `mem_cgroup_sk_alloc'

This?

--- a/mm/memcontrol.c~mm-memcontrol-consolidate-cgroup-socket-tracking-fix
+++ a/mm/memcontrol.c
@@ -5655,9 +5655,6 @@ void mem_cgroup_sk_alloc(struct sock *sk
 {
struct mem_cgroup *memcg;
 
-   if (!mem_cgroup_sockets_enabled)
-   return;
-
/*
 * Socket cloning can throw us here with sk_memcg already
 * filled. It won't however, necessarily happen from
--- a/net/core/sock.c~mm-memcontrol-consolidate-cgroup-socket-tracking-fix
+++ a/net/core/sock.c
@@ -1385,7 +1385,8 @@ static void sk_prot_free(struct proto *p
slab = prot->slab;
 
cgroup_sk_free(>sk_cgrp_data);
-   mem_cgroup_sk_free(sk);
+   if (mem_cgroup_sockets_enabled)
+   mem_cgroup_sk_free(sk);
security_sk_free(sk);
if (slab != NULL)
kmem_cache_free(slab, sk);
@@ -1422,7 +1423,8 @@ struct sock *sk_alloc(struct net *net, i
sock_net_set(sk, net);
atomic_set(>sk_wmem_alloc, 1);
 
-   mem_cgroup_sk_alloc(sk);
+   if (mem_cgroup_sockets_enabled)
+   mem_cgroup_sk_alloc(sk);
cgroup_sk_alloc(>sk_cgrp_data);
sock_update_classid(>sk_cgrp_data);
sock_update_netprioidx(>sk_cgrp_data);
@@ -1569,7 +1571,8 @@ struct sock *sk_clone_lock(const struct
newsk->sk_incoming_cpu = raw_smp_processor_id();
atomic64_set(>sk_cookie, 0);
 
-   mem_cgroup_sk_alloc(newsk);
+   if (mem_cgroup_sockets_enabled)
+   mem_cgroup_sk_alloc(newsk);
cgroup_sk_alloc(>sk_cgrp_data);
 
/*
_



Re: [RFC v3 19/22] landlock: Add interrupted origin

2016-09-14 Thread Mickaël Salaün

On 14/09/2016 20:29, Andy Lutomirski wrote:
> On Wed, Sep 14, 2016 at 12:24 AM, Mickaël Salaün  wrote:
>> This third origin of hook call should cover all possible trigger paths
>> (e.g. page fault). Landlock eBPF programs can then take decisions
>> accordingly.
>>
>> Signed-off-by: Mickaël Salaün 
>> Cc: Alexei Starovoitov 
>> Cc: Andy Lutomirski 
>> Cc: Daniel Borkmann 
>> Cc: Kees Cook 
>> ---
> 
> 
>>
>> +   if (unlikely(in_interrupt())) {
> 
> IMO security hooks have no business being called from interrupts.
> Aren't they all synchronous things done by tasks?  Interrupts are
> driver things.
> 
> Are you trying to check for page faults and such?

Yes, that was the idea you did put in my mind. Not sure how to deal with
this.



signature.asc
Description: OpenPGP digital signature


Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Mickaël Salaün

On 14/09/2016 20:27, Andy Lutomirski wrote:
> On Wed, Sep 14, 2016 at 12:24 AM, Mickaël Salaün  wrote:
>> Add a new flag CGRP_NO_NEW_PRIVS for each cgroup. This flag is initially
>> set for all cgroup except the root. The flag is clear when a new process
>> without the no_new_privs flags is attached to the cgroup.
>>
>> If a cgroup is landlocked, then any new attempt, from an unprivileged
>> process, to attach a process without no_new_privs to this cgroup will
>> be denied.
> 
> Until and unless everyone can agree on a way to properly namespace,
> delegate, etc cgroups, I think that trying to add unprivileged
> semantics to cgroups is nuts.  Given the big thread about cgroup v2,
> no-internal-tasks, etc, I just don't see how this approach can be
> viable.

As far as I can tell, the no_new_privs flag of at task is not related to
namespaces. The CGRP_NO_NEW_PRIVS flag is only a cache to quickly access
the no_new_privs property of *tasks* in a cgroup. The semantic is unchanged.

Using cgroup is optional, any task could use the seccomp-based
landlocking instead. However, for those that want/need to manage a
security policy in a more dynamic way, using cgroups may make sense.

I though cgroup delegation was OK in the v2, isn't it the case? Do you
have some links?

> 
> Can we try to make landlock work completely independently of cgroups
> so that it doesn't get stuck and so that programs can use it without
> worrying about cgroup v1 vs v2, interactions with cgroup managers,
> cgroup managers that (supposedly?) will start migrating processes
> around piecemeal and almost certainly blowing up landlock in the
> process, etc?

This RFC handle both cgroup and seccomp approaches in a similar way. I
don't see why building on top of cgroup v2 is a problem. Is there
security issues with delegation?

> 
> I have no problem with looking at prototypes for how landlock +
> cgroups would work, but I can't imagine the result being mergeable.
> 



signature.asc
Description: OpenPGP digital signature


Re: [RFC v3 17/22] cgroup: Add access check for cgroup_get_from_fd()

2016-09-14 Thread Mickaël Salaün

On 14/09/2016 09:24, Mickaël Salaün wrote:
> Add security access check for cgroup backed FD. The "cgroup.procs" file
> of the corresponding cgroup must be readable to identify the cgroup, and
> writable to prove that the current process can manage this cgroup (e.g.
> through delegation). This is similar to the check done by
> cgroup_procs_write_permission().
> 
> Signed-off-by: Mickaël Salaün 
> Cc: Alexei Starovoitov 
> Cc: Andy Lutomirski 
> Cc: Daniel Borkmann 
> Cc: Daniel Mack 
> Cc: David S. Miller 
> Cc: Kees Cook 
> Cc: Tejun Heo 
> ---
>  include/linux/cgroup.h |  2 +-
>  kernel/bpf/arraymap.c  |  2 +-
>  kernel/bpf/syscall.c   |  6 +++---
>  kernel/cgroup.c| 16 +++-
>  4 files changed, 20 insertions(+), 6 deletions(-)
...
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 48b650a640a9..3bbaf3f02ed2 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -6241,17 +6241,20 @@ EXPORT_SYMBOL_GPL(cgroup_get_from_path);
>  /**
>   * cgroup_get_from_fd - get a cgroup pointer from a fd
>   * @fd: fd obtained by open(cgroup2_dir)
> + * @access_mask: contains the permission mask
>   *
>   * Find the cgroup from a fd which should be obtained
>   * by opening a cgroup directory.  Returns a pointer to the
>   * cgroup on success. ERR_PTR is returned if the cgroup
>   * cannot be found.
>   */
> -struct cgroup *cgroup_get_from_fd(int fd)
> +struct cgroup *cgroup_get_from_fd(int fd, int access_mask)
>  {
>   struct cgroup_subsys_state *css;
>   struct cgroup *cgrp;
>   struct file *f;
> + struct inode *inode;
> + int ret;
>  
>   f = fget_raw(fd);
>   if (!f)
> @@ -6268,6 +6271,17 @@ struct cgroup *cgroup_get_from_fd(int fd)
>   return ERR_PTR(-EBADF);
>   }
>  
> + ret = -ENOMEM;
> + inode = kernfs_get_inode(f->f_path.dentry->d_sb, cgrp->procs_file.kn);

I forgot to properly move fput(f) after this line… This will be fixed.



signature.asc
Description: OpenPGP digital signature


RE: [PATCH v4 00/16] Add Paravirtual RDMA Driver

2016-09-14 Thread Woodruff, Robert J
Adit wrote,
>Thanks. So does this mean that the libraries distributed via OFED 
>(openfabrics.org) will be now from the rdma-plumbing git tree?
>Or is the switch to happen only when distros start shipping with the 4.9 
>kernel by default?

We are still discussing this within the OFA EWG for the OFED releases. I 
suspect that if all of the maintainers of the user-space 
packages agree to merge into and support this new merged repo-model, then OFED 
would eventually base their OFED user-space packages
on that repo, rather than the individual repo's that are used today. The 
question is, when. Work has now just started on OFED-4.8,
that is based on the kernel.org 4.8 kernel. If this new scheme works with 
kernel.org 4.8, then it is possible that it could go into that OFED-4.8
Release, but again, we are still looking at the new scheme and evaluating how 
it affects the community OFED.

Woody



[RFC 1/3] skge: Change LED_OFF to LED_REG_OFF in marvel skge driver to avoid conflicts with leds namespace

2016-09-14 Thread Zach Brown
Adding led support for phy causes namespace conflicts for some
phy drivers.

The marvel skge driver declared an enum for representing the states of
Link LED Register. The enum contained constant LED_OFF which conflicted
with declartation found in linux/leds.h.
LED_OFF changed to LED_REG_OFF

Signed-off-by: Zach Brown 
---
 drivers/net/ethernet/marvell/skge.c | 4 ++--
 drivers/net/ethernet/marvell/skge.h | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/marvell/skge.c 
b/drivers/net/ethernet/marvell/skge.c
index 7173836..ff5a087 100644
--- a/drivers/net/ethernet/marvell/skge.c
+++ b/drivers/net/ethernet/marvell/skge.c
@@ -1062,7 +1062,7 @@ static void skge_link_up(struct skge_port *skge)
 
 static void skge_link_down(struct skge_port *skge)
 {
-   skge_write8(skge->hw, SK_REG(skge->port, LNK_LED_REG), LED_OFF);
+   skge_write8(skge->hw, SK_REG(skge->port, LNK_LED_REG), LED_REG_OFF);
netif_carrier_off(skge->netdev);
netif_stop_queue(skge->netdev);
 
@@ -2668,7 +2668,7 @@ static int skge_down(struct net_device *dev)
if (hw->ports == 1)
free_irq(hw->pdev->irq, hw);
 
-   skge_write8(skge->hw, SK_REG(skge->port, LNK_LED_REG), LED_OFF);
+   skge_write8(skge->hw, SK_REG(skge->port, LNK_LED_REG), LED_REG_OFF);
if (is_genesis(hw))
genesis_stop(skge);
else
diff --git a/drivers/net/ethernet/marvell/skge.h 
b/drivers/net/ethernet/marvell/skge.h
index a2eb341..ec054d3 100644
--- a/drivers/net/ethernet/marvell/skge.h
+++ b/drivers/net/ethernet/marvell/skge.h
@@ -663,7 +663,7 @@ enum {
LED_SYNC_ON = 1<<3, /* Use Sync Wire to switch LED */
LED_SYNC_OFF= 1<<2, /* Disable Sync Wire Input */
LED_ON  = 1<<1, /* switch LED on */
-   LED_OFF = 1<<0, /* switch LED off */
+   LED_REG_OFF = 1<<0, /* switch LED off */
 };
 
 /* Receive GMAC FIFO (YUKON) */
-- 
2.7.4



[RFC 2/3] staging: rtl8712: Change _LED_STATE enum in rtl871x driver to avoid conflicts with LED namespace

2016-09-14 Thread Zach Brown
Adding led support for phy causes namespace conflicts for some
phy drivers.

The rtl871 driver declared an enum for representing LED states. The enum
contains constant LED_OFF which conflicted with declartation found in
linux/leds.h. LED_OFF changed to LED_STATE_OFF

Signed-off-by: Zach Brown 
---
 drivers/staging/rtl8712/rtl8712_led.c | 192 +-
 1 file changed, 96 insertions(+), 96 deletions(-)

diff --git a/drivers/staging/rtl8712/rtl8712_led.c 
b/drivers/staging/rtl8712/rtl8712_led.c
index 9055827..d53ad76 100644
--- a/drivers/staging/rtl8712/rtl8712_led.c
+++ b/drivers/staging/rtl8712/rtl8712_led.c
@@ -52,7 +52,7 @@
 enum _LED_STATE_871x {
LED_UNKNOWN = 0,
LED_ON = 1,
-   LED_OFF = 2,
+   LED_STATE_OFF = 2,
LED_BLINK_NORMAL = 3,
LED_BLINK_SLOWLY = 4,
LED_POWER_ON_BLINK = 5,
@@ -92,7 +92,7 @@ static void InitLed871x(struct _adapter *padapter, struct 
LED_871x *pLed,
nic = padapter->pnetdev;
pLed->padapter = padapter;
pLed->LedPin = LedPin;
-   pLed->CurrLedState = LED_OFF;
+   pLed->CurrLedState = LED_STATE_OFF;
pLed->bLedOn = false;
pLed->bLedBlinkInProgress = false;
pLed->BlinkTimes = 0;
@@ -249,7 +249,7 @@ static void SwLedBlink(struct LED_871x *pLed)
} else {
/* Assign LED state to toggle. */
if (pLed->BlinkingLedState == LED_ON)
-   pLed->BlinkingLedState = LED_OFF;
+   pLed->BlinkingLedState = LED_STATE_OFF;
else
pLed->BlinkingLedState = LED_ON;
 
@@ -312,7 +312,7 @@ static void SwLedBlink1(struct LED_871x *pLed)
switch (pLed->CurrLedState) {
case LED_BLINK_SLOWLY:
if (pLed->bLedOn)
-   pLed->BlinkingLedState = LED_OFF;
+   pLed->BlinkingLedState = LED_STATE_OFF;
else
pLed->BlinkingLedState = LED_ON;
mod_timer(>BlinkTimer, jiffies +
@@ -320,7 +320,7 @@ static void SwLedBlink1(struct LED_871x *pLed)
break;
case LED_BLINK_NORMAL:
if (pLed->bLedOn)
-   pLed->BlinkingLedState = LED_OFF;
+   pLed->BlinkingLedState = LED_STATE_OFF;
else
pLed->BlinkingLedState = LED_ON;
mod_timer(>BlinkTimer, jiffies +
@@ -335,7 +335,7 @@ static void SwLedBlink1(struct LED_871x *pLed)
pLed->bLedLinkBlinkInProgress = true;
pLed->CurrLedState = LED_BLINK_NORMAL;
if (pLed->bLedOn)
-   pLed->BlinkingLedState = LED_OFF;
+   pLed->BlinkingLedState = LED_STATE_OFF;
else
pLed->BlinkingLedState = LED_ON;
mod_timer(>BlinkTimer, jiffies +
@@ -344,7 +344,7 @@ static void SwLedBlink1(struct LED_871x *pLed)
pLed->bLedNoLinkBlinkInProgress = true;
pLed->CurrLedState = LED_BLINK_SLOWLY;
if (pLed->bLedOn)
-   pLed->BlinkingLedState = LED_OFF;
+   pLed->BlinkingLedState = LED_STATE_OFF;
else
pLed->BlinkingLedState = LED_ON;
mod_timer(>BlinkTimer, jiffies +
@@ -353,7 +353,7 @@ static void SwLedBlink1(struct LED_871x *pLed)
pLed->bLedScanBlinkInProgress = false;
} else {
 if (pLed->bLedOn)
-   pLed->BlinkingLedState = LED_OFF;
+   pLed->BlinkingLedState = LED_STATE_OFF;
else
pLed->BlinkingLedState = LED_ON;
mod_timer(>BlinkTimer, jiffies +
@@ -369,7 +369,7 @@ static void SwLedBlink1(struct LED_871x *pLed)
pLed->bLedLinkBlinkInProgress = true;
pLed->CurrLedState = LED_BLINK_NORMAL;
if (pLed->bLedOn)
-   pLed->BlinkingLedState = LED_OFF;
+   pLed->BlinkingLedState = LED_STATE_OFF;
else
pLed->BlinkingLedState = LED_ON;
mod_timer(>BlinkTimer, jiffies +
@@ -378,7 +378,7 @@ static void SwLedBlink1(struct LED_871x *pLed)
pLed->bLedNoLinkBlinkInProgress = true;
pLed->CurrLedState = LED_BLINK_SLOWLY;
if (pLed->bLedOn)
-

[RFC 3/3] phy,leds: add support for led triggers on phy link state change

2016-09-14 Thread Zach Brown
From: Josh Cartwright 

Create an option CONFIG_LED_TRIGGER_PHY (default n), which will
create a set of led triggers for each instantiated PHY device.  There is
one LED trigger per link-speed, per-phy.

This allows for a user to configure their system to allow a set of LEDs
to represent link state changes on the phy.

Signed-off-by: Josh Cartwright 
Signed-off-by: Nathan Sullivan 
Signed-off-by: Zach Brown 
---
 drivers/net/phy/Kconfig|  12 
 drivers/net/phy/Makefile   |   1 +
 drivers/net/phy/phy.c  |   8 +++
 drivers/net/phy/phy_device.c   |   4 ++
 drivers/net/phy/phy_led_triggers.c | 109 +
 include/linux/phy.h|   9 +++
 include/linux/phy_led_triggers.h   |  42 ++
 7 files changed, 185 insertions(+)
 create mode 100644 drivers/net/phy/phy_led_triggers.c
 create mode 100644 include/linux/phy_led_triggers.h

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index 1c3e07c..4aeaba4 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -15,6 +15,18 @@ if PHYLIB
 config SWPHY
bool
 
+config LED_TRIGGER_PHY
+   bool "Support LED triggers for tracking link state"
+   depends on LEDS_TRIGGERS
+   ---help---
+ Adds support for a set of LED trigger events per-PHY.  Link
+ state change will trigger the events, for consumption by an
+ LED class driver.  There are triggers for each link speed,
+ and are of the form:
+  ::
+
+ Where speed is one of: 10Mb, 100Mb, Gb, 2.5Gb, or 10GbE.
+
 comment "MDIO bus device drivers"
 
 config MDIO_BCM_IPROC
diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
index e58667d..86d12cd 100644
--- a/drivers/net/phy/Makefile
+++ b/drivers/net/phy/Makefile
@@ -2,6 +2,7 @@
 
 libphy-y   := phy.o phy_device.o mdio_bus.o mdio_device.o
 libphy-$(CONFIG_SWPHY) += swphy.o
+libphy-$(CONFIG_LED_TRIGGER_PHY)   += phy_led_triggers.o
 
 obj-$(CONFIG_PHYLIB)   += libphy.o
 
diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index c6f6683..3345767 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -936,6 +936,7 @@ void phy_state_machine(struct work_struct *work)
phydev->state = PHY_NOLINK;
netif_carrier_off(phydev->attached_dev);
phydev->adjust_link(phydev->attached_dev);
+   phy_led_trigger_change_speed(phydev);
break;
}
 
@@ -949,6 +950,7 @@ void phy_state_machine(struct work_struct *work)
phydev->state = PHY_RUNNING;
netif_carrier_on(phydev->attached_dev);
phydev->adjust_link(phydev->attached_dev);
+   phy_led_trigger_change_speed(phydev);
 
} else if (0 == phydev->link_timeout--)
needs_aneg = true;
@@ -976,6 +978,7 @@ void phy_state_machine(struct work_struct *work)
phydev->state = PHY_RUNNING;
netif_carrier_on(phydev->attached_dev);
phydev->adjust_link(phydev->attached_dev);
+   phy_led_trigger_change_speed(phydev);
}
break;
case PHY_FORCING:
@@ -992,6 +995,7 @@ void phy_state_machine(struct work_struct *work)
}
 
phydev->adjust_link(phydev->attached_dev);
+   phy_led_trigger_change_speed(phydev);
break;
case PHY_RUNNING:
/* Only register a CHANGE if we are polling and link changed
@@ -1021,6 +1025,7 @@ void phy_state_machine(struct work_struct *work)
}
 
phydev->adjust_link(phydev->attached_dev);
+   phy_led_trigger_change_speed(phydev);
 
if (phy_interrupt_is_valid(phydev))
err = phy_config_interrupt(phydev,
@@ -1031,6 +1036,7 @@ void phy_state_machine(struct work_struct *work)
phydev->link = 0;
netif_carrier_off(phydev->attached_dev);
phydev->adjust_link(phydev->attached_dev);
+   phy_led_trigger_change_speed(phydev);
do_suspend = true;
}
break;
@@ -1055,6 +1061,7 @@ void phy_state_machine(struct work_struct *work)
phydev->state = PHY_NOLINK;
}
phydev->adjust_link(phydev->attached_dev);
+   phy_led_trigger_change_speed(phydev);
} else {
phydev->state = PHY_AN;
phydev->link_timeout = PHY_AN_TIMEOUT;
@@ -1071,6 

[RFC 0/3] Add support for led triggers on phy link state change

2016-09-14 Thread Zach Brown
Some drivers that include phy.h defined LED_OFF which conflicts with
definition in leds.h. phy led support uses leds.h so the two namespaces are no
longer isolated.
The first two patches fix the two net drivers that declared enum constants that
conflict with enum constants in linux/leds.h.

The final patch adds support for led triggers on phy link state changes by
adding a config option. When set the config option will create a set of led
triggers for each phy device. Users can use the led triggers to represent link
state changes on the phy.
The changes assumes that there are 5 speed options
10Mb,100Mb,1Gb,2.5Gb,10Gb
The assumption makes mapping a phy_device's current speed to a trigger easy,
but means there are triggers made that aren't used if the phy doesn't support
the corresponding speeds.
Thoughts on how to better manage the triggers created would be appreciated
if is important to do so.

Josh Cartwright (1):
  phy,leds: add support for led triggers on phy link state change

Zach Brown (2):
  skge: Change LED_OFF to LED_REG_OFF in marvel skge driver to avoid
conflicts with leds namespace
  staging: rtl8712: Change _LED_STATE enum in rtl871x driver to avoid
conflicts with LED namespace

 drivers/net/ethernet/marvell/skge.c   |   4 +-
 drivers/net/ethernet/marvell/skge.h   |   2 +-
 drivers/net/phy/Kconfig   |  12 +++
 drivers/net/phy/Makefile  |   1 +
 drivers/net/phy/phy.c |   8 ++
 drivers/net/phy/phy_device.c  |   4 +
 drivers/net/phy/phy_led_triggers.c| 109 +++
 drivers/staging/rtl8712/rtl8712_led.c | 192 +-
 include/linux/phy.h   |   9 ++
 include/linux/phy_led_triggers.h  |  42 
 10 files changed, 284 insertions(+), 99 deletions(-)
 create mode 100644 drivers/net/phy/phy_led_triggers.c
 create mode 100644 include/linux/phy_led_triggers.h

--
2.7.4



Re: [PATCH 3/3] mm: memcontrol: consolidate cgroup socket tracking

2016-09-14 Thread kbuild test robot
Hi Johannes,

[auto build test ERROR on net/master]
[also build test ERROR on v4.8-rc6 next-20160914]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]
[Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for 
convenience) to record what (public, well-known) commit your patch series was 
built on]
[Check https://git-scm.com/docs/git-format-patch for more information]

url:
https://github.com/0day-ci/linux/commits/Johannes-Weiner/mm-memcontrol-make-per-cpu-charge-cache-IRQ-safe-for-socket-accounting/20160915-035634
config: m68k-sun3_defconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 4.9.0
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=m68k 

All errors (new ones prefixed by >>):

   net/built-in.o: In function `sk_alloc':
>> (.text+0x4076): undefined reference to `mem_cgroup_sk_alloc'
   net/built-in.o: In function `__sk_destruct':
>> sock.c:(.text+0x457e): undefined reference to `mem_cgroup_sk_free'
   net/built-in.o: In function `sk_clone_lock':
   (.text+0x4f1c): undefined reference to `mem_cgroup_sk_alloc'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [RFC v3 22/22] samples/landlock: Add sandbox example

2016-09-14 Thread Alexei Starovoitov
On Wed, Sep 14, 2016 at 09:24:15AM +0200, Mickaël Salaün wrote:
> Add a basic sandbox tool to create a process isolated from some part of
> the system. This can depend of the current cgroup.
> 
> Example with the current process hierarchy (seccomp):
> 
>   $ ls /home
>   user1
>   $ LANDLOCK_ALLOWED='/bin:/lib:/usr:/tmp:/proc/self/fd/0' \
>   ./samples/landlock/sandbox /bin/sh -i
>   Launching a new sandboxed process.
>   $ ls /home
>   ls: cannot open directory '/home': Permission denied
> 
> Example with a cgroup:
> 
>   $ mkdir /sys/fs/cgroup/sandboxed
>   $ ls /home
>   user1
>   $ LANDLOCK_CGROUPS='/sys/fs/cgroup/sandboxed' \
>   LANDLOCK_ALLOWED='/bin:/lib:/usr:/tmp:/proc/self/fd/0' \
>   ./samples/landlock/sandbox
>   Ready to sandbox with cgroups.
>   $ ls /home
>   user1
>   $ echo $$ > /sys/fs/cgroup/sandboxed/cgroup.procs
>   $ ls /home
>   ls: cannot open directory '/home': Permission denied
> 
> Changes since v2:
> * use BPF_PROG_ATTACH for cgroup handling
> 
> Signed-off-by: Mickaël Salaün 
...
> + struct bpf_insn hook_path[] = {
> + /* specify an option, if any */
> + BPF_MOV32_IMM(BPF_REG_1, 0),
> + /* handles to compare with */
> + BPF_LD_MAP_FD(BPF_REG_2, map_fs),
> + BPF_MOV64_IMM(BPF_REG_3, BPF_MAP_ARRAY_OP_OR),
> + /* hook argument (struct file) */
> + BPF_LDX_MEM(BPF_DW, BPF_REG_4, BPF_REG_6, offsetof(struct
> + landlock_data, args[0])),
> + /* checker function */
> + 
> BPF_EMIT_CALL(BPF_FUNC_landlock_cmp_fs_beneath_with_struct_file),

the example is sweet!
Since only that helper is used, could you skip the other one
from the patches (for now) ?



Re: [RFC v3 21/22] bpf,landlock: Add optional skb pointer in the Landlock context

2016-09-14 Thread Alexei Starovoitov
On Wed, Sep 14, 2016 at 09:24:14AM +0200, Mickaël Salaün wrote:
> This is a proof of concept to expose optional values that could depend
> of the process access rights.
> 
> There is two dedicated flags: LANDLOCK_FLAG_ACCESS_SKB_READ and
> LANDLOCK_FLAG_ACCESS_SKB_WRITE. Each of them can be activated to access
> eBPF functions manipulating a skb in a read or write way.
> 
> Signed-off-by: Mickaël Salaün 
...
>  /* Handle check flags */
>  #define LANDLOCK_FLAG_FS_DENTRY  (1 << 0)
> @@ -619,12 +621,15 @@ struct landlock_handle {
>   * @args: LSM hook arguments, see include/linux/lsm_hooks.h for there
>   *description and the LANDLOCK_HOOK* definitions from
>   *security/landlock/lsm.c for their types.
> + * @opt_skb: optional skb pointer, accessible with the
> + *   LANDLOCK_FLAG_ACCESS_SKB_* flags for network-related hooks.
>   */
>  struct landlock_data {
>   __u32 hook; /* enum landlock_hook_id */
>   __u16 origin; /* LANDLOCK_FLAG_ORIGIN_* */
>   __u16 cookie; /* seccomp RET_LANDLOCK */
>   __u64 args[6];
> + __u64 opt_skb;
>  };

missing something here.
This patch doesn't make use of it.
That's something for the future?
How that field will be populated?
Why make it different vs the rest or args[6] ?



Re: [RFC v3 14/22] bpf/cgroup: Make cgroup_bpf_update() return an error code

2016-09-14 Thread Alexei Starovoitov
On Wed, Sep 14, 2016 at 09:24:07AM +0200, Mickaël Salaün wrote:
> This will be useful to support Landlock for the next commits.
> 
> Signed-off-by: Mickaël Salaün 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> Cc: Daniel Mack 
> Cc: David S. Miller 
> Cc: Tejun Heo 

I think this is good to do regardless. Sooner or later cgroup_bpf_update
will start rejecting the prog attach. Like we discussed to have a flag
that would dissallow processeses lower in the cgroup hierarchy to install
their own bpf programs.
It will be minimal change to bpf_prog_attach() error handling,
but will greatly help Mickael to build stuff on top.
DanielM can you refactor your patch to do that from the start ?

Thanks!

> ---
>  include/linux/bpf-cgroup.h |  4 ++--
>  kernel/bpf/cgroup.c|  3 ++-
>  kernel/bpf/syscall.c   | 10 ++
>  kernel/cgroup.c|  6 --
>  4 files changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> index 2234042d7f61..6cca7924ee17 100644
> --- a/include/linux/bpf-cgroup.h
> +++ b/include/linux/bpf-cgroup.h
> @@ -31,13 +31,13 @@ struct cgroup_bpf {
>  void cgroup_bpf_put(struct cgroup *cgrp);
>  void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
>  
> -void __cgroup_bpf_update(struct cgroup *cgrp,
> +int __cgroup_bpf_update(struct cgroup *cgrp,
>struct cgroup *parent,
>struct bpf_prog *prog,
>enum bpf_attach_type type);
>  
>  /* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
> -void cgroup_bpf_update(struct cgroup *cgrp,
> +int cgroup_bpf_update(struct cgroup *cgrp,
>  struct bpf_prog *prog,
>  enum bpf_attach_type type);
>  
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 782878ec4f2d..7b75fa692617 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -83,7 +83,7 @@ void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup 
> *parent)
>   *
>   * Must be called with cgroup_mutex held.
>   */
> -void __cgroup_bpf_update(struct cgroup *cgrp,
> +int __cgroup_bpf_update(struct cgroup *cgrp,
>struct cgroup *parent,
>struct bpf_prog *prog,
>enum bpf_attach_type type)
> @@ -117,6 +117,7 @@ void __cgroup_bpf_update(struct cgroup *cgrp,
>   bpf_prog_put(old_pinned.prog);
>   static_branch_dec(_bpf_enabled_key);
>   }
> + return 0;
>  }
>  
>  /**
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 45a91d59..c978f2d9a1b3 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -831,6 +831,7 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>  {
>   struct bpf_prog *prog;
>   struct cgroup *cgrp;
> + int result;
>  
>   if (!capable(CAP_NET_ADMIN))
>   return -EPERM;
> @@ -858,10 +859,10 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>   return PTR_ERR(cgrp);
>   }
>  
> - cgroup_bpf_update(cgrp, prog, attr->attach_type);
> + result = cgroup_bpf_update(cgrp, prog, attr->attach_type);
>   cgroup_put(cgrp);
>  
> - return 0;
> + return result;
>  }
>  
>  #define BPF_PROG_DETACH_LAST_FIELD attach_type
> @@ -869,6 +870,7 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>  static int bpf_prog_detach(const union bpf_attr *attr)
>  {
>   struct cgroup *cgrp;
> + int result = 0;
>  
>   if (!capable(CAP_NET_ADMIN))
>   return -EPERM;
> @@ -883,7 +885,7 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>   if (IS_ERR(cgrp))
>   return PTR_ERR(cgrp);
>  
> - cgroup_bpf_update(cgrp, NULL, attr->attach_type);
> + result = cgroup_bpf_update(cgrp, NULL, attr->attach_type);
>   cgroup_put(cgrp);
>   break;
>  
> @@ -891,7 +893,7 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>   return -EINVAL;
>   }
>  
> - return 0;
> + return result;
>  }
>  #endif /* CONFIG_CGROUP_BPF */
>  
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 87324ce481b1..48b650a640a9 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -6450,15 +6450,17 @@ static __init int cgroup_namespaces_init(void)
>  subsys_initcall(cgroup_namespaces_init);
>  
>  #ifdef CONFIG_CGROUP_BPF
> -void cgroup_bpf_update(struct cgroup *cgrp,
> +int cgroup_bpf_update(struct cgroup *cgrp,
>  struct bpf_prog *prog,
>  enum bpf_attach_type type)
>  {
>   struct cgroup *parent = cgroup_parent(cgrp);
> + int result;
>  
>   mutex_lock(_mutex);
> - __cgroup_bpf_update(cgrp, parent, prog, type);
> + result = __cgroup_bpf_update(cgrp, parent, prog, type);
>   

Re: [PATCH 7/9] net: ethernet: ti: cpts: calc mult and shift from refclk freq

2016-09-14 Thread Grygorii Strashko

On 09/15/2016 12:03 AM, Richard Cochran wrote:

On Wed, Sep 14, 2016 at 11:47:46PM +0300, Grygorii Strashko wrote:

As I understand (and tested), clocks_calc_mult_shift() will
return max possible mult which can be used without overflow.


Yes, BUT the returned values depends on the @maxsec input.  As the
kerneldec says,

 * Larger ranges may reduce the conversion accuracy by chosing smaller
 * mult and shift factors.

In addition to that, frequency tuning by calculating a percentage of
'mult', and if 'mult' is small, then the tuning resolution is poor.

So we don't want @maxsec as large as possible, but as small as
possible.



Ok. So, will it work if maxsec will be ~= (maxsec / 2) + 1?
+ 1 to cover potential delays of overflow check work execution.

[...]


--
regards,
-grygorii


Re: [PATCH 3/3] mm: memcontrol: consolidate cgroup socket tracking

2016-09-14 Thread kbuild test robot
Hi Johannes,

[auto build test ERROR on net/master]
[also build test ERROR on v4.8-rc6 next-20160914]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]
[Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for 
convenience) to record what (public, well-known) commit your patch series was 
built on]
[Check https://git-scm.com/docs/git-format-patch for more information]

url:
https://github.com/0day-ci/linux/commits/Johannes-Weiner/mm-memcontrol-make-per-cpu-charge-cache-IRQ-safe-for-socket-accounting/20160915-035634
config: parisc-c3000_defconfig (attached as .config)
compiler: hppa-linux-gnu-gcc (Debian 5.4.0-6) 5.4.0 20160609
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=parisc 

All errors (new ones prefixed by >>):

   net/built-in.o: In function `sk_alloc':
>> (.text.sk_alloc+0x88): undefined reference to `mem_cgroup_sk_alloc'
   net/built-in.o: In function `__sk_destruct':
>> net/core/sock.o:(.text.__sk_destruct+0xbc): undefined reference to 
>> `mem_cgroup_sk_free'
   net/built-in.o: In function `sk_clone_lock':
>> (.text.sk_clone_lock+0x164): undefined reference to `mem_cgroup_sk_alloc'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH 3/3] mm: memcontrol: consolidate cgroup socket tracking

2016-09-14 Thread kbuild test robot
Hi Johannes,

[auto build test ERROR on net/master]
[also build test ERROR on v4.8-rc6 next-20160914]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]
[Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for 
convenience) to record what (public, well-known) commit your patch series was 
built on]
[Check https://git-scm.com/docs/git-format-patch for more information]

url:
https://github.com/0day-ci/linux/commits/Johannes-Weiner/mm-memcontrol-make-per-cpu-charge-cache-IRQ-safe-for-socket-accounting/20160915-035634
config: parisc-b180_defconfig (attached as .config)
compiler: hppa-linux-gnu-gcc (Debian 5.4.0-6) 5.4.0 20160609
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=parisc 

All errors (new ones prefixed by >>):

   net/built-in.o: In function `sk_alloc':
   (.text.sk_alloc+0xb4): undefined reference to `mem_cgroup_sk_alloc'
   net/built-in.o: In function `__sk_destruct':
>> net/core/.tmp_sock.o:(.text.__sk_destruct+0xdc): undefined reference to 
>> `mem_cgroup_sk_free'
   net/built-in.o: In function `sk_clone_lock':
   (.text.sk_clone_lock+0x184): undefined reference to `mem_cgroup_sk_alloc'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [RFC v3 07/22] landlock: Handle file comparisons

2016-09-14 Thread Alexei Starovoitov
On Wed, Sep 14, 2016 at 09:24:00AM +0200, Mickaël Salaün wrote:
> Add eBPF functions to compare file system access with a Landlock file
> system handle:
> * bpf_landlock_cmp_fs_prop_with_struct_file(prop, map, map_op, file)
>   This function allows to compare the dentry, inode, device or mount
>   point of the currently accessed file, with a reference handle.
> * bpf_landlock_cmp_fs_beneath_with_struct_file(opt, map, map_op, file)
>   This function allows an eBPF program to check if the current accessed
>   file is the same or in the hierarchy of a reference handle.
> 
> The goal of file system handle is to abstract kernel objects such as a
> struct file or a struct inode. Userland can create this kind of handle
> thanks to the BPF_MAP_UPDATE_ELEM command. The element is a struct
> landlock_handle containing the handle type (e.g.
> BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD) and a file descriptor. This could
> also be any descriptions able to match a struct file or a struct inode
> (e.g. path or glob string).
> 
> Changes since v2:
> * add MNT_INTERNAL check to only add file handle from user-visible FS
>   (e.g. no anonymous inode)
> * replace struct file* with struct path* in map_landlock_handle
> * add BPF protos
> * fix bpf_landlock_cmp_fs_prop_with_struct_file()
> 
> Signed-off-by: Mickaël Salaün 
> Cc: Alexei Starovoitov 
> Cc: Andy Lutomirski 
> Cc: Daniel Borkmann 
> Cc: David S. Miller 
> Cc: James Morris 
> Cc: Kees Cook 
> Cc: Serge E. Hallyn 
> Link: 
> https://lkml.kernel.org/r/calcetrwwtiz3kztkegow24-dvhqq6lftwexh77fd2g5o71y...@mail.gmail.com

thanks for keeping the links to the previous discussion.
Long term it should help, though I worry we already at the point
where there are too many outstanding issues to resolve before we
can proceed with reasonable code review.

> +/*
> + * bpf_landlock_cmp_fs_prop_with_struct_file
> + *
> + * Cf. include/uapi/linux/bpf.h
> + */
> +static inline u64 bpf_landlock_cmp_fs_prop_with_struct_file(u64 r1_property,
> + u64 r2_map, u64 r3_map_op, u64 r4_file, u64 r5)
> +{
> + u8 property = (u8) r1_property;
> + struct bpf_map *map = (struct bpf_map *) (unsigned long) r2_map;
> + enum bpf_map_array_op map_op = r3_map_op;
> + struct file *file = (struct file *) (unsigned long) r4_file;

please use just added BPF_CALL_ macros. They will help readability of the above.

> + struct bpf_array *array = container_of(map, struct bpf_array, map);
> + struct path *p1, *p2;
> + struct map_landlock_handle *handle;
> + int i;
> +
> + /* ARG_CONST_PTR_TO_LANDLOCK_HANDLE_FS is an arraymap */
> + if (unlikely(!map)) {
> + WARN_ON(1);
> + return -EFAULT;
> + }
> + if (unlikely(!file))
> + return -ENOENT;
> + if (unlikely((property | _LANDLOCK_FLAG_FS_MASK) != 
> _LANDLOCK_FLAG_FS_MASK))
> + return -EINVAL;
> +
> + /* for now, only handle OP_OR */
> + switch (map_op) {
> + case BPF_MAP_ARRAY_OP_OR:
> + break;
> + case BPF_MAP_ARRAY_OP_UNSPEC:
> + case BPF_MAP_ARRAY_OP_AND:
> + case BPF_MAP_ARRAY_OP_XOR:
> + default:
> + return -EINVAL;
> + }
> + p2 = >f_path;
> +
> + synchronize_rcu();

that is completely broken.
bpf programs are executing under rcu_lock.
please enable CONFIG_PROVE_RCU and retest everything.

I would suggest for the next RFC to do minimal 7 patches up to this point
with simple example that demonstrates the use case.
I would avoid all unpriv stuff and all of seccomp for the next RFC as well,
otherwise I don't think we can realistically make forward progress, since
there are too many issues raised in the subsequent patches.

The common part that is mergeable is prog's subtype extension to
the verifier that can be used for better tracing and is the common
piece of infra needed for both landlock and checmate LSMs
(which must be one LSM anyway)



Re: [PATCH 7/9] net: ethernet: ti: cpts: calc mult and shift from refclk freq

2016-09-14 Thread Richard Cochran
On Wed, Sep 14, 2016 at 11:47:46PM +0300, Grygorii Strashko wrote:
> As I understand (and tested), clocks_calc_mult_shift() will 
> return max possible mult which can be used without overflow.

Yes, BUT the returned values depends on the @maxsec input.  As the
kerneldec says,

 * Larger ranges may reduce the conversion accuracy by chosing smaller
 * mult and shift factors.

In addition to that, frequency tuning by calculating a percentage of
'mult', and if 'mult' is small, then the tuning resolution is poor.

So we don't want @maxsec as large as possible, but as small as
possible.

> if calculated results do not satisfy end user - the custom values can
> be passed in DT.  

If we calculate automatically, then the result had better well be
optimal or nearly so.  Otherwise, we should leave it as a manual input
via DTS, IMHO, so that someone is forced to check the values.

Thanks,
Richard


Re: [PATCH 3/9] net: ethernet: ti: cpts: rework initialization/deinitialization

2016-09-14 Thread Grygorii Strashko
On 09/14/2016 11:52 PM, Richard Cochran wrote:
> On Wed, Sep 14, 2016 at 11:37:45PM +0300, Grygorii Strashko wrote:
>> The problem is that if cpts not initialized than pinter on 
>> cpts (in consumer/parent driver - NETCP) will be NULL.
> 
> You made that problem with your "clean up" in this series.
> Previously, cpts was always allocated.

not exactly - it was allocated in CPSW .probe() manually, 
in without this re-work it has to be allocated in NTCP the
same way - manually. So, checks we are discussing here will be still
present, but they will need to be done in CPSW/NETCP drivers -
just one level up and duplicated in both these drivers.

> 
>> So, rx_enable will be unaccessible and cause crash :(
> 
> So just make sure cpts is initialized.

OK. I'll update code this way.

-- 
regards,
-grygorii


Re: [PATCH 8/9] net: ethernet: ti: cpts: fix overflow check period

2016-09-14 Thread Grygorii Strashko

On 09/14/2016 11:43 PM, Richard Cochran wrote:

On Wed, Sep 14, 2016 at 11:23:43PM +0300, Grygorii Strashko wrote:

if yes then those changes are correct as from patch#7 point of
view, as from patch#8 because they are separate standalone changes.
In patch patch#7 it reasonable to ball out earlier, while in patch#8
it required to move forward a bit as I need to know maxsec.


And what about the extra blank line?  AFAICT, placing the test later
in patch #7 is correct logic and has the advantage of not distracting
reviews with pointless churn!



NP. I'll change it.

--
regards,
-grygorii


Re: [PATCH 3/9] net: ethernet: ti: cpts: rework initialization/deinitialization

2016-09-14 Thread Richard Cochran
On Wed, Sep 14, 2016 at 11:37:45PM +0300, Grygorii Strashko wrote:
> The problem is that if cpts not initialized than pinter on 
> cpts (in consumer/parent driver - NETCP) will be NULL.

You made that problem with your "clean up" in this series.
Previously, cpts was always allocated.

> So, rx_enable will be unaccessible and cause crash :(

So just make sure cpts is initialized.

Thanks,
Richard


Re: [PATCH 7/9] net: ethernet: ti: cpts: calc mult and shift from refclk freq

2016-09-14 Thread Grygorii Strashko
On 09/14/2016 11:26 PM, Richard Cochran wrote:
> On Wed, Sep 14, 2016 at 04:02:29PM +0300, Grygorii Strashko wrote:
>> +static void cpts_calc_mult_shift(struct cpts *cpts)
>> +{
>> +u64 maxsec;
>> +u32 freq;
>> +u32 mult;
>> +u32 shift;
>> +u64 ns;
>> +u64 frac;
>> +
>> +if (cpts->cc_mult || cpts->cc.shift)
>> +return;
>> +
>> +freq = clk_get_rate(cpts->refclk);
>> +
>> +/* Calc the maximum number of seconds which we can run before
>> + * wrapping around.
>> + */
>> +maxsec = cpts->cc.mask;
>> +do_div(maxsec, freq);
>> +if (!maxsec)
>> +maxsec = 1;
> 
> This doesn't pass the smell test.  If the max counter value is M, you
> are figuring M*1/F which is the time in seconds corresponding to M.
> We set M=2^32-1, and so 'freq' would have to be greater than 4 GHz in
> order for 'maxsec' to be zero.  Do you really expect such high
> frequency input clocks?

no. can drop it.

> 
>> +else if (maxsec > 600 && cpts->cc.mask > UINT_MAX)
>> +maxsec = 600;
> 
> What is this all about?  cc.mask is always 2^32 - 1.

Oh. Not sure if we will update CPTS to support this, but on
 KS2 E/L (66AK2E) CPTS counter can work in 64bit mode.

> 
>> +clocks_calc_mult_shift(, , freq, NSEC_PER_SEC, maxsec);
>> +
>> +cpts->cc_mult = mult;
>> +cpts->cc.mult = mult;
> 
> In order to get good resolution on the frequency adjustment, we want
> to keep 'mult' as large as possible.  I don't see your code doing
> this.  We can rely on the watchdog reader (work queue) to prevent
> overflows.

As I understand (and tested), clocks_calc_mult_shift() will 
return max possible mult which can be used without overflow.
if calculated results do not satisfy end user - the custom values can
be passed in DT.  

-- 
regards,
-grygorii


Re: [PATCH v4 00/16] Add Paravirtual RDMA Driver

2016-09-14 Thread Jason Gunthorpe
On Wed, Sep 14, 2016 at 07:44:45PM +, Adit Ranadive wrote:
> On Wed, Sep 14, 2016 at 10:37:00 -0700, Jason Gunthorpe wrote:
> > We desire to use this as the vehical for the userspace included with the 4.9
> > kernel.
> > 
> > I anticipate the tree will be running by Oct 1.
> 
> Thanks. So does this mean that the libraries distributed via OFED 
> (openfabrics.org)
> will be now from the rdma-plumbing git tree?

I can't speak to the exact transition schedule each downstream will
follow. OFED is a downstream distribution focused on backporting to
old distros.

I would expect that by the time OFED starts to backport 4.9 they will
need to use rdma plumbing.

At the cutoff point we will immediately tell all downstreams the
latest official upstream source is in rdma-plumbing and they should
stop monitoring the old repos.

> Or is the switch to happen only when distros start shipping with the
> 4.9 kernel by default?

Now that we are going down this path with the 4.9 set as the time
frame, I wouldn't worry about timing, there is no down-side to you for
working within rdma-plumbing, and it guarantees that eventually your
code will be widely distributed.

Jason


Re: [PATCH 3/3] mm: memcontrol: consolidate cgroup socket tracking

2016-09-14 Thread Tejun Heo
On Wed, Sep 14, 2016 at 03:48:46PM -0400, Johannes Weiner wrote:
> The cgroup core and the memory controller need to track socket
> ownership for different purposes, but the tracking sites being
> entirely different is kind of ugly.
> 
> Be a better citizen and rename the memory controller callbacks to
> match the cgroup core callbacks, then move them to the same place.
> 
> Signed-off-by: Johannes Weiner 

For 1-3,

Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: [PATCH 8/9] net: ethernet: ti: cpts: fix overflow check period

2016-09-14 Thread Richard Cochran
On Wed, Sep 14, 2016 at 11:23:43PM +0300, Grygorii Strashko wrote:
> if yes then those changes are correct as from patch#7 point of 
> view, as from patch#8 because they are separate standalone changes.
> In patch patch#7 it reasonable to ball out earlier, while in patch#8
> it required to move forward a bit as I need to know maxsec. 

And what about the extra blank line?  AFAICT, placing the test later
in patch #7 is correct logic and has the advantage of not distracting
reviews with pointless churn!

Thanks,
Richard


Re: [PATCH 3/9] net: ethernet: ti: cpts: rework initialization/deinitialization

2016-09-14 Thread Grygorii Strashko
On 09/14/2016 11:32 PM, Richard Cochran wrote:
> On Wed, Sep 14, 2016 at 11:10:48PM +0300, Grygorii Strashko wrote:
>> On 09/14/2016 04:52 PM, Richard Cochran wrote:
>>> On Wed, Sep 14, 2016 at 04:02:25PM +0300, Grygorii Strashko wrote:
> 
 -  if (!cpts->rx_enable)
 +  if (!cpts || !cpts->rx_enable)
return;
> 
>> Ok. I can't say I'd like all this checks, but there are internal requirement 
>> to allow CPTS to be disabled though DT on KS2 (even if built in).
>> I'd try to clarify and return back here.
>>
>> But It'll be good to know your position - acceptable/can be 
>> discussed/completely unacceptable?
> 
> Look at that code snippet.  We already test for @cpts->rx_enable.  If
> you want to disable cpts at run time, just avoid setting that field.
> There is no need for any new tests or return codes.
> 

The problem is that if cpts not initialized than pinter on 
cpts (in consumer/parent driver - NETCP) will be NULL.
So, rx_enable will be unaccessible and cause crash :(

-- 
regards,
-grygorii


Re: [PATCH 3/9] net: ethernet: ti: cpts: rework initialization/deinitialization

2016-09-14 Thread Richard Cochran
On Wed, Sep 14, 2016 at 11:10:48PM +0300, Grygorii Strashko wrote:
> On 09/14/2016 04:52 PM, Richard Cochran wrote:
> > On Wed, Sep 14, 2016 at 04:02:25PM +0300, Grygorii Strashko wrote:

> >> -  if (!cpts->rx_enable)
> >> +  if (!cpts || !cpts->rx_enable)
> >>return;

> Ok. I can't say I'd like all this checks, but there are internal requirement 
> to allow CPTS to be disabled though DT on KS2 (even if built in).
> I'd try to clarify and return back here.
> 
> But It'll be good to know your position - acceptable/can be 
> discussed/completely unacceptable?

Look at that code snippet.  We already test for @cpts->rx_enable.  If
you want to disable cpts at run time, just avoid setting that field.
There is no need for any new tests or return codes.

Thanks,
Richard


Re: [PATCH 7/9] net: ethernet: ti: cpts: calc mult and shift from refclk freq

2016-09-14 Thread Richard Cochran
On Wed, Sep 14, 2016 at 04:02:29PM +0300, Grygorii Strashko wrote:
> +static void cpts_calc_mult_shift(struct cpts *cpts)
> +{
> + u64 maxsec;
> + u32 freq;
> + u32 mult;
> + u32 shift;
> + u64 ns;
> + u64 frac;
> +
> + if (cpts->cc_mult || cpts->cc.shift)
> + return;
> +
> + freq = clk_get_rate(cpts->refclk);
> +
> + /* Calc the maximum number of seconds which we can run before
> +  * wrapping around.
> +  */
> + maxsec = cpts->cc.mask;
> + do_div(maxsec, freq);
> + if (!maxsec)
> + maxsec = 1;

This doesn't pass the smell test.  If the max counter value is M, you
are figuring M*1/F which is the time in seconds corresponding to M.
We set M=2^32-1, and so 'freq' would have to be greater than 4 GHz in
order for 'maxsec' to be zero.  Do you really expect such high
frequency input clocks?

> + else if (maxsec > 600 && cpts->cc.mask > UINT_MAX)
> + maxsec = 600;

What is this all about?  cc.mask is always 2^32 - 1.

> + clocks_calc_mult_shift(, , freq, NSEC_PER_SEC, maxsec);
> +
> + cpts->cc_mult = mult;
> + cpts->cc.mult = mult;

In order to get good resolution on the frequency adjustment, we want
to keep 'mult' as large as possible.  I don't see your code doing
this.  We can rely on the watchdog reader (work queue) to prevent
overflows.

Thanks,
Richard

> + cpts->cc.shift = shift;
> + /* Check calculations and inform if not precise */
> + frac = 0;
> + ns = cyclecounter_cyc2ns(>cc, freq, cpts->cc.mask, );
> +
> + dev_info(cpts->dev,
> +  "CPTS: ref_clk_freq:%u calc_mult:%u calc_shift:%u error:%lld 
> nsec/sec\n",
> +  freq, cpts->cc_mult, cpts->cc.shift, (ns - NSEC_PER_SEC));
> +}


Re: [PATCH 8/9] net: ethernet: ti: cpts: fix overflow check period

2016-09-14 Thread Grygorii Strashko
On 09/14/2016 11:08 PM, Richard Cochran wrote:
> On Wed, Sep 14, 2016 at 11:03:18PM +0300, Grygorii Strashko wrote:
>> On 09/14/2016 05:25 PM, Richard Cochran wrote:
>>> On Wed, Sep 14, 2016 at 04:02:30PM +0300, Grygorii Strashko wrote:
 @@ -427,9 +427,6 @@ static void cpts_calc_mult_shift(struct cpts *cpts)
u64 ns;
u64 frac;
  
 -  if (cpts->cc_mult || cpts->cc.shift)
 -  return;
 -
freq = clk_get_rate(cpts->refclk);
  
/* Calc the maximum number of seconds which we can run before
>>>
>>> This hunk has nothing to do with $subject.
>>
>> Sry, but I did not get your comment here :(
>> I'd happy to update patch according to your request, but could you provide 
>> more info here, pls?
> 
> You added that code in patch #7.  Then you moved it in patch #8.  You
> could have made the code correct in patch #7 to begin with.
> 

Do you mean 
 -  if (cpts->cc_mult || cpts->cc.shift)
 -  return;
??

if yes then those changes are correct as from patch#7 point of 
view, as from patch#8 because they are separate standalone changes.
In patch patch#7 it reasonable to ball out earlier, while in patch#8
it required to move forward a bit as I need to know maxsec. 

Sry, that I'm bothering you with stupid questions.


-- 
regards,
-grygorii


Re: [PATCH 3/9] net: ethernet: ti: cpts: rework initialization/deinitialization

2016-09-14 Thread Grygorii Strashko
On 09/14/2016 04:52 PM, Richard Cochran wrote:
> On Wed, Sep 14, 2016 at 04:02:25PM +0300, Grygorii Strashko wrote:
>> @@ -323,7 +307,7 @@ void cpts_rx_timestamp(struct cpts *cpts, struct sk_buff 
>> *skb)
>>  u64 ns;
>>  struct skb_shared_hwtstamps *ssh;
>>  
>> -if (!cpts->rx_enable)
>> +if (!cpts || !cpts->rx_enable)
>>  return;
> 
> This function is in the hot path, and you have added a pointless new
> test.  Don't do that.
> 
>>  ns = cpts_find_ts(cpts, skb, CPTS_EV_RX);
>>  if (!ns)
>> @@ -338,7 +322,7 @@ void cpts_tx_timestamp(struct cpts *cpts, struct sk_buff 
>> *skb)
>>  u64 ns;
>>  struct skb_shared_hwtstamps ssh;
>>  
>> -if (!(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS))
>> +if (!cpts || !(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS))
>>  return;
> 
> Same here.
> 
>>  ns = cpts_find_ts(cpts, skb, CPTS_EV_TX);
>>  if (!ns)
>> @@ -348,53 +332,102 @@ void cpts_tx_timestamp(struct cpts *cpts, struct 
>> sk_buff *skb)
>>  skb_tstamp_tx(skb, );
>>  }
>>  
>> -int cpts_register(struct device *dev, struct cpts *cpts,
>> -  u32 mult, u32 shift)
>> +int cpts_register(struct cpts *cpts)
>>  {
>>  int err, i;
>> -unsigned long flags;
>>  
>> -cpts->info = cpts_info;
>> -cpts->clock = ptp_clock_register(>info, dev);
>> -if (IS_ERR(cpts->clock)) {
>> -err = PTR_ERR(cpts->clock);
>> -cpts->clock = NULL;
>> -return err;
>> -}
>> -spin_lock_init(>lock);
>> -
>> -cpts->cc.read = cpts_systim_read;
>> -cpts->cc.mask = CLOCKSOURCE_MASK(32);
>> -cpts->cc_mult = mult;
>> -cpts->cc.mult = mult;
>> -cpts->cc.shift = shift;
>> +if (!cpts)
>> +return -EINVAL;
> 
> Not hot path, but still silly.  The caller should never pass NULL.

Ok. I can't say I'd like all this checks, but there are internal requirement 
to allow CPTS to be disabled though DT on KS2 (even if built in).
I'd try to clarify and return back here.

But It'll be good to know your position - acceptable/can be 
discussed/completely unacceptable?


-- 
regards,
-grygorii


Re: [PATCH 8/9] net: ethernet: ti: cpts: fix overflow check period

2016-09-14 Thread Richard Cochran
On Wed, Sep 14, 2016 at 11:03:18PM +0300, Grygorii Strashko wrote:
> On 09/14/2016 05:25 PM, Richard Cochran wrote:
> > On Wed, Sep 14, 2016 at 04:02:30PM +0300, Grygorii Strashko wrote:
> >> @@ -427,9 +427,6 @@ static void cpts_calc_mult_shift(struct cpts *cpts)
> >>u64 ns;
> >>u64 frac;
> >>  
> >> -  if (cpts->cc_mult || cpts->cc.shift)
> >> -  return;
> >> -
> >>freq = clk_get_rate(cpts->refclk);
> >>  
> >>/* Calc the maximum number of seconds which we can run before
> > 
> > This hunk has nothing to do with $subject.
> 
> Sry, but I did not get your comment here :(
> I'd happy to update patch according to your request, but could you provide 
> more info here, pls?

You added that code in patch #7.  Then you moved it in patch #8.  You
could have made the code correct in patch #7 to begin with.

> >>cpts->cc_mult = mult;
> >>cpts->cc.mult = mult;
> >>cpts->cc.shift = shift;
> >> +

If you want a blank line here, then put into the original patch #7.

Thanks,
Richard


Re: [PATCH 8/9] net: ethernet: ti: cpts: fix overflow check period

2016-09-14 Thread Grygorii Strashko
On 09/14/2016 05:25 PM, Richard Cochran wrote:
> On Wed, Sep 14, 2016 at 04:02:30PM +0300, Grygorii Strashko wrote:
>> @@ -427,9 +427,6 @@ static void cpts_calc_mult_shift(struct cpts *cpts)
>>  u64 ns;
>>  u64 frac;
>>  
>> -if (cpts->cc_mult || cpts->cc.shift)
>> -return;
>> -
>>  freq = clk_get_rate(cpts->refclk);
>>  
>>  /* Calc the maximum number of seconds which we can run before
> 
> This hunk has nothing to do with $subject.

Sry, but I did not get your comment here :(
I'd happy to update patch according to your request, but could you provide more 
info here, pls?



> 
>> @@ -442,11 +439,20 @@ static void cpts_calc_mult_shift(struct cpts *cpts)
>>  else if (maxsec > 600 && cpts->cc.mask > UINT_MAX)
>>  maxsec = 600;
>>  
>> +/* Calc overflow check period (maxsec / 2) */
>> +cpts->ov_check_period = (HZ * maxsec) / 2;
>> +dev_info(cpts->dev, "cpts: overflow check period %lu\n",
>> + cpts->ov_check_period);
>> +
>> +if (cpts->cc_mult || cpts->cc.shift)
>> +return;
>> +
>>  clocks_calc_mult_shift(, , freq, NSEC_PER_SEC, maxsec);
>>  
>>  cpts->cc_mult = mult;
>>  cpts->cc.mult = mult;
>>  cpts->cc.shift = shift;
>> +
> 
> Nor does this.



-- 
regards,
-grygorii


Re: [PATCH 4/9] net: ethernet: ti: cpts: move dt props parsing to cpts driver

2016-09-14 Thread Richard Cochran
On Wed, Sep 14, 2016 at 10:45:54PM +0300, Grygorii Strashko wrote:
> With this change It will not be required to add the same DT parsing code
> in Keystone 2 netcp driver, so overall number of lines will be reduced.

This explanation would make the commit message much more informative.

Thanks,
Richard


Re: [PATCH 7/9] net: ethernet: ti: cpts: calc mult and shift from refclk freq

2016-09-14 Thread Grygorii Strashko

On 09/14/2016 05:22 PM, Richard Cochran wrote:

On Wed, Sep 14, 2016 at 04:02:29PM +0300, Grygorii Strashko wrote:

@@ -35,6 +33,8 @@ Optional properties:
  For example in dra72x-evm, pcf gpio has to be
  driven low so that cpsw slave 0 and phy data
  lines are connected via mux.
+- cpts_clock_mult  : Numerator to convert input clock ticks into 
nanoseconds
+- cpts_clock_shift : Denominator to convert input clock ticks into 
nanoseconds


You should explain to the reader how these will be calculated when the
properties are missing.


Not sure how full should it be explained in bindings - I'll try.





diff --git a/drivers/net/ethernet/ti/cpts.c b/drivers/net/ethernet/ti/cpts.c
index ff8bb85..8046a21 100644
--- a/drivers/net/ethernet/ti/cpts.c
+++ b/drivers/net/ethernet/ti/cpts.c
@@ -418,18 +418,60 @@ void cpts_unregister(struct cpts *cpts)
clk_disable(cpts->refclk);
 }

+static void cpts_calc_mult_shift(struct cpts *cpts)
+{
+   u64 maxsec;
+   u32 freq;
+   u32 mult;
+   u32 shift;
+   u64 ns;
+   u64 frac;


Why so many new lines?  This isn't good style.  Please combine
variables of the same type into one line and sort the lists
alphabetically.


Its matter of preferences :), but sure - I'll update.



+   if (cpts->cc_mult || cpts->cc.shift)
+   return;
+
+   freq = clk_get_rate(cpts->refclk);
+
+   /* Calc the maximum number of seconds which we can run before
+* wrapping around.
+*/
+   maxsec = cpts->cc.mask;
+   do_div(maxsec, freq);
+   if (!maxsec)
+   maxsec = 1;
+   else if (maxsec > 600 && cpts->cc.mask > UINT_MAX)
+   maxsec = 600;
+
+   clocks_calc_mult_shift(, , freq, NSEC_PER_SEC, maxsec);
+
+   cpts->cc_mult = mult;
+   cpts->cc.mult = mult;
+   cpts->cc.shift = shift;
+   /* Check calculations and inform if not precise */


Contrary to this comment, you are not making any kind of judgment
about whether the calculations are precise or not.


+   frac = 0;
+   ns = cyclecounter_cyc2ns(>cc, freq, cpts->cc.mask, );
+
+   dev_info(cpts->dev,
+"CPTS: ref_clk_freq:%u calc_mult:%u calc_shift:%u error:%lld 
nsec/sec\n",
+freq, cpts->cc_mult, cpts->cc.shift, (ns - NSEC_PER_SEC));
+}
+




Thanks for the review.

--
regards,
-grygorii


Re: [PATCH RFC 5/6] net: Generic resolver backend

2016-09-14 Thread Tom Herbert
On Wed, Sep 14, 2016 at 2:49 AM, Thomas Graf  wrote:
> On 09/09/16 at 04:19pm, Tom Herbert wrote:
>> diff --git a/net/core/resolver.c b/net/core/resolver.c
>> new file mode 100644
>> index 000..61b36c5
>> --- /dev/null
>> +++ b/net/core/resolver.c
>> @@ -0,0 +1,267 @@
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>
> This include list could be stripped down a bit. ila, lwt, fib, ...
>
>> +
>> +static struct net_rslv_ent *net_rslv_new_ent(struct net_rslv *nrslv,
>> +  void *key)
>
> Comment above that net_rslv_get_lock() must be held?
>
>> +{
>> + struct net_rslv_ent *nrent;
>> + int err;
>> +
>> + nrent = kzalloc(sizeof(*nrent) + nrslv->obj_size, GFP_KERNEL);
>
> GFP_ATOMIC since you typically hold net_rslv_get_lock() spinlock?
>
>> + if (!nrent)
>> + return ERR_PTR(-EAGAIN);
>> +
>> + /* Key is always at beginning of object data */
>> + memcpy(nrent->object, key, nrslv->params.key_len);
>> +
>> + /* Initialize user data */
>> + if (nrslv->rslv_init)
>> + nrslv->rslv_init(nrslv, nrent);
>> +
>> + /* Put in hash table */
>> + err = rhashtable_lookup_insert_fast(>rhash_table,
>> + >node, nrslv->params);
>> + if (err)
>> + return ERR_PTR(err);
>> +
>> + if (nrslv->timeout) {
>> + /* Schedule timeout for resolver */
>> + INIT_DELAYED_WORK(>timeout_work, net_rslv_delayed_work);
>
> Should this be done before inserting into rhashtable?
>
Adding to the table and setting delayed work are done under a lock so
I think it should be okay. I'll add a comment to the function that the
lock is held.

>> + schedule_delayed_work(>timeout_work, nrslv->timeout);
>> + }
>> +
>> + nrent->nrslv = nrslv;
>
> Same here.  net_rslv_cancel_all_delayed_work() walking the rhashtable could
> see ->nrslv as NULL.

I'll move it up, but net_rslv_cancel_all_delayed_work is only called
when we're destroying the table so it would be a bug in the higher
layer if it is both destroying the table and adding entries at the
same time.

Thanks,
Tom


>


Re: [PATCH 6/9] net: ethernet: ti: cpts: clean up event list if event pool is empty

2016-09-14 Thread Grygorii Strashko

On 09/14/2016 05:14 PM, Richard Cochran wrote:

On Wed, Sep 14, 2016 at 04:02:28PM +0300, Grygorii Strashko wrote:

From: WingMan Kwok 

When a CPTS user does not exit gracefully by disabling cpts
timestamping and leaving a joined multicast group, the system
continues to receive and timestamps the ptp packets which eventually
occupy all the event list entries.  When this happns, the added code
tries to remove some list entries which are expired.

Signed-off-by: WingMan Kwok 
Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/cpts.c | 30 --
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpts.c b/drivers/net/ethernet/ti/cpts.c
index 970d4e2..ff8bb85 100644
--- a/drivers/net/ethernet/ti/cpts.c
+++ b/drivers/net/ethernet/ti/cpts.c
@@ -57,22 +57,48 @@ static int cpts_fifo_pop(struct cpts *cpts, u32 *high, u32 
*low)
return -1;
 }

+static int cpts_event_list_clean_up(struct cpts *cpts)


5 words, that is quite a mouth full.  How about this instead?

static int cpts_purge_events(struct cpts *cpts);


+{
+   struct list_head *this, *next;
+   struct cpts_event *event;
+   int removed = 0;
+
+   list_for_each_safe(this, next, >events) {
+   event = list_entry(this, struct cpts_event, list);
+   if (event_expired(event)) {
+   list_del_init(>list);
+   list_add(>list, >pool);
+   ++removed;
+   }
+   }
+   return removed;
+}
+
 /*
  * Returns zero if matching event type was found.
  */
 static int cpts_fifo_read(struct cpts *cpts, int match)
 {
int i, type = -1;
+   int removed;


No need for another variable, just change the return code above to

return removed ? 0 : -1;

and then you have ...


u32 hi, lo;
struct cpts_event *event;

for (i = 0; i < CPTS_FIFO_DEPTH; i++) {
if (cpts_fifo_pop(cpts, , ))
break;
+
if (list_empty(>pool)) {
-   pr_err("cpts: event pool is empty\n");
-   return -1;
+   removed = cpts_event_list_clean_up(cpts);
+   if (!removed) {
+   dev_err(cpts->dev,
+   "cpts: event pool is empty\n");
+   return -1;
+   }


if (cpts_purge_events(cpts)) {
dev_err(cpts->dev, "cpts: event pool empty\n");
return -1;
}

Notice how I avoided the ugly line break?


+   dev_dbg(cpts->dev,
+   "cpts: event pool cleaned up %d\n", removed);
}
+
event = list_first_entry(>pool, struct cpts_event, list);
event->tmo = jiffies + 2;
event->high = hi;
--
2.9.3





NP here - will update. Thanks for you review.


--
regards,
-grygorii


[PATCH 3/3] mm: memcontrol: consolidate cgroup socket tracking

2016-09-14 Thread Johannes Weiner
The cgroup core and the memory controller need to track socket
ownership for different purposes, but the tracking sites being
entirely different is kind of ugly.

Be a better citizen and rename the memory controller callbacks to
match the cgroup core callbacks, then move them to the same place.

Signed-off-by: Johannes Weiner 
---
 include/linux/memcontrol.h |  4 ++--
 mm/memcontrol.c| 19 +++
 net/core/sock.c|  6 +++---
 net/ipv4/tcp.c |  2 --
 net/ipv4/tcp_ipv4.c|  3 ---
 5 files changed, 16 insertions(+), 18 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0710143723bc..ca11b3e6dd65 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -773,8 +773,8 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback 
*wb,
 #endif /* CONFIG_CGROUP_WRITEBACK */
 
 struct sock;
-void sock_update_memcg(struct sock *sk);
-void sock_release_memcg(struct sock *sk);
+void mem_cgroup_sk_alloc(struct sock *sk);
+void mem_cgroup_sk_free(struct sock *sk);
 bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
 void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int 
nr_pages);
 #ifdef CONFIG_MEMCG
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 60bb830abc34..2caf1ee86e78 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2939,7 +2939,7 @@ static int memcg_update_tcp_limit(struct mem_cgroup 
*memcg, unsigned long limit)
/*
 * The active flag needs to be written after the static_key
 * update. This is what guarantees that the socket activation
-* function is the last one to run. See sock_update_memcg() for
+* function is the last one to run. See mem_cgroup_sk_alloc() 
for
 * details, and note that we don't mark any socket as belonging
 * to this memcg until that flag is up.
 *
@@ -2948,7 +2948,7 @@ static int memcg_update_tcp_limit(struct mem_cgroup 
*memcg, unsigned long limit)
 * as accounted, but the accounting functions are not patched in
 * yet, we'll lose accounting.
 *
-* We never race with the readers in sock_update_memcg(),
+* We never race with the readers in mem_cgroup_sk_alloc(),
 * because when this value change, the code to process it is not
 * patched in yet.
 */
@@ -5651,11 +5651,15 @@ void mem_cgroup_migrate(struct page *oldpage, struct 
page *newpage)
 DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
 EXPORT_SYMBOL(memcg_sockets_enabled_key);
 
-void sock_update_memcg(struct sock *sk)
+void mem_cgroup_sk_alloc(struct sock *sk)
 {
struct mem_cgroup *memcg;
 
-   /* Socket cloning can throw us here with sk_cgrp already
+   if (!mem_cgroup_sockets_enabled)
+   return;
+
+   /*
+* Socket cloning can throw us here with sk_memcg already
 * filled. It won't however, necessarily happen from
 * process context. So the test for root memcg given
 * the current task's memcg won't help us in this case.
@@ -5680,12 +5684,11 @@ void sock_update_memcg(struct sock *sk)
 out:
rcu_read_unlock();
 }
-EXPORT_SYMBOL(sock_update_memcg);
 
-void sock_release_memcg(struct sock *sk)
+void mem_cgroup_sk_free(struct sock *sk)
 {
-   WARN_ON(!sk->sk_memcg);
-   css_put(>sk_memcg->css);
+   if (sk->sk_memcg)
+   css_put(>sk_memcg->css);
 }
 
 /**
diff --git a/net/core/sock.c b/net/core/sock.c
index 038e660ef844..c73e28fc9c2a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1363,6 +1363,7 @@ static void sk_prot_free(struct proto *prot, struct sock 
*sk)
slab = prot->slab;
 
cgroup_sk_free(>sk_cgrp_data);
+   mem_cgroup_sk_free(sk);
security_sk_free(sk);
if (slab != NULL)
kmem_cache_free(slab, sk);
@@ -1399,6 +1400,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t 
priority,
sock_net_set(sk, net);
atomic_set(>sk_wmem_alloc, 1);
 
+   mem_cgroup_sk_alloc(sk);
cgroup_sk_alloc(>sk_cgrp_data);
sock_update_classid(>sk_cgrp_data);
sock_update_netprioidx(>sk_cgrp_data);
@@ -1545,6 +1547,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const 
gfp_t priority)
newsk->sk_incoming_cpu = raw_smp_processor_id();
atomic64_set(>sk_cookie, 0);
 
+   mem_cgroup_sk_alloc(newsk);
cgroup_sk_alloc(>sk_cgrp_data);
 
/*
@@ -1569,9 +1572,6 @@ struct sock *sk_clone_lock(const struct sock *sk, const 
gfp_t priority)
sk_set_socket(newsk, NULL);
newsk->sk_wq = NULL;
 
-   if (mem_cgroup_sockets_enabled && sk->sk_memcg)
-   

[PATCH 1/3] mm: memcontrol: make per-cpu charge cache IRQ-safe for socket accounting

2016-09-14 Thread Johannes Weiner
From: Johannes Weiner 

During cgroup2 rollout into production, we started encountering css
refcount underflows and css access crashes in the memory controller.
Splitting the heavily shared css reference counter into logical users
narrowed the imbalance down to the cgroup2 socket memory accounting.

The problem turns out to be the per-cpu charge cache. Cgroup1 had a
separate socket counter, but the new cgroup2 socket accounting goes
through the common charge path that uses a shared per-cpu cache for
all memory that is being tracked. Those caches are safe against
scheduling preemption, but not against interrupts - such as the newly
added packet receive path. When cache draining is interrupted by
network RX taking pages out of the cache, the resuming drain operation
will put references of in-use pages, thus causing the imbalance.

Disable IRQs during all per-cpu charge cache operations.

Fixes: f7e1cb6ec51b ("mm: memcontrol: account socket memory in unified 
hierarchy memory controller")
Cc:  # 4.5+
Signed-off-by: Johannes Weiner 
---
 mm/memcontrol.c | 31 ++-
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7a8d6624758a..60bb830abc34 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1710,17 +1710,22 @@ static DEFINE_MUTEX(percpu_charge_mutex);
 static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
struct memcg_stock_pcp *stock;
+   unsigned long flags;
bool ret = false;
 
if (nr_pages > CHARGE_BATCH)
return ret;
 
-   stock = _cpu_var(memcg_stock);
+   local_irq_save(flags);
+
+   stock = this_cpu_ptr(_stock);
if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
stock->nr_pages -= nr_pages;
ret = true;
}
-   put_cpu_var(memcg_stock);
+
+   local_irq_restore(flags);
+
return ret;
 }
 
@@ -1741,15 +1746,18 @@ static void drain_stock(struct memcg_stock_pcp *stock)
stock->cached = NULL;
 }
 
-/*
- * This must be called under preempt disabled or must be called by
- * a thread which is pinned to local cpu.
- */
 static void drain_local_stock(struct work_struct *dummy)
 {
-   struct memcg_stock_pcp *stock = this_cpu_ptr(_stock);
+   struct memcg_stock_pcp *stock;
+   unsigned long flags;
+
+   local_irq_save(flags);
+
+   stock = this_cpu_ptr(_stock);
drain_stock(stock);
clear_bit(FLUSHING_CACHED_CHARGE, >flags);
+
+   local_irq_restore(flags);
 }
 
 /*
@@ -1758,14 +1766,19 @@ static void drain_local_stock(struct work_struct *dummy)
  */
 static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
-   struct memcg_stock_pcp *stock = _cpu_var(memcg_stock);
+   struct memcg_stock_pcp *stock;
+   unsigned long flags;
+
+   local_irq_save(flags);
 
+   stock = this_cpu_ptr(_stock);
if (stock->cached != memcg) { /* reset if necessary */
drain_stock(stock);
stock->cached = memcg;
}
stock->nr_pages += nr_pages;
-   put_cpu_var(memcg_stock);
+
+   local_irq_restore(flags);
 }
 
 /*
-- 
2.9.3



[PATCH 2/3] cgroup: duplicate cgroup reference when cloning sockets

2016-09-14 Thread Johannes Weiner
From: Johannes Weiner 

When a socket is cloned, the associated sock_cgroup_data is duplicated
but not its reference on the cgroup. As a result, the cgroup reference
count will underflow when both sockets are destroyed later on.

Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
Cc:  # 4.5+
Signed-off-by: Johannes Weiner 
---
 kernel/cgroup.c | 6 ++
 net/core/sock.c | 5 -
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0c4db7908264..b0d727d26fc7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -6297,6 +6297,12 @@ void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
if (cgroup_sk_alloc_disabled)
return;
 
+   /* Socket clone path */
+   if (skcd->val) {
+   cgroup_get(sock_cgroup_ptr(skcd));
+   return;
+   }
+
rcu_read_lock();
 
while (true) {
diff --git a/net/core/sock.c b/net/core/sock.c
index 51a730485649..038e660ef844 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1340,7 +1340,6 @@ static struct sock *sk_prot_alloc(struct proto *prot, 
gfp_t priority,
if (!try_module_get(prot->owner))
goto out_free_sec;
sk_tx_queue_clear(sk);
-   cgroup_sk_alloc(>sk_cgrp_data);
}
 
return sk;
@@ -1400,6 +1399,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t 
priority,
sock_net_set(sk, net);
atomic_set(>sk_wmem_alloc, 1);
 
+   cgroup_sk_alloc(>sk_cgrp_data);
sock_update_classid(>sk_cgrp_data);
sock_update_netprioidx(>sk_cgrp_data);
}
@@ -1544,6 +1544,9 @@ struct sock *sk_clone_lock(const struct sock *sk, const 
gfp_t priority)
newsk->sk_priority = 0;
newsk->sk_incoming_cpu = raw_smp_processor_id();
atomic64_set(>sk_cookie, 0);
+
+   cgroup_sk_alloc(>sk_cgrp_data);
+
/*
 * Before updating sk_refcnt, we must commit prior changes to 
memory
 * (Documentation/RCU/rculist_nulls.txt for details)
-- 
2.9.3



  1   2   3   >