from:"Alexei Starovoitov"

[PATCH net-next] fix unsafe set_memory_rw from softirq

2013-10-02 Thread Alexei Starovoitov

on x86 system with net.core.bpf_jit_enable = 1

sudo tcpdump -i eth1 'tcp port 22'

causes the warning:
[   56.766097]  Possible unsafe locking scenario:
[   56.766097]
[   56.780146]CPU0
[   56.786807]
[   56.793188]   lock(&(&vb->lock)->rlock);
[   56.799593]   
[   56.805889] lock(&(&vb->lock)->rlock);
[   56.812266]
[   56.812266]  *** DEADLOCK ***
[   56.812266]
[   56.830670] 1 lock held by ksoftirqd/1/13:
[   56.836838]  #0:  (rcu_read_lock){.+.+..}, at: [] 
vm_unmap_aliases+0x8c/0x380
[   56.849757]
[   56.849757] stack backtrace:
[   56.862194] CPU: 1 PID: 13 Comm: ksoftirqd/1 Not tainted 3.12.0-rc3+ #45
[   56.868721] Hardware name: System manufacturer System Product Name/P8Z77 WS, 
BIOS 3007 07/26/2012
[   56.882004]  821944c0 88080bbdb8c8 8175a145 
0007
[   56.895630]  88080bbd5f40 88080bbdb928 81755b14 
0001
[   56.909313]  88080001 8808 8101178f 
0001
[   56.923006] Call Trace:
[   56.929532]  [] dump_stack+0x55/0x76
[   56.936067]  [] print_usage_bug+0x1f7/0x208
[   56.942445]  [] ? save_stack_trace+0x2f/0x50
[   56.948932]  [] ? check_usage_backwards+0x150/0x150
[   56.955470]  [] mark_lock+0x282/0x2c0
[   56.961945]  [] __lock_acquire+0x45d/0x1d50
[   56.968474]  [] ? __lock_acquire+0x2de/0x1d50
[   56.975140]  [] ? cpumask_next_and+0x55/0x90
[   56.981942]  [] lock_acquire+0x92/0x1d0
[   56.988745]  [] ? vm_unmap_aliases+0x16a/0x380
[   56.995619]  [] _raw_spin_lock+0x41/0x50
[   57.002493]  [] ? vm_unmap_aliases+0x16a/0x380
[   57.009447]  [] vm_unmap_aliases+0x16a/0x380
[   57.016477]  [] ? vm_unmap_aliases+0x8c/0x380
[   57.023607]  [] change_page_attr_set_clr+0xc0/0x460
[   57.030818]  [] ? trace_hardirqs_on+0xd/0x10
[   57.037896]  [] ? kmem_cache_free+0xb0/0x2b0
[   57.044789]  [] ? free_object_rcu+0x93/0xa0
[   57.051720]  [] set_memory_rw+0x2f/0x40
[   57.058727]  [] bpf_jit_free+0x2c/0x40
[   57.065577]  [] sk_filter_release_rcu+0x1a/0x30
[   57.072338]  [] rcu_process_callbacks+0x202/0x7c0
[   57.078962]  [] __do_softirq+0xf7/0x3f0
[   57.085373]  [] run_ksoftirqd+0x35/0x70

cannot reuse filter memory, since it's readonly, so have to
extend sk_filter with work_struct

Signed-off-by: Alexei Starovoitov 
---
 arch/x86/net/bpf_jit_comp.c |   17 -
 include/linux/filter.h  |1 +
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 79c216a..89a43df 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -772,13 +772,20 @@ out:
return;
 }
 
+static void bpf_jit_free_deferred(struct work_struct *work)
+{
+   struct sk_filter *fp = container_of(work, struct sk_filter, work);
+   unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
+   struct bpf_binary_header *header = (void *)addr;
+
+   set_memory_rw(addr, header->pages);
+   module_free(NULL, header);
+}
+
 void bpf_jit_free(struct sk_filter *fp)
 {
if (fp->bpf_func != sk_run_filter) {
-   unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
-   struct bpf_binary_header *header = (void *)addr;
-
-   set_memory_rw(addr, header->pages);
-   module_free(NULL, header);
+   INIT_WORK(&fp->work, bpf_jit_free_deferred);
+   schedule_work(&fp->work);
}
 }
diff --git a/include/linux/filter.h b/include/linux/filter.h
index a6ac848..378fa03 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -27,6 +27,7 @@ struct sk_filter
unsigned intlen;/* Number of filter blocks */
unsigned int(*bpf_func)(const struct sk_buff *skb,
const struct sock_filter *filter);
+   struct work_struct  work;
struct rcu_head rcu;
struct sock_filter  insns[0];
 };
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net-next] fix unsafe set_memory_rw from softirq

2013-10-02 Thread Alexei Starovoitov

On Wed, Oct 2, 2013 at 9:23 PM, Eric Dumazet  wrote:
> On Wed, 2013-10-02 at 20:50 -0700, Alexei Starovoitov wrote:
>> on x86 system with net.core.bpf_jit_enable = 1
>
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index a6ac848..378fa03 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -27,6 +27,7 @@ struct sk_filter
>>   unsigned intlen;/* Number of filter blocks */
>>   unsigned int(*bpf_func)(const struct sk_buff *skb,
>>   const struct sock_filter *filter);
>> + struct work_struct  work;
>>   struct rcu_head rcu;
>>   struct sock_filter  insns[0];
>>  };
>
> Nice catch !
>
> It seems only x86 and s390 needs this work_struct.

I think ifdef config_x86 is a bit ugly inside struct sk_filter, but
don't mind whichever way.

> (and you might CC Heiko Carstens  to ask him
> to make the s390 part, of Ack it if you plan to do it)

set_memory_rw() on s390 is a simple page table walker that doesn't do
any IPI unlike x86
Heiko, please confirm that it's not an issue there.

Thanks
Alexei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net-next] fix unsafe set_memory_rw from softirq

2013-10-03 Thread Alexei Starovoitov

On Wed, Oct 2, 2013 at 9:57 PM, Eric Dumazet  wrote:
> On Wed, 2013-10-02 at 21:53 -0700, Eric Dumazet wrote:
>> On Wed, 2013-10-02 at 21:44 -0700, Alexei Starovoitov wrote:
>>
>> > I think ifdef config_x86 is a bit ugly inside struct sk_filter, but
>> > don't mind whichever way.
>>
>> Its not fair to make sk_filter bigger, because it means that simple (non
>> JIT) filter might need an extra cache line.
>>
>> You could presumably use the following layout instead :
>>
>> struct sk_filter
>> {
>> atomic_trefcnt;
>> struct rcu_head rcu;
>>   struct work_struct  work;
>>
>> unsigned intlen cacheline_aligned;/* Number of 
>> filter blocks */
>> unsigned int(*bpf_func)(const struct sk_buff *skb,
>> const struct sock_filter 
>> *filter);
>> struct sock_filter  insns[0];
>> };
>
> And since @len is not used by sk_run_filter() use :
>
> struct sk_filter {
> atomic_trefcnt;
> int len; /* number of filter blocks */
> struct rcu_head rcu;
> struct work_struct  work;
>
> unsigned int(*bpf_func)(const struct sk_buff *skb,
> const struct sock_filter *filter) 
> cacheline_aligned;
> struct sock_filter  insns[0];
> };

yes. make sense to avoid first insn cache miss inside sk_run_filter()
at the expense
of 8-byte gap between work and bpf_func (on x86_64 w/o lockdep)

Probably even better to overlap work and insns fields.
Pro: sk_filter size the same, no impact on non-jit case
Con: would be harder to understand the code

another problem is that kfree(sk_filter) inside
sk_filter_release_rcu() needs to move inside bpf_jit_free().
so self nack. Let me fix these issues and respin

Thanks
Alexei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 net-next] fix unsafe set_memory_rw from softirq

2013-10-03 Thread Alexei Starovoitov

on x86 system with net.core.bpf_jit_enable = 1

sudo tcpdump -i eth1 'tcp port 22'

causes the warning:
[   56.766097]  Possible unsafe locking scenario:
[   56.766097]
[   56.780146]CPU0
[   56.786807]
[   56.793188]   lock(&(&vb->lock)->rlock);
[   56.799593]   
[   56.805889] lock(&(&vb->lock)->rlock);
[   56.812266]
[   56.812266]  *** DEADLOCK ***
[   56.812266]
[   56.830670] 1 lock held by ksoftirqd/1/13:
[   56.836838]  #0:  (rcu_read_lock){.+.+..}, at: [] 
vm_unmap_aliases+0x8c/0x380
[   56.849757]
[   56.849757] stack backtrace:
[   56.862194] CPU: 1 PID: 13 Comm: ksoftirqd/1 Not tainted 3.12.0-rc3+ #45
[   56.868721] Hardware name: System manufacturer System Product Name/P8Z77 WS, 
BIOS 3007 07/26/2012
[   56.882004]  821944c0 88080bbdb8c8 8175a145 
0007
[   56.895630]  88080bbd5f40 88080bbdb928 81755b14 
0001
[   56.909313]  88080001 8808 8101178f 
0001
[   56.923006] Call Trace:
[   56.929532]  [] dump_stack+0x55/0x76
[   56.936067]  [] print_usage_bug+0x1f7/0x208
[   56.942445]  [] ? save_stack_trace+0x2f/0x50
[   56.948932]  [] ? check_usage_backwards+0x150/0x150
[   56.955470]  [] mark_lock+0x282/0x2c0
[   56.961945]  [] __lock_acquire+0x45d/0x1d50
[   56.968474]  [] ? __lock_acquire+0x2de/0x1d50
[   56.975140]  [] ? cpumask_next_and+0x55/0x90
[   56.981942]  [] lock_acquire+0x92/0x1d0
[   56.988745]  [] ? vm_unmap_aliases+0x16a/0x380
[   56.995619]  [] _raw_spin_lock+0x41/0x50
[   57.002493]  [] ? vm_unmap_aliases+0x16a/0x380
[   57.009447]  [] vm_unmap_aliases+0x16a/0x380
[   57.016477]  [] ? vm_unmap_aliases+0x8c/0x380
[   57.023607]  [] change_page_attr_set_clr+0xc0/0x460
[   57.030818]  [] ? trace_hardirqs_on+0xd/0x10
[   57.037896]  [] ? kmem_cache_free+0xb0/0x2b0
[   57.044789]  [] ? free_object_rcu+0x93/0xa0
[   57.051720]  [] set_memory_rw+0x2f/0x40
[   57.058727]  [] bpf_jit_free+0x2c/0x40
[   57.065577]  [] sk_filter_release_rcu+0x1a/0x30
[   57.072338]  [] rcu_process_callbacks+0x202/0x7c0
[   57.078962]  [] __do_softirq+0xf7/0x3f0
[   57.085373]  [] run_ksoftirqd+0x35/0x70

cannot reuse jited filter memory, since it's readonly,
so use original bpf insns memory to hold work_struct

defer kfree of sk_filter until jit completed freeing

tested on x86_64 and i386

Signed-off-by: Alexei Starovoitov 
---
 arch/x86/net/bpf_jit_comp.c |   20 +++-
 include/linux/filter.h  |9 +++--
 net/core/filter.c   |8 ++--
 3 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 79c216a..1396a0a 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -772,13 +772,23 @@ out:
return;
 }
 
+static void bpf_jit_free_deferred(struct work_struct *work)
+{
+   struct sk_filter *fp = container_of((void *)work, struct sk_filter,
+   insns);
+   unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
+   struct bpf_binary_header *header = (void *)addr;
+
+   set_memory_rw(addr, header->pages);
+   module_free(NULL, header);
+   kfree(fp);
+}
+
 void bpf_jit_free(struct sk_filter *fp)
 {
if (fp->bpf_func != sk_run_filter) {
-   unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
-   struct bpf_binary_header *header = (void *)addr;
-
-   set_memory_rw(addr, header->pages);
-   module_free(NULL, header);
+   struct work_struct *work = (struct work_struct *)fp->insns;
+   INIT_WORK(work, bpf_jit_free_deferred);
+   schedule_work(work);
}
 }
diff --git a/include/linux/filter.h b/include/linux/filter.h
index a6ac848..4876ac4 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -25,15 +25,20 @@ struct sk_filter
 {
atomic_trefcnt;
unsigned intlen;/* Number of filter blocks */
+   struct rcu_head rcu;
unsigned int(*bpf_func)(const struct sk_buff *skb,
const struct sock_filter *filter);
-   struct rcu_head rcu;
+   /* insns start right after bpf_func, so that sk_run_filter() fetches
+* first insn from the same cache line that was used to call into
+* sk_run_filter()
+*/
struct sock_filter  insns[0];
 };
 
 static inline unsigned int sk_filter_len(const struct sk_filter *fp)
 {
-   return fp->len * sizeof(struct sock_filter) + sizeof(*fp);
+   return max(fp->len * sizeof(struct sock_filter),
+  sizeof(struct work_struct)) + sizeof(*fp);
 }
 
 extern int sk_filter(struct sock *sk, struct sk_buff *skb);
diff --git a/net/core/filter.c b/net/core/filter.c
index 6438f29..1ebbc21 1006

Re: [PATCH v2 net-next] fix unsafe set_memory_rw from softirq

2013-10-03 Thread Alexei Starovoitov

On Thu, Oct 3, 2013 at 4:02 PM, Eric Dumazet  wrote:
> On Thu, 2013-10-03 at 15:47 -0700, Alexei Starovoitov wrote:
>> on x86 system with net.core.bpf_jit_enable = 1
>>
>
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -644,7 +644,9 @@ void sk_filter_release_rcu(struct rcu_head *rcu)
>>   struct sk_filter *fp = container_of(rcu, struct sk_filter, rcu);
>>
>>   bpf_jit_free(fp);
>> +#if !defined(CONFIG_X86_64) /* x86_64 has a deferred free */
>>   kfree(fp);
>> +#endif
>
> Sorry this is not very nice.
>
> Make bpf_jit_free(fp) a bool ?  true : caller must free, false : caller
> must not free ?
>
> if (bpf_jit_free(fp))
> kfree(fp);
>
> Or move the kfree() in bpf_jit_free()

I think it's cleaner too, just didn't want to touch all architectures.
Will do then.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 net-next] fix unsafe set_memory_rw from softirq

2013-10-03 Thread Alexei Starovoitov

On Thu, Oct 3, 2013 at 4:07 PM, Eric Dumazet  wrote:
> On Thu, 2013-10-03 at 15:47 -0700, Alexei Starovoitov wrote:
>
>> @@ -722,7 +725,8 @@ EXPORT_SYMBOL_GPL(sk_unattached_filter_destroy);
>>  int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk)
>>  {
>>   struct sk_filter *fp, *old_fp;
>> - unsigned int fsize = sizeof(struct sock_filter) * fprog->len;
>> + unsigned int fsize = max(sizeof(struct sock_filter) * fprog->len,
>> +  sizeof(struct work_struct));
>>   int err;
>>
>>   if (sock_flag(sk, SOCK_FILTER_LOCKED))
>
> Thats broken, as we might copy more data from user than expected,
> and eventually trigger EFAULT :
>
> if (copy_from_user(fp->insns, fprog->filter, fsize)) {

yes. will fix.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 net-next] fix unsafe set_memory_rw from softirq

2013-10-03 Thread Alexei Starovoitov

On Thu, Oct 3, 2013 at 4:11 PM, Alexei Starovoitov  wrote:
> On Thu, Oct 3, 2013 at 4:07 PM, Eric Dumazet  wrote:
>> On Thu, 2013-10-03 at 15:47 -0700, Alexei Starovoitov wrote:
>>
>>> @@ -722,7 +725,8 @@ EXPORT_SYMBOL_GPL(sk_unattached_filter_destroy);
>>>  int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk)
>>>  {
>>>   struct sk_filter *fp, *old_fp;
>>> - unsigned int fsize = sizeof(struct sock_filter) * fprog->len;
>>> + unsigned int fsize = max(sizeof(struct sock_filter) * fprog->len,
>>> +  sizeof(struct work_struct));
>>>   int err;
>>>
>>>   if (sock_flag(sk, SOCK_FILTER_LOCKED))
>>
>> Thats broken, as we might copy more data from user than expected,
>> and eventually trigger EFAULT :
>>
>> if (copy_from_user(fp->insns, fprog->filter, fsize)) {
>
> yes. will fix.

tested on x86_64/i386 only
with tcpdump and netsniff 1-4k filter size.
Thank you for careful review.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes)

2016-07-19 Thread Alexei Starovoitov

On Wed, Jul 20, 2016 at 01:19:51AM +0200, Daniel Borkmann wrote:
> On 07/19/2016 06:34 PM, Alexei Starovoitov wrote:
> >On Tue, Jul 19, 2016 at 01:17:53PM +0200, Daniel Borkmann wrote:
> >>>+  return -EINVAL;
> >>>+
> >>>+  /* Is this a user address, or a kernel address? */
> >>>+  if (!access_ok(VERIFY_WRITE, to, size))
> >>>+  return -EINVAL;
> >>>+
> >>>+  return probe_kernel_write(to, from, size);
> >>
> >>I'm still worried that this can lead to all kind of hard to find
> >>bugs or races for user processes, if you make this writable to entire
> >>user address space (which is the only thing that access_ok() checks
> >>for). What if the BPF program has bugs and writes to wrong addresses
> >>for example, introducing bugs in some other, non-intended processes
> >>elsewhere? Can this be limited to syscalls only? And if so, to the
> >>passed memory only?
> >
> >my understanding that above code will write only to memory of current 
> >process,
> >so impact is contained and in that sense buggy kprobe program is no different
> >from buggy seccomp prorgram.
> 
> Compared to seccomp, you might not notice that a race has happened,
> in seccomp case you might have killed your process, which is visible.
> But ok, in ptrace() case it might be similar issue perhaps ...
> 
> The asm-generic version does __access_ok(..) { return 1; } for nommu
> case, I haven't checked closely enough whether there's actually an arch
> that uses this, but for example arm nommu with only one addr space would
> certainly result in access_ok() as 1, and then you could also go into
> probe_kernel_write(), no?

good point. how arm nommu handles copy_to_user? if there is nommu
then there is no user/kernel mm ? Crazy archs.
I guess we have to disable this helper on all such archs.

> Don't know that code well enough, but I believe the check would only
> ensure in normal use-cases that user process doesn't fiddle with kernel
> address space, but not necessarily guarantee that this really only
> belongs to the process address space.

why? on x86 that exactly what it does. access_ok=true means
it's user space address and since we're in _this user context_
probe_kernel_write can only affect this user.

> x86 code comments this with "note that, depending on architecture,
> this function probably just checks that the pointer is in the user
> space range - after calling this function, memory access functions may
> still return -EFAULT".

Yes. I've read that comment to :)
Certainly not an expert, but the archs I've looked at, access_ok
has the same meaning as on x86. They check the space range to
make sure address doesn't belong to kernel.
Could I have missed something? Certainly. Please double check :)

> Also, what happens in case of kernel thread?

my understanding if access_ok(addr)=true the addr will never point
to memory of kernel thread.
We need expert opinion. Whom should we ping?

> As it stands, it does ...
> 
>   if (unlikely(in_interrupt()))
>   return -EINVAL;
>   if (unlikely(!task || !task->pid))
>   return -EINVAL;
> 
> So up to here, irq/sirq, NULL current and that current is not the 'idle'
> process is being checked (still fail to see the point for the !task->pid,
> I believe the intend here is different).
> 
>   /* Is this a user address, or a kernel address? */
>   if (!access_ok(VERIFY_WRITE, to, size))
>   return -EINVAL;
> 
> Now here. What if it's a kernel thread? You'll have KERNEL_DS segment,
> task->pid was non-zero as well for the kthread, so access_ok() will
> pass and you can still execute probe_kernel_write() ...

I think user_addr_max() should be zero for kthread, but
worth checking for sure.

> >Limiting this to syscalls will make it too limited.
> >I'm in favor of this change, because it allows us to experiment
> >with restartable sequences and lock-free algorithms that need ultrafast
> >access to cpuid without burdening the kernel with stable abi.
> >
> >>Have you played around with ptrace() to check whether you could
> >>achieve similar functionality (was thinking about things like [1],
> >>PTRACE_PEEK{TEXT,DATA} / PTRACE_POKE{TEXT,DATA}). If not, why can't
> >>this be limited to a similar functionality for only the current task.
> >>ptrace() utilizes helpers like access_process_vm(), maybe this can
> >>similarly be adapted here, too (under the circumstances that sleeping
> >>is not allowed)?
> >
> >If we hack access_process_vm I think at the end

Re: [PATCH v4 1/2] bpf: Add bpf_probe_write BPF helper to be called in tracers (kprobes)

2016-07-21 Thread Alexei Starovoitov

On Thu, Jul 21, 2016 at 06:09:17PM -0700, Sargun Dhillon wrote:
> This allows user memory to be written to during the course of a kprobe.
> It shouldn't be used to implement any kind of security mechanism
> because of TOC-TOU attacks, but rather to debug, divert, and
> manipulate execution of semi-cooperative processes.
> 
> Although it uses probe_kernel_write, we limit the address space
> the probe can write into by checking the space with access_ok.
> This is so the call doesn't sleep.
> 
> Given this feature is experimental, and has the risk of crashing
> the system, we print a warning on invocation.
> 
> It was tested with the tracex7 program on x86-64.
> 
> Signed-off-by: Sargun Dhillon 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> ---
>  include/uapi/linux/bpf.h  | 12 
>  kernel/bpf/verifier.c |  9 +
>  kernel/trace/bpf_trace.c  | 37 +
>  samples/bpf/bpf_helpers.h |  2 ++
>  4 files changed, 60 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2b7076f..4536282 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -365,6 +365,18 @@ enum bpf_func_id {
>*/
>   BPF_FUNC_get_current_task,
>  
> + /**
> +  * bpf_probe_write(void *dst, void *src, int len)
> +  * safely attempt to write to a location
> +  * @dst: destination address in userspace
> +  * @src: source address on stack
> +  * @len: number of bytes to copy
> +  * Return:
> +  *   Returns number of bytes that could not be copied.
> +  *   On success, this will be zero

that is odd comment.
there are only three possible return values 0, -EFAULT, -EPERM

> +  */
> + BPF_FUNC_probe_write,
> +
>   __BPF_FUNC_MAX_ID,
>  };
>  
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index f72f23b..6785008 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1154,6 +1154,15 @@ static int check_call(struct verifier_env *env, int 
> func_id)
>   return -EINVAL;
>   }
>  
> + if (func_id == BPF_FUNC_probe_write) {
> + 
> pr_warn_once("\n");
> + pr_warn_once("* bpf_probe_write: Experimental Feature in use 
> *\n");
> + pr_warn_once("* bpf_probe_write: Feature may corrupt memory  
> *\n");
> + 
> pr_warn_once("\n");
> + pr_notice_ratelimited("bpf_probe_write in use by: %.16s-%d",
> +   current->comm, task_pid_nr(current));
> + }

I think single line pr_notice_ratelimited() with 'feature may corrupt user 
memory'
will be enough.
Also please move this to tracing specific part into bpf_trace.c
similar to bpf_get_trace_printk_proto() instead of verifier.c

> +static u64 bpf_probe_write(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
> +{
> + void *unsafe_ptr = (void *) (long) r1;
> + void *src = (void *) (long) r2;
> + int size = (int) r3;
> + struct task_struct *task = current;
> +
> + /*

bpf_trace.c follows non-net comment style, so it's good here.
just distracting vs the rest of net style.

> +  * Ensure we're in a user context which it is safe for the helper
> +  * to run. This helper has no business in a kthread
> +  *
> +  * access_ok should prevent writing to non-user memory, but on
> +  * some architectures (nommu, etc...) access_ok isn't enough
> +  * So we check the current segment
> +  */
> +
> + if (unlikely(in_interrupt() || (task->flags & PF_KTHREAD)))
> + return -EPERM;

Should we also add a check for !PF_EXITING ?
Like signals are not delivered to such tasks and I'm not sure
what would be the state of mm of it.

> + if (unlikely(segment_eq(get_fs(), KERNEL_DS)))
> + return -EPERM;
> + if (!access_ok(VERIFY_WRITE, unsafe_ptr, size))
> + return -EPERM;
> +
> + return probe_kernel_write(unsafe_ptr, src, size);
> +}
> +
> +static const struct bpf_func_proto bpf_probe_write_proto = {
> + .func   = bpf_probe_write,
> + .gpl_only   = true,
> + .ret_type   = RET_INTEGER,
> + .arg1_type  = ARG_ANYTHING,
> + .arg2_type  = ARG_PTR_TO_STACK,
> + .arg3_type  = ARG_CONST_STACK_SIZE,

I have 2nd thoughts on naming.
I think 'consistency' with probe_read is actually hurting here.
People derive semantic of the helper mainly from the name.
If we call it bpf_probe_read, it would mean that it&#x

Re: [PATCH v4 1/2] bpf: Add bpf_probe_write BPF helper to be called in tracers (kprobes)

2016-07-22 Thread Alexei Starovoitov

On Fri, Jul 22, 2016 at 11:53:52AM +0200, Daniel Borkmann wrote:
> On 07/22/2016 04:14 AM, Alexei Starovoitov wrote:
> >On Thu, Jul 21, 2016 at 06:09:17PM -0700, Sargun Dhillon wrote:
> >>This allows user memory to be written to during the course of a kprobe.
> >>It shouldn't be used to implement any kind of security mechanism
> >>because of TOC-TOU attacks, but rather to debug, divert, and
> >>manipulate execution of semi-cooperative processes.
> >>
> >>Although it uses probe_kernel_write, we limit the address space
> >>the probe can write into by checking the space with access_ok.
> >>This is so the call doesn't sleep.
> >>
> >>Given this feature is experimental, and has the risk of crashing
> >>the system, we print a warning on invocation.
> >>
> >>It was tested with the tracex7 program on x86-64.
> >>
> >>Signed-off-by: Sargun Dhillon 
> >>Cc: Alexei Starovoitov 
> >>Cc: Daniel Borkmann 
> >>---
> >>  include/uapi/linux/bpf.h  | 12 
> >>  kernel/bpf/verifier.c |  9 +
> >>  kernel/trace/bpf_trace.c  | 37 +
> >>  samples/bpf/bpf_helpers.h |  2 ++
> >>  4 files changed, 60 insertions(+)
> >>
> >>diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >>index 2b7076f..4536282 100644
> >>--- a/include/uapi/linux/bpf.h
> >>+++ b/include/uapi/linux/bpf.h
> >>@@ -365,6 +365,18 @@ enum bpf_func_id {
> >> */
> >>BPF_FUNC_get_current_task,
> >>
> >>+   /**
> >>+* bpf_probe_write(void *dst, void *src, int len)
> >>+* safely attempt to write to a location
> >>+* @dst: destination address in userspace
> >>+* @src: source address on stack
> >>+* @len: number of bytes to copy
> >>+* Return:
> >>+*   Returns number of bytes that could not be copied.
> >>+*   On success, this will be zero
> >
> >that is odd comment.
> >there are only three possible return values 0, -EFAULT, -EPERM
> 
> Agree.
> 
> >>+*/
> >>+   BPF_FUNC_probe_write,
> >>+
> >>__BPF_FUNC_MAX_ID,
> >>  };
> >>
> >>diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >>index f72f23b..6785008 100644
> >>--- a/kernel/bpf/verifier.c
> >>+++ b/kernel/bpf/verifier.c
> >>@@ -1154,6 +1154,15 @@ static int check_call(struct verifier_env *env, int 
> >>func_id)
> >>return -EINVAL;
> >>}
> >>
> >>+   if (func_id == BPF_FUNC_probe_write) {
> >>+   
> >>pr_warn_once("\n");
> >>+   pr_warn_once("* bpf_probe_write: Experimental Feature in use 
> >>*\n");
> >>+   pr_warn_once("* bpf_probe_write: Feature may corrupt memory  
> >>*\n");
> >>+   
> >>pr_warn_once("\n");
> >>+   pr_notice_ratelimited("bpf_probe_write in use by: %.16s-%d",
> >>+ current->comm, task_pid_nr(current));
> >>+   }
> >
> >I think single line pr_notice_ratelimited() with 'feature may corrupt user 
> >memory'
> >will be enough.
> 
> Agree.
> 
> >Also please move this to tracing specific part into bpf_trace.c
> >similar to bpf_get_trace_printk_proto() instead of verifier.c
> 
> Yes, sorry for not being too clear about it, this spot will then be
> called by the verifier to fetch it for every function call. Meaning that
> this could get printed multiple times for loading a single program, but
> I think it's okay. If single line, I'd make that pr_warn_ratelimited(),
> and probably something like ...

yes, but inside check_call() it may be printed even more times, since the same
call site can be verified multiple times due to different reg types.

>  "note: %s[%d] is installing a program with bpf_probe_write helper that may 
> corrupt user memory!",
>  current->comm, task_pid_nr(current)

sounds good to me.

> 
> You can make that 'current->flags & (PF_KTHREAD | PF_EXITING)' and
> we don't need the extra task var either.

+1

> >>+   if (unlikely(segment_eq(get_fs(), KERNEL_DS)))
> >>+   return -EPERM;
> >>+   if (!access_ok(VERIFY_WRITE, unsafe_ptr, size))
> >>+   return -EPERM;
> >>+
> >>+   ret

Re: [PATCH v4 1/2] bpf: Add bpf_probe_write BPF helper to be called in tracers (kprobes)

2016-07-23 Thread Alexei Starovoitov

On Fri, Jul 22, 2016 at 05:05:27PM -0700, Sargun Dhillon wrote:
> It was tested with the tracex7 program on x86-64.

it's my fault to start tracexN tradition that turned out to be
cumbersome, let's not continue it. Instead could you rename it
to something meaningful? Like test_probe_write_user ?
Right now it just prints client's peer address and human needs to
visually verify that probe_write_user actually happened, if you can
convert it into a test it will help a lot.
We were planning to convert all of the samples/bpf/ into tests,
so we can run them continuously.

btw, single patch re-submit will not be picked up. Please always
re-submit the whole patch set together.

> +static const struct bpf_func_proto *bpf_get_probe_write_proto(void) {
> + pr_warn_once("*\n");
> + pr_warn_once("* bpf_probe_write_user: Experimental Feature in use *\n");
> + pr_warn_once("* bpf_probe_write_user: Feature may corrupt memory  *\n");
> + pr_warn_once("*\n");
> + pr_notice_ratelimited("bpf_probe_write_user: %s[%d] installing program 
> with helper: it may corrupt user memory!",
> + current->comm, task_pid_nr(current));

I thought we were argeeing on single pr_warn_ratelimited without banner ?

The rest looks good.
Thanks!

Re: [PATCH net-next v5 1/2] bpf: Add bpf_probe_write_user BPF helper to be called in tracers

2016-07-23 Thread Alexei Starovoitov

On Sat, Jul 23, 2016 at 05:43:48PM -0700, Sargun Dhillon wrote:
> This allows user memory to be written to during the course of a kprobe.
> It shouldn't be used to implement any kind of security mechanism
> because of TOC-TOU attacks, but rather to debug, divert, and
> manipulate execution of semi-cooperative processes.
> 
> Although it uses probe_kernel_write, we limit the address space
> the probe can write into by checking the space with access_ok.
> This is so the call doesn't sleep. In addition we ensure the threads's
> current fs / segment is USER_DS and the thread isn't exiting nor
> a kernel thread.
> 
> Given this feature is experimental, and has the risk of crashing the
> system, we print a warning on first invocation, and the process name
> on subsequent invocations.
> 
> It was tested with the tracex7 program on x86-64.

s/tracex7/test_probe_write_user form the next patch/
or just drop this sentence.

> +static const struct bpf_func_proto *bpf_get_probe_write_proto(void) {
> + pr_warn_ratelimited("bpf_probe_write_user: %s[%d] installing program 
> with helper: it may corrupt user memory!",
> + current->comm, task_pid_nr(current));

I think checkpatch should have complained here.
current->comm line should start under "

No other nits for this patch :)
Once fixed, feel free to add my Acked-by: Alexei Starovoitov

Re: [PATCH net-next v5 2/2] samples/bpf: Add test/example of using bpf_probe_write_user bpf helper

2016-07-23 Thread Alexei Starovoitov

On Sat, Jul 23, 2016 at 05:44:11PM -0700, Sargun Dhillon wrote:
> This example shows using a kprobe to act as a dnat mechanism to divert
> traffic for arbitrary endpoints. It rewrite the arguments to a syscall
> while they're still in userspace, and before the syscall has a chance
> to copy the argument into kernel space.
> 
> Although this is an example, it also acts as a test because the mapped
> address is 255.255.255.255:555 -> real address, and that's not a legal
> address to connect to. If the helper is broken, the example will fail.

nice. makes sense.

> Signed-off-by: Sargun Dhillon 
...
> +/* Copyright (c) 2013-2015 PLUMgrid, http://plumgrid.com

hmm, not sure what to think about this line.
plumgrid got a new employee? ;)

> + if (load_bpf_file(filename)) {
> + printf("%s", bpf_log_buf);
> + return 1;
> + }
> +
> + /* Is the server's getsockname = the socket getpeername */
> + assert(memcmp(&serv_addr, &tmp_addr, sizeof(struct sockaddr_in)) == 0);

thanks. so the $?==0 will indicate success, right?
After respin feel free to add my ack.

Re: [PATCH v4 1/2] bpf: Add bpf_probe_write BPF helper to be called in tracers (kprobes)

2016-07-23 Thread Alexei Starovoitov

On Sat, Jul 23, 2016 at 05:39:42PM -0700, Sargun Dhillon wrote:
> The example has been modified to act like a test in the follow up set. It 
> tests 
> for the positive case (Did the helper work or not) as opposed to the negative 
> case (is the helper able to violate the safety constraints we set forth)? I 
> could do that as well, in another patch by mprotecting those pages, or some 
> such. Should I add an additional negative test?

That would be awesome, but doesn't have to be in this patch set.
It can be done as a follow up.

Thanks!

Re: [PATCH net-next v6 0/2] bpf: add bpf_probe_write_user helper & example

2016-07-23 Thread Alexei Starovoitov

On Sat, Jul 23, 2016 at 08:22:04PM -0700, Sargun Dhillon wrote:
> This patch series contains two patches that add support for a probe_write
> helper to BPF programs. This allows them to manipulate user memory during
> the course of tracing. The second patch in the series has an example that
> uses it, in one the intended ways to divert execution.
> 
> Thanks to Alexei Starovoitov, and Daniel Borkmann for review, I've made
> changes based on their recommendations. 
> 
> This helper should be considered experimental, so we print a warning
> to dmesg when it is along with the command and pid. A follow-up patchset
> will contain a mechanism to verify the safety of the probe beyond what
> was done by hand.

I'd like to clarify above 'helper is experimental' meaning that
it should only be used for experiments and not production.
That's what the warning is for.
If Dave applies it, it will be permanent abi and cannot be removed.
In other words it's for debugging user apps and trying out crazy ideas.
Like we will use to experiment with different approaches around
restartable sequences and tracing.

lsm naming dilemma. Re: [RFC v3 07/22] landlock: Handle file comparisons

2016-09-19 Thread Alexei Starovoitov

On Thu, Sep 15, 2016 at 11:25:10PM +0200, Mickaël Salaün wrote:
> >> Agreed. With this RFC, the Checmate features (i.e. network helpers)
> >> should be able to sit on top of Landlock.
> > 
> > I think neither of them should be called fancy names for no technical 
> > reason.
> > We will have only one bpf based lsm. That's it and it doesn't
> > need an obscure name. Directory name can be security/bpf/..stuff.c
> 
> I disagree on an LSM named "BPF". I first started with the "seccomp LSM"
> name (first RFC) but I later realized that it is confusing because
> seccomp is associated to its syscall and the underlying features. Same
> thing goes for BPF. It is also artificially hard to grep on a name too
> used in the kernel source tree.
> Making an association between the generic eBPF mechanism and a security
> centric approach (i.e. LSM) seems a bit reductive (for BPF). Moreover,
> the seccomp interface [1] can still be used.

agree with above.

> Landlock is a nice name to depict a sandbox as an enclave (i.e. a
> landlocked country/state). I want to keep this name, which is simple,
> express the goal of Landlock nicely and is comparable to other sandbox
> mechanisms as Seatbelt or Pledge.
> Landlock should not be confused with the underlying eBPF implementation.
> Landlock could use more than only eBPF in the future and eBPF could be
> used in other LSM as well.

there will not be two bpf based LSMs.
Therefore unless you can convince Sargun to give up his 'checmate' name,
nothing goes in.
The features you both need are 90% the same, so they must be done
as part of single LSM whatever you both agree to call it.

Re: [PATCH v1] cgroup,bpf: Add access check for cgroup_get_from_fd()

2016-09-19 Thread Alexei Starovoitov

On Tue, Sep 20, 2016 at 12:49:13AM +0200, Mickaël Salaün wrote:
> Add security access check for cgroup backed FD. The "cgroup.procs" file
> of the corresponding cgroup should be readable to identify the cgroup,
> and writable to prove that the current process can manage this cgroup
> (e.g. through delegation). This is similar to the check done by
> cgroup_procs_write_permission().
> 
> Fixes: 4ed8ec521ed5 ("cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAY")

I don't understand what 'fixes' is about.
Looks like new feature or tightening?
Since cgroup was opened by the process and it got an fd,
it had an access, so extra check here looks unnecessary.

> -struct cgroup *cgroup_get_from_fd(int fd)
> +struct cgroup *cgroup_get_from_fd(int fd, int access_mask)
>  {
>   struct cgroup_subsys_state *css;
>   struct cgroup *cgrp;
>   struct file *f;
> + struct inode *inode;
> + int ret;
>  
>   f = fget_raw(fd);
>   if (!f)
>   return ERR_PTR(-EBADF);
>  
>   css = css_tryget_online_from_dir(f->f_path.dentry, NULL);
> - fput(f);

why move it down?

> - if (IS_ERR(css))
> - return ERR_CAST(css);
> + if (IS_ERR(css)) {
> + ret = PTR_ERR(css);
> + goto put_f;
> + }
>  
>   cgrp = css->cgroup;
>   if (!cgroup_on_dfl(cgrp)) {
> - cgroup_put(cgrp);
> - return ERR_PTR(-EBADF);
> + ret = -EBADF;
> + goto put_cgrp;
> + }
> +
> + ret = -ENOMEM;
> + inode = kernfs_get_inode(f->f_path.dentry->d_sb, cgrp->procs_file.kn);
> + if (inode) {
> + ret = inode_permission(inode, access_mask);
> + iput(inode);
>   }
> + if (ret)
> + goto put_cgrp;
>  
> + fput(f);
>   return cgrp;
> +
> +put_cgrp:
> + cgroup_put(cgrp);
> +put_f:
> + fput(f);
> + return ERR_PTR(ret);
>  }
>  EXPORT_SYMBOL_GPL(cgroup_get_from_fd);
>  
> -- 
> 2.9.3
>

Re: [PATCH 2/2] perf record: Add --dry-run option to check cmdline options

2016-06-20 Thread Alexei Starovoitov

On Mon, Jun 20, 2016 at 11:38:18AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Mon, Jun 20, 2016 at 11:29:13AM +0800, Wangnan (F) escreveu:
> > On 2016/6/17 0:48, Arnaldo Carvalho de Melo wrote:
> > >Em Thu, Jun 16, 2016 at 08:02:41AM +, Wang Nan escreveu:
> > >>With '--dry-run', 'perf record' doesn't do reall recording. Combine with
> > >>llvm.dump-obj option, --dry-run can be used to help compile BPF objects 
> > >>for
> > >>embedded platform.
> > >So these are nice and have value, but can we have a subcommand to do all
> > >this with an expressive name, Something like:
> 
> > >   perf bpfcc foo.c -o foo
> 
> > >or shorter:
> 
> > >   perf bcc foo.c -o foo
> 
> > >Just like one would use gcc or some other compiler to generate something
> > >for later use?
> 
> > I'll try it today. I thought a subcommand require a bigger feature,
> > and wrapping clang is not big enough.
> 
> Not really, we may have as many as we like, given that they provide
> something useful, like I think is the case here.
> 
> Having to edit ~/.perfconfig, create a new section, a variable in it
> with a boolean value (at first, just reading the changeset comment, I
> thought I had to provide a directory where to store the objects
> "dumped"), to then use a tool to record a .c event, but not recording
> (use dry-run, which is useful to test the command line, etc), to then
> get, on the current directory, the end result looked to me a convoluted
> way to ask perf to compile the given .c file into a .o for later use.
> 
> Doing:
> 
>   perf bcc -c foo.c
> 
> Looks so much simpler and similar to an existing compile source code
> into object file workflow (gcc's, any C compiler) that I think it would
> fit in the workflow being discussed really nicely.

I'm hopeful that eventually we'll be able merge iovisor/bcc project
with perf, so would be good to reserve 'perf bcc' command for that
future use. Also picking a different name for compiling would be less
confusing to users who already familiar with bcc. Instead we can use:
perf bpfcc foo.c -o foo.o
perf cc foo.c
perf compile foo.c

Re: [PATCH] ppc: Fix BPF JIT for ABIv2

2016-06-21 Thread Alexei Starovoitov


On 6/21/16 7:47 AM, Thadeu Lima de Souza Cascardo wrote:


The calling convention is different with ABIv2 and so we'll need changes
in bpf_slow_path_common() and sk_negative_common().


How big would those changes be? Do we know?

How come no one reported this was broken previously? This is the first I've
heard of it being broken.



I just heard of it less than two weeks ago, and only could investigate it last
week, when I realized mainline was also affected.

It looks like the little-endian support for classic JIT were done before the
conversion to ABIv2. And as JIT is disabled by default, no one seems to have
exercised it.


it's not a surprise unfortunately. The JITs that were written before
test_bpf.ko was developed were missing corner cases. Typical tcpdump
would be fine, but fragmented packets, negative offsets and
out-out-bounds wouldn't be handled correctly.
I'd suggest to validate the stable backport with test_bpf as well.

Re: [PATCH -next 2/4] cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAY

2016-06-21 Thread Alexei Starovoitov

On Tue, Jun 21, 2016 at 05:23:20PM -0700, Martin KaFai Lau wrote:
> Add a BPF_MAP_TYPE_CGROUP_ARRAY and its bpf_map_ops's implementations.
> To update an element, the caller is expected to obtain a cgroup2 backed
> fd by open(cgroup2_dir) and then update the array with that fd.
> 
> Signed-off-by: Martin KaFai Lau 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> Cc: Tejun Heo 

Acked-by: Alexei Starovoitov

Re: [PATCH -next 4/4] cgroup: bpf: Add an example to do cgroup checking in BPF

2016-06-21 Thread Alexei Starovoitov

On Tue, Jun 21, 2016 at 05:23:22PM -0700, Martin KaFai Lau wrote:
> test_cgrp2_array_pin.c:
> A userland program that creates a bpf_map (BPF_MAP_TYPE_GROUP_ARRAY),
> pouplates/updates it with a cgroup2's backed fd and pins it to a
> bpf-fs's file.  The pinned file can be loaded by tc and then used
> by the bpf prog later.  This program can also update an existing pinned
> array and it could be useful for debugging/testing purpose.
> 
> test_cgrp2_tc_kern.c:
> A bpf prog which should be loaded by tc.  It is to demonstrate
> the usage of bpf_skb_in_cgroup.
> 
> test_cgrp2_tc.sh:
> A script that glues the test_cgrp2_array_pin.c and
> test_cgrp2_tc_kern.c together.  The idea is like:
> 1. Use test_cgrp2_array_pin.c to populate a BPF_MAP_TYPE_CGROUP_ARRAY
>with a cgroup fd
> 2. Load the test_cgrp2_tc_kern.o by tc
> 3. Do a 'ping -6 ff02::1%ve' to ensure the packet has been
>dropped because of a match on the cgroup
> 
> Most of the lines in test_cgrp2_tc.sh is the boilerplate
> to setup the cgroup/bpf-fs/net-devices/netns...etc.  It is
> not bulletproof on errors but should work well enough and
> give enough debug info if things did not go well.
> 
> Signed-off-by: Martin KaFai Lau 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> Cc: Tejun Heo 
> ---
>  samples/bpf/Makefile   |   3 +
>  samples/bpf/bpf_helpers.h  |   2 +
>  samples/bpf/test_cgrp2_array_pin.c | 109 +
>  samples/bpf/test_cgrp2_tc.sh   | 189 
> +
>  samples/bpf/test_cgrp2_tc_kern.c   |  71 ++
>  5 files changed, 374 insertions(+)
...
> +struct bpf_elf_map SEC("maps") test_cgrp2_array_pin = {
> + .type   = BPF_MAP_TYPE_CGROUP_ARRAY,
> + .size_key   = sizeof(uint32_t),
> + .size_value = sizeof(uint32_t),
> + .pinning= PIN_GLOBAL_NS,
> + .max_elem   = 1,
> +};
> +
> +SEC("filter")
> +int handle_egress(struct __sk_buff *skb)
> +{
> + void *data = (void *)(long)skb->data;
> + struct eth_hdr *eth = data;
> + struct ipv6hdr *ip6h = data + sizeof(*eth);
> + void *data_end = (void *)(long)skb->data_end;
> + char dont_care_msg[] = "dont care %04x %d\n";
> + char pass_msg[] = "pass\n";
> + char reject_msg[] = "reject\n";
> +
> + /* single length check */
> + if (data + sizeof(*eth) + sizeof(*ip6h) > data_end)
> + return TC_ACT_OK;

love the test case.
It's using tc + clsact + cls_bpf in da mode + bpffs + direct packet access
and new cgroup helper.
All the most recent features I can think of :)

Acked-by: Alexei Starovoitov

Re: [PATCH -next 3/4] cgroup: bpf: Add bpf_skb_in_cgroup_proto

2016-06-21 Thread Alexei Starovoitov

On Tue, Jun 21, 2016 at 05:23:21PM -0700, Martin KaFai Lau wrote:
> Adds a bpf helper, bpf_skb_in_cgroup, to decide if a skb->sk
> belongs to a descendant of a cgroup2.  It is similar to the
> feature added in netfilter:
> commit c38c4597e4bf ("netfilter: implement xt_cgroup cgroup2 path match")
> 
> The user is expected to populate a BPF_MAP_TYPE_CGROUP_ARRAY
> which will be used by the bpf_skb_in_cgroup.
> 
> Modifications to the bpf verifier is to ensure BPF_MAP_TYPE_CGROUP_ARRAY
> and bpf_skb_in_cgroup() are always used together.
> 
> Signed-off-by: Martin KaFai Lau 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> Cc: Tejun Heo 
> ---
>  include/uapi/linux/bpf.h |  1 +
>  kernel/bpf/verifier.c|  8 
>  net/core/filter.c| 36 
>  3 files changed, 45 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index ef4e386..a91714bd 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -314,6 +314,7 @@ enum bpf_func_id {
>*/
>   BPF_FUNC_skb_get_tunnel_opt,
>   BPF_FUNC_skb_set_tunnel_opt,
> + BPF_FUNC_skb_in_cgroup,
...
> +static u64 bpf_skb_in_cgroup(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
> +{
...
> + if (unlikely(!cgrp))
> + return -ENOENT;
> +
> + return cgroup_is_descendant(sock_cgroup_ptr(&sk->sk_cgrp_data), cgrp);

if you'd need to respin the patch for other reasons please add kdoc
to bpf.h for this new helper similar to other helpers.
To say that 0 or 1 return values is indication of cg2 descendant relation
and < 0 in case of error.

Acked-by: Alexei Starovoitov

Re: [PATCH v2] bpf: Set register type according to is_valid_access()

2016-09-22 Thread Alexei Starovoitov

On Thu, Sep 22, 2016 at 09:56:47PM +0200, Mickaël Salaün wrote:
> This fix a pointer leak when an unprivileged eBPF program read a pointer
> value from the context. Even if is_valid_access() returns a pointer
> type, the eBPF verifier replace it with UNKNOWN_VALUE. The register
> value containing an address is then allowed to leak. Moreover, this
> prevented unprivileged eBPF programs to use functions with (legitimate)
> pointer arguments.
> 
> This bug is not an issue for now because the only unprivileged eBPF
> program allowed is of type BPF_PROG_TYPE_SOCKET_FILTER and all the types
> from its context are UNKNOWN_VALUE. However, this fix is important for
> future unprivileged eBPF program types which could use pointers in their
> context.
> 
> Signed-off-by: Mickaël Salaün 
> Fixes: 969bf05eb3ce ("bpf: direct packet access")

Please drop 'fixes' tag and rewrite commit log.
It's not a fix.
Right now only two reg types can be seen: PTR_TO_PACKET and PTR_TO_PACKET_END.
Both are only in clsact and xdp programs which are root only.
So nothing is leaking at present.
Best case this patch is a pre-patch for some future work.

Re: [PATCH 1/2] bpf samples: fix compiler errors with sockex2 and sockex3

2016-09-24 Thread Alexei Starovoitov

On Sat, Sep 24, 2016 at 02:10:04AM +0530, Naveen N. Rao wrote:
> These samples fail to compile as 'struct flow_keys' conflicts with
> definition in net/flow_dissector.h. Fix the same by renaming the
> structure used in the sample.
> 
> Signed-off-by: Naveen N. Rao 

Thanks for the fix.
Acked-by: Alexei Starovoitov

Re: [PATCH 2/2] bpf samples: update tracex5 sample to use __seccomp_filter

2016-09-24 Thread Alexei Starovoitov

On Sat, Sep 24, 2016 at 02:10:05AM +0530, Naveen N. Rao wrote:
> seccomp_phase1() does not exist anymore. Instead, update sample to use
> __seccomp_filter(). While at it, set max locked memory to unlimited.
> 
> Signed-off-by: Naveen N. Rao 

Acked-by: Alexei Starovoitov

Re: [PATCH 2/3] bpf powerpc: implement support for tail calls

2016-09-24 Thread Alexei Starovoitov

On Sat, Sep 24, 2016 at 12:33:54AM +0200, Daniel Borkmann wrote:
> On 09/23/2016 10:35 PM, Naveen N. Rao wrote:
> >Tail calls allow JIT'ed eBPF programs to call into other JIT'ed eBPF
> >programs. This can be achieved either by:
> >(1) retaining the stack setup by the first eBPF program and having all
> >subsequent eBPF programs re-using it, or,
> >(2) by unwinding/tearing down the stack and having each eBPF program
> >deal with its own stack as it sees fit.
> >
> >To ensure that this does not create loops, there is a limit to how many
> >tail calls can be done (currently 32). This requires the JIT'ed code to
> >maintain a count of the number of tail calls done so far.
> >
> >Approach (1) is simple, but requires every eBPF program to have (almost)
> >the same prologue/epilogue, regardless of whether they need it. This is
> >inefficient for small eBPF programs which may not sometimes need a
> >prologue at all. As such, to minimize impact of tail call
> >implementation, we use approach (2) here which needs each eBPF program
> >in the chain to use its own prologue/epilogue. This is not ideal when
> >many tail calls are involved and when all the eBPF programs in the chain
> >have similar prologue/epilogue. However, the impact is restricted to
> >programs that do tail calls. Individual eBPF programs are not affected.
> >
> >We maintain the tail call count in a fixed location on the stack and
> >updated tail call count values are passed in through this. The very
> >first eBPF program in a chain sets this up to 0 (the first 2
> >instructions). Subsequent tail calls skip the first two eBPF JIT
> >instructions to maintain the count. For programs that don't do tail
> >calls themselves, the first two instructions are NOPs.
> >
> >Signed-off-by: Naveen N. Rao 
> 
> Thanks for adding support, Naveen, that's really great! I think 2) seems
> fine as well in this context as prologue size can vary quite a bit here,
> and depending on program types likelihood of tail call usage as well (but
> I wouldn't expect deep nesting). Thanks a lot!

Great stuff. In this circumstances approach 2 makes sense to me as well.

Re: [PATCH 00/14] perf clang: Support compiling BPF script use builtin clang

2016-09-24 Thread Alexei Starovoitov

On Fri, Sep 23, 2016 at 12:49:47PM +, Wang Nan wrote:
> This patch set is the first step to implement features I announced
> in LinuxCon NA 2016. See page 31 of:
> 
>  
> http://events.linuxfoundation.org/sites/events/files/slides/Performance%20Monitoring%20and%20Analysis%20Using%20perf%20and%20BPF_1.pdf
> 
> This patch set links LLVM and Clang libraries to perf, so perf
> is able to compile BPF script to BPF object on the fly.

Nice!
So single perf binary won't have llvm external dependency anymore
or both ways will be maintained?
The command line stays the same?
If I understand the patches correctly, this set is establishing
the basic functionality and more complex features coming?

Re: [PATCH v3] bpf: Set register type according to is_valid_access()

2016-09-26 Thread Alexei Starovoitov

On Mon, Sep 26, 2016 at 04:49:17PM +0200, Daniel Borkmann wrote:
> On 09/24/2016 08:01 PM, Mickaël Salaün wrote:
> >This prevent future potential pointer leaks when an unprivileged eBPF
> >program will read a pointer value from its context. Even if
> >is_valid_access() returns a pointer type, the eBPF verifier replace it
> >with UNKNOWN_VALUE. The register value that contains a kernel address is
> >then allowed to leak. Moreover, this fix allows unprivileged eBPF
> >programs to use functions with (legitimate) pointer arguments.
> >
> >Not an issue currently since reg_type is only set for PTR_TO_PACKET or
> >PTR_TO_PACKET_END in XDP and TC programs that can only be loaded as
> >privileged. For now, the only unprivileged eBPF program allowed is for
> >socket filtering and all the types from its context are UNKNOWN_VALUE.
> >However, this fix is important for future unprivileged eBPF programs
> >which could use pointers in their context.
> >
> >Signed-off-by: Mickaël Salaün 
> >Cc: Alexei Starovoitov 
> >Cc: Daniel Borkmann 
> 
> Seems okay to me:
> 
> Acked-by: Daniel Borkmann 

Acked-by: Alexei Starovoitov 

Mickael, please mention [PATCH net-next] in subject next time.
Thanks

Re: [PATCH 00/14] perf clang: Support compiling BPF script use builtin clang

2016-09-26 Thread Alexei Starovoitov

On Mon, Sep 26, 2016 at 09:49:30AM +0800, Wangnan (F) wrote:
> 
> 
> On 2016/9/24 23:16, Alexei Starovoitov wrote:
> >On Fri, Sep 23, 2016 at 12:49:47PM +, Wang Nan wrote:
> >>This patch set is the first step to implement features I announced
> >>in LinuxCon NA 2016. See page 31 of:
> >>
> >>  
> >> http://events.linuxfoundation.org/sites/events/files/slides/Performance%20Monitoring%20and%20Analysis%20Using%20perf%20and%20BPF_1.pdf
> >>
> >>This patch set links LLVM and Clang libraries to perf, so perf
> >>is able to compile BPF script to BPF object on the fly.
> >Nice!
> >So single perf binary won't have llvm external dependency anymore
> >or both ways will be maintained?
> >The command line stays the same?
> 
> Yes. This patch set doesn't change interface. It compiles BPF script
> with builtin clang, and if it fail, fall back to external clang.
> 
> >If I understand the patches correctly, this set is establishing
> >the basic functionality and more complex features coming?
> >
> 
> Yes. Following steps are:
> 
>  1. Ease of use improvement: automatically include BPF functions
> declaration and macros.

+1

>  2. Perf's hook: compile part of BPF script into native code, run
> them in perf when something happen. Create a channel, coordinate
> BPF and native code use bpf-output event.

+1

>  3. Define a new language to support common profiling task. I'm not
> very clear what the new language should be. It may looks like lua,
> perf converts it to C then to LLVM IR with builtin clang.

Many tracing languages were invented in the past.
At this point I'm not sure what exactly new language will solve.
To make it easier to write bpf programs?
I think it will be more fruitful to tweak clang/llvm to add
good warnings/errors for cases where we know that C is not going
be compiled into the code that the kernel verifier will accept.
Like we can complain about loops, unitialized variables,
non-inlined and unkown helper functions... all from clang/llvm.
imo that would be the better path forward and will help
both tracing and networking users that write in this restricted C.

Re: [PATCH trival 1/2] bpf: clean up put_cpu_var usage

2016-09-26 Thread Alexei Starovoitov

On Mon, Sep 26, 2016 at 11:14:50AM -0700, Shaohua Li wrote:
> put_cpu_var takes the percpu data, not the data returned from
> get_cpu_var.
> 
> This doesn't change the behavior.
> 
> Cc: Tejun Heo 
> Cc: Alexei Starovoitov 
> Signed-off-by: Shaohua Li 

Looks good. Nice catch.
Please rebase to net-next tree and send it to netdev list.
Otherwise we'll have conflicts at the time of the merge window.

> ---
>  kernel/bpf/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 03fd23d..b73913b 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -1031,7 +1031,7 @@ u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, 
> u64 r5)
>  
>   state = &get_cpu_var(bpf_user_rnd_state);
>   res = prandom_u32_state(state);
> - put_cpu_var(state);
> + put_cpu_var(bpf_user_rnd_state);
>  
>   return res;
>  }
> -- 
> 2.9.3
>

Re: [PATCH trival -resend 1/2] bpf: clean up put_cpu_var usage

2016-09-27 Thread Alexei Starovoitov

On Tue, Sep 27, 2016 at 08:42:41AM -0700, Shaohua Li wrote:
> put_cpu_var takes the percpu data, not the data returned from
> get_cpu_var.
> 
> This doesn't change the behavior.
> 
> Cc: Tejun Heo 
> Cc: Alexei Starovoitov 
> Signed-off-by: Shaohua Li 

Acked-by: Alexei Starovoitov

Re: [PATCH] samples/bpf: fix resource leak on opened file descriptor

2016-07-28 Thread Alexei Starovoitov

On Sun, Jul 24, 2016 at 06:50:47PM +0100, Colin King wrote:
> From: Colin Ian King 
> 
> file f needs to be closed, fixes resource leak.
> 
> Signed-off-by: Colin Ian King 

have been travelling. sorry for delay.
Acked-by: Alexei Starovoitov

Re: [PATCH v2] bpf: silence warnings when building kernel/bpf/core.c with W=1

2016-07-31 Thread Alexei Starovoitov

On Mon, Aug 01, 2016 at 12:33:30AM -0400, Valdis Kletnieks wrote:
> Building with W=1 generates some 350 lines of warnings of the form:
> 
> kernel/bpf/core.c: In function '__bpf_prog_run':
> kernel/bpf/core.c:476:33: warning: initialized field overwritten 
> [-Woverride-init]
>[BPF_ALU | BPF_ADD | BPF_X] = &&ALU_ADD_X,
>  ^~
> kernel/bpf/core.c:476:33: note: (near initialization for 'jumptable[12]')
> 
> Since they come from the way we intentionally build the table, silence
> that one specific warning.
> 
> Signed-off-by: Valdis Kletnieks 
> 
> Version 2: Add bpf: subsystem tag to subject line
> 
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index eed911d091da..bb915f9d9f92 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -1,3 +1,4 @@
> +CFLAGS_core.o += -Wno-override-init

and at least 2 other such patches for other files...
Is there a single warning where -Woverride-init was useful?
May be worth disabling this warning for the whole build?

Re: [PATCH v2] bpf: silence warnings when building kernel/bpf/core.c with W=1

2016-08-01 Thread Alexei Starovoitov

On Mon, Aug 01, 2016 at 01:18:43AM -0400, valdis.kletni...@vt.edu wrote:
> On Sun, 31 Jul 2016 21:42:22 -0700, Alexei Starovoitov said:
> 
> > and at least 2 other such patches for other files...
> > Is there a single warning where -Woverride-init was useful?
> > May be worth disabling this warning for the whole build?
> 
> There's a few other cases that *aren't* the "define the array to zero
> and then build the entries from a list" form.
> 
> In particular, there's still 3 odd complaints:
> 
> drivers/ata/ahci.c:
> drivers/ata/ahci.h:393:16: warning: initialized field overwritten 
> [-Woverride-in
> it]
>   .can_queue  = AHCI_MAX_CMDS - 1,
> 
> drivers/block/drbd/drbd_main.c:
> drivers/block/drbd/drbd_main.c:3767:22: warning: initialized field 
> overwritten [
> -Woverride-init]
>[P_RETRY_WRITE]  = "retry_write",
> 
> arch/x86/kernel/cpu/common.c:
> ./arch/x86/include/asm/page_64_types.h:22:21: warning: initialized field 
> overwri
> tten [-Woverride-init]
>  #define DEBUG_STKSZ (PAGE_SIZE << DEBUG_STACK_ORDER)
> 
> The point of these patches is to make -Woverride-init *useful* - you'll never
> spot 3 warnings in a flood of over 9,000 understood-and-ignored warnings.
> 
> Get rid of the 9,000 understood-and-ignored warnings, and then things that
> probably *should* be looked at can be noticed.

I don't think it makes sense to play kernel whack-a-warning in a hope that
particular warning will find something useful.
Please show few cases where it actually found a real issue, otherwise
just disable it for all.

Re: [PATCH RFC 2/2] rlimits: report resource limits violations

2016-09-07 Thread Alexei Starovoitov

On Wed, Sep 07, 2016 at 01:27:35PM +0300, Yauheni Kaliuta wrote:
> The patch instrument different places of resource limits checks with
> reporting using the infrastructure from the previous patch.
> 
> Signed-off-by: Yauheni Kaliuta 
> ---
>  arch/ia64/kernel/perfmon.c |  4 +++-
>  arch/powerpc/kvm/book3s_64_vio.c   |  6 --
>  arch/powerpc/mm/mmu_context_iommu.c|  6 --
>  drivers/android/binder.c   |  7 ++-
>  drivers/infiniband/core/umem.c |  1 +
>  drivers/infiniband/hw/hfi1/user_pages.c|  5 -
>  drivers/infiniband/hw/qib/qib_user_pages.c |  1 +
>  drivers/infiniband/hw/usnic/usnic_uiom.c   |  1 +
>  drivers/misc/mic/scif/scif_rma.c   |  1 +
>  drivers/vfio/vfio_iommu_spapr_tce.c|  6 --
>  drivers/vfio/vfio_iommu_type1.c|  4 
>  fs/attr.c  |  4 +++-
>  fs/binfmt_aout.c   |  4 +++-
>  fs/binfmt_flat.c   |  1 +
>  fs/coredump.c  |  4 +++-
>  fs/exec.c  | 14 ++
>  fs/file.c  | 26 +-
>  fs/select.c|  4 +++-
>  include/linux/mm.h |  7 ++-
>  ipc/mqueue.c   | 10 --
>  ipc/shm.c  |  1 +
>  kernel/bpf/syscall.c   | 15 ---
>  kernel/events/core.c   |  1 +
>  kernel/fork.c  |  9 ++---
>  kernel/sched/core.c| 17 +
>  kernel/signal.c|  7 ---
>  kernel/sys.c   |  9 ++---
>  kernel/time/posix-cpu-timers.c |  8 
>  mm/mlock.c | 14 +-
>  mm/mmap.c  | 19 +++
>  mm/mremap.c|  4 +++-
>  net/unix/af_unix.c |  9 ++---
>  32 files changed, 179 insertions(+), 50 deletions(-)

I'm certainly not excited that we'd need to maintain this
rlimit tracking for foreseeable future.
I can be convinced otherwise, but so far I don't see
strong enough use case that warrants these changes all over.

Re: [PATCH v2 1/5] blk-mq: abstract tag allocation out into scale_bitmap library

2016-09-07 Thread Alexei Starovoitov


On 9/7/16 4:46 PM, Omar Sandoval wrote:

From: Omar Sandoval 

This is a generally useful data structure, so make it available to
anyone else who might want to use it. It's also a nice cleanup
separating the allocation logic from the rest of the tag handling logic.

The code is behind a new Kconfig option, CONFIG_SCALE_BITMAP, which is
only selected by CONFIG_BLOCK for now.

This should be a complete noop functionality-wise.

Signed-off-by: Omar Sandoval 
---
  MAINTAINERS  |   1 +
  block/Kconfig|   1 +
  block/blk-mq-tag.c   | 469 ++-
  block/blk-mq-tag.h   |  37 +---
  block/blk-mq.c   | 113 +++
  block/blk-mq.h   |   9 -
  include/linux/blk-mq.h   |   9 +-
  include/linux/scale_bitmap.h | 340 +++
  lib/Kconfig  |   3 +
  lib/Makefile |   2 +
  lib/scale_bitmap.c   | 305 

...

diff --git a/include/linux/scale_bitmap.h b/include/linux/scale_bitmap.h
new file mode 100644
index 000..63f712b
--- /dev/null
+++ b/include/linux/scale_bitmap.h
@@ -0,0 +1,340 @@
+/*
+ * Fast and scalable bitmaps.

...

+/**
+ * struct scale_bitmap_word - Word in a &struct scale_bitmap.
+ */
+struct scale_bitmap_word {
+/**
+ * struct scale_bitmap - Scalable bitmap.
+ *
+ * A &struct scale_bitmap is spread over multiple cachelines to avoid 
ping-pong.
+ * This trades off higher memory usage for better scalability.
+ */
+struct scale_bitmap {


scale_bitmap sounds odd, since 'scale' is also a verb.
We also have lib/rhashtable.c:
 * Resizable, Scalable, Concurrent Hash Table
everything is 'scalable' nowadays.
May be resizable bitmap would be a better name?
'struct rbitmap'... lib/rbitmap.c ?

Re: [PATCH v2 1/5] blk-mq: abstract tag allocation out into scale_bitmap library

2016-09-07 Thread Alexei Starovoitov


On 9/7/16 5:38 PM, Omar Sandoval wrote:

On Wed, Sep 07, 2016 at 05:01:56PM -0700, Alexei Starovoitov wrote:

On 9/7/16 4:46 PM, Omar Sandoval wrote:

From: Omar Sandoval 

This is a generally useful data structure, so make it available to
anyone else who might want to use it. It's also a nice cleanup
separating the allocation logic from the rest of the tag handling logic.

The code is behind a new Kconfig option, CONFIG_SCALE_BITMAP, which is
only selected by CONFIG_BLOCK for now.

This should be a complete noop functionality-wise.

Signed-off-by: Omar Sandoval 
---
   MAINTAINERS  |   1 +
   block/Kconfig|   1 +
   block/blk-mq-tag.c   | 469 
++-
   block/blk-mq-tag.h   |  37 +---
   block/blk-mq.c   | 113 +++
   block/blk-mq.h   |   9 -
   include/linux/blk-mq.h   |   9 +-
   include/linux/scale_bitmap.h | 340 +++
   lib/Kconfig  |   3 +
   lib/Makefile |   2 +
   lib/scale_bitmap.c   | 305 

...

diff --git a/include/linux/scale_bitmap.h b/include/linux/scale_bitmap.h
new file mode 100644
index 000..63f712b
--- /dev/null
+++ b/include/linux/scale_bitmap.h
@@ -0,0 +1,340 @@
+/*
+ * Fast and scalable bitmaps.

...

+/**
+ * struct scale_bitmap_word - Word in a &struct scale_bitmap.
+ */
+struct scale_bitmap_word {
+/**
+ * struct scale_bitmap - Scalable bitmap.
+ *
+ * A &struct scale_bitmap is spread over multiple cachelines to avoid 
ping-pong.
+ * This trades off higher memory usage for better scalability.
+ */
+struct scale_bitmap {


scale_bitmap sounds odd, since 'scale' is also a verb.
We also have lib/rhashtable.c:
  * Resizable, Scalable, Concurrent Hash Table
everything is 'scalable' nowadays.


Agreed, I'm not a huge fan of the name.


May be resizable bitmap would be a better name?
'struct rbitmap'... lib/rbitmap.c ?



Hm, the resizing operation isn't very well thought-out right now, it's
there because it's okay for the way blk-mq uses it, but it's definitely
not the point of the data structure. It's more of a cache-friendly
bitmap, or a sparse bitmap. `struct sbitmap`? `struct cbitmap`?


yeah. naming is hard.
I think the name ideally should indicate how this bitmap
is different from array of bits that is already covered by
primitives in bitmap.h
Is it because the user can wait on the bit or because it's
smp aware? sort of percpu? I think that's the main trick how
it achieves good concurrent set/get access, right?
struct pcpu_bitmap ?
struct sbitmap is fine too.

Re: [RFC v3 03/22] bpf,landlock: Add a new arraymap type to deal with (Landlock) handles

2016-09-14 Thread Alexei Starovoitov

On Wed, Sep 14, 2016 at 09:23:56AM +0200, Mickaël Salaün wrote:
> This new arraymap looks like a set and brings new properties:
> * strong typing of entries: the eBPF functions get the array type of
>   elements instead of CONST_PTR_TO_MAP (e.g.
>   CONST_PTR_TO_LANDLOCK_HANDLE_FS);
> * force sequential filling (i.e. replace or append-only update), which
>   allow quick browsing of all entries.
> 
> This strong typing is useful to statically check if the content of a map
> can be passed to an eBPF function. For example, Landlock use it to store
> and manage kernel objects (e.g. struct file) instead of dealing with
> userland raw data. This improve efficiency and ensure that an eBPF
> program can only call functions with the right high-level arguments.
> 
> The enum bpf_map_handle_type list low-level types (e.g.
> BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD) which are identified when
> updating a map entry (handle). This handle types are used to infer a
> high-level arraymap type which are listed in enum bpf_map_array_type
> (e.g. BPF_MAP_ARRAY_TYPE_LANDLOCK_FS).
> 
> For now, this new arraymap is only used by Landlock LSM (cf. next
> commits) but it could be useful for other needs.
> 
> Changes since v2:
> * add a RLIMIT_NOFILE-based limit to the maximum number of arraymap
>   handle entries (suggested by Andy Lutomirski)
> * remove useless checks
> 
> Changes since v1:
> * arraymap of handles replace custom checker groups
> * simpler userland API
> 
> Signed-off-by: Mickaël Salaün 
> Cc: Alexei Starovoitov 
> Cc: Andy Lutomirski 
> Cc: Daniel Borkmann 
> Cc: David S. Miller 
> Cc: Kees Cook 
> Link: 
> https://lkml.kernel.org/r/calcetrwwtiz3kztkegow24-dvhqq6lftwexh77fd2g5o71y...@mail.gmail.com
> ---
>  include/linux/bpf.h  |  14 
>  include/uapi/linux/bpf.h |  18 +
>  kernel/bpf/arraymap.c| 203 
> +++
>  kernel/bpf/verifier.c|  12 ++-
>  4 files changed, 246 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index fa9a988400d9..eae4ce4542c1 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -13,6 +13,10 @@
>  #include 
>  #include 
>  
> +#ifdef CONFIG_SECURITY_LANDLOCK
> +#include  /* struct file */
> +#endif /* CONFIG_SECURITY_LANDLOCK */
> +
>  struct perf_event;
>  struct bpf_map;
>  
> @@ -38,6 +42,7 @@ struct bpf_map_ops {
>  struct bpf_map {
>   atomic_t refcnt;
>   enum bpf_map_type map_type;
> + enum bpf_map_array_type map_array_type;
>   u32 key_size;
>   u32 value_size;
>   u32 max_entries;
> @@ -187,6 +192,9 @@ struct bpf_array {
>*/
>   enum bpf_prog_type owner_prog_type;
>   bool owner_jited;
> +#ifdef CONFIG_SECURITY_LANDLOCK
> + u32 n_entries;  /* number of entries in a handle array */
> +#endif /* CONFIG_SECURITY_LANDLOCK */
>   union {
>   char value[0] __aligned(8);
>   void *ptrs[0] __aligned(8);
> @@ -194,6 +202,12 @@ struct bpf_array {
>   };
>  };
>  
> +#ifdef CONFIG_SECURITY_LANDLOCK
> +struct map_landlock_handle {
> + u32 type; /* enum bpf_map_handle_type */
> +};
> +#endif /* CONFIG_SECURITY_LANDLOCK */
> +
>  #define MAX_TAIL_CALL_CNT 32
>  
>  struct bpf_event_entry {
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 7cd36166f9b7..b68de57f7ab8 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -87,6 +87,15 @@ enum bpf_map_type {
>   BPF_MAP_TYPE_PERCPU_ARRAY,
>   BPF_MAP_TYPE_STACK_TRACE,P_TYPE_CGROUP_ARRAY
>   BPF_MAP_TYPE_CGROUP_ARRAY,
> + BPF_MAP_TYPE_LANDLOCK_ARRAY,
> +};
> +
> +enum bpf_map_array_type {
> + BPF_MAP_ARRAY_TYPE_UNSPEC,
> +};
> +
> +enum bpf_map_handle_type {
> + BPF_MAP_HANDLE_TYPE_UNSPEC,
>  };

missing something. why it has to be special to have it's own
fd array implementation?
Please take a look how BPF_MAP_TYPE_PERF_EVENT_ARRAY, 
BPF_MAP_TYPE_CGROUP_ARRAY and BPF_MAP_TYPE_PROG_ARRAY are done.
The all store objects into array map that user space passes via FD.
I think the same model should apply here.

Re: [RFC v3 07/22] landlock: Handle file comparisons

2016-09-14 Thread Alexei Starovoitov

On Wed, Sep 14, 2016 at 09:24:00AM +0200, Mickaël Salaün wrote:
> Add eBPF functions to compare file system access with a Landlock file
> system handle:
> * bpf_landlock_cmp_fs_prop_with_struct_file(prop, map, map_op, file)
>   This function allows to compare the dentry, inode, device or mount
>   point of the currently accessed file, with a reference handle.
> * bpf_landlock_cmp_fs_beneath_with_struct_file(opt, map, map_op, file)
>   This function allows an eBPF program to check if the current accessed
>   file is the same or in the hierarchy of a reference handle.
> 
> The goal of file system handle is to abstract kernel objects such as a
> struct file or a struct inode. Userland can create this kind of handle
> thanks to the BPF_MAP_UPDATE_ELEM command. The element is a struct
> landlock_handle containing the handle type (e.g.
> BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD) and a file descriptor. This could
> also be any descriptions able to match a struct file or a struct inode
> (e.g. path or glob string).
> 
> Changes since v2:
> * add MNT_INTERNAL check to only add file handle from user-visible FS
>   (e.g. no anonymous inode)
> * replace struct file* with struct path* in map_landlock_handle
> * add BPF protos
> * fix bpf_landlock_cmp_fs_prop_with_struct_file()
> 
> Signed-off-by: Mickaël Salaün 
> Cc: Alexei Starovoitov 
> Cc: Andy Lutomirski 
> Cc: Daniel Borkmann 
> Cc: David S. Miller 
> Cc: James Morris 
> Cc: Kees Cook 
> Cc: Serge E. Hallyn 
> Link: 
> https://lkml.kernel.org/r/calcetrwwtiz3kztkegow24-dvhqq6lftwexh77fd2g5o71y...@mail.gmail.com

thanks for keeping the links to the previous discussion.
Long term it should help, though I worry we already at the point
where there are too many outstanding issues to resolve before we
can proceed with reasonable code review.

> +/*
> + * bpf_landlock_cmp_fs_prop_with_struct_file
> + *
> + * Cf. include/uapi/linux/bpf.h
> + */
> +static inline u64 bpf_landlock_cmp_fs_prop_with_struct_file(u64 r1_property,
> + u64 r2_map, u64 r3_map_op, u64 r4_file, u64 r5)
> +{
> + u8 property = (u8) r1_property;
> + struct bpf_map *map = (struct bpf_map *) (unsigned long) r2_map;
> + enum bpf_map_array_op map_op = r3_map_op;
> + struct file *file = (struct file *) (unsigned long) r4_file;

please use just added BPF_CALL_ macros. They will help readability of the above.

> + struct bpf_array *array = container_of(map, struct bpf_array, map);
> + struct path *p1, *p2;
> + struct map_landlock_handle *handle;
> + int i;
> +
> + /* ARG_CONST_PTR_TO_LANDLOCK_HANDLE_FS is an arraymap */
> + if (unlikely(!map)) {
> + WARN_ON(1);
> + return -EFAULT;
> + }
> + if (unlikely(!file))
> + return -ENOENT;
> + if (unlikely((property | _LANDLOCK_FLAG_FS_MASK) != 
> _LANDLOCK_FLAG_FS_MASK))
> + return -EINVAL;
> +
> + /* for now, only handle OP_OR */
> + switch (map_op) {
> + case BPF_MAP_ARRAY_OP_OR:
> + break;
> + case BPF_MAP_ARRAY_OP_UNSPEC:
> + case BPF_MAP_ARRAY_OP_AND:
> + case BPF_MAP_ARRAY_OP_XOR:
> + default:
> + return -EINVAL;
> + }
> + p2 = &file->f_path;
> +
> + synchronize_rcu();

that is completely broken.
bpf programs are executing under rcu_lock.
please enable CONFIG_PROVE_RCU and retest everything.

I would suggest for the next RFC to do minimal 7 patches up to this point
with simple example that demonstrates the use case.
I would avoid all unpriv stuff and all of seccomp for the next RFC as well,
otherwise I don't think we can realistically make forward progress, since
there are too many issues raised in the subsequent patches.

The common part that is mergeable is prog's subtype extension to
the verifier that can be used for better tracing and is the common
piece of infra needed for both landlock and checmate LSMs
(which must be one LSM anyway)

Re: [RFC v3 14/22] bpf/cgroup: Make cgroup_bpf_update() return an error code

2016-09-14 Thread Alexei Starovoitov

On Wed, Sep 14, 2016 at 09:24:07AM +0200, Mickaël Salaün wrote:
> This will be useful to support Landlock for the next commits.
> 
> Signed-off-by: Mickaël Salaün 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> Cc: Daniel Mack 
> Cc: David S. Miller 
> Cc: Tejun Heo 

I think this is good to do regardless. Sooner or later cgroup_bpf_update
will start rejecting the prog attach. Like we discussed to have a flag
that would dissallow processeses lower in the cgroup hierarchy to install
their own bpf programs.
It will be minimal change to bpf_prog_attach() error handling,
but will greatly help Mickael to build stuff on top.
DanielM can you refactor your patch to do that from the start ?

Thanks!

> ---
>  include/linux/bpf-cgroup.h |  4 ++--
>  kernel/bpf/cgroup.c|  3 ++-
>  kernel/bpf/syscall.c   | 10 ++
>  kernel/cgroup.c|  6 --
>  4 files changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> index 2234042d7f61..6cca7924ee17 100644
> --- a/include/linux/bpf-cgroup.h
> +++ b/include/linux/bpf-cgroup.h
> @@ -31,13 +31,13 @@ struct cgroup_bpf {
>  void cgroup_bpf_put(struct cgroup *cgrp);
>  void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
>  
> -void __cgroup_bpf_update(struct cgroup *cgrp,
> +int __cgroup_bpf_update(struct cgroup *cgrp,
>struct cgroup *parent,
>struct bpf_prog *prog,
>enum bpf_attach_type type);
>  
>  /* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
> -void cgroup_bpf_update(struct cgroup *cgrp,
> +int cgroup_bpf_update(struct cgroup *cgrp,
>  struct bpf_prog *prog,
>  enum bpf_attach_type type);
>  
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 782878ec4f2d..7b75fa692617 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -83,7 +83,7 @@ void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup 
> *parent)
>   *
>   * Must be called with cgroup_mutex held.
>   */
> -void __cgroup_bpf_update(struct cgroup *cgrp,
> +int __cgroup_bpf_update(struct cgroup *cgrp,
>struct cgroup *parent,
>struct bpf_prog *prog,
>enum bpf_attach_type type)
> @@ -117,6 +117,7 @@ void __cgroup_bpf_update(struct cgroup *cgrp,
>   bpf_prog_put(old_pinned.prog);
>   static_branch_dec(&cgroup_bpf_enabled_key);
>   }
> + return 0;
>  }
>  
>  /**
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 45a91d59..c978f2d9a1b3 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -831,6 +831,7 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>  {
>   struct bpf_prog *prog;
>   struct cgroup *cgrp;
> + int result;
>  
>   if (!capable(CAP_NET_ADMIN))
>   return -EPERM;
> @@ -858,10 +859,10 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>   return PTR_ERR(cgrp);
>   }
>  
> - cgroup_bpf_update(cgrp, prog, attr->attach_type);
> + result = cgroup_bpf_update(cgrp, prog, attr->attach_type);
>   cgroup_put(cgrp);
>  
> - return 0;
> + return result;
>  }
>  
>  #define BPF_PROG_DETACH_LAST_FIELD attach_type
> @@ -869,6 +870,7 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>  static int bpf_prog_detach(const union bpf_attr *attr)
>  {
>   struct cgroup *cgrp;
> + int result = 0;
>  
>   if (!capable(CAP_NET_ADMIN))
>   return -EPERM;
> @@ -883,7 +885,7 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>   if (IS_ERR(cgrp))
>   return PTR_ERR(cgrp);
>  
> - cgroup_bpf_update(cgrp, NULL, attr->attach_type);
> + result = cgroup_bpf_update(cgrp, NULL, attr->attach_type);
>   cgroup_put(cgrp);
>   break;
>  
> @@ -891,7 +893,7 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>   return -EINVAL;
>   }
>  
> - return 0;
> + return result;
>  }
>  #endif /* CONFIG_CGROUP_BPF */
>  
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 87324ce481b1..48b650a640a9 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -6450,15 +6450,17 @@ static __init int cgroup_namespaces_init(void)
>  subsys_initcall(cgroup_namespaces_init);
>  
>  #ifdef CONFIG_CGROUP_BPF
> -void cgroup_bpf_update(struct cgroup *cgrp,
> +int cgroup_bpf_update(struct cgroup *cgrp,
>  struct bpf_prog *prog,
>

Re: [RFC v3 21/22] bpf,landlock: Add optional skb pointer in the Landlock context

2016-09-14 Thread Alexei Starovoitov

On Wed, Sep 14, 2016 at 09:24:14AM +0200, Mickaël Salaün wrote:
> This is a proof of concept to expose optional values that could depend
> of the process access rights.
> 
> There is two dedicated flags: LANDLOCK_FLAG_ACCESS_SKB_READ and
> LANDLOCK_FLAG_ACCESS_SKB_WRITE. Each of them can be activated to access
> eBPF functions manipulating a skb in a read or write way.
> 
> Signed-off-by: Mickaël Salaün 
...
>  /* Handle check flags */
>  #define LANDLOCK_FLAG_FS_DENTRY  (1 << 0)
> @@ -619,12 +621,15 @@ struct landlock_handle {
>   * @args: LSM hook arguments, see include/linux/lsm_hooks.h for there
>   *description and the LANDLOCK_HOOK* definitions from
>   *security/landlock/lsm.c for their types.
> + * @opt_skb: optional skb pointer, accessible with the
> + *   LANDLOCK_FLAG_ACCESS_SKB_* flags for network-related hooks.
>   */
>  struct landlock_data {
>   __u32 hook; /* enum landlock_hook_id */
>   __u16 origin; /* LANDLOCK_FLAG_ORIGIN_* */
>   __u16 cookie; /* seccomp RET_LANDLOCK */
>   __u64 args[6];
> + __u64 opt_skb;
>  };

missing something here.
This patch doesn't make use of it.
That's something for the future?
How that field will be populated?
Why make it different vs the rest or args[6] ?

Re: [RFC v3 22/22] samples/landlock: Add sandbox example

2016-09-14 Thread Alexei Starovoitov

On Wed, Sep 14, 2016 at 09:24:15AM +0200, Mickaël Salaün wrote:
> Add a basic sandbox tool to create a process isolated from some part of
> the system. This can depend of the current cgroup.
> 
> Example with the current process hierarchy (seccomp):
> 
>   $ ls /home
>   user1
>   $ LANDLOCK_ALLOWED='/bin:/lib:/usr:/tmp:/proc/self/fd/0' \
>   ./samples/landlock/sandbox /bin/sh -i
>   Launching a new sandboxed process.
>   $ ls /home
>   ls: cannot open directory '/home': Permission denied
> 
> Example with a cgroup:
> 
>   $ mkdir /sys/fs/cgroup/sandboxed
>   $ ls /home
>   user1
>   $ LANDLOCK_CGROUPS='/sys/fs/cgroup/sandboxed' \
>   LANDLOCK_ALLOWED='/bin:/lib:/usr:/tmp:/proc/self/fd/0' \
>   ./samples/landlock/sandbox
>   Ready to sandbox with cgroups.
>   $ ls /home
>   user1
>   $ echo $$ > /sys/fs/cgroup/sandboxed/cgroup.procs
>   $ ls /home
>   ls: cannot open directory '/home': Permission denied
> 
> Changes since v2:
> * use BPF_PROG_ATTACH for cgroup handling
> 
> Signed-off-by: Mickaël Salaün 
...
> + struct bpf_insn hook_path[] = {
> + /* specify an option, if any */
> + BPF_MOV32_IMM(BPF_REG_1, 0),
> + /* handles to compare with */
> + BPF_LD_MAP_FD(BPF_REG_2, map_fs),
> + BPF_MOV64_IMM(BPF_REG_3, BPF_MAP_ARRAY_OP_OR),
> + /* hook argument (struct file) */
> + BPF_LDX_MEM(BPF_DW, BPF_REG_4, BPF_REG_6, offsetof(struct
> + landlock_data, args[0])),
> + /* checker function */
> + 
> BPF_EMIT_CALL(BPF_FUNC_landlock_cmp_fs_beneath_with_struct_file),

the example is sweet!
Since only that helper is used, could you skip the other one
from the patches (for now) ?

Re: [RFC v3 07/22] landlock: Handle file comparisons

2016-09-14 Thread Alexei Starovoitov

On Thu, Sep 15, 2016 at 01:02:22AM +0200, Mickaël Salaün wrote:
> > 
> > I would suggest for the next RFC to do minimal 7 patches up to this point
> > with simple example that demonstrates the use case.
> > I would avoid all unpriv stuff and all of seccomp for the next RFC as well,
> > otherwise I don't think we can realistically make forward progress, since
> > there are too many issues raised in the subsequent patches.
> 
> I hope we will find a common agreement about seccomp vs cgroup… I think
> both approaches have their advantages, can be complementary and nicely
> combined.

I don't mind having both task based lsm and cgroup based as long as
infrastracture is not duplicated and scaling issues from earlier version
are resolved.
I'm proposing to do cgroup only for the next RFC, since mine and Sargun's
use case for this bpf+lsm+cgroup is _not_ security or sandboxing.
No need for unpriv, no_new_priv to cgroups are other things that Andy
is concerned about.

> Unprivileged sandboxing is the main goal of Landlock. This should not be
> a problem, even for privileged features, thanks to the new subtype/access.

yes. the point that unpriv stuff can come later after agreement is reached.
If we keep arguing about seccomp details this set won't go anywhere.
Even in basic part (which is cgroup+bpf+lsm) are plenty of questions
to be still agreed.

> Agreed. With this RFC, the Checmate features (i.e. network helpers)
> should be able to sit on top of Landlock.

I think neither of them should be called fancy names for no technical reason.
We will have only one bpf based lsm. That's it and it doesn't
need an obscure name. Directory name can be security/bpf/..stuff.c

Re: [RFC v3 03/22] bpf,landlock: Add a new arraymap type to deal with (Landlock) handles

2016-09-14 Thread Alexei Starovoitov

On Thu, Sep 15, 2016 at 01:22:49AM +0200, Mickaël Salaün wrote:
> 
> On 14/09/2016 20:51, Alexei Starovoitov wrote:
> > On Wed, Sep 14, 2016 at 09:23:56AM +0200, Mickaël Salaün wrote:
> >> This new arraymap looks like a set and brings new properties:
> >> * strong typing of entries: the eBPF functions get the array type of
> >>   elements instead of CONST_PTR_TO_MAP (e.g.
> >>   CONST_PTR_TO_LANDLOCK_HANDLE_FS);
> >> * force sequential filling (i.e. replace or append-only update), which
> >>   allow quick browsing of all entries.
> >>
> >> This strong typing is useful to statically check if the content of a map
> >> can be passed to an eBPF function. For example, Landlock use it to store
> >> and manage kernel objects (e.g. struct file) instead of dealing with
> >> userland raw data. This improve efficiency and ensure that an eBPF
> >> program can only call functions with the right high-level arguments.
> >>
> >> The enum bpf_map_handle_type list low-level types (e.g.
> >> BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD) which are identified when
> >> updating a map entry (handle). This handle types are used to infer a
> >> high-level arraymap type which are listed in enum bpf_map_array_type
> >> (e.g. BPF_MAP_ARRAY_TYPE_LANDLOCK_FS).
> >>
> >> For now, this new arraymap is only used by Landlock LSM (cf. next
> >> commits) but it could be useful for other needs.
> >>
> >> Changes since v2:
> >> * add a RLIMIT_NOFILE-based limit to the maximum number of arraymap
> >>   handle entries (suggested by Andy Lutomirski)
> >> * remove useless checks
> >>
> >> Changes since v1:
> >> * arraymap of handles replace custom checker groups
> >> * simpler userland API
> >>
> >> Signed-off-by: Mickaël Salaün 
> >> Cc: Alexei Starovoitov 
> >> Cc: Andy Lutomirski 
> >> Cc: Daniel Borkmann 
> >> Cc: David S. Miller 
> >> Cc: Kees Cook 
> >> Link: 
> >> https://lkml.kernel.org/r/calcetrwwtiz3kztkegow24-dvhqq6lftwexh77fd2g5o71y...@mail.gmail.com
> >> ---
> >>  include/linux/bpf.h  |  14 
> >>  include/uapi/linux/bpf.h |  18 +
> >>  kernel/bpf/arraymap.c| 203 
> >> +++
> >>  kernel/bpf/verifier.c|  12 ++-
> >>  4 files changed, 246 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> >> index fa9a988400d9..eae4ce4542c1 100644
> >> --- a/include/linux/bpf.h
> >> +++ b/include/linux/bpf.h
> >> @@ -13,6 +13,10 @@
> >>  #include 
> >>  #include 
> >>  
> >> +#ifdef CONFIG_SECURITY_LANDLOCK
> >> +#include  /* struct file */
> >> +#endif /* CONFIG_SECURITY_LANDLOCK */
> >> +
> >>  struct perf_event;
> >>  struct bpf_map;
> >>  
> >> @@ -38,6 +42,7 @@ struct bpf_map_ops {
> >>  struct bpf_map {
> >>atomic_t refcnt;
> >>enum bpf_map_type map_type;
> >> +  enum bpf_map_array_type map_array_type;
> >>u32 key_size;
> >>u32 value_size;
> >>u32 max_entries;
> >> @@ -187,6 +192,9 @@ struct bpf_array {
> >> */
> >>enum bpf_prog_type owner_prog_type;
> >>bool owner_jited;
> >> +#ifdef CONFIG_SECURITY_LANDLOCK
> >> +  u32 n_entries;  /* number of entries in a handle array */
> >> +#endif /* CONFIG_SECURITY_LANDLOCK */
> >>union {
> >>char value[0] __aligned(8);
> >>void *ptrs[0] __aligned(8);
> >> @@ -194,6 +202,12 @@ struct bpf_array {
> >>};
> >>  };
> >>  
> >> +#ifdef CONFIG_SECURITY_LANDLOCK
> >> +struct map_landlock_handle {
> >> +  u32 type; /* enum bpf_map_handle_type */
> >> +};
> >> +#endif /* CONFIG_SECURITY_LANDLOCK */
> >> +
> >>  #define MAX_TAIL_CALL_CNT 32
> >>  
> >>  struct bpf_event_entry {
> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> index 7cd36166f9b7..b68de57f7ab8 100644
> >> --- a/include/uapi/linux/bpf.h
> >> +++ b/include/uapi/linux/bpf.h
> >> @@ -87,6 +87,15 @@ enum bpf_map_type {
> >>BPF_MAP_TYPE_PERCPU_ARRAY,
> >>BPF_MAP_TYPE_STACK_TRACE,P_TYPE_CGROUP_ARRAY
> >>BPF_MAP_TYPE_CGROUP_ARRAY,
> >> +  BPF_MAP_TYPE_LANDLOCK_ARRAY,
> >> +};
> >> +
> >> +enum bpf_map_array_type {
>

Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Alexei Starovoitov

On Wed, Sep 14, 2016 at 06:25:07PM -0700, Andy Lutomirski wrote:
> On Wed, Sep 14, 2016 at 3:11 PM, Mickaël Salaün  wrote:
> >
> > On 14/09/2016 20:27, Andy Lutomirski wrote:
> >> On Wed, Sep 14, 2016 at 12:24 AM, Mickaël Salaün  wrote:
> >>> Add a new flag CGRP_NO_NEW_PRIVS for each cgroup. This flag is initially
> >>> set for all cgroup except the root. The flag is clear when a new process
> >>> without the no_new_privs flags is attached to the cgroup.
> >>>
> >>> If a cgroup is landlocked, then any new attempt, from an unprivileged
> >>> process, to attach a process without no_new_privs to this cgroup will
> >>> be denied.
> >>
> >> Until and unless everyone can agree on a way to properly namespace,
> >> delegate, etc cgroups, I think that trying to add unprivileged
> >> semantics to cgroups is nuts.  Given the big thread about cgroup v2,
> >> no-internal-tasks, etc, I just don't see how this approach can be
> >> viable.
> >
> > As far as I can tell, the no_new_privs flag of at task is not related to
> > namespaces. The CGRP_NO_NEW_PRIVS flag is only a cache to quickly access
> > the no_new_privs property of *tasks* in a cgroup. The semantic is unchanged.
> >
> > Using cgroup is optional, any task could use the seccomp-based
> > landlocking instead. However, for those that want/need to manage a
> > security policy in a more dynamic way, using cgroups may make sense.
> >
> > I though cgroup delegation was OK in the v2, isn't it the case? Do you
> > have some links?
> >
> >>
> >> Can we try to make landlock work completely independently of cgroups
> >> so that it doesn't get stuck and so that programs can use it without
> >> worrying about cgroup v1 vs v2, interactions with cgroup managers,
> >> cgroup managers that (supposedly?) will start migrating processes
> >> around piecemeal and almost certainly blowing up landlock in the
> >> process, etc?
> >
> > This RFC handle both cgroup and seccomp approaches in a similar way. I
> > don't see why building on top of cgroup v2 is a problem. Is there
> > security issues with delegation?
> 
> What I mean is: cgroup v2 delegation has a functionality problem.
> Tejun says [1]:
> 
> We haven't had to face this decision because cgroup has never properly
> supported delegating to applications and the in-use setups where this
> happens are custom configurations where there is no boundary between
> system and applications and adhoc trial-and-error is good enough a way
> to find a working solution.  That wiggle room goes away once we
> officially open this up to individual applications.
> 
> Unless and until that changes, I think that landlock should stay away
> from cgroups.  Others could reasonably disagree with me.

Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
and not for sandboxing. So the above doesn't matter in such contexts.
lsm hooks + cgroups provide convenient scope and existing entry points.
Please see checmate examples how it's used.

Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Alexei Starovoitov

On Wed, Sep 14, 2016 at 07:27:08PM -0700, Andy Lutomirski wrote:
> >> >
> >> > This RFC handle both cgroup and seccomp approaches in a similar way. I
> >> > don't see why building on top of cgroup v2 is a problem. Is there
> >> > security issues with delegation?
> >>
> >> What I mean is: cgroup v2 delegation has a functionality problem.
> >> Tejun says [1]:
> >>
> >> We haven't had to face this decision because cgroup has never properly
> >> supported delegating to applications and the in-use setups where this
> >> happens are custom configurations where there is no boundary between
> >> system and applications and adhoc trial-and-error is good enough a way
> >> to find a working solution.  That wiggle room goes away once we
> >> officially open this up to individual applications.
> >>
> >> Unless and until that changes, I think that landlock should stay away
> >> from cgroups.  Others could reasonably disagree with me.
> >
> > Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
> > and not for sandboxing. So the above doesn't matter in such contexts.
> > lsm hooks + cgroups provide convenient scope and existing entry points.
> > Please see checmate examples how it's used.
> >
> 
> To be clear: I'm not arguing at all that there shouldn't be
> bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
> landlock interface shouldn't expose any cgroup integration, at least
> until the cgroup situation settles down a lot.

ahh. yes. we're perfectly in agreement here.
I'm suggesting that the next RFC shouldn't include unpriv
and seccomp at all. Once bpf+lsm+cgroup is merged, we can
argue about unpriv with cgroups and even unpriv as a whole,
since it's not a given. Seccomp integration is also questionable.
I'd rather not have seccomp as a gate keeper for this lsm.
lsm and seccomp are orthogonal hook points. Syscalls and lsm hooks
don't have one to one relationship, so mixing them up is only
asking for trouble further down the road.
If we really need to carry some information from seccomp to lsm+bpf,
it's easier to add eBPF support to seccomp and let bpf side deal
with passing whatever information.

Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Alexei Starovoitov

On Wed, Sep 14, 2016 at 09:08:57PM -0700, Andy Lutomirski wrote:
> On Wed, Sep 14, 2016 at 9:00 PM, Alexei Starovoitov
>  wrote:
> > On Wed, Sep 14, 2016 at 07:27:08PM -0700, Andy Lutomirski wrote:
> >> >> >
> >> >> > This RFC handle both cgroup and seccomp approaches in a similar way. I
> >> >> > don't see why building on top of cgroup v2 is a problem. Is there
> >> >> > security issues with delegation?
> >> >>
> >> >> What I mean is: cgroup v2 delegation has a functionality problem.
> >> >> Tejun says [1]:
> >> >>
> >> >> We haven't had to face this decision because cgroup has never properly
> >> >> supported delegating to applications and the in-use setups where this
> >> >> happens are custom configurations where there is no boundary between
> >> >> system and applications and adhoc trial-and-error is good enough a way
> >> >> to find a working solution.  That wiggle room goes away once we
> >> >> officially open this up to individual applications.
> >> >>
> >> >> Unless and until that changes, I think that landlock should stay away
> >> >> from cgroups.  Others could reasonably disagree with me.
> >> >
> >> > Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
> >> > and not for sandboxing. So the above doesn't matter in such contexts.
> >> > lsm hooks + cgroups provide convenient scope and existing entry points.
> >> > Please see checmate examples how it's used.
> >> >
> >>
> >> To be clear: I'm not arguing at all that there shouldn't be
> >> bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
> >> landlock interface shouldn't expose any cgroup integration, at least
> >> until the cgroup situation settles down a lot.
> >
> > ahh. yes. we're perfectly in agreement here.
> > I'm suggesting that the next RFC shouldn't include unpriv
> > and seccomp at all. Once bpf+lsm+cgroup is merged, we can
> > argue about unpriv with cgroups and even unpriv as a whole,
> > since it's not a given. Seccomp integration is also questionable.
> > I'd rather not have seccomp as a gate keeper for this lsm.
> > lsm and seccomp are orthogonal hook points. Syscalls and lsm hooks
> > don't have one to one relationship, so mixing them up is only
> > asking for trouble further down the road.
> > If we really need to carry some information from seccomp to lsm+bpf,
> > it's easier to add eBPF support to seccomp and let bpf side deal
> > with passing whatever information.
> >
> 
> As an argument for keeping seccomp (or an extended seccomp) as the
> interface for an unprivileged bpf+lsm: seccomp already checks off most
> of the boxes for safely letting unprivileged programs sandbox
> themselves.  

you mean the attach part of seccomp syscall that deals with no_new_priv?
sure, that's reusable.

> Furthermore, to the extent that there are use cases for
> unprivileged bpf+lsm that *aren't* expressible within the seccomp
> hierarchy, I suspect that syscall filters have exactly the same
> problem and that we should fix seccomp to cover it.

not sure what you mean by 'seccomp hierarchy'. The normal process
hierarchy ?
imo the main deficiency of secccomp is inability to look into arguments.
One can argue that it's a blessing, since composite args
are not yet copied into the kernel memory.
But in a lot of cases the seccomp arguments are FDs pointing
to kernel objects and if programs could examine those objects
the sandboxing scope would be more precise.
lsm+bpf solves that part and I'd still argue that it's
orthogonal to seccomp's pass/reject flow.
I mean if seccomp says 'ok' the syscall should continue executing
as normal and whatever LSM hooks were triggered by it may have
their own lsm+bpf verdicts.
Furthermore in the process hierarchy different children
should be able to set their own lsm+bpf filters that are not
related to parallel seccomp+bpf hierarchy of programs.
seccomp syscall can be an interface to attach programs
to lsm hooks, but nothing more than that.

Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Alexei Starovoitov

On Wed, Sep 14, 2016 at 09:38:16PM -0700, Andy Lutomirski wrote:
> On Wed, Sep 14, 2016 at 9:31 PM, Alexei Starovoitov
>  wrote:
> > On Wed, Sep 14, 2016 at 09:08:57PM -0700, Andy Lutomirski wrote:
> >> On Wed, Sep 14, 2016 at 9:00 PM, Alexei Starovoitov
> >>  wrote:
> >> > On Wed, Sep 14, 2016 at 07:27:08PM -0700, Andy Lutomirski wrote:
> >> >> >> >
> >> >> >> > This RFC handle both cgroup and seccomp approaches in a similar 
> >> >> >> > way. I
> >> >> >> > don't see why building on top of cgroup v2 is a problem. Is there
> >> >> >> > security issues with delegation?
> >> >> >>
> >> >> >> What I mean is: cgroup v2 delegation has a functionality problem.
> >> >> >> Tejun says [1]:
> >> >> >>
> >> >> >> We haven't had to face this decision because cgroup has never 
> >> >> >> properly
> >> >> >> supported delegating to applications and the in-use setups where this
> >> >> >> happens are custom configurations where there is no boundary between
> >> >> >> system and applications and adhoc trial-and-error is good enough a 
> >> >> >> way
> >> >> >> to find a working solution.  That wiggle room goes away once we
> >> >> >> officially open this up to individual applications.
> >> >> >>
> >> >> >> Unless and until that changes, I think that landlock should stay away
> >> >> >> from cgroups.  Others could reasonably disagree with me.
> >> >> >
> >> >> > Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
> >> >> > and not for sandboxing. So the above doesn't matter in such contexts.
> >> >> > lsm hooks + cgroups provide convenient scope and existing entry 
> >> >> > points.
> >> >> > Please see checmate examples how it's used.
> >> >> >
> >> >>
> >> >> To be clear: I'm not arguing at all that there shouldn't be
> >> >> bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
> >> >> landlock interface shouldn't expose any cgroup integration, at least
> >> >> until the cgroup situation settles down a lot.
> >> >
> >> > ahh. yes. we're perfectly in agreement here.
> >> > I'm suggesting that the next RFC shouldn't include unpriv
> >> > and seccomp at all. Once bpf+lsm+cgroup is merged, we can
> >> > argue about unpriv with cgroups and even unpriv as a whole,
> >> > since it's not a given. Seccomp integration is also questionable.
> >> > I'd rather not have seccomp as a gate keeper for this lsm.
> >> > lsm and seccomp are orthogonal hook points. Syscalls and lsm hooks
> >> > don't have one to one relationship, so mixing them up is only
> >> > asking for trouble further down the road.
> >> > If we really need to carry some information from seccomp to lsm+bpf,
> >> > it's easier to add eBPF support to seccomp and let bpf side deal
> >> > with passing whatever information.
> >> >
> >>
> >> As an argument for keeping seccomp (or an extended seccomp) as the
> >> interface for an unprivileged bpf+lsm: seccomp already checks off most
> >> of the boxes for safely letting unprivileged programs sandbox
> >> themselves.
> >
> > you mean the attach part of seccomp syscall that deals with no_new_priv?
> > sure, that's reusable.
> >
> >> Furthermore, to the extent that there are use cases for
> >> unprivileged bpf+lsm that *aren't* expressible within the seccomp
> >> hierarchy, I suspect that syscall filters have exactly the same
> >> problem and that we should fix seccomp to cover it.
> >
> > not sure what you mean by 'seccomp hierarchy'. The normal process
> > hierarchy ?
> 
> Kind of.  I mean the filter layers that are inherited across fork(),
> the TSYNC mechanism, etc.
> 
> > imo the main deficiency of secccomp is inability to look into arguments.
> > One can argue that it's a blessing, since composite args
> > are not yet copied into the kernel memory.
> > But in a lot of cases the seccomp arguments are FDs pointing
> > to kernel objects and if programs could examine those objects
> > the sandb

Re: [PATCH net-next 4/6] perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT programs

2016-08-30 Thread Alexei Starovoitov

On Mon, Aug 29, 2016 at 02:17:18PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 26, 2016 at 07:31:22PM -0700, Alexei Starovoitov wrote:
> > +static int perf_event_set_bpf_handler(struct perf_event *event, u32 
> > prog_fd)
> > +{
> > +   struct bpf_prog *prog;
> > +
> > +   if (event->overflow_handler_context)
> > +   /* hw breakpoint or kernel counter */
> > +   return -EINVAL;
> > +
> > +   if (event->prog)
> > +   return -EEXIST;
> > +
> > +   prog = bpf_prog_get_type(prog_fd, BPF_PROG_TYPE_PERF_EVENT);
> > +   if (IS_ERR(prog))
> > +   return PTR_ERR(prog);
> > +
> > +   event->prog = prog;
> > +   event->orig_overflow_handler = READ_ONCE(event->overflow_handler);
> > +   WRITE_ONCE(event->overflow_handler, bpf_overflow_handler);
> > +   return 0;
> > +}
> > +
> > +static void perf_event_free_bpf_handler(struct perf_event *event)
> > +{
> > +   struct bpf_prog *prog = event->prog;
> > +
> > +   if (!prog)
> > +   return;
> 
> Does it make sense to do something like:
> 
>   WARN_ON_ONCE(event->overflow_handler != bpf_overflow_handler);

Yes that's an implicit assumption here, but checking for that
would be overkill. event->overflow_handler and event->prog are set
back to back in two places and reset here once together.
Such warn_on will only make people reading this code in the future
think that this bit is too complex to analyze by hand.

> > +
> > +   WRITE_ONCE(event->overflow_handler, event->orig_overflow_handler);
> > +   event->prog = NULL;
> > +   bpf_prog_put(prog);
> > +}
> 
> 
> >  static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
> >  {
> > bool is_kprobe, is_tracepoint;
> > struct bpf_prog *prog;
> >  
> > +   if (event->attr.type == PERF_TYPE_HARDWARE ||
> > +   event->attr.type == PERF_TYPE_SOFTWARE)
> > +   return perf_event_set_bpf_handler(event, prog_fd);
> > +
> > if (event->attr.type != PERF_TYPE_TRACEPOINT)
> > return -EINVAL;
> >  
> > @@ -7647,6 +7711,8 @@ static void perf_event_free_bpf_prog(struct 
> > perf_event *event)
> >  {
> > struct bpf_prog *prog;
> >  
> > +   perf_event_free_bpf_handler(event);
> > +
> > if (!event->tp_event)
> > return;
> >  
> 
> Does it at all make sense to merge the tp_event->prog thing into this
> new event->prog?

'struct trace_event_call *tp_event' is global while tp_event->perf_events
are per cpu, so I don't see how we can do that without breaking user space
logic. Right now users do single perf_event_open of kprobe and attach prog
that is executed on all cpus where kprobe is firing. Additional per-cpu
filtering is done from within bpf prog.

> >  #ifdef CONFIG_HAVE_HW_BREAKPOINT
> > @@ -8957,6 +9029,14 @@ perf_event_alloc(struct perf_event_attr *attr, int 
> > cpu,
> > if (!overflow_handler && parent_event) {
> > overflow_handler = parent_event->overflow_handler;
> > context = parent_event->overflow_handler_context;
> > +   if (overflow_handler == bpf_overflow_handler) {
> > +   event->prog = bpf_prog_inc(parent_event->prog);
> > +   event->orig_overflow_handler = 
> > parent_event->orig_overflow_handler;
> > +   if (IS_ERR(event->prog)) {
> > +   event->prog = NULL;
> > +   overflow_handler = NULL;
> > +   }
> > +   }
> > }
> 
> Should we not fail the entire perf_event_alloc() call in that IS_ERR()
> case?

Yes. Good point. Will do.

[PATCH v2 net-next 0/6] perf, bpf: add support for bpf in sw/hw perf_events

2016-08-31 Thread Alexei Starovoitov

Hi Peter, Dave,

this patch set is a follow up to the discussion:
https://lkml.kernel.org/r/20160804142853.GO6862%20()%20twins%20!%20programming%20!%20kicks-ass%20!%20net
It turned out to be simpler than what we discussed.

Patches 1-3 is bpf-side prep for the main patch 4
that adds bpf program as an overflow_handler to sw and hw perf_events.
Peter, please review.

Patches 5 and 6 are examples from myself and Brendan.

v1-v2: fixed issues spotted by Peter and Daniel.

Thanks!

Alexei Starovoitov (5):
  bpf: support 8-byte metafield access
  bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type
  bpf: perf_event progs should only use preallocated maps
  perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT
programs
  samples/bpf: add perf_event+bpf example

Brendan Gregg (1):
  samples/bpf: add sampleip example

 include/linux/bpf.h |   4 +
 include/linux/perf_event.h  |   7 ++
 include/uapi/linux/Kbuild   |   1 +
 include/uapi/linux/bpf.h|   1 +
 include/uapi/linux/bpf_perf_event.h |  18 +++
 kernel/bpf/verifier.c   |  31 +-
 kernel/events/core.c|  85 +-
 kernel/trace/bpf_trace.c|  60 ++
 samples/bpf/Makefile|   8 ++
 samples/bpf/bpf_helpers.h   |   2 +
 samples/bpf/bpf_load.c  |   7 +-
 samples/bpf/sampleip_kern.c |  38 +++
 samples/bpf/sampleip_user.c | 196 +
 samples/bpf/trace_event_kern.c  |  65 +++
 samples/bpf/trace_event_user.c  | 213 
 15 files changed, 730 insertions(+), 6 deletions(-)
 create mode 100644 include/uapi/linux/bpf_perf_event.h
 create mode 100644 samples/bpf/sampleip_kern.c
 create mode 100644 samples/bpf/sampleip_user.c
 create mode 100644 samples/bpf/trace_event_kern.c
 create mode 100644 samples/bpf/trace_event_user.c

-- 
2.8.0

[PATCH v2 net-next 4/6] perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT programs

2016-08-31 Thread Alexei Starovoitov

Allow attaching BPF_PROG_TYPE_PERF_EVENT programs to sw and hw perf events
via overflow_handler mechanism.
When program is attached the overflow_handlers become stacked.
The program acts as a filter.
Returning zero from the program means that the normal perf_event_output handler
will not be called and sampling event won't be stored in the ring buffer.

The overflow_handler_context==NULL is an additional safety check
to make sure programs are not attached to hw breakpoints and watchdog
in case other checks (that prevent that now anyway) get accidentally
relaxed in the future.

The program refcnt is incremented in case perf_events are inhereted
when target task is forked.
Similar to kprobe and tracepoint programs there is no ioctl to
detach the program or swap already attached program. The user space
expected to close(perf_event_fd) like it does right now for kprobe+bpf.
That restriction simplifies the code quite a bit.

The invocation of overflow_handler in __perf_event_overflow() is now
done via READ_ONCE, since that pointer can be replaced when the program
is attached while perf_event itself could have been active already.
There is no need to do similar treatment for event->prog, since it's
assigned only once before it's accessed.

Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpf.h|  4 +++
 include/linux/perf_event.h |  2 ++
 kernel/events/core.c   | 85 +-
 3 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 11134238417d..9a904f63f8c1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -297,6 +297,10 @@ static inline struct bpf_prog *bpf_prog_add(struct 
bpf_prog *prog, int i)
 static inline void bpf_prog_put(struct bpf_prog *prog)
 {
 }
+static inline struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog)
+{
+   return ERR_PTR(-EOPNOTSUPP);
+}
 #endif /* CONFIG_BPF_SYSCALL */
 
 /* verifier prototypes for helper functions called from eBPF programs */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 97bfe62f30d7..dcaaaf3ec8e6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -679,6 +679,8 @@ struct perf_event {
u64 (*clock)(void);
perf_overflow_handler_t overflow_handler;
void*overflow_handler_context;
+   perf_overflow_handler_t orig_overflow_handler;
+   struct bpf_prog *prog;
 
 #ifdef CONFIG_EVENT_TRACING
struct trace_event_call *tp_event;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3cfabdf7b942..305433ab2447 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7022,7 +7022,7 @@ static int __perf_event_overflow(struct perf_event *event,
irq_work_queue(&event->pending);
}
 
-   event->overflow_handler(event, data, regs);
+   READ_ONCE(event->overflow_handler)(event, data, regs);
 
if (*perf_event_fasync(event) && event->pending_kill) {
event->pending_wakeup = 1;
@@ -7637,11 +7637,75 @@ static void perf_event_free_filter(struct perf_event 
*event)
ftrace_profile_free_filter(event);
 }
 
+static void bpf_overflow_handler(struct perf_event *event,
+struct perf_sample_data *data,
+struct pt_regs *regs)
+{
+   struct bpf_perf_event_data_kern ctx = {
+   .data = data,
+   .regs = regs,
+   };
+   int ret = 0;
+
+#ifdef CONFIG_BPF_SYSCALL
+   preempt_disable();
+   if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1))
+   goto out;
+   rcu_read_lock();
+   ret = BPF_PROG_RUN(event->prog, (void *)&ctx);
+   rcu_read_unlock();
+ out:
+   __this_cpu_dec(bpf_prog_active);
+   preempt_enable();
+#endif
+   if (!ret)
+   return;
+
+   event->orig_overflow_handler(event, data, regs);
+}
+
+static int perf_event_set_bpf_handler(struct perf_event *event, u32 prog_fd)
+{
+   struct bpf_prog *prog;
+
+   if (event->overflow_handler_context)
+   /* hw breakpoint or kernel counter */
+   return -EINVAL;
+
+   if (event->prog)
+   return -EEXIST;
+
+   prog = bpf_prog_get_type(prog_fd, BPF_PROG_TYPE_PERF_EVENT);
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+
+   event->prog = prog;
+   event->orig_overflow_handler = READ_ONCE(event->overflow_handler);
+   WRITE_ONCE(event->overflow_handler, bpf_overflow_handler);
+   return 0;
+}
+
+static void perf_event_free_bpf_handler(struct perf_event *event)
+{
+   struct bpf_prog *prog = event->prog;
+
+   if (!prog)
+   return;
+
+   WRITE_ONCE(event->overflow_handler, event->orig_overflow_handler);
+   event-

[PATCH v2 net-next 2/6] bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type

2016-08-31 Thread Alexei Starovoitov

Introduce BPF_PROG_TYPE_PERF_EVENT programs that can be attached to
HW and SW perf events (PERF_TYPE_HARDWARE and PERF_TYPE_SOFTWARE
correspondingly in uapi/linux/perf_event.h)

The program visible context meta structure is
struct bpf_perf_event_data {
struct pt_regs regs;
 __u64 sample_period;
};
which is accessible directly from the program:
int bpf_prog(struct bpf_perf_event_data *ctx)
{
  ... ctx->sample_period ...
  ... ctx->regs.ip ...
}

The bpf verifier rewrites the accesses into kernel internal
struct bpf_perf_event_data_kern which allows changing
struct perf_sample_data without affecting bpf programs.
New fields can be added to the end of struct bpf_perf_event_data
in the future.

Signed-off-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 include/linux/perf_event.h  |  5 
 include/uapi/linux/Kbuild   |  1 +
 include/uapi/linux/bpf.h|  1 +
 include/uapi/linux/bpf_perf_event.h | 18 +++
 kernel/trace/bpf_trace.c| 60 +
 5 files changed, 85 insertions(+)
 create mode 100644 include/uapi/linux/bpf_perf_event.h

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2b6b43cc0dd5..97bfe62f30d7 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -788,6 +788,11 @@ struct perf_output_handle {
int page;
 };
 
+struct bpf_perf_event_data_kern {
+   struct pt_regs *regs;
+   struct perf_sample_data *data;
+};
+
 #ifdef CONFIG_CGROUP_PERF
 
 /*
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 185f8ea2702f..d0352a971ebd 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -71,6 +71,7 @@ header-y += binfmts.h
 header-y += blkpg.h
 header-y += blktrace_api.h
 header-y += bpf_common.h
+header-y += bpf_perf_event.h
 header-y += bpf.h
 header-y += bpqether.h
 header-y += bsg.h
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e4c5a1baa993..f896dfac4ac0 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -95,6 +95,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SCHED_ACT,
BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_XDP,
+   BPF_PROG_TYPE_PERF_EVENT,
 };
 
 #define BPF_PSEUDO_MAP_FD  1
diff --git a/include/uapi/linux/bpf_perf_event.h 
b/include/uapi/linux/bpf_perf_event.h
new file mode 100644
index ..067427259820
--- /dev/null
+++ b/include/uapi/linux/bpf_perf_event.h
@@ -0,0 +1,18 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _UAPI__LINUX_BPF_PERF_EVENT_H__
+#define _UAPI__LINUX_BPF_PERF_EVENT_H__
+
+#include 
+#include 
+
+struct bpf_perf_event_data {
+   struct pt_regs regs;
+   __u64 sample_period;
+};
+
+#endif /* _UAPI__LINUX_BPF_PERF_EVENT_H__ */
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ad35213b8405..0ac414abbf68 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1,4 +1,5 @@
 /* Copyright (c) 2011-2015 PLUMgrid, http://plumgrid.com
+ * Copyright (c) 2016 Facebook
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of version 2 of the GNU General Public
@@ -8,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -552,10 +554,68 @@ static struct bpf_prog_type_list tracepoint_tl = {
.type   = BPF_PROG_TYPE_TRACEPOINT,
 };
 
+static bool pe_prog_is_valid_access(int off, int size, enum bpf_access_type 
type,
+   enum bpf_reg_type *reg_type)
+{
+   if (off < 0 || off >= sizeof(struct bpf_perf_event_data))
+   return false;
+   if (type != BPF_READ)
+   return false;
+   if (off % size != 0)
+   return false;
+   if (off == offsetof(struct bpf_perf_event_data, sample_period)) {
+   if (size != sizeof(u64))
+   return false;
+   } else {
+   if (size != sizeof(long))
+   return false;
+   }
+   return true;
+}
+
+static u32 pe_prog_convert_ctx_access(enum bpf_access_type type, int dst_reg,
+ int src_reg, int ctx_off,
+ struct bpf_insn *insn_buf,
+ struct bpf_prog *prog)
+{
+   struct bpf_insn *insn = insn_buf;
+
+   BUILD_BUG_ON(FIELD_SIZEOF(struct perf_sample_data, period) != 
sizeof(u64));
+   switch (ctx_off) {
+   case offsetof(struct bpf_perf_event_data, sample_period):
+   *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct 
bpf_perf_event_data_kern, data)),
+ dst_reg, src_reg,
+ off

[PATCH v2 net-next 5/6] samples/bpf: add perf_event+bpf example

2016-08-31 Thread Alexei Starovoitov

The bpf program is called 50 times a second and does 
hashmap[kern&user_stackid]++
It's primary purpose to check that key bpf helpers like map lookup, update,
get_stackid, trace_printk and ctx access are all working.
It checks:
- PERF_COUNT_HW_CPU_CYCLES on all cpus
- PERF_COUNT_HW_CPU_CYCLES for current process and inherited perf_events to 
children
- PERF_COUNT_SW_CPU_CLOCK on all cpus
- PERF_COUNT_SW_CPU_CLOCK for current process

Signed-off-by: Alexei Starovoitov 
---
 samples/bpf/Makefile   |   4 +
 samples/bpf/bpf_helpers.h  |   2 +
 samples/bpf/bpf_load.c |   7 +-
 samples/bpf/trace_event_kern.c |  65 +
 samples/bpf/trace_event_user.c | 213 +
 5 files changed, 290 insertions(+), 1 deletion(-)
 create mode 100644 samples/bpf/trace_event_kern.c
 create mode 100644 samples/bpf/trace_event_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index db3cb061bfcd..a69cf9045285 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -25,6 +25,7 @@ hostprogs-y += test_cgrp2_array_pin
 hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
+hostprogs-y += trace_event
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -52,6 +53,7 @@ xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
 test_current_task_under_cgroup-objs := bpf_load.o libbpf.o \
   test_current_task_under_cgroup_user.o
+trace_event-objs := bpf_load.o libbpf.o trace_event_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -79,6 +81,7 @@ always += test_cgrp2_tc_kern.o
 always += xdp1_kern.o
 always += xdp2_kern.o
 always += test_current_task_under_cgroup_kern.o
+always += trace_event_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -103,6 +106,7 @@ HOSTLOADLIBES_test_overhead += -lelf -lrt
 HOSTLOADLIBES_xdp1 += -lelf
 HOSTLOADLIBES_xdp2 += -lelf
 HOSTLOADLIBES_test_current_task_under_cgroup += -lelf
+HOSTLOADLIBES_trace_event += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index bbdf62a1e45e..90f44bd2045e 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -55,6 +55,8 @@ static int (*bpf_skb_get_tunnel_opt)(void *ctx, void *md, int 
size) =
(void *) BPF_FUNC_skb_get_tunnel_opt;
 static int (*bpf_skb_set_tunnel_opt)(void *ctx, void *md, int size) =
(void *) BPF_FUNC_skb_set_tunnel_opt;
+static unsigned long long (*bpf_get_prandom_u32)(void) =
+   (void *) BPF_FUNC_get_prandom_u32;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 0cfda2320320..97913e109b14 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -51,6 +51,7 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0;
bool is_xdp = strncmp(event, "xdp", 3) == 0;
+   bool is_perf_event = strncmp(event, "perf_event", 10) == 0;
enum bpf_prog_type prog_type;
char buf[256];
int fd, efd, err, id;
@@ -69,6 +70,8 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
prog_type = BPF_PROG_TYPE_TRACEPOINT;
} else if (is_xdp) {
prog_type = BPF_PROG_TYPE_XDP;
+   } else if (is_perf_event) {
+   prog_type = BPF_PROG_TYPE_PERF_EVENT;
} else {
printf("Unknown event '%s'\n", event);
return -1;
@@ -82,7 +85,7 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
 
prog_fd[prog_cnt++] = fd;
 
-   if (is_xdp)
+   if (is_xdp || is_perf_event)
return 0;
 
if (is_socket) {
@@ -326,6 +329,7 @@ int load_bpf_file(char *path)
memcmp(shname_prog, "kretprobe/", 10) == 0 ||
memcmp(shname_prog, "tracepoint/", 11) == 0 ||
memcmp(shname_prog, "xdp", 3) == 0 ||
+   memcmp(shname_prog, "perf_event", 10) == 0 ||
memcmp(shname_prog, "socket", 6) == 0)
load_and_attach(shname_prog, insns, 
data_prog->d_size);
}
@@ -344,6 +348,7 @@ int load_bpf_file(char *path)
memcmp(shname, "kretprobe/", 10) == 0 ||
memcmp(shname, "

[PATCH v2 net-next 3/6] bpf: perf_event progs should only use preallocated maps

2016-08-31 Thread Alexei Starovoitov

Make sure that BPF_PROG_TYPE_PERF_EVENT programs only use
preallocated hash maps, since doing memory allocation
in overflow_handler can crash depending on where nmi got triggered.

Signed-off-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 kernel/bpf/verifier.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c1c9e441f0f5..48c2705db22c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2511,6 +2511,20 @@ process_bpf_exit:
return 0;
 }
 
+static int check_map_prog_compatibility(struct bpf_map *map,
+   struct bpf_prog *prog)
+
+{
+   if (prog->type == BPF_PROG_TYPE_PERF_EVENT &&
+   (map->map_type == BPF_MAP_TYPE_HASH ||
+map->map_type == BPF_MAP_TYPE_PERCPU_HASH) &&
+   (map->map_flags & BPF_F_NO_PREALLOC)) {
+   verbose("perf_event programs can only use preallocated hash 
map\n");
+   return -EINVAL;
+   }
+   return 0;
+}
+
 /* look for pseudo eBPF instructions that access map FDs and
  * replace them with actual map pointers
  */
@@ -2518,7 +2532,7 @@ static int replace_map_fd_with_map_ptr(struct 
verifier_env *env)
 {
struct bpf_insn *insn = env->prog->insnsi;
int insn_cnt = env->prog->len;
-   int i, j;
+   int i, j, err;
 
for (i = 0; i < insn_cnt; i++, insn++) {
if (BPF_CLASS(insn->code) == BPF_LDX &&
@@ -2562,6 +2576,12 @@ static int replace_map_fd_with_map_ptr(struct 
verifier_env *env)
return PTR_ERR(map);
}
 
+   err = check_map_prog_compatibility(map, env->prog);
+   if (err) {
+   fdput(f);
+   return err;
+   }
+
/* store map pointer inside BPF_LD_IMM64 instruction */
insn[0].imm = (u32) (unsigned long) map;
insn[1].imm = ((u64) (unsigned long) map) >> 32;
-- 
2.8.0

[PATCH v2 net-next 6/6] samples/bpf: add sampleip example

2016-08-31 Thread Alexei Starovoitov

From: Brendan Gregg 

sample instruction pointer and frequency count in a BPF map

Signed-off-by: Brendan Gregg 
Signed-off-by: Alexei Starovoitov 
---
 samples/bpf/Makefile|   4 +
 samples/bpf/sampleip_kern.c |  38 +
 samples/bpf/sampleip_user.c | 196 
 3 files changed, 238 insertions(+)
 create mode 100644 samples/bpf/sampleip_kern.c
 create mode 100644 samples/bpf/sampleip_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index a69cf9045285..12b7304d55dc 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -26,6 +26,7 @@ hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
 hostprogs-y += trace_event
+hostprogs-y += sampleip
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -54,6 +55,7 @@ xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
 test_current_task_under_cgroup-objs := bpf_load.o libbpf.o \
   test_current_task_under_cgroup_user.o
 trace_event-objs := bpf_load.o libbpf.o trace_event_user.o
+sampleip-objs := bpf_load.o libbpf.o sampleip_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -82,6 +84,7 @@ always += xdp1_kern.o
 always += xdp2_kern.o
 always += test_current_task_under_cgroup_kern.o
 always += trace_event_kern.o
+always += sampleip_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -107,6 +110,7 @@ HOSTLOADLIBES_xdp1 += -lelf
 HOSTLOADLIBES_xdp2 += -lelf
 HOSTLOADLIBES_test_current_task_under_cgroup += -lelf
 HOSTLOADLIBES_trace_event += -lelf
+HOSTLOADLIBES_sampleip += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/sampleip_kern.c b/samples/bpf/sampleip_kern.c
new file mode 100644
index ..774a681f374a
--- /dev/null
+++ b/samples/bpf/sampleip_kern.c
@@ -0,0 +1,38 @@
+/* Copyright 2016 Netflix, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+#define MAX_IPS8192
+
+struct bpf_map_def SEC("maps") ip_map = {
+   .type = BPF_MAP_TYPE_HASH,
+   .key_size = sizeof(u64),
+   .value_size = sizeof(u32),
+   .max_entries = MAX_IPS,
+};
+
+SEC("perf_event")
+int do_sample(struct bpf_perf_event_data *ctx)
+{
+   u64 ip;
+   u32 *value, init_val = 1;
+
+   ip = ctx->regs.ip;
+   value = bpf_map_lookup_elem(&ip_map, &ip);
+   if (value)
+   *value += 1;
+   else
+   /* E2BIG not tested for this example only */
+   bpf_map_update_elem(&ip_map, &ip, &init_val, BPF_NOEXIST);
+
+   return 0;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/sampleip_user.c b/samples/bpf/sampleip_user.c
new file mode 100644
index ..260a6bdd6413
--- /dev/null
+++ b/samples/bpf/sampleip_user.c
@@ -0,0 +1,196 @@
+/*
+ * sampleip: sample instruction pointer and frequency count in a BPF map.
+ *
+ * Copyright 2016 Netflix, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "libbpf.h"
+#include "bpf_load.h"
+
+#define DEFAULT_FREQ   99
+#define DEFAULT_SECS   5
+#define MAX_IPS8192
+#define PAGE_OFFSET0x8800
+
+static int nr_cpus;
+
+static void usage(void)
+{
+   printf("USAGE: sampleip [-F freq] [duration]\n");
+   printf("   -F freq# sample frequency (Hertz), default 99\n");
+   printf("   duration   # sampling duration (seconds), default 5\n");
+}
+
+static int sampling_start(int *pmu_fd, int freq)
+{
+   int i;
+
+   struct perf_event_attr pe_sample_attr = {
+   .type = PERF_TYPE_SOFTWARE,
+   .freq = 1,
+   .sample_period = freq,
+   .config = PERF_COUNT_SW_CPU_CLOCK,
+   .inherit = 1,
+   };
+
+   for (i = 0; i < nr_cpus; i++) {
+   pmu_fd[i] = perf_event_open(&pe_sample_attr, -1 /* pid */, i,
+   -1 /* group_fd */, 0 /* flags */);
+   if (pmu_fd[i] < 0) {
+   fprintf(stderr, "ERROR: Initializing perf sampling\n");
+   return 1;
+   }
+   assert(ioctl(pmu_fd[i],

[PATCH v2 net-next 1/6] bpf: support 8-byte metafield access

2016-08-31 Thread Alexei Starovoitov

The verifier supported only 4-byte metafields in
struct __sk_buff and struct xdp_md. The metafields in upcoming
struct bpf_perf_event are 8-byte to match register width in struct pt_regs.
Teach verifier to recognize 8-byte metafield access.
The patch doesn't affect safety of sockets and xdp programs.
They check for 4-byte only ctx access before these conditions are hit.

Signed-off-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 kernel/bpf/verifier.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index abb61f3f6900..c1c9e441f0f5 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2333,7 +2333,8 @@ static int do_check(struct verifier_env *env)
if (err)
return err;
 
-   if (BPF_SIZE(insn->code) != BPF_W) {
+   if (BPF_SIZE(insn->code) != BPF_W &&
+   BPF_SIZE(insn->code) != BPF_DW) {
insn_idx++;
continue;
}
@@ -2642,9 +2643,11 @@ static int convert_ctx_accesses(struct verifier_env *env)
for (i = 0; i < insn_cnt; i++, insn++) {
u32 insn_delta, cnt;
 
-   if (insn->code == (BPF_LDX | BPF_MEM | BPF_W))
+   if (insn->code == (BPF_LDX | BPF_MEM | BPF_W) ||
+   insn->code == (BPF_LDX | BPF_MEM | BPF_DW))
type = BPF_READ;
-   else if (insn->code == (BPF_STX | BPF_MEM | BPF_W))
+   else if (insn->code == (BPF_STX | BPF_MEM | BPF_W) ||
+insn->code == (BPF_STX | BPF_MEM | BPF_DW))
type = BPF_WRITE;
else
continue;
-- 
2.8.0

Re: [PATCH v2 net-next 4/6] perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT programs

2016-09-01 Thread Alexei Starovoitov

On Thu, Sep 01, 2016 at 10:12:51AM +0200, Peter Zijlstra wrote:
> On Wed, Aug 31, 2016 at 02:50:41PM -0700, Alexei Starovoitov wrote:
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index 97bfe62f30d7..dcaaaf3ec8e6 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -679,6 +679,8 @@ struct perf_event {
> > u64 (*clock)(void);
> > perf_overflow_handler_t overflow_handler;
> > void*overflow_handler_context;
> 
> > +   perf_overflow_handler_t orig_overflow_handler;
> > +   struct bpf_prog *prog;
> 
> Should we put that under CONFIG_BPF_SYSCALL too?

sure. will do.

> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 3cfabdf7b942..305433ab2447 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> 
> > @@ -7637,11 +7637,75 @@ static void perf_event_free_filter(struct 
> > perf_event *event)
> > ftrace_profile_free_filter(event);
> >  }
> >  
> > +static void bpf_overflow_handler(struct perf_event *event,
> > +struct perf_sample_data *data,
> > +struct pt_regs *regs)
> > +{
> > +   struct bpf_perf_event_data_kern ctx = {
> > +   .data = data,
> > +   .regs = regs,
> > +   };
> > +   int ret = 0;
> > +
> > +#ifdef CONFIG_BPF_SYSCALL
> > +   preempt_disable();
> > +   if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1))
> > +   goto out;
> > +   rcu_read_lock();
> > +   ret = BPF_PROG_RUN(event->prog, (void *)&ctx);
> > +   rcu_read_unlock();
> > + out:
> 
> Please, no leading space before labels. Use something like:

good catch.
sadly checkpatch didn't complain.
 
> [diff "default"]
> xfuncname = "^[[:alpha:]$_].*[^:]$"
> 
> In your .gitconfig if you want to keep diff output 'sane'.

interesting trick. don't remember being bitten by it.
This extra space was a typo.

> > +   __this_cpu_dec(bpf_prog_active);
> > +   preempt_enable();
> > +#endif
> > +   if (!ret)
> > +   return;
> > +
> > +   event->orig_overflow_handler(event, data, regs);
> > +}
> 
> Other than that, ACK.

Thanks!

[PATCH v3 net-next 0/6] perf, bpf: add support for bpf in sw/hw perf_events

2016-09-01 Thread Alexei Starovoitov

Hi Peter, Dave,

this patch set is a follow up to the discussion:
https://lkml.kernel.org/r/20160804142853.GO6862%20()%20twins%20!%20programming%20!%20kicks-ass%20!%20net
It turned out to be simpler than what we discussed.

Patches 1-3 is bpf-side prep for the main patch 4
that adds bpf program as an overflow_handler to sw and hw perf_events.

Patches 5 and 6 are examples from myself and Brendan.

Peter,
to implement your suggestion to add ifdef CONFIG_BPF_SYSCALL
inside struct perf_event, I had to shuffle ifdefs in events/core.c
Please double check whether that is what you wanted to see.

v2->v3: fixed few more minor issues
v1->v2: fixed issues spotted by Peter and Daniel.

Thanks!

Alexei Starovoitov (5):
  bpf: support 8-byte metafield access
  bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type
  bpf: perf_event progs should only use preallocated maps
  perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT
programs
  samples/bpf: add perf_event+bpf example

Brendan Gregg (1):
  samples/bpf: add sampleip example

 include/linux/bpf.h |   4 +
 include/linux/perf_event.h  |   9 ++
 include/uapi/linux/Kbuild   |   1 +
 include/uapi/linux/bpf.h|   1 +
 include/uapi/linux/bpf_perf_event.h |  18 +++
 kernel/bpf/verifier.c   |  31 +-
 kernel/events/core.c|  89 ++-
 kernel/trace/bpf_trace.c|  61 +++
 samples/bpf/Makefile|   8 ++
 samples/bpf/bpf_helpers.h   |   2 +
 samples/bpf/bpf_load.c  |   7 +-
 samples/bpf/sampleip_kern.c |  38 +++
 samples/bpf/sampleip_user.c | 196 +
 samples/bpf/trace_event_kern.c  |  65 +++
 samples/bpf/trace_event_user.c  | 213 
 15 files changed, 737 insertions(+), 6 deletions(-)
 create mode 100644 include/uapi/linux/bpf_perf_event.h
 create mode 100644 samples/bpf/sampleip_kern.c
 create mode 100644 samples/bpf/sampleip_user.c
 create mode 100644 samples/bpf/trace_event_kern.c
 create mode 100644 samples/bpf/trace_event_user.c

-- 
2.8.0

[PATCH v3 net-next 1/6] bpf: support 8-byte metafield access

2016-09-01 Thread Alexei Starovoitov

The verifier supported only 4-byte metafields in
struct __sk_buff and struct xdp_md. The metafields in upcoming
struct bpf_perf_event are 8-byte to match register width in struct pt_regs.
Teach verifier to recognize 8-byte metafield access.
The patch doesn't affect safety of sockets and xdp programs.
They check for 4-byte only ctx access before these conditions are hit.

Signed-off-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 kernel/bpf/verifier.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index abb61f3f6900..c1c9e441f0f5 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2333,7 +2333,8 @@ static int do_check(struct verifier_env *env)
if (err)
return err;
 
-   if (BPF_SIZE(insn->code) != BPF_W) {
+   if (BPF_SIZE(insn->code) != BPF_W &&
+   BPF_SIZE(insn->code) != BPF_DW) {
insn_idx++;
continue;
}
@@ -2642,9 +2643,11 @@ static int convert_ctx_accesses(struct verifier_env *env)
for (i = 0; i < insn_cnt; i++, insn++) {
u32 insn_delta, cnt;
 
-   if (insn->code == (BPF_LDX | BPF_MEM | BPF_W))
+   if (insn->code == (BPF_LDX | BPF_MEM | BPF_W) ||
+   insn->code == (BPF_LDX | BPF_MEM | BPF_DW))
type = BPF_READ;
-   else if (insn->code == (BPF_STX | BPF_MEM | BPF_W))
+   else if (insn->code == (BPF_STX | BPF_MEM | BPF_W) ||
+insn->code == (BPF_STX | BPF_MEM | BPF_DW))
type = BPF_WRITE;
else
continue;
-- 
2.8.0

[PATCH v3 net-next 2/6] bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type

2016-09-01 Thread Alexei Starovoitov

Introduce BPF_PROG_TYPE_PERF_EVENT programs that can be attached to
HW and SW perf events (PERF_TYPE_HARDWARE and PERF_TYPE_SOFTWARE
correspondingly in uapi/linux/perf_event.h)

The program visible context meta structure is
struct bpf_perf_event_data {
struct pt_regs regs;
 __u64 sample_period;
};
which is accessible directly from the program:
int bpf_prog(struct bpf_perf_event_data *ctx)
{
  ... ctx->sample_period ...
  ... ctx->regs.ip ...
}

The bpf verifier rewrites the accesses into kernel internal
struct bpf_perf_event_data_kern which allows changing
struct perf_sample_data without affecting bpf programs.
New fields can be added to the end of struct bpf_perf_event_data
in the future.

Signed-off-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 include/linux/perf_event.h  |  5 +++
 include/uapi/linux/Kbuild   |  1 +
 include/uapi/linux/bpf.h|  1 +
 include/uapi/linux/bpf_perf_event.h | 18 +++
 kernel/trace/bpf_trace.c| 61 +
 5 files changed, 86 insertions(+)
 create mode 100644 include/uapi/linux/bpf_perf_event.h

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2b6b43cc0dd5..97bfe62f30d7 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -788,6 +788,11 @@ struct perf_output_handle {
int page;
 };
 
+struct bpf_perf_event_data_kern {
+   struct pt_regs *regs;
+   struct perf_sample_data *data;
+};
+
 #ifdef CONFIG_CGROUP_PERF
 
 /*
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 185f8ea2702f..d0352a971ebd 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -71,6 +71,7 @@ header-y += binfmts.h
 header-y += blkpg.h
 header-y += blktrace_api.h
 header-y += bpf_common.h
+header-y += bpf_perf_event.h
 header-y += bpf.h
 header-y += bpqether.h
 header-y += bsg.h
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e4c5a1baa993..f896dfac4ac0 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -95,6 +95,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SCHED_ACT,
BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_XDP,
+   BPF_PROG_TYPE_PERF_EVENT,
 };
 
 #define BPF_PSEUDO_MAP_FD  1
diff --git a/include/uapi/linux/bpf_perf_event.h 
b/include/uapi/linux/bpf_perf_event.h
new file mode 100644
index ..067427259820
--- /dev/null
+++ b/include/uapi/linux/bpf_perf_event.h
@@ -0,0 +1,18 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _UAPI__LINUX_BPF_PERF_EVENT_H__
+#define _UAPI__LINUX_BPF_PERF_EVENT_H__
+
+#include 
+#include 
+
+struct bpf_perf_event_data {
+   struct pt_regs regs;
+   __u64 sample_period;
+};
+
+#endif /* _UAPI__LINUX_BPF_PERF_EVENT_H__ */
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ad35213b8405..d3869b03d9fe 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1,4 +1,5 @@
 /* Copyright (c) 2011-2015 PLUMgrid, http://plumgrid.com
+ * Copyright (c) 2016 Facebook
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of version 2 of the GNU General Public
@@ -8,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -552,10 +554,69 @@ static struct bpf_prog_type_list tracepoint_tl = {
.type   = BPF_PROG_TYPE_TRACEPOINT,
 };
 
+static bool pe_prog_is_valid_access(int off, int size, enum bpf_access_type 
type,
+   enum bpf_reg_type *reg_type)
+{
+   if (off < 0 || off >= sizeof(struct bpf_perf_event_data))
+   return false;
+   if (type != BPF_READ)
+   return false;
+   if (off % size != 0)
+   return false;
+   if (off == offsetof(struct bpf_perf_event_data, sample_period)) {
+   if (size != sizeof(u64))
+   return false;
+   } else {
+   if (size != sizeof(long))
+   return false;
+   }
+   return true;
+}
+
+static u32 pe_prog_convert_ctx_access(enum bpf_access_type type, int dst_reg,
+ int src_reg, int ctx_off,
+ struct bpf_insn *insn_buf,
+ struct bpf_prog *prog)
+{
+   struct bpf_insn *insn = insn_buf;
+
+   switch (ctx_off) {
+   case offsetof(struct bpf_perf_event_data, sample_period):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct perf_sample_data, period) != 
sizeof(u64));
+   *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct 
bpf_perf_event_data_kern, data)),
+ dst_reg, src_reg,
+

[PATCH v3 net-next 5/6] samples/bpf: add perf_event+bpf example

2016-09-01 Thread Alexei Starovoitov

The bpf program is called 50 times a second and does 
hashmap[kern&user_stackid]++
It's primary purpose to check that key bpf helpers like map lookup, update,
get_stackid, trace_printk and ctx access are all working.
It checks:
- PERF_COUNT_HW_CPU_CYCLES on all cpus
- PERF_COUNT_HW_CPU_CYCLES for current process and inherited perf_events to 
children
- PERF_COUNT_SW_CPU_CLOCK on all cpus
- PERF_COUNT_SW_CPU_CLOCK for current process

Signed-off-by: Alexei Starovoitov 
---
 samples/bpf/Makefile   |   4 +
 samples/bpf/bpf_helpers.h  |   2 +
 samples/bpf/bpf_load.c |   7 +-
 samples/bpf/trace_event_kern.c |  65 +
 samples/bpf/trace_event_user.c | 213 +
 5 files changed, 290 insertions(+), 1 deletion(-)
 create mode 100644 samples/bpf/trace_event_kern.c
 create mode 100644 samples/bpf/trace_event_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index db3cb061bfcd..a69cf9045285 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -25,6 +25,7 @@ hostprogs-y += test_cgrp2_array_pin
 hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
+hostprogs-y += trace_event
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -52,6 +53,7 @@ xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
 test_current_task_under_cgroup-objs := bpf_load.o libbpf.o \
   test_current_task_under_cgroup_user.o
+trace_event-objs := bpf_load.o libbpf.o trace_event_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -79,6 +81,7 @@ always += test_cgrp2_tc_kern.o
 always += xdp1_kern.o
 always += xdp2_kern.o
 always += test_current_task_under_cgroup_kern.o
+always += trace_event_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -103,6 +106,7 @@ HOSTLOADLIBES_test_overhead += -lelf -lrt
 HOSTLOADLIBES_xdp1 += -lelf
 HOSTLOADLIBES_xdp2 += -lelf
 HOSTLOADLIBES_test_current_task_under_cgroup += -lelf
+HOSTLOADLIBES_trace_event += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index bbdf62a1e45e..90f44bd2045e 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -55,6 +55,8 @@ static int (*bpf_skb_get_tunnel_opt)(void *ctx, void *md, int 
size) =
(void *) BPF_FUNC_skb_get_tunnel_opt;
 static int (*bpf_skb_set_tunnel_opt)(void *ctx, void *md, int size) =
(void *) BPF_FUNC_skb_set_tunnel_opt;
+static unsigned long long (*bpf_get_prandom_u32)(void) =
+   (void *) BPF_FUNC_get_prandom_u32;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 0cfda2320320..97913e109b14 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -51,6 +51,7 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0;
bool is_xdp = strncmp(event, "xdp", 3) == 0;
+   bool is_perf_event = strncmp(event, "perf_event", 10) == 0;
enum bpf_prog_type prog_type;
char buf[256];
int fd, efd, err, id;
@@ -69,6 +70,8 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
prog_type = BPF_PROG_TYPE_TRACEPOINT;
} else if (is_xdp) {
prog_type = BPF_PROG_TYPE_XDP;
+   } else if (is_perf_event) {
+   prog_type = BPF_PROG_TYPE_PERF_EVENT;
} else {
printf("Unknown event '%s'\n", event);
return -1;
@@ -82,7 +85,7 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
 
prog_fd[prog_cnt++] = fd;
 
-   if (is_xdp)
+   if (is_xdp || is_perf_event)
return 0;
 
if (is_socket) {
@@ -326,6 +329,7 @@ int load_bpf_file(char *path)
memcmp(shname_prog, "kretprobe/", 10) == 0 ||
memcmp(shname_prog, "tracepoint/", 11) == 0 ||
memcmp(shname_prog, "xdp", 3) == 0 ||
+   memcmp(shname_prog, "perf_event", 10) == 0 ||
memcmp(shname_prog, "socket", 6) == 0)
load_and_attach(shname_prog, insns, 
data_prog->d_size);
}
@@ -344,6 +348,7 @@ int load_bpf_file(char *path)
memcmp(shname, "kretprobe/", 10) == 0 ||
memcmp(shname, "

[PATCH v3 net-next 4/6] perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT programs

2016-09-01 Thread Alexei Starovoitov

Allow attaching BPF_PROG_TYPE_PERF_EVENT programs to sw and hw perf events
via overflow_handler mechanism.
When program is attached the overflow_handlers become stacked.
The program acts as a filter.
Returning zero from the program means that the normal perf_event_output handler
will not be called and sampling event won't be stored in the ring buffer.

The overflow_handler_context==NULL is an additional safety check
to make sure programs are not attached to hw breakpoints and watchdog
in case other checks (that prevent that now anyway) get accidentally
relaxed in the future.

The program refcnt is incremented in case perf_events are inhereted
when target task is forked.
Similar to kprobe and tracepoint programs there is no ioctl to
detach the program or swap already attached program. The user space
expected to close(perf_event_fd) like it does right now for kprobe+bpf.
That restriction simplifies the code quite a bit.

The invocation of overflow_handler in __perf_event_overflow() is now
done via READ_ONCE, since that pointer can be replaced when the program
is attached while perf_event itself could have been active already.
There is no need to do similar treatment for event->prog, since it's
assigned only once before it's accessed.

Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpf.h|  4 +++
 include/linux/perf_event.h |  4 +++
 kernel/events/core.c   | 89 +-
 3 files changed, 96 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 11134238417d..9a904f63f8c1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -297,6 +297,10 @@ static inline struct bpf_prog *bpf_prog_add(struct 
bpf_prog *prog, int i)
 static inline void bpf_prog_put(struct bpf_prog *prog)
 {
 }
+static inline struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog)
+{
+   return ERR_PTR(-EOPNOTSUPP);
+}
 #endif /* CONFIG_BPF_SYSCALL */
 
 /* verifier prototypes for helper functions called from eBPF programs */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 97bfe62f30d7..ccb73a58113d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -679,6 +679,10 @@ struct perf_event {
u64 (*clock)(void);
perf_overflow_handler_t overflow_handler;
void*overflow_handler_context;
+#ifdef CONFIG_BPF_SYSCALL
+   perf_overflow_handler_t orig_overflow_handler;
+   struct bpf_prog *prog;
+#endif
 
 #ifdef CONFIG_EVENT_TRACING
struct trace_event_call *tp_event;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3cfabdf7b942..85bf4c37911f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7022,7 +7022,7 @@ static int __perf_event_overflow(struct perf_event *event,
irq_work_queue(&event->pending);
}
 
-   event->overflow_handler(event, data, regs);
+   READ_ONCE(event->overflow_handler)(event, data, regs);
 
if (*perf_event_fasync(event) && event->pending_kill) {
event->pending_wakeup = 1;
@@ -7637,11 +7637,83 @@ static void perf_event_free_filter(struct perf_event 
*event)
ftrace_profile_free_filter(event);
 }
 
+#ifdef CONFIG_BPF_SYSCALL
+static void bpf_overflow_handler(struct perf_event *event,
+struct perf_sample_data *data,
+struct pt_regs *regs)
+{
+   struct bpf_perf_event_data_kern ctx = {
+   .data = data,
+   .regs = regs,
+   };
+   int ret = 0;
+
+   preempt_disable();
+   if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1))
+   goto out;
+   rcu_read_lock();
+   ret = BPF_PROG_RUN(event->prog, (void *)&ctx);
+   rcu_read_unlock();
+out:
+   __this_cpu_dec(bpf_prog_active);
+   preempt_enable();
+   if (!ret)
+   return;
+
+   event->orig_overflow_handler(event, data, regs);
+}
+
+static int perf_event_set_bpf_handler(struct perf_event *event, u32 prog_fd)
+{
+   struct bpf_prog *prog;
+
+   if (event->overflow_handler_context)
+   /* hw breakpoint or kernel counter */
+   return -EINVAL;
+
+   if (event->prog)
+   return -EEXIST;
+
+   prog = bpf_prog_get_type(prog_fd, BPF_PROG_TYPE_PERF_EVENT);
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+
+   event->prog = prog;
+   event->orig_overflow_handler = READ_ONCE(event->overflow_handler);
+   WRITE_ONCE(event->overflow_handler, bpf_overflow_handler);
+   return 0;
+}
+
+static void perf_event_free_bpf_handler(struct perf_event *event)
+{
+   struct bpf_prog *prog = event->prog;
+
+   if (!prog)
+   return;
+
+   WRITE_ONCE(event->overflow_handler, event->orig_overflow_h

[PATCH v3 net-next 6/6] samples/bpf: add sampleip example

2016-09-01 Thread Alexei Starovoitov

From: Brendan Gregg 

sample instruction pointer and frequency count in a BPF map

Signed-off-by: Brendan Gregg 
Signed-off-by: Alexei Starovoitov 
---
 samples/bpf/Makefile|   4 +
 samples/bpf/sampleip_kern.c |  38 +
 samples/bpf/sampleip_user.c | 196 
 3 files changed, 238 insertions(+)
 create mode 100644 samples/bpf/sampleip_kern.c
 create mode 100644 samples/bpf/sampleip_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index a69cf9045285..12b7304d55dc 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -26,6 +26,7 @@ hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
 hostprogs-y += trace_event
+hostprogs-y += sampleip
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -54,6 +55,7 @@ xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
 test_current_task_under_cgroup-objs := bpf_load.o libbpf.o \
   test_current_task_under_cgroup_user.o
 trace_event-objs := bpf_load.o libbpf.o trace_event_user.o
+sampleip-objs := bpf_load.o libbpf.o sampleip_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -82,6 +84,7 @@ always += xdp1_kern.o
 always += xdp2_kern.o
 always += test_current_task_under_cgroup_kern.o
 always += trace_event_kern.o
+always += sampleip_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -107,6 +110,7 @@ HOSTLOADLIBES_xdp1 += -lelf
 HOSTLOADLIBES_xdp2 += -lelf
 HOSTLOADLIBES_test_current_task_under_cgroup += -lelf
 HOSTLOADLIBES_trace_event += -lelf
+HOSTLOADLIBES_sampleip += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/sampleip_kern.c b/samples/bpf/sampleip_kern.c
new file mode 100644
index ..774a681f374a
--- /dev/null
+++ b/samples/bpf/sampleip_kern.c
@@ -0,0 +1,38 @@
+/* Copyright 2016 Netflix, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+#define MAX_IPS8192
+
+struct bpf_map_def SEC("maps") ip_map = {
+   .type = BPF_MAP_TYPE_HASH,
+   .key_size = sizeof(u64),
+   .value_size = sizeof(u32),
+   .max_entries = MAX_IPS,
+};
+
+SEC("perf_event")
+int do_sample(struct bpf_perf_event_data *ctx)
+{
+   u64 ip;
+   u32 *value, init_val = 1;
+
+   ip = ctx->regs.ip;
+   value = bpf_map_lookup_elem(&ip_map, &ip);
+   if (value)
+   *value += 1;
+   else
+   /* E2BIG not tested for this example only */
+   bpf_map_update_elem(&ip_map, &ip, &init_val, BPF_NOEXIST);
+
+   return 0;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/sampleip_user.c b/samples/bpf/sampleip_user.c
new file mode 100644
index ..260a6bdd6413
--- /dev/null
+++ b/samples/bpf/sampleip_user.c
@@ -0,0 +1,196 @@
+/*
+ * sampleip: sample instruction pointer and frequency count in a BPF map.
+ *
+ * Copyright 2016 Netflix, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "libbpf.h"
+#include "bpf_load.h"
+
+#define DEFAULT_FREQ   99
+#define DEFAULT_SECS   5
+#define MAX_IPS8192
+#define PAGE_OFFSET0x8800
+
+static int nr_cpus;
+
+static void usage(void)
+{
+   printf("USAGE: sampleip [-F freq] [duration]\n");
+   printf("   -F freq# sample frequency (Hertz), default 99\n");
+   printf("   duration   # sampling duration (seconds), default 5\n");
+}
+
+static int sampling_start(int *pmu_fd, int freq)
+{
+   int i;
+
+   struct perf_event_attr pe_sample_attr = {
+   .type = PERF_TYPE_SOFTWARE,
+   .freq = 1,
+   .sample_period = freq,
+   .config = PERF_COUNT_SW_CPU_CLOCK,
+   .inherit = 1,
+   };
+
+   for (i = 0; i < nr_cpus; i++) {
+   pmu_fd[i] = perf_event_open(&pe_sample_attr, -1 /* pid */, i,
+   -1 /* group_fd */, 0 /* flags */);
+   if (pmu_fd[i] < 0) {
+   fprintf(stderr, "ERROR: Initializing perf sampling\n");
+   return 1;
+   }
+   assert(ioctl(pmu_fd[i],

[PATCH v3 net-next 3/6] bpf: perf_event progs should only use preallocated maps

2016-09-01 Thread Alexei Starovoitov

Make sure that BPF_PROG_TYPE_PERF_EVENT programs only use
preallocated hash maps, since doing memory allocation
in overflow_handler can crash depending on where nmi got triggered.

Signed-off-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 kernel/bpf/verifier.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c1c9e441f0f5..48c2705db22c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2511,6 +2511,20 @@ static int do_check(struct verifier_env *env)
return 0;
 }
 
+static int check_map_prog_compatibility(struct bpf_map *map,
+   struct bpf_prog *prog)
+
+{
+   if (prog->type == BPF_PROG_TYPE_PERF_EVENT &&
+   (map->map_type == BPF_MAP_TYPE_HASH ||
+map->map_type == BPF_MAP_TYPE_PERCPU_HASH) &&
+   (map->map_flags & BPF_F_NO_PREALLOC)) {
+   verbose("perf_event programs can only use preallocated hash 
map\n");
+   return -EINVAL;
+   }
+   return 0;
+}
+
 /* look for pseudo eBPF instructions that access map FDs and
  * replace them with actual map pointers
  */
@@ -2518,7 +2532,7 @@ static int replace_map_fd_with_map_ptr(struct 
verifier_env *env)
 {
struct bpf_insn *insn = env->prog->insnsi;
int insn_cnt = env->prog->len;
-   int i, j;
+   int i, j, err;
 
for (i = 0; i < insn_cnt; i++, insn++) {
if (BPF_CLASS(insn->code) == BPF_LDX &&
@@ -2562,6 +2576,12 @@ static int replace_map_fd_with_map_ptr(struct 
verifier_env *env)
return PTR_ERR(map);
}
 
+   err = check_map_prog_compatibility(map, env->prog);
+   if (err) {
+   fdput(f);
+   return err;
+   }
+
/* store map pointer inside BPF_LD_IMM64 instruction */
insn[0].imm = (u32) (unsigned long) map;
insn[1].imm = ((u64) (unsigned long) map) >> 32;
-- 
2.8.0

Re: [RFC V2 PATCH 00/25] Kernel NET policy

2016-08-04 Thread Alexei Starovoitov

On Wed, Dec 31, 2014 at 08:38:49PM -0500, kan.li...@intel.com wrote:
> 
> Changes since V1:
>  - Using work queue to set Rx network flow classification rules and search
>available NET policy object asynchronously.
>  - Using RCU lock to replace read-write lock
>  - Redo performance test and update performance results.
>  - Some minor modification for codes and documents.
>  - Remove i40e related patches which will be submitted in separate thread.

Most of the issues brought up in the prior submission were not addressed,
so one more NACK from me as well.
My objection with this approach is the same as others:
such policy doesn't belong in the kernel.

>  1. Why userspace tool cannot do the same thing?
> A: Kernel is more suitable for NET policy.
>- User space code would be far more complicated to get right and 
> perform
>  well . It always need to work with out of date state compared to the
>  latest, because it cannot do any locking with the kernel state.
>- User space code is less efficient than kernel code, because of the
>  additional context switches needed.
>- Kernel is in the right position to coordinate requests from multiple
>  users.

and above excuses is the reason to hack flow director rules in the kernel?
You can do the same in user space. It's not a kernel job.

Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-04 Thread Alexei Starovoitov

On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote:
> 
> > As for pmu tracepoints: if I were to instrument it (although I wasn't
> > planning to), I'd put a tracepoint in perf_event_overflow() called
> > "perf:perf_overflow", with the same arguments. That could then be used
> > for all PMU overflow events, without needing to add specific
> > tracepoints. 
> 
> Could we not teach BPF to replace event->overflow_handler and inject
> itself there?
> 
> We don't currently have nice interfaces for doing that, but it should be
> possible to do I think. We already have the indirect function call, so
> injecting ourself there has 0 overhead.

you're right. All makes sense. I guess I was too lazy to look into
how to do it properly. Adding a tracepoint looked like quick and
easy way to achieve the same.
As far as api goes probably existing IOC_SET_BPF ioctl will do too.
Currently overflow_handler is set at event alloc time. If we start
changing it on the fly with atomic xchg(), afaik things shouldn't
break, since each overflow_handler is run to completion and doesn't
change global state, right?

Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-04 Thread Alexei Starovoitov

On Thu, Aug 04, 2016 at 09:13:16PM -0700, Brendan Gregg wrote:
> On Thu, Aug 4, 2016 at 6:43 PM, Alexei Starovoitov
>  wrote:
> > On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote:
> >> On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote:
> >>
> >> > As for pmu tracepoints: if I were to instrument it (although I wasn't
> >> > planning to), I'd put a tracepoint in perf_event_overflow() called
> >> > "perf:perf_overflow", with the same arguments. That could then be used
> >> > for all PMU overflow events, without needing to add specific
> >> > tracepoints.
> >>
> >> Could we not teach BPF to replace event->overflow_handler and inject
> >> itself there?
> >>
> >> We don't currently have nice interfaces for doing that, but it should be
> >> possible to do I think. We already have the indirect function call, so
> >> injecting ourself there has 0 overhead.
> 
> Sounds like a good idea, especially for things like struct
> file_operations so that we can statically instrument file system
> read/writes with zero non-enabled overhead, and not worry about high
> frequency workloads (>10M events/sec).
> 
> These perf probes aren't high frequency, though, and the code is not
> normally in use, so overhead should be much less of a concern.
> Sampling at 999 Hertz * CPUs is as frequent as I'd go. And if the
> tracepoint code is still adding a mem read, conditional, and branch,
> then that's not many instructions, especially considering the normal
> use case of these perf functions: creating records and writing to a
> perf ring buffer, then picking that up in user space by perf, then
> either processing it live or writing to perf.data, back to the file
> system, etc. It would be hard to benchmark the effect of adding a few
> instructions to that path (and any results may be more sensitive to
> cache line placement than the instructions).

tracepoints are actually zero overhead already via static-key mechanism.
I don't think Peter's objection for the tracepoint was due to overhead.

> The perf:perf_hrtimer probe point is also reading state mid-way
> through a function, so it's not quite as simple as wrapping the
> function pointer. I do like that idea, though, but for things like
> struct file_operations.
> 
> >
> > you're right. All makes sense. I guess I was too lazy to look into
> > how to do it properly. Adding a tracepoint looked like quick and
> > easy way to achieve the same.
> > As far as api goes probably existing IOC_SET_BPF ioctl will do too.
> > Currently overflow_handler is set at event alloc time. If we start
> > changing it on the fly with atomic xchg(), afaik things shouldn't
> > break, since each overflow_handler is run to completion and doesn't
> > change global state, right?
> >
> 
> How would it be implemented? I was thinking of adding explicit wrappers, eg:

instead of adding a tracepoint to perf_swevent_hrtimer we can replace
overflow_handler for that particular event with some form of bpf wrapper.
(probably new bpf program type). Then not only periodic events
will be triggering bpf prog, but pmu events as well.
So instead of normal __perf_event_output() writing into ringbuffer,
a bpf prog will be called that can optionally write into different
rb via bpf_perf_event_output. The question is what to pass into the
program to make the most use out of it. 'struct pt_regs' is done deal.
but perf_sample_data we cannot pass as-is, since it's kernel internal.
Probably something similar to __sk_buff mirror would be needed.
Another nice benefit of doing via overflow_handler instead of tracepoint
is that exclude_idle, exclude_user, exclude_kernel flags of the perf event
will all magically work and program will be event specific.
So two parallel 'perf record'-like sampling won't conflict.

Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-05 Thread Alexei Starovoitov

On Fri, Aug 05, 2016 at 12:52:09PM +0200, Peter Zijlstra wrote:
> > > > Currently overflow_handler is set at event alloc time. If we start
> > > > changing it on the fly with atomic xchg(), afaik things shouldn't
> > > > break, since each overflow_handler is run to completion and doesn't
> > > > change global state, right?
> 
> Yes, or even a simple WRITE_ONCE() to replace it, as long as we make
> sure to use a READ_ONCE() to load the pointer.
> 
> As long as we're sure to limit this poking to a single user its fairly
> simple to get right. The moment there can be concurrency a lot of fail
> can happen.

agreed.

> > So instead of normal __perf_event_output() writing into ringbuffer,
> > a bpf prog will be called that can optionally write into different
> > rb via bpf_perf_event_output. 
> 
> It could even chain and call into the original function once its done
> and have both outputs.

interesting idea. makes sense.
Also thinking about concurrency and the need to remember the original
handler somewhere, would it be cleaner api to add a bit to perf_event_attr
and use attr.config1 as bpf_fd ?
Then perf_event_open at event creation time will use bpf prog as
overflow_handler. That solves concurrency concerns and potential semantical
issues if we go with ioctl() approach.
Like if we perf_event_open() an event for a task, then bpf attach to it,
what children task and corresponding inherited events suppose to do?
Inherit overflow_handler, right? but then deatch of bpf in the parent
suppose to clear it in inherited events as well. A bit complicated.
I guess we can define it that way.
Just seems easier to do bpf attach at perf_event_open time only.

> > The question is what to pass into the
> > program to make the most use out of it. 'struct pt_regs' is done deal.
> > but perf_sample_data we cannot pass as-is, since it's kernel internal.
> 
> Urgh, does it have to be stable API? Can't we simply rely on the kernel
> headers to provide the right structure definition?

yes we can. The concern is about assumptions people will make about
perf_sample_data and the speed of access to it. From bpf program point
of view the pointer to perf_sample_data will be opaque unsafe pointer,
so any access to fields would have to be done via bpf_probe_read which
has non-trivial overhead.
If we go with the uapi mirror of perf_sample_data approach, it will be
fast, since mirror is not an actual struct. Like the 'struct __sk_buff' we
have in uapi/linux/bpf.h is a meta structure. It's not allocated anywhere
and no fields are copied. When bpf program does 'skb->vlan_present'
the verifier rewrites it at load time into corresponding access to
kernel internal 'struct sk_buff' fields with bitmask, shifts and such.
For this case we can define something like
struct bpf_perf_sample_data {
  __u64 period;
};
then bpf prog will only be able to access that signle field which verifier
will translate into 'data->period' where data is 'struct perf_sample_data *'
Later we can add other fields if necessary. The kernel is free to mess
around with perf_sample_data whichever way without impacting bpf progs.

Re: [RFC v2 09/10] landlock: Handle cgroups

2016-08-25 Thread Alexei Starovoitov

On Thu, Aug 25, 2016 at 12:32:44PM +0200, Mickaël Salaün wrote:
> Add an eBPF function bpf_landlock_cmp_cgroup_beneath(opt, map, map_op)
> to compare the current process cgroup with a cgroup handle, The handle
> can match the current cgroup if it is the same or a child. This allows
> to make conditional rules according to the current cgroup.
> 
> A cgroup handle is a map entry created from a file descriptor referring
> a cgroup directory (e.g. by opening /sys/fs/cgroup/X). In this case, the
> map entry is of type BPF_MAP_HANDLE_TYPE_LANDLOCK_CGROUP_FD and the
> inferred array map is of type BPF_MAP_ARRAY_TYPE_LANDLOCK_CGROUP.
> 
> An unprivileged process can create and manipulate cgroups thanks to
> cgroup delegation.
> 
> Signed-off-by: Mickaël Salaün 
...
> +static inline u64 bpf_landlock_cmp_cgroup_beneath(u64 r1_option, u64 r2_map,
> + u64 r3_map_op, u64 r4, u64 r5)
> +{
> + u8 option = (u8) r1_option;
> + struct bpf_map *map = (struct bpf_map *) (unsigned long) r2_map;
> + enum bpf_map_array_op map_op = r3_map_op;
> + struct bpf_array *array = container_of(map, struct bpf_array, map);
> + struct cgroup *cg1, *cg2;
> + struct map_landlock_handle *handle;
> + int i;
> +
> + /* ARG_CONST_PTR_TO_LANDLOCK_HANDLE_CGROUP is an arraymap */
> + if (unlikely(!map)) {
> + WARN_ON(1);
> + return -EFAULT;
> + }
> + if (unlikely((option | _LANDLOCK_FLAG_OPT_MASK) != 
> _LANDLOCK_FLAG_OPT_MASK))
> + return -EINVAL;
> +
> + /* for now, only handle OP_OR */
> + switch (map_op) {
> + case BPF_MAP_ARRAY_OP_OR:
> + break;
> + case BPF_MAP_ARRAY_OP_UNSPEC:
> + case BPF_MAP_ARRAY_OP_AND:
> + case BPF_MAP_ARRAY_OP_XOR:
> + default:
> + return -EINVAL;
> + }
> +
> + synchronize_rcu();
> +
> + for (i = 0; i < array->n_entries; i++) {
> + handle = (struct map_landlock_handle *)
> + (array->value + array->elem_size * i);
> +
> + /* protected by the proto types, should not happen */
> + if (unlikely(handle->type != 
> BPF_MAP_HANDLE_TYPE_LANDLOCK_CGROUP_FD)) {
> + WARN_ON(1);
> + return -EFAULT;
> + }
> + if (unlikely(!handle->css)) {
> + WARN_ON(1);
> + return -EFAULT;
> + }
> +
> + if (option & LANDLOCK_FLAG_OPT_REVERSE) {
> + cg1 = handle->css->cgroup;
> + cg2 = task_css_set(current)->dfl_cgrp;
> + } else {
> + cg1 = task_css_set(current)->dfl_cgrp;
> + cg2 = handle->css->cgroup;
> + }
> +
> + if (cgroup_is_descendant(cg1, cg2))
> + return 0;
> + }
> + return 1;
> +}

- please take a loook at exisiting bpf_current_task_under_cgroup and
reuse BPF_MAP_TYPE_CGROUP_ARRAY as a minimum. Doing new cgroup array
is nothing but duplication of the code.

- I don't think such 'for' loop can scale. The solution needs to work
with thousands of containers and thousands of cgroups.
In the patch 06/10 the proposal is to use 'current' as holder of
the programs:
+   for (prog = current->seccomp.landlock_prog;
+   prog; prog = prog->prev) {
+   if (prog->filter == landlock_ret->filter) {
+   cur_ret = BPF_PROG_RUN(prog->prog, (void *)&ctx);
+   break;
+   }
+   }
imo that's the root of scalability issue.
I think to be able to scale the bpf programs have to be attached to
cgroups instead of tasks.
That would be very different api. seccomp doesn't need to be touched.
But that is the only way I see to be able to scale.
May be another way of thinking about it is 'lsm cgroup controller'
that Sargun is proposing.
The lsm hooks will provide stable execution points and the programs
will be called like:
prog = task_css_set(current)->dfl_cgrp->bpf.prog_effective[lsm_hook_id];
BPF_PROG_RUN(prog, ctx);
The delegation functionality and 'prog_effective' logic that
Daniel Mack is proposing will be fully reused here.
External container management software will be able to apply bpf
programs to control tasks under cgroup and such
bpf_landlock_cmp_cgroup_beneath() helper won't be necessary.
The user will be able to register different programs for different lsm hooks.
If I understand the patch 6/10 correctly, there is one (or a list) prog for
all lsm hooks per task which is not flexible enough.
Anoop Naravaram's use case is to control the ports the applications
under cgroup can bind and listen on.
Such use case can be solved by such 'lsm cgroup controller' by
attaching bpf program to security_socket_bind lsm hook and
filtering sockaddr.
Furthermore Sargun's use case is to allow further sockaddr rewrites
from the bpf program which can be done as natural extension
of such mechanism.

If I understood Daniel's Anoop's Sargu

Re: [PATCH v3 net-next] bpf/verifier: track liveness for pruning

2017-08-17 Thread Alexei Starovoitov


On 8/15/17 12:34 PM, Edward Cree wrote:

State of a register doesn't matter if it wasn't read in reaching an exit;
 a write screens off all reads downstream of it from all explored_states
 upstream of it.
This allows us to prune many more branches; here are some processed insn
 counts for some Cilium programs:
Program  before  after
bpf_lb_opt_-DLB_L3.o   6515   3361
bpf_lb_opt_-DLB_L4.o   8976   5176
bpf_lb_opt_-DUNKNOWN.o 2960   1137
bpf_lxc_opt_-DDROP_ALL.o  95412  48537
bpf_lxc_opt_-DUNKNOWN.o  141706  78718
bpf_netdev.o  24251  17995
bpf_overlay.o 10999   9385

The runtime is also improved; here are 'time' results in ms:
Program  before  after
bpf_lb_opt_-DLB_L3.o 24  6
bpf_lb_opt_-DLB_L4.o 26 11
bpf_lb_opt_-DUNKNOWN.o   11  2
bpf_lxc_opt_-DDROP_ALL.o   1288139
bpf_lxc_opt_-DUNKNOWN.o1768234
bpf_netdev.o 62 31
bpf_overlay.o15 13

Signed-off-by: Edward Cree 


this is one ingenious hack. Love it!
I took me whole day to understand most of it, but I still have
few questions:


+
+static void propagate_liveness(const struct bpf_verifier_state *state,
+  struct bpf_verifier_state *parent)


here the name 'parent' is very confusing, since for the first
iteration of the loop below it transfers lives from 'neighbor'
state to the current state and only then traverses the link
of parents in the current.
Would be good to document it, since I was struggling the most
with this name until I realized that the way you build parent link list
in is_state_visited() is actual sequence of roughly basic blocks and
the name 'parent' applies there, but not for the first iteration
of this function.


@@ -3407,6 +3501,14 @@ static int is_state_visited(struct bpf_verifier_env 
*env, int insn_idx)
memcpy(&new_sl->state, &env->cur_state, sizeof(env->cur_state));
new_sl->next = env->explored_states[insn_idx];
env->explored_states[insn_idx] = new_sl;
+   /* connect new state to parentage chain */
+   env->cur_state.parent = &new_sl->state;
+   /* clear liveness marks in current state */
+   for (i = 0; i < BPF_REG_FP; i++)
+   env->cur_state.regs[i].live = REG_LIVE_NONE;
+   for (i = 0; i < MAX_BPF_STACK / BPF_REG_SIZE; i++)
+   if (env->cur_state.stack_slot_type[i * BPF_REG_SIZE] == 
STACK_SPILL)
+   env->cur_state.spilled_regs[i].live = REG_LIVE_NONE;


and this part I don't get at all.
It seems you're trying to sort-of do per-fake-basic block liveness
analysis, but our state_list_marks are not correct if we go with
canonical basic block definition, since we mark the jump insn and
not insn after the branch and not every basic block boundary is
properly detected.
So if algorithm should only work for basic blocks (for sequences of
instructions without control flow changes) then it's broken.
If it should work with control flow insns then it should also work
for the whole chain of insns from the first one till bpf_exit...
So I tried removing two above clearing loops and results are much
better:
before  after
bpf_lb-DLB_L3.o 26041120
bpf_lb-DLB_L4.o 11159   1371
bpf_lb-DUNKNOWN.o   1116485
bpf_lxc-DDROP_ALL.o 34566   12758
bpf_lxc-DUNKNOWN.o  53267   18337
bpf_netdev.o17843   10564
bpf_overlay.o   86725513

but it feels too good to be true and probably not correct.
So either way we need to fix something it seems.

Re: [PATCH 1/3] bpf: Don't check for current being NULL

2017-10-16 Thread Alexei Starovoitov

On Mon, Oct 16, 2017 at 11:18 AM, Richard Weinberger  wrote:
> current is never NULL.
>
> Signed-off-by: Richard Weinberger 
> ---
>  kernel/bpf/helpers.c | 12 
>  1 file changed, 12 deletions(-)
>
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 3d24e238221e..e8845adcd15e 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -120,9 +120,6 @@ BPF_CALL_0(bpf_get_current_pid_tgid)
>  {
> struct task_struct *task = current;
>
> -   if (unlikely(!task))
> -   return -EINVAL;
> -

really? in all context? including irq and nmi?

Re: [PATCH 3/3] bpf: Make sure that ->comm does not change under us.

2017-10-16 Thread Alexei Starovoitov

On Mon, Oct 16, 2017 at 2:10 PM, Richard Weinberger  wrote:
> Am Montag, 16. Oktober 2017, 23:02:06 CEST schrieb Daniel Borkmann:
>> On 10/16/2017 10:55 PM, Richard Weinberger wrote:
>> > Am Montag, 16. Oktober 2017, 22:50:43 CEST schrieb Daniel Borkmann:
>> >>>   struct task_struct *task = current;
>> >>>
>> >>> + task_lock(task);
>> >>>
>> >>>   strncpy(buf, task->comm, size);
>> >>>
>> >>> + task_unlock(task);
>> >>
>> >> Wouldn't this potentially lead to a deadlock? E.g. you attach yourself
>> >> to task_lock() / spin_lock() / etc, and then the BPF prog triggers the
>> >> bpf_get_current_comm() taking the lock again ...
>> >
>> > Yes, but doesn't the same apply to the use case when I attach to strncpy()
>> > and run bpf_get_current_comm()?
>>
>> You mean due to recursion? In that case trace_call_bpf() would bail out
>> due to the bpf_prog_active counter.
>
> Ah, that's true.
> So, when someone wants to use bpf_get_current_comm() while tracing task_lock,
> we have a problem. I agree.
> On the other hand, without locking the function may return wrong results.

it will surely race with somebody else setting task comm and it's fine.
all of bpf tracing is read-only, so locks are only allowed inside bpf core
bits like maps. Taking core locks like task_lock() is quite scary.
bpf scripts rely on bpf_probe_read() of all sorts of kernel fields
so reading comm here w/o lock is fine.

Re: [PATCH 1/3] bpf: Don't check for current being NULL

2017-10-16 Thread Alexei Starovoitov

On Tue, Oct 17, 2017 at 12:23:13AM +0200, Richard Weinberger wrote:
> Alexei,
> 
> Am Dienstag, 17. Oktober 2017, 00:06:08 CEST schrieb Alexei Starovoitov:
> > On Mon, Oct 16, 2017 at 11:18 AM, Richard Weinberger  wrote:
> > > current is never NULL.
> > > 
> > > Signed-off-by: Richard Weinberger 
> > > ---
> > > 
> > >  kernel/bpf/helpers.c | 12 
> > >  1 file changed, 12 deletions(-)
> > > 
> > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > > index 3d24e238221e..e8845adcd15e 100644
> > > --- a/kernel/bpf/helpers.c
> > > +++ b/kernel/bpf/helpers.c
> > > @@ -120,9 +120,6 @@ BPF_CALL_0(bpf_get_current_pid_tgid)
> > > 
> > >  {
> > >  
> > > struct task_struct *task = current;
> > > 
> > > -   if (unlikely(!task))
> > > -   return -EINVAL;
> > > -
> > 
> > really? in all context? including irq and nmi?
> 
> I would be astonished current is NULL in such a context.
> 
> To be sure, let's CC linux-arch.
> IIRC I talked also with Al about this and he also assumed that current
> cannot be NULL.

Hmm I probably mistakenly stole the !current check from somewhere.
Happy to delete all these checks then.

Re: [PATCH net 1/3] mm, percpu: add support for __GFP_NOWARN flag

2017-10-17 Thread Alexei Starovoitov

On Tue, Oct 17, 2017 at 04:55:52PM +0200, Daniel Borkmann wrote:
> Add an option for pcpu_alloc() to support __GFP_NOWARN flag.
> Currently, we always throw a warning when size or alignment
> is unsupported (and also dump stack on failed allocation
> requests). The warning itself is harmless since we return
> NULL anyway for any failed request, which callers are
> required to handle anyway. However, it becomes harmful when
> panic_on_warn is set.
> 
> The rationale for the WARN() in pcpu_alloc() is that it can
> be tracked when larger than supported allocation requests are
> made such that allocations limits can be tweaked if warranted.
> This makes sense for in-kernel users, however, there are users
> of pcpu allocator where allocation size is derived from user
> space requests, e.g. when creating BPF maps. In these cases,
> the requests should fail gracefully without throwing a splat.
> 
> The current work-around was to check allocation size against
> the upper limit of PCPU_MIN_UNIT_SIZE from call-sites for
> bailing out prior to a call to pcpu_alloc() in order to
> avoid throwing the WARN(). This is bad in multiple ways since
> PCPU_MIN_UNIT_SIZE is an implementation detail, and having
> the checks on call-sites only complicates the code for no
> good reason. Thus, lets fix it generically by supporting the
> __GFP_NOWARN flag that users can then use with calling the
> __alloc_percpu_gfp() helper instead.
> 
> Signed-off-by: Daniel Borkmann 
> Cc: Tejun Heo 
> Cc: Mark Rutland 

The approach looks great to me. We've been doing this dance around
allocator warning for long time. It's really not a job of bpf code
to guess into valid parameters of pcpu alloc.
Adding support for __GFP_NOWARN and using it in bpf is much cleaner
fix that avoids layering violations.

Acked-by: Alexei Starovoitov

Re: [PATCH net 2/3] bpf: fix splat for illegal devmap percpu allocation

2017-10-17 Thread Alexei Starovoitov

On Tue, Oct 17, 2017 at 04:55:53PM +0200, Daniel Borkmann wrote:
> It was reported that syzkaller was able to trigger a splat on
> devmap percpu allocation due to illegal/unsupported allocation
> request size passed to __alloc_percpu():
> 
>   [   70.094249] illegal size (32776) or align (8) for percpu allocation
>   [   70.094256] [ cut here ]
>   [   70.094259] WARNING: CPU: 3 PID: 3451 at mm/percpu.c:1365 
> pcpu_alloc+0x96/0x630
>   [...]
>   [   70.094325] Call Trace:
>   [   70.094328]  __alloc_percpu_gfp+0x12/0x20
>   [   70.094330]  dev_map_alloc+0x134/0x1e0
>   [   70.094331]  SyS_bpf+0x9bc/0x1610
>   [   70.094333]  ? selinux_task_setrlimit+0x5a/0x60
>   [   70.094334]  ? security_task_setrlimit+0x43/0x60
>   [   70.094336]  entry_SYSCALL_64_fastpath+0x1a/0xa5
> 
> This was due to too large max_entries for the map such that we
> surpassed the upper limit of PCPU_MIN_UNIT_SIZE. It's fine to
> fail naturally here, so switch to __alloc_percpu_gfp() and pass
> __GFP_NOWARN instead.
> 
> Fixes: 11393cc9b9be ("xdp: Add batching support to redirect map")
> Reported-by: Mark Rutland 
> Reported-by: Shankara Pailoor 
> Reported-by: Richard Weinberger 
> Signed-off-by: Daniel Borkmann 
> Cc: John Fastabend 

Acked-by: Alexei Starovoitov

Re: [PATCH net 3/3] bpf: do not test for PCPU_MIN_UNIT_SIZE before percpu allocations

2017-10-17 Thread Alexei Starovoitov

On Tue, Oct 17, 2017 at 04:55:54PM +0200, Daniel Borkmann wrote:
> PCPU_MIN_UNIT_SIZE is an implementation detail of the percpu
> allocator. Given we support __GFP_NOWARN now, lets just let
> the allocation request fail naturally instead. The two call
> sites from BPF mistakenly assumed __GFP_NOWARN would work, so
> no changes needed to their actual __alloc_percpu_gfp() calls
> which use the flag already.
> 
> Signed-off-by: Daniel Borkmann 

Acked-by: Alexei Starovoitov

Re: [PATCH 5/5] MIPS: Add support for eBPF JIT.

2017-05-25 Thread Alexei Starovoitov

On Thu, May 25, 2017 at 05:38:26PM -0700, David Daney wrote:
> Since the eBPF machine has 64-bit registers, we only support this in
> 64-bit kernels.  As of the writing of this commit log test-bpf is showing:
> 
>   test_bpf: Summary: 316 PASSED, 0 FAILED, [308/308 JIT'ed]
> 
> All current test cases are successfully compiled.
> 
> Signed-off-by: David Daney 
> ---
>  arch/mips/Kconfig   |1 +
>  arch/mips/net/bpf_jit.c | 1627 
> ++-
>  arch/mips/net/bpf_jit.h |7 +
>  3 files changed, 1633 insertions(+), 2 deletions(-)

Great stuff. I wonder what is the performance difference
interpreter vs JIT

> + * eBPF stack frame will be something like:
> + *
> + *  Entry $sp -->   ++
> + *  |   $ra  (optional)  |
> + *  ++
> + *  |   $s0  (optional)  |
> + *  ++
> + *  |   $s1  (optional)  |
> + *  ++
> + *  |   $s2  (optional)  |
> + *  ++
> + *  |   $s3  (optional)  |
> + *  ++
> + *  |   tmp-storage  (if $ra saved)  |
> + * $sp + tmp_offset --> ++ <--BPF_REG_10
> + *  |   BPF_REG_10 relative storage  |
> + *  |MAX_BPF_STACK (optional)|
> + *  |  . |
> + *  |  . |
> + *  |  . |
> + * $sp >++
> + *
> + * If BPF_REG_10 is never referenced, then the MAX_BPF_STACK sized
> + * area is not allocated.
> + */

It's especially great to see that you've put the tmp storage
above program stack and made the stack allocation optional.
At the moment I'm working on reducing bpf program stack size,
so that JIT and interpreter can use only the stack they need.
Looking at this JIT code only minimal changes will be needed.

[PATCH v2 net-next 1/3] perf, bpf: Add BPF support to all perf_event types

2017-05-25 Thread Alexei Starovoitov

From: Teng Qin 

Allow BPF program to attach to all perf_event types supported
by the current bpf and perf code logic, including HW_CACHE, RAW,
and dynamic pmu events.

Also add support for reading these event counters using
bpf_perf_event_read() helper.

Signed-off-by: Teng Qin 
Signed-off-by: Alexei Starovoitov 
---
 kernel/bpf/arraymap.c| 26 +++---
 kernel/events/core.c |  6 +-
 kernel/trace/bpf_trace.c |  4 ++--
 3 files changed, 14 insertions(+), 22 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 5e00b2333c26..55ffa9949128 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -462,26 +462,22 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map 
*map,
 
event = perf_file->private_data;
ee = ERR_PTR(-EINVAL);
+   /* Per-task events are not supported */
+   if (event->attach_state & PERF_ATTACH_TASK)
+   goto err_out;
 
attr = perf_event_attrs(event);
if (IS_ERR(attr) || attr->inherit)
goto err_out;
+   /* TRACEPOINT and BREAKPOINT not supported in perf_event_read_local */
+   if (attr->type == PERF_TYPE_TRACEPOINT ||
+   attr->type == PERF_TYPE_BREAKPOINT)
+   goto err_out;
 
-   switch (attr->type) {
-   case PERF_TYPE_SOFTWARE:
-   if (attr->config != PERF_COUNT_SW_BPF_OUTPUT)
-   goto err_out;
-   /* fall-through */
-   case PERF_TYPE_RAW:
-   case PERF_TYPE_HARDWARE:
-   ee = bpf_event_entry_gen(perf_file, map_file);
-   if (ee)
-   return ee;
-   ee = ERR_PTR(-ENOMEM);
-   /* fall-through */
-   default:
-   break;
-   }
+   ee = bpf_event_entry_gen(perf_file, map_file);
+   if (ee)
+   return ee;
+   ee = ERR_PTR(-ENOMEM);
 
 err_out:
fput(perf_file);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6e75a5c9412d..52f667046599 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8037,12 +8037,8 @@ static int perf_event_set_bpf_prog(struct perf_event 
*event, u32 prog_fd)
bool is_kprobe, is_tracepoint;
struct bpf_prog *prog;
 
-   if (event->attr.type == PERF_TYPE_HARDWARE ||
-   event->attr.type == PERF_TYPE_SOFTWARE)
-   return perf_event_set_bpf_handler(event, prog_fd);
-
if (event->attr.type != PERF_TYPE_TRACEPOINT)
-   return -EINVAL;
+   return perf_event_set_bpf_handler(event, prog_fd);
 
if (event->tp_event->prog)
return -EEXIST;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 460a031c77e5..8425bf193f39 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -248,8 +248,8 @@ BPF_CALL_2(bpf_perf_event_read, struct bpf_map *, map, u64, 
flags)
return -ENOENT;
 
event = ee->event;
-   if (unlikely(event->attr.type != PERF_TYPE_HARDWARE &&
-event->attr.type != PERF_TYPE_RAW))
+   if (unlikely(event->attr.type == PERF_TYPE_SOFTWARE &&
+event->attr.config == PERF_COUNT_SW_BPF_OUTPUT))
return -EINVAL;
 
/* make sure event is local and doesn't have pmu::count */
-- 
2.9.3

[PATCH v2 net-next 3/3] bpf: update perf event helper functions documentation

2017-05-25 Thread Alexei Starovoitov

From: Teng Qin 

This commit updates documentation of the bpf_perf_event_output and
bpf_perf_event_read helpers to match their implementation.

Signed-off-by: Teng Qin 
Signed-off-by: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h   | 11 +++
 tools/include/uapi/linux/bpf.h | 11 +++
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 94dfa9def355..e78aece03628 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -313,8 +313,11 @@ union bpf_attr {
  * @flags: room for future extensions
  * Return: 0 on success or negative error
  *
- * u64 bpf_perf_event_read(&map, index)
- * Return: Number events read or error code
+ * u64 bpf_perf_event_read(map, flags)
+ * read perf event counter value
+ * @map: pointer to perf_event_array map
+ * @flags: index of event in the map or bitmask flags
+ * Return: value of perf event counter read or error code
  *
  * int bpf_redirect(ifindex, flags)
  * redirect to another netdev
@@ -328,11 +331,11 @@ union bpf_attr {
  * @skb: pointer to skb
  * Return: realm if != 0
  *
- * int bpf_perf_event_output(ctx, map, index, data, size)
+ * int bpf_perf_event_output(ctx, map, flags, data, size)
  * output perf raw sample
  * @ctx: struct pt_regs*
  * @map: pointer to perf_event_array map
- * @index: index of event in the map
+ * @flags: index of event in the map or bitmask flags
  * @data: data on stack to be output as raw data
  * @size: size of data
  * Return: 0 on success or negative error
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 94dfa9def355..e78aece03628 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -313,8 +313,11 @@ union bpf_attr {
  * @flags: room for future extensions
  * Return: 0 on success or negative error
  *
- * u64 bpf_perf_event_read(&map, index)
- * Return: Number events read or error code
+ * u64 bpf_perf_event_read(map, flags)
+ * read perf event counter value
+ * @map: pointer to perf_event_array map
+ * @flags: index of event in the map or bitmask flags
+ * Return: value of perf event counter read or error code
  *
  * int bpf_redirect(ifindex, flags)
  * redirect to another netdev
@@ -328,11 +331,11 @@ union bpf_attr {
  * @skb: pointer to skb
  * Return: realm if != 0
  *
- * int bpf_perf_event_output(ctx, map, index, data, size)
+ * int bpf_perf_event_output(ctx, map, flags, data, size)
  * output perf raw sample
  * @ctx: struct pt_regs*
  * @map: pointer to perf_event_array map
- * @index: index of event in the map
+ * @flags: index of event in the map or bitmask flags
  * @data: data on stack to be output as raw data
  * @size: size of data
  * Return: 0 on success or negative error
-- 
2.9.3

[PATCH v2 net-next 0/3] bpf: Add BPF support to all perf_event

2017-05-25 Thread Alexei Starovoitov

v1->v2: address Peter's feedback. Refactor patch 1 to allow attaching
bpf programs to all event types and reading counters from all of them as well
patch 2 - more tests
patch 3 - address Dave's feedback and document bpf_perf_event_read()
and bpf_perf_event_output() properly

Teng Qin (3):
  perf, bpf: Add BPF support to all perf_event types
  samples/bpf: add samples for more perf event types
  bpf: update perf event helper functions documentation

 include/uapi/linux/bpf.h   |  11 ++-
 kernel/bpf/arraymap.c  |  26 +++---
 kernel/events/core.c   |   6 +-
 kernel/trace/bpf_trace.c   |   4 +-
 samples/bpf/bpf_helpers.h  |   3 +-
 samples/bpf/trace_event_user.c |  46 ++-
 samples/bpf/tracex6_kern.c |  28 +--
 samples/bpf/tracex6_user.c | 176 -
 tools/include/uapi/linux/bpf.h |  11 ++-
 9 files changed, 232 insertions(+), 79 deletions(-)

-- 
2.9.3

[PATCH v2 net-next 2/3] samples/bpf: add samples for more perf event types

2017-05-25 Thread Alexei Starovoitov

From: Teng Qin 

This commit adds test code to attach BPF to HW_CACHE and RAW type events
and updates clean-up logic to disable the perf events before closing pmu_fd.

This commit also adds test code to read SOFTWARE, HW_CACHE, RAW and dynamic
pmu events from BPF program using bpf_perf_event_read(). Refactored the
existing sample to fork individual task on each CPU, attach kprobe to
more controllable function, and more accurately check if each read on
every CPU returned with good value.

Signed-off-by: Teng Qin 
Signed-off-by: Alexei Starovoitov 
---
 samples/bpf/bpf_helpers.h  |   3 +-
 samples/bpf/trace_event_user.c |  46 ++-
 samples/bpf/tracex6_kern.c |  28 +--
 samples/bpf/tracex6_user.c | 176 -
 4 files changed, 204 insertions(+), 49 deletions(-)

diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 9a9c95f2c9fb..51e567bc70fc 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -31,7 +31,8 @@ static unsigned long long (*bpf_get_current_uid_gid)(void) =
(void *) BPF_FUNC_get_current_uid_gid;
 static int (*bpf_get_current_comm)(void *buf, int buf_size) =
(void *) BPF_FUNC_get_current_comm;
-static int (*bpf_perf_event_read)(void *map, int index) =
+static unsigned long long (*bpf_perf_event_read)(void *map,
+unsigned long long flags) =
(void *) BPF_FUNC_perf_event_read;
 static int (*bpf_clone_redirect)(void *ctx, int ifindex, int flags) =
(void *) BPF_FUNC_clone_redirect;
diff --git a/samples/bpf/trace_event_user.c b/samples/bpf/trace_event_user.c
index fa4336423da5..666761773fda 100644
--- a/samples/bpf/trace_event_user.c
+++ b/samples/bpf/trace_event_user.c
@@ -122,13 +122,14 @@ static void test_perf_event_all_cpu(struct 
perf_event_attr *attr)
 {
int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
int *pmu_fd = malloc(nr_cpus * sizeof(int));
-   int i;
+   int i, error = 0;
 
/* open perf_event on all cpus */
for (i = 0; i < nr_cpus; i++) {
pmu_fd[i] = sys_perf_event_open(attr, -1, i, -1, 0);
if (pmu_fd[i] < 0) {
printf("sys_perf_event_open failed\n");
+   error = 1;
goto all_cpu_err;
}
assert(ioctl(pmu_fd[i], PERF_EVENT_IOC_SET_BPF, prog_fd[0]) == 
0);
@@ -137,9 +138,13 @@ static void test_perf_event_all_cpu(struct perf_event_attr 
*attr)
system("dd if=/dev/zero of=/dev/null count=5000k");
print_stacks();
 all_cpu_err:
-   for (i--; i >= 0; i--)
+   for (i--; i >= 0; i--) {
+   ioctl(pmu_fd[i], PERF_EVENT_IOC_DISABLE, 0);
close(pmu_fd[i]);
+   }
free(pmu_fd);
+   if (error)
+   int_exit(0);
 }
 
 static void test_perf_event_task(struct perf_event_attr *attr)
@@ -150,7 +155,7 @@ static void test_perf_event_task(struct perf_event_attr 
*attr)
pmu_fd = sys_perf_event_open(attr, 0, -1, -1, 0);
if (pmu_fd < 0) {
printf("sys_perf_event_open failed\n");
-   return;
+   int_exit(0);
}
assert(ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd[0]) == 0);
assert(ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0) == 0);
@@ -175,11 +180,45 @@ static void test_bpf_perf_event(void)
.config = PERF_COUNT_SW_CPU_CLOCK,
.inherit = 1,
};
+   struct perf_event_attr attr_hw_cache_l1d = {
+   .sample_freq = SAMPLE_FREQ,
+   .freq = 1,
+   .type = PERF_TYPE_HW_CACHE,
+   .config =
+   PERF_COUNT_HW_CACHE_L1D |
+   (PERF_COUNT_HW_CACHE_OP_READ << 8) |
+   (PERF_COUNT_HW_CACHE_RESULT_ACCESS << 16),
+   .inherit = 1,
+   };
+   struct perf_event_attr attr_hw_cache_branch_miss = {
+   .sample_freq = SAMPLE_FREQ,
+   .freq = 1,
+   .type = PERF_TYPE_HW_CACHE,
+   .config =
+   PERF_COUNT_HW_CACHE_BPU |
+   (PERF_COUNT_HW_CACHE_OP_READ << 8) |
+   (PERF_COUNT_HW_CACHE_RESULT_MISS << 16),
+   .inherit = 1,
+   };
+   struct perf_event_attr attr_type_raw = {
+   .sample_freq = SAMPLE_FREQ,
+   .freq = 1,
+   .type = PERF_TYPE_RAW,
+   /* Intel Instruction Retired */
+   .config = 0xc0,
+   .inherit = 1,
+   };
 
test_perf_event_all_cpu(&attr_type_hw);
test_perf_event_task(&attr_type_hw);
test_perf_event_all_cpu(&attr_type_sw);
test_perf_event_task(&attr_type_sw);
+   test_perf_event_all_cpu(&attr_hw_cache_l1d);
+   test_perf_e

Re: [PATCH v2 net-next 1/3] perf, bpf: Add BPF support to all perf_event types

2017-05-30 Thread Alexei Starovoitov

On 5/29/17 2:39 AM, Peter Zijlstra wrote:

Do we want something like the below to replace much of the above?

if (!perf_event_valid_local(event, NULL, cpu))
goto err_out;

Seems to be roughly what you're after, although I suppose @cpu might be
hard to determine a priory, so maybe we should allow a magic value to
short-circuit that test.

---
 kernel/events/core.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8d6acaeeea17..a7dc34f19568 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3630,6 +3630,36 @@ static inline u64 perf_event_count(struct perf_event 
*event)
 }

 /*
+ * perf_event_valid_local() - validates if the event is usable by 
perf_event_read_local()
+ * event: the event to validate
+ * task:  the task the @event will be used in
+ * cpu:   the cpu the @event will be used on
+ *
+ * In case one wants to disallow all per-task events, use @task = NULL.
+ * In case one wants to disallow all per-cpu events, use @cpu = -1.
+ */
+bool perf_event_valid_local(struct perf_event *event, struct task_struct 
*task, int cpu)
+{
+   /* See perf_event_read_local() for the reasons for these tests */
+
+   if ((event->attach_state & PERF_ATTACH_TASK) &&
+   event->hw.target != task)
+   return false;
+
+   if (!(event->attach_state & PERF_ATTACH_TASK) &&
+   event->cpu != cpu)
+   return false;

we do if (unlikely(event->oncpu != cpu))
as dynamic check inside bpf_perf_event_read(), since we cannot do it
statically at perf_event_array update time.
If we drop the above 'if' and keep 'task==null' trick,
then indeed we can use this function as static check.

Right now we're trying to keep as many checks as possible as
static checks to make bpf_perf_event_read() faster.
I guess we can drop that approach and do perf_event_valid_local()
check for every read since perf_event_read_local() does all the
same checks anyway.
So how about converting all WARN_ON in perf_event_read_local()
into 'return -EINVAL' and change func proto into:
int perf_event_read_local(struct perf_event *even, u64 *counter_val)

> I cannot find reason for this comment. That is, why would
> perf_event_read_local() not support those two types?

I don't know. What is the meaning of
reading tracepoint/breakpoint counter?
Because of 'event->oncpu != cpu' dynamic check all counters are
expected to be per-cpu. I'm not sure how uncore counters work.
What do they have in event->oncpu? -1? I guess they have pmu->count?
So we cannot read them from bpf program anyway?

If we change warn_ons in perf_event_read_local() to returns
them we can make per-task counters working.
User side will open per-task counter and bpf program will
do current->pid != expected_pid check before calling
bpf_perf_event_read(). bpf scripts often do that already.

int perf_event_read_local(struct perf_event *even, u64 *counter_val)
{
  local_irq_save(flags);
  if ((event->attach_state & PERF_ATTACH_TASK) &&
   event->hw.target != current)
return -EINVAL;

  if (!(event->attach_state & PERF_ATTACH_TASK) &&
  event->cpu != smp_processor_id()
return -EINVAL;
  ... inherit and pmu->count checks here ...
  *counter_val = local64_read(&event->count)
  local_irq_restore(flags);
  return 0;
}

thoughts?

Re: [PATCH v2 net-next 1/3] perf, bpf: Add BPF support to all perf_event types

2017-05-30 Thread Alexei Starovoitov


On 5/30/17 9:51 AM, Peter Zijlstra wrote:

On Tue, May 30, 2017 at 08:52:14AM -0700, Alexei Starovoitov wrote:


+   if (!(event->attach_state & PERF_ATTACH_TASK) &&
+   event->cpu != cpu)
+   return false;


we do if (unlikely(event->oncpu != cpu))
as dynamic check inside bpf_perf_event_read(), since we cannot do it
statically at perf_event_array update time.


Right, that's what I thought.


If we drop the above 'if' and keep 'task==null' trick,
then indeed we can use this function as static check.


Right, or otherwise have a special value to disable it.


Right now we're trying to keep as many checks as possible as
static checks to make bpf_perf_event_read() faster.
I guess we can drop that approach and do perf_event_valid_local()
check for every read since perf_event_read_local() does all the
same checks anyway.
So how about converting all WARN_ON in perf_event_read_local()
into 'return -EINVAL' and change func proto into:
int perf_event_read_local(struct perf_event *even, u64 *counter_val)


I'm confused on how that is better. My recent patches to WARN should
have greatly improved performance of WARN_ON_ONCE(). And looking at that
code, I suspect its dominated by the POPF for inactive events.


I cannot find reason for this comment. That is, why would
perf_event_read_local() not support those two types?


I don't know. What is the meaning of
reading tracepoint/breakpoint counter?


They count like all other software events. +1 for each occurrence.

So for instance, if you use irq_vectors:local_timer_entry you get how
many cpu local timer instances happened during your measurement window.

Same with a breakpoint, it counts how many times it got hit. Typically
you'd want to install a custom handler on breakpoints to do something
'interesting', but even without that its acts like a normal software
event.


Because of 'event->oncpu != cpu' dynamic check all counters are
expected to be per-cpu. I'm not sure how uncore counters work.


Uncore thingies are assigned to any online CPU in their 'domain'.


What do they have in event->oncpu? -1? I guess they have pmu->count?
So we cannot read them from bpf program anyway?


They have the CPU number of the CPU that's assigned to them. So you
_could_ make use of them, but its a bit tricky to get them to work
reliably because you'd have to get that CPU 'right' and it can change.

Typically they would end up on the first CPU in their domain, but with
CPU hotplug you can move them about and get confusion.

I'd have to think on how to do that nicely.


If we change warn_ons in perf_event_read_local() to returns
them we can make per-task counters working.


I'm not entirely sure I see how that is required. Should per task not
already work? The WARN that's there will only trigger if you call them
on the wrong task, which is something you shouldn't do anyway.


The kernel WARN is considered to be a bug of bpf infra. That's the
reason we do all these checks at map update time and at run-time.
The bpf program authors should be able to do all possible experiments
until their scripts work. Dealing with kernel warns and reboots is not
something user space folks like to do.
Today bpf_perf_event_read() for per-task events isn't really
working due to event->oncpu != cpu runtime check in there.
If we convert warns to returns the existing scripts will continue
to work as-is and per-task will be possible.

Re: [PATCH 0/2][v2] Add the ability to do BPF directed error injection

2017-10-31 Thread Alexei Starovoitov

On 10/31/17 6:55 PM, David Miller wrote:

From: Josef Bacik 
Date: Tue, 31 Oct 2017 11:45:55 -0400

v1->v2:
- moved things around to make sure that bpf_override_return could really only be
  used for an ftrace kprobe.
- killed the special return values from trace_call_bpf.
- renamed pc_modified to bpf_kprobe_state so bpf_override_return could tell if
  it was being called from an ftrace kprobe context.
- reworked the logic in kprobe_perf_func to take advantage of bpf_kprobe_state.
- updated the test as per Alexei's review.

A lot of our error paths are not well tested because we have no good way of
injecting errors generically.  Some subystems (block, memory) have ways to
inject errors, but they are random so it's hard to get reproduceable results.

With BPF we can add determinism to our error injection.  We can use kprobes and
other things to verify we are injecting errors at the exact case we are trying
to test.  This patch gives us the tool to actual do the error injection part.
It is very simple, we just set the return value of the pt_regs we're given to
whatever we provide, and then override the PC with a dummy function that simply
returns.

Right now this only works on x86, but it would be simple enough to expand to
other architectures.  Thanks,

This appears to moreso target the tracing tree than the networking tree.

Let me know if that's not the case and I should be the one intergrating
these changes.

i don't think it will apply to anything but net-next. If it goes any
other tree we will have major conflicts during merge window.
btw I haven't reviewed them for the second time.

Re: [PATCH 1/2] bpf: add a bpf_override_function helper

2017-10-31 Thread Alexei Starovoitov


On 10/31/17 8:45 AM, Josef Bacik wrote:

From: Josef Bacik 

Error injection is sloppy and very ad-hoc.  BPF could fill this niche
perfectly with it's kprobe functionality.  We could make sure errors are
only triggered in specific call chains that we care about with very
specific situations.  Accomplish this with the bpf_override_funciton
helper.  This will modify the probe'd callers return value to the
specified value and set the PC to an override function that simply
returns, bypassing the originally probed function.  This gives us a nice
clean way to implement systematic error injection for all of our code
paths.

Signed-off-by: Josef Bacik 
---
 arch/Kconfig |  3 +++
 arch/x86/Kconfig |  1 +
 arch/x86/include/asm/kprobes.h   |  4 
 arch/x86/include/asm/ptrace.h|  5 +
 arch/x86/kernel/kprobes/ftrace.c | 14 ++
 include/linux/trace_events.h |  7 +++
 include/uapi/linux/bpf.h |  7 ++-
 kernel/trace/Kconfig | 11 +++
 kernel/trace/bpf_trace.c | 30 
 kernel/trace/trace_kprobe.c  | 42 +---
 10 files changed, 116 insertions(+), 8 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index d789a89cb32c..4fb618082259 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -195,6 +195,9 @@ config HAVE_OPTPROBES
 config HAVE_KPROBES_ON_FTRACE
bool

+config HAVE_KPROBE_OVERRIDE
+   bool
+
 config HAVE_NMI
bool

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 971feac13506..5126d2750dd0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -152,6 +152,7 @@ config X86
select HAVE_KERNEL_XZ
select HAVE_KPROBES
select HAVE_KPROBES_ON_FTRACE
+   select HAVE_KPROBE_OVERRIDE
select HAVE_KRETPROBES
select HAVE_KVM
select HAVE_LIVEPATCH   if X86_64
diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index 6cf65437b5e5..c6c3b1f4306a 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -67,6 +67,10 @@ extern const int kretprobe_blacklist_size;
 void arch_remove_kprobe(struct kprobe *p);
 asmlinkage void kretprobe_trampoline(void);

+#ifdef CONFIG_KPROBES_ON_FTRACE
+extern void arch_ftrace_kprobe_override_function(struct pt_regs *regs);
+#endif
+
 /* Architecture specific copy of original instruction*/
 struct arch_specific_insn {
/* copy of the original instruction */
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 91c04c8e67fa..f04e71800c2f 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -108,6 +108,11 @@ static inline unsigned long regs_return_value(struct 
pt_regs *regs)
return regs->ax;
 }

+static inline void regs_set_return_value(struct pt_regs *regs, unsigned long 
rc)
+{
+   regs->ax = rc;
+}
+
 /*
  * user_mode(regs) determines whether a register set came from user
  * mode.  On x86_32, this is true if V8086 mode was enabled OR if the
diff --git a/arch/x86/kernel/kprobes/ftrace.c b/arch/x86/kernel/kprobes/ftrace.c
index 041f7b6dfa0f..3c455bf490cb 100644
--- a/arch/x86/kernel/kprobes/ftrace.c
+++ b/arch/x86/kernel/kprobes/ftrace.c
@@ -97,3 +97,17 @@ int arch_prepare_kprobe_ftrace(struct kprobe *p)
p->ainsn.boostable = false;
return 0;
 }
+
+asmlinkage void override_func(void);
+asm(
+   ".type override_func, @function\n"
+   "override_func:\n"
+   "  ret\n"
+   ".size override_func, .-override_func\n"
+);
+
+void arch_ftrace_kprobe_override_function(struct pt_regs *regs)
+{
+   regs->ip = (unsigned long)&override_func;
+}
+NOKPROBE_SYMBOL(arch_ftrace_kprobe_override_function);
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index fc6aeca945db..9179f109c49b 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -521,7 +521,14 @@ do {   
\
 #ifdef CONFIG_PERF_EVENTS
 struct perf_event;

+enum {
+   BPF_STATE_NORMAL_KPROBE = 0,
+   BPF_STATE_FTRACE_KPROBE,
+   BPF_STATE_MODIFIED_PC,
+};
+
 DECLARE_PER_CPU(struct pt_regs, perf_trace_regs);
+DECLARE_PER_CPU(int, bpf_kprobe_state);

 extern int  perf_trace_init(struct perf_event *event);
 extern void perf_trace_destroy(struct perf_event *event);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0b7b54d898bd..1ad5b87a42f6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -673,6 +673,10 @@ union bpf_attr {
  * @buf: buf to fill
  * @buf_size: size of the buf
  * Return : 0 on success or negative error code
+ *
+ * int bpf_override_return(pt_regs, rc)
+ * @pt_regs: pointer to struct pt_regs
+ * @rc: the return value to set
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -732,7 +736,8 @@ union bpf_attr {

Re: linux-next: manual merge of the tip tree with the net-next tree

2017-11-01 Thread Alexei Starovoitov

On Wed, Nov 01, 2017 at 09:55:24AM +0100, Peter Zijlstra wrote:
> On Wed, Nov 01, 2017 at 09:27:43AM +0100, Ingo Molnar wrote:
> > 
> > * Peter Zijlstra  wrote:
> > 
> > > On Wed, Nov 01, 2017 at 06:15:54PM +1100, Stephen Rothwell wrote:
> > > > Hi all,
> > > > 
> > > > Today's linux-next merge of the tip tree got a conflict in:
> > > > 
> > > >   kernel/trace/bpf_trace.c
> > > > 
> > > > between commits:
> > > > 
> > > >   97562633bcba ("bpf: perf event change needed for subsequent bpf 
> > > > helpers")
> > > > and more changes ...
> > > > 
> > > > from the net-next tree and commit:
> > > > 
> > > >   7d9285e82db5 ("perf/bpf: Extend the perf_event_read_local() 
> > > > interface, a.k.a. "bpf: perf event change needed for subsequent bpf 
> > > > helpers"")
> > > > 
> > > > from the tip tree.
> > > 
> > > So those should be the exact same patch; except for Changelog and
> > > subject. Code wise there shouldn't be a conflict.
> > 
> > So the problem is that then we have:
> > 
> >   0d3d73aac2ff ("perf/core: Rewrite event timekeeping")
> > 
> > which changes the code. This is a known conflict generation pattern: Git 
> > isn't 
> > smart enough to sort out that (probably because it would make merges too 
> > expensive) - and it's a bad flow in any case.
> 
> Hmm, I thought having that same base patch in both trees would allow it
> to resolve that conflict. A well..

sigh. I had the same impression.
In the past the same patch was applied to both tip and net-next
and there were no conflicts.
May be git could have been smarter if you kept the same
one line commit description as we have in net-next?

Will it help if we push the same ("perf/core: Rewrite event timekeeping") commit
into net-next ? They will be in different order, so it probably
won't help and only make things worse. That sucks.

I think we need to discuss what should be our apporach moving
forward to commits that affect tracing and networking at the same time.
I don't think pushing to only one tree is an option, since it will be
close to impossible to resolve such conflicts at merge window time.
Linus would need to do some major surgery to untangle the mess.
I think it's still the best to push to both trees and expect
such mini-conflicts to appear in linux-next.
Like this time it was pretty obvious that commits are the same
and no real action necessary.

Also what do you mean by "same patch != same commit" ?
Like if we had pushed to some 3rd tree first and then pulled
into tip and net-next it would have been better?

Re: [PATCH 1/2] bpf: add a bpf_override_function helper

2017-11-01 Thread Alexei Starovoitov


On 11/1/17 10:00 AM, Josef Bacik wrote:

From: Josef Bacik 

Error injection is sloppy and very ad-hoc.  BPF could fill this niche
perfectly with it's kprobe functionality.  We could make sure errors are
only triggered in specific call chains that we care about with very
specific situations.  Accomplish this with the bpf_override_funciton
helper.  This will modify the probe'd callers return value to the
specified value and set the PC to an override function that simply
returns, bypassing the originally probed function.  This gives us a nice
clean way to implement systematic error injection for all of our code
paths.

Signed-off-by: Josef Bacik 


Both bpf and tracing bits look great to me.
Acked-by: Alexei Starovoitov

Re: [PATCH 2/2] samples/bpf: add a test for bpf_override_return

2017-11-01 Thread Alexei Starovoitov


On 11/1/17 10:00 AM, Josef Bacik wrote:

From: Josef Bacik 

This adds a basic test for bpf_override_return to verify it works.  We
override the main function for mounting a btrfs fs so it'll return
-ENOMEM and then make sure that trying to mount a btrfs fs will fail.

Signed-off-by: Josef Bacik 


Acked-by: Alexei Starovoitov 


+++ b/samples/bpf/test_override_return.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+
+rm -f testfile.img
+dd if=/dev/zero of=testfile.img bs=1M seek=1000 count=1
+DEVICE=$(losetup --show -f testfile.img)
+mkfs.btrfs -f $DEVICE
+mkdir tmpmnt
+./tracex7 $DEVICE
+if [ $? -eq 0 ]
+then
+   echo "SUCCESS!"
+else
+   echo "FAILED!"
+fi
+losetup -d $DEVICE
diff --git a/samples/bpf/tracex7_kern.c b/samples/bpf/tracex7_kern.c
new file mode 100644
index ..1ab308a43e0f
--- /dev/null
+++ b/samples/bpf/tracex7_kern.c
@@ -0,0 +1,16 @@
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+SEC("kprobe/open_ctree")
+int bpf_prog1(struct pt_regs *ctx)
+{
+   unsigned long rc = -12;
+
+   bpf_override_return(ctx, rc);
+   return 0;
+}


great stuff. I wonder in how many kernel test frameworks
it will appear in the near future.
We can even stress test bpf with bpf.

Re: [PATCH net-next V2 3/3] tun: add eBPF based queue selection method

2017-11-01 Thread Alexei Starovoitov

On Wed, Nov 01, 2017 at 03:59:48PM +0200, Michael S. Tsirkin wrote:
> On Wed, Nov 01, 2017 at 09:02:03PM +0800, Jason Wang wrote:
> > 
> > 
> > On 2017年11月01日 00:45, Michael S. Tsirkin wrote:
> > > > +static void __tun_set_steering_ebpf(struct tun_struct *tun,
> > > > +   struct bpf_prog *new)
> > > > +{
> > > > +   struct bpf_prog *old;
> > > > +
> > > > +   old = rtnl_dereference(tun->steering_prog);
> > > > +   rcu_assign_pointer(tun->steering_prog, new);
> > > > +
> > > > +   if (old) {
> > > > +   synchronize_net();
> > > > +   bpf_prog_destroy(old);
> > > > +   }
> > > > +}
> > > > +
> > > Is this really called under rtnl?
> > 
> > Yes it is __tun_chr_ioctl() will call rtnl_lock().
> 
> Is the call from tun_free_netdev under rtnl too?
> 
> > > If no then rtnl_dereference
> > > is wrong. If yes I'm not sure you can call synchronize_net
> > > under rtnl.
> > > 
> > 
> > Are you worrying about the long wait? Looking at synchronize_net(), it does:
> > 
> > void synchronize_net(void)
> > {
> >     might_sleep();
> >     if (rtnl_is_locked())
> >         synchronize_rcu_expedited();
> >     else
> >         synchronize_rcu();
> > }
> > EXPORT_SYMBOL(synchronize_net);
> > 
> > Thanks
> 
> Not the wait - expedited is not a good thing to allow unpriveledged
> userspace to do, it interrupts all VMs running on the same box.
> 
> We could use a callback though the docs warn userspace can use that
> to cause a DOS and needs to be limited.

the whole __tun_set_steering_ebpf() looks odd to me.
There is tun_attach_filter/tun_detach_filter pattern
that works for classic BPF. Why for eBPF this strange
synchronize_net() is there?

Re: [PATCH v2 net-next 3/5] bpf, cgroup: implement eBPF-based device controller for cgroup v2

2017-11-02 Thread Alexei Starovoitov


On 11/2/17 7:54 AM, Roman Gushchin wrote:

+#define DEV_BPF_ACC_MKNOD  (1ULL << 0)
+#define DEV_BPF_ACC_READ   (1ULL << 1)
+#define DEV_BPF_ACC_WRITE  (1ULL << 2)
+
+#define DEV_BPF_DEV_BLOCK  (1ULL << 0)
+#define DEV_BPF_DEV_CHAR   (1ULL << 1)
+


all macros in bpf.h start with BPF_
To be consistent with the rest can you rename above to BPF_DEVCG_.. ?

Re: [PATCH 2/2] [net-next] bpf: fix out-of-bounds access warning in bpf_check

2017-11-02 Thread Alexei Starovoitov

On Thu, Nov 02, 2017 at 12:05:52PM +0100, Arnd Bergmann wrote:
> The bpf_verifer_ops array is generated dynamically and may be
> empty depending on configuration, which then causes an out
> of bounds access:
> 
> kernel/bpf/verifier.c: In function 'bpf_check':
> kernel/bpf/verifier.c:4320:29: error: array subscript is above array bounds 
> [-Werror=array-bounds]
> 
> This adds a check to the start of the function as a workaround.
> I would assume that the function is never called in that configuration,
> so the warning is probably harmless.
> 
> Fixes: 00176a34d9e2 ("bpf: remove the verifier ops from program structure")
> Signed-off-by: Arnd Bergmann 
> ---
> Since there hasn't been a linux-next release in two weeks, I'm not
> entirely sure this is still needed, but from looking of the net-next
> contents it seems it is. I did not check any other trees that might
> have a fix already.
> ---
>  kernel/bpf/verifier.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 750aff880ecb..debb60ad08ee 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -4447,6 +4447,10 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr 
> *attr)
>   struct bpf_verifer_log *log;
>   int ret = -EINVAL;
>  
> + /* no program is valid */
> + if (ARRAY_SIZE(bpf_verifier_ops) == 0)
> + return -EINVAL;

sorry I don't see how bpf_verifier_ops can be empty.
Did you mix it up with your previous patch when you made bpf_analyzer_ops empty?

Re: [PATCH 2/2] [net-next] bpf: fix out-of-bounds access warning in bpf_check

2017-11-02 Thread Alexei Starovoitov

On Thu, Nov 02, 2017 at 05:14:00PM +0100, Arnd Bergmann wrote:
> On Thu, Nov 2, 2017 at 4:59 PM, Alexei Starovoitov
>  wrote:
> > On Thu, Nov 02, 2017 at 12:05:52PM +0100, Arnd Bergmann wrote:
> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >> index 750aff880ecb..debb60ad08ee 100644
> >> --- a/kernel/bpf/verifier.c
> >> +++ b/kernel/bpf/verifier.c
> >> @@ -4447,6 +4447,10 @@ int bpf_check(struct bpf_prog **prog, union 
> >> bpf_attr *attr)
> >>   struct bpf_verifer_log *log;
> >>   int ret = -EINVAL;
> >>
> >> + /* no program is valid */
> >> + if (ARRAY_SIZE(bpf_verifier_ops) == 0)
> >> + return -EINVAL;
> >
> > sorry I don't see how bpf_verifier_ops can be empty.
> > Did you mix it up with your previous patch when you made bpf_analyzer_ops 
> > empty?
> 
> I confused the two a couple of times while creating the patches, but
> I'm still fairly
> sure I got it right in the end:
> 
> bpf_verifier_ops is an array that gets generated by including 
> linux/bpf_types.h.
> That file has two kinds of entries:
> 
> - BPF_MAP_TYPE() entries are left out, as that macro is defined to an
> empty string
>   here.
> 
> - BPF_PROG_TYPE() entries are conditional depending on CONFIG_NET and
>   CONFIG_BPF_EVENTS. In the configuration that produces the warning,
>   both are disabled.

I see. Didn't realize that it's possible to enable bpf syscall
without networking and tracing support.
I'm thinking whether it's better to disallow such uselss mode in kconfig,
but it's probably going to be convoluted.
Above if (ARRAY_SIZE(bpf_verifier_ops) == 0) will be optimized away
by gcc in 99.9% of configs, so I guess that's fine, so:
Acked-by: Alexei Starovoitov

Re: [PATCH 1/2] [net-next] bpf: fix link error without CONFIG_NET

2017-11-02 Thread Alexei Starovoitov

On Thu, Nov 02, 2017 at 10:55:30AM -0700, Jakub Kicinski wrote:
> On Thu,  2 Nov 2017 12:05:51 +0100, Arnd Bergmann wrote:
> > I ran into this link error with the latest net-next plus linux-next
> > trees when networking is disabled:
> > 
> > kernel/bpf/verifier.o:(.rodata+0x2958): undefined reference to 
> > `tc_cls_act_analyzer_ops'
> > kernel/bpf/verifier.o:(.rodata+0x2970): undefined reference to 
> > `xdp_analyzer_ops'
> > 
> > It seems that the code was written to deal with varying contents of
> > the arrray, but the actual #ifdef was missing. Both tc_cls_act_analyzer_ops
> > and xdp_analyzer_ops are defined in the core networking code, so adding
> > a check for CONFIG_NET seems appropriate here, and I've verified this with
> > many randconfig builds
> > 
> > Fixes: 4f9218aaf8a4 ("bpf: move knowledge about post-translation offsets 
> > out of verifier")
> > Signed-off-by: Arnd Bergmann 
> 
> Thanks Arnd!  I was hoping to nuke this code before build bots catch up
> to me, didn't work out :)

yeah. Jakub's patches may not make it in time for net-next closing.
so let's use this fix for now.

Acked-by: Alexei Starovoitov

Re: [PATCH 1/2] bpf: add a bpf_override_function helper

2017-11-11 Thread Alexei Starovoitov


On 11/11/17 4:14 PM, Ingo Molnar wrote:


* Josef Bacik  wrote:


On Fri, Nov 10, 2017 at 10:34:59AM +0100, Ingo Molnar wrote:


* Josef Bacik  wrote:


@@ -551,6 +578,10 @@ static const struct bpf_func_proto 
*kprobe_prog_func_proto(enum bpf_func_id func
return &bpf_get_stackid_proto;
case BPF_FUNC_perf_event_read_value:
return &bpf_perf_event_read_value_proto;
+   case BPF_FUNC_override_return:
+   pr_warn_ratelimited("%s[%d] is installing a program with 
bpf_override_return helper that may cause unexpected behavior!",
+   current->comm, task_pid_nr(current));
+   return &bpf_override_return_proto;


So if this new functionality is used we'll always print this into the syslog?

The warning is also a bit passive aggressive about informing the user: what
unexpected behavior can happen, what is the worst case?



It's modeled after the other warnings bpf will spit out, but with this feature
you are skipping a function and instead returning some arbitrary value, so
anything could go wrong if you mess something up.  For instance I screwed up my
initial test case and made every IO submitted return an error instead of just on
the one file system I was attempting to test, so all sorts of hilarity ensued.


Ok, then for the x86 bits:

  NAK-ed-by: Ingo Molnar 

One of the major advantages of having an in-kernel BPF sandbox is to never crash
the kernel - and allowing BPF programs to just randomly modify the return value 
of
kernel functions sounds immensely broken to me.

(And yes, I realize that kprobes are used here as a vehicle, but the point
remains.)


yeah. modifying arbitrary function return pushes bpf outside of
its safety guarantees and in that sense doing the same
override_return could be done from a kernel module if kernel
provides the x64 side of the facility introduced by this patch.
On the other side adding parts of this feature to the kernel only
to be used by external kernel module is quite ugly too and not
something that was ever done before.
How about we restrict this bpf_override_return() only to the functions
which callers expect to handle errors ?
We can add something similar to NOKPROBE_SYMBOL(). Like
ALLOW_RETURN_OVERRIDE() and on btrfs side mark the functions
we're going to test with this feature.
Then 'not crashing kernel' requirement will be preserved.
btrfs or whatever else we will be testing with override_return
will be functioning in 'stress test' mode and if bpf program
is not careful and returns error all the time then one particular
subsystem (like btrfs) will not be functional, but the kernel
will not be crashing.
Thoughts?

Re: [ftrace-bpf 1/5] add BPF_PROG_TYPE_FTRACE to bpf

2017-11-12 Thread Alexei Starovoitov

On Sun, Nov 12, 2017 at 07:28:24AM +, yupeng0...@gmail.com wrote:
> Add a new type BPF_PROG_TYPE_FTRACE to bpf, let bpf can be attached to
> ftrace. Ftrace pass the function parameters to bpf prog, bpf prog
> return 1 or 0 to indicate whether ftrace can trace this function. The
> major propose is provide an accurate way to trigger function graph
> trace. Changes in code:
> 1. add FTRACE_BPF_FILTER in kernel/trace/Kconfig. Let ftrace pass
> function parameter to bpf need to modify architecture dependent code,
> so this feature will only be enabled only when it is enabled in
> Kconfig and the architecture support this feature. If an architecture
> support this feature, it should define a macro whose name is
> FTRACE_BPF_FILTER, e.g.:
> So other code in kernel can check whether the macro FTRACE_BPF_FILTER
> is defined to know whether this feature is really enabled.
> 2. add BPF_PROG_TYPE_FTRACE in bpf_prog_type
> 3. check kernel version when load BPF_PROG_TYPE_FTRACE bpf prog
> 4. define ftrace_prog_func_proto, the prog input is a struct
> ftrace_regs type pointer, it is similar as pt_regs in kprobe, it
> is an architecture dependent code, if an architecture doens't define
> FTRACE_BPF_FILTER, use a fake ftrace_prog_func_proto.
> 5. add BPF_PROG_TYPE in bpf_types.h
> 
> Signed-off-by: yupeng0...@gmail.com

In general I like the bigger concept of adding bpf filtering to ftrace,
but there are a lot of fundamental issues with this patch set.

1. anything bpf related has to go via net-next tree.

> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -118,6 +118,7 @@ enum bpf_prog_type {
>   BPF_PROG_TYPE_UNSPEC,
>   BPF_PROG_TYPE_SOCKET_FILTER,
>   BPF_PROG_TYPE_KPROBE,
> + BPF_PROG_TYPE_FTRACE,
>   BPF_PROG_TYPE_SCHED_CLS,

2.
this obviously breaks ABI. New types can only be added to the end.

> +static bool ftrace_prog_is_valid_access(int off, int size,
> + enum bpf_access_type type,
> + struct bpf_insn_access_aux *info)
> +{
> + if (off < 0 || off >= sizeof(struct ftrace_regs))
> + return false;

3.
this won't even compile, since ftrace_regs is only added in the patch 4.

Since bpf program will see ftrace_regs as an input it becomes
abi, so has to be defined in uapi/linux/bpf_ftrace.h or similar.
We need to think through how to make it generic across archs
instead of defining ftrace_regs for each arch.

4.
the patch 2/3 takes an approach of passing FD integer value in text form
to the kernel. That approach was discussed years ago and rejected.
It has to use binary interface like perf_event + ioctl.
See RFC patches where we're extending perf_event_open syscall to
support binary access to kprobe/uprobe.
imo binary interface to ftrace is pre-requisite to ftrace+bpf work.
We've had too many issues with text based kprobe api to repeat
the same mistake here.

5.
patch 4 hacks save_mcount_regs asm to pass ctx pointer in %rcx
whereas it's only used in ftrace_graph_caller which doesn't seem right.
It points out to another issue that such ftrace+bpf integration
is only done for ftrace_graph_caller without extensibility in mind.
If we do ftrace+bpf I'd rather see generic framework that applies
to all of ftrace instead of single feature of it.

6.
copyright line copy-pasted incorrectly.

Re: [PATCH 1/2] bpf: add a bpf_override_function helper

2017-11-03 Thread Alexei Starovoitov

On Fri, Nov 03, 2017 at 05:52:22PM +0100, Daniel Borkmann wrote:
> On 11/03/2017 03:31 PM, Josef Bacik wrote:
> > On Fri, Nov 03, 2017 at 12:12:13AM +0100, Daniel Borkmann wrote:
> > > Hi Josef,
> > > 
> > > one more issue I just noticed, see comment below:
> > > 
> > > On 11/02/2017 03:37 PM, Josef Bacik wrote:
> > > [...]
> > > > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > > > index cdd78a7beaae..dfa44fd74bae 100644
> > > > --- a/include/linux/filter.h
> > > > +++ b/include/linux/filter.h
> > > > @@ -458,7 +458,8 @@ struct bpf_prog {
> > > > locked:1,   /* Program image 
> > > > locked? */
> > > > gpl_compatible:1, /* Is filter GPL 
> > > > compatible? */
> > > > cb_access:1,/* Is control block 
> > > > accessed? */
> > > > -   dst_needed:1;   /* Do we need dst 
> > > > entry? */
> > > > +   dst_needed:1,   /* Do we need dst 
> > > > entry? */
> > > > +   kprobe_override:1; /* Do we override a 
> > > > kprobe? */
> > > > kmemcheck_bitfield_end(meta);
> > > > enum bpf_prog_type  type;   /* Type of BPF program 
> > > > */
> > > > u32 len;/* Number of filter 
> > > > blocks */
> > > [...]
> > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > > index d906775e12c1..f8f7927a9152 100644
> > > > --- a/kernel/bpf/verifier.c
> > > > +++ b/kernel/bpf/verifier.c
> > > > @@ -4189,6 +4189,8 @@ static int fixup_bpf_calls(struct 
> > > > bpf_verifier_env *env)
> > > > prog->dst_needed = 1;
> > > > if (insn->imm == BPF_FUNC_get_prandom_u32)
> > > > bpf_user_rnd_init_once();
> > > > +   if (insn->imm == BPF_FUNC_override_return)
> > > > +   prog->kprobe_override = 1;
> > > > if (insn->imm == BPF_FUNC_tail_call) {
> > > > /* If we tail call into other programs, we
> > > >  * cannot make any assumptions since they can
> > > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > > index 9660ee65fbef..0d7fce52391d 100644
> > > > --- a/kernel/events/core.c
> > > > +++ b/kernel/events/core.c
> > > > @@ -8169,6 +8169,13 @@ static int perf_event_set_bpf_prog(struct 
> > > > perf_event *event, u32 prog_fd)
> > > > return -EINVAL;
> > > > }
> > > > 
> > > > +   /* Kprobe override only works for kprobes, not uprobes. */
> > > > +   if (prog->kprobe_override &&
> > > > +   !(event->tp_event->flags & TRACE_EVENT_FL_KPROBE)) {
> > > > +   bpf_prog_put(prog);
> > > > +   return -EINVAL;
> > > > +   }
> > > 
> > > Can we somehow avoid the prog->kprobe_override flag here completely
> > > and also same in the perf_event_attach_bpf_prog() handler?
> > > 
> > > Reason is that it's not reliable for bailing out this way: Think of
> > > the main program you're attaching doesn't use bpf_override_return()
> > > helper, but it tail-calls into other BPF progs that make use of it
> > > instead. So above check would be useless and will fail and we continue
> > > to attach the prog for probes where it's not intended to be used.
> > > 
> > > We've had similar issues in the past e.g. c2002f983767 ("bpf: fix
> > > checking xdp_adjust_head on tail calls") is just one of those. Thus,
> > > can we avoid the flag altogether and handle such error case differently?
> > 
> > So if I'm reading this right there's no way to know what we'll tail call at 
> > any
> > given point, so I need to go back to my previous iteration of this patch and
> > always save the state of the kprobe in the per-cpu variable to make sure we
> > don't use bpf_override_return in the wrong case?
> 
> Yeah.
> 
> > The tail call functions won't be in the BPF_PROG_ARRAY right?  It'll be just
> > some other arbitrary function?  If that's the case then we really need 
> > something
> > like this
> 
> With BPF_PROG_ARRAY you mean BPF_MAP_TYPE_PROG_ARRAY or the prog array
> for the tracing/multiprog attach point? The program you're calling into
> is inside the BPF_MAP_TYPE_PROG_ARRAY map, but can change at any time
> and can have nesting as well.
> 
> > https://patchwork.kernel.org/patch/10034815/
> > 
> > and I need to bring that back right?  Thanks,
> 
> I'm afraid so. The thing with skb cb_access which was brought up there is
> that once you have a tail call in the prog you cannot make any assumptions
> anymore, therefore the cb_access flag is set to 1 so we save/restore for
> those cases precautionary since it could be accessed or not later on. In
> your case I think this wouldn't work since legitimate bpf kprobes progs could
> use tail calls today, so setting prog->kprobe_override there would prevent
> attaching for non-kprobes due to subsequent flags & TRACE_EVENT_F

Re: [RFC PATCH] bpf: Add helpers to read useful task_struct members

2017-11-04 Thread Alexei Starovoitov


On 11/3/17 3:58 PM, Sandipan Das wrote:

For added security, the layout of some structures can be
randomized by enabling CONFIG_GCC_PLUGIN_RANDSTRUCT. One
such structure is task_struct. To build BPF programs, we
use Clang which does not support this feature. So, if we
attempt to read a field of a structure with a randomized
layout within a BPF program, we do not get the expected
value because of incorrect offsets. To observe this, it
is not mandatory to have CONFIG_GCC_PLUGIN_RANDSTRUCT
enabled because the structure annotations/members added
for this purpose are enough to cause this. So, all kernel
builds are affected.

For example, considering samples/bpf/offwaketime_kern.c,
if we try to print the values of pid and comm inside the
task_struct passed to waker() by adding the following
lines of code at the appropriate place

  char fmt[] = "waker(): p->pid = %u, p->comm = %s\n";
  bpf_trace_printk(fmt, sizeof(fmt), _(p->pid), _(p->comm));

it is seen that upon rebuilding and running this sample
followed by inspecting /sys/kernel/debug/tracing/trace,
the output looks like the following

   _-=> irqs-off
  / _=> need-resched
 | / _---=> hardirq/softirq
 || / _--=> preempt-depth
 ||| / delay
TASK-PID   CPU#  TIMESTAMP  FUNCTION
   | |   |      | |
  -0 [007] d.s.  1883.443594: 0x0001: waker(): p->pid = 0, 
p->comm =
  -0 [018] d.s.  1883.453588: 0x0001: waker(): p->pid = 0, 
p->comm =
  -0 [007] d.s.  1883.463584: 0x0001: waker(): p->pid = 0, 
p->comm =
  -0 [009] d.s.  1883.483586: 0x0001: waker(): p->pid = 0, 
p->comm =
  -0 [005] d.s.  1883.493583: 0x0001: waker(): p->pid = 0, 
p->comm =
  -0 [009] d.s.  1883.503583: 0x0001: waker(): p->pid = 0, 
p->comm =
  -0 [018] d.s.  1883.513578: 0x0001: waker(): p->pid = 0, 
p->comm =
 systemd-journal-3140  [003] d...  1883.627660: 0x0001: waker(): p->pid = 0, 
p->comm =
 systemd-journal-3140  [003] d...  1883.627704: 0x0001: waker(): p->pid = 0, 
p->comm =
 systemd-journal-3140  [003] d...  1883.627723: 0x0001: waker(): p->pid = 0, 
p->comm =

To avoid this, we add new BPF helpers that read the
correct values for some of the important task_struct
members such as pid, tgid, comm and flags which are
extensively used in BPF-based analysis tools such as
bcc. Since these helpers are built with GCC, they use
the correct offsets when referencing a member.

Signed-off-by: Sandipan Das 

...

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f90860d1f897..324508d27bd2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -338,6 +338,16 @@ union bpf_attr {
  * @skb: pointer to skb
  * Return: classid if != 0
  *
+ * u64 bpf_get_task_pid_tgid(struct task_struct *task)
+ * Return: task->tgid << 32 | task->pid
+ *
+ * int bpf_get_task_comm(struct task_struct *task)
+ * Stores task->comm into buf
+ * Return: 0 on success or negative error
+ *
+ * u32 bpf_get_task_flags(struct task_struct *task)
+ * Return: task->flags
+ *


I don't think it's a solution.
Tracing scripts read other fields too.
Making it work for these 3 fields is a drop in a bucket.
If randomization is used I think we have to accept
that existing bpf scripts won't be usable.
Long term solution is to support 'BPF Type Format' or BTF
(which is old C-Type Format) for kernel data structures,
so bcc scripts wouldn't need to use kernel headers and clang.
The proper offsets will be described in BTF.
We were planning to use it initially to describe map key/value,
but it applies for this case as well.
There will be a tool that will take dwarf from vmlinux and
compress it into BTF. Kernel will also be able to verify
that BTF is a valid BTF.
I'm assuming that gcc randomization plugin produces dwarf
with correct offsets, if not, it would have to be fixed.

Re: [RFC PATCH] bpf: Add helpers to read useful task_struct members

2017-11-04 Thread Alexei Starovoitov


On 11/5/17 2:31 AM, Naveen N. Rao wrote:

Hi Alexei,

Alexei Starovoitov wrote:

On 11/3/17 3:58 PM, Sandipan Das wrote:

For added security, the layout of some structures can be
randomized by enabling CONFIG_GCC_PLUGIN_RANDSTRUCT. One
such structure is task_struct. To build BPF programs, we
use Clang which does not support this feature. So, if we
attempt to read a field of a structure with a randomized
layout within a BPF program, we do not get the expected
value because of incorrect offsets. To observe this, it
is not mandatory to have CONFIG_GCC_PLUGIN_RANDSTRUCT
enabled because the structure annotations/members added
for this purpose are enough to cause this. So, all kernel
builds are affected.

For example, considering samples/bpf/offwaketime_kern.c,
if we try to print the values of pid and comm inside the
task_struct passed to waker() by adding the following
lines of code at the appropriate place

  char fmt[] = "waker(): p->pid = %u, p->comm = %s\n";
  bpf_trace_printk(fmt, sizeof(fmt), _(p->pid), _(p->comm));

it is seen that upon rebuilding and running this sample
followed by inspecting /sys/kernel/debug/tracing/trace,
the output looks like the following

   _-=> irqs-off
  / _=> need-resched
 | / _---=> hardirq/softirq
 || / _--=> preempt-depth
 ||| / delay
TASK-PID   CPU#  TIMESTAMP  FUNCTION
   | |   |      | |
  -0 [007] d.s.  1883.443594: 0x0001: waker():
p->pid = 0, p->comm =
  -0 [018] d.s.  1883.453588: 0x0001: waker():
p->pid = 0, p->comm =
  -0 [007] d.s.  1883.463584: 0x0001: waker():
p->pid = 0, p->comm =
  -0 [009] d.s.  1883.483586: 0x0001: waker():
p->pid = 0, p->comm =
  -0 [005] d.s.  1883.493583: 0x0001: waker():
p->pid = 0, p->comm =
  -0 [009] d.s.  1883.503583: 0x0001: waker():
p->pid = 0, p->comm =
  -0 [018] d.s.  1883.513578: 0x0001: waker():
p->pid = 0, p->comm =
 systemd-journal-3140  [003] d...  1883.627660: 0x0001: waker():
 p->pid = 0, p->comm =
 systemd-journal-3140  [003] d...  1883.627704: 0x0001: waker():
 p->pid = 0, p->comm =
 systemd-journal-3140  [003] d...  1883.627723: 0x0001: waker():
 p->pid = 0, p->comm =

To avoid this, we add new BPF helpers that read the
correct values for some of the important task_struct
members such as pid, tgid, comm and flags which are
extensively used in BPF-based analysis tools such as
bcc. Since these helpers are built with GCC, they use
the correct offsets when referencing a member.

Signed-off-by: Sandipan Das 

...

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f90860d1f897..324508d27bd2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -338,6 +338,16 @@ union bpf_attr {
  * @skb: pointer to skb
  * Return: classid if != 0
  *
+ * u64 bpf_get_task_pid_tgid(struct task_struct *task)
+ * Return: task->tgid << 32 | task->pid
+ *
+ * int bpf_get_task_comm(struct task_struct *task)
+ * Stores task->comm into buf
+ * Return: 0 on success or negative error
+ *
+ * u32 bpf_get_task_flags(struct task_struct *task)
+ * Return: task->flags
+ *


I don't think it's a solution.
Tracing scripts read other fields too.
Making it work for these 3 fields is a drop in a bucket.


Indeed. However...


If randomization is used I think we have to accept
that existing bpf scripts won't be usable.


... the actual issue is that randomization isn't necessary for this to
show up. The annotations added to mark off the structure members results
in some structure members being moved into an anonymous structure, which
would then get padded differently. So, *all* kernels since v4.13 are
affected, afaict.


hmm. why would all 4.13+ be affected?
It's just an anonymous struct inside task_struct.
Are you saying that due to clang not adding this 'struct { };' treatment 
to task_struct?

I thought such struct shouldn't change layout.
If it is we need to fix include/linux/compiler-clang.h to do that
anon struct as well.


As such, we wanted to propose this as a short term solution, but I do
agree that this doesn't solve the real issue.


Long term solution is to support 'BPF Type Format' or BTF
(which is old C-Type Format) for kernel data structures,
so bcc scripts wouldn't need to use kernel headers and clang.
The proper offsets will be described in BTF.
We were planning to use it initially to describe map key/value,
but it applies for this case as well.
There will be a tool that will take dwarf from vmlinux and
compress it into BTF. Kernel will also be able to verify
that BTF is a valid BTF.


Thi

Re: [PATCH net 0/3] Fix for BPF devmap percpu allocation splat

2017-10-18 Thread Alexei Starovoitov

On Wed, Oct 18, 2017 at 7:22 AM, Daniel Borkmann  wrote:
>
> Higher prio imo would be to make the allocation itself faster
> though, I remember we talked about this back in May wrt hashtable,
> but I kind of lost track whether there was an update on this in
> the mean time. ;-)

new percpu allocator by Dennis fixed those issues. It's in 4.14

Re: [PATCH 18/27] bpf: Restrict kernel image access functions when the kernel is locked down

2017-10-19 Thread Alexei Starovoitov

On Thu, Oct 19, 2017 at 03:52:49PM +0100, David Howells wrote:
> From: Chun-Yi Lee 
> 
> There are some bpf functions can be used to read kernel memory:
> bpf_probe_read, bpf_probe_write_user and bpf_trace_printk.  These allow
> private keys in kernel memory (e.g. the hibernation image signing key) to
> be read by an eBPF program.  Prohibit those functions when the kernel is
> locked down.
> 
> Signed-off-by: Chun-Yi Lee 
> Signed-off-by: David Howells 
> cc: net...@vger.kernel.org
> ---
> 
>  kernel/trace/bpf_trace.c |   11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index dc498b605d5d..35e85a3fdb37 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -65,6 +65,11 @@ BPF_CALL_3(bpf_probe_read, void *, dst, u32, size, const 
> void *, unsafe_ptr)
>  {
>   int ret;
>  
> + if (kernel_is_locked_down("BPF")) {
> + memset(dst, 0, size);
> + return -EPERM;
> + }

That doesn't help the lockdown purpose.
If you don't trust the root the only way to prevent bpf read
memory is to disable the whole thing.
Have a single check in sys_bpf() to disallow everything if 
kernel_is_locked_down()
and don't add overhead to critical path like bpf_probe_read().

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2370 matches

Mail list logo