Re: [iovisor-dev] Accessing user memory and minor page faults

2017-10-01 Thread Alexei Starovoitov via iovisor-dev
On Sun, Oct 01, 2017 at 02:08:36PM -0700, Gianluca Borello via iovisor-dev 
wrote:
> On Mon, Sep 25, 2017 at 9:36 AM, Alexei Starovoitov
>  wrote:
> >
> > this issue was discussed at Plumbers and it seems there may be
> > a solution in sight. The work on 'speculative page faults' will
> > remove mm->mmap_sem in favor of srcu approach with sequence numbers
> > and we will be able to do find_vma() and vma->vm_ops->access() from
> > the non-sleepable context.
> > From bpf program point of view it probably be a new helper
> > bpf_probe_read_harder() ;) or something that will try normal
> > pagefault_disabled read first and if it fails will try
> > srcu_read_lock+vma->access approach.
> >
> 
> Thank you Alexei for your reply and sorry for the delay, I just
> finally found the time over the weekend to go over your message more
> deeply.
> 
> I applied the speculative page fault patch to my tree to better
> understand the implications of your comment and indeed this patch (way
> over my head!) seems a huge leap forward because it allows us to
> lookup a VMA without taking any lock, so we can do it in a
> non-sleepable context.
> 
> However, I am still missing how this could be a resolutive fix. Let's
> imagine for example the case I mentioned above where we have a fork()
> child and right after the fork all VMAs referring to mapped files will
> not have any valid PTEs (but the file is already in the page cache).
> 
> In this case, there's little we can do beside grabbing the VMA and
> asking some vma->vm_ops to give us the page corresponding to the
> address we're looking for. With the speculative fault, we can do it
> also from a BPF helper, however some vm_ops methods are not ready to
> be called in a non-sleepable context. For example, for filemap:
> 
> - fault() is not safe because it consistently ends up in a
> might_sleep() invocation [1][2]
> - map_pages() seems safe (but is it also for other VMA implementations?)
> - access() is not defined
> 
> So, which ones would this BPF helper call in order to guarantee
> usefulness while not causing blocking? Just calling vm_ops->access()
> wouldn't help in this case since it's not defined. Looking at the code
> for __access_remote_vm(), it seems it does a mix of get_user_pages()
> (which in turn calls vm_ops->fault() and/or vm_ops->map_pages()) and
> as a fallback it uses vm_ops->access(), but of course that one can
> sleep.

my understanding of speculative page fault patch is that the whole
get_user_pages will operate under srcu, so we can call it
if necessary (not only find_vma will be safe to call).

> Perhaps the solution is much simpler and I just didn't grasp all the
> implications of this work? (sorry again, it's the first time I dabble
> in this subsystem).

all correct. It's not simpler than what you described.
I thought access() will be available in the situation we care about.
The only bit I'm missing is why to trace exec() args it will be going
all the way to filemap-backed pages of the executable ?
The strings are in the heap/stack, no?
we can try to implement filemap_access and it probably won't
need to lock_page to access it, since readahead does it without locking.
Accessing things via kmap or ioremap (the way generic_access_phys does it)
is too costly, so probably not the right approach.

___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] New helper bpf_get_current_pidns_info

2017-10-01 Thread carlos antonio neira bustos via iovisor-dev
Hi,

I'll add the device major/minor numbers into the pidns_info structure and
work on those 4 points.
Thanks for your guidance on this.

Bests

On Sat, Sep 30, 2017 at 9:55 PM, Y Song  wrote:

> On Thu, Sep 28, 2017 at 2:02 PM, carlos antonio neira bustos
>  wrote:
> > Hi All,
> >
> > I'm still working this issue https://github.com/iovisor/bcc/issues/1329.
> > I have added tests under samples/bpf, here is test calling this new
> helper.
> >
> > Inside the container
> >
> > ping-10619 [000] d.s1  5319.547909: 0x0001: ns_id 4026532197 tgid
> 10619
> > pid 10619
> >
> > Outside the container
> >
> > ping-12174 [000] d.s1  5480.582818: 0x0001: ns_id 4026531836 tgid
> 12174
> > pid 12174
> >
> >
> > Let me know if something needs to be changed.
> >
> > Thanks again for your help and comments.
> >
> > Here is the patch
> >
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index b69e7a5..34b608e 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -227,6 +227,12 @@ struct bpf_event_entry {
> >   struct rcu_head rcu;
> >  };
> >
> > +struct bpf_current_pidns_info {
> > +u64 ns_id;
> > +u32 tgid;
> > +u32 pid;
> > +};
>
> During linux plumbers conference, I talked to
> Eric Biederman (kernel namespace maintainer).
> The below is an example output of "stat -L /proc/self/ns/pid":
>
> -bash-4.3$ stat -L /proc/self/ns/pid
>   File: '/proc/self/ns/pid'
>   Size: 0 Blocks: 0  IO Block: 4096   regular empty
> file
> Device: 3h/3dInode: 4026531836  Links: 1
> Access: (0444/-r--r--r--)  Uid: (0/root)   Gid: (0/root)
> Context: system_u:object_r:nsfs_t:s0
> Access: 2017-09-28 21:21:41.496571299 -0700
> Modify: 2017-09-28 21:21:41.496571299 -0700
> Change: 2017-09-28 21:21:41.496571299 -0700
>  Birth: -
> -bash-4.3$
>
> You will notice that there is a "Device" field with major and minor
> number. Currently, all namespace files will have the same device.
> However, in the future, it is possible (maybe under really rare
> cases) that different pid_ns files may belong to different device.
>
> So he suggests that we should put device major/minor numbers
> in the pidns_info structure as well. Could you help do that as well?
>
> > +
> >  u64 bpf_tail_call(u64 ctx, u64 r2, u64 index, u64 r4, u64 r5);
> >  u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
> >
> > @@ -375,6 +381,8 @@ extern const struct bpf_func_proto
> > bpf_skb_vlan_push_proto;
> >  extern const struct bpf_func_proto bpf_skb_vlan_pop_proto;
> >  extern const struct bpf_func_proto bpf_get_stackid_proto;
> >
> > +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
> > +
> >  /* Shared helpers among cBPF and eBPF. */
> >  void bpf_user_rnd_init_once(void);
> >  u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index e99e3e6..c1b94fa 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -539,6 +539,15 @@ union bpf_attr {
> >   * @mode: operation mode (enum bpf_adj_room_mode)
> >   * @flags: reserved for future use
> >   * Return: 0 on success or negative error code
> > + *
> > + * int bpf_get_current_pidns_info(void *buf, int size_of_buf)
> > + * stores the following  namespace data into
> > + * bpf_current_pins_info struct:
> > + * namespace id
> > + * tgid inside namespace
> > + * pid  inside namespace
> > + * Return: 0 on success or negative error
> > + *
> >   */
> >  #define __BPF_FUNC_MAPPER(FN)  \
> >   FN(unspec),   \
> > @@ -591,7 +600,9 @@ union bpf_attr {
> >   FN(get_socket_uid),  \
> >   FN(set_hash),   \
> >   FN(setsockopt),   \
> > - FN(skb_adjust_room),
> > + FN(skb_adjust_room),\
> > + FN(get_current_pidns_info),
> > +
> >
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which
> > helper
> >   * function eBPF program intends to call
> > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > index ad5f559..c81ffa0 100644
> > --- a/kernel/bpf/core.c
> > +++ b/kernel/bpf/core.c
> > @@ -1379,6 +1379,9 @@ const struct bpf_func_proto
> > bpf_get_current_pid_tgid_proto __weak;
> >  const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
> >  const struct bpf_func_proto bpf_get_current_comm_proto __weak;
> >
> > +const struct bpf_func_proto bpf_get_current_pidns_info __weak;
> > +
> > +
> >  const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
> >  {
> >   return NULL;
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index 3d24e23..682d623 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -18,6 +18,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  /* If kernel subsystem is allowing eBPF programs to call this function,
> >   * inside its own verifier_ops->get_func_proto() callback it should
> return
> > @@