Re: How to properly fix reading user pointers in bpf in android kernel 4.9?

2024-05-24 Thread Bagas Sanjaya
[also Cc: bpf maintainers and get_maintainer output]

On Thu, May 23, 2024 at 07:52:22PM +0300, Marcel wrote:
> This seems that it was a long standing problem with the Linux kernel in 
> general. bpf_probe_read should have worked for both kernel and user pointers 
> but it fails with access error when reading an user one instead. 
> 
> I know there's a patch upstream that fixes this by introducing new helpers 
> for reading kernel and userspace pointers and I tried to back port them back 
> to my kernel but with no success. Tools like bcc fail to use them and instead 
> they report that the arguments sent to the helpers are invalid. I assume this 
> is due to the arguments ARG_CONST_STACK_SIZE and ARG_PTR_TO_RAW_STACK handle 
> data different in the 4.9 android version and the upstream version but I'm 
> not sure that this is the cause. I left the patch I did below and with a link 
> to the kernel I'm working on and maybe someone can take a look and give me an 
> hand (the patch isn't applied yet)

What upstream patch? Has it already been in mainline?

> 
> 
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 744b4763b80e..de94c13b7193 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -559,6 +559,43 @@ enum bpf_func_id {
> */
> BPF_FUNC_probe_read_user,
>  
> +   /**
> +   * int bpf_probe_read_kernel(void *dst, int size, void *src)
> +   * Read a kernel pointer safely.
> +   * Return: 0 on success or negative error
> +   */
> +   BPF_FUNC_probe_read_kernel,
> +
> + /**
> +  * int bpf_probe_read_str(void *dst, int size, const void *unsafe_ptr)
> +  * Copy a NUL terminated string from user unsafe address. In case 
> the string
> +  * length is smaller than size, the target is not padded with 
> further NUL
> +  * bytes. In case the string length is larger than size, just 
> count-1
> +  * bytes are copied and the last byte is set to NUL.
> +  * @dst: destination address
> +  * @size: maximum number of bytes to copy, including the trailing 
> NUL
> +  * @unsafe_ptr: unsafe address
> +  * Return:
> +  *   > 0 length of the string including the trailing NUL on success
> +  *   < 0 error
> +  */
> + BPF_FUNC_probe_read_user_str,
> +
> + /**
> +  * int bpf_probe_read_str(void *dst, int size, const void *unsafe_ptr)
> +  * Copy a NUL terminated string from unsafe address. In case the 
> string
> +  * length is smaller than size, the target is not padded with 
> further NUL
> +  * bytes. In case the string length is larger than size, just 
> count-1
> +  * bytes are copied and the last byte is set to NUL.
> +  * @dst: destination address
> +  * @size: maximum number of bytes to copy, including the trailing 
> NUL
> +  * @unsafe_ptr: unsafe address
> +  * Return:
> +  *   > 0 length of the string including the trailing NUL on success
> +  *   < 0 error
> +  */
> + BPF_FUNC_probe_read_kernel_str,
> +
>   __BPF_FUNC_MAX_ID,
>  };
>  
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index a1e37a5d8c88..3478ca744a45 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -94,7 +94,7 @@ static const struct bpf_func_proto bpf_probe_read_proto = {
>   .arg3_type  = ARG_ANYTHING,
>  };
>  
> -BPF_CALL_3(bpf_probe_read_user, void *, dst, u32, size, const void *, 
> unsafe_ptr)
> +BPF_CALL_3(bpf_probe_read_user, void *, dst, u32, size, const void  __user 
> *, unsafe_ptr)
>  {
>   int ret;
>  
> @@ -115,6 +115,27 @@ static const struct bpf_func_proto 
> bpf_probe_read_user_proto = {
>  };
>  
>  
> +BPF_CALL_3(bpf_probe_read_kernel, void *, dst, u32, size, const void *, 
> unsafe_ptr)
> +{
> + int ret;
> +
> + ret = probe_kernel_read(dst, unsafe_ptr, size);
> + if (unlikely(ret < 0))
> + memset(dst, 0, size);
> +
> + return ret;
> +}
> +
> +static const struct bpf_func_proto bpf_probe_read_kernel_proto = {
> + .func   = bpf_probe_read_kernel,
> + .gpl_only   = true,
> + .ret_type   = RET_INTEGER,
> + .arg1_type  = ARG_PTR_TO_RAW_STACK,
> + .arg2_type  = ARG_CONST_STACK_SIZE,
> + .arg3_type  = ARG_ANYTHING,
> +};
> +
> +
>  BPF_CALL_3(bpf_probe_write_user, void *, unsafe_ptr, const void *, src,
>  u32, size)
>  {
> @@ -487,6 +508,69 @@ static const struct bpf_func_proto 
> bpf_probe_read_str_proto = {
>   .arg3_type  = ARG_ANYTHING,
>  };
>  
> +
> +
> +BPF_CALL_3(bpf_probe_read_user_str, void *, dst, u32, size,
> +const void __user *, unsafe_ptr)
> +{
> + int ret;
> +
> + /*
> +  * The strncpy_from_unsafe() call will likely not fill the entire
> +  * buffer, but that's okay in this circumstance as we're probing
> +  * 

Re: [RFC PATCH bpf-next] ksnoop: kernel argument/return value tracing/display using BTF

2021-01-05 Thread Alan Maguire



On Tue, 5 Jan 2021, Cong Wang wrote:

> On Mon, Jan 4, 2021 at 7:29 AM Alan Maguire  wrote:
> >
> > BPF Type Format (BTF) provides a description of kernel data structures
> > and of the types kernel functions utilize as arguments and return values.
> >
> > A helper was recently added - bpf_snprintf_btf() - that uses that
> > description to create a string representation of the data provided,
> > using the BTF id of its type.  For example to create a string
> > representation of a "struct sk_buff", the pointer to the skb
> > is provided along with the type id of "struct sk_buff".
> >
> > Here that functionality is utilized to support tracing kernel
> > function entry and return using k[ret]probes.  The "struct pt_regs"
> > context can be used to derive arguments and return values, and
> > when the user supplies a function name we
> >
> > - look it up in /proc/kallsyms to find its address/module
> > - look it up in the BTF kernel data to get types of arguments
> >   and return value
> > - store a map representation of the trace information, keyed by
> >   instruction pointer
> > - on function entry/return we look up the map to retrieve the BTF
> >   ids of the arguments/return values and can call bpf_snprintf_btf()
> >   with these argument/return values along with the type ids to store
> >   a string representation in the map.
> > - this is then sent via perf event to userspace where it can be
> >   displayed.
> >
> > ksnoop can be used to show function signatures; for example:
> 
> This is definitely quite useful!
> 
> Is it possible to integrate this with bpftrace? That would save people
> from learning yet another tool. ;)
> 

I'd imagine (and hope!) other tracing tools will do this, but right 
now the aim is to make the task of tracing kernel data structures simpler, 
so having a tool dedicated to just that can hopefully help those 
discussions.  There's a bit more work to be done to simplify that task, for
example  implementing Alexei's suggestion to support pretty-printing of 
data structures using BTF in libbpf.

My hope is that we can evolve this tool - or something like it - to the 
point where we can solve that one problem easily, and that other more 
general tracers can then make use of that solution.  I probably should
have made all of this clearer in the patch submission, sorry about that.

Alan


Re: [RFC PATCH bpf-next] ksnoop: kernel argument/return value tracing/display using BTF

2021-01-05 Thread Cong Wang
On Mon, Jan 4, 2021 at 7:29 AM Alan Maguire  wrote:
>
> BPF Type Format (BTF) provides a description of kernel data structures
> and of the types kernel functions utilize as arguments and return values.
>
> A helper was recently added - bpf_snprintf_btf() - that uses that
> description to create a string representation of the data provided,
> using the BTF id of its type.  For example to create a string
> representation of a "struct sk_buff", the pointer to the skb
> is provided along with the type id of "struct sk_buff".
>
> Here that functionality is utilized to support tracing kernel
> function entry and return using k[ret]probes.  The "struct pt_regs"
> context can be used to derive arguments and return values, and
> when the user supplies a function name we
>
> - look it up in /proc/kallsyms to find its address/module
> - look it up in the BTF kernel data to get types of arguments
>   and return value
> - store a map representation of the trace information, keyed by
>   instruction pointer
> - on function entry/return we look up the map to retrieve the BTF
>   ids of the arguments/return values and can call bpf_snprintf_btf()
>   with these argument/return values along with the type ids to store
>   a string representation in the map.
> - this is then sent via perf event to userspace where it can be
>   displayed.
>
> ksnoop can be used to show function signatures; for example:

This is definitely quite useful!

Is it possible to integrate this with bpftrace? That would save people
from learning yet another tool. ;)

Thanks.


Re: [RFC PATCH bpf-next] ksnoop: kernel argument/return value tracing/display using BTF

2021-01-04 Thread Alexei Starovoitov
On Mon, Jan 04, 2021 at 03:26:31PM +, Alan Maguire wrote:
> 
> ksnoop can be used to show function signatures; for example:
> 
> $ ksnoop info ip_send_skb
> int  ip_send_skb(struct net  * net, struct sk_buff  * skb);
> 
> Then we can trace the function, for example:
> 
> $ ksnoop trace ip_send_skb

Thanks for sharing. It will be useful tool.

> +
> + data = get_arg(ctx, currtrace->base_arg);
> +
> + dataptr = (void *)data;
> +
> + if (currtrace->offset)
> + dataptr += currtrace->offset;
> +
> + /* look up member value and read into data field, provided
> +  * it <= size of a __u64; when it is, it can be used in
> +  * predicate evaluation.
> +  */
> + if (currtrace->flags & KSNOOP_F_MEMBER) {
> + ret = -EINVAL;
> + data = 0;
> + if (currtrace->size <= sizeof(__u64))
> + ret = bpf_probe_read_kernel(,
> + currtrace->size,
> + dataptr);
> + else
> + bpf_printk("size was %d cant trace",
> +currtrace->size);
> + if (ret) {
> + currdata->err_type_id =
> + currtrace->type_id;
> + currdata->err = ret;
> + continue;
> + }
> + if (currtrace->flags & KSNOOP_F_PTR)
> + dataptr = (void *)data;
> + }
> +
> + /* simple predicate evaluation: if any predicate fails,
> +  * skip all tracing for this function.
> +  */
> + if (currtrace->flags & KSNOOP_F_PREDICATE_MASK) {
> + bool ok = false;
> +
> + if (currtrace->flags & KSNOOP_F_PREDICATE_EQ &&
> + data == currtrace->predicate_value)
> + ok = true;
> +
> + if (currtrace->flags & KSNOOP_F_PREDICATE_NOTEQ &&
> + data != currtrace->predicate_value)
> + ok = true;
> +
> + if (currtrace->flags & KSNOOP_F_PREDICATE_GT &&
> + data > currtrace->predicate_value)
> + ok = true;
> + if (currtrace->flags & KSNOOP_F_PREDICATE_LT &&
> + data < currtrace->predicate_value)
> + ok = true;
> +
> + if (!ok)
> + goto skiptrace;
> + }
> +
> + currdata->raw_value = data;
> +
> + if (currtrace->flags & (KSNOOP_F_PTR | KSNOOP_F_MEMBER))
> + btf_ptr.ptr = dataptr;
> + else
> + btf_ptr.ptr = 
> +
> + btf_ptr.type_id = currtrace->type_id;
> +
> + if (trace->buf_len + MAX_TRACE_DATA >= MAX_TRACE_BUF)
> + break;
> +
> + buf_offset = >buf[trace->buf_len];
> + if (buf_offset > >buf[MAX_TRACE_BUF]) {
> + currdata->err_type_id = currtrace->type_id;
> +     currdata->err = -ENOSPC;
> + continue;
> + }
> + currdata->buf_offset = trace->buf_len;
> +
> + ret = bpf_snprintf_btf(buf_offset,
> +MAX_TRACE_DATA,
> +_ptr, sizeof(btf_ptr),
> +BTF_F_PTR_RAW);

The overhead would be much lower if instead of printing in the kernel the
tool's bpf prog would dump the struct data into ring buffer and let the user
space part of the tool do the pretty printing. There would be no need to pass
btf_id from the user space to the kernel either. The user space would need to
gain pretty printing logic, but may be we can share the code somehow between
the kernel and libbpf.

Separately the interpreter in the bpf prog to handle predicates is kinda
anti-bpf :) I think ksnoop can generate bpf code on the fly instead. No need
for llvm. The currtrace->offset/size would be written into the prog placeholder
instructions by ksnoop before loading the prog. With much improved overhead for
filtering.


[RFC PATCH bpf-next] ksnoop: kernel argument/return value tracing/display using BTF

2021-01-04 Thread Alan Maguire
BPF Type Format (BTF) provides a description of kernel data structures
and of the types kernel functions utilize as arguments and return values.

A helper was recently added - bpf_snprintf_btf() - that uses that
description to create a string representation of the data provided,
using the BTF id of its type.  For example to create a string
representation of a "struct sk_buff", the pointer to the skb
is provided along with the type id of "struct sk_buff".

Here that functionality is utilized to support tracing kernel
function entry and return using k[ret]probes.  The "struct pt_regs"
context can be used to derive arguments and return values, and
when the user supplies a function name we

- look it up in /proc/kallsyms to find its address/module
- look it up in the BTF kernel data to get types of arguments
  and return value
- store a map representation of the trace information, keyed by
  instruction pointer
- on function entry/return we look up the map to retrieve the BTF
  ids of the arguments/return values and can call bpf_snprintf_btf()
  with these argument/return values along with the type ids to store
  a string representation in the map.
- this is then sent via perf event to userspace where it can be
  displayed.

ksnoop can be used to show function signatures; for example:

$ ksnoop info ip_send_skb
int  ip_send_skb(struct net  * net, struct sk_buff  * skb);

Then we can trace the function, for example:

$ ksnoop trace ip_send_skb
TASKPID CPU# TIMESTAMP FUNCTION

ping   38331 251523.616148 ip_send_skb(
net = *(struct net){
 .passive = (refcount_t){
  .refs = (atomic_t){
   .counter = (int)2,
  },
 },

etc.  Truncated data is suffixed by "..." (2048 bytes of
string value are provided for each argument).  Up to
five arguments are displayed.

The arguments are referred to via name (e.g. skb, net), and
the return value is referred to as "return" (using the keyword
ensures we can never clash with an argument name), i.e.

ping   38331 251523.617250 ip_send_skb(
return = (int)0

   );

ksnoop can select specific arguments/return value rather
than tracing everything; for example:

$ ksnoop "ip_send_skb(skb)"

...will only trace the skb argument.  A single level of
reference is supported also, for example:

$ ksnoop "ip_send_skb(skb->sk)"

..for a pointer member or

$ ksnoop "ip_send_skb(skb->len)"

...for a non-pointer member.

Multiple functions can be specified also, for example:

$ ksnoop ip_send_skb ip_rcv

ksnoop will work for in-kernel and module-specific functions,
but in the latter case only base types or core kernel types
will be displayed; bpf_snprintf_btf() does not currently
support module-specific type display.

If invalid memory (such as a userspace pointers or invalid
NULL pointers) is encountered in function arguments, return
values or references, ksnoop will report it like this:

  irqbalance   10433 282167.478364 getname(
filename = 0x7ffd5a0cca10
/* Cannot show 'filename' as 
'char  *'.
 * Userspace/invalid ptr? */

   );

ksnoop can handle simple predicate evaluations;
"==", "!=", ">", "<", ">=", "<=" are supported and the
the assumption is that for a trace to be recorded, all
predicates have to evaluate to true.  For example:

$ ksnoop "ip_send_skb(skb->len == 84, skb)"
ping  190091  19671.328156 ip_send_skb(
skb->len = (unsigned int)84
,

skb = *(struct sk_buff){
 (union){
  .sk = (struct sock 
*)0x930a01095c00,
  .ip_defrag_offset = 
(int)17390592,
 },
 (union){
  (struct){
   ._skb_refdst = (long 
unsigned int)18446624275917552448,
   .destructor = ( 
*)0xa5bfaf00,
  },
  .tcp_tsorted_anchor = (struct 
list_head){
   .next = (struct list_head 

Re: [PATCH v4 bpf-next 1/5] bpf: add in-kernel split BTF support

2020-11-10 Thread Song Liu



> On Nov 10, 2020, at 10:31 AM, Andrii Nakryiko  
> wrote:
> 
> On Tue, Nov 10, 2020 at 9:50 AM Song Liu  wrote:
>> 
>> 
>> 
>>> On Nov 9, 2020, at 5:19 PM, Andrii Nakryiko  wrote:
>>> 
>>> Adjust in-kernel BTF implementation to support a split BTF mode of 
>>> operation.
>>> Changes are mostly mirroring libbpf split BTF changes, with the exception of
>>> start_id being 0 for in-kernel implementation due to simpler read-only mode.
>>> 
>>> Otherwise, for split BTF logic, most of the logic of jumping to base BTF,
>>> where necessary, is encapsulated in few helper functions. Type numbering and
>>> string offset in a split BTF are logically continuing where base BTF ends, 
>>> so
>>> most of the high-level logic is kept without changes.
>>> 
>>> Type verification and size resolution is only doing an added resolution of 
>>> new
>>> split BTF types and relies on already cached size and type resolution 
>>> results
>>> in the base BTF.
>>> 
>>> Signed-off-by: Andrii Nakryiko 
>>> ---
>>> kernel/bpf/btf.c | 171 +++++++++--
>>> 1 file changed, 119 insertions(+), 52 deletions(-)
>>> 
>>> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
>>> index 6324de8c59f7..727c1c27053f 100644
>>> --- a/kernel/bpf/btf.c
>>> +++ b/kernel/bpf/btf.c
>>> @@ -203,12 +203,17 @@ struct btf {
>>>  const char *strings;
>>>  void *nohdr_data;
>>>  struct btf_header hdr;
>>> - u32 nr_types;
>>> + u32 nr_types; /* includes VOID for base BTF */
>>>  u32 types_size;
>>>  u32 data_size;
>>>  refcount_t refcnt;
>>>  u32 id;
>>>  struct rcu_head rcu;
>>> +
>>> + /* split BTF support */
>>> + struct btf *base_btf;
>>> + u32 start_id; /* first type ID in this BTF (0 for base BTF) */
>>> + u32 start_str_off; /* first string offset (0 for base BTF) */
>>> };
>>> 
>>> enum verifier_phase {
>>> @@ -449,14 +454,27 @@ static bool btf_type_is_datasec(const struct btf_type 
>>> *t)
>>>  return BTF_INFO_KIND(t->info) == BTF_KIND_DATASEC;
>>> }
>>> 
>>> +static u32 btf_nr_types_total(const struct btf *btf)
>>> +{
>>> + u32 total = 0;
>>> +
>>> + while (btf) {
>>> + total += btf->nr_types;
>>> + btf = btf->base_btf;
>>> + }
>>> +
>>> + return total;
>>> +}
>>> +
>>> s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind)
>>> {
>>>  const struct btf_type *t;
>>>  const char *tname;
>>> - u32 i;
>>> + u32 i, total;
>>> 
>>> - for (i = 1; i <= btf->nr_types; i++) {
>>> - t = btf->types[i];
>>> + total = btf_nr_types_total(btf);
>>> + for (i = 1; i < total; i++) {
>>> + t = btf_type_by_id(btf, i);
>>>  if (BTF_INFO_KIND(t->info) != kind)
>>>  continue;
>>> 
>>> @@ -599,8 +617,14 @@ static const struct btf_kind_operations 
>>> *btf_type_ops(const struct btf_type *t)
>>> 
>>> static bool btf_name_offset_valid(const struct btf *btf, u32 offset)
>>> {
>>> - return BTF_STR_OFFSET_VALID(offset) &&
>>> - offset < btf->hdr.str_len;
>>> + if (!BTF_STR_OFFSET_VALID(offset))
>>> + return false;
>>> +
>>> + while (offset < btf->start_str_off)
>>> + btf = btf->base_btf;
>> 
>> Do we need "if (!btf) return false;" in the while loop? (and some other 
>> loops below)
> 
> No, because for base btf start_str_off and start_type_id are always
> zero, so loop condition is always false.

Ah, I misread the code. Thanks for the explanation. 

Acked-by: Song Liu 

Re: [PATCH v4 bpf-next 1/5] bpf: add in-kernel split BTF support

2020-11-10 Thread Andrii Nakryiko
On Tue, Nov 10, 2020 at 9:50 AM Song Liu  wrote:
>
>
>
> > On Nov 9, 2020, at 5:19 PM, Andrii Nakryiko  wrote:
> >
> > Adjust in-kernel BTF implementation to support a split BTF mode of 
> > operation.
> > Changes are mostly mirroring libbpf split BTF changes, with the exception of
> > start_id being 0 for in-kernel implementation due to simpler read-only mode.
> >
> > Otherwise, for split BTF logic, most of the logic of jumping to base BTF,
> > where necessary, is encapsulated in few helper functions. Type numbering and
> > string offset in a split BTF are logically continuing where base BTF ends, 
> > so
> > most of the high-level logic is kept without changes.
> >
> > Type verification and size resolution is only doing an added resolution of 
> > new
> > split BTF types and relies on already cached size and type resolution 
> > results
> > in the base BTF.
> >
> > Signed-off-by: Andrii Nakryiko 
> > ---
> > kernel/bpf/btf.c | 171 +++++++++--
> > 1 file changed, 119 insertions(+), 52 deletions(-)
> >
> > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > index 6324de8c59f7..727c1c27053f 100644
> > --- a/kernel/bpf/btf.c
> > +++ b/kernel/bpf/btf.c
> > @@ -203,12 +203,17 @@ struct btf {
> >   const char *strings;
> >   void *nohdr_data;
> >   struct btf_header hdr;
> > - u32 nr_types;
> > + u32 nr_types; /* includes VOID for base BTF */
> >   u32 types_size;
> >   u32 data_size;
> >   refcount_t refcnt;
> >   u32 id;
> >   struct rcu_head rcu;
> > +
> > + /* split BTF support */
> > + struct btf *base_btf;
> > + u32 start_id; /* first type ID in this BTF (0 for base BTF) */
> > + u32 start_str_off; /* first string offset (0 for base BTF) */
> > };
> >
> > enum verifier_phase {
> > @@ -449,14 +454,27 @@ static bool btf_type_is_datasec(const struct btf_type 
> > *t)
> >   return BTF_INFO_KIND(t->info) == BTF_KIND_DATASEC;
> > }
> >
> > +static u32 btf_nr_types_total(const struct btf *btf)
> > +{
> > + u32 total = 0;
> > +
> > + while (btf) {
> > + total += btf->nr_types;
> > + btf = btf->base_btf;
> > + }
> > +
> > + return total;
> > +}
> > +
> > s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind)
> > {
> >   const struct btf_type *t;
> >   const char *tname;
> > - u32 i;
> > + u32 i, total;
> >
> > - for (i = 1; i <= btf->nr_types; i++) {
> > - t = btf->types[i];
> > + total = btf_nr_types_total(btf);
> > + for (i = 1; i < total; i++) {
> > + t = btf_type_by_id(btf, i);
> >   if (BTF_INFO_KIND(t->info) != kind)
> >   continue;
> >
> > @@ -599,8 +617,14 @@ static const struct btf_kind_operations 
> > *btf_type_ops(const struct btf_type *t)
> >
> > static bool btf_name_offset_valid(const struct btf *btf, u32 offset)
> > {
> > - return BTF_STR_OFFSET_VALID(offset) &&
> > - offset < btf->hdr.str_len;
> > + if (!BTF_STR_OFFSET_VALID(offset))
> > + return false;
> > +
> > + while (offset < btf->start_str_off)
> > + btf = btf->base_btf;
>
> Do we need "if (!btf) return false;" in the while loop? (and some other loops 
> below)

No, because for base btf start_str_off and start_type_id are always
zero, so loop condition is always false.

>
> [...]
>


Re: [PATCH v4 bpf-next 1/5] bpf: add in-kernel split BTF support

2020-11-10 Thread Song Liu



> On Nov 9, 2020, at 5:19 PM, Andrii Nakryiko  wrote:
> 
> Adjust in-kernel BTF implementation to support a split BTF mode of operation.
> Changes are mostly mirroring libbpf split BTF changes, with the exception of
> start_id being 0 for in-kernel implementation due to simpler read-only mode.
> 
> Otherwise, for split BTF logic, most of the logic of jumping to base BTF,
> where necessary, is encapsulated in few helper functions. Type numbering and
> string offset in a split BTF are logically continuing where base BTF ends, so
> most of the high-level logic is kept without changes.
> 
> Type verification and size resolution is only doing an added resolution of new
> split BTF types and relies on already cached size and type resolution results
> in the base BTF.
> 
> Signed-off-by: Andrii Nakryiko 
> ---
> kernel/bpf/btf.c | 171 +--
> 1 file changed, 119 insertions(+), 52 deletions(-)
> 
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 6324de8c59f7..727c1c27053f 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -203,12 +203,17 @@ struct btf {
>   const char *strings;
>   void *nohdr_data;
>   struct btf_header hdr;
> - u32 nr_types;
> + u32 nr_types; /* includes VOID for base BTF */
>   u32 types_size;
>   u32 data_size;
>   refcount_t refcnt;
>   u32 id;
>   struct rcu_head rcu;
> +
> + /* split BTF support */
> + struct btf *base_btf;
> + u32 start_id; /* first type ID in this BTF (0 for base BTF) */
> + u32 start_str_off; /* first string offset (0 for base BTF) */
> };
> 
> enum verifier_phase {
> @@ -449,14 +454,27 @@ static bool btf_type_is_datasec(const struct btf_type 
> *t)
>   return BTF_INFO_KIND(t->info) == BTF_KIND_DATASEC;
> }
> 
> +static u32 btf_nr_types_total(const struct btf *btf)
> +{
> + u32 total = 0;
> +
> + while (btf) {
> + total += btf->nr_types;
> + btf = btf->base_btf;
> + }
> +
> + return total;
> +}
> +
> s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind)
> {
>   const struct btf_type *t;
>   const char *tname;
> - u32 i;
> + u32 i, total;
> 
> - for (i = 1; i <= btf->nr_types; i++) {
> - t = btf->types[i];
> + total = btf_nr_types_total(btf);
> + for (i = 1; i < total; i++) {
> + t = btf_type_by_id(btf, i);
>   if (BTF_INFO_KIND(t->info) != kind)
>   continue;
> 
> @@ -599,8 +617,14 @@ static const struct btf_kind_operations 
> *btf_type_ops(const struct btf_type *t)
> 
> static bool btf_name_offset_valid(const struct btf *btf, u32 offset)
> {
> - return BTF_STR_OFFSET_VALID(offset) &&
> - offset < btf->hdr.str_len;
> + if (!BTF_STR_OFFSET_VALID(offset))
> + return false;
> +
> + while (offset < btf->start_str_off)
> + btf = btf->base_btf;

Do we need "if (!btf) return false;" in the while loop? (and some other loops 
below)

[...]



[PATCH v4 bpf-next 1/5] bpf: add in-kernel split BTF support

2020-11-09 Thread Andrii Nakryiko
Adjust in-kernel BTF implementation to support a split BTF mode of operation.
Changes are mostly mirroring libbpf split BTF changes, with the exception of
start_id being 0 for in-kernel implementation due to simpler read-only mode.

Otherwise, for split BTF logic, most of the logic of jumping to base BTF,
where necessary, is encapsulated in few helper functions. Type numbering and
string offset in a split BTF are logically continuing where base BTF ends, so
most of the high-level logic is kept without changes.

Type verification and size resolution is only doing an added resolution of new
split BTF types and relies on already cached size and type resolution results
in the base BTF.

Signed-off-by: Andrii Nakryiko 
---
 kernel/bpf/btf.c | 171 +--
 1 file changed, 119 insertions(+), 52 deletions(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 6324de8c59f7..727c1c27053f 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -203,12 +203,17 @@ struct btf {
const char *strings;
void *nohdr_data;
struct btf_header hdr;
-   u32 nr_types;
+   u32 nr_types; /* includes VOID for base BTF */
u32 types_size;
u32 data_size;
refcount_t refcnt;
u32 id;
struct rcu_head rcu;
+
+   /* split BTF support */
+   struct btf *base_btf;
+   u32 start_id; /* first type ID in this BTF (0 for base BTF) */
+   u32 start_str_off; /* first string offset (0 for base BTF) */
 };
 
 enum verifier_phase {
@@ -449,14 +454,27 @@ static bool btf_type_is_datasec(const struct btf_type *t)
return BTF_INFO_KIND(t->info) == BTF_KIND_DATASEC;
 }
 
+static u32 btf_nr_types_total(const struct btf *btf)
+{
+   u32 total = 0;
+
+   while (btf) {
+   total += btf->nr_types;
+   btf = btf->base_btf;
+   }
+
+   return total;
+}
+
 s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind)
 {
const struct btf_type *t;
const char *tname;
-   u32 i;
+   u32 i, total;
 
-   for (i = 1; i <= btf->nr_types; i++) {
-   t = btf->types[i];
+   total = btf_nr_types_total(btf);
+   for (i = 1; i < total; i++) {
+   t = btf_type_by_id(btf, i);
if (BTF_INFO_KIND(t->info) != kind)
continue;
 
@@ -599,8 +617,14 @@ static const struct btf_kind_operations 
*btf_type_ops(const struct btf_type *t)
 
 static bool btf_name_offset_valid(const struct btf *btf, u32 offset)
 {
-   return BTF_STR_OFFSET_VALID(offset) &&
-   offset < btf->hdr.str_len;
+   if (!BTF_STR_OFFSET_VALID(offset))
+   return false;
+
+   while (offset < btf->start_str_off)
+   btf = btf->base_btf;
+
+   offset -= btf->start_str_off;
+   return offset < btf->hdr.str_len;
 }
 
 static bool __btf_name_char_ok(char c, bool first, bool dot_ok)
@@ -614,10 +638,22 @@ static bool __btf_name_char_ok(char c, bool first, bool 
dot_ok)
return true;
 }
 
+static const char *btf_str_by_offset(const struct btf *btf, u32 offset)
+{
+   while (offset < btf->start_str_off)
+   btf = btf->base_btf;
+
+   offset -= btf->start_str_off;
+   if (offset < btf->hdr.str_len)
+   return >strings[offset];
+
+   return NULL;
+}
+
 static bool __btf_name_valid(const struct btf *btf, u32 offset, bool dot_ok)
 {
/* offset must be valid */
-   const char *src = >strings[offset];
+   const char *src = btf_str_by_offset(btf, offset);
const char *src_limit;
 
if (!__btf_name_char_ok(*src, true, dot_ok))
@@ -650,27 +686,28 @@ static bool btf_name_valid_section(const struct btf *btf, 
u32 offset)
 
 static const char *__btf_name_by_offset(const struct btf *btf, u32 offset)
 {
+   const char *name;
+
if (!offset)
return "(anon)";
-   else if (offset < btf->hdr.str_len)
-   return >strings[offset];
-   else
-   return "(invalid-name-offset)";
+
+   name = btf_str_by_offset(btf, offset);
+   return name ?: "(invalid-name-offset)";
 }
 
 const char *btf_name_by_offset(const struct btf *btf, u32 offset)
 {
-   if (offset < btf->hdr.str_len)
-   return >strings[offset];
-
-   return NULL;
+   return btf_str_by_offset(btf, offset);
 }
 
 const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id)
 {
-   if (type_id > btf->nr_types)
-   return NULL;
+   while (type_id < btf->start_id)
+   btf = btf->base_btf;
 
+   type_id -= btf->start_id;
+   if (type_id >= btf->nr_types)
+   return NULL;
return btf->types[type_id];
 }
 
@@ -1390,17 +1427,13 @@ static int btf_add_type(struct 

[PATCH v3 bpf-next 1/5] bpf: add in-kernel split BTF support

2020-11-09 Thread Andrii Nakryiko
Adjust in-kernel BTF implementation to support a split BTF mode of operation.
Changes are mostly mirroring libbpf split BTF changes, with the exception of
start_id being 0 for in-kernel implementation due to simpler read-only mode.

Otherwise, for split BTF logic, most of the logic of jumping to base BTF,
where necessary, is encapsulated in few helper functions. Type numbering and
string offset in a split BTF are logically continuing where base BTF ends, so
most of the high-level logic is kept without changes.

Type verification and size resolution is only doing an added resolution of new
split BTF types and relies on already cached size and type resolution results
in the base BTF.

Signed-off-by: Andrii Nakryiko 
---
 kernel/bpf/btf.c | 171 +--
 1 file changed, 119 insertions(+), 52 deletions(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index ed7d02e8bc93..894ee33f4c84 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -204,12 +204,17 @@ struct btf {
const char *strings;
void *nohdr_data;
struct btf_header hdr;
-   u32 nr_types;
+   u32 nr_types; /* includes VOID for base BTF */
u32 types_size;
u32 data_size;
refcount_t refcnt;
u32 id;
struct rcu_head rcu;
+
+   /* split BTF support */
+   struct btf *base_btf;
+   u32 start_id; /* first type ID in this BTF (0 for base BTF) */
+   u32 start_str_off; /* first string offset (0 for base BTF) */
 };
 
 enum verifier_phase {
@@ -450,14 +455,27 @@ static bool btf_type_is_datasec(const struct btf_type *t)
return BTF_INFO_KIND(t->info) == BTF_KIND_DATASEC;
 }
 
+static u32 btf_nr_types_total(const struct btf *btf)
+{
+   u32 total = 0;
+
+   while (btf) {
+   total += btf->nr_types;
+   btf = btf->base_btf;
+   }
+
+   return total;
+}
+
 s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind)
 {
const struct btf_type *t;
const char *tname;
-   u32 i;
+   u32 i, total;
 
-   for (i = 1; i <= btf->nr_types; i++) {
-   t = btf->types[i];
+   total = btf_nr_types_total(btf);
+   for (i = 1; i < total; i++) {
+   t = btf_type_by_id(btf, i);
if (BTF_INFO_KIND(t->info) != kind)
continue;
 
@@ -600,8 +618,14 @@ static const struct btf_kind_operations 
*btf_type_ops(const struct btf_type *t)
 
 static bool btf_name_offset_valid(const struct btf *btf, u32 offset)
 {
-   return BTF_STR_OFFSET_VALID(offset) &&
-   offset < btf->hdr.str_len;
+   if (!BTF_STR_OFFSET_VALID(offset))
+   return false;
+
+   while (offset < btf->start_str_off)
+   btf = btf->base_btf;
+
+   offset -= btf->start_str_off;
+   return offset < btf->hdr.str_len;
 }
 
 static bool __btf_name_char_ok(char c, bool first, bool dot_ok)
@@ -615,10 +639,22 @@ static bool __btf_name_char_ok(char c, bool first, bool 
dot_ok)
return true;
 }
 
+static const char *btf_str_by_offset(const struct btf *btf, u32 offset)
+{
+   while (offset < btf->start_str_off)
+   btf = btf->base_btf;
+
+   offset -= btf->start_str_off;
+   if (offset < btf->hdr.str_len)
+   return >strings[offset];
+
+   return NULL;
+}
+
 static bool __btf_name_valid(const struct btf *btf, u32 offset, bool dot_ok)
 {
/* offset must be valid */
-   const char *src = >strings[offset];
+   const char *src = btf_str_by_offset(btf, offset);
const char *src_limit;
 
if (!__btf_name_char_ok(*src, true, dot_ok))
@@ -651,27 +687,28 @@ static bool btf_name_valid_section(const struct btf *btf, 
u32 offset)
 
 static const char *__btf_name_by_offset(const struct btf *btf, u32 offset)
 {
+   const char *name;
+
if (!offset)
return "(anon)";
-   else if (offset < btf->hdr.str_len)
-   return >strings[offset];
-   else
-   return "(invalid-name-offset)";
+
+   name = btf_str_by_offset(btf, offset);
+   return name ?: "(invalid-name-offset)";
 }
 
 const char *btf_name_by_offset(const struct btf *btf, u32 offset)
 {
-   if (offset < btf->hdr.str_len)
-   return >strings[offset];
-
-   return NULL;
+   return btf_str_by_offset(btf, offset);
 }
 
 const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id)
 {
-   if (type_id > btf->nr_types)
-   return NULL;
+   while (type_id < btf->start_id)
+   btf = btf->base_btf;
 
+   type_id -= btf->start_id;
+   if (type_id >= btf->nr_types)
+   return NULL;
return btf->types[type_id];
 }
 
@@ -1391,17 +1428,13 @@ static int btf_add_type(struct 

Re: [PATCH v2 bpf-next 1/5] bpf: add in-kernel split BTF support

2020-11-06 Thread Andrii Nakryiko
On Fri, Nov 6, 2020 at 5:28 PM Song Liu  wrote:
>
>
>
> > On Nov 6, 2020, at 3:02 PM, Andrii Nakryiko  wrote:
> >
> > Adjust in-kernel BTF implementation to support a split BTF mode of 
> > operation.
> > Changes are mostly mirroring libbpf split BTF changes, with the exception of
> > start_id being 0 for in-kernel implementation due to simpler read-only mode.
> >
> > Otherwise, for split BTF logic, most of the logic of jumping to base BTF,
> > where necessary, is encapsulated in few helper functions. Type numbering and
> > string offset in a split BTF are logically continuing where base BTF ends, 
> > so
> > most of the high-level logic is kept without changes.
> >
> > Type verification and size resolution is only doing an added resolution of 
> > new
> > split BTF types and relies on already cached size and type resolution 
> > results
> > in the base BTF.
> >
> > Signed-off-by: Andrii Nakryiko 
>
> [...]
>
> >
> > @@ -600,8 +618,15 @@ static const struct btf_kind_operations 
> > *btf_type_ops(const struct btf_type *t)
> >
> > static bool btf_name_offset_valid(const struct btf *btf, u32 offset)
> > {
> > - return BTF_STR_OFFSET_VALID(offset) &&
> > - offset < btf->hdr.str_len;
> > + if (!BTF_STR_OFFSET_VALID(offset))
> > + return false;
> > +again:
> > + if (offset < btf->start_str_off) {
> > + btf = btf->base_btf;
> > + goto again;
>
> Can we do a while loop instead of "goto again;"?

yep, not sure why I went with goto...

while (offset < btf->start_str_off)
btf = btf->base_btf;

Shorter.

>
> > + }
> > + offset -= btf->start_str_off;
> > + return offset < btf->hdr.str_len;
> > }
> >
> > static bool __btf_name_char_ok(char c, bool first, bool dot_ok)
> > @@ -615,10 +640,25 @@ static bool __btf_name_char_ok(char c, bool first, 
> > bool dot_ok)
> >   return true;
> > }
> >
> > +static const char *btf_str_by_offset(const struct btf *btf, u32 offset)
> > +{
> > +again:
> > + if (offset < btf->start_str_off) {
> > + btf = btf->base_btf;
> > + goto again;
> > + }
>
> Maybe add a btf_find_base_btf(btf, offset) helper for this logic?

No strong feelings about this, but given it's a two-line loop might
not be worth it. I'd also need a pretty verbose
btf_find_base_btf_for_str_offset() and
btf_find_base_btf_for_type_id(). I feel like loop might be less
distracting actually.

>
> > +
> > + offset -= btf->start_str_off;
> > + if (offset < btf->hdr.str_len)
> > + return >strings[offset];
> > +
> > + return NULL;
> > +}
> > +
>
> [...]
>
> > }
> >
> > const char *btf_name_by_offset(const struct btf *btf, u32 offset)
> > {
> > - if (offset < btf->hdr.str_len)
> > - return >strings[offset];
> > -
> > - return NULL;
> > + return btf_str_by_offset(btf, offset);
> > }
>
> IIUC, btf_str_by_offset() and btf_name_by_offset() are identical. Can we
> just keep btf_name_by_offset()?

btf_str_by_offset() is static, so should be inlinable, while
btf_name_by_offset() is a global function, I was worrying about
regressing performance for __btf_name_valid() and
__btf_name_by_offset(). Premature optimization you think?

>
> >
> > const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id)
> > {
> > - if (type_id > btf->nr_types)
> > - return NULL;
> > +again:
> > + if (type_id < btf->start_id) {
> > + btf = btf->base_btf;
> > + goto again;
> > + }
>
> ditto, goto again..
>
> [...]
>
>


Re: [PATCH v2 bpf-next 1/5] bpf: add in-kernel split BTF support

2020-11-06 Thread Song Liu



> On Nov 6, 2020, at 3:02 PM, Andrii Nakryiko  wrote:
> 
> Adjust in-kernel BTF implementation to support a split BTF mode of operation.
> Changes are mostly mirroring libbpf split BTF changes, with the exception of
> start_id being 0 for in-kernel implementation due to simpler read-only mode.
> 
> Otherwise, for split BTF logic, most of the logic of jumping to base BTF,
> where necessary, is encapsulated in few helper functions. Type numbering and
> string offset in a split BTF are logically continuing where base BTF ends, so
> most of the high-level logic is kept without changes.
> 
> Type verification and size resolution is only doing an added resolution of new
> split BTF types and relies on already cached size and type resolution results
> in the base BTF.
> 
> Signed-off-by: Andrii Nakryiko 

[...]

> 
> @@ -600,8 +618,15 @@ static const struct btf_kind_operations 
> *btf_type_ops(const struct btf_type *t)
> 
> static bool btf_name_offset_valid(const struct btf *btf, u32 offset)
> {
> - return BTF_STR_OFFSET_VALID(offset) &&
> - offset < btf->hdr.str_len;
> + if (!BTF_STR_OFFSET_VALID(offset))
> + return false;
> +again:
> + if (offset < btf->start_str_off) {
> + btf = btf->base_btf;
> + goto again;

Can we do a while loop instead of "goto again;"?

> + }
> + offset -= btf->start_str_off;
> + return offset < btf->hdr.str_len;
> }
> 
> static bool __btf_name_char_ok(char c, bool first, bool dot_ok)
> @@ -615,10 +640,25 @@ static bool __btf_name_char_ok(char c, bool first, bool 
> dot_ok)
>   return true;
> }
> 
> +static const char *btf_str_by_offset(const struct btf *btf, u32 offset)
> +{
> +again:
> + if (offset < btf->start_str_off) {
> + btf = btf->base_btf;
> + goto again;
> + }

Maybe add a btf_find_base_btf(btf, offset) helper for this logic?

> +
> + offset -= btf->start_str_off;
> + if (offset < btf->hdr.str_len)
> + return >strings[offset];
> +
> + return NULL;
> +}
> +

[...]

> }
> 
> const char *btf_name_by_offset(const struct btf *btf, u32 offset)
> {
> - if (offset < btf->hdr.str_len)
> - return >strings[offset];
> -
> - return NULL;
> + return btf_str_by_offset(btf, offset);
> }

IIUC, btf_str_by_offset() and btf_name_by_offset() are identical. Can we
just keep btf_name_by_offset()?

> 
> const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id)
> {
> - if (type_id > btf->nr_types)
> - return NULL;
> +again:
> + if (type_id < btf->start_id) {
> + btf = btf->base_btf;
> + goto again;
> + }

ditto, goto again..

[...]




[PATCH v2 bpf-next 1/5] bpf: add in-kernel split BTF support

2020-11-06 Thread Andrii Nakryiko
Adjust in-kernel BTF implementation to support a split BTF mode of operation.
Changes are mostly mirroring libbpf split BTF changes, with the exception of
start_id being 0 for in-kernel implementation due to simpler read-only mode.

Otherwise, for split BTF logic, most of the logic of jumping to base BTF,
where necessary, is encapsulated in few helper functions. Type numbering and
string offset in a split BTF are logically continuing where base BTF ends, so
most of the high-level logic is kept without changes.

Type verification and size resolution is only doing an added resolution of new
split BTF types and relies on already cached size and type resolution results
in the base BTF.

Signed-off-by: Andrii Nakryiko 
---
 kernel/bpf/btf.c | 182 +--
 1 file changed, 130 insertions(+), 52 deletions(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index ed7d02e8bc93..f61944a3873b 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -204,12 +204,17 @@ struct btf {
const char *strings;
void *nohdr_data;
struct btf_header hdr;
-   u32 nr_types;
+   u32 nr_types; /* includes VOID for base BTF */
u32 types_size;
u32 data_size;
refcount_t refcnt;
u32 id;
struct rcu_head rcu;
+
+   /* split BTF support */
+   struct btf *base_btf;
+   u32 start_id; /* first type ID in this BTF (0 for base BTF) */
+   u32 start_str_off; /* first string offset (0 for base BTF) */
 };
 
 enum verifier_phase {
@@ -450,14 +455,27 @@ static bool btf_type_is_datasec(const struct btf_type *t)
return BTF_INFO_KIND(t->info) == BTF_KIND_DATASEC;
 }
 
+static u32 btf_nr_types_total(const struct btf *btf)
+{
+   u32 total = 0;
+
+   while (btf) {
+   total += btf->nr_types;
+   btf = btf->base_btf;
+   }
+
+   return total;
+}
+
 s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind)
 {
const struct btf_type *t;
const char *tname;
-   u32 i;
+   u32 i, total;
 
-   for (i = 1; i <= btf->nr_types; i++) {
-   t = btf->types[i];
+   total = btf_nr_types_total(btf);
+   for (i = 1; i < total; i++) {
+   t = btf_type_by_id(btf, i);
if (BTF_INFO_KIND(t->info) != kind)
continue;
 
@@ -600,8 +618,15 @@ static const struct btf_kind_operations 
*btf_type_ops(const struct btf_type *t)
 
 static bool btf_name_offset_valid(const struct btf *btf, u32 offset)
 {
-   return BTF_STR_OFFSET_VALID(offset) &&
-   offset < btf->hdr.str_len;
+   if (!BTF_STR_OFFSET_VALID(offset))
+   return false;
+again:
+   if (offset < btf->start_str_off) {
+   btf = btf->base_btf;
+   goto again;
+   }
+   offset -= btf->start_str_off;
+   return offset < btf->hdr.str_len;
 }
 
 static bool __btf_name_char_ok(char c, bool first, bool dot_ok)
@@ -615,10 +640,25 @@ static bool __btf_name_char_ok(char c, bool first, bool 
dot_ok)
return true;
 }
 
+static const char *btf_str_by_offset(const struct btf *btf, u32 offset)
+{
+again:
+   if (offset < btf->start_str_off) {
+   btf = btf->base_btf;
+   goto again;
+   }
+
+   offset -= btf->start_str_off;
+   if (offset < btf->hdr.str_len)
+   return >strings[offset];
+
+   return NULL;
+}
+
 static bool __btf_name_valid(const struct btf *btf, u32 offset, bool dot_ok)
 {
/* offset must be valid */
-   const char *src = >strings[offset];
+   const char *src = btf_str_by_offset(btf, offset);
const char *src_limit;
 
if (!__btf_name_char_ok(*src, true, dot_ok))
@@ -651,27 +691,31 @@ static bool btf_name_valid_section(const struct btf *btf, 
u32 offset)
 
 static const char *__btf_name_by_offset(const struct btf *btf, u32 offset)
 {
+   const char *name;
+
if (!offset)
return "(anon)";
-   else if (offset < btf->hdr.str_len)
-   return >strings[offset];
-   else
-   return "(invalid-name-offset)";
+
+   name = btf_str_by_offset(btf, offset);
+   return name ?: "(invalid-name-offset)";
 }
 
 const char *btf_name_by_offset(const struct btf *btf, u32 offset)
 {
-   if (offset < btf->hdr.str_len)
-   return >strings[offset];
-
-   return NULL;
+   return btf_str_by_offset(btf, offset);
 }
 
 const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id)
 {
-   if (type_id > btf->nr_types)
-   return NULL;
+again:
+   if (type_id < btf->start_id) {
+   btf = btf->base_btf;
+   goto again;
+   }
 
+   type_id -= btf->start_id;
+   if (type_id >= btf->nr_types)
+  

[PATCH bpf-next 1/5] bpf: add in-kernel split BTF support

2020-11-05 Thread Andrii Nakryiko
Adjust in-kernel BTF implementation to support a split BTF mode of operation.
Changes are mostly mirroring libbpf split BTF changes, with the exception of
start_id being 0 for in-kernel implementation due to simpler read-only mode.

Otherwise, for split BTF logic, most of the logic of jumping to base BTF,
where necessary, is encapsulated in few helper functions. Type numbering and
string offset in a split BTF are logically continuing where base BTF ends, so
most of the high-level logic is kept without changes.

Type verification and size resolution is only doing an added resolution of new
split BTF types and relies on already cached size and type resolution results
in the base BTF.

Signed-off-by: Andrii Nakryiko 
---
 kernel/bpf/btf.c | 182 +--
 1 file changed, 130 insertions(+), 52 deletions(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index ed7d02e8bc93..f61944a3873b 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -204,12 +204,17 @@ struct btf {
const char *strings;
void *nohdr_data;
struct btf_header hdr;
-   u32 nr_types;
+   u32 nr_types; /* includes VOID for base BTF */
u32 types_size;
u32 data_size;
refcount_t refcnt;
u32 id;
struct rcu_head rcu;
+
+   /* split BTF support */
+   struct btf *base_btf;
+   u32 start_id; /* first type ID in this BTF (0 for base BTF) */
+   u32 start_str_off; /* first string offset (0 for base BTF) */
 };
 
 enum verifier_phase {
@@ -450,14 +455,27 @@ static bool btf_type_is_datasec(const struct btf_type *t)
return BTF_INFO_KIND(t->info) == BTF_KIND_DATASEC;
 }
 
+static u32 btf_nr_types_total(const struct btf *btf)
+{
+   u32 total = 0;
+
+   while (btf) {
+   total += btf->nr_types;
+   btf = btf->base_btf;
+   }
+
+   return total;
+}
+
 s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind)
 {
const struct btf_type *t;
const char *tname;
-   u32 i;
+   u32 i, total;
 
-   for (i = 1; i <= btf->nr_types; i++) {
-   t = btf->types[i];
+   total = btf_nr_types_total(btf);
+   for (i = 1; i < total; i++) {
+   t = btf_type_by_id(btf, i);
if (BTF_INFO_KIND(t->info) != kind)
continue;
 
@@ -600,8 +618,15 @@ static const struct btf_kind_operations 
*btf_type_ops(const struct btf_type *t)
 
 static bool btf_name_offset_valid(const struct btf *btf, u32 offset)
 {
-   return BTF_STR_OFFSET_VALID(offset) &&
-   offset < btf->hdr.str_len;
+   if (!BTF_STR_OFFSET_VALID(offset))
+   return false;
+again:
+   if (offset < btf->start_str_off) {
+   btf = btf->base_btf;
+   goto again;
+   }
+   offset -= btf->start_str_off;
+   return offset < btf->hdr.str_len;
 }
 
 static bool __btf_name_char_ok(char c, bool first, bool dot_ok)
@@ -615,10 +640,25 @@ static bool __btf_name_char_ok(char c, bool first, bool 
dot_ok)
return true;
 }
 
+static const char *btf_str_by_offset(const struct btf *btf, u32 offset)
+{
+again:
+   if (offset < btf->start_str_off) {
+   btf = btf->base_btf;
+   goto again;
+   }
+
+   offset -= btf->start_str_off;
+   if (offset < btf->hdr.str_len)
+   return >strings[offset];
+
+   return NULL;
+}
+
 static bool __btf_name_valid(const struct btf *btf, u32 offset, bool dot_ok)
 {
/* offset must be valid */
-   const char *src = >strings[offset];
+   const char *src = btf_str_by_offset(btf, offset);
const char *src_limit;
 
if (!__btf_name_char_ok(*src, true, dot_ok))
@@ -651,27 +691,31 @@ static bool btf_name_valid_section(const struct btf *btf, 
u32 offset)
 
 static const char *__btf_name_by_offset(const struct btf *btf, u32 offset)
 {
+   const char *name;
+
if (!offset)
return "(anon)";
-   else if (offset < btf->hdr.str_len)
-   return >strings[offset];
-   else
-   return "(invalid-name-offset)";
+
+   name = btf_str_by_offset(btf, offset);
+   return name ?: "(invalid-name-offset)";
 }
 
 const char *btf_name_by_offset(const struct btf *btf, u32 offset)
 {
-   if (offset < btf->hdr.str_len)
-   return >strings[offset];
-
-   return NULL;
+   return btf_str_by_offset(btf, offset);
 }
 
 const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id)
 {
-   if (type_id > btf->nr_types)
-   return NULL;
+again:
+   if (type_id < btf->start_id) {
+   btf = btf->base_btf;
+   goto again;
+   }
 
+   type_id -= btf->start_id;
+   if (type_id >= btf->nr_types)
+  

[PATCH bpf-next 0/2] selftests/bpf: BTF-based kernel data display fixes

2020-09-29 Thread Alan Maguire
Resolve issues in bpf selftests introduced with BTF-based kernel data
display selftests; these are

- a warning introduced in snprintf_btf.c; and
- compilation failures with old kernels vmlinux.h

Alan Maguire (2):
  selftests/bpf: fix unused-result warning in snprintf_btf.c
  selftests/bpf: ensure snprintf_btf/bpf_iter tests compatibility with
old vmlinux.h

 .../selftests/bpf/prog_tests/snprintf_btf.c|  2 +-
 tools/testing/selftests/bpf/progs/bpf_iter.h   | 23 ++
 tools/testing/selftests/bpf/progs/btf_ptr.h| 27 ++
 .../selftests/bpf/progs/netif_receive_skb.c|  2 +-
 4 files changed, 52 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/btf_ptr.h

-- 
1.8.3.1



[PATCH 5.7 359/477] bpf/sockmap: Fix kernel panic at __tcp_bpf_recvmsg

2020-06-23 Thread Greg Kroah-Hartman
From: dihu 

[ Upstream commit 487082fb7bd2a32b66927d2b22e3a81b072b44f0 ]

When user application calls read() with MSG_PEEK flag to read data
of bpf sockmap socket, kernel panic happens at
__tcp_bpf_recvmsg+0x12c/0x350. sk_msg is not removed from ingress_msg
queue after read out under MSG_PEEK flag is set. Because it's not
judged whether sk_msg is the last msg of ingress_msg queue, the next
sk_msg may be the head of ingress_msg queue, whose memory address of
sg page is invalid. So it's necessary to add check codes to prevent
this problem.

[20759.125457] BUG: kernel NULL pointer dereference, address:
0008
[20759.132118] CPU: 53 PID: 51378 Comm: envoy Tainted: GE
5.4.32 #1
[20759.140890] Hardware name: Inspur SA5212M4/YZMB-00370-109, BIOS
4.1.12 06/18/2017
[20759.149734] RIP: 0010:copy_page_to_iter+0xad/0x300
[20759.270877] __tcp_bpf_recvmsg+0x12c/0x350
[20759.276099] tcp_bpf_recvmsg+0x113/0x370
[20759.281137] inet_recvmsg+0x55/0xc0
[20759.285734] __sys_recvfrom+0xc8/0x130
[20759.290566] ? __audit_syscall_entry+0x103/0x130
[20759.296227] ? syscall_trace_enter+0x1d2/0x2d0
[20759.301700] ? __audit_syscall_exit+0x1e4/0x290
[20759.307235] __x64_sys_recvfrom+0x24/0x30
[20759.312226] do_syscall_64+0x55/0x1b0
[20759.316852] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: dihu 
Signed-off-by: Alexei Starovoitov 
Acked-by: John Fastabend 
Acked-by: Jakub Sitnicki 
Link: 
https://lore.kernel.org/bpf/20200605084625.9783-1-anny...@linux.alibaba.com
Signed-off-by: Sasha Levin 
---
 net/ipv4/tcp_bpf.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 9c5540887fbe5..7aa68f4aae6c3 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -64,6 +64,9 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
} while (i != msg_rx->sg.end);
 
if (unlikely(peek)) {
+   if (msg_rx == list_last_entry(>ingress_msg,
+ struct sk_msg, list))
+   break;
msg_rx = list_next_entry(msg_rx, list);
continue;
}
-- 
2.25.1





[PATCH 5.4 247/314] bpf/sockmap: Fix kernel panic at __tcp_bpf_recvmsg

2020-06-23 Thread Greg Kroah-Hartman
From: dihu 

[ Upstream commit 487082fb7bd2a32b66927d2b22e3a81b072b44f0 ]

When user application calls read() with MSG_PEEK flag to read data
of bpf sockmap socket, kernel panic happens at
__tcp_bpf_recvmsg+0x12c/0x350. sk_msg is not removed from ingress_msg
queue after read out under MSG_PEEK flag is set. Because it's not
judged whether sk_msg is the last msg of ingress_msg queue, the next
sk_msg may be the head of ingress_msg queue, whose memory address of
sg page is invalid. So it's necessary to add check codes to prevent
this problem.

[20759.125457] BUG: kernel NULL pointer dereference, address:
0008
[20759.132118] CPU: 53 PID: 51378 Comm: envoy Tainted: GE
5.4.32 #1
[20759.140890] Hardware name: Inspur SA5212M4/YZMB-00370-109, BIOS
4.1.12 06/18/2017
[20759.149734] RIP: 0010:copy_page_to_iter+0xad/0x300
[20759.270877] __tcp_bpf_recvmsg+0x12c/0x350
[20759.276099] tcp_bpf_recvmsg+0x113/0x370
[20759.281137] inet_recvmsg+0x55/0xc0
[20759.285734] __sys_recvfrom+0xc8/0x130
[20759.290566] ? __audit_syscall_entry+0x103/0x130
[20759.296227] ? syscall_trace_enter+0x1d2/0x2d0
[20759.301700] ? __audit_syscall_exit+0x1e4/0x290
[20759.307235] __x64_sys_recvfrom+0x24/0x30
[20759.312226] do_syscall_64+0x55/0x1b0
[20759.316852] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: dihu 
Signed-off-by: Alexei Starovoitov 
Acked-by: John Fastabend 
Acked-by: Jakub Sitnicki 
Link: 
https://lore.kernel.org/bpf/20200605084625.9783-1-anny...@linux.alibaba.com
Signed-off-by: Sasha Levin 
---
 net/ipv4/tcp_bpf.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 69b0254083904..ad9f382027311 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -96,6 +96,9 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
} while (i != msg_rx->sg.end);
 
if (unlikely(peek)) {
+   if (msg_rx == list_last_entry(>ingress_msg,
+ struct sk_msg, list))
+   break;
msg_rx = list_next_entry(msg_rx, list);
continue;
}
-- 
2.25.1





[PATCH AUTOSEL 5.7 367/388] bpf/sockmap: Fix kernel panic at __tcp_bpf_recvmsg

2020-06-17 Thread Sasha Levin
From: dihu 

[ Upstream commit 487082fb7bd2a32b66927d2b22e3a81b072b44f0 ]

When user application calls read() with MSG_PEEK flag to read data
of bpf sockmap socket, kernel panic happens at
__tcp_bpf_recvmsg+0x12c/0x350. sk_msg is not removed from ingress_msg
queue after read out under MSG_PEEK flag is set. Because it's not
judged whether sk_msg is the last msg of ingress_msg queue, the next
sk_msg may be the head of ingress_msg queue, whose memory address of
sg page is invalid. So it's necessary to add check codes to prevent
this problem.

[20759.125457] BUG: kernel NULL pointer dereference, address:
0008
[20759.132118] CPU: 53 PID: 51378 Comm: envoy Tainted: GE
5.4.32 #1
[20759.140890] Hardware name: Inspur SA5212M4/YZMB-00370-109, BIOS
4.1.12 06/18/2017
[20759.149734] RIP: 0010:copy_page_to_iter+0xad/0x300
[20759.270877] __tcp_bpf_recvmsg+0x12c/0x350
[20759.276099] tcp_bpf_recvmsg+0x113/0x370
[20759.281137] inet_recvmsg+0x55/0xc0
[20759.285734] __sys_recvfrom+0xc8/0x130
[20759.290566] ? __audit_syscall_entry+0x103/0x130
[20759.296227] ? syscall_trace_enter+0x1d2/0x2d0
[20759.301700] ? __audit_syscall_exit+0x1e4/0x290
[20759.307235] __x64_sys_recvfrom+0x24/0x30
[20759.312226] do_syscall_64+0x55/0x1b0
[20759.316852] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: dihu 
Signed-off-by: Alexei Starovoitov 
Acked-by: John Fastabend 
Acked-by: Jakub Sitnicki 
Link: 
https://lore.kernel.org/bpf/20200605084625.9783-1-anny...@linux.alibaba.com
Signed-off-by: Sasha Levin 
---
 net/ipv4/tcp_bpf.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 9c5540887fbe..7aa68f4aae6c 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -64,6 +64,9 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
} while (i != msg_rx->sg.end);
 
if (unlikely(peek)) {
+   if (msg_rx == list_last_entry(>ingress_msg,
+ struct sk_msg, list))
+   break;
msg_rx = list_next_entry(msg_rx, list);
continue;
}
-- 
2.25.1



[PATCH AUTOSEL 5.4 254/266] bpf/sockmap: Fix kernel panic at __tcp_bpf_recvmsg

2020-06-17 Thread Sasha Levin
From: dihu 

[ Upstream commit 487082fb7bd2a32b66927d2b22e3a81b072b44f0 ]

When user application calls read() with MSG_PEEK flag to read data
of bpf sockmap socket, kernel panic happens at
__tcp_bpf_recvmsg+0x12c/0x350. sk_msg is not removed from ingress_msg
queue after read out under MSG_PEEK flag is set. Because it's not
judged whether sk_msg is the last msg of ingress_msg queue, the next
sk_msg may be the head of ingress_msg queue, whose memory address of
sg page is invalid. So it's necessary to add check codes to prevent
this problem.

[20759.125457] BUG: kernel NULL pointer dereference, address:
0008
[20759.132118] CPU: 53 PID: 51378 Comm: envoy Tainted: GE
5.4.32 #1
[20759.140890] Hardware name: Inspur SA5212M4/YZMB-00370-109, BIOS
4.1.12 06/18/2017
[20759.149734] RIP: 0010:copy_page_to_iter+0xad/0x300
[20759.270877] __tcp_bpf_recvmsg+0x12c/0x350
[20759.276099] tcp_bpf_recvmsg+0x113/0x370
[20759.281137] inet_recvmsg+0x55/0xc0
[20759.285734] __sys_recvfrom+0xc8/0x130
[20759.290566] ? __audit_syscall_entry+0x103/0x130
[20759.296227] ? syscall_trace_enter+0x1d2/0x2d0
[20759.301700] ? __audit_syscall_exit+0x1e4/0x290
[20759.307235] __x64_sys_recvfrom+0x24/0x30
[20759.312226] do_syscall_64+0x55/0x1b0
[20759.316852] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: dihu 
Signed-off-by: Alexei Starovoitov 
Acked-by: John Fastabend 
Acked-by: Jakub Sitnicki 
Link: 
https://lore.kernel.org/bpf/20200605084625.9783-1-anny...@linux.alibaba.com
Signed-off-by: Sasha Levin 
---
 net/ipv4/tcp_bpf.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 69b025408390..ad9f38202731 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -96,6 +96,9 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
} while (i != msg_rx->sg.end);
 
if (unlikely(peek)) {
+   if (msg_rx == list_last_entry(>ingress_msg,
+ struct sk_msg, list))
+   break;
msg_rx = list_next_entry(msg_rx, list);
continue;
}
-- 
2.25.1



Re: [PATCH] bpf/sockmap: fix kernel panic at __tcp_bpf_recvmsg

2020-06-02 Thread Jakub Sitnicki
On Fri, May 29, 2020 at 11:05 AM CEST, dihu wrote:
> On 2020/5/27 5:10, John Fastabend wrote:
>> dihu wrote:
>>>  From 865a45747de6b68fd02a0ff128a69a5c8feb73c3 Mon Sep 17 00:00:00 2001
>>> From: dihu 
>>> Date: Mon, 25 May 2020 17:23:16 +0800
>>> Subject: [PATCH] bpf/sockmap: fix kernel panic at __tcp_bpf_recvmsg
>>>
>>> When user application calls read() with MSG_PEEK flag to read data
>>> of bpf sockmap socket, kernel panic happens at
>>> __tcp_bpf_recvmsg+0x12c/0x350. sk_msg is not removed from ingress_msg
>>> queue after read out under MSG_PEEK flag is set. Because it's not
>>> judged whether sk_msg is the last msg of ingress_msg queue, the next
>>> sk_msg may be the head of ingress_msg queue, whose memory address of
>>> sg page is invalid. So it's necessary to add check codes to prevent
>>> this problem.
>>>
>>> [20759.125457] BUG: kernel NULL pointer dereference, address:
>>> 0008
>>> [20759.132118] CPU: 53 PID: 51378 Comm: envoy Tainted: GE
>>> 5.4.32 #1
>>> [20759.140890] Hardware name: Inspur SA5212M4/YZMB-00370-109, BIOS
>>> 4.1.12 06/18/2017
>>> [20759.149734] RIP: 0010:copy_page_to_iter+0xad/0x300
>>> [20759.270877] __tcp_bpf_recvmsg+0x12c/0x350
>>> [20759.276099] tcp_bpf_recvmsg+0x113/0x370
>>> [20759.281137] inet_recvmsg+0x55/0xc0
>>> [20759.285734] __sys_recvfrom+0xc8/0x130
>>> [20759.290566] ? __audit_syscall_entry+0x103/0x130
>>> [20759.296227] ? syscall_trace_enter+0x1d2/0x2d0
>>> [20759.301700] ? __audit_syscall_exit+0x1e4/0x290
>>> [20759.307235] __x64_sys_recvfrom+0x24/0x30
>>> [20759.312226] do_syscall_64+0x55/0x1b0
>>> [20759.316852] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> Signed-off-by: dihu 
>>> ---
>>>   net/ipv4/tcp_bpf.c | 3 +++
>>>   1 file changed, 3 insertions(+)
>>>
>>> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
>>> index 5a05327..c0d4624 100644
>>> --- a/net/ipv4/tcp_bpf.c
>>> +++ b/net/ipv4/tcp_bpf.c
>>> @@ -64,6 +64,9 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock 
>>> *psock,
>>> } while (i != msg_rx->sg.end);
>>>
>>> if (unlikely(peek)) {
>>> +   if (msg_rx == list_last_entry(>ingress_msg,
>>> +   struct sk_msg, list))
>>> +break;
>>
>> Thanks. Change looks good but spacing is a bit off . Can we
>> turn those spaces into tabs? Otherwise adding fixes tag and
>> my ack would be great.
>>
>> Fixes: 02c558b2d5d67 ("bpf: sockmap, support for msg_peek in sk_msg with 
>> redirect ingress")
>> Acked-by: John Fastabend 
>
>
> From 127a334fa5e5d029353ceb1a0414886c527f4be5 Mon Sep 17 00:00:00 2001
> From: dihu 
> Date: Fri, 29 May 2020 16:38:50 +0800
> Subject: [PATCH] bpf/sockmap: fix kernel panic at __tcp_bpf_recvmsg
>
> When user application calls read() with MSG_PEEK flag to read data
> of bpf sockmap socket, kernel panic happens at
> __tcp_bpf_recvmsg+0x12c/0x350. sk_msg is not removed from ingress_msg
> queue after read out under MSG_PEEK flag is set. Because it's not
> judged whether sk_msg is the last msg of ingress_msg queue, the next
> sk_msg may be the head of ingress_msg queue, whose memory address of
> sg page is invalid. So it's necessary to add check codes to prevent
> this problem.
>
> [20759.125457] BUG: kernel NULL pointer dereference, address:
> 0008
> [20759.132118] CPU: 53 PID: 51378 Comm: envoy Tainted: G E
> 5.4.32 #1
> [20759.140890] Hardware name: Inspur SA5212M4/YZMB-00370-109, BIOS
> 4.1.12 06/18/2017
> [20759.149734] RIP: 0010:copy_page_to_iter+0xad/0x300
> [20759.270877] __tcp_bpf_recvmsg+0x12c/0x350
> [20759.276099] tcp_bpf_recvmsg+0x113/0x370
> [20759.281137] inet_recvmsg+0x55/0xc0
> [20759.285734] __sys_recvfrom+0xc8/0x130
> [20759.290566] ? __audit_syscall_entry+0x103/0x130
> [20759.296227] ? syscall_trace_enter+0x1d2/0x2d0
> [20759.301700] ? __audit_syscall_exit+0x1e4/0x290
> [20759.307235] __x64_sys_recvfrom+0x24/0x30
> [20759.312226] do_syscall_64+0x55/0x1b0
> [20759.316852] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Signed-off-by: dihu 
> ---
> net/ipv4/tcp_bpf.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index 5a05327..b82e4c3 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c
> @@ -64,6 +64,9 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock 
> *psock,
>   } while (i != msg_rx->sg.end);
>
>   if (unlikely(peek)) {
> +   if (msg_rx == list_last_entry(>ingress_msg,
> +   struct sk_msg, list))
> +break;
>msg_rx = list_next_entry(msg_rx, list);
>continue;
>   }

Looks like the patch is garbled. I suspect due to copy-paste to an
e-mail client. Context line got wrapped and there are non-breaking
spaces instead of tabs in the body.

Crash fix is important so could you resend it with `git send-email`?

  git send-email --to b...@vger.kernel.org --cc net...@vger.kernel.org 
file.patch

You might find the documentation below helpful:

  https://www.kernel.org/doc/html/latest/process/email-clients.html


Re: [PATCH] bpf/sockmap: fix kernel panic at __tcp_bpf_recvmsg

2020-05-29 Thread dihu



On 2020/5/27 5:10, John Fastabend wrote:

dihu wrote:

 From 865a45747de6b68fd02a0ff128a69a5c8feb73c3 Mon Sep 17 00:00:00 2001
From: dihu 
Date: Mon, 25 May 2020 17:23:16 +0800
Subject: [PATCH] bpf/sockmap: fix kernel panic at __tcp_bpf_recvmsg

When user application calls read() with MSG_PEEK flag to read data
of bpf sockmap socket, kernel panic happens at
__tcp_bpf_recvmsg+0x12c/0x350. sk_msg is not removed from ingress_msg
queue after read out under MSG_PEEK flag is set. Because it's not
judged whether sk_msg is the last msg of ingress_msg queue, the next
sk_msg may be the head of ingress_msg queue, whose memory address of
sg page is invalid. So it's necessary to add check codes to prevent
this problem.

[20759.125457] BUG: kernel NULL pointer dereference, address:
0008
[20759.132118] CPU: 53 PID: 51378 Comm: envoy Tainted: GE
5.4.32 #1
[20759.140890] Hardware name: Inspur SA5212M4/YZMB-00370-109, BIOS
4.1.12 06/18/2017
[20759.149734] RIP: 0010:copy_page_to_iter+0xad/0x300
[20759.270877] __tcp_bpf_recvmsg+0x12c/0x350
[20759.276099] tcp_bpf_recvmsg+0x113/0x370
[20759.281137] inet_recvmsg+0x55/0xc0
[20759.285734] __sys_recvfrom+0xc8/0x130
[20759.290566] ? __audit_syscall_entry+0x103/0x130
[20759.296227] ? syscall_trace_enter+0x1d2/0x2d0
[20759.301700] ? __audit_syscall_exit+0x1e4/0x290
[20759.307235] __x64_sys_recvfrom+0x24/0x30
[20759.312226] do_syscall_64+0x55/0x1b0
[20759.316852] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: dihu 
---
  net/ipv4/tcp_bpf.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 5a05327..c0d4624 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -64,6 +64,9 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
} while (i != msg_rx->sg.end);

if (unlikely(peek)) {
+   if (msg_rx == list_last_entry(>ingress_msg,
+   struct sk_msg, list))
+break;


Thanks. Change looks good but spacing is a bit off . Can we
turn those spaces into tabs? Otherwise adding fixes tag and
my ack would be great.

Fixes: 02c558b2d5d67 ("bpf: sockmap, support for msg_peek in sk_msg with redirect 
ingress")
Acked-by: John Fastabend 



From 127a334fa5e5d029353ceb1a0414886c527f4be5 Mon Sep 17 00:00:00 2001
From: dihu 
Date: Fri, 29 May 2020 16:38:50 +0800
Subject: [PATCH] bpf/sockmap: fix kernel panic at __tcp_bpf_recvmsg

When user application calls read() with MSG_PEEK flag to read data
of bpf sockmap socket, kernel panic happens at
__tcp_bpf_recvmsg+0x12c/0x350. sk_msg is not removed from ingress_msg
queue after read out under MSG_PEEK flag is set. Because it's not
judged whether sk_msg is the last msg of ingress_msg queue, the next
sk_msg may be the head of ingress_msg queue, whose memory address of
sg page is invalid. So it's necessary to add check codes to prevent
this problem.

[20759.125457] BUG: kernel NULL pointer dereference, address:
0008
[20759.132118] CPU: 53 PID: 51378 Comm: envoy Tainted: G    E
5.4.32 #1
[20759.140890] Hardware name: Inspur SA5212M4/YZMB-00370-109, BIOS
4.1.12 06/18/2017
[20759.149734] RIP: 0010:copy_page_to_iter+0xad/0x300
[20759.270877] __tcp_bpf_recvmsg+0x12c/0x350
[20759.276099] tcp_bpf_recvmsg+0x113/0x370
[20759.281137] inet_recvmsg+0x55/0xc0
[20759.285734] __sys_recvfrom+0xc8/0x130
[20759.290566] ? __audit_syscall_entry+0x103/0x130
[20759.296227] ? syscall_trace_enter+0x1d2/0x2d0
[20759.301700] ? __audit_syscall_exit+0x1e4/0x290
[20759.307235] __x64_sys_recvfrom+0x24/0x30
[20759.312226] do_syscall_64+0x55/0x1b0
[20759.316852] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: dihu 
---
 net/ipv4/tcp_bpf.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 5a05327..b82e4c3 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -64,6 +64,9 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock 
*psock,

     } while (i != msg_rx->sg.end);

     if (unlikely(peek)) {
+            if (msg_rx == list_last_entry(>ingress_msg,
+                          struct sk_msg, list))
+                break;
         msg_rx = list_next_entry(msg_rx, list);
         continue;
     }
--
1.8.3.1



RE: [PATCH] bpf/sockmap: fix kernel panic at __tcp_bpf_recvmsg

2020-05-26 Thread John Fastabend
dihu wrote:
> From 865a45747de6b68fd02a0ff128a69a5c8feb73c3 Mon Sep 17 00:00:00 2001
> From: dihu 
> Date: Mon, 25 May 2020 17:23:16 +0800
> Subject: [PATCH] bpf/sockmap: fix kernel panic at __tcp_bpf_recvmsg
> 
> When user application calls read() with MSG_PEEK flag to read data
> of bpf sockmap socket, kernel panic happens at
> __tcp_bpf_recvmsg+0x12c/0x350. sk_msg is not removed from ingress_msg
> queue after read out under MSG_PEEK flag is set. Because it's not
> judged whether sk_msg is the last msg of ingress_msg queue, the next
> sk_msg may be the head of ingress_msg queue, whose memory address of
> sg page is invalid. So it's necessary to add check codes to prevent
> this problem.
> 
> [20759.125457] BUG: kernel NULL pointer dereference, address:
> 0008
> [20759.132118] CPU: 53 PID: 51378 Comm: envoy Tainted: GE
> 5.4.32 #1
> [20759.140890] Hardware name: Inspur SA5212M4/YZMB-00370-109, BIOS
> 4.1.12 06/18/2017
> [20759.149734] RIP: 0010:copy_page_to_iter+0xad/0x300
> [20759.270877] __tcp_bpf_recvmsg+0x12c/0x350
> [20759.276099] tcp_bpf_recvmsg+0x113/0x370
> [20759.281137] inet_recvmsg+0x55/0xc0
> [20759.285734] __sys_recvfrom+0xc8/0x130
> [20759.290566] ? __audit_syscall_entry+0x103/0x130
> [20759.296227] ? syscall_trace_enter+0x1d2/0x2d0
> [20759.301700] ? __audit_syscall_exit+0x1e4/0x290
> [20759.307235] __x64_sys_recvfrom+0x24/0x30
> [20759.312226] do_syscall_64+0x55/0x1b0
> [20759.316852] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Signed-off-by: dihu 
> ---
>  net/ipv4/tcp_bpf.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index 5a05327..c0d4624 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c
> @@ -64,6 +64,9 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock 
> *psock,
>} while (i != msg_rx->sg.end);
> 
>if (unlikely(peek)) {
> +   if (msg_rx == list_last_entry(>ingress_msg,
> +   struct sk_msg, list))
> +break;


Thanks. Change looks good but spacing is a bit off . Can we
turn those spaces into tabs? Otherwise adding fixes tag and
my ack would be great.

Fixes: 02c558b2d5d67 ("bpf: sockmap, support for msg_peek in sk_msg with 
redirect ingress")
Acked-by: John Fastabend 


Re: BPF in the kernel

2000-09-21 Thread Marc Lehmann

On Mon, Sep 18, 2000 at 10:55:20PM -0700, Philippe Troin <[EMAIL PROTECTED]> wrote:
> Ideally, libpcap should be extended to support the LPFisms. A LSF

Indeed, and I sent a patch to do so about 4 weeks after LSF was in the
kernel, but nothing happened so far. ("Hey, the -s option suddenly works ;")

-- 
  -==- |
  ==-- _   |
  ---==---(_)__  __   __   Marc Lehmann  +--
  --==---/ / _ \/ // /\ \/ /   [EMAIL PROTECTED] |e|
  -=/_/_//_/\_,_/ /_/\_\   XX11-RIPE --+
The choice of a GNU generation   |
 |
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: BPF in the kernel

2000-09-21 Thread Marc Lehmann

On Mon, Sep 18, 2000 at 10:55:20PM -0700, Philippe Troin [EMAIL PROTECTED] wrote:
 Ideally, libpcap should be extended to support the LPFisms. A LSF

Indeed, and I sent a patch to do so about 4 weeks after LSF was in the
kernel, but nothing happened so far. ("Hey, the -s option suddenly works ;")

-- 
  -==- |
  ==-- _   |
  ---==---(_)__  __   __   Marc Lehmann  +--
  --==---/ / _ \/ // /\ \/ /   [EMAIL PROTECTED] |e|
  -=/_/_//_/\_,_/ /_/\_\   XX11-RIPE --+
The choice of a GNU generation   |
 |
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: BPF in the kernel

2000-09-18 Thread Philippe Troin

kjh63 <[EMAIL PROTECTED]> writes:

> > How Linux Kernel and BPF relate to each other:
> >
> > a) linux has BPF (I don't think so).

It has LSF, the Linux Socket Filter.

> > b) Linux has own equivalent of BPF (part of NAT?)

Yes, the LSF.

> > c) linux does not have anything like BPF

BPF opcodes works on LSF. LSF has some extensions to BSF, like
fetching which interface the packet came from.

> > d) something else (if so, then what?)
> 
> a) The Documentation/networking/filters.txt may say so but i dont think so
> either:
> 
> [root@localhost networking]# ls -al /dev/bpf0
> ls: /dev/bpf0: No such file or directory
> [root@localhost networking]# cd /dev/
> [root@localhost /dev]# sh MAKEDEV bpf0
> MAKEDEV: don't know how to make device "bpf0"

LSF does not work with devices nodes, it works with setsockopt:

struct bpf_program bpfp;
/* Fill bpfp */
setsockopt(SOL_SOCKET, SO_ATTACH_FILTER, , sizeof(bpfp));

> How can I make ethereal (or libpcap) work with LSF?

Last time I checked libpcap was emulating BPF in user space. Which is
bad of course because all packets are copied...

Ideally, libpcap should be extended to support the LPFisms. A LSF
filtering some ethernet traffic does not have to be bound to an
interface. You can filter all the packets coming in.

I can send you an example if you need so.

Phil.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



BPF in the kernel

2000-09-18 Thread kjh63

> How Linux Kernel and BPF relate to each other:
>
> a) linux has BPF (I don't think so).
> b) Linux has own equivalent of BPF (part of NAT?)
> c) linux does not have anything like BPF
> d) something else (if so, then what?)

a) The Documentation/networking/filters.txt may say so but i dont think so
either:

[root@localhost networking]# ls -al /dev/bpf0
ls: /dev/bpf0: No such file or directory

[root@localhost networking]# cd /dev/
[root@localhost /dev]# sh MAKEDEV bpf0
MAKEDEV: don't know how to make device "bpf0"

How can I make ethereal (or libpcap) work with LSF?

Thanks

Kurt

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: BPF in the kernel

2000-09-18 Thread Philippe Troin

kjh63 [EMAIL PROTECTED] writes:

  How Linux Kernel and BPF relate to each other:
 
  a) linux has BPF (I don't think so).

It has LSF, the Linux Socket Filter.

  b) Linux has own equivalent of BPF (part of NAT?)

Yes, the LSF.

  c) linux does not have anything like BPF

BPF opcodes works on LSF. LSF has some extensions to BSF, like
fetching which interface the packet came from.

  d) something else (if so, then what?)
 
 a) The Documentation/networking/filters.txt may say so but i dont think so
 either:
 
 [root@localhost networking]# ls -al /dev/bpf0
 ls: /dev/bpf0: No such file or directory
 [root@localhost networking]# cd /dev/
 [root@localhost /dev]# sh MAKEDEV bpf0
 MAKEDEV: don't know how to make device "bpf0"

LSF does not work with devices nodes, it works with setsockopt:

struct bpf_program bpfp;
/* Fill bpfp */
setsockopt(SOL_SOCKET, SO_ATTACH_FILTER, bpfp, sizeof(bpfp));

 How can I make ethereal (or libpcap) work with LSF?

Last time I checked libpcap was emulating BPF in user space. Which is
bad of course because all packets are copied...

Ideally, libpcap should be extended to support the LPFisms. A LSF
filtering some ethernet traffic does not have to be bound to an
interface. You can filter all the packets coming in.

I can send you an example if you need so.

Phil.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/