Re: [PATCH 3/4] vmlinux.lds.h: Fix alignment for __ksymtab*, __kcrctab_* and .pci_fixup sections

2023-12-22 Thread Masahiro Yamada
On Fri, Dec 22, 2023 at 6:02 PM Helge Deller  wrote:
>
> On 12/21/23 14:07, Masahiro Yamada wrote:
> > On Thu, Nov 23, 2023 at 7:18 AM  wrote:
> >>
> >> From: Helge Deller 
> >>
> >> On 64-bit architectures without CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
> >> (e.g. ppc64, ppc64le, parisc, s390x,...) the __KSYM_REF() macro stores
> >> 64-bit pointers into the __ksymtab* sections.
> >> Make sure that the start of those sections is 64-bit aligned in the vmlinux
> >> executable, otherwise unaligned memory accesses may happen at runtime.
> >
> >
> > Are you solving a real problem?
>
> Not any longer.
> I faced a problem on parisc when neither #1 and #3 were applied
> because of a buggy unalignment exception handler. But this is
> not something which I would count a "real generic problem".
>
> > 1/4 already ensures the proper alignment of __ksymtab*, doesn't it?
>
> Yes, it does.
>
> >...
> > So, my understanding is this patch is unneeded.
>
> Yes, it's not required and I'm fine if we drop it.
>
> But regarding __kcrctab:
>
> >> @@ -498,6 +501,7 @@
> >>  }   \
> >>  \
> >>  /* Kernel symbol table: Normal symbols */   \
> >> +   . = ALIGN(4);   \
> >>  __kcrctab : AT(ADDR(__kcrctab) - LOAD_OFFSET) { \
> >>  __start___kcrctab = .;  \
> >>  KEEP(*(SORT(___kcrctab+*))) \
>
> I think this patch would be beneficial to get proper alignment:
>
> diff --git a/include/linux/export-internal.h b/include/linux/export-internal.h
> index cd253eb51d6c..d445705ac13c 100644
> --- a/include/linux/export-internal.h
> +++ b/include/linux/export-internal.h
> @@ -64,6 +64,7 @@
>
>   #define SYMBOL_CRC(sym, crc, sec)   \
>  asm(".section \"___kcrctab" sec "+" #sym "\",\"a\"" "\n" \
> +   ".balign 4" "\n" \
>  "__crc_" #sym ":"   "\n" \
>  ".long " #crc   "\n" \
>  ".previous" "\n")


Yes!


Please send a patch with this:


Fixes: f3304ecd7f06 ("linux/export: use inline assembler to populate
symbol CRCs")



-- 
Best Regards
Masahiro Yamada



Re: [PATCH bpf-next 0/3] bpf: introduce BPF_MAP_TYPE_RELAY

2023-12-22 Thread Philo Lu




On 2023/12/22 22:45, Jiri Olsa wrote:

On Fri, Dec 22, 2023 at 08:21:43PM +0800, Philo Lu wrote:

The patch set introduce a new type of map, BPF_MAP_TYPE_RELAY, based on
relay interface [0]. It provides a way for persistent and overwritable data
transfer.

As stated in [0], relay is a efficient method for log and data transfer.
And the interface is simple enough so that we can implement and use this
type of map with current map interfaces. Besides we need a new helper
bpf_relay_output to output data to user, similar with bpf_ringbuf_output.

We need this map because currently neither ringbuf nor perfbuf satisfies
the requirements of relatively long-term consistent tracing, where the bpf
program keeps writing into the buffer without any bundled reader, and the
buffer supports overwriting. For users, they just run the bpf program to
collect data, and are able to read as need. The detailed discussion can be
found at [1].

The buffer is exposed to users as per-cpu files in debugfs, supporting read
and mmap, and it is up to users how to formulate and read it, either
through a program with mmap or just `cat`. Specifically, the files are
created as "/sys/kerenl/debug//#cpu", where the 
is defined with map_update_elem (Note that we do not need to implement
actual elem operators for relay_map).

If this map is acceptable, other parts including docs, libbpf support,
selftests, and benchmarks (if need) will be added in the following version.


looks useful, selftests might be already helpful to see the usage



Ok, I will add selftests in the next version.

Thanks.


jirka



[0]
https://github.com/torvalds/linux/blob/master/Documentation/filesystems/relay.rst
[1]
https://lore.kernel.org/bpf/20231219122850.433be...@gandalf.local.home/T/

Philo Lu (3):
   bpf: implement relay map basis
   bpf: implement map_update_elem to init relay file
   bpf: introduce bpf_relay_output helper

  include/linux/bpf.h   |   1 +
  include/linux/bpf_types.h |   3 +
  include/uapi/linux/bpf.h  |  17 +++
  kernel/bpf/Makefile   |   3 +
  kernel/bpf/helpers.c  |   4 +
  kernel/bpf/relaymap.c | 213 ++
  kernel/bpf/syscall.c  |   1 +
  kernel/bpf/verifier.c |   8 ++
  kernel/trace/bpf_trace.c  |   4 +
  9 files changed, 254 insertions(+)
  create mode 100644 kernel/bpf/relaymap.c

--
2.32.0.3.g01195cf9f





Re: [PATCH bpf-next 3/3] bpf: introduce bpf_relay_output helper

2023-12-22 Thread Philo Lu




On 2023/12/22 22:46, Jiri Olsa wrote:

On Fri, Dec 22, 2023 at 08:21:46PM +0800, Philo Lu wrote:

Like perfbuf/ringbuf, a helper is needed to write into the buffer, named
bpf_relay_output, whose usage is same as ringbuf. Note that it works only
after relay files are set, i.e., after calling map_update_elem for the
created relay map.


we can't add helpers anymore, this will need to be kfunc
check functions marked with __bpf_kfunc as examples



Got it, I will change it to kfunc in the next version.

Thanks.


jirka



Signed-off-by: Philo Lu 
---
  include/linux/bpf.h  |  1 +
  include/uapi/linux/bpf.h | 10 ++
  kernel/bpf/helpers.c |  4 
  kernel/bpf/relaymap.c| 26 ++
  kernel/bpf/verifier.c|  8 
  kernel/trace/bpf_trace.c |  4 
  6 files changed, 53 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 7671530d6e4e..b177122369e6 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -3095,6 +3095,7 @@ extern const struct bpf_func_proto bpf_get_retval_proto;
  extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
  extern const struct bpf_func_proto bpf_cgrp_storage_get_proto;
  extern const struct bpf_func_proto bpf_cgrp_storage_delete_proto;
+extern const struct bpf_func_proto bpf_relay_output_proto;
  
  const struct bpf_func_proto *tracing_prog_func_proto(

enum bpf_func_id func_id, const struct bpf_prog *prog);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 143b75676bd3..03c0c1953ba1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5686,6 +5686,15 @@ union bpf_attr {
   *0 on success.
   *
   ***-ENOENT** if the bpf_local_storage cannot be found.
+ *
+ * long bpf_relay_output(void *map, void *data, u64 size, u64 flags)
+ * Description
+ * Copy *size* bytes from *data* into *map* of type 
BPF_MAP_TYPE_RELAY.
+ * Currently, the *flags* must be 0.
+ * Return
+ * 0 on success.
+ *
+ * **-ENOENT** if the relay base_file in debugfs cannot be found.
   */
  #define ___BPF_FUNC_MAPPER(FN, ctx...)\
FN(unspec, 0, ##ctx)\
@@ -5900,6 +5909,7 @@ union bpf_attr {
FN(user_ringbuf_drain, 209, ##ctx)  \
FN(cgrp_storage_get, 210, ##ctx)\
FN(cgrp_storage_delete, 211, ##ctx) \
+   FN(relay_output, 212, ##ctx)\
/* */
  
  /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index be72824f32b2..0c26e87ce729 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1720,6 +1720,10 @@ bpf_base_func_proto(enum bpf_func_id func_id)
return _ringbuf_discard_proto;
case BPF_FUNC_ringbuf_query:
return _ringbuf_query_proto;
+#ifdef CONFIG_RELAY
+   case BPF_FUNC_relay_output:
+   return _relay_output_proto;
+#endif
case BPF_FUNC_strncmp:
return _strncmp_proto;
case BPF_FUNC_strtol:
diff --git a/kernel/bpf/relaymap.c b/kernel/bpf/relaymap.c
index 588c8de0a4bd..f9e2e4a780df 100644
--- a/kernel/bpf/relaymap.c
+++ b/kernel/bpf/relaymap.c
@@ -173,6 +173,32 @@ static u64 relay_map_mem_usage(const struct bpf_map *map)
return usage;
  }
  
+BPF_CALL_4(bpf_relay_output, struct bpf_map *, map, void *, data, u64, size,

+  u64, flags)
+{
+   struct bpf_relay_map *rmap;
+
+   /* not support any flag now */
+   if (unlikely(flags))
+   return -EINVAL;
+
+   rmap = container_of(map, struct bpf_relay_map, map);
+   if (!rmap->relay_chan->has_base_filename)
+   return -ENOENT;
+
+   relay_write(rmap->relay_chan, data, size);
+   return 0;
+}
+
+const struct bpf_func_proto bpf_relay_output_proto = {
+   .func   = bpf_relay_output,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_CONST_MAP_PTR,
+   .arg2_type  = ARG_PTR_TO_MEM | MEM_RDONLY,
+   .arg3_type  = ARG_CONST_SIZE_OR_ZERO,
+   .arg4_type  = ARG_ANYTHING,
+};
+
  BTF_ID_LIST_SINGLE(relay_map_btf_ids, struct, bpf_relay_map)
  const struct bpf_map_ops relay_map_ops = {
.map_meta_equal = bpf_map_meta_equal,
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f13008d27f35..8c8287d6ae18 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8800,6 +8800,10 @@ static int check_map_func_compatibility(struct 
bpf_verifier_env *env,
if (func_id != BPF_FUNC_user_ringbuf_drain)
goto error;
break;
+   case BPF_MAP_TYPE_RELAY:
+   if (func_id != BPF_FUNC_relay_output)
+   goto error;
+   break;
case BPF_MAP_TYPE_STACK_TRACE:
if (func_id != BPF_FUNC_get_stackid)
 

Re: [PATCH bpf-next 2/3] bpf: implement map_update_elem to init relay file

2023-12-22 Thread Philo Lu




On 2023/12/22 22:45, Jiri Olsa wrote:

On Fri, Dec 22, 2023 at 08:21:45PM +0800, Philo Lu wrote:

map_update_elem is used to create relay files and bind them with the
relay channel, which is created with BPF_MAP_CREATE. This allows users
to set a custom directory name. It must be used with key=NULL and
flag=0.

Here is an example:
```
struct {
__uint(type, BPF_MAP_TYPE_RELAY);
__uint(max_entries, 4096);
} my_relay SEC(".maps");
...
char dir_name[] = "relay_test";
bpf_map_update_elem(map_fd, NULL, dir_name, 0);
```

Then, directory `/sys/kerenl/debug/relay_test` will be created, which
includes files of my_relay0...my_relay[#cpu]. Each represents a per-cpu
buffer with size 8 * 4096 B (there are 8 subbufs by default, each with
size 4096B).

Signed-off-by: Philo Lu 
---
  kernel/bpf/relaymap.c | 32 +++-
  1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/relaymap.c b/kernel/bpf/relaymap.c
index d0adc7f67758..588c8de0a4bd 100644
--- a/kernel/bpf/relaymap.c
+++ b/kernel/bpf/relaymap.c
@@ -117,7 +117,37 @@ static void *relay_map_lookup_elem(struct bpf_map *map, 
void *key)
  static long relay_map_update_elem(struct bpf_map *map, void *key, void *value,
   u64 flags)
  {
-   return -EOPNOTSUPP;
+   struct bpf_relay_map *rmap;
+   struct dentry *parent;
+   int err;
+
+   if (unlikely(flags))
+   return -EINVAL;
+
+   if (unlikely(key))
+   return -EINVAL;
+
+   rmap = container_of(map, struct bpf_relay_map, map);
+
+   /* The directory already exists */
+   if (rmap->relay_chan->has_base_filename)
+   return -EEXIST;
+
+   /* Setup relay files. Note that the directory name passed as value 
should
+* not be longer than map->value_size, including the '\0' at the end.
+*/
+   ((char *)value)[map->value_size - 1] = '\0';
+   parent = debugfs_create_dir(value, NULL);
+   if (IS_ERR_OR_NULL(parent))
+   return PTR_ERR(parent);
+
+   err = relay_late_setup_files(rmap->relay_chan, map->name, parent);
+   if (err) {
+   debugfs_remove_recursive(parent);
+   return err;
+   }


looks like this patch could be moved to the previous one?



OK, will do it.


jirka


+
+   return 0;
  }
  
  static long relay_map_delete_elem(struct bpf_map *map, void *key)

--
2.32.0.3.g01195cf9f






Re: [PATCH bpf-next 1/3] bpf: implement relay map basis

2023-12-22 Thread Philo Lu




On 2023/12/22 22:45, Jiri Olsa wrote:

On Fri, Dec 22, 2023 at 08:21:44PM +0800, Philo Lu wrote:

SNIP


+/* bpf_attr is used as follows:
+ * - key size: must be 0
+ * - value size: value will be used as directory name by map_update_elem
+ *   (to create relay files). If passed as 0, it will be set to NAME_MAX as
+ *   default
+ *
+ * - max_entries: subbuf size
+ * - map_extra: subbuf num, default as 8
+ *
+ * When alloc, we do not set up relay files considering dir_name conflicts.
+ * Instead we use relay_late_setup_files() in map_update_elem(), and thus the
+ * value is used as dir_name, and map->name is used as base_filename.
+ */
+static struct bpf_map *relay_map_alloc(union bpf_attr *attr)
+{
+   struct bpf_relay_map *rmap;
+
+   if (unlikely(attr->map_flags & ~RELAY_CREATE_FLAG_MASK))
+   return ERR_PTR(-EINVAL);
+
+   /* key size must be 0 in relay map */
+   if (unlikely(attr->key_size))
+   return ERR_PTR(-EINVAL);
+
+   if (unlikely(attr->value_size > NAME_MAX)) {
+   pr_warn("value_size should be no more than %d\n", NAME_MAX);
+   return ERR_PTR(-EINVAL);
+   } else if (attr->value_size == 0)
+   attr->value_size = NAME_MAX;


the concept of no key with just value seems strange.. I never worked
with relay channels, so perhaps stupid question: but why not have one
relay channel for given key? having the debugfs like:

   /sys/kernel/debug/my_rmap/mychannel/

Here, a relay map is actually a relay channel, which includes buffers 
for all cpus. And I think 2 levels is enough when we use relay map in 
`/sys/kernel/debug/`: /[#cpu]. The `dir_name` is 
necessary because user could use the same `map_name` in different bpf 
programs, and we can use it as an additional tag to distinguish them. 
The `dir_name` is set by user with relay_map_update_elem.


Here is an example. Assume we have 2 relay maps (rmap_a and rmap_b) and 
2 cpus, the debugfs will be like:

```
/sys/kernel/debug//rmap_a0
/sys/kernel/debug//rmap_a1
/sys/kernel/debug//rmap_b0
/sys/kernel/debug//rmap_b1
```

So I think the key point here is that we just need one field to set the 
`dir_name`, either key or value. I chose key as NULL because I think it 
suggests "Normally map_update_elem should be invoked just once for a 
relay map". But I think it okay to use key instead, and value as NULL.



+
+   /* set default subbuf num */
+   attr->map_extra = attr->map_extra & UINT_MAX;
+   if (!attr->map_extra)
+   attr->map_extra = 8;
+
+   if (!attr->map_name || strlen(attr->map_name) == 0)


attr->map_name is allways != NULL


+   return ERR_PTR(-EINVAL);
+
+   rmap = bpf_map_area_alloc(sizeof(*rmap), NUMA_NO_NODE);
+   if (!rmap)
+   return ERR_PTR(-ENOMEM);
+
+   bpf_map_init_from_attr(>map, attr);
+
+   rmap->relay_cb.create_buf_file = create_buf_file_handler;
+   rmap->relay_cb.remove_buf_file = remove_buf_file_handler;
+   if (attr->map_flags & BPF_F_OVERWRITE)
+   rmap->relay_cb.subbuf_start = subbuf_start_overwrite;
+
+   rmap->relay_chan = relay_open(NULL, NULL,
+   attr->max_entries, 
attr->map_extra,
+   >relay_cb, NULL);


wrong indentation


Got it. I will adjust it.


+   if (!rmap->relay_chan)
+   return ERR_PTR(-EINVAL);
+
+   return >map;
+}
+
+static void relay_map_free(struct bpf_map *map)
+{
+   struct bpf_relay_map *rmap;
+
+   rmap = container_of(map, struct bpf_relay_map, map);
+   relay_close(rmap->relay_chan);
+   debugfs_remove_recursive(rmap->relay_chan->parent);
+   kfree(rmap);


should you use bpf_map_area_free instead?


Thanks for catching. Will fix it.


jirka


+}
+
+static void *relay_map_lookup_elem(struct bpf_map *map, void *key)
+{
+   return ERR_PTR(-EOPNOTSUPP);
+}
+
+static long relay_map_update_elem(struct bpf_map *map, void *key, void *value,
+  u64 flags)
+{
+   return -EOPNOTSUPP;
+}
+
+static long relay_map_delete_elem(struct bpf_map *map, void *key)
+{
+   return -EOPNOTSUPP;
+}
+
+static int relay_map_get_next_key(struct bpf_map *map, void *key,
+   void *next_key)
+{
+   return -EOPNOTSUPP;
+}
+
+static u64 relay_map_mem_usage(const struct bpf_map *map)
+{
+   struct bpf_relay_map *rmap;
+   u64 usage = sizeof(struct bpf_relay_map);
+
+   rmap = container_of(map, struct bpf_relay_map, map);
+   usage += sizeof(struct rchan);
+   usage += (sizeof(struct rchan_buf) + rmap->relay_chan->alloc_size)
+* num_online_cpus();
+   return usage;
+}
+
+BTF_ID_LIST_SINGLE(relay_map_btf_ids, struct, bpf_relay_map)
+const struct bpf_map_ops relay_map_ops = {
+   .map_meta_equal = bpf_map_meta_equal,
+   .map_alloc = relay_map_alloc,
+   .map_free = 

Re: [PATCH 0/4] Section alignment issues?

2023-12-22 Thread Masahiro Yamada
On Fri, Dec 22, 2023 at 5:23 PM Helge Deller  wrote:
>
> On 12/21/23 16:42, Masahiro Yamada wrote:
> > On Thu, Dec 21, 2023 at 10:40 PM Masahiro Yamada  
> > wrote:
> >>
> >> On Thu, Nov 23, 2023 at 7:18 AM  wrote:
> >>>
> >>> From: Helge Deller 
> >>>
> >>> While working on the 64-bit parisc kernel, I noticed that the __ksymtab[]
> >>> table was not correctly 64-bit aligned in many modules.
> >>> The following patches do fix some of those issues in the generic code.
> >>>
> >>> But further investigation shows that multiple sections in the kernel and 
> >>> in
> >>> modules are possibly not correctly aligned, and thus may lead to 
> >>> performance
> >>> degregations at runtime (small on x86, huge on parisc, sparc and others 
> >>> which
> >>> need exception handlers). Sometimes wrong alignments may also be simply 
> >>> hidden
> >>> by the linker or kernel module loader which pulls in the sections by luck 
> >>> with
> >>> a correct alignment (e.g. because the previous section was aligned 
> >>> already).
> >>>
> >>> An objdump on a x86 module shows e.g.:
> >>>
> >>> ./kernel/net/netfilter/nf_log_syslog.ko: file format elf64-x86-64
> >>> Sections:
> >>> Idx Name  Size  VMA   LMA   File off  
> >>> Algn
> >>>0 .text 1fdf      0040 
> >>>  2**4
> >>>CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
> >>>1 .init.text00f6      2020 
> >>>  2**4
> >>>CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
> >>>2 .exit.text005c      2120 
> >>>  2**4
> >>>CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
> >>>3 .rodata.str1.8 00dc      
> >>> 2180  2**3
> >>>CONTENTS, ALLOC, LOAD, READONLY, DATA
> >>>4 .rodata.str1.1 030a      
> >>> 225c  2**0
> >>>CONTENTS, ALLOC, LOAD, READONLY, DATA
> >>>5 .rodata   00b0      2580 
> >>>  2**5
> >>>CONTENTS, ALLOC, LOAD, READONLY, DATA
> >>>6 .modinfo  019e      2630 
> >>>  2**0
> >>>CONTENTS, ALLOC, LOAD, READONLY, DATA
> >>>7 .return_sites 0034      27ce 
> >>>  2**0
> >>>CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
> >>>8 .call_sites   029c      2802 
> >>>  2**0
> >>>CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
> >>>
> >>> In this example I believe the ".return_sites" and ".call_sites" should 
> >>> have
> >>> an alignment of at least 32-bit (4 bytes).
> >>>
> >>> On other architectures or modules other sections like ".altinstructions" 
> >>> or
> >>> "__ex_table" may show up wrongly instead.
> >>>
> >>> In general I think it would be beneficial to search for wrong alignments 
> >>> at
> >>> link time, and maybe at runtime.
> >>>
> >>> The patch at the end of this cover letter
> >>> - adds compile time checks to the "modpost" tool, and
> >>> - adds a runtime check to the kernel module loader at runtime.
> >>> And it will possibly show false positives too (!!!)
> >>> I do understand that some of those sections are not performce critical
> >>> and thus any alignment is OK.
> >>>
> >>> The modpost patch will emit at compile time such warnings (on x86-64 
> >>> kernel build):
> >>>
> >>> WARNING: modpost: vmlinux: section .initcall7.init (type 1, flags 2) has 
> >>> alignment of 1, expected at least 4.
> >>> Maybe you need to add ALIGN() to the modules.lds file (or fix modpost) ?
> >>> WARNING: modpost: vmlinux: section .altinstructions (type 1, flags 2) has 
> >>> alignment of 1, expected at least 2.
> >>> WARNING: modpost: vmlinux: section .initcall6.init (type 1, flags 2) has 
> >>> alignment of 1, expected at least 4.
> >>> WARNING: modpost: vmlinux: section .initcallearly.init (type 1, flags 2) 
> >>> has alignment of 1, expected at least 4.
> >>> WARNING: modpost: vmlinux: section .rodata.cst2 (type 1, flags 18) has 
> >>> alignment of 2, expected at least 64.
> >>> WARNING: modpost: vmlinux: section .static_call_tramp_key (type 1, flags 
> >>> 2) has alignment of 1, expected at least 8.
> >>> WARNING: modpost: vmlinux: section .con_initcall.init (type 1, flags 2) 
> >>> has alignment of 1, expected at least 8.
> >>> WARNING: modpost: vmlinux: section __bug_table (type 1, flags 3) has 
> >>> alignment of 1, expected at least 4.
> >>> ...
> >>
> >>
> >>
> >>
> >> modpost acts on vmlinux.o instead of vmlinux.
> >>
> >>
> >> vmlinux.o is a relocatable ELF, which is not a real layout
> >> because no linker script has been considered yet at this
> >> point.
> >>
> >>
> >> vmlinux is an executable ELF, produced by a linker,
> >> with the linker script taken into 

Re: [PATCH bpf-next 1/3] bpf: implement relay map basis

2023-12-22 Thread kernel test robot
Hi Philo,

kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:
https://github.com/intel-lab-lkp/linux/commits/Philo-Lu/bpf-implement-relay-map-basis/20231222-204545
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:
https://lore.kernel.org/r/20231222122146.65519-2-lulie%40linux.alibaba.com
patch subject: [PATCH bpf-next 1/3] bpf: implement relay map basis
config: x86_64-kexec 
(https://download.01.org/0day-ci/archive/20231223/202312230850.nqeizxj6-...@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20231223/202312230850.nqeizxj6-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202312230850.nqeizxj6-...@intel.com/

All warnings (new ones prefixed by >>):

   kernel/bpf/relaymap.c: In function 'relay_map_alloc':
>> kernel/bpf/relaymap.c:79:13: warning: the comparison will always evaluate as 
>> 'true' for the address of 'map_name' will never be NULL [-Waddress]
  79 | if (!attr->map_name || strlen(attr->map_name) == 0)
 | ^
   In file included from include/linux/bpf.h:7,
from include/linux/filter.h:9,
from kernel/bpf/relaymap.c:3:
   include/uapi/linux/bpf.h:1394:25: note: 'map_name' declared here
1394 | charmap_name[BPF_OBJ_NAME_LEN];
 | ^~~~


vim +79 kernel/bpf/relaymap.c

43  
44  /* bpf_attr is used as follows:
45   * - key size: must be 0
46   * - value size: value will be used as directory name by map_update_elem
47   *   (to create relay files). If passed as 0, it will be set to 
NAME_MAX as
48   *   default
49   *
50   * - max_entries: subbuf size
51   * - map_extra: subbuf num, default as 8
52   *
53   * When alloc, we do not set up relay files considering dir_name 
conflicts.
54   * Instead we use relay_late_setup_files() in map_update_elem(), and 
thus the
55   * value is used as dir_name, and map->name is used as base_filename.
56   */
57  static struct bpf_map *relay_map_alloc(union bpf_attr *attr)
58  {
59  struct bpf_relay_map *rmap;
60  
61  if (unlikely(attr->map_flags & ~RELAY_CREATE_FLAG_MASK))
62  return ERR_PTR(-EINVAL);
63  
64  /* key size must be 0 in relay map */
65  if (unlikely(attr->key_size))
66  return ERR_PTR(-EINVAL);
67  
68  if (unlikely(attr->value_size > NAME_MAX)) {
69  pr_warn("value_size should be no more than %d\n", 
NAME_MAX);
70  return ERR_PTR(-EINVAL);
71  } else if (attr->value_size == 0)
72  attr->value_size = NAME_MAX;
73  
74  /* set default subbuf num */
75  attr->map_extra = attr->map_extra & UINT_MAX;
76  if (!attr->map_extra)
77  attr->map_extra = 8;
78  
  > 79  if (!attr->map_name || strlen(attr->map_name) == 0)
80  return ERR_PTR(-EINVAL);
81  
82  rmap = bpf_map_area_alloc(sizeof(*rmap), NUMA_NO_NODE);
83  if (!rmap)
84  return ERR_PTR(-ENOMEM);
85  
86  bpf_map_init_from_attr(>map, attr);
87  
88  rmap->relay_cb.create_buf_file = create_buf_file_handler;
89  rmap->relay_cb.remove_buf_file = remove_buf_file_handler;
90  if (attr->map_flags & BPF_F_OVERWRITE)
91  rmap->relay_cb.subbuf_start = subbuf_start_overwrite;
92  
93  rmap->relay_chan = relay_open(NULL, NULL,
94  
attr->max_entries, attr->map_extra,
95  
>relay_cb, NULL);
96  if (!rmap->relay_chan)
97  return ERR_PTR(-EINVAL);
98  
99  return >map;
   100  }
   101  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



Re: [PATCH net-next v3 3/3] net: add netmem_ref to skb_frag_t

2023-12-22 Thread kernel test robot
Hi Mina,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]

url:
https://github.com/intel-lab-lkp/linux/commits/Mina-Almasry/vsock-virtio-use-skb_frag_-helpers/20231222-164637
base:   net-next/main
patch link:
https://lore.kernel.org/r/20231220214505.2303297-4-almasrymina%40google.com
patch subject: [PATCH net-next v3 3/3] net: add netmem_ref to skb_frag_t
config: i386-randconfig-141-20231222 
(https://download.01.org/0day-ci/archive/20231223/202312230739.g0tfssdt-...@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20231223/202312230739.g0tfssdt-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202312230739.g0tfssdt-...@intel.com/

All errors (new ones prefixed by >>):

   net/kcm/kcmsock.c: In function 'kcm_write_msgs':
>> net/kcm/kcmsock.c:637:59: error: 'skb_frag_t' {aka 'struct skb_frag'} has no 
>> member named 'bv_len'
 637 | msize += skb_shinfo(skb)->frags[i].bv_len;
 |   ^


vim +637 net/kcm/kcmsock.c

cd6e111bf5be5c Tom Herbert   2016-03-07  578  
ab7ac4eb9832e3 Tom Herbert   2016-03-07  579  /* Write any messages ready 
on the kcm socket.  Called with kcm sock lock
ab7ac4eb9832e3 Tom Herbert   2016-03-07  580   * held.  Return bytes 
actually sent or error.
ab7ac4eb9832e3 Tom Herbert   2016-03-07  581   */
ab7ac4eb9832e3 Tom Herbert   2016-03-07  582  static int 
kcm_write_msgs(struct kcm_sock *kcm)
ab7ac4eb9832e3 Tom Herbert   2016-03-07  583  {
c31a25e1db486f David Howells 2023-06-09  584unsigned int total_sent 
= 0;
ab7ac4eb9832e3 Tom Herbert   2016-03-07  585struct sock *sk = 
>sk;
ab7ac4eb9832e3 Tom Herbert   2016-03-07  586struct kcm_psock *psock;
c31a25e1db486f David Howells 2023-06-09  587struct sk_buff *head;
ab7ac4eb9832e3 Tom Herbert   2016-03-07  588int ret = 0;
ab7ac4eb9832e3 Tom Herbert   2016-03-07  589  
ab7ac4eb9832e3 Tom Herbert   2016-03-07  590kcm->tx_wait_more = 
false;
ab7ac4eb9832e3 Tom Herbert   2016-03-07  591psock = kcm->tx_psock;
ab7ac4eb9832e3 Tom Herbert   2016-03-07  592if (unlikely(psock && 
psock->tx_stopped)) {
ab7ac4eb9832e3 Tom Herbert   2016-03-07  593/* A reserved 
psock was aborted asynchronously. Unreserve
ab7ac4eb9832e3 Tom Herbert   2016-03-07  594 * it and we'll 
retry the message.
ab7ac4eb9832e3 Tom Herbert   2016-03-07  595 */
ab7ac4eb9832e3 Tom Herbert   2016-03-07  596
unreserve_psock(kcm);
cd6e111bf5be5c Tom Herbert   2016-03-07  597
kcm_report_tx_retry(kcm);
ab7ac4eb9832e3 Tom Herbert   2016-03-07  598if 
(skb_queue_empty(>sk_write_queue))
ab7ac4eb9832e3 Tom Herbert   2016-03-07  599return 
0;
ab7ac4eb9832e3 Tom Herbert   2016-03-07  600  
c31a25e1db486f David Howells 2023-06-09  601
kcm_tx_msg(skb_peek(>sk_write_queue))->started_tx = false;
ab7ac4eb9832e3 Tom Herbert   2016-03-07  602}
ab7ac4eb9832e3 Tom Herbert   2016-03-07  603  
c31a25e1db486f David Howells 2023-06-09  604  retry:
c31a25e1db486f David Howells 2023-06-09  605while ((head = 
skb_peek(>sk_write_queue))) {
c31a25e1db486f David Howells 2023-06-09  606struct msghdr 
msg = {
c31a25e1db486f David Howells 2023-06-09  607
.msg_flags = MSG_DONTWAIT | MSG_SPLICE_PAGES,
c31a25e1db486f David Howells 2023-06-09  608};
c31a25e1db486f David Howells 2023-06-09  609struct 
kcm_tx_msg *txm = kcm_tx_msg(head);
c31a25e1db486f David Howells 2023-06-09  610struct sk_buff 
*skb;
c31a25e1db486f David Howells 2023-06-09  611unsigned int 
msize;
c31a25e1db486f David Howells 2023-06-09  612int i;
ab7ac4eb9832e3 Tom Herbert   2016-03-07  613  
c31a25e1db486f David Howells 2023-06-09  614if 
(!txm->started_tx) {
c31a25e1db486f David Howells 2023-06-09  615psock = 
reserve_psock(kcm);
c31a25e1db486f David Howells 2023-06-09  616if 
(!psock)
c31a25e1db486f David Howells 2023-06-09  617
goto out;
c31a25e1db486f David Howells 2023-06-09  618skb = 
head;
c31a25e1db486f David Howells 2023-06-09  619
txm->frag_offset = 0;
c31a25e1db486f David Howells 2023-06-09  620
txm-

Re: [PATCH bpf-next 1/3] bpf: implement relay map basis

2023-12-22 Thread kernel test robot
Hi Philo,

kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:
https://github.com/intel-lab-lkp/linux/commits/Philo-Lu/bpf-implement-relay-map-basis/20231222-204545
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:
https://lore.kernel.org/r/20231222122146.65519-2-lulie%40linux.alibaba.com
patch subject: [PATCH bpf-next 1/3] bpf: implement relay map basis
config: arm-randconfig-001-20231223 
(https://download.01.org/0day-ci/archive/20231223/202312230638.iydd8joz-...@intel.com/config)
compiler: clang version 18.0.0git (https://github.com/llvm/llvm-project 
d3ef86708241a3bee902615c190dead1638c4e09)
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20231223/202312230638.iydd8joz-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202312230638.iydd8joz-...@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/bpf/relaymap.c:79:13: warning: address of array 'attr->map_name' will 
>> always evaluate to 'true' [-Wpointer-bool-conversion]
  79 | if (!attr->map_name || strlen(attr->map_name) == 0)
 | ~~~^~~~
   1 warning generated.


vim +79 kernel/bpf/relaymap.c

43  
44  /* bpf_attr is used as follows:
45   * - key size: must be 0
46   * - value size: value will be used as directory name by map_update_elem
47   *   (to create relay files). If passed as 0, it will be set to 
NAME_MAX as
48   *   default
49   *
50   * - max_entries: subbuf size
51   * - map_extra: subbuf num, default as 8
52   *
53   * When alloc, we do not set up relay files considering dir_name 
conflicts.
54   * Instead we use relay_late_setup_files() in map_update_elem(), and 
thus the
55   * value is used as dir_name, and map->name is used as base_filename.
56   */
57  static struct bpf_map *relay_map_alloc(union bpf_attr *attr)
58  {
59  struct bpf_relay_map *rmap;
60  
61  if (unlikely(attr->map_flags & ~RELAY_CREATE_FLAG_MASK))
62  return ERR_PTR(-EINVAL);
63  
64  /* key size must be 0 in relay map */
65  if (unlikely(attr->key_size))
66  return ERR_PTR(-EINVAL);
67  
68  if (unlikely(attr->value_size > NAME_MAX)) {
69  pr_warn("value_size should be no more than %d\n", 
NAME_MAX);
70  return ERR_PTR(-EINVAL);
71  } else if (attr->value_size == 0)
72  attr->value_size = NAME_MAX;
73  
74  /* set default subbuf num */
75  attr->map_extra = attr->map_extra & UINT_MAX;
76  if (!attr->map_extra)
77  attr->map_extra = 8;
78  
  > 79  if (!attr->map_name || strlen(attr->map_name) == 0)
80  return ERR_PTR(-EINVAL);
81  
82  rmap = bpf_map_area_alloc(sizeof(*rmap), NUMA_NO_NODE);
83  if (!rmap)
84  return ERR_PTR(-ENOMEM);
85  
86  bpf_map_init_from_attr(>map, attr);
87  
88  rmap->relay_cb.create_buf_file = create_buf_file_handler;
89  rmap->relay_cb.remove_buf_file = remove_buf_file_handler;
90  if (attr->map_flags & BPF_F_OVERWRITE)
91  rmap->relay_cb.subbuf_start = subbuf_start_overwrite;
92  
93  rmap->relay_chan = relay_open(NULL, NULL,
94  
attr->max_entries, attr->map_extra,
95  
>relay_cb, NULL);
96  if (!rmap->relay_chan)
97  return ERR_PTR(-EINVAL);
98  
99  return >map;
   100  }
   101  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



[ANNOUNCE] 5.10.204-rt100

2023-12-22 Thread Luis Claudio R. Goncalves
Hello RT-list!

I'm pleased to announce the 5.10.204-rt100 stable release.

This release is an RT-only update.
The only change in this release is specific to RT:

rt: mm/page_alloc: backport missing bits from __build_all_zonelists() fix

A while ago upstream landed commit a2ebb51575828 ("mm/page_alloc: use
write_seqlock_irqsave() instead write_seqlock() + local_irq_save().")
to fix a problem that had already been worked on v5.10-rt via commit
7bdd3bd5143a4 ("Revert "mm/page_alloc: fix potential deadlock on
zonelist_update_seqseqlock""). Sebastian pointed out it was important
to backport the missing elements of a2ebb51575828 for code consistency.

You can get this release via the git tree at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git

  branch: v5.10-rt
  Head SHA1: 3f1186be09688c4aedf2d61176990651cf996c75

Or to build 5.10.204-rt100 directly, the following patches should be applied:

  https://www.kernel.org/pub/linux/kernel/v5.x/linux-5.10.tar.xz

  https://www.kernel.org/pub/linux/kernel/v5.x/patch-5.10.204.xz

  
https://www.kernel.org/pub/linux/kernel/projects/rt/5.10/older/patch-5.10.204-rt100.patch.xz

Signing key fingerprint:

  9354 0649 9972 8D31 D464  D140 F394 A423 F8E6 7C26

All keys used for the above files and repositories can be found on the
following git repository:

   git://git.kernel.org/pub/scm/docs/kernel/pgpkeys.git

Enjoy!
Luis

Changes from v5.10.204-rt99:
---

Luis Claudio R. Goncalves (1):
  rt: mm/page_alloc: backport missing bits from __build_all_zonelists() fix
---
 mm/page_alloc.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 39d1782b398f..cd1e8d0b2269 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6043,8 +6043,9 @@ static void __build_all_zonelists(void *data)
int nid;
int __maybe_unused cpu;
pg_data_t *self = data;
+   unsigned long flags;
 
-   write_seqlock(_update_seq);
+   write_seqlock_irqsave(_update_seq, flags);
 
 #ifdef CONFIG_NUMA
memset(node_load, 0, sizeof(node_load));
@@ -6077,7 +6078,7 @@ static void __build_all_zonelists(void *data)
 #endif
}
 
-   write_sequnlock(_update_seq);
+   write_sequnlock_irqrestore(_update_seq, flags);
 }
 
 static noinline void __init




[ANNOUNCE] 5.10.204-rt99

2023-12-22 Thread Luis Claudio R. Goncalves
Hello RT-list!

I'm pleased to announce the 5.10.204-rt99 stable release.

This release is just an update to the new stable 5.10.204 version
and no RT-specific changes have been made.

You can get this release via the git tree at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git

  branch: v5.10-rt
  Head SHA1: f795cc5faaf7e9c18f67c418d1ca5d8d18658320

Or to build 5.10.204-rt99 directly, the following patches should be applied:

  https://www.kernel.org/pub/linux/kernel/v5.x/linux-5.10.tar.xz

  https://www.kernel.org/pub/linux/kernel/v5.x/patch-5.10.204.xz

  
https://www.kernel.org/pub/linux/kernel/projects/rt/5.10/older/patch-5.10.204-rt99.patch.xz

Signing key fingerprint:

  9354 0649 9972 8D31 D464  D140 F394 A423 F8E6 7C26

All keys used for the above files and repositories can be found on the
following git repository:

   git://git.kernel.org/pub/scm/docs/kernel/pgpkeys.git

Enjoy!
Luis




[ANNOUNCE] 4.14.333-rt158

2023-12-22 Thread Luis Claudio R. Goncalves
Hello RT-list!

I'm pleased to announce the 4.14.333-rt158 stable release.

This release is just an update to the new stable 4.14.333 version
and no RT-specific changes have been performed.

You can get this release via the git tree at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git

  branch: v4.14-rt
  Head SHA1: 33cbc8b5610cf1ec35407cc3434385dadb71b8ed

Or to build 4.14.333-rt158 directly, the following patches should be applied:

  https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.14.tar.xz

  https://www.kernel.org/pub/linux/kernel/v4.x/patch-4.14.333.xz

  
https://www.kernel.org/pub/linux/kernel/projects/rt/4.14/older/patch-4.14.333-rt158.patch.xz

Signing key fingerprint:

  9354 0649 9972 8D31 D464  D140 F394 A423 F8E6 7C26

All keys used for the above files and repositories can be found on the
following git repository:

   git://git.kernel.org/pub/scm/docs/kernel/pgpkeys.git

Enjoy!
Luis




Re: [PATCH net-next v3 3/3] net: add netmem_ref to skb_frag_t

2023-12-22 Thread kernel test robot
Hi Mina,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]

url:
https://github.com/intel-lab-lkp/linux/commits/Mina-Almasry/vsock-virtio-use-skb_frag_-helpers/20231222-164637
base:   net-next/main
patch link:
https://lore.kernel.org/r/20231220214505.2303297-4-almasrymina%40google.com
patch subject: [PATCH net-next v3 3/3] net: add netmem_ref to skb_frag_t
config: powerpc-allmodconfig 
(https://download.01.org/0day-ci/archive/20231223/202312230340.icf8soop-...@intel.com/config)
compiler: clang version 18.0.0git (https://github.com/llvm/llvm-project 
d3ef86708241a3bee902615c190dead1638c4e09)
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20231223/202312230340.icf8soop-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202312230340.icf8soop-...@intel.com/

All errors (new ones prefixed by >>):

>> net/kcm/kcmsock.c:637:39: error: no member named 'bv_len' in 'struct 
>> skb_frag'
 637 | msize += skb_shinfo(skb)->frags[i].bv_len;
 |  ~ ^
   1 error generated.


vim +637 net/kcm/kcmsock.c

cd6e111bf5be5c Tom Herbert   2016-03-07  578  
ab7ac4eb9832e3 Tom Herbert   2016-03-07  579  /* Write any messages ready 
on the kcm socket.  Called with kcm sock lock
ab7ac4eb9832e3 Tom Herbert   2016-03-07  580   * held.  Return bytes 
actually sent or error.
ab7ac4eb9832e3 Tom Herbert   2016-03-07  581   */
ab7ac4eb9832e3 Tom Herbert   2016-03-07  582  static int 
kcm_write_msgs(struct kcm_sock *kcm)
ab7ac4eb9832e3 Tom Herbert   2016-03-07  583  {
c31a25e1db486f David Howells 2023-06-09  584unsigned int total_sent 
= 0;
ab7ac4eb9832e3 Tom Herbert   2016-03-07  585struct sock *sk = 
>sk;
ab7ac4eb9832e3 Tom Herbert   2016-03-07  586struct kcm_psock *psock;
c31a25e1db486f David Howells 2023-06-09  587struct sk_buff *head;
ab7ac4eb9832e3 Tom Herbert   2016-03-07  588int ret = 0;
ab7ac4eb9832e3 Tom Herbert   2016-03-07  589  
ab7ac4eb9832e3 Tom Herbert   2016-03-07  590kcm->tx_wait_more = 
false;
ab7ac4eb9832e3 Tom Herbert   2016-03-07  591psock = kcm->tx_psock;
ab7ac4eb9832e3 Tom Herbert   2016-03-07  592if (unlikely(psock && 
psock->tx_stopped)) {
ab7ac4eb9832e3 Tom Herbert   2016-03-07  593/* A reserved 
psock was aborted asynchronously. Unreserve
ab7ac4eb9832e3 Tom Herbert   2016-03-07  594 * it and we'll 
retry the message.
ab7ac4eb9832e3 Tom Herbert   2016-03-07  595 */
ab7ac4eb9832e3 Tom Herbert   2016-03-07  596
unreserve_psock(kcm);
cd6e111bf5be5c Tom Herbert   2016-03-07  597
kcm_report_tx_retry(kcm);
ab7ac4eb9832e3 Tom Herbert   2016-03-07  598if 
(skb_queue_empty(>sk_write_queue))
ab7ac4eb9832e3 Tom Herbert   2016-03-07  599return 
0;
ab7ac4eb9832e3 Tom Herbert   2016-03-07  600  
c31a25e1db486f David Howells 2023-06-09  601
kcm_tx_msg(skb_peek(>sk_write_queue))->started_tx = false;
ab7ac4eb9832e3 Tom Herbert   2016-03-07  602}
ab7ac4eb9832e3 Tom Herbert   2016-03-07  603  
c31a25e1db486f David Howells 2023-06-09  604  retry:
c31a25e1db486f David Howells 2023-06-09  605while ((head = 
skb_peek(>sk_write_queue))) {
c31a25e1db486f David Howells 2023-06-09  606struct msghdr 
msg = {
c31a25e1db486f David Howells 2023-06-09  607
.msg_flags = MSG_DONTWAIT | MSG_SPLICE_PAGES,
c31a25e1db486f David Howells 2023-06-09  608};
c31a25e1db486f David Howells 2023-06-09  609struct 
kcm_tx_msg *txm = kcm_tx_msg(head);
c31a25e1db486f David Howells 2023-06-09  610struct sk_buff 
*skb;
c31a25e1db486f David Howells 2023-06-09  611unsigned int 
msize;
c31a25e1db486f David Howells 2023-06-09  612int i;
ab7ac4eb9832e3 Tom Herbert   2016-03-07  613  
c31a25e1db486f David Howells 2023-06-09  614if 
(!txm->started_tx) {
c31a25e1db486f David Howells 2023-06-09  615psock = 
reserve_psock(kcm);
c31a25e1db486f David Howells 2023-06-09  616if 
(!psock)
c31a25e1db486f David Howells 2023-06-09  617
goto out;
c31a25e1db486f David Howells 2023-06-09  618skb = 
head;
c31a25e1db486f David Howells 2023-06-09  619
txm->frag_offset = 0;
c31a25e1db486f David Howells 2023-06-09  620 

Re: [PATCH 1/4] linux/export: Fix alignment for 64-bit ksymtab entries

2023-12-22 Thread Luis Chamberlain
On Fri, Dec 22, 2023 at 04:01:30PM +0900, Masahiro Yamada wrote:
> On Fri, Dec 22, 2023 at 3:08 PM Luis Chamberlain  wrote:
> >
> > On Thu, Dec 21, 2023 at 10:07:13PM -0800, Luis Chamberlain wrote:
> > >
> > > If we want to go bananas we could even get a graph of size of modules
> >
> > Sorry I meant size of number of symbols Vs cost.
> >
> >  Luis
> 
> 
> 
> But, 1/4 is really a bug-fix, isn't it?

Ah you mean a regression fix, yeah sure, thanks I see !

 Luis



Re: [PATCH 2/4] modules: Ensure 64-bit alignment on __ksymtab_* sections

2023-12-22 Thread Luis Chamberlain
On Fri, Dec 22, 2023 at 01:13:26PM +0100, Helge Deller wrote:
> Hi Luis,
> 
> On 12/22/23 06:59, Luis Chamberlain wrote:
> > On Wed, Nov 22, 2023 at 11:18:12PM +0100, del...@kernel.org wrote:
> > > From: Helge Deller 
> > > 
> > > On 64-bit architectures without CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
> > > (e.g. ppc64, ppc64le, parisc, s390x,...) the __KSYM_REF() macro stores
> > > 64-bit pointers into the __ksymtab* sections.
> > > Make sure that those sections will be correctly aligned at module link 
> > > time,
> > > otherwise unaligned memory accesses may happen at runtime.
> > 
> > The ramifications are not explained there.
> 
> What do you mean exactly by this?

You don't explain the impact of not applying this patch.

> > You keep sending me patches with this and we keep doing a nose dive
> > on this. It means I have to do more work.
> Sorry about that. Since you are mentioned as maintainer for modules I
> thought you are the right contact.

I am certaintly the right person to send this patch to, however, I am
saying that given our previous dialog I would like the commit log to
describe a bit more detail of a few things:

 * how you found this
 * what are the impact of not applying it

Specially if you are going to be sending patches right before the
holidays with a Cc stable fix annotation. This gets me on hight alert
and I have to put things down to see how really critical this is so to
decide to fast track this to Linus or not.

> Furthermore, this patch is pretty small and trivial.

Many patches can be but some can break things and some can be small but
also improve things critically, for example if we are aware of broken
exception handlers.

> And I had the impression that people understand that having unaligned
> structures is generally a bad idea as it usually always impacts performance
> (although the performance penalty in this case isn't critical at all,
> as we are not on hot paths).

There are two aspects to this, one is the critical nature which is
implied by your patch which pegs it as a stable fix, given you prior
patches about this it leaves me wondering if it is fixing some crashes
on some systems given faulty exception handlers.

The performance thing is also subjective, but it could not be subjective
by doing some very trivial tests as I suggested. Such a test would also
help as we lack specific selftsts for this case and we can grow it later
with other things. I figured I'd ask you to try it, since *you* keep
sending patches about misalignments on module sections so I figured
you must be seeing something others are not, and so you must care.

> > Thoughts?
> 
> I really appreciate your thoughts here!
> 
> But given the triviality of this patch, I personally think you are
> proposing a huge academic investment in time and resources with no real 
> benefit.
> On which architecture would you suggest to test this?
> What would be the effective (really new) outcome?
> If the performance penalty is zero or close to zero, will that imply
> that such a patch isn't useful?
> Please also keep in mind that not everyone gets paid to work on the kernel,
> so I have neither the time nor the various architectures to test on.

I think you make this seem so difficult, but I understand it could seem
that way. I've attached at the end of this email a patch that does just
this then to help.

> So, honestly I don't see a real reason why it shouldn't be applied...

Like I said, you Cc'd stable as a fix, as a maintainer it is my job to
verify how critical this is and ask for more details about how you found
it and evaluate the real impact. Even if it was not a stable fix I tend
to ask this for patches, even if they are trivial.

> > > Signed-off-by: Helge Deller 
> > > Cc:  # v6.0+
> > 
> > That's a stretch without any data, don't you think?
> 
> Yes. No need to push such a patch to stable series.

OK, can you extend the patch below with something like:

perf stat --repeat 100 --pre 'modprobe -r b a b c' -- 
./tools/testing/selftests/module/find_symbol.sh

And test before and after?

I ran a simple test as-is and the data I get is within noise, and so
I think we need the --repeat 100 thing.

---
before:
sudo ./tools/testing/selftests/module/find_symbol.sh 

 Performance counter stats for '/sbin/modprobe test_kallsyms_b':

81,956,206 ns   duration_time   
  
81,883,000 ns   system_time 
  
   210  page-faults 
  

   0.081956206 seconds time elapsed

   0.0 seconds user
   0.081883000 seconds sys



 Performance counter stats for '/sbin/modprobe test_kallsyms_b':

85,960,863 ns   duration_time   
  
84,679,000 ns   system_time  

Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2023-12-22 Thread Christophe Leroy


Le 22/12/2023 à 06:35, Kees Cook a écrit :
> [Vous ne recevez pas souvent de courriers de k...@kernel.org. Découvrez 
> pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification ]
> 
> On December 21, 2023 4:16:56 AM PST, Michael Ellerman  
> wrote:
>> Cc +Kees
>>
>> Christophe Leroy  writes:
>>> Declaring rodata_enabled and mark_rodata_ro() at all time
>>> helps removing related #ifdefery in C files.
>>>
>>> Signed-off-by: Christophe Leroy 
>>> ---
>>>   include/linux/init.h |  4 
>>>   init/main.c  | 21 +++--
>>>   2 files changed, 7 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/include/linux/init.h b/include/linux/init.h
>>> index 01b52c9c7526..d2b47be38a07 100644
>>> --- a/include/linux/init.h
>>> +++ b/include/linux/init.h
>>> @@ -168,12 +168,8 @@ extern initcall_entry_t __initcall_end[];
>>>
>>>   extern struct file_system_type rootfs_fs_type;
>>>
>>> -#if defined(CONFIG_STRICT_KERNEL_RWX) || defined(CONFIG_STRICT_MODULE_RWX)
>>>   extern bool rodata_enabled;
>>> -#endif
>>> -#ifdef CONFIG_STRICT_KERNEL_RWX
>>>   void mark_rodata_ro(void);
>>> -#endif
>>>
>>>   extern void (*late_time_init)(void);
>>>
>>> diff --git a/init/main.c b/init/main.c
>>> index e24b0780fdff..807df08c501f 100644
>>> --- a/init/main.c
>>> +++ b/init/main.c
>>> @@ -1396,10 +1396,9 @@ static int __init set_debug_rodata(char *str)
>>>   early_param("rodata", set_debug_rodata);
>>>   #endif
>>>
>>> -#ifdef CONFIG_STRICT_KERNEL_RWX
>>>   static void mark_readonly(void)
>>>   {
>>> -if (rodata_enabled) {
>>> +if (IS_ENABLED(CONFIG_STRICT_KERNEL_RWX) && rodata_enabled) {
> 
> I think this will break without rodata_enabled actual existing on other 
> architectures. (Only declaration was made visible, not the definition, which 
> is above here and still behind ifdefs?)

The compiler constant-folds IS_ENABLED(CONFIG_STRICT_KERNEL_RWX).
When it is false, the second part is dropped.

Exemple:

bool test(void)
{
if (IS_ENABLED(CONFIG_STRICT_KERNEL_RWX) && rodata_enabled)
return true;
else
return false;
}

With CONFIG_STRICT_KERNEL_RWX set, it directly returns the content of 
rodata_enabled:

0160 :
  160:  3d 20 00 00 lis r9,0
162: R_PPC_ADDR16_HArodata_enabled
  164:  88 69 00 00 lbz r3,0(r9)
166: R_PPC_ADDR16_LOrodata_enabled
  168:  4e 80 00 20 blr

With CONFIG_STRICT_KERNEL_RWX unset, it returns 0 and doesn't reference 
rodata_enabled at all:

00bc :
   bc:  38 60 00 00 li  r3,0
   c0:  4e 80 00 20 blr

Many places in the kernel use this approach to minimise amount of #ifdefs.

Christophe


Re: [RFC PATCH 1/5] dt-bindings: mfd: add entry for the Marvell 88PM88X PMICs

2023-12-22 Thread Karel Balej
Rob,

thank you very much for your feedback.

On Mon Dec 18, 2023 at 4:17 PM CET, Rob Herring wrote:
> > +  Marvell 88PM880 and 88PM886 are two similar PMICs providing
> > +  several functions such as onkey, regulators or battery and
> > +  charger. Both seem to come in two revisions -- A0 and A1.
> > +
> > +properties:
> > +  compatible:
> > +const: marvell,88pm886-a1
>
> The description talks about 4 different devices, but only 1 here. 
>
> Do you expect to need A0 support? Devices with these PMICs should be 
> known and few, right? 

I know of three smartphones which have 88PM886 and all of them (at least
the revisions that have been tested) seem to use A1. So no, I don't know
of any device that would need A0, but I wanted have the driver ready in
case somebody needed to add it later. What change do you then suggest?

Thank you and kind regards,
K. B.



Re: [PATCH bpf-next 3/3] bpf: introduce bpf_relay_output helper

2023-12-22 Thread Jiri Olsa
On Fri, Dec 22, 2023 at 08:21:46PM +0800, Philo Lu wrote:
> Like perfbuf/ringbuf, a helper is needed to write into the buffer, named
> bpf_relay_output, whose usage is same as ringbuf. Note that it works only
> after relay files are set, i.e., after calling map_update_elem for the
> created relay map.

we can't add helpers anymore, this will need to be kfunc
check functions marked with __bpf_kfunc as examples

jirka

> 
> Signed-off-by: Philo Lu 
> ---
>  include/linux/bpf.h  |  1 +
>  include/uapi/linux/bpf.h | 10 ++
>  kernel/bpf/helpers.c |  4 
>  kernel/bpf/relaymap.c| 26 ++
>  kernel/bpf/verifier.c|  8 
>  kernel/trace/bpf_trace.c |  4 
>  6 files changed, 53 insertions(+)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 7671530d6e4e..b177122369e6 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -3095,6 +3095,7 @@ extern const struct bpf_func_proto bpf_get_retval_proto;
>  extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
>  extern const struct bpf_func_proto bpf_cgrp_storage_get_proto;
>  extern const struct bpf_func_proto bpf_cgrp_storage_delete_proto;
> +extern const struct bpf_func_proto bpf_relay_output_proto;
>  
>  const struct bpf_func_proto *tracing_prog_func_proto(
>enum bpf_func_id func_id, const struct bpf_prog *prog);
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 143b75676bd3..03c0c1953ba1 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -5686,6 +5686,15 @@ union bpf_attr {
>   *   0 on success.
>   *
>   *   **-ENOENT** if the bpf_local_storage cannot be found.
> + *
> + * long bpf_relay_output(void *map, void *data, u64 size, u64 flags)
> + *   Description
> + *   Copy *size* bytes from *data* into *map* of type 
> BPF_MAP_TYPE_RELAY.
> + *   Currently, the *flags* must be 0.
> + *   Return
> + *   0 on success.
> + *
> + *   **-ENOENT** if the relay base_file in debugfs cannot be found.
>   */
>  #define ___BPF_FUNC_MAPPER(FN, ctx...)   \
>   FN(unspec, 0, ##ctx)\
> @@ -5900,6 +5909,7 @@ union bpf_attr {
>   FN(user_ringbuf_drain, 209, ##ctx)  \
>   FN(cgrp_storage_get, 210, ##ctx)\
>   FN(cgrp_storage_delete, 211, ##ctx) \
> + FN(relay_output, 212, ##ctx)\
>   /* */
>  
>  /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index be72824f32b2..0c26e87ce729 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1720,6 +1720,10 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>   return _ringbuf_discard_proto;
>   case BPF_FUNC_ringbuf_query:
>   return _ringbuf_query_proto;
> +#ifdef CONFIG_RELAY
> + case BPF_FUNC_relay_output:
> + return _relay_output_proto;
> +#endif
>   case BPF_FUNC_strncmp:
>   return _strncmp_proto;
>   case BPF_FUNC_strtol:
> diff --git a/kernel/bpf/relaymap.c b/kernel/bpf/relaymap.c
> index 588c8de0a4bd..f9e2e4a780df 100644
> --- a/kernel/bpf/relaymap.c
> +++ b/kernel/bpf/relaymap.c
> @@ -173,6 +173,32 @@ static u64 relay_map_mem_usage(const struct bpf_map *map)
>   return usage;
>  }
>  
> +BPF_CALL_4(bpf_relay_output, struct bpf_map *, map, void *, data, u64, size,
> +u64, flags)
> +{
> + struct bpf_relay_map *rmap;
> +
> + /* not support any flag now */
> + if (unlikely(flags))
> + return -EINVAL;
> +
> + rmap = container_of(map, struct bpf_relay_map, map);
> + if (!rmap->relay_chan->has_base_filename)
> + return -ENOENT;
> +
> + relay_write(rmap->relay_chan, data, size);
> + return 0;
> +}
> +
> +const struct bpf_func_proto bpf_relay_output_proto = {
> + .func   = bpf_relay_output,
> + .ret_type   = RET_INTEGER,
> + .arg1_type  = ARG_CONST_MAP_PTR,
> + .arg2_type  = ARG_PTR_TO_MEM | MEM_RDONLY,
> + .arg3_type  = ARG_CONST_SIZE_OR_ZERO,
> + .arg4_type  = ARG_ANYTHING,
> +};
> +
>  BTF_ID_LIST_SINGLE(relay_map_btf_ids, struct, bpf_relay_map)
>  const struct bpf_map_ops relay_map_ops = {
>   .map_meta_equal = bpf_map_meta_equal,
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index f13008d27f35..8c8287d6ae18 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -8800,6 +8800,10 @@ static int check_map_func_compatibility(struct 
> bpf_verifier_env *env,
>   if (func_id != BPF_FUNC_user_ringbuf_drain)
>   goto error;
>   break;
> + case BPF_MAP_TYPE_RELAY:
> + if (func_id != BPF_FUNC_relay_output)
> + goto error;
> + break;
>   case BPF_MAP_TYPE_STACK_TRACE:
>   if (func_id != BPF_FUNC_get_stackid)

Re: [PATCH bpf-next 2/3] bpf: implement map_update_elem to init relay file

2023-12-22 Thread Jiri Olsa
On Fri, Dec 22, 2023 at 08:21:45PM +0800, Philo Lu wrote:
> map_update_elem is used to create relay files and bind them with the
> relay channel, which is created with BPF_MAP_CREATE. This allows users
> to set a custom directory name. It must be used with key=NULL and
> flag=0.
> 
> Here is an example:
> ```
> struct {
> __uint(type, BPF_MAP_TYPE_RELAY);
> __uint(max_entries, 4096);
> } my_relay SEC(".maps");
> ...
> char dir_name[] = "relay_test";
> bpf_map_update_elem(map_fd, NULL, dir_name, 0);
> ```
> 
> Then, directory `/sys/kerenl/debug/relay_test` will be created, which
> includes files of my_relay0...my_relay[#cpu]. Each represents a per-cpu
> buffer with size 8 * 4096 B (there are 8 subbufs by default, each with
> size 4096B).
> 
> Signed-off-by: Philo Lu 
> ---
>  kernel/bpf/relaymap.c | 32 +++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/bpf/relaymap.c b/kernel/bpf/relaymap.c
> index d0adc7f67758..588c8de0a4bd 100644
> --- a/kernel/bpf/relaymap.c
> +++ b/kernel/bpf/relaymap.c
> @@ -117,7 +117,37 @@ static void *relay_map_lookup_elem(struct bpf_map *map, 
> void *key)
>  static long relay_map_update_elem(struct bpf_map *map, void *key, void 
> *value,
>  u64 flags)
>  {
> - return -EOPNOTSUPP;
> + struct bpf_relay_map *rmap;
> + struct dentry *parent;
> + int err;
> +
> + if (unlikely(flags))
> + return -EINVAL;
> +
> + if (unlikely(key))
> + return -EINVAL;
> +
> + rmap = container_of(map, struct bpf_relay_map, map);
> +
> + /* The directory already exists */
> + if (rmap->relay_chan->has_base_filename)
> + return -EEXIST;
> +
> + /* Setup relay files. Note that the directory name passed as value 
> should
> +  * not be longer than map->value_size, including the '\0' at the end.
> +  */
> + ((char *)value)[map->value_size - 1] = '\0';
> + parent = debugfs_create_dir(value, NULL);
> + if (IS_ERR_OR_NULL(parent))
> + return PTR_ERR(parent);
> +
> + err = relay_late_setup_files(rmap->relay_chan, map->name, parent);
> + if (err) {
> + debugfs_remove_recursive(parent);
> + return err;
> + }

looks like this patch could be moved to the previous one?

jirka

> +
> + return 0;
>  }
>  
>  static long relay_map_delete_elem(struct bpf_map *map, void *key)
> -- 
> 2.32.0.3.g01195cf9f
> 
> 



Re: [PATCH bpf-next 1/3] bpf: implement relay map basis

2023-12-22 Thread Jiri Olsa
On Fri, Dec 22, 2023 at 08:21:44PM +0800, Philo Lu wrote:

SNIP

> +/* bpf_attr is used as follows:
> + * - key size: must be 0
> + * - value size: value will be used as directory name by map_update_elem
> + *   (to create relay files). If passed as 0, it will be set to NAME_MAX as
> + *   default
> + *
> + * - max_entries: subbuf size
> + * - map_extra: subbuf num, default as 8
> + *
> + * When alloc, we do not set up relay files considering dir_name conflicts.
> + * Instead we use relay_late_setup_files() in map_update_elem(), and thus the
> + * value is used as dir_name, and map->name is used as base_filename.
> + */
> +static struct bpf_map *relay_map_alloc(union bpf_attr *attr)
> +{
> + struct bpf_relay_map *rmap;
> +
> + if (unlikely(attr->map_flags & ~RELAY_CREATE_FLAG_MASK))
> + return ERR_PTR(-EINVAL);
> +
> + /* key size must be 0 in relay map */
> + if (unlikely(attr->key_size))
> + return ERR_PTR(-EINVAL);
> +
> + if (unlikely(attr->value_size > NAME_MAX)) {
> + pr_warn("value_size should be no more than %d\n", NAME_MAX);
> + return ERR_PTR(-EINVAL);
> + } else if (attr->value_size == 0)
> + attr->value_size = NAME_MAX;

the concept of no key with just value seems strange.. I never worked
with relay channels, so perhaps stupid question: but why not have one
relay channel for given key? having the debugfs like:

  /sys/kernel/debug/my_rmap/mychannel/

> +
> + /* set default subbuf num */
> + attr->map_extra = attr->map_extra & UINT_MAX;
> + if (!attr->map_extra)
> + attr->map_extra = 8;
> +
> + if (!attr->map_name || strlen(attr->map_name) == 0)

attr->map_name is allways != NULL

> + return ERR_PTR(-EINVAL);
> +
> + rmap = bpf_map_area_alloc(sizeof(*rmap), NUMA_NO_NODE);
> + if (!rmap)
> + return ERR_PTR(-ENOMEM);
> +
> + bpf_map_init_from_attr(>map, attr);
> +
> + rmap->relay_cb.create_buf_file = create_buf_file_handler;
> + rmap->relay_cb.remove_buf_file = remove_buf_file_handler;
> + if (attr->map_flags & BPF_F_OVERWRITE)
> + rmap->relay_cb.subbuf_start = subbuf_start_overwrite;
> +
> + rmap->relay_chan = relay_open(NULL, NULL,
> + attr->max_entries, 
> attr->map_extra,
> + >relay_cb, NULL);

wrong indentation

> + if (!rmap->relay_chan)
> + return ERR_PTR(-EINVAL);
> +
> + return >map;
> +}
> +
> +static void relay_map_free(struct bpf_map *map)
> +{
> + struct bpf_relay_map *rmap;
> +
> + rmap = container_of(map, struct bpf_relay_map, map);
> + relay_close(rmap->relay_chan);
> + debugfs_remove_recursive(rmap->relay_chan->parent);
> + kfree(rmap);

should you use bpf_map_area_free instead?

jirka

> +}
> +
> +static void *relay_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> + return ERR_PTR(-EOPNOTSUPP);
> +}
> +
> +static long relay_map_update_elem(struct bpf_map *map, void *key, void 
> *value,
> +u64 flags)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +static long relay_map_delete_elem(struct bpf_map *map, void *key)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +static int relay_map_get_next_key(struct bpf_map *map, void *key,
> + void *next_key)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +static u64 relay_map_mem_usage(const struct bpf_map *map)
> +{
> + struct bpf_relay_map *rmap;
> + u64 usage = sizeof(struct bpf_relay_map);
> +
> + rmap = container_of(map, struct bpf_relay_map, map);
> + usage += sizeof(struct rchan);
> + usage += (sizeof(struct rchan_buf) + rmap->relay_chan->alloc_size)
> +  * num_online_cpus();
> + return usage;
> +}
> +
> +BTF_ID_LIST_SINGLE(relay_map_btf_ids, struct, bpf_relay_map)
> +const struct bpf_map_ops relay_map_ops = {
> + .map_meta_equal = bpf_map_meta_equal,
> + .map_alloc = relay_map_alloc,
> + .map_free = relay_map_free,
> + .map_lookup_elem = relay_map_lookup_elem,
> + .map_update_elem = relay_map_update_elem,
> + .map_delete_elem = relay_map_delete_elem,
> + .map_get_next_key = relay_map_get_next_key,
> + .map_mem_usage = relay_map_mem_usage,
> + .map_btf_id = _map_btf_ids[0],
> +};
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 1bf9805ee185..35ae54ac6736 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1147,6 +1147,7 @@ static int map_create(union bpf_attr *attr)
>   }
>  
>   if (attr->map_type != BPF_MAP_TYPE_BLOOM_FILTER &&
> + attr->map_type != BPF_MAP_TYPE_RELAY &&
>   attr->map_extra != 0)
>   return -EINVAL;
>  
> -- 
> 2.32.0.3.g01195cf9f
> 



Re: [PATCH bpf-next 0/3] bpf: introduce BPF_MAP_TYPE_RELAY

2023-12-22 Thread Jiri Olsa
On Fri, Dec 22, 2023 at 08:21:43PM +0800, Philo Lu wrote:
> The patch set introduce a new type of map, BPF_MAP_TYPE_RELAY, based on
> relay interface [0]. It provides a way for persistent and overwritable data
> transfer.
> 
> As stated in [0], relay is a efficient method for log and data transfer.
> And the interface is simple enough so that we can implement and use this
> type of map with current map interfaces. Besides we need a new helper
> bpf_relay_output to output data to user, similar with bpf_ringbuf_output.
> 
> We need this map because currently neither ringbuf nor perfbuf satisfies
> the requirements of relatively long-term consistent tracing, where the bpf
> program keeps writing into the buffer without any bundled reader, and the
> buffer supports overwriting. For users, they just run the bpf program to
> collect data, and are able to read as need. The detailed discussion can be
> found at [1].
> 
> The buffer is exposed to users as per-cpu files in debugfs, supporting read
> and mmap, and it is up to users how to formulate and read it, either
> through a program with mmap or just `cat`. Specifically, the files are
> created as "/sys/kerenl/debug//#cpu", where the 
> is defined with map_update_elem (Note that we do not need to implement
> actual elem operators for relay_map).
> 
> If this map is acceptable, other parts including docs, libbpf support,
> selftests, and benchmarks (if need) will be added in the following version.

looks useful, selftests might be already helpful to see the usage

jirka

> 
> [0]
> https://github.com/torvalds/linux/blob/master/Documentation/filesystems/relay.rst
> [1]
> https://lore.kernel.org/bpf/20231219122850.433be...@gandalf.local.home/T/
> 
> Philo Lu (3):
>   bpf: implement relay map basis
>   bpf: implement map_update_elem to init relay file
>   bpf: introduce bpf_relay_output helper
> 
>  include/linux/bpf.h   |   1 +
>  include/linux/bpf_types.h |   3 +
>  include/uapi/linux/bpf.h  |  17 +++
>  kernel/bpf/Makefile   |   3 +
>  kernel/bpf/helpers.c  |   4 +
>  kernel/bpf/relaymap.c | 213 ++
>  kernel/bpf/syscall.c  |   1 +
>  kernel/bpf/verifier.c |   8 ++
>  kernel/trace/bpf_trace.c  |   4 +
>  9 files changed, 254 insertions(+)
>  create mode 100644 kernel/bpf/relaymap.c
> 
> --
> 2.32.0.3.g01195cf9f
> 



[PATCH bpf-next 3/3] bpf: introduce bpf_relay_output helper

2023-12-22 Thread Philo Lu
Like perfbuf/ringbuf, a helper is needed to write into the buffer, named
bpf_relay_output, whose usage is same as ringbuf. Note that it works only
after relay files are set, i.e., after calling map_update_elem for the
created relay map.

Signed-off-by: Philo Lu 
---
 include/linux/bpf.h  |  1 +
 include/uapi/linux/bpf.h | 10 ++
 kernel/bpf/helpers.c |  4 
 kernel/bpf/relaymap.c| 26 ++
 kernel/bpf/verifier.c|  8 
 kernel/trace/bpf_trace.c |  4 
 6 files changed, 53 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 7671530d6e4e..b177122369e6 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -3095,6 +3095,7 @@ extern const struct bpf_func_proto bpf_get_retval_proto;
 extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
 extern const struct bpf_func_proto bpf_cgrp_storage_get_proto;
 extern const struct bpf_func_proto bpf_cgrp_storage_delete_proto;
+extern const struct bpf_func_proto bpf_relay_output_proto;
 
 const struct bpf_func_proto *tracing_prog_func_proto(
   enum bpf_func_id func_id, const struct bpf_prog *prog);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 143b75676bd3..03c0c1953ba1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5686,6 +5686,15 @@ union bpf_attr {
  * 0 on success.
  *
  * **-ENOENT** if the bpf_local_storage cannot be found.
+ *
+ * long bpf_relay_output(void *map, void *data, u64 size, u64 flags)
+ * Description
+ * Copy *size* bytes from *data* into *map* of type 
BPF_MAP_TYPE_RELAY.
+ * Currently, the *flags* must be 0.
+ * Return
+ * 0 on success.
+ *
+ * **-ENOENT** if the relay base_file in debugfs cannot be found.
  */
 #define ___BPF_FUNC_MAPPER(FN, ctx...) \
FN(unspec, 0, ##ctx)\
@@ -5900,6 +5909,7 @@ union bpf_attr {
FN(user_ringbuf_drain, 209, ##ctx)  \
FN(cgrp_storage_get, 210, ##ctx)\
FN(cgrp_storage_delete, 211, ##ctx) \
+   FN(relay_output, 212, ##ctx)\
/* */
 
 /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index be72824f32b2..0c26e87ce729 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1720,6 +1720,10 @@ bpf_base_func_proto(enum bpf_func_id func_id)
return _ringbuf_discard_proto;
case BPF_FUNC_ringbuf_query:
return _ringbuf_query_proto;
+#ifdef CONFIG_RELAY
+   case BPF_FUNC_relay_output:
+   return _relay_output_proto;
+#endif
case BPF_FUNC_strncmp:
return _strncmp_proto;
case BPF_FUNC_strtol:
diff --git a/kernel/bpf/relaymap.c b/kernel/bpf/relaymap.c
index 588c8de0a4bd..f9e2e4a780df 100644
--- a/kernel/bpf/relaymap.c
+++ b/kernel/bpf/relaymap.c
@@ -173,6 +173,32 @@ static u64 relay_map_mem_usage(const struct bpf_map *map)
return usage;
 }
 
+BPF_CALL_4(bpf_relay_output, struct bpf_map *, map, void *, data, u64, size,
+  u64, flags)
+{
+   struct bpf_relay_map *rmap;
+
+   /* not support any flag now */
+   if (unlikely(flags))
+   return -EINVAL;
+
+   rmap = container_of(map, struct bpf_relay_map, map);
+   if (!rmap->relay_chan->has_base_filename)
+   return -ENOENT;
+
+   relay_write(rmap->relay_chan, data, size);
+   return 0;
+}
+
+const struct bpf_func_proto bpf_relay_output_proto = {
+   .func   = bpf_relay_output,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_CONST_MAP_PTR,
+   .arg2_type  = ARG_PTR_TO_MEM | MEM_RDONLY,
+   .arg3_type  = ARG_CONST_SIZE_OR_ZERO,
+   .arg4_type  = ARG_ANYTHING,
+};
+
 BTF_ID_LIST_SINGLE(relay_map_btf_ids, struct, bpf_relay_map)
 const struct bpf_map_ops relay_map_ops = {
.map_meta_equal = bpf_map_meta_equal,
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f13008d27f35..8c8287d6ae18 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8800,6 +8800,10 @@ static int check_map_func_compatibility(struct 
bpf_verifier_env *env,
if (func_id != BPF_FUNC_user_ringbuf_drain)
goto error;
break;
+   case BPF_MAP_TYPE_RELAY:
+   if (func_id != BPF_FUNC_relay_output)
+   goto error;
+   break;
case BPF_MAP_TYPE_STACK_TRACE:
if (func_id != BPF_FUNC_get_stackid)
goto error;
@@ -8932,6 +8936,10 @@ static int check_map_func_compatibility(struct 
bpf_verifier_env *env,
if (map->map_type != BPF_MAP_TYPE_USER_RINGBUF)
goto error;
break;
+   case BPF_FUNC_relay_output:
+   if (map->map_type != 

[PATCH bpf-next 0/3] bpf: introduce BPF_MAP_TYPE_RELAY

2023-12-22 Thread Philo Lu
The patch set introduce a new type of map, BPF_MAP_TYPE_RELAY, based on
relay interface [0]. It provides a way for persistent and overwritable data
transfer.

As stated in [0], relay is a efficient method for log and data transfer.
And the interface is simple enough so that we can implement and use this
type of map with current map interfaces. Besides we need a new helper
bpf_relay_output to output data to user, similar with bpf_ringbuf_output.

We need this map because currently neither ringbuf nor perfbuf satisfies
the requirements of relatively long-term consistent tracing, where the bpf
program keeps writing into the buffer without any bundled reader, and the
buffer supports overwriting. For users, they just run the bpf program to
collect data, and are able to read as need. The detailed discussion can be
found at [1].

The buffer is exposed to users as per-cpu files in debugfs, supporting read
and mmap, and it is up to users how to formulate and read it, either
through a program with mmap or just `cat`. Specifically, the files are
created as "/sys/kerenl/debug//#cpu", where the 
is defined with map_update_elem (Note that we do not need to implement
actual elem operators for relay_map).

If this map is acceptable, other parts including docs, libbpf support,
selftests, and benchmarks (if need) will be added in the following version.

[0]
https://github.com/torvalds/linux/blob/master/Documentation/filesystems/relay.rst
[1]
https://lore.kernel.org/bpf/20231219122850.433be...@gandalf.local.home/T/

Philo Lu (3):
  bpf: implement relay map basis
  bpf: implement map_update_elem to init relay file
  bpf: introduce bpf_relay_output helper

 include/linux/bpf.h   |   1 +
 include/linux/bpf_types.h |   3 +
 include/uapi/linux/bpf.h  |  17 +++
 kernel/bpf/Makefile   |   3 +
 kernel/bpf/helpers.c  |   4 +
 kernel/bpf/relaymap.c | 213 ++
 kernel/bpf/syscall.c  |   1 +
 kernel/bpf/verifier.c |   8 ++
 kernel/trace/bpf_trace.c  |   4 +
 9 files changed, 254 insertions(+)
 create mode 100644 kernel/bpf/relaymap.c

--
2.32.0.3.g01195cf9f




[PATCH bpf-next 2/3] bpf: implement map_update_elem to init relay file

2023-12-22 Thread Philo Lu
map_update_elem is used to create relay files and bind them with the
relay channel, which is created with BPF_MAP_CREATE. This allows users
to set a custom directory name. It must be used with key=NULL and
flag=0.

Here is an example:
```
struct {
__uint(type, BPF_MAP_TYPE_RELAY);
__uint(max_entries, 4096);
} my_relay SEC(".maps");
...
char dir_name[] = "relay_test";
bpf_map_update_elem(map_fd, NULL, dir_name, 0);
```

Then, directory `/sys/kerenl/debug/relay_test` will be created, which
includes files of my_relay0...my_relay[#cpu]. Each represents a per-cpu
buffer with size 8 * 4096 B (there are 8 subbufs by default, each with
size 4096B).

Signed-off-by: Philo Lu 
---
 kernel/bpf/relaymap.c | 32 +++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/relaymap.c b/kernel/bpf/relaymap.c
index d0adc7f67758..588c8de0a4bd 100644
--- a/kernel/bpf/relaymap.c
+++ b/kernel/bpf/relaymap.c
@@ -117,7 +117,37 @@ static void *relay_map_lookup_elem(struct bpf_map *map, 
void *key)
 static long relay_map_update_elem(struct bpf_map *map, void *key, void *value,
   u64 flags)
 {
-   return -EOPNOTSUPP;
+   struct bpf_relay_map *rmap;
+   struct dentry *parent;
+   int err;
+
+   if (unlikely(flags))
+   return -EINVAL;
+
+   if (unlikely(key))
+   return -EINVAL;
+
+   rmap = container_of(map, struct bpf_relay_map, map);
+
+   /* The directory already exists */
+   if (rmap->relay_chan->has_base_filename)
+   return -EEXIST;
+
+   /* Setup relay files. Note that the directory name passed as value 
should
+* not be longer than map->value_size, including the '\0' at the end.
+*/
+   ((char *)value)[map->value_size - 1] = '\0';
+   parent = debugfs_create_dir(value, NULL);
+   if (IS_ERR_OR_NULL(parent))
+   return PTR_ERR(parent);
+
+   err = relay_late_setup_files(rmap->relay_chan, map->name, parent);
+   if (err) {
+   debugfs_remove_recursive(parent);
+   return err;
+   }
+
+   return 0;
 }
 
 static long relay_map_delete_elem(struct bpf_map *map, void *key)
-- 
2.32.0.3.g01195cf9f




[PATCH bpf-next 1/3] bpf: implement relay map basis

2023-12-22 Thread Philo Lu
BPF_MAP_TYPE_RELAY is implemented based on relay interface, which
creates per-cpu buffer to transfer data. Each buffer is essentially a
list of fix-sized sub-buffers, and is exposed to user space as files in
debugfs. attr->max_entries is used as subbuf size and attr->map_extra is
used as subbuf num. Currently, the default value of subbuf num is 8.

The data can be accessed by read or mmap via these files. For example,
if there are 2 cpus, files could be `/sys/kernel/debug/mydir/my_rmap0`
and `/sys/kernel/debug/mydir/my_rmap1`.

Buffer-only mode is used to create the relay map, which just allocates
the buffer without creating user-space files. Then user can setup the
files with map_update_elem, thus allowing user to define the directory
name in debugfs. map_update_elem is implemented in the following patch.

A new map flag named BPF_F_OVERWRITE is introduced to set overwrite mode
of relay map.

Signed-off-by: Philo Lu 
---
 include/linux/bpf_types.h |   3 +
 include/uapi/linux/bpf.h  |   7 ++
 kernel/bpf/Makefile   |   3 +
 kernel/bpf/relaymap.c | 157 ++
 kernel/bpf/syscall.c  |   1 +
 5 files changed, 171 insertions(+)
 create mode 100644 kernel/bpf/relaymap.c

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index fc0d6f32c687..c122d7b494c5 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -132,6 +132,9 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, 
bpf_struct_ops_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_USER_RINGBUF, user_ringbuf_map_ops)
+#ifdef CONFIG_RELAY
+BPF_MAP_TYPE(BPF_MAP_TYPE_RELAY, relay_map_ops)
+#endif
 
 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 754e68ca8744..143b75676bd3 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -951,6 +951,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_BLOOM_FILTER,
BPF_MAP_TYPE_USER_RINGBUF,
BPF_MAP_TYPE_CGRP_STORAGE,
+   BPF_MAP_TYPE_RELAY,
 };
 
 /* Note that tracing related programs such as
@@ -1330,6 +1331,9 @@ enum {
 
 /* Get path from provided FD in BPF_OBJ_PIN/BPF_OBJ_GET commands */
BPF_F_PATH_FD   = (1U << 14),
+
+/* Enable overwrite for relay map */
+   BPF_F_OVERWRITE = (1U << 15),
 };
 
 /* Flags for BPF_PROG_QUERY. */
@@ -1401,6 +1405,9 @@ union bpf_attr {
 * BPF_MAP_TYPE_BLOOM_FILTER - the lowest 4 bits indicate the
 * number of hash functions (if 0, the bloom filter will default
 * to using 5 hash functions).
+*
+* BPF_MAP_TYPE_RELAY - the lowest 32 bits indicate the number 
of
+* relay subbufs (if 0, the number will be set to 8 by default).
 */
__u64   map_extra;
};
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index f526b7573e97..45b35bb0e572 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -10,6 +10,9 @@ obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o 
helpers.o tnum.o log.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_iter.o map_iter.o task_iter.o prog_iter.o 
link_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
+ifeq ($(CONFIG_RELAY),y)
+obj-$(CONFIG_BPF_SYSCALL) += relaymap.o
+endif
 obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
 obj-${CONFIG_BPF_LSM}+= bpf_inode_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o
diff --git a/kernel/bpf/relaymap.c b/kernel/bpf/relaymap.c
new file mode 100644
index ..d0adc7f67758
--- /dev/null
+++ b/kernel/bpf/relaymap.c
@@ -0,0 +1,157 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define RELAY_CREATE_FLAG_MASK (BPF_F_OVERWRITE)
+
+struct bpf_relay_map {
+   struct bpf_map map;
+   struct rchan *relay_chan;
+   struct rchan_callbacks relay_cb;
+};
+
+static struct dentry *create_buf_file_handler(const char *filename,
+  struct dentry 
*parent, umode_t mode,
+  struct rchan_buf 
*buf, int *is_global)
+{
+   /* Because we do relay_late_setup_files(), create_buf_file(NULL, NULL, 
...)
+* will be called by relay_open.
+*/
+   if (!filename)
+   return NULL;
+
+   return debugfs_create_file(filename, mode, parent, buf,
+  _file_operations);
+}
+
+static int remove_buf_file_handler(struct dentry *dentry)
+{
+   debugfs_remove(dentry);
+   return 0;
+}
+
+/* For non-overwrite, use default subbuf_start cb */
+static int 

Re: [PATCH 2/4] modules: Ensure 64-bit alignment on __ksymtab_* sections

2023-12-22 Thread Helge Deller

Hi Luis,

On 12/22/23 06:59, Luis Chamberlain wrote:

On Wed, Nov 22, 2023 at 11:18:12PM +0100, del...@kernel.org wrote:

From: Helge Deller 

On 64-bit architectures without CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
(e.g. ppc64, ppc64le, parisc, s390x,...) the __KSYM_REF() macro stores
64-bit pointers into the __ksymtab* sections.
Make sure that those sections will be correctly aligned at module link time,
otherwise unaligned memory accesses may happen at runtime.


The ramifications are not explained there.


What do you mean exactly by this?


You keep sending me patches with this and we keep doing a nose dive
on this. It means I have to do more work.

Sorry about that. Since you are mentioned as maintainer for modules I
thought you are the right contact.

Furthermore, this patch is pretty small and trivial.
And I had the impression that people understand that having unaligned
structures is generally a bad idea as it usually always impacts performance
(although the performance penalty in this case isn't critical at all,
as we are not on hot paths).


So as I had suggested with your patch which I merged in
commit 87c482bdfa79 ("modules: Ensure natural alignment for
.altinstructions and __bug_table sections") please clarify the
impact of not merging this patch. Also last time you noticed the
misalignment due to a faulty exception handler, please mention how
you found this out now.


Sure.


And since this is not your first patch on the exact related subject
I'd appreciate if you can show me perf stat results differences between
having and not having this patch merged. Why? Because we talk about
a performance penalthy, but we are not saying how much, and since this
is an ongoing thing, we might as well have a tool belt with ways to
measure such performance impact to bring clarity and value to this
and future related patches.


The __kcrctab* sections store 32-bit entities, so use ALIGN(4) for those.


I've given some thought about how to test this. Sadly perf kallsysms
just opens the /proc/kallsysms file, but that's fine, we need our own
test.

I think a 3 new simple modules selftest would do it and running perf
stat on it. One module, let us call it module A which constructs its own
name space prefix for its exported symbols and has tons of silly symbols
for arbitrary data, whatever. We then have module B which refers to a
few arbitrary symbols from module A, hopefully spread out linearly, so
if module A had 10,000 symbols, we'd have module A refer to a symbol
ever 1,000 symbols. Finally we want a symbol C which has say, 50,000
symbols all of which will not be used at all by the first two modules,
but the selftest will load module C first, prior to calling modprobe B.

We'll stress test this way two calls which use find_symbol():

1) Upon load of B it will trigger simplify_symbols() to look for the
symbol it uses from the module A with tons of symbols. That's an
indirect way for us to call resolve_symbol_wait() from module A without
having to use symbol_get() which want to remove as exported as it is
just a hack which should go away. Our goal is for us to test
resolve_symbol() which will call find_symbol() and that will eventually
look for the symbol on module A with:

   find_exported_symbol_in_section()

That uses bsearch() so a binary search for the symbol and we'd end up
hitting the misalignments here. Binary search will at worst be O(log(n))
and so the only way to aggreviate the search will be to add tons of
symbols to A, and have B use a few of them.

2) When you load B, userspace will at first load A as depmod will inform
userspace A goes before B. Upon B's load towards the end right before
we call module B's init routine we get complete_formation() called on
the module. That will first check for duplicate symbols with the call
to verify_exported_symbols(). That is when we'll force iteration on
module C's insane symbol list.

The selftests just runs

perf stat -e pick-your-poison-for-misalignments 
tools/testing/selftests/kmod/ksymtab.sh

Where ksymtab.sh is your new script which calls:

modprobe C
modprobe B

I say pick-your-poison-for-misalignments because I am not sure what is
best here.

Thoughts?


I really appreciate your thoughts here!

But given the triviality of this patch, I personally think you are
proposing a huge academic investment in time and resources with no real benefit.
On which architecture would you suggest to test this?
What would be the effective (really new) outcome?
If the performance penalty is zero or close to zero, will that imply
that such a patch isn't useful?
Please also keep in mind that not everyone gets paid to work on the kernel,
so I have neither the time nor the various architectures to test on.

Then, as Masahiro already mentioned, the real solution is
probably to add ".balign" to the various inline assembler parts.
With this the alignment gets recorded in the sh_addralign field
in the object files, and the linker will correctly lay out the executable
even without 

Re: [PATCH v3 17/34] lib/zlib: Unpoison DFLTCC output buffers

2023-12-22 Thread Alexander Potapenko
On Thu, Dec 14, 2023 at 12:36 AM Ilya Leoshkevich  wrote:
>
> The constraints of the DFLTCC inline assembly are not precise: they
> do not communicate the size of the output buffers to the compiler, so
> it cannot automatically instrument it.
>
> Add the manual kmsan_unpoison_memory() calls for the output buffers.
> The logic is the same as in [1].
>
> [1] 
> https://github.com/zlib-ng/zlib-ng/commit/1f5ddcc009ac3511e99fc88736a9e1a6381168c5
>
> Reported-by: Alexander Gordeev 
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 


> @@ -34,6 +37,7 @@ static inline dfltcc_cc dfltcc(
>  )
>  {
>  Byte *t2 = op1 ? *op1 : NULL;
> +unsigned char *orig_t2 = t2;
>  size_t t3 = len1 ? *len1 : 0;
>  const Byte *t4 = op2 ? *op2 : NULL;
>  size_t t5 = len2 ? *len2 : 0;
> @@ -59,6 +63,26 @@ static inline dfltcc_cc dfltcc(
>   : "cc", "memory");
>  t2 = r2; t3 = r3; t4 = r4; t5 = r5;
>
> +switch (fn & DFLTCC_FN_MASK) {

It might be a good idea to add a comment explaining what this block of
code does.
(And that it is no-op in non-KMSAN builds)



Re: [PATCH v3 27/34] s390/irqflags: Do not instrument arch_local_irq_*() with KMSAN

2023-12-22 Thread Alexander Potapenko
On Thu, Dec 14, 2023 at 12:36 AM Ilya Leoshkevich  wrote:
>
> KMSAN generates the following false positives on s390x:
>
> [6.063666] DEBUG_LOCKS_WARN_ON(lockdep_hardirqs_enabled())
> [ ...]
> [6.577050] Call Trace:
> [6.619637]  [<0690d2de>] check_flags+0x1fe/0x210
> [6.665411] ([<0690d2da>] check_flags+0x1fa/0x210)
> [6.707478]  [<006cec1a>] lock_acquire+0x2ca/0xce0
> [6.749959]  [<069820ea>] _raw_spin_lock_irqsave+0xea/0x190
> [6.794912]  [<041fc988>] __stack_depot_save+0x218/0x5b0
> [6.838420]  [<0197affe>] __msan_poison_alloca+0xfe/0x1a0
> [6.882985]  [<07c5827c>] start_kernel+0x70c/0xd50
> [6.927454]  [<00100036>] startup_continue+0x36/0x40
>
> Between trace_hardirqs_on() and `stosm __mask, 3` lockdep thinks that
> interrupts are on, but on the CPU they are still off. KMSAN
> instrumentation takes spinlocks, giving lockdep a chance to see and
> complain about this discrepancy.
>
> KMSAN instrumentation is inserted in order to poison the __mask
> variable. Disable instrumentation in the respective functions. They are
> very small and it's easy to see that no important metadata updates are
> lost because of this.
>
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH vhost v4 06/15] vdpa: Track device suspended state

2023-12-22 Thread Dragos Tatulea
On Wed, 2023-12-20 at 13:55 +0100, Dragos Tatulea wrote:
> On Wed, 2023-12-20 at 11:46 +0800, Jason Wang wrote:
> > On Wed, Dec 20, 2023 at 2:09 AM Dragos Tatulea  wrote:
> > > 
> > > Set vdpa device suspended state on successful suspend. Clear it on
> > > successful resume and reset.
> > > 
> > > The state will be locked by the vhost_vdpa mutex. The mutex is taken
> > > during suspend, resume and reset in vhost_vdpa_unlocked_ioctl. The
> > > exception is vhost_vdpa_open which does a device reset but that should
> > > be safe because it can only happen before the other ops.
> > > 
> > > Signed-off-by: Dragos Tatulea 
> > > Suggested-by: Eugenio Pérez 
> > > ---
> > >  drivers/vhost/vdpa.c | 17 +++--
> > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > > index b4e8ddf86485..00b4fa8e89f2 100644
> > > --- a/drivers/vhost/vdpa.c
> > > +++ b/drivers/vhost/vdpa.c
> > > @@ -59,6 +59,7 @@ struct vhost_vdpa {
> > > int in_batch;
> > > struct vdpa_iova_range range;
> > > u32 batch_asid;
> > > +   bool suspended;
> > 
> > Any reason why we don't do it in the core vDPA device but here?
> > 
> Not really. I wanted to be safe and not expose it in a header due to locking.
> 
A few clearer answers for why the state is not added in struct vdpa_device:
- All the suspend infrastructure is currently only for vhost.
- If the state would be moved to struct vdpa_device then the cf_lock would have
to be used. This adds more complexity to the code.

Thanks,
Dragos


Re: [PATCH vhost v4 02/15] vdpa: Add VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag

2023-12-22 Thread Dragos Tatulea
On Fri, 2023-12-22 at 03:29 -0500, Michael S. Tsirkin wrote:
> On Thu, Dec 21, 2023 at 03:07:22PM +, Dragos Tatulea wrote:
> > > > > In that case you're right, we don't need feature flags. But I think it
> > > > > would be great to also move the error return in case userspace tries
> > > > > to modify vq parameters out of suspend state.
> > > > > 
> > > > On the driver side or on the core side?
> > > > 
> > > 
> > > Core side.
> > > 
> > Checking my understanding: instead of the feature flags there would be a 
> > check
> > (for .set_vq_addr and .set_vq_state) to return an error if they are called 
> > under
> > DRIVER_OK and not suspended state?
> 
> Yea this looks much saner, if we start adding feature flags for
> each OPERATION_X_LEGAL_IN_STATE_Y then we will end up with N^2
> feature bits which is not reasonable.
> 
Ack. Is the v2 enough or should I respin a v5 with the updated Acked-by tags?

I will prepare the core part as a different series without the flags.

Thanks,
Dragos


RE: [PATCH 0/4] Section alignment issues?

2023-12-22 Thread David Laight
...
> diff --git a/include/linux/init.h b/include/linux/init.h
> index 3fa3f6241350..650311e4b215 100644
> --- a/include/linux/init.h
> +++ b/include/linux/init.h
> @@ -264,6 +264,7 @@ extern struct module __this_module;
>  #define define_initcall(fn, __stub, __name, __sec) \
> __define_initcall_stub(__stub, fn)  \
> asm(".section   \"" __sec "\", \"a\"\n" \
> +   ".balign 4  \n" \
> __stringify(__name) ":  \n" \
> ".long  " __stringify(__stub) " - . \n" \
> ".previous  \n");   \
> 
> 
> 
> Then, "this section requires at least 4 byte alignment"
> is recorded in the sh_addralign field.

Perhaps one of the headers should contain (something like):
#ifdef CONFIG_64
#define BALIGN_PTR ".balign 8\n"
#else
#define BALIGN_PTR ".balign 4\n"
#endif

to make it all easier (although that example doesn't need it).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: [PATCH 0/4] Section alignment issues?

2023-12-22 Thread Helge Deller

On 12/20/23 20:40, Luis Chamberlain wrote:

On Tue, Dec 19, 2023 at 01:26:49PM -0800, Luis Chamberlain wrote:

On Wed, Nov 22, 2023 at 11:18:10PM +0100, del...@kernel.org wrote:

From: Helge Deller 
My questions:
- Am I wrong with my analysis?


This would typically of course depend on the arch, but whether it helps
is what I would like to see with real numbers rather then speculation.
Howeer, I don't expect some of these are hot paths except maybe the
table lookups. So could you look at some perf stat differences
without / with alignment ? Other than bootup live patching would be
a good test case. We have selftest for modules, the script in selftests
tools/testing/selftests/kmod/kmod.sh is pretty aggressive, but the live
patching tests might be better suited.


- What does people see on other architectures?
- Does it make sense to add a compile- and runtime-check, like the patch below, 
to the kernel?


The chatty aspects really depend on the above results.

Aren't there some archs where an unaligned access would actually crash?
Why hasn't that happened?


I've gone down through memory lane and we have discussed this before.

It would seem this misalignment should not affect performance, and this
should not be an issue unless you have a buggy exception hanlder. We
actually ran into one before. Please refer to merge commit

e74acdf55da6649dd30be5b621a93b71cbe7f3f9


Yes, this is the second time I stumbled over this issue.
But let's continue discussing in the other mail thread...

Helge



Re: [PATCH 3/4] vmlinux.lds.h: Fix alignment for __ksymtab*, __kcrctab_* and .pci_fixup sections

2023-12-22 Thread Helge Deller

On 12/21/23 14:07, Masahiro Yamada wrote:

On Thu, Nov 23, 2023 at 7:18 AM  wrote:


From: Helge Deller 

On 64-bit architectures without CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
(e.g. ppc64, ppc64le, parisc, s390x,...) the __KSYM_REF() macro stores
64-bit pointers into the __ksymtab* sections.
Make sure that the start of those sections is 64-bit aligned in the vmlinux
executable, otherwise unaligned memory accesses may happen at runtime.



Are you solving a real problem?


Not any longer.
I faced a problem on parisc when neither #1 and #3 were applied
because of a buggy unalignment exception handler. But this is
not something which I would count a "real generic problem".


1/4 already ensures the proper alignment of __ksymtab*, doesn't it?


Yes, it does.


...
So, my understanding is this patch is unneeded.


Yes, it's not required and I'm fine if we drop it.

But regarding __kcrctab:


@@ -498,6 +501,7 @@
 }   \
 \
 /* Kernel symbol table: Normal symbols */   \
+   . = ALIGN(4);   \
 __kcrctab : AT(ADDR(__kcrctab) - LOAD_OFFSET) { \
 __start___kcrctab = .;  \
 KEEP(*(SORT(___kcrctab+*))) \


I think this patch would be beneficial to get proper alignment:

diff --git a/include/linux/export-internal.h b/include/linux/export-internal.h
index cd253eb51d6c..d445705ac13c 100644
--- a/include/linux/export-internal.h
+++ b/include/linux/export-internal.h
@@ -64,6 +64,7 @@

 #define SYMBOL_CRC(sym, crc, sec)   \
asm(".section \"___kcrctab" sec "+" #sym "\",\"a\"" "\n" \
+   ".balign 4" "\n" \
"__crc_" #sym ":"   "\n" \
".long " #crc   "\n" \
".previous" "\n")


Helge



Re: [PATCH vhost v4 02/15] vdpa: Add VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag

2023-12-22 Thread Michael S. Tsirkin
On Thu, Dec 21, 2023 at 03:07:22PM +, Dragos Tatulea wrote:
> > > > In that case you're right, we don't need feature flags. But I think it
> > > > would be great to also move the error return in case userspace tries
> > > > to modify vq parameters out of suspend state.
> > > > 
> > > On the driver side or on the core side?
> > > 
> > 
> > Core side.
> > 
> Checking my understanding: instead of the feature flags there would be a check
> (for .set_vq_addr and .set_vq_state) to return an error if they are called 
> under
> DRIVER_OK and not suspended state?

Yea this looks much saner, if we start adding feature flags for
each OPERATION_X_LEGAL_IN_STATE_Y then we will end up with N^2
feature bits which is not reasonable.

-- 
MST




Re: [PATCH 0/4] Section alignment issues?

2023-12-22 Thread Helge Deller

On 12/21/23 16:42, Masahiro Yamada wrote:

On Thu, Dec 21, 2023 at 10:40 PM Masahiro Yamada  wrote:


On Thu, Nov 23, 2023 at 7:18 AM  wrote:


From: Helge Deller 

While working on the 64-bit parisc kernel, I noticed that the __ksymtab[]
table was not correctly 64-bit aligned in many modules.
The following patches do fix some of those issues in the generic code.

But further investigation shows that multiple sections in the kernel and in
modules are possibly not correctly aligned, and thus may lead to performance
degregations at runtime (small on x86, huge on parisc, sparc and others which
need exception handlers). Sometimes wrong alignments may also be simply hidden
by the linker or kernel module loader which pulls in the sections by luck with
a correct alignment (e.g. because the previous section was aligned already).

An objdump on a x86 module shows e.g.:

./kernel/net/netfilter/nf_log_syslog.ko: file format elf64-x86-64
Sections:
Idx Name  Size  VMA   LMA   File off  Algn
   0 .text 1fdf      0040  2**4
   CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
   1 .init.text00f6      2020  2**4
   CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
   2 .exit.text005c      2120  2**4
   CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
   3 .rodata.str1.8 00dc      2180  2**3
   CONTENTS, ALLOC, LOAD, READONLY, DATA
   4 .rodata.str1.1 030a      225c  2**0
   CONTENTS, ALLOC, LOAD, READONLY, DATA
   5 .rodata   00b0      2580  2**5
   CONTENTS, ALLOC, LOAD, READONLY, DATA
   6 .modinfo  019e      2630  2**0
   CONTENTS, ALLOC, LOAD, READONLY, DATA
   7 .return_sites 0034      27ce  2**0
   CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
   8 .call_sites   029c      2802  2**0
   CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA

In this example I believe the ".return_sites" and ".call_sites" should have
an alignment of at least 32-bit (4 bytes).

On other architectures or modules other sections like ".altinstructions" or
"__ex_table" may show up wrongly instead.

In general I think it would be beneficial to search for wrong alignments at
link time, and maybe at runtime.

The patch at the end of this cover letter
- adds compile time checks to the "modpost" tool, and
- adds a runtime check to the kernel module loader at runtime.
And it will possibly show false positives too (!!!)
I do understand that some of those sections are not performce critical
and thus any alignment is OK.

The modpost patch will emit at compile time such warnings (on x86-64 kernel 
build):

WARNING: modpost: vmlinux: section .initcall7.init (type 1, flags 2) has 
alignment of 1, expected at least 4.
Maybe you need to add ALIGN() to the modules.lds file (or fix modpost) ?
WARNING: modpost: vmlinux: section .altinstructions (type 1, flags 2) has 
alignment of 1, expected at least 2.
WARNING: modpost: vmlinux: section .initcall6.init (type 1, flags 2) has 
alignment of 1, expected at least 4.
WARNING: modpost: vmlinux: section .initcallearly.init (type 1, flags 2) has 
alignment of 1, expected at least 4.
WARNING: modpost: vmlinux: section .rodata.cst2 (type 1, flags 18) has 
alignment of 2, expected at least 64.
WARNING: modpost: vmlinux: section .static_call_tramp_key (type 1, flags 2) has 
alignment of 1, expected at least 8.
WARNING: modpost: vmlinux: section .con_initcall.init (type 1, flags 2) has 
alignment of 1, expected at least 8.
WARNING: modpost: vmlinux: section __bug_table (type 1, flags 3) has alignment 
of 1, expected at least 4.
...





modpost acts on vmlinux.o instead of vmlinux.


vmlinux.o is a relocatable ELF, which is not a real layout
because no linker script has been considered yet at this
point.


vmlinux is an executable ELF, produced by a linker,
with the linker script taken into consideration
to determine the final section/symbol layout.


So, checking this in modpost is meaningless.



I did not check the module checking code, but
modules are also relocatable ELF.




Sorry, I replied too early.
(Actually I replied without reading your modpost code).

Now, I understand what your checker is doing.


I did not test how many false positives are produced,
but it catches several suspicious mis-alignments.


Yes.


However, I am not convinced with this warning.


+   warn("%s: section %s (type %d, flags %lu) has
alignment of %d, expected at least %d.\n"
+"Maybe you need to add ALIGN() to the modules.lds
file (or fix modpost) ?\n",
+