[GIT PULL] Modules changes for v6.10-rc1

2024-05-14 Thread Luis Chamberlain
The following changes since commit a5131c3fdf2608f1c15f3809e201cf540eb28489:

  Merge tag 'x86-shstk-2024-05-13' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip (2024-05-13 19:33:23 
-0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/ 
tags/modules-6.10-rc1

for you to fetch changes up to 2c9e5d4a008293407836d29d35dfd4353615bd2f:

  bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of (2024-05-14 
00:36:29 -0700)


Modules changes for v6.10-rc1

Finally something fun. Mike Rapoport does some cleanup to allow us to
take out module_alloc() out of modules into a new paint shedded execmem_alloc()
and execmem_free() so to make emphasis these helpers are actually used outside
of modules. It starts with a no-functional changes API rename / placeholders
to then allow architectures to define their requirements into a new shiny
struct execmem_info with ranges, and requirements for those ranges. Archs
now can intitialize this execmem_info as the last part of mm_core_init() if
they have to diverge from the norm. Each range is a known type clearly
articulated and spelled out in enum execmem_type.

Although a lot of this is major cleanup and prep work for future enhancements an
immediate clear gain is we get to enable KPROBES without MODULES now. That is
ultimately what motiviated to pick this work up again, now with smaller goal as
concrete stepping stone.

This has been sitting on linux-next for a little less than a month, a few issues
were found already and fixed, in particular an odd mips boot issue. Arch folks
reviewed the code too. This is ready for wider exposure and testing.


Justin Stitt (1):
  kallsyms: replace deprecated strncpy with strscpy

Mike Rapoport (IBM) (16):
  arm64: module: remove unneeded call to kasan_alloc_module_shadow()
  mips: module: rename MODULE_START to MODULES_VADDR
  nios2: define virtual address space for modules
  sparc: simplify module_alloc()
  module: make module_memory_{alloc,free} more self-contained
  mm: introduce execmem_alloc() and execmem_free()
  mm/execmem, arch: convert simple overrides of module_alloc to execmem
  mm/execmem, arch: convert remaining overrides of module_alloc to execmem
  riscv: extend execmem_params for generated code allocations
  arm64: extend execmem_info for generated code allocations
  powerpc: extend execmem_params for kprobes allocations
  arch: make execmem setup available regardless of CONFIG_MODULES
  x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
  powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where appropriate
  kprobes: remove dependency on CONFIG_MODULES
  bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of

Yifan Hong (1):
  module: allow UNUSED_KSYMS_WHITELIST to be relative against objtree.

 arch/Kconfig |  10 ++-
 arch/arm/kernel/module.c |  34 -
 arch/arm/mm/init.c   |  45 +++
 arch/arm64/Kconfig   |   1 +
 arch/arm64/kernel/module.c   | 126 --
 arch/arm64/kernel/probes/kprobes.c   |   7 --
 arch/arm64/mm/init.c | 140 ++
 arch/arm64/net/bpf_jit_comp.c|  11 ---
 arch/loongarch/kernel/module.c   |   6 --
 arch/loongarch/mm/init.c |  21 +
 arch/mips/include/asm/pgtable-64.h   |   4 +-
 arch/mips/kernel/module.c|  10 ---
 arch/mips/mm/fault.c |   4 +-
 arch/mips/mm/init.c  |  23 ++
 arch/nios2/include/asm/pgtable.h |   5 +-
 arch/nios2/kernel/module.c   |  20 -
 arch/nios2/mm/init.c |  21 +
 arch/parisc/kernel/module.c  |  12 ---
 arch/parisc/mm/init.c|  23 +-
 arch/powerpc/Kconfig |   2 +-
 arch/powerpc/include/asm/kasan.h |   2 +-
 arch/powerpc/kernel/head_8xx.S   |   4 +-
 arch/powerpc/kernel/head_book3s_32.S |   6 +-
 arch/powerpc/kernel/kprobes.c|  22 +-
 arch/powerpc/kernel/module.c |  38 --
 arch/powerpc/lib/code-patching.c |   2 +-
 arch/powerpc/mm/book3s32/mmu.c   |   2 +-
 arch/powerpc/mm/mem.c|  64 
 arch/riscv/include/asm/pgtable.h |   3 +
 arch/riscv/kernel/module.c   |  12 ---
 arch/riscv/kernel/probes/kprobes.c   |  10 ---
 arch/riscv/mm/init.c |  35 +
 arch/riscv/net/bpf_jit_core.c|  13 
 arch/s390/kernel/ftrace.c|   4 +-
 arch/s390/kernel/kprobes.c   |   4 +-
 arch/s390/kernel/module.c|  42 +-
 arch/s390/mm/init.c  |  30 
 arch/sparc/include/asm/pgtable_32.h  |   2 +
 arch/sparc/kernel/module.c   |  30 
 

Re: [PATCH] modules: Drop the .export_symbol section from the final modules

2024-05-11 Thread Luis Chamberlain
On Wed, Apr 17, 2024 at 01:35:30PM +0800, wang...@lemote.com wrote:
> From: Wang Yao 
> 
> Commit ddb5cdbafaaa ("kbuild: generate KSYMTAB entries by modpost")
> forget drop the .export_symbol section from the final modules.
> 
> Signed-off-by: Wang Yao 

Masahiro, commit ddb5cdbafaaa ("kbuild: generate KSYMTAB entries by
modpost") was your change, wanna address / take it through your
tree? It makes sense to me though.

  Luis

> ---
>  scripts/module.lds.S | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/scripts/module.lds.S b/scripts/module.lds.S
> index bf5bcf2836d8..89ff01a22634 100644
> --- a/scripts/module.lds.S
> +++ b/scripts/module.lds.S
> @@ -13,6 +13,7 @@ SECTIONS {
>   /DISCARD/ : {
>   *(.discard)
>   *(.discard.*)
> + *(.export_symbol)
>   }
>  
>   __ksymtab   0 : { *(SORT(___ksymtab+*)) }
> -- 
> 2.27.0
> 



Re: [PATCH RESEND v8 00/16] mm: jit/text allocator

2024-05-05 Thread Luis Chamberlain
On Sun, May 05, 2024 at 07:06:12PM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> Hi,
> 
> The patches are also available in git:
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v8
> 
> v8:
> * fix intialization of default_execmem_info

Thanks, applied and pushed to modules-next. If we find fixes, let's
please just now have separate patches on top of this series.

  Luis



Re: [PATCH v7 00/16] mm: jit/text allocator

2024-05-02 Thread Luis Chamberlain
On Thu, May 02, 2024 at 11:50:36PM +0100, Liviu Dudau wrote:
> On Mon, Apr 29, 2024 at 09:29:20AM -0700, Luis Chamberlain wrote:
> > On Mon, Apr 29, 2024 at 03:16:04PM +0300, Mike Rapoport wrote:
> > > From: "Mike Rapoport (IBM)" 
> > > 
> > > Hi,
> > > 
> > > The patches are also available in git:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v7
> > > 
> > > v7 changes:
> > > * define MODULE_{VADDR,END} for riscv32 to fix the build and avoid
> > >   #ifdefs in a function body
> > > * add Acks, thanks everybody
> > 
> > Thanks, I've pushed this to modules-next for further exposure / testing.
> > Given the status of testing so far with prior revisions, in that only a
> > few issues were found and that those were fixed, and the status of
> > reviews, this just might be ripe for v6.10.
> 
> Looks like there is still some work needed. I've picked up next-20240501
> and on arch/mips with CONFIG_MODULE_COMPRESS_XZ=y and 
> CONFIG_MODULE_DECOMPRESS=y
> I fail to load any module:
> 
> # modprobe rfkill
> [11746.539090] Invalid ELF header magic: != ELF
> [11746.587149] execmem: unable to allocate memory
> modprobe: can't load module rfkill (kernel/net/rfkill/rfkill.ko.xz): Out of 
> memory
> 
> The (hopefully) relevant parts of my .config:

Thanks for the report! Any chance we can get you to try a bisection? I
think it should take 2-3 test boots. To help reduce scope you try modules-next:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=modules-next

Then can you check by resetting your tree to commmit 3fbe6c2f820a76 (mm:
introduce execmem_alloc() and execmem_free()"). I suspect that should
boot, so your bad commit would be the tip 3c2c250cb3a5fbb ("bpf: remove
CONFIG_BPF_JIT dependency on CONFIG_MODULES of").

That gives us only a few commits to bisect:

git log --oneline 3fbe6c2f820a76bc36d5546bda85832f57c8fce2..
3c2c250cb3a5 (HEAD -> modules-next, korg/modules-next) bpf: remove 
CONFIG_BPF_JIT dependency on CONFIG_MODULES of
11e8e65cce5c kprobes: remove dependency on CONFIG_MODULES
e10cbc38697b powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where 
appropriate
4da3d38f24c5 x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
13ae3d74ee70 arch: make execmem setup available regardless of CONFIG_MODULES
460bbbc70a47 powerpc: extend execmem_params for kprobes allocations
e1a14069b5b4 arm64: extend execmem_info for generated code allocations
971e181c6585 riscv: extend execmem_params for generated code allocations
0fa276f26721 mm/execmem, arch: convert remaining overrides of module_alloc to 
execmem
022cef244287 mm/execmem, arch: convert simple overrides of module_alloc to 
execmem

With 2-3 boots we should be to tell which is the bad commit.

  Luis



Re: [PATCH v7 00/16] mm: jit/text allocator

2024-04-29 Thread Luis Chamberlain
On Mon, Apr 29, 2024 at 03:16:04PM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> Hi,
> 
> The patches are also available in git:
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v7
> 
> v7 changes:
> * define MODULE_{VADDR,END} for riscv32 to fix the build and avoid
>   #ifdefs in a function body
> * add Acks, thanks everybody

Thanks, I've pushed this to modules-next for further exposure / testing.
Given the status of testing so far with prior revisions, in that only a
few issues were found and that those were fixed, and the status of
reviews, this just might be ripe for v6.10.

  Luis



Re: [PATCH v6 00/16] mm: jit/text allocator

2024-04-26 Thread Luis Chamberlain
On Fri, Apr 26, 2024 at 11:28:38AM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> Hi,
> 
> The patches are also available in git:
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v6
> 
> v6 changes:
> * restore patch "arm64: extend execmem_info for generated code
>   allocations" that disappeared in v5 rebase
> * update execmem initialization so that by default it will be
>   initialized early while late initialization will be an opt-in

I've taken this new iteration again through modules-next so to help get
more testing exposure to this. If we run into conflicts again with mm
we can see if Andrew is willing to take this in through there. However,
it may make sense to only consider up to "mm: introduce execmem_alloc() and
execmem_free()" for v6.10 given we're bound to likely find more issues
and we are already at rc5.

  Luis



Re: [PATCH v5 00/15] mm: jit/text allocator

2024-04-23 Thread Luis Chamberlain
On Mon, Apr 22, 2024 at 12:44:21PM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> (something went wrong with the prevois posting, sorry for the noise)
> 
> Hi,
> 
> Since v3 I looked into making execmem more of an utility toolbox, as we
> discussed at LPC with Mark Rutland, but it was getting more hairier than
> having a struct describing architecture constraints and a type identifying
> the consumer of execmem.
> 
> And I do think that having the description of architecture constraints for
> allocations of executable memory in a single place is better than having it
> spread all over the place.
> 
> The patches available via git:
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v5

Thanks! I've merged and pushed this onto modules-next in its entirety now for
wider testing.

  Luis



Re: [PATCH] module: ban '.', '..' as module names, ban '/' in module names

2024-04-14 Thread Luis Chamberlain
On Sun, Apr 14, 2024 at 10:05:05PM +0300, Alexey Dobriyan wrote:
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3616,4 +3616,12 @@ extern int vfs_fadvise(struct file *file, loff_t 
> offset, loff_t len,
>  extern int generic_fadvise(struct file *file, loff_t offset, loff_t len,
>  int advice);
>  
> +/*
> + * Use this if data from userspace end up as directory/filename on
> + * some virtual filesystem.
> + */
> +static inline bool string_is_vfs_ready(const char *s)
> +{
> + return strcmp(s, ".") != 0 && strcmp(s, "..") != 0 && !strchr(s, '/');
> +}
>  #endif /* _LINUX_FS_H */
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -2893,6 +2893,11 @@ static int load_module(struct load_info *info, const 
> char __user *uargs,
>  
>   audit_log_kern_module(mod->name);
>  
> + if (!string_is_vfs_ready(mod->name)) {
> + err = -EINVAL;
> + goto free_module;
> + }
> +

Sensible change however to put string_is_vfs_ready() in include/linux/fs.h 
is a stretch if there really are no other users.

  Luis



Re: [PATCH v4 00/15] mm: jit/text allocator

2024-04-11 Thread Luis Chamberlain
On Thu, Apr 11, 2024 at 07:00:36PM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> Hi,
> 
> Since v3 I looked into making execmem more of an utility toolbox, as we
> discussed at LPC with Mark Rutland, but it was getting more hairier than
> having a struct describing architecture constraints and a type identifying
> the consumer of execmem.
> 
> And I do think that having the description of architecture constraints for
> allocations of executable memory in a single place is better that having it
> spread all over the place.
> 
> The patches available via git:
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v4

I've taken the first 5 patches through modules-next for now to get early
exposure to testing. Of those I just had minor nit feedback on the 5th,
but the rest look good.

Let's wait for review for the rest of the patches 6-15.

  Luis



Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-11 Thread Luis Chamberlain
On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> module_alloc() is used everywhere as a mean to allocate memory for code.
> 
> Beside being semantically wrong, this unnecessarily ties all subsystems
> that need to allocate code, such as ftrace, kprobes and BPF to modules and
> puts the burden of code allocation to the modules code.
> 
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
> 
> Start splitting code allocation from modules by introducing execmem_alloc()
> and execmem_free() APIs.
> 
> Initially, execmem_alloc() is a wrapper for module_alloc() and
> execmem_free() is a replacement of module_memfree() to allow updating all
> call sites to use the new APIs.
> 
> Since architectures define different restrictions on placement,
> permissions, alignment and other parameters for memory that can be used by
> different subsystems that allocate executable memory, execmem_alloc() takes
> a type argument, that will be used to identify the calling subsystem and to
> allow architectures define parameters for ranges suitable for that
> subsystem.

It would be good to describe this is a non-fuctional change.

> Signed-off-by: Mike Rapoport (IBM) 
> ---

> diff --git a/mm/execmem.c b/mm/execmem.c
> new file mode 100644
> index ..ed2ea41a2543
> --- /dev/null
> +++ b/mm/execmem.c
> @@ -0,0 +1,26 @@
> +// SPDX-License-Identifier: GPL-2.0

And this just needs to copy over the copyright notices from the main.c file.

  Luis



Re: [PATCH] [v4] module: don't ignore sysfs_create_link() failures

2024-04-08 Thread Luis Chamberlain
On Mon, Apr 08, 2024 at 10:05:58AM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The sysfs_create_link() return code is marked as __must_check, but the
> module_add_driver() function tries hard to not care, by assigning the
> return code to a variable. When building with 'make W=1', gcc still
> warns because this variable is only assigned but not used:
> 
> drivers/base/module.c: In function 'module_add_driver':
> drivers/base/module.c:36:6: warning: variable 'no_warn' set but not used 
> [-Wunused-but-set-variable]
> 
> Rework the code to properly unwind and return the error code to the
> caller. My reading of the original code was that it tries to
> not fail when the links already exist, so keep ignoring -EEXIST
> errors.
> 
> Fixes: e17e0f51aeea ("Driver core: show drivers in /sys/module/")
> Reviewed-by: Greg Kroah-Hartman 

Reviewed-by: Luis Chamberlain 

  Luis



[GIT PULL] Modules changes for v6.9-rc1

2024-03-12 Thread Luis Chamberlain
The following changes since commit 41bccc98fb7931d63d03f326a746ac4d429c1dd3:

  Linux 6.8-rc2 (2024-01-28 17:01:12 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/ 
tags/modules-6.9-rc1

for you to fetch changes up to d1909c0221739356f31c721de4743e7d219a56cc:

  module: Don't ignore errors from set_memory_XX() (2024-02-16 11:30:43 -0800)


Modules changes for v6.9-rc1

Christophe Leroy did most of the work on this release, first with a few
cleanups on CONFIG_STRICT_KERNEL_RWX and ending with error handling for
when set_memory_XX() can fail. This is part of a larger effort to clean
up all these callers which can fail, modules is just part of it.

This has been sitting on linux-next for about a month without issues.


Christophe Leroy (6):
  module: Use set_memory_rox()
  module: Change module_enable_{nx/x/ro}() to more explicit names
  init: Declare rodata_enabled and mark_rodata_ro() at all time
  modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around rodata_enabled
  powerpc: Simplify strict_kernel_rwx_enabled()
  module: Don't ignore errors from set_memory_XX()

Randy Dunlap (1):
  lib/test_kmod: fix kernel-doc warnings

 arch/powerpc/include/asm/mmu.h |  9 +-
 include/linux/init.h   |  4 ---
 init/main.c| 21 +-
 kernel/module/internal.h   |  6 ++--
 kernel/module/main.c   | 20 +++---
 kernel/module/strict_rwx.c | 63 +++---
 lib/test_kmod.c|  6 +++-
 7 files changed, 73 insertions(+), 56 deletions(-)



Re: [PATCH v1] module.h: define __symbol_get_gpl() as a regular __symbol_get()

2024-03-12 Thread Luis Chamberlain
On Tue, Mar 12, 2024 at 03:25:27PM -0700, Luis Chamberlain wrote:
> On Tue, Feb 13, 2024 at 02:10:45PM +0300, Andrew Kanner wrote:
> > On Thu, Feb 01, 2024 at 10:13:54AM -0800, Luis Chamberlain wrote:
> > > 
> > > While you're at it, if you want to try it, you could see if you can
> > > improve the situation more by looking at symbol_get() users that remain
> > > and seeing if you can instead fix it with proper Kconfig dependency and
> > > at build time. Then we can just remove it as well.
> > > 
> > >   Luis
> > 
> > Sorry for the late reply.
> > 
> > Luis, can you give more details of your idea? I re-read it once, then
> > came back and still don't understand.
> > 
> > I see that there are ~10 users for symbol_get() currently. Do you want
> > to stringify symbol names at build time to completely remove
> > symbol_get() from module.h? Correct me if I'm wrong since using of a
> > fuction which is not declared anywhere sounds confusing.
> 
> As an example look at the code and see if there's a sensible way to make
> some calls built-in instead of part of the module, then the module can
> have a kconfig builtin option, that adds to the built-in code which
> means you don't need the symbol_get().
> 
> For some other pieces of code it may require other strategies.

An example is FW_LOADER_USER_HELPER which is bool only, and is selected
by users. It didn't use symbol_get() before, however its an example of
how through Kconfig you can align requirements and define built-in
components, even if they do come from a module.

  Luis



Re: [PATCH v1] module.h: define __symbol_get_gpl() as a regular __symbol_get()

2024-03-12 Thread Luis Chamberlain
On Tue, Feb 13, 2024 at 02:10:45PM +0300, Andrew Kanner wrote:
> On Thu, Feb 01, 2024 at 10:13:54AM -0800, Luis Chamberlain wrote:
> > 
> > While you're at it, if you want to try it, you could see if you can
> > improve the situation more by looking at symbol_get() users that remain
> > and seeing if you can instead fix it with proper Kconfig dependency and
> > at build time. Then we can just remove it as well.
> > 
> >   Luis
> 
> Sorry for the late reply.
> 
> Luis, can you give more details of your idea? I re-read it once, then
> came back and still don't understand.
> 
> I see that there are ~10 users for symbol_get() currently. Do you want
> to stringify symbol names at build time to completely remove
> symbol_get() from module.h? Correct me if I'm wrong since using of a
> fuction which is not declared anywhere sounds confusing.

As an example look at the code and see if there's a sensible way to make
some calls built-in instead of part of the module, then the module can
have a kconfig builtin option, that adds to the built-in code which
means you don't need the symbol_get().

For some other pieces of code it may require other strategies.

  Luis



Re: [RFC][PATCH 0/4] Make bpf_jit and kprobes work with CONFIG_MODULES=n

2024-03-07 Thread Luis Chamberlain
On Thu, Mar 7, 2024 at 6:50 PM Masami Hiramatsu  wrote:
>
> On Wed, 6 Mar 2024 17:58:14 -0800
> Song Liu  wrote:
>
> > Hi Calvin,
> >
> > It is great to hear from you! :)
> >
> > On Wed, Mar 6, 2024 at 3:23 PM Calvin Owens  wrote:
> > >
> > > On Wednesday 03/06 at 13:34 -0800, Luis Chamberlain wrote:
> > > > On Wed, Mar 06, 2024 at 12:05:07PM -0800, Calvin Owens wrote:
> > > > > Hello all,
> > > > >
> > > > > This patchset makes it possible to use bpftrace with kprobes on 
> > > > > kernels
> > > > > built without loadable module support.
> > > >
> > > > This is a step in the right direction for another reason: clearly the
> > > > module_alloc() is not about modules, and we have special reasons for it
> > > > now beyond modules. The effort to share a generalize a huge page for
> > > > these things is also another reason for some of this but that is more
> > > > long term.
> > > >
> > > > I'm all for minor changes here so to avoid regressions but it seems a
> > > > rename is in order -- if we're going to all this might as well do it
> > > > now. And for that I'd just like to ask you paint the bikeshed with
> > > > Song Liu as he's been the one slowly making way to help us get there
> > > > with the "module: replace module_layout with module_memory",
> > > > and Mike Rapoport as he's had some follow up attempts [0]. As I see it,
> > > > the EXECMEM stuff would be what we use instead then. Mike kept the
> > > > module_alloc() and the execmem was just a wrapper but your move of the
> > > > arch stuff makes sense as well and I think would complement his series
> > > > nicely.
> > >
> > > I apologize for missing that. I think these are the four most recent
> > > versions of the different series referenced from that LWN link:
> > >
> > >   a) https://lore.kernel.org/all/20230918072955.2507221-1-r...@kernel.org/
> > >   b) https://lore.kernel.org/all/20230526051529.3387103-1-s...@kernel.org/
> > >   c) https://lore.kernel.org/all/20221107223921.3451913-1-s...@kernel.org/
> > >   d) 
> > > https://lore.kernel.org/all/20201120202426.18009-1-rick.p.edgeco...@intel.com/
> > >
> > > Song and Mike, please correct me if I'm wrong, but I think what I've
> > > done here (see [1], sorry for not adding you initially) is compatible
> > > with everything both of you have recently proposed above. How do you
> > > feel about this as a first step?
> >
> > I agree that the work here is compatible with other efforts. I have no
> > objection to making this the first step.
> >
> > >
> > > For naming, execmem_alloc() seems reasonable to me? I have no strong
> > > feelings at all, I'll just use that going forward unless somebody else
> > > expresses an opinion.
> >
> > I am not good at naming things. No objection from me to "execmem_alloc".
>
> Hm, it sounds good to me too. I think we should add a patch which just
> rename the module_alloc/module_memfree with execmem_alloc/free first.

I think that would be cleaner, yes. Leaving the possible move to a
secondary patch and placing the testing more on the later part.

 Luis



Re: [RFC][PATCH 0/4] Make bpf_jit and kprobes work with CONFIG_MODULES=n

2024-03-06 Thread Luis Chamberlain
On Wed, Mar 06, 2024 at 12:05:07PM -0800, Calvin Owens wrote:
> Hello all,
> 
> This patchset makes it possible to use bpftrace with kprobes on kernels
> built without loadable module support.

This is a step in the right direction for another reason: clearly the
module_alloc() is not about modules, and we have special reasons for it
now beyond modules. The effort to share a generalize a huge page for
these things is also another reason for some of this but that is more
long term.

I'm all for minor changes here so to avoid regressions but it seems a
rename is in order -- if we're going to all this might as well do it
now. And for that I'd just like to ask you paint the bikeshed with
Song Liu as he's been the one slowly making way to help us get there
with the "module: replace module_layout with module_memory",
and Mike Rapoport as he's had some follow up attempts [0]. As I see it,
the EXECMEM stuff would be what we use instead then. Mike kept the
module_alloc() and the execmem was just a wrapper but your move of the
arch stuff makes sense as well and I think would complement his series
nicely.

If you're gonna split code up to move to another place, it'd be nice
if you can add copyright headers as was done with the kernel/module.c
split into kernel/module/*.c

Can we start with some small basic stuff we can all agree on?

[0] https://lwn.net/Articles/944857/

  Luis



Re: [PATCH v4] modules: wait do_free_init correctly

2024-02-27 Thread Luis Chamberlain
On Tue, Feb 27, 2024 at 10:35:46AM +0800, Changbin Du wrote:
> The synchronization here is to ensure the ordering of freeing of a module
> init so that it happens before W+X checking. It is worth noting it is not
> that the freeing was not happening, it is just that our sanity checkers
> raced against the permission checkers which assume init memory is already
> gone.
> 
> Commit 1a7b7d922081 ("modules: Use vmalloc special flag") moved
> calling do_free_init() into a global workqueue instead of relying on it
> being called through call_rcu(..., do_free_init), which used to allowed us
> call do_free_init() asynchronously after the end of a subsequent grace
> period. The move to a global workqueue broke the gaurantees for code which
> needed to be sure the do_free_init() would complete with rcu_barrier().
> To fix this callers which used to rely on rcu_barrier() must now instead
> use flush_work(_free_wq).
> 
> Without this fix, we still could encounter false positive reports in W+X
> checking since the rcu_barrier() here can not ensure the ordering now.
> 
> Even worse, the rcu_barrier() can introduce significant delay. Eric Chanudet
> reported that the rcu_barrier introduces ~0.1s delay on a PREEMPT_RT kernel.
> 
>   [0.291444] Freeing unused kernel memory: 5568K
>   [0.402442] Run /sbin/init as init process
> 
> With this fix, the above delay can be eliminated.
> 
> Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag")
> Signed-off-by: Changbin Du 
> Cc: Xiaoyi Su 
> Cc: Eric Chanudet 
> Cc: Luis Chamberlain 
> Tested-by: Eric Chanudet 

Acked-by: Luis Chamberlain 

  Luis



Re: [PATCH v3] modules: wait do_free_init correctly

2024-02-21 Thread Luis Chamberlain
+ live-patching folks,

Finally, things are starting to be much clearer. Thanks for the time
for working on this, some more comments below and a question which
I think deserves some attention.

On Sat, Feb 17, 2024 at 04:18:10PM +0800, Changbin Du wrote:
> The synchronization here is just to ensure the module init's been freed
> before doing W+X checking. 

Some nits, this should read instead:

Fix the ordering of freeing of a module init so that it happens before
W+X checking.

> But the commit 1a7b7d922081 ("modules: Use
> vmalloc special flag") moves do_free_init() into a global workqueue
> instead of call_rcu(). So now rcu_barrier() can not ensure that do_free_init
> has completed. We should wait it via flush_work().

Remove "But" and adjust as:

Commit 1a7b7d922081 ("modules: Use vmalloc special flag") moved
calling do_free_init() into a global workqueue instead of relying on it
being called through call_rcu(..., do_free_init), which used to allowed us
call do_free_init() asynchronously after the end of a subsequent grace  
   
period. The move to a global workqueue broke the gaurantees for code
which needed to be sure the do_free_init() would complete with rcu_barrier().
To fix this callers which used to rely on rcu_barrier() must now instead
use flush_work(_free_wq).

> Without this fix, we still could encounter false positive reports in
> W+X checking,

This is good thanks for the clarification.

I think it would be useful for the commit log then to describe also that
it is not that the freeing was not happening, it is just that our sanity
checkers raced against the permission checkers which assume init memory
is already gone.

> and the rcu synchronization is unnecessary which can
> introduce significant delay.

While this can be true, I am not sure if we can remove it. See below.

> Eric Chanudet reports that the rcu_barrier introduces ~0.1s delay on a
> PREEMPT_RT kernel.

That's a separate issue.

>   [0.291444] Freeing unused kernel memory: 5568K
>   [0.402442] Run /sbin/init as init process
> 
> With this fix, the above delay can be eliminated.
> 
> Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag")
> Signed-off-by: Changbin Du 
> Cc: Xiaoyi Su 
> Cc: Eric Chanudet 
> 
> ---
> v3:
>   - amend comment in do_init_module() and update commit msg.
> v2:
>   - fix compilation issue for no CONFIG_MODULES found by 0-DAY.
> ---
>  include/linux/moduleloader.h | 8 
>  init/main.c  | 5 +++--
>  kernel/module/main.c | 9 +++--
>  3 files changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
> index 001b2ce83832..89b1e0ed9811 100644
> --- a/include/linux/moduleloader.h
> +++ b/include/linux/moduleloader.h
> @@ -115,6 +115,14 @@ int module_finalize(const Elf_Ehdr *hdr,
>   const Elf_Shdr *sechdrs,
>   struct module *mod);
>  
> +#ifdef CONFIG_MODULES
> +void flush_module_init_free_work(void);
> +#else
> +static inline void flush_module_init_free_work(void)
> +{
> +}
> +#endif
> +
>  /* Any cleanup needed when module leaves. */
>  void module_arch_cleanup(struct module *mod);
>  
> diff --git a/init/main.c b/init/main.c
> index e24b0780fdff..f0b7e21ac67f 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -99,6 +99,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include 
> @@ -1402,11 +1403,11 @@ static void mark_readonly(void)
>   if (rodata_enabled) {
>   /*
>* load_module() results in W+X mappings, which are cleaned
> -  * up with call_rcu().  Let's make sure that queued work is
> +  * up with init_free_wq. Let's make sure that queued work is
>* flushed so that we don't hit false positives looking for
>* insecure pages which are W+X.
>*/
> - rcu_barrier();

Was this the only source of waiters that used rcu_barrier() to sync ?
What about kallsyms, live-patching ?

This original source to the addition of this rcu_barrier() (in a slight
older modified form with with rcu_barrier_sched()) was commit
ae646f0b9ca13 ("init: fix false positives in W+X checking") since
v4.17 in 2018, 6 years ago. So I'm hoping we don't have any other
side-by new users which have grown dependent on this rcu_barrier() for
other call_rcu()'s they may have used, but it is hard to tell.

So while I agree that flush work is the right solution, removing the
rcu_barrier() is technically another change which could potentially
regress for other reasons now. It is perhaps safe, but I'm used to
surprises for minor changes like these. So I think it makes sense to
lift it now, and test it in the wild to see what could possibly break,
I'd much prefer to split this as two separate commits. One which does
the fix, and another that lifts the rcu_barrier() with the stated
rationale and savings on time of ~0.1s on PREEMPT_RT kernels.

  Luis



Re: [RFC PATCH] kernel/module: add a safer implementation of try_module_get()

2024-02-02 Thread Luis Chamberlain
On Thu, Feb 01, 2024 at 03:27:54PM +0100, Marco Pagani wrote:
> 
> On 2024-01-30 21:47, Luis Chamberlain wrote:
> > 
> > It very much sounds like there is a desire to have this but without a
> > user, there is no justification.
> 
> I was working on a set of patches to fix an issue in the fpga subsystem
> when I came across your commit 557aafac1153 ("kernel/module: add
> documentation for try_module_get()") that made me realize we also had a
> safety problem. 
> 
> To solve this problem for the fpga manager, we had to add a mutex to
> ensure the low-level module still exists before calling
> try_module_get(). However, having a safer version of try_module_get()
> would have simplified the code and made it more robust against changes.
> 
> https://lore.kernel.org/linux-fpga/2024060242.149265-1-marpa...@redhat.com/
> 
> I suspect there may be other cases where try_module_get() is
> inadvertently called without ensuring that the module still exists
> that may benefit from a safer implementation.

Maybe so, however I'm not yet sure if this is safe from deadlocks.
Please work on a series of selftest simple modules which demonstrate
its use / and a simple bash script selftest loader which verifies this
won't bust. Consider you may have third party modules which also race
with this too, and other users without this new API.

> >> +bool try_module_get_safe(struct module *module)
> >> +{
> >> +  struct module *mod;
> >> +  bool ret = true;
> >> +
> >> +  if (!module)
> >> +  goto out;
> >> +
> >> +  mutex_lock(_mutex);
> > 
> > If a user comes around then this should be mutex_lock_interruptible(),
> > and add might_sleep()
> 
> Would it be okay to return false if it gets interrupted, or should I
> change the return type to int to propagate -EINTR? My concern with
> changing the signature is that it would be less straightforward to
> use the function in place of try_module_get().

Since we want a safe mechanism we might as well not allow a simple drop
in replacement but a more robust one so that users take care of the
return value properly.

  Luis



Re: [PATCH v2 0/3] modules: few of alignment fixes

2024-02-02 Thread Luis Chamberlain
On Sat, Feb 03, 2024 at 12:20:38AM +0900, Masahiro Yamada wrote:
> On Fri, Feb 2, 2024 at 3:05 AM Luis Chamberlain  wrote:
> >
> > On Wed, Jan 31, 2024 at 02:11:44PM -0800, Luis Chamberlain wrote:
> > > On Mon, Jan 29, 2024 at 11:26:39AM -0800, Luis Chamberlain wrote:
> > > > Masahiro, if there no issues feel free to take this or I can take them 
> > > > in
> > > > too via the modules-next tree. Lemme know!
> > >
> > > I've queued this onto modules-testing to get winder testing [0]
> >
> > I've moved it to modules-next as I've found no issues.
> >
> >   Luis
> 
> 
> I believe this patch series is wrong.
> 
> I thought we agreed that the alignment must be added to
> individual asm code, not to the linker script.
>
> I am surprised that you came back to this.

I misseed the dialog on the old cover letter, sorry. I've yanked these patches
out. I'd expect a respin from Helge.

  Luis



Re: [PATCH v1] module.h: define __symbol_get_gpl() as a regular __symbol_get()

2024-02-01 Thread Luis Chamberlain
On Thu, Feb 01, 2024 at 12:29:46PM +0300, Andrew Kanner wrote:
> On Thu, Feb 01, 2024 at 06:29:58AM +0100, Christoph Hellwig wrote:
> > On Wed, Jan 31, 2024 at 10:02:52PM +0300, Andrew Kanner wrote:
> > > Prototype for __symbol_get_gpl() was introduced in the initial git
> > > commit 1da177e4c3f4 ("Linux-2.6.12-rc2"), but was not used after that.
> > > 
> > > In commit 9011e49d54dc ("modules: only allow symbol_get of
> > > EXPORT_SYMBOL_GPL modules") Christoph Hellwig switched __symbol_get()
> > > to process GPL symbols only, most likely this is what
> > > __symbol_get_gpl() was designed to do.
> > > 
> > > We might either define __symbol_get_gpl() as __symbol_get() or remove
> > > it completely as suggested by Mauro Carvalho Chehab.
> > 
> > Just remove it, there is no need to keep unused funtionality around.
> > 
> > Btw, where did the discussion start?  I hope you're not trying to
> > add new symbol_get users?
> > 
> 
> Of course not, no new users needed.
> 
> I haven't discussed it directly. I found the unused __symbol_get_gpl()
> myself, but during investigation of wether it was ever used somewhere
> found the old patch series suggested by Mauro Carvalho Chehab (in Cc).
> 
> Link: 
> https://lore.kernel.org/lkml/5f001015990a76c0da35a4c3cf08e457ec353ab2.1652113087.git.mche...@kernel.org/
> 
> The patch series is from 2022 and not merged. You can take [PATCH v6
> 1/4] which removes the unused symbol from the link.
> 
> Or I can resend v2 with my commit msg. But not sure about how it works
> in such a case - will adding Suggested-by tag (if no objections from
> Mauro) with the Link be ok?

While you're at it, if you want to try it, you could see if you can
improve the situation more by looking at symbol_get() users that remain
and seeing if you can instead fix it with proper Kconfig dependency and
at build time. Then we can just remove it as well.

  Luis



Re: [PATCH] lib/test_kmod: fix kernel-doc warnings

2024-02-01 Thread Luis Chamberlain
On Fri, Nov 03, 2023 at 09:20:44PM -0700, Randy Dunlap wrote:
> Fix all kernel-doc warnings in test_kmod.c:
> - Mark some enum values as private so that kernel-doc is not needed
>   for them
> - s/thread_mutex/thread_lock/ in a struct's kernel-doc comments
> - add kernel-doc info for @task_sync
> 
> test_kmod.c:67: warning: Enum value '__TEST_KMOD_INVALID' not described in 
> enum 'kmod_test_case'
> test_kmod.c:67: warning: Enum value '__TEST_KMOD_MAX' not described in enum 
> 'kmod_test_case'
> test_kmod.c:100: warning: Function parameter or member 'task_sync' not 
> described in 'kmod_test_device_info'
> test_kmod.c:134: warning: Function parameter or member 'thread_mutex' not 
> described in 'kmod_test_device'
> 
> Signed-off-by: Randy Dunlap 
> Cc: Luis Chamberlain 
> Cc: linux-modu...@vger.kernel.org

Applied and pushed, thanks!

  Luis



Re: [PATCH v2 0/3] modules: few of alignment fixes

2024-02-01 Thread Luis Chamberlain
On Wed, Jan 31, 2024 at 02:11:44PM -0800, Luis Chamberlain wrote:
> On Mon, Jan 29, 2024 at 11:26:39AM -0800, Luis Chamberlain wrote:
> > Masahiro, if there no issues feel free to take this or I can take them in
> > too via the modules-next tree. Lemme know!
> 
> I've queued this onto modules-testing to get winder testing [0]

I've moved it to modules-next as I've found no issues.

  Luis



Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-31 Thread Luis Chamberlain
On Wed, Jan 31, 2024 at 06:53:13AM +, Christophe Leroy wrote:
> The problem being identified in commit 677bfb9db8a3 ("module: Don't 
> ignore errors from set_memory_XX()"), you can keep/re-apply the series 
> [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time.

Sure, queued that up into modules-testing before I push to modules-next.

  Luis



Re: [PATCH v2 0/3] modules: few of alignment fixes

2024-01-31 Thread Luis Chamberlain
On Mon, Jan 29, 2024 at 11:26:39AM -0800, Luis Chamberlain wrote:
> Masahiro, if there no issues feel free to take this or I can take them in
> too via the modules-next tree. Lemme know!

I've queued this onto modules-testing to get winder testing [0]

[0] 
https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=modules-testing

  Luis



Re: [RFC PATCH] kernel/module: add a safer implementation of try_module_get()

2024-01-30 Thread Luis Chamberlain
On Tue, Jan 30, 2024 at 08:36:14PM +0100, Marco Pagani wrote:
> The current implementation of try_module_get() requires the module to
> exist and be live as a precondition. While this may seem intuitive at
> first glance, enforcing the precondition can be tricky, considering that
> modules can be unloaded at any time if not previously taken. For
> instance, the caller could be preempted just before calling
> try_module_get(), and while preempted, the module could be unloaded and
> freed. More subtly, the module could also be unloaded at any point while
> executing try_module_get() before incrementing the refount with
> atomic_inc_not_zero().
> 
> Neglecting the precondition that the module must exist and be live can
> cause unexpected race conditions that can lead to crashes. However,
> ensuring that the precondition is met may require additional locking
> that increases the complexity of the code and can make it more
> error-prone.
> 
> This patch adds a slower yet safer implementation of try_module_get()
> that checks if the module is valid by looking into the mod_tree before
> taking the module's refcount. This new function can be safely called on
> stale and invalid module pointers, relieving developers from the burden
> of ensuring that the module exists and is live before attempting to take
> it.
> 
> The tree lookup and refcount increment are executed after taking the
> module_mutex to prevent the module from being unloaded after looking up
> the tree.
> 
> Signed-off-by: Marco Pagani 

It very much sounds like there is a desire to have this but without a
user, there is no justification.

> +bool try_module_get_safe(struct module *module)
> +{
> + struct module *mod;
> + bool ret = true;
> +
> + if (!module)
> + goto out;
> +
> + mutex_lock(_mutex);

If a user comes around then this should be mutex_lock_interruptible(),
and add might_sleep()

> +
> + /*
> +  * Check if the address points to a valid live module and take
> +  * the refcount only if it points to the module struct.
> +  */
> + mod = __module_address((unsigned long)module);
> + if (mod && mod == module && module_is_live(mod))
> + __module_get(mod);
> + else
> + ret = false;
> +
> + mutex_unlock(_mutex);
> +
> +out:
> + return ret;
> +}
> +EXPORT_SYMBOL(try_module_get_safe);

And EXPORT_SYMBOL_GPL() would need to be used.

I'd also expect selftests to be expanded for this case, but again,
without a user, this is just trying to resolve a problem which does not
exist.

  Luis



Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-30 Thread Luis Chamberlain
On Tue, Jan 30, 2024 at 06:48:11PM +0100, Marek Szyprowski wrote:
> Dear All,
> 
> On 30.01.2024 12:03, Christophe Leroy wrote:
> > Le 30/01/2024 à 10:16, Chen-Yu Tsai a écrit :
> >> [Vous ne recevez pas souvent de courriers de we...@chromium.org. D?couvrez 
> >> pourquoi ceci est important ? 
> >> https://aka.ms/LearnAboutSenderIdentification ]
> >>
> >> On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote:
> >>> On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote:
> >>>> Declaring rodata_enabled and mark_rodata_ro() at all time
> >>>> helps removing related #ifdefery in C files.
> >>>>
> >>>> Signed-off-by: Christophe Leroy 
> >>> Very nice cleanup, thanks!, applied and pushed
> >>>
> >>> Luis
> >> On next-20240130, which has your modules-next branch, and thus this
> >> series and the other "module: Use set_memory_rox()" series applied,
> >> my kernel crashes in some very weird way. Reverting your branch
> >> makes the crash go away.
> >>
> >> I thought I'd report it right away. Maybe you folks would know what's
> >> happening here? This is on arm64.
> > That's strange, it seems to bug in module_bug_finalize() which is
> > _before_ calls to module_enable_ro() and such.
> >
> > Can you try to revert the 6 patches one by one to see which one
> > introduces the problem ?
> >
> > In reality, only patch 677bfb9db8a3 really change things. Other ones are
> > more on less only cleanup.
> 
> I've also run into this issue with today's (20240130) linux-next on my 
> test farm. The issue is not fully reproducible, so it was a bit hard to 
> bisect it automatically. I've spent some time on manual testing and it 
> looks that reverting the following 2 commits on top of linux-next fixes 
> the problem:
> 
> 65929884f868 ("modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around 
> rodata_enabled")
> 677bfb9db8a3 ("module: Don't ignore errors from set_memory_XX()")
> 
> This in fact means that commit 677bfb9db8a3 is responsible for this 
> regression, as 65929884f868 has to be reverted only because the latter 
> depends on it. Let me know what I can do to help debugging this issue.

Thanks for the bisect, I've reset my tree to commit
3559ad395bf02 ("module: Change module_enable_{nx/x/ro}() to more
explicit names") for now then, so to remove those commits.

  Luis



Re: [RESEND PATCH v2] modules: wait do_free_init correctly

2024-01-30 Thread Luis Chamberlain
On Tue, Jan 30, 2024 at 09:40:38AM +0800, Changbin Du wrote:
> On Mon, Jan 29, 2024 at 09:53:58AM -0800, Luis Chamberlain wrote:
> > On Mon, Jan 29, 2024 at 10:03:04AM +0800, Changbin Du wrote:
> > > The commit 1a7b7d922081 ("modules: Use vmalloc special flag") moves
> > > do_free_init() into a global workqueue instead of call_rcu(). So now
> > > rcu_barrier() can not ensure that do_free_init has completed. We should
> > > wait it via flush_work().
> > > 
> > > Without this fix, we still could encounter false positive reports in
> > > W+X checking, and rcu synchronization is unnecessary.
> > 
> > You didn't answer my question, which should be documented in the commit log.
> > 
> > Does this mean we never freed modules init because of this? If so then
> > your commit log should clearly explain that. It should also explain that
> > if true (you have to verify) then it means we were no longer saving
> > the memory we wished to save, and that is important for distributions
> > which do want to save anything on memory. You may want to do a general
> > estimate on how much that means these days on any desktop / server.
>
> Actually, I have explained it in commit msg. It's not about saving memory. The
> synchronization here is just to ensure the module init's been freed before
> doing W+X checking. The problem is that the current implementation is wrong,
> rcu_barrier() cannot guarantee that. So we can encounter false positive 
> reports.
> But anyway, the module init will be freed, and it's just a timing related 
> issue.

Your desciption here is better than the commit log.

  Luis



Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-29 Thread Luis Chamberlain
On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote:
> Declaring rodata_enabled and mark_rodata_ro() at all time
> helps removing related #ifdefery in C files.
> 
> Signed-off-by: Christophe Leroy 

Very nice cleanup, thanks!, applied and pushed

  Luis



Re: [PATCH 1/3] module: Use set_memory_rox()

2024-01-29 Thread Luis Chamberlain
On Thu, Dec 21, 2023 at 08:24:23AM +0100, Christophe Leroy wrote:
> A couple of architectures seem concerned about calling set_memory_ro()
> and set_memory_x() too frequently and have implemented a version of
> set_memory_rox(), see commit 60463628c9e0 ("x86/mm: Implement native
> set_memory_rox()") and commit 22e99fa56443 ("s390/mm: implement
> set_memory_rox()")
> 
> Use set_memory_rox() in modules when STRICT_MODULES_RWX is set.
> 
> Signed-off-by: Christophe Leroy 

Nice simplification. I applied all 3 patches and pushed!

  Luis



[PATCH v2 1/4] selftests: add new kallsyms selftests

2024-01-29 Thread Luis Chamberlain
We lack find_symbol() selftests, so add one. This let's us stress test
improvements easily on find_symbol() or optimizations. It also inherently
allows us to test the limits of kallsyms on Linux today.

We test a pathalogical use case for kallsyms by introducing modules
which are automatically written for us with a larger number of symbols.
We have 4 kallsyms test modules:

A: has KALLSYSMS_NUMSYMS exported symbols
B: uses one of A's symbols
C: adds KALLSYMS_SCALE_FACTOR * KALLSYSMS_NUMSYMS exported
D: adds 2 * the symbols than C

By using anything much larger than KALLSYSMS_NUMSYMS as 10,000 and
KALLSYMS_SCALE_FACTOR of 8 we segfault today. So we're capped at
around 16 symbols somehow today. We can inpsect that issue at
our leasure later, but for now the real value to this test is that
this will easily allow us to test improvements on find_symbol().

On x86_64 we can use perf, for other architectures we just use 'time'
and allow for customizations. For example a future enhancements could
be done for parisc to check for unaligned accesses which triggers a
special special exception handler assembler code inside the kernel.
The negative impact on performance is so large on parisc that it
keeps track of its accesses on /proc/cpuinfo as UAH:

IRQ:   CPU0   CPU1
3:   1332  0 SuperIO  ttyS0
7:1270013  0 SuperIO  pata_ns87415
64:  320023012  320021431 CPU  timer
65:   17080507   20624423 CPU  IPI
UAH:   10948640  58104   Unaligned access handler traps

While at it, this tidies up lib/ test modules to allow us to have
a new directory for them. The amount of test modules under lib/
is insane.

This should also hopefully showcase how to start doing basic
self module writing code, which may be more useful for more complex
cases later in the future.

Signed-off-by: Luis Chamberlain 
---
 lib/Kconfig.debug | 103 ++
 lib/Makefile  |   1 +
 lib/tests/Makefile|   1 +
 lib/tests/module/.gitignore   |   4 +
 lib/tests/module/Makefile |  15 ++
 lib/tests/module/gen_test_kallsyms.sh | 128 ++
 tools/testing/selftests/module/Makefile   |  12 ++
 tools/testing/selftests/module/config |   3 +
 tools/testing/selftests/module/find_symbol.sh |  81 +++
 9 files changed, 348 insertions(+)
 create mode 100644 lib/tests/Makefile
 create mode 100644 lib/tests/module/.gitignore
 create mode 100644 lib/tests/module/Makefile
 create mode 100755 lib/tests/module/gen_test_kallsyms.sh
 create mode 100644 tools/testing/selftests/module/Makefile
 create mode 100644 tools/testing/selftests/module/config
 create mode 100755 tools/testing/selftests/module/find_symbol.sh

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 97ce28f4d154..29db47ca251f 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2835,6 +2835,109 @@ config TEST_KMOD
 
  If unsure, say N.
 
+config TEST_RUNTIME
+   bool
+
+config TEST_RUNTIME_MODULE
+   bool
+
+config TEST_KALLSYMS
+   tristate "module kallsyms find_symbol() test"
+   depends on m
+   select TEST_RUNTIME
+   select TEST_RUNTIME_MODULE
+   select TEST_KALLSYMS_A
+   select TEST_KALLSYMS_B
+   select TEST_KALLSYMS_C
+   select TEST_KALLSYMS_D
+   help
+ This allows us to stress test find_symbol() through the kallsyms
+ used to place symbols on the kernel ELF kallsyms and modules kallsyms
+ where we place kernel symbols such as exported symbols.
+
+ We have four test modules:
+
+ A: has KALLSYSMS_NUMSYMS exported symbols
+ B: uses one of A's symbols
+ C: adds KALLSYMS_SCALE_FACTOR * KALLSYSMS_NUMSYMS exported
+ D: adds 2 * the symbols than C
+
+ We stress test find_symbol() through two means:
+
+ 1) Upon load of B it will trigger simplify_symbols() to look for the
+ one symbol it uses from the module A with tons of symbols. This is an
+ indirect way for us to have B call resolve_symbol_wait() upon module
+ load. This will eventually call find_symbol() which will eventually
+ try to find the symbols used with find_exported_symbol_in_section().
+ find_exported_symbol_in_section() uses bsearch() so a binary search
+ for each symbol. Binary search will at worst be O(log(n)) so the
+ larger TEST_MODULE_KALLSYSMS the worse the search.
+
+ 2) The selftests should load C first, before B. Upon B's load towards
+ the end right before we call module B's init routine we get
+ complete_formation() called on the module. That will first check
+ for duplicate symbols with the call to verify_exported_symbols().
+ That is when we'll force iteration on module C's insane symbol list.
+ Since it has 10 * KALLSYMS_NUMSYMS it me

[PATCH v2 4/4] modules: Add missing entry for __ex_table

2024-01-29 Thread Luis Chamberlain
From: Helge Deller 

The entry for __ex_table was missing, which may make __ex_table
become 1- or 2-byte aligned in modules.
Add the entry to ensure it gets 32-bit aligned.

Signed-off-by: Helge Deller 
Signed-off-by: Luis Chamberlain 
---
 scripts/module.lds.S | 1 +
 1 file changed, 1 insertion(+)

diff --git a/scripts/module.lds.S b/scripts/module.lds.S
index b00415a9ff27..488f61b156b2 100644
--- a/scripts/module.lds.S
+++ b/scripts/module.lds.S
@@ -26,6 +26,7 @@ SECTIONS {
.altinstructions0 : ALIGN(8) { KEEP(*(.altinstructions)) }
__bug_table 0 : ALIGN(8) { KEEP(*(__bug_table)) }
__jump_table0 : ALIGN(8) { KEEP(*(__jump_table)) }
+   __ex_table  0 : ALIGN(4) { KEEP(*(__ex_table)) }
 
__patchable_function_entries : { *(__patchable_function_entries) }
 
-- 
2.42.0




[PATCH v2 0/3] modules: few of alignment fixes

2024-01-29 Thread Luis Chamberlain
Helge Deller had written a series of patches to fix a few 
misalignemnt annotations in the kernel [0]. Three of these
patches were tagged as being stable candidates. Because of these
annotation I suggested proof of the imapact, however we did not
easily have a way to verify the value / impact of fixing a few
misaligment changes in the kernel for modules.

For the more hotpath alignment fix Helge had I suggested we could
easily test this by stress testing find_symbol() with a few set of
modules. This adds such tests to allow us to both allow us to test
the impact of such possible fix, and also while at it, let's us
test future improvements on the find_symbol() path.

Changes on this v2:

 - Adds new selftest for kallsyms
 - Drops patch #1 as Masahiro Yamada already applied it to linux-kbuild/fixes
 - Removes stable tags
 - Drops patch #3 as it was not needed
 - Adds a new patch with the issues noted by Helge as a fix
   to commit f3304ecd7f06 ("linux/export: use inline assembler to
   populate symbol CRCs") as noted by Masahiro Yamada
 - Adds selftest impact on x86_64 for patch #2, this really should
   be tested on parisc though as that's an example architecture
   where we could see perhaps more improvement

[0] https://lkml.kernel.org/r/2023111814.139916-1-del...@kernel.org

Masahiro, if there no issues feel free to take this or I can take them in
too via the modules-next tree. Lemme know!

Helge Deller (2):
  modules: Ensure 64-bit alignment on __ksymtab_* sections
  modules: Add missing entry for __ex_table

Luis Chamberlain (2):
  selftests: add new kallsyms selftests
  vmlinux.lds.h: add missing alignment for symbol CRCs

 include/linux/export-internal.h   |   1 +
 lib/Kconfig.debug | 103 ++
 lib/Makefile  |   1 +
 lib/tests/Makefile|   1 +
 lib/tests/module/.gitignore   |   4 +
 lib/tests/module/Makefile |  15 ++
 lib/tests/module/gen_test_kallsyms.sh | 128 ++
 scripts/module.lds.S  |   9 +-
 tools/testing/selftests/module/Makefile   |  12 ++
 tools/testing/selftests/module/config |   3 +
 tools/testing/selftests/module/find_symbol.sh |  81 +++
 11 files changed, 354 insertions(+), 4 deletions(-)
 create mode 100644 lib/tests/Makefile
 create mode 100644 lib/tests/module/.gitignore
 create mode 100644 lib/tests/module/Makefile
 create mode 100755 lib/tests/module/gen_test_kallsyms.sh
 create mode 100644 tools/testing/selftests/module/Makefile
 create mode 100644 tools/testing/selftests/module/config
 create mode 100755 tools/testing/selftests/module/find_symbol.sh

-- 
2.42.0




[PATCH v2 3/4] vmlinux.lds.h: add missing alignment for symbol CRCs

2024-01-29 Thread Luis Chamberlain
Commit f3304ecd7f06 (linux/export: use inline assembler to populate
symbol CRCs") fixed an issue with unexpected padding  by adding
asm inline assembly to directly specify the desired data layout.
It however forgot to add the alignment, fix that.

Reported-by: Helge Deller 
Fixes: f3304ecd7f06 ("linux/export: use inline assembler to populate symbol 
CRCs")
Signed-off-by: Luis Chamberlain 
---
 include/linux/export-internal.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/export-internal.h b/include/linux/export-internal.h
index 69501e0ec239..51b8cf3f60ef 100644
--- a/include/linux/export-internal.h
+++ b/include/linux/export-internal.h
@@ -61,6 +61,7 @@
 
 #define SYMBOL_CRC(sym, crc, sec)   \
asm(".section \"___kcrctab" sec "+" #sym "\",\"a\"" "\n" \
+   ".balign 4" "\n" \
"__crc_" #sym ":"   "\n" \
".long " #crc   "\n" \
".previous" "\n")
-- 
2.42.0




[PATCH v2 2/4] modules: Ensure 64-bit alignment on __ksymtab_* sections

2024-01-29 Thread Luis Chamberlain
From: Helge Deller 

On 64-bit architectures without CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
(e.g. ppc64, ppc64le, parisc, s390x,...) the __KSYM_REF() macro stores
64-bit pointers into the __ksymtab* sections.
Make sure that those sections will be correctly aligned at module link time,
otherwise unaligned memory accesses may happen at runtime.

The __kcrctab* sections store 32-bit entities, so use ALIGN(4) for those.

Testing with the kallsyms selftest on x86_64 we see a savings of about
1,958,153 ns in the worst case which may or not be within noise. Testing
on parisc would be useful and welcomed.

On x86_64 before:

 Performance counter stats for '/sbin/modprobe test_kallsyms_b':

86,430,119 ns   duration_time
84,407,000 ns   system_time
   213  page-faults

   0.086430119 seconds time elapsed

   0.0 seconds user
   0.084407000 seconds sys

 Performance counter stats for '/sbin/modprobe test_kallsyms_b':

85,777,474 ns   duration_time
82,581,000 ns   system_time
   212  page-faults

   0.085777474 seconds time elapsed

   0.0 seconds user
   0.082581000 seconds sys

 Performance counter stats for '/sbin/modprobe test_kallsyms_b':

87,906,053 ns   duration_time
87,939,000 ns   system_time
   212  page-faults

   0.087906053 seconds time elapsed

   0.0 seconds user
   0.087939000 seconds sys

After:

 Performance counter stats for '/sbin/modprobe test_kallsyms_b':

82,925,631 ns   duration_time
83,000,000 ns   system_time
   212  page-faults

   0.082925631 seconds time elapsed

   0.0 seconds user
   0.08300 seconds sys

 Performance counter stats for '/sbin/modprobe test_kallsyms_b':

87,776,380 ns   duration_time
86,678,000 ns   system_time
   213  page-faults

   0.087776380 seconds time elapsed

   0.0 seconds user
   0.086678000 seconds sys

 Performance counter stats for '/sbin/modprobe test_kallsyms_b':

85,947,900 ns   duration_time
82,006,000 ns   system_time
   212  page-faults

   0.085947900 seconds time elapsed

   0.0 seconds user
   0.082006000 seconds sys

Signed-off-by: Helge Deller 
[mcgrof: ran kallsyms selftests on x86_64]
Signed-off-by: Luis Chamberlain 
---
 scripts/module.lds.S | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/scripts/module.lds.S b/scripts/module.lds.S
index bf5bcf2836d8..b00415a9ff27 100644
--- a/scripts/module.lds.S
+++ b/scripts/module.lds.S
@@ -15,10 +15,10 @@ SECTIONS {
*(.discard.*)
}
 
-   __ksymtab   0 : { *(SORT(___ksymtab+*)) }
-   __ksymtab_gpl   0 : { *(SORT(___ksymtab_gpl+*)) }
-   __kcrctab   0 : { *(SORT(___kcrctab+*)) }
-   __kcrctab_gpl   0 : { *(SORT(___kcrctab_gpl+*)) }
+   __ksymtab   0 : ALIGN(8) { *(SORT(___ksymtab+*)) }
+   __ksymtab_gpl   0 : ALIGN(8) { *(SORT(___ksymtab_gpl+*)) }
+   __kcrctab   0 : ALIGN(4) { *(SORT(___kcrctab+*)) }
+   __kcrctab_gpl   0 : ALIGN(4) { *(SORT(___kcrctab_gpl+*)) }
 
.ctors  0 : ALIGN(8) { *(SORT(.ctors.*)) *(.ctors) }
.init_array 0 : ALIGN(8) { *(SORT(.init_array.*)) 
*(.init_array) }
-- 
2.42.0




Re: [PATCH 4/4] modules: Add missing entry for __ex_table

2024-01-29 Thread Luis Chamberlain
On Wed, Nov 22, 2023 at 11:18:14PM +0100, del...@kernel.org wrote:
> From: Helge Deller 
> 
> The entry for __ex_table was missing, which may make __ex_table
> become 1- or 2-byte aligned in modules.
> Add the entry to ensure it gets 32-bit aligned.
> 
> Signed-off-by: Helge Deller 
> Cc:  # v6.0+

Cc'ing stable was overkill, I'll remove it.

  Luis

> ---
>  scripts/module.lds.S | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/scripts/module.lds.S b/scripts/module.lds.S
> index b00415a9ff27..488f61b156b2 100644
> --- a/scripts/module.lds.S
> +++ b/scripts/module.lds.S
> @@ -26,6 +26,7 @@ SECTIONS {
>   .altinstructions0 : ALIGN(8) { KEEP(*(.altinstructions)) }
>   __bug_table 0 : ALIGN(8) { KEEP(*(__bug_table)) }
>   __jump_table0 : ALIGN(8) { KEEP(*(__jump_table)) }
> + __ex_table  0 : ALIGN(4) { KEEP(*(__ex_table)) }
>  
>   __patchable_function_entries : { *(__patchable_function_entries) }
>  
> -- 
> 2.41.0
> 
> 



Re: [RESEND PATCH v2] modules: wait do_free_init correctly

2024-01-29 Thread Luis Chamberlain
On Mon, Jan 29, 2024 at 10:03:04AM +0800, Changbin Du wrote:
> The commit 1a7b7d922081 ("modules: Use vmalloc special flag") moves
> do_free_init() into a global workqueue instead of call_rcu(). So now
> rcu_barrier() can not ensure that do_free_init has completed. We should
> wait it via flush_work().
> 
> Without this fix, we still could encounter false positive reports in
> W+X checking, and rcu synchronization is unnecessary.

You didn't answer my question, which should be documented in the commit log.

Does this mean we never freed modules init because of this? If so then
your commit log should clearly explain that. It should also explain that
if true (you have to verify) then it means we were no longer saving
the memory we wished to save, and that is important for distributions
which do want to save anything on memory. You may want to do a general
estimate on how much that means these days on any desktop / server.

  Luis



Re: [PATCH 2/4] modules: Ensure 64-bit alignment on __ksymtab_* sections

2024-01-22 Thread Luis Chamberlain
On Mon, Jan 22, 2024 at 05:47:49PM +0100, Helge Deller wrote:
> On 1/22/24 17:10, Luis Chamberlain wrote:
> > 
> > It's within the noise for x86_64, but given what you suggest
> > for parisc where it is much more expensive, we should see a non-noise
> > delta. Even just time on loading the module should likely result in
> > a considerable delta than on x86_64. You may just need to play a bit
> > with the default values at build time.
> 
> I don't know if it will be a "considerable" amount of time.

There are variables which you can tune, so given what you suggest
it should be possible to get to that spot rather easily with a few
changes to the default values I think.

> > If you don't feel like doing that test that's fine too, we can just
> > ignore that.
> 
> I can do that test, but I won't have time for that in the next few weeks...

I would appreciate it!

  Luis



Re: [PATCH 2/4] modules: Ensure 64-bit alignment on __ksymtab_* sections

2024-01-22 Thread Luis Chamberlain
On Sat, Dec 30, 2023 at 08:33:24AM +0100, Helge Deller wrote:
> Your selftest code is based on perf.
> AFAICS we don't have perf on parisc/hppa, 

I see!

> so I can't test your selftest code
> on that architecture.
> I assume you tested on x86, where the CPU will transparently take care of
> unaligned accesses. This is probably why the results are within
> the noise.
> But on some platforms the CPU raises an exception on unaligned accesses
> and jumps into special exception handler assembler code inside the kernel.
> This is much more expensive than on x86, which is why we track on parisc
> in /proc/cpuinfo counters on how often this exception handler is called:
> IRQ:   CPU0   CPU1
>   3:   1332  0 SuperIO  ttyS0
>   7:1270013  0 SuperIO  pata_ns87415
>  64:  320023012  320021431 CPU  timer
>  65:   17080507   20624423 CPU  IPI
> UAH:   10948640  58104   Unaligned access handler traps
> 
> This "UAH" field could theoretically be used to extend your selftest.

Nice!

> But is it really worth it? The outcome is very much architecture and CPU
> specific, maybe it's just within the noise as you measured.

It's within the noise for x86_64, but given what you suggest
for parisc where it is much more expensive, we should see a non-noise
delta. Even just time on loading the module should likely result in
a considerable delta than on x86_64. You may just need to play a bit
with the default values at build time.

> IMHO we should always try to natively align structures, and if we see
> we got it wrong in kernel code, we should fix it.

This was all motivated by the first review criteria of these patches
as if they were stable worthy or not. Even if we don't consider them
stable material, given the test is now written and easily extended to
test on parisc with just timing information and UAH I think it would
be nice to have this data for a few larger default factor values so we
can compare against x86_64 while we're at it.

If you don't feel like doing that test that's fine too, we can just
ignore that. I'll still apply the patches but, I figured I'd ask to
collect information while the test was already written and it should
now be easy to compare / contrast differences.

  Luis



[GIT PULL] Modules changes for v6.8-rc1

2024-01-09 Thread Luis Chamberlain
The following changes since commit ceb6a6f023fd3e8b07761ed900352ef574010bcb:

  Linux 6.7-rc6 (2023-12-17 15:19:28 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/ 
tags/modules-6.8-rc1

for you to fetch changes up to 4515d08a742c76612b65d2f47a87d12860519842:

  kernel/module: improve documentation for try_module_get() (2023-12-21 
10:26:14 -0800)


Modules changes for v6.8-rc1

Just one cleanup and one documentation improvement change. No functional
changes. However, this has been tested on linux-next for over 1 month.


Kevin Hao (1):
  module: Remove redundant TASK_UNINTERRUPTIBLE

Marco Pagani (1):
  kernel/module: improve documentation for try_module_get()

 include/linux/module.h | 2 +-
 kernel/module/dups.c   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)



Re: [PATCH 1/4] linux/export: Fix alignment for 64-bit ksymtab entries

2023-12-22 Thread Luis Chamberlain
On Fri, Dec 22, 2023 at 04:01:30PM +0900, Masahiro Yamada wrote:
> On Fri, Dec 22, 2023 at 3:08 PM Luis Chamberlain  wrote:
> >
> > On Thu, Dec 21, 2023 at 10:07:13PM -0800, Luis Chamberlain wrote:
> > >
> > > If we want to go bananas we could even get a graph of size of modules
> >
> > Sorry I meant size of number of symbols Vs cost.
> >
> >  Luis
> 
> 
> 
> But, 1/4 is really a bug-fix, isn't it?

Ah you mean a regression fix, yeah sure, thanks I see !

 Luis



Re: [PATCH 2/4] modules: Ensure 64-bit alignment on __ksymtab_* sections

2023-12-22 Thread Luis Chamberlain
On Fri, Dec 22, 2023 at 01:13:26PM +0100, Helge Deller wrote:
> Hi Luis,
> 
> On 12/22/23 06:59, Luis Chamberlain wrote:
> > On Wed, Nov 22, 2023 at 11:18:12PM +0100, del...@kernel.org wrote:
> > > From: Helge Deller 
> > > 
> > > On 64-bit architectures without CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
> > > (e.g. ppc64, ppc64le, parisc, s390x,...) the __KSYM_REF() macro stores
> > > 64-bit pointers into the __ksymtab* sections.
> > > Make sure that those sections will be correctly aligned at module link 
> > > time,
> > > otherwise unaligned memory accesses may happen at runtime.
> > 
> > The ramifications are not explained there.
> 
> What do you mean exactly by this?

You don't explain the impact of not applying this patch.

> > You keep sending me patches with this and we keep doing a nose dive
> > on this. It means I have to do more work.
> Sorry about that. Since you are mentioned as maintainer for modules I
> thought you are the right contact.

I am certaintly the right person to send this patch to, however, I am
saying that given our previous dialog I would like the commit log to
describe a bit more detail of a few things:

 * how you found this
 * what are the impact of not applying it

Specially if you are going to be sending patches right before the
holidays with a Cc stable fix annotation. This gets me on hight alert
and I have to put things down to see how really critical this is so to
decide to fast track this to Linus or not.

> Furthermore, this patch is pretty small and trivial.

Many patches can be but some can break things and some can be small but
also improve things critically, for example if we are aware of broken
exception handlers.

> And I had the impression that people understand that having unaligned
> structures is generally a bad idea as it usually always impacts performance
> (although the performance penalty in this case isn't critical at all,
> as we are not on hot paths).

There are two aspects to this, one is the critical nature which is
implied by your patch which pegs it as a stable fix, given you prior
patches about this it leaves me wondering if it is fixing some crashes
on some systems given faulty exception handlers.

The performance thing is also subjective, but it could not be subjective
by doing some very trivial tests as I suggested. Such a test would also
help as we lack specific selftsts for this case and we can grow it later
with other things. I figured I'd ask you to try it, since *you* keep
sending patches about misalignments on module sections so I figured
you must be seeing something others are not, and so you must care.

> > Thoughts?
> 
> I really appreciate your thoughts here!
> 
> But given the triviality of this patch, I personally think you are
> proposing a huge academic investment in time and resources with no real 
> benefit.
> On which architecture would you suggest to test this?
> What would be the effective (really new) outcome?
> If the performance penalty is zero or close to zero, will that imply
> that such a patch isn't useful?
> Please also keep in mind that not everyone gets paid to work on the kernel,
> so I have neither the time nor the various architectures to test on.

I think you make this seem so difficult, but I understand it could seem
that way. I've attached at the end of this email a patch that does just
this then to help.

> So, honestly I don't see a real reason why it shouldn't be applied...

Like I said, you Cc'd stable as a fix, as a maintainer it is my job to
verify how critical this is and ask for more details about how you found
it and evaluate the real impact. Even if it was not a stable fix I tend
to ask this for patches, even if they are trivial.

> > > Signed-off-by: Helge Deller 
> > > Cc:  # v6.0+
> > 
> > That's a stretch without any data, don't you think?
> 
> Yes. No need to push such a patch to stable series.

OK, can you extend the patch below with something like:

perf stat --repeat 100 --pre 'modprobe -r b a b c' -- 
./tools/testing/selftests/module/find_symbol.sh

And test before and after?

I ran a simple test as-is and the data I get is within noise, and so
I think we need the --repeat 100 thing.

---
before:
sudo ./tools/testing/selftests/module/find_symbol.sh 

 Performance counter stats for '/sbin/modprobe test_kallsyms_b':

81,956,206 ns   duration_time   
  
81,883,000 ns   system_time 
  
   210  page-faults 
  

   0.081956206 seconds time elapsed

   0.0 seconds user
   0.081

Re: [PATCH 1/4] linux/export: Fix alignment for 64-bit ksymtab entries

2023-12-21 Thread Luis Chamberlain
On Thu, Dec 21, 2023 at 10:07:13PM -0800, Luis Chamberlain wrote:
> 
> If we want to go bananas we could even get a graph of size of modules

Sorry I meant size of number of symbols Vs cost.

 Luis



Re: [PATCH 1/4] linux/export: Fix alignment for 64-bit ksymtab entries

2023-12-21 Thread Luis Chamberlain
On Fri, Dec 22, 2023 at 01:01:23AM +0900, Masahiro Yamada wrote:
> On Thu, Dec 21, 2023 at 7:22 PM Masahiro Yamada  wrote:
> >
> > On Thu, Nov 23, 2023 at 7:18 AM  wrote:
> > >
> > > From: Helge Deller 
> > >
> > > An alignment of 4 bytes is wrong for 64-bit platforms which don't define
> > > CONFIG_HAVE_ARCH_PREL32_RELOCATIONS (which then store 64-bit pointers).
> > > Fix their alignment to 8 bytes.
> > >
> > > Signed-off-by: Helge Deller 
> >
> >
> > This is correct.
> >
> > Acked-by: Masahiro Yamada 
> >
> > Please add
> >
> >
> > Fixes: ddb5cdbafaaa ("kbuild: generate KSYMTAB entries by modpost")
> >
> >
> 
> 
> If there is no objection, I will pick this up
> to linux-kbuild/fixes.

The new selftests I've suggested should help get perf data to cover both
modules and built-in kernel symbols given find_symbol() will first hit
built-in symbols first before modules with one caveat: we'd want to extend
the selftest with a part which builds a module built-in with also tons
of other symbols.

So I'm all for you taking this but I don't think we need to rush for the
same reasons I mentioned in my reply to Helge.

I think it would be nice to get real perf data with perf stat as I
suggested, and include that in the commit logs. I think it would also be
useful to include a description about the fact that there is no real fix
and that the performance hit is all that happens as the architecture
just emulates the aligment. In the worst case, if exception handlers
are broken we could crash but that is rare although it does happen.

If we want to go bananas we could even get a graph of size of modules
Vs cost on misaligment as a relationship with time. Without this, frankly
cost on "performance" is artificial.

Thoughts?

  Luis



Re: [PATCH 2/4] modules: Ensure 64-bit alignment on __ksymtab_* sections

2023-12-21 Thread Luis Chamberlain
On Wed, Nov 22, 2023 at 11:18:12PM +0100, del...@kernel.org wrote:
> From: Helge Deller 
> 
> On 64-bit architectures without CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
> (e.g. ppc64, ppc64le, parisc, s390x,...) the __KSYM_REF() macro stores
> 64-bit pointers into the __ksymtab* sections.
> Make sure that those sections will be correctly aligned at module link time,
> otherwise unaligned memory accesses may happen at runtime.

The ramifications are not explained there. You keep sending me patches
with this and we keep doing a nose dive on this. It means I have to do
more work. So as I had suggested with your patch which I merged in
commit 87c482bdfa79 ("modules: Ensure natural alignment for
.altinstructions and __bug_table sections") please clarify the
impact of not merging this patch. Also last time you noticed the
misalignment due to a faulty exception handler, please mention how
you found this out now.

And since this is not your first patch on the exact related subject
I'd appreciate if you can show me perf stat results differences between
having and not having this patch merged. Why? Because we talk about
a performance penalthy, but we are not saying how much, and since this
is an ongoing thing, we might as well have a tool belt with ways to
measure such performance impact to bring clarity and value to this
and future related patches.

> The __kcrctab* sections store 32-bit entities, so use ALIGN(4) for those.

I've given some thought about how to test this. Sadly perf kallsysms
just opens the /proc/kallsysms file, but that's fine, we need our own
test.

I think a 3 new simple modules selftest would do it and running perf
stat on it. One module, let us call it module A which constructs its own
name space prefix for its exported symbols and has tons of silly symbols
for arbitrary data, whatever. We then have module B which refers to a
few arbitrary symbols from module A, hopefully spread out linearly, so
if module A had 10,000 symbols, we'd have module A refer to a symbol
ever 1,000 symbols. Finally we want a symbol C which has say, 50,000
symbols all of which will not be used at all by the first two modules,
but the selftest will load module C first, prior to calling modprobe B.

We'll stress test this way two calls which use find_symbol():

1) Upon load of B it will trigger simplify_symbols() to look for the
symbol it uses from the module A with tons of symbols. That's an
indirect way for us to call resolve_symbol_wait() from module A without
having to use symbol_get() which want to remove as exported as it is
just a hack which should go away. Our goal is for us to test
resolve_symbol() which will call find_symbol() and that will eventually
look for the symbol on module A with:

  find_exported_symbol_in_section()

That uses bsearch() so a binary search for the symbol and we'd end up
hitting the misalignments here. Binary search will at worst be O(log(n))
and so the only way to aggreviate the search will be to add tons of
symbols to A, and have B use a few of them.

2) When you load B, userspace will at first load A as depmod will inform
userspace A goes before B. Upon B's load towards the end right before
we call module B's init routine we get complete_formation() called on
the module. That will first check for duplicate symbols with the call
to verify_exported_symbols(). That is when we'll force iteration on
module C's insane symbol list.

The selftests just runs

perf stat -e pick-your-poison-for-misalignments 
tools/testing/selftests/kmod/ksymtab.sh

Where ksymtab.sh is your new script which calls:

modprobe C
modprobe B

I say pick-your-poison-for-misalignments because I am not sure what is
best here.

Thoughts?

> Signed-off-by: Helge Deller 
> Cc:  # v6.0+

That's a stretch without any data, don't you think?

 Luis



Re: [PATCH] kernel/module: improve documentation for try_module_get()

2023-12-21 Thread Luis Chamberlain
On Thu, Dec 21, 2023 at 05:58:47PM +0100, Marco Pagani wrote:
> The sentence "this call will fail if the module is already being
> removed" is potentially confusing and may contradict the rest of the
> documentation. If one tries to get a module that has already been
> removed using a stale pointer, the kernel will crash.
> 
> Signed-off-by: Marco Pagani 

Thanks, patch applied!

  Luis



Re: [PATCH 0/4] Section alignment issues?

2023-12-20 Thread Luis Chamberlain
On Tue, Dec 19, 2023 at 01:26:49PM -0800, Luis Chamberlain wrote:
> On Wed, Nov 22, 2023 at 11:18:10PM +0100, del...@kernel.org wrote:
> > From: Helge Deller 
> > My questions:
> > - Am I wrong with my analysis?
> 
> This would typically of course depend on the arch, but whether it helps
> is what I would like to see with real numbers rather then speculation.
> Howeer, I don't expect some of these are hot paths except maybe the
> table lookups. So could you look at some perf stat differences
> without / with alignment ? Other than bootup live patching would be
> a good test case. We have selftest for modules, the script in selftests
> tools/testing/selftests/kmod/kmod.sh is pretty aggressive, but the live
> patching tests might be better suited.
> 
> > - What does people see on other architectures?
> > - Does it make sense to add a compile- and runtime-check, like the patch 
> > below, to the kernel?
> 
> The chatty aspects really depend on the above results.
> 
> Aren't there some archs where an unaligned access would actually crash?
> Why hasn't that happened?

I've gone down through memory lane and we have discussed this before.

It would seem this misalignment should not affect performance, and this
should not be an issue unless you have a buggy exception hanlder. We
actually ran into one before. Please refer to merge commit

e74acdf55da6649dd30be5b621a93b71cbe7f3f9

  Luis



Re: [PATCH] modules: wait do_free_init correctly

2023-12-20 Thread Luis Chamberlain
On Wed, Dec 20, 2023 at 01:27:51PM +0800, Changbin Du wrote:
> On Tue, Dec 19, 2023 at 01:52:03PM -0800, Luis Chamberlain wrote:
> > On Tue, Dec 19, 2023 at 12:51:51PM -0800, Andrew Morton wrote:
> > > On Tue, 19 Dec 2023 22:12:31 +0800 Changbin Du  
> > > wrote:
> > > 
> > > > The commit 1a7b7d922081 ("modules: Use vmalloc special flag") moves
> > > > do_free_init() into a global workqueue instead of call_rcu(). So now
> > > > we should wait it via flush_work().
> > > 
> > > What are the runtime effects of this change?
> > 
> > Indeed that's needed given how old this culprit commit is:
> > 
> > git describe --contains 1a7b7d922081
> > v5.2-rc1~192^2~5
> > 
> > Who did this work and for what reason? What triggered this itch?
> >
> Seems the waiting was introduced by commit ae646f0b9ca ("init: fix false 
> positives
> in W+X checking").
> 
> As what I have observed, mark_readonly() is only invoked by the first user 
> mode
> thread function kernel_init(), which is before userspace /init. So is it real
> possible we have loaded modules at this point?

Are you saying we don't free any module inits at all then? I asked a lot
of questions and your answers seem slim.

How did you find this?
What actual impact does this have without the patch?

The commit must document this.

  Luis



Re: [PATCH] modules: wait do_free_init correctly

2023-12-19 Thread Luis Chamberlain
On Tue, Dec 19, 2023 at 12:51:51PM -0800, Andrew Morton wrote:
> On Tue, 19 Dec 2023 22:12:31 +0800 Changbin Du  wrote:
> 
> > The commit 1a7b7d922081 ("modules: Use vmalloc special flag") moves
> > do_free_init() into a global workqueue instead of call_rcu(). So now
> > we should wait it via flush_work().
> 
> What are the runtime effects of this change?

Indeed that's needed given how old this culprit commit is:

git describe --contains 1a7b7d922081
v5.2-rc1~192^2~5

Who did this work and for what reason? What triggered this itch?

Is it perhaps for an out of tree driver that did something funky
on its module exit?

As per Documentation/RCU/rcubarrier.rst rcu_barrier will ensure the
callbacks complete, so interms of determinism both mechanisms will
have waited for the free. It seems we're now just limiting the scope.

This could also mean initialization grew used to having RCU calls on
init complete at this point in time, even for modules, and so localizing
this wait may now also introduce other unexpected behaviour.

  Luis



Re: [PATCH 0/4] Section alignment issues?

2023-12-19 Thread Luis Chamberlain
On Wed, Nov 22, 2023 at 11:18:10PM +0100, del...@kernel.org wrote:
> From: Helge Deller 
> My questions:
> - Am I wrong with my analysis?

This would typically of course depend on the arch, but whether it helps
is what I would like to see with real numbers rather then speculation.
Howeer, I don't expect some of these are hot paths except maybe the
table lookups. So could you look at some perf stat differences
without / with alignment ? Other than bootup live patching would be
a good test case. We have selftest for modules, the script in selftests
tools/testing/selftests/kmod/kmod.sh is pretty aggressive, but the live
patching tests might be better suited.

> - What does people see on other architectures?
> - Does it make sense to add a compile- and runtime-check, like the patch 
> below, to the kernel?

The chatty aspects really depend on the above results.

Aren't there some archs where an unaligned access would actually crash?
Why hasn't that happened?

  Luis



Re: [PATCH] init: move THIS_MODULE from to

2023-12-04 Thread Luis Chamberlain
On Sun, Nov 26, 2023 at 04:19:14PM +0900, Masahiro Yamada wrote:
> Commit f50169324df4 ("module.h: split out the EXPORT_SYMBOL into
> export.h") appropriately separated EXPORT_SYMBOL into 
> because modules and EXPORT_SYMBOL are orthogonal; modules are symbol
> consumers, while EXPORT_SYMBOL are used by symbol providers, which
> may not be necessarily a module.
> 
> However, that commit also relocated THIS_MODULE. As explained in the
> commit description, the intention was to define THIS_MODULE in a
> lightweight header, but I do not believe  was the
> suitable location because EXPORT_SYMBOL and THIS_MODULE are unrelated.
> 
> Move it to another lightweight header, . The reason for
> choosing  is to make  self-contained
> without relying on  incorrectly including
> .
> 
> With this adjustment, the role of  becomes clearer as
> it only defines EXPORT_SYMBOL.
> 
> Signed-off-by: Masahiro Yamada 

Reviewed-by: Luis Chamberlain 

Do you want this this to go through modules-next or your tree? I'm fine
it goes either way.

  Luis



Re: [PATCH 2/3] modpost: Extended modversion support

2023-11-16 Thread Luis Chamberlain
On Wed, Nov 15, 2023 at 06:50:10PM +, Matthew Maurer wrote:
> Adds a new format for modversions which stores each field in a separate
> elf section.

The "why" is critical and not mentioned. And I'd like to also see
documented this with foresight, if Rust needed could this be used
in the future for other things?

Also please include folks CC'd in *one* patch to *all* patches as
otherwise we have no context.

> This initially adds support for variable length names, but
> could later be used to add additional fields to modversions in a
> backwards compatible way if needed.
> 
> Adding support for variable length names makes it possible to enable
> MODVERSIONS and RUST at the same time.
> 
> Signed-off-by: Matthew Maurer 
> ---
>  arch/powerpc/kernel/module_64.c | 24 +-

Why was only powerpc modified? If the commit log explained this it would
make it easier for review.

> diff --git a/kernel/module/internal.h b/kernel/module/internal.h
> index c8b7b4dcf782..0c188c96a045 100644
> --- a/kernel/module/internal.h
> +++ b/kernel/module/internal.h
> @@ -80,7 +80,7 @@ struct load_info {
>   unsigned int used_pages;
>  #endif
>   struct {
> - unsigned int sym, str, mod, vers, info, pcpu;
> + unsigned int sym, str, mod, vers, info, pcpu, vers_ext_crc, 
> vers_ext_name;

We might as well modify this in a preliminary patch to add each new
unsinged int in a new line, so that it is easier to blame when each new
entry gets added. It should not grow the size of the struct at all but
it would make futur extensions easier to review what is new and git
blame easier to spot when something was added.

Although we don't use this extensively today this can easily grow for
convenience and making code easier to read.

> diff --git a/kernel/module/version.c b/kernel/module/version.c
> index 53f43ac5a73e..93d97dad8c77 100644
> --- a/kernel/module/version.c
> +++ b/kernel/module/version.c
> @@ -19,11 +19,28 @@ int check_version(const struct load_info *info,
>   unsigned int versindex = info->index.vers;
>   unsigned int i, num_versions;
>   struct modversion_info *versions;
> + struct modversion_info_ext version_ext;
>  
>   /* Exporting module didn't supply crcs?  OK, we're already tainted. */
>   if (!crc)
>   return 1;
>  
> + /* If we have extended version info, rely on it */
> + if (modversion_ext_start(info, _ext) >= 0) {

There are two things we need to do to make processing modules easier:

  1) ELF validation
  2) Once checked then process the information

We used to have this split up but also had a few places which did both
1) and 2) together. This was wrong and so I want to keep things tidy
and ensure we do things which validate the ELF separate. To that
end please put the checks to validate the ELF first so that we report
to users with a proper error/debug check in case the ELF is wrong,
this enables futher debug checks for that to be done instead of
confusing users who end up scratching their heads why something
failed.

So please split up the ELF validation check and put that into
elf_validity_cache_copy() which runs *earlier* than this.

Then *if* if has this, you just process it. Please take care to be
very pedantic in the elf_validity_cache_copy() and extend the checks
you have for validation in modversion_ext_start() and bring them to
elf_validity_cache_copy() with perhaps *more* stuff which does any
insane checks to verify it is 100% correct.

> + do {
> + if (strncmp(version_ext.name.value, symname,
> + version_ext.name.end - 
> version_ext.name.value) != 0)
> + continue;
> +
> + if (*version_ext.crc.value == *crc)
> + return 1;
> + pr_debug("Found checksum %X vs module %X\n",
> +  *crc, *version_ext.crc.value);
> + goto bad_version;
> + } while (modversion_ext_advance(_ext) == 0);

Can you do a for_each_foo()) type loop here instead after validation?
Because the validation would ensure your loop is bounded then. Look at
for_each_mod_mem_type() for inspiration.

> + goto broken_toolchain;

The broken toolchain thing would then be an issue reported in the
ELF validation.

> @@ -87,6 +105,65 @@ int same_magic(const char *amagic, const char *bmagic,
>   return strcmp(amagic, bmagic) == 0;
>  }
>  
> +#define MODVERSION_FIELD_START(sec, field) \
> + field.value = (typeof(field.value))sec.sh_addr; \
> + field.end = field.value + sec.sh_size
> +
> +ssize_t modversion_ext_start(const struct load_info *info,
> +  struct modversion_info_ext *start)
> +{
> + unsigned int crc_idx = info->index.vers_ext_crc;
> + unsigned int name_idx = info->index.vers_ext_name;
> + Elf_Shdr *sechdrs = info->sechdrs;
> +
> + // Both of these fields are needed for this to be 

[GIT PULL] Modules changes for v6.7-rc1

2023-11-01 Thread Luis Chamberlain
The following changes since commit 7d461b291e65938f15f56fe58da2303b07578a76:

  Merge tag 'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm 
(2023-11-01 06:28:35 -1000)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/ 
tags/modules-6.7-rc1

for you to fetch changes up to ea0b0bcef4917a2640ecc100c768b8e785784834:

  module: Annotate struct module_notes_attrs with __counted_by (2023-11-01 
13:07:32 -0700)


Modules changes for v6.7-rc1

The only thing worth highligthing is that gzip moves to use vmalloc() instead of
kmalloc just as we had a fix for this for zstd on v6.6-rc1. The rest is regular
house keeping, keeping things neat, tidy, and boring.

Oh and this has been on linux-next for over a month.


Andrea Righi (1):
  module/decompress: use vmalloc() for gzip decompression workspace

Kees Cook (2):
  module: Clarify documentation of module_param_call()
  module: Annotate struct module_notes_attrs with __counted_by

Luis Chamberlain (1):
  MAINTAINERS: add include/linux/module*.h to modules

Tiezhu Yang (2):
  module: Make is_mapping_symbol() return bool
  module: Make is_valid_name() return bool

Zhu Mao (1):
  module: Fix comment typo

 MAINTAINERS   | 2 +-
 include/linux/module_symbol.h | 2 +-
 include/linux/moduleparam.h   | 6 +-
 kernel/module/decompress.c| 4 ++--
 kernel/module/stats.c | 2 +-
 kernel/module/sysfs.c | 2 +-
 scripts/mod/modpost.c | 4 ++--
 7 files changed, 13 insertions(+), 9 deletions(-)



Re: [PATCH 5/5] modules: only allow symbol_get of EXPORT_SYMBOL_GPL modules

2023-10-18 Thread Luis Chamberlain
On Wed, Oct 18, 2023 at 07:31:46AM +0200, Christoph Hellwig wrote:
> On Wed, Oct 18, 2023 at 01:30:18AM +0100, David Woodhouse wrote:
> > 
> > But if we're going to tolerate the core kernel still exporting some
> > stuff with EXPORT_SYMBOL, why isn't OK for a GPL-licensed module do to
> > the same? Even an *in-tree* GPL-licensed module now can't export
> > functionality with EXPORT_SYMBOL and have it used with symbol_get().
> 
> Anything using symbol_get is by intent very deeply internal for tightly
> coupled modules working together, and thus not a non-GPL export.
> 
> In fact the current series is just a stepping stone.  Once some mess
> in the kvm/vfio integration is fixed up we'll require a new explicit
> EXPORT_SYMBOL variant as symbol_get wasn't ever intended to be used
> on totally random symbols not exported for use by symbol_get.

The later patches in the series also show we could resolves most
uses through Kconfig and at build time, it really begs the question
if we even need it for any real valid uses.

  Luis



Re: [PATCH v3] module: Add CONFIG_MODULE_DISABLE_INIT_FREE option

2023-10-12 Thread Luis Chamberlain
On Thu, Oct 12, 2023 at 09:50:27AM -0700, Luis Chamberlain wrote:
> > In the kernel, selecting the CONFIG_MODULE_DISABLE_INIT_FREE
> > option disables the asynchronous freeing of init sections.
> 
> No it does not.

I take it back, your patch effectively only does this.

  Luis



Re: [PATCH v4] module: Add CONFIG_MODULE_DISABLE_INIT_FREE option

2023-10-12 Thread Luis Chamberlain
On Thu, Oct 12, 2023 at 07:17:19AM +0530, Joey Jiao wrote:
>  
> +config MODULE_DISABLE_INIT_FREE
> + bool "Disable freeing of init sections"
> + default n
> + help
> +   By default, kernel will free init sections after module being fully
> +   loaded.
> +
> +   MODULE_DISABLE_INIT_FREE allows users to prevent the freeing of init
> +   sections. This option is particularly helpful for syzkaller fuzzing,
> +   ensuring that the module consistently loads into the same address
> +   across reboots.

How and why does not free'ing init help with syzkaller exactly? I don't
see the relationship between not free'ing init and ensuring th emodule
loads into the same address. There could be many things which could
incur an address gets allocated from a module at another location which
a module can take. I cannot fathom how this simple toggle could ensure
modules following the address allocations accross reboots. That seems
like odd chance, not something actually deterministic.

> +
>  endif # MODULES
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index 98fedfdb8db5..0f242b7b29fe 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -2593,7 +2593,8 @@ static noinline int do_init_module(struct module *mod)
>* be cleaned up needs to sync with the queued work - ie
>* rcu_barrier()
>*/
> - if (llist_add(>node, _free_list))
> + if (llist_add(>node, _free_list) &&
> + !IS_ENABLED(CONFIG_MODULE_DISABLE_INIT_FREE))
>   schedule_work(_free_wq);





Re: [PATCH v3] module: Add CONFIG_MODULE_DISABLE_INIT_FREE option

2023-10-12 Thread Luis Chamberlain
On Thu, Oct 12, 2023 at 07:10:11AM +0530, Joey Jiao wrote:
> To facilitate syzkaller test, it's essential for the module to retain the same
> address across reboots.

Why?

> In userspace, the execution of modprobe commands must
> occur sequentially.

Why?

> In the kernel, selecting the CONFIG_MODULE_DISABLE_INIT_FREE
> option disables the asynchronous freeing of init sections.

No it does not.

> Signed-off-by: Joey Jiao 
> ---
>  kernel/module/Kconfig | 8 
>  kernel/module/main.c  | 5 +++--
>  2 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig
> index 33a2e991f608..1cdbee4c51de 100644
> --- a/kernel/module/Kconfig
> +++ b/kernel/module/Kconfig
> @@ -389,4 +389,12 @@ config MODULES_TREE_LOOKUP
>   def_bool y
>   depends on PERF_EVENTS || TRACING || CFI_CLANG
>  
> +config MODULE_DISABLE_INIT_FREE
> + bool "Disable freeing of init sections"
> + default n
> + help
> +   Allows users to prevent the freeing of init sections. This option is
> +   particularly helpful for syzkaller fuzzing, ensuring that the module
> +   consistently loads into the same address across reboots.
> +
>  endif # MODULES
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index 98fedfdb8db5..a5210b90c078 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -2593,8 +2593,9 @@ static noinline int do_init_module(struct module *mod)
>* be cleaned up needs to sync with the queued work - ie
>* rcu_barrier()
>*/
> - if (llist_add(>node, _free_list))
> - schedule_work(_free_wq);
> + if (llist_add(>node, _free_list) &&
> + !IS_ENABLED(CONFIG_MODULE_DISABLE_INIT_FREE))
> + schedule_work(_free_wq);

llist_add() returns true if the list was empty prior to adding the
entry, so the functionality you are adding makes no sense with the
commit log in any way shape or form.

  Luis



Re: [PATCH v2] module: Add CONFIG_MODULE_LOAD_IN_SEQUENCE option

2023-10-11 Thread Luis Chamberlain
On Wed, Oct 11, 2023 at 01:14:38PM +0530, Joey Jiao wrote:
> When modprobe cmds are executed one by one, the final loaded modules
> are not in fixed sequence as expected.
> 
> Add the option to make sure modules are in fixed sequence across reboot.
> 
> Signed-off-by: Joey Jiao 
> ---
>  kernel/module/Kconfig | 11 +++
>  kernel/module/main.c  |  3 ++-
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig
> index 33a2e991f608..b45a45f31d6d 100644
> --- a/kernel/module/Kconfig
> +++ b/kernel/module/Kconfig
> @@ -389,4 +389,15 @@ config MODULES_TREE_LOOKUP
>   def_bool y
>   depends on PERF_EVENTS || TRACING || CFI_CLANG
>  
> +config MODULE_LOAD_IN_SEQUENCE
> + bool "Load module in sequence"
> + default n
> + help
> +   By default, modules are loaded in random sequence depending on when 
> modprobe
> +   is executed.
> +
> +   This option allows modules to be loaded in sequence if modprobe cmds 
> are
> +   executed one by one in sequence. This option is helpful during 
> syzkaller fuzzing
> +   to make sure module is loaded into fixed address across device reboot.
> +
>  endif # MODULES
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index 98fedfdb8db5..e238a31d09eb 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -2594,7 +2594,8 @@ static noinline int do_init_module(struct module *mod)
>* rcu_barrier()
>*/
>   if (llist_add(>node, _free_list))
> - schedule_work(_free_wq);
> + if (!IS_ENABLED(CONFIG_MODULE_LOAD_IN_SEQUENCE)) {
> + schedule_work(_free_wq);
>  

As Christoph suggested the rationale for something like this needs to be
clearly spelled out in the commit log and if so valuable it should be
a default. The commit log and even the Kconfig description do little
to justify any rationale for this.

  Luis



Re: [PATCH] module: Add CONFIG_MODULE_LOAD_IN_SEQUENCE option

2023-10-10 Thread Luis Chamberlain
Please find a good email client to reply to patches.

On Wed, Oct 11, 2023 at 01:57:58AM +, Joey Jiao (QUIC) wrote:
> Hi Luis,
> 
> > How is ignoring an error ensuring ordering?
> The change is just to disable the schedule_work.

That's different and can be made clearer. Try:

if (!IS_ENABLED(CONFIG_FOO))
schedule_stuff

> > Why are you making this only now be called with this new kconfig option?
> This sequence loading is especially helpful for syzkaller coverage decoding.
> When kaslr is disabled, address inside core kernel is fixed, so syzkaller can 
> always get right function/line number from addr2line.
> But module address keeps change across rebooting, in first booting, it might 
> be loaded at X1, and at X2 after reboot, and at X3 after another reboot.
> In this way, syzkaller just can't decode correctly for module address. And 
> syzkaller currently uses PC and branch info for coverage guided things.
> 
> There was a discussion previously here 
> https://groups.google.com/g/syzkaller/c/1Pnm_BjrZO8/m/WOyAKx8ZAgAJ for 
> modprobe.

You are missing my point, you are disabling in effect a piece of code
where it was not before.

  Luis



Re: [PATCH v2 0/5] params: harden string ops and allocatio ops

2023-10-10 Thread Luis Chamberlain
On Mon, Oct 02, 2023 at 09:57:59AM -0700, Kees Cook wrote:
> On Mon, Oct 02, 2023 at 03:48:51PM +0300, Andy Shevchenko wrote:
> > A couple of patches are for get the string ops, used in the module,
> > slightly harden. On top a few cleanups.
> > 
> > Since the main part is rather hardening, I think the Kees' tree is
> > the best fit for the series, but I'm open for another option(s).
> > 
> > Changelog v2:
> > - dropped the s*printf() --> sysfs_emit() conversion as it revealed
> >   an issue, i.e. reuse getters with non-page-aligned pointer, which
> >   would be addressed separately
> > - added cover letter and clarified the possible route for the series
> >   (Luis)
> > 
> > Andy Shevchenko (5):
> >   params: Introduce the param_unknown_fn type
> >   params: Do not go over the limit when getting the string length
> >   params: Use size_add() for kmalloc()
> >   params: Sort headers
> >   params: Fix multi-line comment style
> 
> Seems like a nice bit of clean-up.
> 
> Reviewed-by: Kees Cook 

Reviewed-by: Luis Chamberlain 

  Luis



Re: [PATCH] module: Add CONFIG_MODULE_LOAD_IN_SEQUENCE option

2023-10-10 Thread Luis Chamberlain
On Mon, Oct 09, 2023 at 10:26:35AM +0530, Joey Jiao wrote:
> When modprobe cmds are executed one by one, the final loaded modules
> are not in fixed sequence as expected.
> 
> Add the option to make sure modules are in fixed sequence across reboot.
> 
> Signed-off-by: Joey Jiao 
> ---
>  kernel/module/Kconfig | 11 +++
>  kernel/module/main.c  |  6 ++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig
> index 33a2e991f608..b45a45f31d6d 100644
> --- a/kernel/module/Kconfig
> +++ b/kernel/module/Kconfig
> @@ -389,4 +389,15 @@ config MODULES_TREE_LOOKUP
>   def_bool y
>   depends on PERF_EVENTS || TRACING || CFI_CLANG
>  
> +config MODULE_LOAD_IN_SEQUENCE
> + bool "Load module in sequence"
> + default n
> + help
> +   By default, modules are loaded in random sequence depending on when 
> modprobe
> +   is executed.
> +
> +   This option allows modules to be loaded in sequence if modprobe cmds 
> are
> +   executed one by one in sequence. This option is helpful during 
> syzkaller fuzzing
> +   to make sure module is loaded into fixed address across device reboot.
> +
>  endif # MODULES
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index 98fedfdb8db5..587fd84083ae 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -2593,11 +2593,17 @@ static noinline int do_init_module(struct module *mod)
>* be cleaned up needs to sync with the queued work - ie
>* rcu_barrier()
>*/
> +#ifdef CONFIG_MODULE_LOAD_IN_SEQUENCE
> + llist_add(>node, _free_list);
> +#else
>   if (llist_add(>node, _free_list))
>   schedule_work(_free_wq);
> +#endif

How is ignoring an error ensuring ordering?

>   mutex_unlock(_mutex);
> +#ifdef CONFIG_MODULE_LOAD_IN_SEQUENCE
>   wake_up_all(_wq);
> +#endif

Why are you making this only now be called with this new kconfig option?

  Luis



Re: [PATCH v3 0/7] sysctl: Remove sentinel elements from arch

2023-10-10 Thread Luis Chamberlain
On Mon, Oct 02, 2023 at 01:30:35PM +0200, Joel Granados via B4 Relay wrote:
> V3:
> * Removed the ia64 patch to avoid conflicts with the ia64 removal
> * Rebased onto v6.6-rc4
> * Kept/added the trailing comma for the ctl_table arrays. This was a comment
>   that we received "drivers/*" patch set.
> * Link to v2: 
> https://lore.kernel.org/r/20230913-jag-sysctl_remove_empty_elem_arch-v2-0-d1bd13a29...@samsung.com

Thanks! I replaced the v2 with this v3 on sysctl-next.

  Luis


Re: [PATCH] module: Fix comment typo

2023-09-20 Thread Luis Chamberlain
On Sun, Jun 18, 2023 at 10:58:43PM +0800, zhumao...@208suo.com wrote:
> 
> Delete duplicated word in comment.
> 
> Signed-off-by: Zhu Mao 
> ---

Applied and pushed, next time please consider sending a slew of typo
fixes, just one word is a bit too much trouble for just one patch.

  Luis



Re: [PATCH 0/2] module: Do some small changes

2023-09-20 Thread Luis Chamberlain
On Fri, Jun 16, 2023 at 09:51:31AM +0800, Tiezhu Yang wrote:
> The first patch is suggested by Xi Zhang  offline,
> the second patch is inspired by the first patch, no functional changes.

Applied and pushed, thanks!

  Luis



Re: [PATCH] module/decompress: use vmalloc() for gzip decompression workspace

2023-09-20 Thread Luis Chamberlain
On Wed, Aug 30, 2023 at 05:58:20PM +0200, Andrea Righi wrote:
> Use a similar approach as commit a419beac4a07 ("module/decompress: use
> vmalloc() for zstd decompression workspace") and replace kmalloc() with
> vmalloc() also for the gzip module decompression workspace.
> 
> In this case the workspace is represented by struct inflate_workspace
> that can be fairly large for kmalloc() and it can potentially lead to
> allocation errors on certain systems:
> 
> $ pahole inflate_workspace
> struct inflate_workspace {
>   struct inflate_state   inflate_state;/* 0  9544 */
>   /* --- cacheline 149 boundary (9536 bytes) was 8 bytes ago --- */
>   unsigned char  working_window[32768]; /*  9544 32768 */
> 
>   /* size: 42312, cachelines: 662, members: 2 */
>   /* last cacheline: 8 bytes */
> };
> 
> Considering that there is no need to use continuous physical memory,
> simply switch to vmalloc() to provide a more reliable in-kernel module
> decompression.
> 
> Fixes: b1ae6dc41eaa ("module: add in-kernel support for decompressing")
> Signed-off-by: Andrea Righi 

Applied, and pushed, thanks!

  Luis



[PATCH] MAINTAINERS: add include/linux/module*.h to modules

2023-09-20 Thread Luis Chamberlain
Use glob include/linux/module*.h to capture all module changes.

Suggested-by: Kees Cook 
Signed-off-by: Luis Chamberlain 
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 3be1bdfe8ecc..b78cf8dc7f16 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14289,7 +14289,7 @@ L:  linux-kernel@vger.kernel.org
 S: Maintained
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git 
modules-next
 F: include/linux/kmod.h
-F: include/linux/module.h
+F: include/linux/module*.h
 F: kernel/module/
 F: lib/test_kmod.c
 F: scripts/module*
-- 
2.39.2




Re: [PATCH v2 0/8] sysctl: Remove sentinel elements from arch

2023-09-20 Thread Luis Chamberlain
On Wed, Sep 13, 2023 at 11:10:54AM +0200, Joel Granados via B4 Relay wrote:
> V2:
> * Added clarification both in the commit messages and the coverletter as
>   to why this patch is safe to apply.
> * Added {Acked,Reviewed,Tested}-by from list
> * Link to v1: 
> https://lore.kernel.org/r/20230906-jag-sysctl_remove_empty_elem_arch-v1-0-3935d4854...@samsung.com

Thanks! I've merged this onto sysctl-next.

  Luis


Re: [RFC v2 6/6] fs: add automatic kernel fs freeze / thaw and remove kthread freezing

2021-04-20 Thread Luis Chamberlain
On Tue, Apr 20, 2021 at 01:59:03PM +0100, Christoph Hellwig wrote:
> > This also removes all the superflous freezer calls on all filesystems
> > as they are no longer needed as the VFS now performs filesystem
> > freezing/thaw if the filesystem has support for it. The filesystem
> > therefore is in charge of properly dealing with quiescing of the
> > filesystem through its callbacks.
> 
> Can you split that out from the main logic change?  Maybe even into one
> patch per file system?

The issue with this is that once you do the changes in pm to
freeze/suspend, if you leave the other changes in for the filesystems
freeze / resume will stall, so all this needs to be an atomic operation
if we want bisectable kernels.

  Luis


Re: [PATCH 2/2] fs/kernel_read_file: use usermodehelper_read_trylock() as a stop gap

2021-04-19 Thread Luis Chamberlain
On Fri, Apr 16, 2021 at 11:58:50PM +, Luis Chamberlain wrote:
> The endless wait for the read which the piece of hardware never got
> stalls resume as sync calls are all asynchronous.

Sorry, this should read:

The endless wait for the read, which the piece of hardware never got,
stalls resume as all pm resume calls are serialized and synchronous.

  Luis


[RFC v2 4/6] fs: distinguish between user initiated freeze and kernel initiated freeze

2021-04-16 Thread Luis Chamberlain
Userspace can initiate a freeze call using ioctls. If the kernel decides
to freeze a filesystem later it must be able to distinguish if userspace
had initiated the freeze, so that it does not unfreeze it later
automatically on resume.

Likewise if the kernel is initiating a freeze on its own it should *not*
fail to freeze a filesystem if a user had already frozen it on our behalf.
This same concept applies to thawing, even if its not possible for
userspace to beat the kernel in thawing a filesystem. This logic however
has never applied to userspace freezing and thawing, two consecutive
userspace freeze calls will results in only the first one succeeding, so
we must retain the same behaviour in userspace.

This doesn't implement yet kernel initiated filesystem freeze calls,
this will be done in subsequent calls. This change should introduce
no functional changes, it just extends the definitions a frozen
filesystem to account for future kernel initiated filesystem freeze.

Signed-off-by: Luis Chamberlain 
---
 fs/super.c | 27 ++-
 include/linux/fs.h | 17 +++--
 2 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 744b2399a272..53106d4c7f56 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -40,7 +40,7 @@
 #include 
 #include "internal.h"
 
-static int thaw_super_locked(struct super_block *sb);
+static int thaw_super_locked(struct super_block *sb, bool usercall);
 
 static LIST_HEAD(super_blocks);
 static DEFINE_SPINLOCK(sb_lock);
@@ -977,7 +977,7 @@ static void do_thaw_all_callback(struct super_block *sb)
down_write(>s_umount);
if (sb->s_root && sb->s_flags & SB_BORN) {
emergency_thaw_bdev(sb);
-   thaw_super_locked(sb);
+   thaw_super_locked(sb, false);
} else {
up_write(>s_umount);
}
@@ -1625,10 +1625,13 @@ static void sb_freeze_unlock(struct super_block *sb)
 }
 
 /* Caller takes lock and handles active count */
-static int freeze_locked_super(struct super_block *sb)
+static int freeze_locked_super(struct super_block *sb, bool usercall)
 {
int ret;
 
+   if (!usercall && sb_is_frozen(sb))
+   return 0;
+
if (!sb_is_unfrozen(sb))
return -EBUSY;
 
@@ -1673,7 +1676,10 @@ static int freeze_locked_super(struct super_block *sb)
 * For debugging purposes so that fs can warn if it sees write activity
 * when frozen is set to SB_FREEZE_COMPLETE, and for thaw_super().
 */
-   sb->s_writers.frozen = SB_FREEZE_COMPLETE;
+   if (usercall)
+   sb->s_writers.frozen = SB_FREEZE_COMPLETE;
+   else
+   sb->s_writers.frozen = SB_FREEZE_COMPLETE_AUTO;
return 0;
 }
 
@@ -1717,7 +1723,7 @@ int freeze_super(struct super_block *sb)
atomic_inc(>s_active);
 
down_write(>s_umount);
-   error = freeze_locked_super(sb);
+   error = freeze_locked_super(sb, true);
if (error) {
deactivate_locked_super(sb);
goto out;
@@ -1731,10 +1737,13 @@ int freeze_super(struct super_block *sb)
 EXPORT_SYMBOL(freeze_super);
 
 /* Caller deals with the sb->s_umount */
-static int __thaw_super_locked(struct super_block *sb)
+static int __thaw_super_locked(struct super_block *sb, bool usercall)
 {
int error;
 
+   if (!usercall && sb_is_unfrozen(sb))
+   return 0;
+
if (!sb_is_frozen(sb))
return -EINVAL;
 
@@ -1763,11 +1772,11 @@ static int __thaw_super_locked(struct super_block *sb)
 }
 
 /* Handles unlocking of sb->s_umount for you */
-static int thaw_super_locked(struct super_block *sb)
+static int thaw_super_locked(struct super_block *sb, bool usercall)
 {
int error;
 
-   error = __thaw_super_locked(sb);
+   error = __thaw_super_locked(sb, usercall);
if (error) {
up_write(>s_umount);
return error;
@@ -1787,6 +1796,6 @@ static int thaw_super_locked(struct super_block *sb)
 int thaw_super(struct super_block *sb)
 {
down_write(>s_umount);
-   return thaw_super_locked(sb);
+   return thaw_super_locked(sb, true);
 }
 EXPORT_SYMBOL(thaw_super);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3dcf2c1968e5..6980e709e94a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1406,9 +1406,10 @@ enum {
SB_FREEZE_FS = 3,   /* For internal FS use (e.g. to stop
 * internal threads if needed) */
SB_FREEZE_COMPLETE = 4, /* ->freeze_fs finished successfully */
+   SB_FREEZE_COMPLETE_AUTO = 5,/* same but initiated automatically */
 };
 
-#define SB_FREEZE_LEVELS (SB_FREEZE_COMPLETE - 1)
+#define SB_FREEZE_LEVELS (SB_FREEZE_COMPLETE_AUTO - 2)
 
 struct sb_writers {
int frozen; /* Is sb 

[RFC v2 6/6] fs: add automatic kernel fs freeze / thaw and remove kthread freezing

2021-04-16 Thread Luis Chamberlain
Add support to automatically handle freezing and thawing filesystems
during the kernel's suspend/resume cycle.

This is needed so that we properly really stop IO in flight without
races after userspace has been frozen. Without this we rely on
kthread freezing and its semantics are loose and error prone.
For instance, even though a kthread may use try_to_freeze() and end
up being frozen we have no way of being sure that everything that
has been spawned asynchronously from it (such as timers) have also
been stopped as well.

A long term advantage of also adding filesystem freeze / thawing
supporting during suspend / hibernation is that long term we may
be able to eventually drop the kernel's thread freezing completely
as it was originally added to stop disk IO in flight as we hibernate
or suspend.

This also removes all the superflous freezer calls on all filesystems
as they are no longer needed as the VFS now performs filesystem
freezing/thaw if the filesystem has support for it. The filesystem
therefore is in charge of properly dealing with quiescing of the
filesystem through its callbacks.

This also implies that many kthread users exist which have been
adding freezer semantics onto its kthreads without need. These also
will need to be reviewed later.

This is based on prior work originally by Rafael Wysocki and later by
Jiri Kosina.

The following Coccinelle rule was used as to remove the now superflous
freezer calls:

spatch --sp-file fs-freeze-cleanup.cocci --in-place fs/

@ has_freeze_fs @
identifier super_ops;
expression freeze_op;
@@

struct super_operations super_ops = {
.freeze_fs = freeze_op,
};

@ remove_set_freezable depends on has_freeze_fs @
expression time;
statement S, S2;
expression task, current;
@@

(
-   set_freezable();
|
-   if (try_to_freeze())
-   continue;
|
-   try_to_freeze();
|
-   freezable_schedule();
+   schedule();
|
-   freezable_schedule_timeout(time);
+   schedule_timeout(time);
|
-   if (freezing(task)) { S }
|
-   if (freezing(task)) { S }
-   else
{ S2 }
|
-   freezing(current)
)

@ remove_wq_freezable @
expression WQ_E, WQ_ARG1, WQ_ARG2, WQ_ARG3, WQ_ARG4;
identifier fs_wq_fn;
@@

(
WQ_E = alloc_workqueue(WQ_ARG1,
-  WQ_ARG2 | WQ_FREEZABLE,
+  WQ_ARG2,
   ...);
|
WQ_E = alloc_workqueue(WQ_ARG1,
-  WQ_ARG2 | WQ_FREEZABLE | WQ_ARG3,
+  WQ_ARG2 | WQ_ARG3,
   ...);
|
WQ_E = alloc_workqueue(WQ_ARG1,
-  WQ_ARG2 | WQ_ARG3 | WQ_FREEZABLE,
+  WQ_ARG2 | WQ_ARG3,
   ...);
|
WQ_E = alloc_workqueue(WQ_ARG1,
-  WQ_ARG2 | WQ_ARG3 | WQ_FREEZABLE | WQ_ARG4,
+  WQ_ARG2 | WQ_ARG3 | WQ_ARG4,
   ...);
|
WQ_E =
-   WQ_ARG1 | WQ_FREEZABLE
+   WQ_ARG1
|
WQ_E =
-   WQ_ARG1 | WQ_FREEZABLE | WQ_ARG3
+   WQ_ARG1 | WQ_ARG3
|
fs_wq_fn(
-   WQ_FREEZABLE | WQ_ARG2 | WQ_ARG3
+   WQ_ARG2 | WQ_ARG3
)
|
fs_wq_fn(
-   WQ_FREEZABLE | WQ_ARG2
+   WQ_ARG2
)
|
fs_wq_fn(
-   WQ_FREEZABLE
+   0
)
)

Signed-off-by: Luis Chamberlain 
---
 fs/btrfs/disk-io.c |  4 +-
 fs/btrfs/scrub.c   |  2 +-
 fs/cifs/cifsfs.c   | 10 ++---
 fs/cifs/dfs_cache.c|  2 +-
 fs/ext4/super.c|  2 -
 fs/f2fs/gc.c   |  7 +---
 fs/f2fs/segment.c  |  6 +--
 fs/gfs2/glock.c|  6 +--
 fs/gfs2/main.c |  4 +-
 fs/jfs/jfs_logmgr.c| 11 ++
 fs/jfs/jfs_txnmgr.c| 31 +--
 fs/nilfs2/segment.c| 48 ++-
 fs/super.c | 88 ++
 fs/xfs/xfs_log.c   |  3 +-
 fs/xfs/xfs_mru_cache.c |  2 +-
 fs/xfs/xfs_pwork.c |  2 +-
 fs/xfs/xfs_super.c | 14 +++
 fs/xfs/xfs_trans_ail.c |  7 +---
 include/linux/fs.h | 13 +++
 kernel/power/process.c | 15 ++-
 20 files changed, 175 insertions(+), 102 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9fc2ec72327f..2c718f1eaae3 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2303,7 +2303,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info 
*fs_info,
struct btrfs_fs_devices *fs_devices)
 {
u32 max_active = fs_info->thread_pool_size;
-   unsigned int flags = WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_UNBOUND;
+   unsigned int flags = WQ_MEM_RECLAIM | WQ_UNBOUND;
 
fs_info->workers =
btrfs_alloc_workqueue(fs_info, "worker",
@@ -2355,7 +2355,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info 
*fs_info,

[RFC v2 5/6] fs: add iterate_supers_excl() and iterate_supers_reverse_excl()

2021-04-16 Thread Luis Chamberlain
There are use cases where we wish to traverse the superblock list
but also capture errors, and in which case we want to avoid having
our callers issue a lock themselves since we can do the locking for
the callers. Provide a iterate_supers_excl() which calls a function
with the write lock held. If an error occurs we capture it and
propagate it.

Likewise there are use cases where we wish to traverse the superblock
list but in reverse order. The new iterate_supers_reverse_excl() helpers
does this but also also captures any errors encountered.

Reviewed-by: Jan Kara 
Signed-off-by: Luis Chamberlain 
---
 fs/super.c | 91 ++
 include/linux/fs.h |  2 +
 2 files changed, 93 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index 53106d4c7f56..2a6ef4ec2496 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -705,6 +705,97 @@ void iterate_supers(void (*f)(struct super_block *, void 
*), void *arg)
spin_unlock(_lock);
 }
 
+/**
+ * iterate_supers_excl - exclusively call func for all active superblocks
+ * @f: function to call
+ * @arg: argument to pass to it
+ *
+ * Scans the superblock list and calls given function, passing it
+ * locked superblock and given argument. Returns 0 unless an error
+ * occurred on calling the function on any superblock.
+ */
+int iterate_supers_excl(int (*f)(struct super_block *, void *), void *arg)
+{
+   struct super_block *sb, *p = NULL;
+   int error = 0;
+
+   spin_lock(_lock);
+   list_for_each_entry(sb, _blocks, s_list) {
+   if (hlist_unhashed(>s_instances))
+   continue;
+   sb->s_count++;
+   spin_unlock(_lock);
+
+   down_write(>s_umount);
+   if (sb->s_root && (sb->s_flags & SB_BORN)) {
+   error = f(sb, arg);
+   if (error) {
+   up_write(>s_umount);
+   spin_lock(_lock);
+   __put_super(sb);
+   break;
+   }
+   }
+   up_write(>s_umount);
+
+   spin_lock(_lock);
+   if (p)
+   __put_super(p);
+   p = sb;
+   }
+   if (p)
+   __put_super(p);
+   spin_unlock(_lock);
+
+   return error;
+}
+
+/**
+ * iterate_supers_reverse_excl - exclusively calls func in reverse order
+ * @f: function to call
+ * @arg: argument to pass to it
+ *
+ * Scans the superblock list and calls given function, passing it
+ * locked superblock and given argument, in reverse order, and holding
+ * the s_umount write lock. Returns if an error occurred.
+ */
+int iterate_supers_reverse_excl(int (*f)(struct super_block *, void *),
+void *arg)
+{
+   struct super_block *sb, *p = NULL;
+   int error = 0;
+
+   spin_lock(_lock);
+   list_for_each_entry_reverse(sb, _blocks, s_list) {
+   if (hlist_unhashed(>s_instances))
+   continue;
+   sb->s_count++;
+   spin_unlock(_lock);
+
+   down_write(>s_umount);
+   if (sb->s_root && (sb->s_flags & SB_BORN)) {
+   error = f(sb, arg);
+   if (error) {
+   up_write(>s_umount);
+   spin_lock(_lock);
+   __put_super(sb);
+   break;
+   }
+   }
+   up_write(>s_umount);
+
+   spin_lock(_lock);
+   if (p)
+   __put_super(p);
+   p = sb;
+   }
+   if (p)
+   __put_super(p);
+   spin_unlock(_lock);
+
+   return error;
+}
+
 /**
  * iterate_supers_type - call function for superblocks of given type
  * @type: fs type
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6980e709e94a..0f4d624f0f3f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3442,6 +3442,8 @@ extern struct super_block *get_active_super(struct 
block_device *bdev);
 extern void drop_super(struct super_block *sb);
 extern void drop_super_exclusive(struct super_block *sb);
 extern void iterate_supers(void (*)(struct super_block *, void *), void *);
+extern int iterate_supers_excl(int (*f)(struct super_block *, void *), void 
*arg);
+extern int iterate_supers_reverse_excl(int (*)(struct super_block *, void *), 
void *);
 extern void iterate_supers_type(struct file_system_type *,
void (*)(struct super_block *, void *), void *);
 
-- 
2.29.2



[RFC v2 1/6] fs: provide unlocked helper for freeze_super()

2021-04-16 Thread Luis Chamberlain
freeze_super() holds a write lock, however we wish to also enable
callers which already hold the write lock. To do this provide a helper
and make freeze_super() use it. This way, all that freeze_super() does
now is lock handling and active count management.

This change has no functional changes.

Suggested-by: Dave Chinner 
Reviewed-by: Jan Kara 
Signed-off-by: Luis Chamberlain 
---
 fs/super.c | 100 +
 1 file changed, 55 insertions(+), 45 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 11b7e7213fd1..e24d0849d935 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1624,59 +1624,20 @@ static void sb_freeze_unlock(struct super_block *sb)
percpu_up_write(sb->s_writers.rw_sem + level);
 }
 
-/**
- * freeze_super - lock the filesystem and force it into a consistent state
- * @sb: the super to lock
- *
- * Syncs the super to make sure the filesystem is consistent and calls the fs's
- * freeze_fs.  Subsequent calls to this without first thawing the fs will 
return
- * -EBUSY.
- *
- * During this function, sb->s_writers.frozen goes through these values:
- *
- * SB_UNFROZEN: File system is normal, all writes progress as usual.
- *
- * SB_FREEZE_WRITE: The file system is in the process of being frozen.  New
- * writes should be blocked, though page faults are still allowed. We wait for
- * all writes to complete and then proceed to the next stage.
- *
- * SB_FREEZE_PAGEFAULT: Freezing continues. Now also page faults are blocked
- * but internal fs threads can still modify the filesystem (although they
- * should not dirty new pages or inodes), writeback can run etc. After waiting
- * for all running page faults we sync the filesystem which will clean all
- * dirty pages and inodes (no new dirty pages or inodes can be created when
- * sync is running).
- *
- * SB_FREEZE_FS: The file system is frozen. Now all internal sources of fs
- * modification are blocked (e.g. XFS preallocation truncation on inode
- * reclaim). This is usually implemented by blocking new transactions for
- * filesystems that have them and need this additional guard. After all
- * internal writers are finished we call ->freeze_fs() to finish filesystem
- * freezing. Then we transition to SB_FREEZE_COMPLETE state. This state is
- * mostly auxiliary for filesystems to verify they do not modify frozen fs.
- *
- * sb->s_writers.frozen is protected by sb->s_umount.
- */
-int freeze_super(struct super_block *sb)
+/* Caller takes lock and handles active count */
+static int freeze_locked_super(struct super_block *sb)
 {
int ret;
 
-   atomic_inc(>s_active);
-   down_write(>s_umount);
-   if (sb->s_writers.frozen != SB_UNFROZEN) {
-   deactivate_locked_super(sb);
+   if (sb->s_writers.frozen != SB_UNFROZEN)
return -EBUSY;
-   }
 
-   if (!(sb->s_flags & SB_BORN)) {
-   up_write(>s_umount);
+   if (!(sb->s_flags & SB_BORN))
return 0;   /* sic - it's "nothing to do" */
-   }
 
if (sb_rdonly(sb)) {
/* Nothing to do really... */
sb->s_writers.frozen = SB_FREEZE_COMPLETE;
-   up_write(>s_umount);
return 0;
}
 
@@ -1705,7 +1666,6 @@ int freeze_super(struct super_block *sb)
sb->s_writers.frozen = SB_UNFROZEN;
sb_freeze_unlock(sb);
wake_up(>s_writers.wait_unfrozen);
-   deactivate_locked_super(sb);
return ret;
}
}
@@ -1714,9 +1674,59 @@ int freeze_super(struct super_block *sb)
 * when frozen is set to SB_FREEZE_COMPLETE, and for thaw_super().
 */
sb->s_writers.frozen = SB_FREEZE_COMPLETE;
+   return 0;
+}
+
+/**
+ * freeze_super - lock the filesystem and force it into a consistent state
+ * @sb: the super to lock
+ *
+ * Syncs the super to make sure the filesystem is consistent and calls the fs's
+ * freeze_fs.  Subsequent calls to this without first thawing the fs will 
return
+ * -EBUSY.
+ *
+ * During this function, sb->s_writers.frozen goes through these values:
+ *
+ * SB_UNFROZEN: File system is normal, all writes progress as usual.
+ *
+ * SB_FREEZE_WRITE: The file system is in the process of being frozen.  New
+ * writes should be blocked, though page faults are still allowed. We wait for
+ * all writes to complete and then proceed to the next stage.
+ *
+ * SB_FREEZE_PAGEFAULT: Freezing continues. Now also page faults are blocked
+ * but internal fs threads can still modify the filesystem (although they
+ * should not dirty new pages or inodes), writeback can run etc. After waiting
+ * for all running page faults we sync the filesystem which will clean all
+ * dirty pages and inodes (no new dirty pages or inodes can be created when
+ * sync is running).
+ *
+ *

[RFC v2 3/6] fs: add a helper for thaw_super_locked() which does not unlock

2021-04-16 Thread Luis Chamberlain
The thaw_super_locked() expects the caller to hold the sb->s_umount
semaphore. It also handles the unlocking of the semaphore for you.
Allow for cases where the caller will do the unlocking of the semaphore.

Signed-off-by: Luis Chamberlain 
---
 fs/super.c | 25 +++--
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 72b445a69a45..744b2399a272 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1730,14 +1730,13 @@ int freeze_super(struct super_block *sb)
 }
 EXPORT_SYMBOL(freeze_super);
 
-static int thaw_super_locked(struct super_block *sb)
+/* Caller deals with the sb->s_umount */
+static int __thaw_super_locked(struct super_block *sb)
 {
int error;
 
-   if (!sb_is_frozen(sb)) {
-   up_write(>s_umount);
+   if (!sb_is_frozen(sb))
return -EINVAL;
-   }
 
if (sb_rdonly(sb)) {
sb->s_writers.frozen = SB_UNFROZEN;
@@ -1752,7 +1751,6 @@ static int thaw_super_locked(struct super_block *sb)
printk(KERN_ERR
"VFS:Filesystem thaw failed\n");
lockdep_sb_freeze_release(sb);
-   up_write(>s_umount);
return error;
}
}
@@ -1761,10 +1759,25 @@ static int thaw_super_locked(struct super_block *sb)
sb_freeze_unlock(sb);
 out:
wake_up(>s_writers.wait_unfrozen);
-   deactivate_locked_super(sb);
return 0;
 }
 
+/* Handles unlocking of sb->s_umount for you */
+static int thaw_super_locked(struct super_block *sb)
+{
+   int error;
+
+   error = __thaw_super_locked(sb);
+   if (error) {
+   up_write(>s_umount);
+   return error;
+   }
+
+   deactivate_locked_super(sb);
+
+   return 0;
+ }
+
 /**
  * thaw_super -- unlock filesystem
  * @sb: the super to thaw
-- 
2.29.2



[RFC v2 2/6] fs: add frozen sb state helpers

2021-04-16 Thread Luis Chamberlain
The question of whether or not a superblock is frozen needs to be
augmented in the future to account for differences between a user
initiated freeze and a kernel initiated freeze done automatically
on behalf of the kernel.

Provide helpers so that these can be used instead so that we don't
have to expand checks later in these same call sites as we expand
the definition of a frozen superblock.

Signed-off-by: Luis Chamberlain 
---
 fs/ext4/ext4_jbd2.c |  2 +-
 fs/super.c  |  4 ++--
 fs/xfs/xfs_trans.c  |  3 +--
 include/linux/fs.h  | 34 ++
 4 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index be799040a415..efda50563feb 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -72,7 +72,7 @@ static int ext4_journal_check_start(struct super_block *sb)
 
if (sb_rdonly(sb))
return -EROFS;
-   WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE);
+   WARN_ON(sb_is_frozen(sb));
journal = EXT4_SB(sb)->s_journal;
/*
 * Special case here: if the journal has aborted behind our
diff --git a/fs/super.c b/fs/super.c
index e24d0849d935..72b445a69a45 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1629,7 +1629,7 @@ static int freeze_locked_super(struct super_block *sb)
 {
int ret;
 
-   if (sb->s_writers.frozen != SB_UNFROZEN)
+   if (!sb_is_unfrozen(sb))
return -EBUSY;
 
if (!(sb->s_flags & SB_BORN))
@@ -1734,7 +1734,7 @@ static int thaw_super_locked(struct super_block *sb)
 {
int error;
 
-   if (sb->s_writers.frozen != SB_FREEZE_COMPLETE) {
+   if (!sb_is_frozen(sb)) {
up_write(>s_umount);
return -EINVAL;
}
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index bcc978011869..b4669dd65c9e 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -272,8 +272,7 @@ xfs_trans_alloc(
 * Zero-reservation ("empty") transactions can't modify anything, so
 * they're allowed to run while we're frozen.
 */
-   WARN_ON(resp->tr_logres > 0 &&
-   mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
+   WARN_ON(resp->tr_logres > 0 && sb_is_frozen(mp->m_super));
ASSERT(!(flags & XFS_TRANS_RES_FDBLKS) ||
   xfs_sb_version_haslazysbcount(>m_sb));
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c3c88fdb9b2a..3dcf2c1968e5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1885,6 +1885,40 @@ static inline bool sb_start_intwrite_trylock(struct 
super_block *sb)
return __sb_start_write_trylock(sb, SB_FREEZE_FS);
 }
 
+/**
+ * sb_is_frozen_by_user - is superblock frozen by a user call
+ * @sb: the super to check
+ *
+ * Returns true if the super freeze was initiated by userspace, for instance,
+ * an ioctl call.
+ */
+static inline bool sb_is_frozen_by_user(struct super_block *sb)
+{
+   return sb->s_writers.frozen == SB_FREEZE_COMPLETE;
+}
+
+/**
+ * sb_is_frozen - is superblock frozen
+ * @sb: the super to check
+ *
+ * Returns true if the super is frozen.
+ */
+static inline bool sb_is_frozen(struct super_block *sb)
+{
+   return sb_is_frozen_by_user(sb);
+}
+
+/**
+ * sb_is_unfrozen - is superblock unfrozen
+ * @sb: the super to check
+ *
+ * Returns true if the super is unfrozen.
+ */
+static inline bool sb_is_unfrozen(struct super_block *sb)
+{
+   return sb->s_writers.frozen == SB_UNFROZEN;
+}
+
 bool inode_owner_or_capable(struct user_namespace *mnt_userns,
const struct inode *inode);
 
-- 
2.29.2



[RFC v2 0/6] vfs: provide automatic kernel freeze / resume

2021-04-16 Thread Luis Chamberlain
This picks up where I left off in 2017, I was inclined to respin this
up again due to a new issue Lukas reported which is easy to reproduce.
I posted a stop-gap patch for that issue [0] and a test case, however,
*this* is the work we want upstream to fix these sorts of issues.

As discussed long ago though at LSFMM, we have much work to do. However,
we can take small strides to get us closer. This is trying to take one
step. It addresses most of the comments and feedback from the last
series I posted, however it doesn't address the biggest two issues:

 o provide clean and clear semantics between userspace ioctls /
   automatic fs freeze, and freeze bdev. This also involves moving the
   counter stuff from bdev to the superblock. This is pending work.
 o The loopback hack which breaks the reverse ordering isn't addressed,
   perhaps just flagging it suffices for now?
 o The long term desirable DAG *is* plausable and I have an initial
   kernel graph implementation which I could use, but that may take
   longer to merge.

What this series *does* address based on the last series is:

  o Rebased onto linux-next tag next-20210415
  o Fixed RCU complaints. The issue was that I was adding new fs levels, and
this increated undesirably also the amount of rw semaphores, but we were
just using the new levels to distinguish *who* was triggering the suspend,
it was either inside the kernel and automatic, or triggered by userspace.
  o thaw_super_locked() was added but that unlocks the sb sb->s_umount,
our exclusive reverse iterate supers however will want to hold that
semaphore, so we provide a helper which lets the caller do the unlocking
for you, and make thaw_super_locked() a user of that.
  o WQ_FREEZABLE is now dealt with
  o I have folded the fs freezer removal stuff into the patch which adds
the automatic fs frezer / thaw work from the kernel as otherwise separting
this would create intermediate steps which would produce kernels which
can stall on suspend / resume.

[0] https://lkml.kernel.org/r/20210416235850.23690-1-mcg...@kernel.org
[1] https://lkml.kernel.org/r/20171129232356.28296-1-mcg...@kernel.org  
    

Luis Chamberlain (6):
  fs: provide unlocked helper for freeze_super()
  fs: add frozen sb state helpers
  fs: add a helper for thaw_super_locked() which does not unlock
  fs: distinguish between user initiated freeze and kernel initiated
freeze
  fs: add iterate_supers_excl() and iterate_supers_reverse_excl()
  fs: add automatic kernel fs freeze / thaw and remove kthread freezing

 fs/btrfs/disk-io.c |   4 +-
 fs/btrfs/scrub.c   |   2 +-
 fs/cifs/cifsfs.c   |  10 +-
 fs/cifs/dfs_cache.c|   2 +-
 fs/ext4/ext4_jbd2.c|   2 +-
 fs/ext4/super.c|   2 -
 fs/f2fs/gc.c   |   7 +-
 fs/f2fs/segment.c  |   6 +-
 fs/gfs2/glock.c|   6 +-
 fs/gfs2/main.c |   4 +-
 fs/jfs/jfs_logmgr.c|  11 +-
 fs/jfs/jfs_txnmgr.c|  31 ++--
 fs/nilfs2/segment.c|  48 +++---
 fs/super.c | 321 ++---
 fs/xfs/xfs_log.c   |   3 +-
 fs/xfs/xfs_mru_cache.c |   2 +-
 fs/xfs/xfs_pwork.c |   2 +-
 fs/xfs/xfs_super.c |  14 +-
 fs/xfs/xfs_trans.c |   3 +-
 fs/xfs/xfs_trans_ail.c |   7 +-
 include/linux/fs.h |  64 +++-
 kernel/power/process.c |  15 +-
 22 files changed, 405 insertions(+), 161 deletions(-)

-- 
2.29.2



[PATCH 2/2] fs/kernel_read_file: use usermodehelper_read_trylock() as a stop gap

2021-04-16 Thread Luis Chamberlain
The VFS lacks support to do automatic freeze / thaw of filesystems
on the suspend / resume cycle. This can cause some issues, namely
stalls when we have reads/writes during the suspend / resume cycle.

Although for module loading / kexec the probability of this happening
is extremely low, maybe even impossible, its a known real issue with
the request_firmare() API which it does direct fs read. For this reason
only be chatty about the issue on the call used by the firmware API.

Lukas Middendorf has reported an easy situation to reproduce, which can
be caused by questionably buggy drivers which call the request_firmware()
API on resume.

All you have to do is do the suspend / resume cycle using a driver
which will call request_firmware() on resume on a fresh XFS or
btrfs filesystem which has random files but the target file present:

for n in {1..1000}; do
dd if=/dev/urandom of=/lib/firmware/file$( printf %03d "$n" ).bin bs=1 
count=$((RANDOM + 1024 ))
done

You can reproduce this either with a buggy driver or by using the
test_firmware driver with the new hooks added to test this case.

No other request_firmware() calls can be triggered for this hang to work.
The hang occurs because:

request_firmware() -->
kernel_read_file_from_path_initns() -->
file_open_root() -->
  btrfs_lookup() --> ... -->
  btrfs_search_slot() --> read_block_for_search() -->
--> read_tree_block() --> btree_read_extent_buffer_pages() -->
--> submit_one_bio() --> btrfs_submit_metadata_bio() -->
--> btrfsic_submit_bio() --> submit_bio()
--> this completes and then
--> wait_on_page_locked() on the first page
--> never returns

The endless wait for the read which the piece of hardware never got
stalls resume as sync calls are all asynchronous.

The VFS fs freeze work fixes this issue, however it requires a bit
more work, which may take a while to land upstream, and so for now
this provides a simple stop-gap solution.

We can remove this stop-gap solution once the kernel gets VFS
fs freeze / thaw support.

Cc: raf...@kernel.org
Cc: j...@suse.cz
Cc: bvanass...@acm.org
Cc: ker...@tuxforce.de
Cc: mche...@kernel.org
Cc: keesc...@chromium.org
Reported-by: Lukas Middendorf 
Signed-off-by: Luis Chamberlain 
---
 fs/kernel_read_file.c | 37 +++--
 kernel/kexec_file.c   |  9 -
 kernel/module.c   |  8 +++-
 3 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
index 90d255fbdd9b..f82d8534bf39 100644
--- a/fs/kernel_read_file.c
+++ b/fs/kernel_read_file.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /**
  * kernel_read_file() - read file contents into a kernel buffer
@@ -134,12 +135,20 @@ int kernel_read_file_from_path(const char *path, loff_t 
offset, void **buf,
if (!path || !*path)
return -EINVAL;
 
+   ret = usermodehelper_read_trylock();
+   if (ret)
+   return ret;
+
file = filp_open(path, O_RDONLY, 0);
-   if (IS_ERR(file))
+   if (IS_ERR(file)) {
+   usermodehelper_read_unlock();
return PTR_ERR(file);
+   }
 
ret = kernel_read_file(file, offset, buf, buf_size, file_size, id);
fput(file);
+
+   usermodehelper_read_unlock();
return ret;
 }
 EXPORT_SYMBOL_GPL(kernel_read_file_from_path);
@@ -160,13 +169,37 @@ int kernel_read_file_from_path_initns(const char *path, 
loff_t offset,
get_fs_root(init_task.fs, );
task_unlock(_task);
 
+   /*
+* Note: we may be incorrectly called on a driver's resume callback.
+*
+* The kernel's power management may be busy bringing us to suspend
+* or trying to get us back to resume. If we try to do a direct write
+* during this time a block driver may never get that request, and the
+* filesystem can wait forever. This requires proper VFS work, which
+* is not yet ready.
+*
+* Likewise busy trying here is not possible as well as we'd be holding
+* up the kernel's pm resume, and waiting will not allow use to thaw
+* the filesystem, we'd just wait forever. Best we can do is
+* communuicate the problem so that drivers use the firwmare cache or
+* implement their own prior to resume.
+*/
+   ret = usermodehelper_read_trylock();
+   if (ret) {
+   pr_warn_once("Trying direct fs read while not allowed");
+   return ret;
+   }
+
file = file_open_root(root.dentry, root.mnt, path, O_RDONLY, 0);
path_put();
-   if (IS_ERR(file))
+   if (IS_ERR(file)) {
+   usermodehelper_read_unlock();
return PTR_ERR(file);
+   }
 
ret = kernel_read_file(file, offset, buf, buf_size, file_size, id);
fput(file);
+

[PATCH 0/2] fs: provide a stop gap fix for buggy resume firmware API calls

2021-04-16 Thread Luis Chamberlain
Lukas has reported an issue [0] with a media driver which causes stall on
resume. At first his suspicion was that this issue was due to a bug in
btrfs. I managed to build a custom driver to reproduce the issue and
confirmed it was not a bug in btrfs. The issue is reproducible in XFS
as well. This issue is a demonstration of how broken the filesystem
suspend / resume cycle is and how easy it can be to trigger an issue.

By only doing reads with the firmware API used incorrectly, a simple
suspend / resume cycle can stall a system. The stall happens since
the hardware never gets read request issued by the filesystem as it
was already suspended. The fs waits forever. The stall also happens
because resume calls are synchronous and if one does not complete
we'll wait forever.

My new unposted VFS series for the fs freezer / resume work fixes this,
however this series will require a bit more discussion before this lands
upstream. And so this series provides a test case for the issue and an
intermediate stop-gap patch which resolves the issue for now. We can
remove this once the VFS freeze work lands upstream.

[0] https://lkml.kernel.org/r/c79e24a5-f808-d1f0-1f09-ee6f135d9...@tuxforce.de

Luis Chamberlain (2):
  test_firmware: add suspend support to test buggy drivers
  fs/kernel_read_file: use usermodehelper_read_trylock() as a stop gap

 fs/kernel_read_file.c |  37 -
 kernel/kexec_file.c   |   9 +-
 kernel/module.c   |   8 +-
 lib/test_firmware.c   | 136 --
 tools/testing/selftests/firmware/fw_lib.sh|   8 +-
 .../selftests/firmware/fw_test_resume.sh  |  80 +++
 6 files changed, 261 insertions(+), 17 deletions(-)
 create mode 100755 tools/testing/selftests/firmware/fw_test_resume.sh

-- 
2.29.2



[PATCH 1/2] test_firmware: add suspend support to test buggy drivers

2021-04-16 Thread Luis Chamberlain
Lukas Middendorf reported a situation where a driver's
request_firmware() call on resume caused a stall. Upon
inspection the issue was that the driver in question was
calling request_firmware() only on resume, and since we
currently do not have a generic kernel VFS freeze / thaw
solution in place we are allowing races for filesystems
to race against the disappearance of a block device, and
this is presently an issue which can lead to a stall.

It is difficult to reproduce this unless you have hardware
which mimics this setup. So to test this setup, let's just
implement support for doing these wacky things. This lets
us test that corner case easily as follows.

echo N > /sys/module/printk/parameters/console_suspend
modprobe test_firmware
echo 1 > /sys/devices/virtual/misc/test_firmware/config_enable_resume_test
systemctl suspend

Then call virsh dompmwakeup guest-id, on the guest and you
can reproduce the issue easily. The issue is reprodicible
with xfs and btrfs, and using a new partition for /lib/firmware
with a files created first as follows:

for n in {1..1000}; do
  dd if=/dev/urandom of=/lib/firmware/file$( printf %03d "$n" ).bin bs=1 
count=$((RANDOM + 1024 ))
done

Cc: raf...@kernel.org
Cc: j...@suse.cz
Cc: bvanass...@acm.org
Cc: ker...@tuxforce.de
Cc: mche...@kernel.org
Cc: keesc...@chromium.org
Signed-off-by: Luis Chamberlain 
---
 lib/test_firmware.c   | 136 --
 tools/testing/selftests/firmware/fw_lib.sh|   8 +-
 .../selftests/firmware/fw_test_resume.sh  |  80 +++
 3 files changed, 211 insertions(+), 13 deletions(-)
 create mode 100755 tools/testing/selftests/firmware/fw_test_resume.sh

diff --git a/lib/test_firmware.c b/lib/test_firmware.c
index b6fe89add9fe..47ba6f4380ab 100644
--- a/lib/test_firmware.c
+++ b/lib/test_firmware.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -35,6 +36,8 @@ MODULE_IMPORT_NS(TEST_FIRMWARE);
 static DEFINE_MUTEX(test_fw_mutex);
 static const struct firmware *test_firmware;
 
+static struct platform_device *pdev;
+
 struct test_batched_req {
u8 idx;
int rc;
@@ -58,6 +61,9 @@ struct test_batched_req {
  * @sync_direct: when the sync trigger is used if this is true
  * request_firmware_direct() will be used instead.
  * @send_uevent: whether or not to send a uevent for async requests
+ * @enable_resume_test: if @senable_resume is true this will enable a test to
+ * issue a request_firmware() upon resume. This is useful to test resume
+ * after suspend filesystem races.
  * @num_requests: number of requests to try per test case. This is trigger
  * specific.
  * @reqs: stores all requests information
@@ -99,6 +105,7 @@ struct test_config {
bool partial;
bool sync_direct;
bool send_uevent;
+   bool enable_resume_test;
u8 num_requests;
u8 read_fw_idx;
 
@@ -195,6 +202,7 @@ static int __test_firmware_config_init(void)
test_fw_config->file_offset = 0;
test_fw_config->partial = false;
test_fw_config->sync_direct = false;
+   test_fw_config->enable_resume_test = false;
test_fw_config->req_firmware = request_firmware;
test_fw_config->test_result = 0;
test_fw_config->reqs = NULL;
@@ -275,6 +283,9 @@ static ssize_t config_show(struct device *dev,
len += scnprintf(buf + len, PAGE_SIZE - len,
"sync_direct:\t\t%s\n",
test_fw_config->sync_direct ? "true" : "false");
+   len += scnprintf(buf+len, PAGE_SIZE - len,
+   "enable_resume_test:\t\t%s\n",
+   test_fw_config->enable_resume_test ? "true" : "false");
len += scnprintf(buf + len, PAGE_SIZE - len,
"read_fw_idx:\t%u\n", test_fw_config->read_fw_idx);
 
@@ -538,6 +549,22 @@ static ssize_t config_sync_direct_show(struct device *dev,
 }
 static DEVICE_ATTR_RW(config_sync_direct);
 
+static ssize_t config_enable_resume_test_store(struct device *dev,
+  struct device_attribute *attr,
+  const char *buf, size_t count)
+{
+   return test_dev_config_update_bool(buf, count,
+  _fw_config->enable_resume_test);
+}
+
+static ssize_t config_enable_resume_test_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+   return test_dev_config_show_bool(buf, 
test_fw_config->enable_resume_test);
+}
+static DEVICE_ATTR_RW(config_enable_resume_test);
+
 static ssize_t config_send_uevent_store(struct device *dev,
struct device_attribute *attr,
c

Re: [RFC v3 0/2] vfs / btrfs: add support for ustat()

2021-04-15 Thread Luis Chamberlain
On Thu, Apr 15, 2021 at 02:17:58PM -0400, Josef Bacik wrote:
> There's a lot of larger things that need to
> be addressed in general to support the volume approach inside file systems
> that is going to require a lot of work inside of VFS.  If you feel like
> tackling that work and then wiring up btrfs by all means have at it, but I'm
> not seeing a urgent need to address this.  Thanks,

That's precisely what I what I want to hear me about. Things like this.
Would btrfs be the ony user of volumes inside filesystem? Jeff had
mentioned before this could also allow namespaces per volumes, and this
might be a desirable feature.

What else?

 Luis


Re: [RFC v3 0/2] vfs / btrfs: add support for ustat()

2021-04-15 Thread Luis Chamberlain
On Wed, Aug 23, 2017 at 3:31 PM Jeff Mahoney  wrote:
>
> On 8/15/14 5:29 AM, Al Viro wrote:
> > On Thu, Aug 14, 2014 at 07:58:56PM -0700, Luis R. Rodriguez wrote:
> >
> >> Christoph had noted that this seemed associated to the problem
> >> that the btrfs uses different assignments for st_dev than s_dev,
> >> but much as I'd like to see that changed based on discussions so
> >> far its unclear if this is going to be possible unless strong
> >> commitment is reached.
>
> Resurrecting a dead thread since we've been carrying this patch anyway
> since then.
>
> > Explain, please.  Whose commitment and commitment to what, exactly?
> > Having different ->st_dev values for different files on the same
> > fs is a bloody bad idea; why does btrfs do that at all?  If nothing else,
> > it breaks the usual "are those two files on the same fs?" tests...
>
> It's because btrfs snapshots would have inode number collisions.
> Changing the inode numbers for snapshots would negate a big benefit of
> btrfs snapshots: the quick creation and lightweight on-disk
> representation due to metadata sharing.
>
> The thing is that ustat() used to work.  Your commit 0ee5dc676a5f8
> (btrfs: kill magical embedded struct superblock) had a regression:
> Since it replaced the superblock with a simple dev_t, it rendered the
> device no longer discoverable by user_get_super.  We need a list_head to
> attach for searching.
>
> There's an argument that this is hacky.  It's valid.  The only other
> feedback I've heard is to use a real superblock for subvolumes to do
> this instead.  That doesn't work either, due to things like freeze/thaw
> and inode writeback.  Ultimately, what we need is a single file system
> with multiple namespaces.  Years ago we just needed different inode
> namespaces, but as people have started adopting btrfs for containers, we
> need more than that.  I've heard requests for per-subvolume security
> contexts.  I'd imagine user namespaces are on someone's wish list.  A
> working df can be done with ->d_automount, but the way btrfs handles
> having a "canonical" subvolume location has always been a way to avoid
> directory loops.  I'd like to just automount subvolumes everywhere
> they're referenced.  One solution, for which I have no code yet, is to
> have something like a superblock-light that we can hang things like a
> security context, a user namespace, and an anonymous dev.  Most file
> systems would have just one.  Btrfs would have one per subvolume.
>
> That's a big project with a bunch of discussion.

4 years have gone by and this patch is still being carried around for
btrfs. Other than resolving this ustat() issue for btrfs are there new
reasons to support this effort done to be done properly? Are there
other filesystems that would benefit? I'd like to get an idea of the
stakeholder here before considering taking this on or not.

 Luis


Re: [PATCH v3] firmware_loader: fix use-after-free in firmware_fallback_sysfs

2021-04-14 Thread Luis Chamberlain
Shuah, a question for you toward the end here.

On Wed, Apr 14, 2021 at 02:24:05PM +0530, Anirudh Rayabharam wrote:
> This use-after-free happens when a fw_priv object has been freed but
> hasn't been removed from the pending list (pending_fw_head). The next
> time fw_load_sysfs_fallback tries to insert into the list, it ends up
> accessing the pending_list member of the previoiusly freed fw_priv.
> 
> The root cause here is that all code paths that abort the fw load
> don't delete it from the pending list. For example:
> 
>   _request_firmware()
> -> fw_abort_batch_reqs()
> -> fw_state_aborted()
> 
> To fix this, delete the fw_priv from the list in __fw_set_state() if
> the new state is DONE or ABORTED. This way, all aborts will remove
> the fw_priv from the list. Accordingly, remove calls to list_del_init
> that were being made before calling fw_state_(aborted|done).
> 
> Also, in fw_load_sysfs_fallback, don't add the fw_priv to the pending
> list if it is already aborted. Instead, just jump out and return early.
> 
> Fixes: bcfbd3523f3c ("firmware: fix a double abort case with 
> fw_load_sysfs_fallback")
> Reported-by: syzbot+de271708674e20930...@syzkaller.appspotmail.com
> Tested-by: syzbot+de271708674e20930...@syzkaller.appspotmail.com
> Signed-off-by: Anirudh Rayabharam 
> ---
> 
> Changes in v3:
> Modified the patch to incorporate suggestions by Luis Chamberlain in
> order to fix the root cause instead of applying a "band-aid" kind of
> fix.
> https://lore.kernel.org/lkml/20210403013143.gv4...@42.do-not-panic.com/
> 
> Changes in v2:
> 1. Fixed 1 error and 1 warning (in the commit message) reported by
> checkpatch.pl. The error was regarding the format for referring to
> another commit "commit  ("oneline")". The warning was for line
> longer than 75 chars. 
> 
> ---
>  drivers/base/firmware_loader/fallback.c | 8 ++--
>  drivers/base/firmware_loader/firmware.h | 6 +-
>  drivers/base/firmware_loader/main.c | 2 ++
>  3 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/base/firmware_loader/fallback.c 
> b/drivers/base/firmware_loader/fallback.c
> index 91899d185e31..73581b6998b4 100644
> --- a/drivers/base/firmware_loader/fallback.c
> +++ b/drivers/base/firmware_loader/fallback.c
> @@ -94,7 +94,6 @@ static void __fw_load_abort(struct fw_priv *fw_priv)
>   if (fw_sysfs_done(fw_priv))
>   return;
>  
> - list_del_init(_priv->pending_list);
>   fw_state_aborted(fw_priv);
>  }
>  
> @@ -280,7 +279,6 @@ static ssize_t firmware_loading_store(struct device *dev,
>* Same logic as fw_load_abort, only the DONE bit
>* is ignored and we set ABORT only on failure.
>*/
> - list_del_init(_priv->pending_list);
>   if (rc) {
>   fw_state_aborted(fw_priv);
>   written = rc;
> @@ -513,6 +511,11 @@ static int fw_load_sysfs_fallback(struct fw_sysfs 
> *fw_sysfs, long timeout)
>   }
>  
>   mutex_lock(_lock);
> + if (fw_state_is_aborted(fw_priv)) {
> + mutex_unlock(_lock);
> + retval = -EAGAIN;
> + goto out;
> + }

Thanks for the quick follow up!

This would regress commit 76098b36b5db1 ("firmware: send -EINTR on
signal abort on fallback mechanism") which I had mentioned in my follow
up email you posted a link to. It would regress it since the condition
is just being met earlier and you nullify the effort. So essentially
on Android you would make not being able to detect signal handlers
like the SIGCHLD signal sent to init, if init was the same process
dealing with the sysfs fallback firmware upload.

The way I dealt with this in my patch was I decided to return -EINTR
in the earlier case in the hunk you added, instead of -EAGAIN. In
addition to this, later on fw_load_sysfs_fallback() when
fw_sysfs_wait_timeout() is used that would also deal with checking
for error codes on wait, and only then check if it was a signal
that cancelled things (the check for -ERESTARTSYS). We therefore
only send to userspace -EAGAIN when the wait really did hit the
timeout.

But also note that my change added a check for
fw_state_is_aborted(fw_priv) inside fw_sysfs_wait_timeout(),
as that was a recently intended goal.

In either case I documented well *why* we do these error checks
before sending a code to userspace on fw_sysfs_wait_timeout() since
otherwise it would be easy to regress that code, so please also
document that as I did.

I'll re-iterate again also:

Shuah's commit 0542ad88fbdd81bb ("firmware loader: Fix
_request_firmware_load(

Re: [PATCH v2] firmware_loader: fix use-after-free in firmware_fallback_sysfs

2021-04-13 Thread Luis Chamberlain
On Tue, Apr 13, 2021 at 04:12:42PM +0530, Anirudh Rayabharam wrote:
> The use-after-free happens when a fw_priv object has been freed but
> hasn't been removed from the pending list (pending_fw_head). The next
> time fw_load_sysfs_fallback tries to insert into the list, it ends up
> accessing the pending_list member of the previoiusly freed fw_priv.
> 
> In commit bcfbd3523f3c ("firmware: fix a double abort case with
> fw_load_sysfs_fallback"), fw_load_abort() is skipped if
> fw_sysfs_wait_timeout() returns -ENOENT. This causes the fw_priv to
> not be removed from the pending list.
> 
> To fix this, delete the fw_priv from the pending list when retval
> is -ENOENT instead of skipping the entire block.
> 
> Fixes: bcfbd3523f3c ("firmware: fix a double abort case with 
> fw_load_sysfs_fallback")
> Reported-by: syzbot+de271708674e20930...@syzkaller.appspotmail.com
> Tested-by: syzbot+de271708674e20930...@syzkaller.appspotmail.com
> Signed-off-by: Anirudh Rayabharam 

Thanks for your patch Anirudh, but please also see this reply to the
issue:

http://lkml.kernel.org/r/20210403013143.gv4...@42.do-not-panic.com

The way you patched the issue is just a band-aid, meaning we keep on
moving the issue further and it seems that's just the wrong approach.

Can you try the patch in that thread, to verify if the UAF goes away?


Re: [PATCH v2 1/1] kernel.h: Split out panic and oops helpers

2021-04-09 Thread Luis Chamberlain
On Fri, Apr 09, 2021 at 01:02:50PM +0300, Andy Shevchenko wrote:
> kernel.h is being used as a dump for all kinds of stuff for a long time.
> Here is the attempt to start cleaning it up by splitting out panic and
> oops helpers.
> 
> There are several purposes of doing this:
> - dropping dependency in bug.h
> - dropping a loop by moving out panic_notifier.h
> - unload kernel.h from something which has its own domain
> 
> At the same time convert users tree-wide to use new headers, although
> for the time being include new header back to kernel.h to avoid twisted
> indirected includes for existing users.
> 
> Signed-off-by: Andy Shevchenko 
> Reviewed-by: Bjorn Andersson 
> Acked-by: Mike Rapoport 
> Acked-by: Corey Minyard 
> Acked-by: Christian Brauner 
> Acked-by: Arnd Bergmann 
> Acked-by: Kees Cook 
> Acked-by: Wei Liu 
> Acked-by: Rasmus Villemoes 
> Signed-off-by: Andrew Morton 

Acked-by: Luis Chamberlain 

  Luis


Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate

2021-04-08 Thread Luis Chamberlain
On Thu, Apr 08, 2021 at 10:55:53AM +0200, Greg KH wrote:
> So to add crazy complexity to the kernel, 

I agree that this can be tricky. However, driver developers are going to
open code this either way. The problem with that as well, and one of my
own reasons for striving for at least *trying* for a generic solution,
was that I am aware that driver developers may be trying a busy solution
rather than the try method. The busy approach means you could also end
up in a situation where userspace can prevent full undoing / removal
of a driver. The try method is best effort in the sense that if the
driver is going it won't be allowed.

So a sensible consideration I think is we at least choose one of these:

 a) We allow driver developers to open code it on drivers which need it on
each and every single sysfs knob on the driver where its needed, and
accept folks might do it wrong

 b) Provide a macro which is opt-in, ie, not doing it for all
attributes, but perhaps one which the driver author *is* aware to
try / put of the driver method.

 c) Document this race and other races / failings so that driver
authors who do care about module removal are aware and know what
to do.

In this thread two races were exposed on syfs:

  * deadlock when a sysfs attribute uses a lock which is also used on
module __exit

  * possible race against the device's private data, and this is type
specific. I think many people probably missed the last hunks of my
proposed patch which added dev_type_get() which were not part of the
deadlock fix. At least I showed how attributes for all block devices
have an exposed race, which can crash if removal of a block device
with del_gendisk() happens while a sysfs attribute is about to be
used.
 
I don't think either race is trivial for a driver developer to assume a
solution for. Most focus on this thread was about the first race, the
seconod however is also very real, and likely can cause more issues on
rolling backs on error codes unrelated to rmmod...

> for an operation that can only
> be triggered manually by a root user, is not worth it in my opinion, as
> the maintainer of that code the complexity was asked to be made to.

Three things:

1) Many driver maintainers *do* care that rmmod works well. To the point
that if it does not, they feel ashamed. Reason for this is that in some
subsystems this *is* a typical test case. So although rmmod may be
a vodoo thing for core, many driver developers do want this to work
well.

As someone who also works on many test cases to expose odd issues in the
kernel unrelated to module removal, I can also say that module removal
does also expose other possible races which would otherwise be difficult
to trigger. So it is also a helfup aid as a developer working on
differen areas of the kernel.

2) Folks have pointed out that this is not just about rmmod, rolling
back removal of sysfs attributes due to error code paths is another
real scenario to consider. I don't think we have rules to tell userspace
to not muck with sysfs files after they are exposed. In fact those
uevents we send to userspace allow userspace to know exactly when to
muck with them.

3) Sadly, some sysfs attributes also purposely do things which *also*
mimic what is typically done on module removal, such as removal of an
interface, or block device. That was the case with zram, it allows
remvoal / reset of a device... Yes these are odd things to do, but we
allow for it. And sysfs users *do* need to be aware of these corner
cases if they want to support them.

There **may** be some severity to some of the issues mentioned above, to
allow really bad things to be done in userspace even without module
removal... but I didn't have time yet to expose a pattern with coccinelle
yet to see how commonplace some of these issue are. I was focusing at
first more for a generic solution if possible, as I thought that would
be better first evaluated, and to let others slowly become aware of the
issue.

  Luis


Re: [PATCH v1 1/1] kernel.h: Split out panic and oops helpers

2021-04-07 Thread Luis Chamberlain
On Wed, Apr 07, 2021 at 05:59:19PM +0300, Andy Shevchenko wrote:
> On Wed, Apr 7, 2021 at 5:30 PM Luis Chamberlain  wrote:
> > On Wed, Apr 07, 2021 at 10:33:44AM +0300, Andy Shevchenko wrote:
> > > On Wed, Apr 7, 2021 at 10:25 AM Luis Chamberlain  
> > > wrote:
> > > > On Tue, Apr 06, 2021 at 04:31:58PM +0300, Andy Shevchenko wrote:
> 
> ...
> 
> > > > Why is it worth it to add another file just for this?
> > >
> > > The main point is to break tons of loops that prevent having clean
> > > headers anymore.
> > >
> > > In this case, see bug.h, which is very important in this sense.
> >
> > OK based on the commit log this was not clear, it seemed more of moving
> > panic stuff to its own file, so just cleanup.
> 
> Sorry for that. it should have mentioned the kernel folder instead of
> lib. But I think it won't clarify the above.
> 
> In any case there are several purposes in this case
>  - dropping dependency in bug.h
>  - dropping a loop by moving out panic_notifier.h
>  - unload kernel.h from something which has its own domain
> 
> I think that you are referring to the commit message describing 3rd
> one, but not 1st and 2nd.

Right!

> I will amend this for the future splits, thanks!

Don't get me wrong, I love the motivation behind just the 3rd purpose,
however I figured there might be something more when I saw panic_notifier.h.
It was just not clear.

But awesome stuff!

  Luis


Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate

2021-04-07 Thread Luis Chamberlain
On Tue, Apr 06, 2021 at 06:38:24PM -0700, Minchan Kim wrote:
> To clarify what I understood form the discussion until now:
> 
> 1. zram shouldn't allow creating more zram instance during
> rmmod(It causes CPU multistate splat)

Right!

> 2. the private data of gendisk shouldn't destroyed while zram
> sysfs knob is working(it makes system goes crash)

Yup, this is resolved with the bdgrab / bdput on each sysfs knob.

And the last issue is:

3) which patch 2/2 addresed. If a mutex is shared on sysfs
knobs and module removal, you must do try_module_get() to prevent
the deadlock.

  Luis


Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate

2021-04-07 Thread Luis Chamberlain
On Tue, Apr 06, 2021 at 06:23:46PM -0700, Minchan Kim wrote:
> On Tue, Apr 06, 2021 at 12:29:09AM +0000, Luis Chamberlain wrote:
> > On Mon, Apr 05, 2021 at 12:58:05PM -0700, Minchan Kim wrote:
> > > On Mon, Apr 05, 2021 at 07:00:23PM +, Luis Chamberlain wrote:
> > > > Which is why the *try_module_get()* I think is much more suitable, as
> > > > it will always fails if we're already going down.
> > > 
> > > How does the try_module_get solved the problem?
> > 
> > The try stuff only resolves the deadlock. The bget() / bdput() resolves
> > the race to access to the gendisk private_data.
> 
> That's the one I missed in this discussion. Now I am reading your [2/2]
> in original patch. I thought it was just zram instance was destroyed
> by sysfs race problem so you had seen the deadlock.

Patch [2/2] indeed dealt with a zram instance race but the issue was not
a crash due to indirection, it was because of a deadlock. The deadlock
happens because a shared mutex is used both at sysfs and also on __exit.
I'll expand on the example as you request so that both issues are
clearly understood.

The zram race you spotted which could lead to a derefence and crash is
a separate one, which the bget() / bdput() on sysfs knobs resolves. That
race happens because zram's sysfs knobs don't prevent del_gendisk() from
completing currently, and so a dereference can happen.

> Hmm, we are discussing several problems all at once. I feel it's time
> to jump v2 with your way in this point. You said three different
> problems. As I asked, please write it down with more detail with
> code sequence as we discussed other thread. If you mean a deadlock,
> please write what specific locks was deadlock with it.
> It would make discussion much easier. Let's discuss the issue
> one by one in each thread.

Sure. Will post a v2.

> > But so far Greg does not see enough justification for a), so we can either
> > show how wider this issue is (which I can do using coccinelle), or we
> > just open code the try_module_get() / put on each driver that needs it
> > for now. Either way it would resolve the issue.
> 
> I second if it's general problem for drivers, I agree it's worth to
> addresss in the core unless the driver introduce the mess. I have
> no idea here since I didn't understand the problem, yet.

I suspect these issue are generic, however hard to reproduce, but this
just means busying out sysfs knobs and doing module removal can likely
cause crashes to some kernel drivers.

Since it seems the position to take is module removal is best effort,
if crashes on module removal are important to module maintainers, the
position to take is that such races are best addressed on the driver
side, not core.

> > As for b), given that I think even you had missed my attempt to
> > generialzie the bdget/bdput solution for any attribute type (did you see
> > my dev_type_get() and dev_type_put() proposed changes?), I don't think
> > this problem is yet well defined in a generic way for us to rule it out.
> > It is however easier to simply open code this per driver that needs it
> > for now given that I don't think Greg is yet convinced the deadlock is
> > yet a widespread issue. I however am pretty sure both races races *do*
> > exist outside of zram in many places.
> 
> It would be good sign to propose general solution.

Same situation here, as the race is with module removal.

Even in the case where module removal is possible today and likely
"supported" -- on livepatching, it seems we're shying away slowly from
that situation, and will likely expose a knob to ensure removing a
livepatch (and do module removal) will require a knob to be flipped
to say, "Yes, I know what I am doing").

> > They try_module_get() approach resolves the deadlock race, but it does
> > so in a lazy way. I mean lazy in that then rmmod wins over sysfs knobs.
> > So touching sysfs knobs won't make an rmmod fail. I think that's more
> > typical expected behaviour. Why? Because I find it odd that looping
> > forever touching sysfs attributes should prevent a module removal. But
> > that's a personal preference.
> 
> I agree with you that would be better but let's see how it could go
> clean.

OK great.

> FYI, please look at hot_remove_store which also can remove zram
> instance on demand. I am looking forward to seeing your v2.

Sure, I'll address all sysfs knobs in v2.

  Luis


Re: [PATCH v1 1/1] kernel.h: Split out panic and oops helpers

2021-04-07 Thread Luis Chamberlain
On Wed, Apr 07, 2021 at 10:33:44AM +0300, Andy Shevchenko wrote:
> On Wed, Apr 7, 2021 at 10:25 AM Luis Chamberlain  wrote:
> >
> > On Tue, Apr 06, 2021 at 04:31:58PM +0300, Andy Shevchenko wrote:
> > > diff --git a/include/linux/panic_notifier.h 
> > > b/include/linux/panic_notifier.h
> > > new file mode 100644
> > > index ..41e32483d7a7
> > > --- /dev/null
> > > +++ b/include/linux/panic_notifier.h
> > > @@ -0,0 +1,12 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#ifndef _LINUX_PANIC_NOTIFIERS_H
> > > +#define _LINUX_PANIC_NOTIFIERS_H
> > > +
> > > +#include 
> > > +#include 
> > > +
> > > +extern struct atomic_notifier_head panic_notifier_list;
> > > +
> > > +extern bool crash_kexec_post_notifiers;
> > > +
> > > +#endif   /* _LINUX_PANIC_NOTIFIERS_H */
> >
> > Why is it worth it to add another file just for this?
> 
> The main point is to break tons of loops that prevent having clean
> headers anymore.
>
> In this case, see bug.h, which is very important in this sense.

OK based on the commit log this was not clear, it seemed more of moving
panic stuff to its own file, so just cleanup.

> >  Seems like a very
> > small file.
> 
> If it is an argument, it's kinda strange. We have much smaller headers.

The motivation for such separate file was just not clear on the commit
log.

  Luis


Re: [PATCH v1 1/1] kernel.h: Split out panic and oops helpers

2021-04-06 Thread Luis Chamberlain
On Tue, Apr 06, 2021 at 04:31:58PM +0300, Andy Shevchenko wrote:
> diff --git a/include/linux/panic_notifier.h b/include/linux/panic_notifier.h
> new file mode 100644
> index ..41e32483d7a7
> --- /dev/null
> +++ b/include/linux/panic_notifier.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_PANIC_NOTIFIERS_H
> +#define _LINUX_PANIC_NOTIFIERS_H
> +
> +#include 
> +#include 
> +
> +extern struct atomic_notifier_head panic_notifier_list;
> +
> +extern bool crash_kexec_post_notifiers;
> +
> +#endif   /* _LINUX_PANIC_NOTIFIERS_H */

Why is it worth it to add another file just for this? Seems like a very
small file.

  Luis


Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate

2021-04-05 Thread Luis Chamberlain
On Mon, Apr 05, 2021 at 12:58:05PM -0700, Minchan Kim wrote:
> On Mon, Apr 05, 2021 at 07:00:23PM +0000, Luis Chamberlain wrote:
> > On Mon, Apr 05, 2021 at 10:07:24AM -0700, Minchan Kim wrote:
> > > On Thu, Apr 01, 2021 at 11:59:25PM +, Luis Chamberlain wrote:
> > > > And come to think of it the last patch I had sent with a new
> > > > DECLARE_RWSEM(zram_unload) also has this same issue making most
> > > > sysfs attributes rather fragile.
> > > 
> > > Thanks for looking the way. I agree the single zram_index_rwlock is
> > > not the right approach to fix it. However, I still hope we find more
> > > generic solution to fix them at once since I see it's zram instance
> > > racing problem.
> > 
> > They are 3 separate different problems. Related, but different.
> 
> What are 3 different problems? I am asking since I remember only two:
> one for CPU multistate and the other one for sysfs during rmmod.

The third one is the race to use sysfs attributes and those routines
then derefernece th egendisk private_data.

> > If the idea then is to busy out rmmod if a sysfs attribute is being
> > read, that could then mean rmmod can sometimes never complete. Hogging
> > up / busying out sysfs attributes means the module cannto be removed.
> 
> It's true but is it a big problem? There are many cases that system
> just return error if it's busy and rely on the admin. IMHO, rmmod should
> be part of them.

It depends on existing userspace scripts which are used to test and
expectations set. Consider existing tests, you would know better, and
since you are the maintainer you decide.

I at least know for many other types of device drivers an rmmod is
a sledge hammer.

You decide. I just thought it would be good to highlight the effect now
rather than us considering it later.

> > Which is why the *try_module_get()* I think is much more suitable, as
> > it will always fails if we're already going down.
> 
> How does the try_module_get solved the problem?

The try stuff only resolves the deadlock. The bget() / bdput() resolves
the race to access to the gendisk private_data.

> > > I see one of the problems is how I could make new zram object's
> > > attribute group for zram knobs under /sys/block/zram0 since block
> > > layer already made zram0 kobject via device_add_disk.
> > 
> > Right.. well the syfs attribute races uncovered here actually do
> > apply to any block driver as well. And which is why I was aiming
> > for something generic if possible.
> 
> It would be great but that's not the one we have atm so want to
> proceed to fix anyway.

What is not the one we have atm? I *do* have a proposed generic solution
for 2/3 issues we have been disussing:

 a) deadlock on sysfs access
 b) gendisk private_data race

But so far Greg does not see enough justification for a), so we can either
show how wider this issue is (which I can do using coccinelle), or we
just open code the try_module_get() / put on each driver that needs it
for now. Either way it would resolve the issue.

As for b), given that I think even you had missed my attempt to
generialzie the bdget/bdput solution for any attribute type (did you see
my dev_type_get() and dev_type_put() proposed changes?), I don't think
this problem is yet well defined in a generic way for us to rule it out.
It is however easier to simply open code this per driver that needs it
for now given that I don't think Greg is yet convinced the deadlock is
yet a widespread issue. I however am pretty sure both races races *do*
exist outside of zram in many places.

> > I am not sure if you missed the last hunks of the generic solution,
> > but that would resolve the issue you noted. Here is the same approach
> > but in a non-generic solution, specific to just one attribute so far
> > and to zram:
> 
> So idea is refcount of the block_device's inode 

Yes that itself prevents races against the gendisk private_data from
being invalid. Why because a bdget() would not be successful after
del_gendisk():

del_gendisk() --> invalidate_partition() --> __invalidate_device() --> 
invalidate_inodes()

> and module_exit path
> checks also the inode refcount to make rmmod failure?

They try_module_get() approach resolves the deadlock race, but it does
so in a lazy way. I mean lazy in that then rmmod wins over sysfs knobs.
So touching sysfs knobs won't make an rmmod fail. I think that's more
typical expected behaviour. Why? Because I find it odd that looping
forever touching sysfs attributes should prevent a module removal. But
that's a personal preference.

  Luis


Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate

2021-04-05 Thread Luis Chamberlain
On Mon, Apr 05, 2021 at 10:07:24AM -0700, Minchan Kim wrote:
> On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote:
> > And come to think of it the last patch I had sent with a new
> > DECLARE_RWSEM(zram_unload) also has this same issue making most
> > sysfs attributes rather fragile.
> 
> Thanks for looking the way. I agree the single zram_index_rwlock is
> not the right approach to fix it. However, I still hope we find more
> generic solution to fix them at once since I see it's zram instance
> racing problem.

They are 3 separate different problems. Related, but different.
At this point I think it would be difficult to resolve all 3
with one solution without creating side issues, but hey,
I'm curious if you find a solution that does not create other
issues.

> A approach I am considering is to make struct zram include kobject
> and then make zram sysfs auto populated under the kobject. So, zram/sysfs
> lifetime should be under the kobject. With it, sysfs race probem I
> mentioned above should be gone. Furthermore, zram_remove should fail
> if one of the alive zram objects is existing
> (i.e., zram->kobject->refcount > 1) so module_exit will fail, too.

If the idea then is to busy out rmmod if a sysfs attribute is being
read, that could then mean rmmod can sometimes never complete. Hogging
up / busying out sysfs attributes means the module cannto be removed.
Which is why the *try_module_get()* I think is much more suitable, as
it will always fails if we're already going down.

> I see one of the problems is how I could make new zram object's
> attribute group for zram knobs under /sys/block/zram0 since block
> layer already made zram0 kobject via device_add_disk.

Right.. well the syfs attribute races uncovered here actually do
apply to any block driver as well. And which is why I was aiming
for something generic if possible.

I am not sure if you missed the last hunks of the generic solution,
but that would resolve the issue you noted. Here is the same approach
but in a non-generic solution, specific to just one attribute so far
and to zram:

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 494695ff227e..b566916e4ad9 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1724,6 +1724,11 @@ static ssize_t disksize_store(struct device *dev,
 
mutex_lock(_index_mutex);
 
+   if (!bdgrab(dev_to_bdev(dev))) {
+   err = -ENODEV;
+   goto out_nodev;
+   }
+
if (!zram_up || zram->claim) {
err = -ENODEV;
goto out;
@@ -1760,6 +1765,7 @@ static ssize_t disksize_store(struct device *dev,
zram->disksize = disksize;
set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
up_write(>init_lock);
+   bdput(dev_to_bdev(dev);
 
mutex_unlock(_index_mutex);
 
@@ -1770,6 +1776,8 @@ static ssize_t disksize_store(struct device *dev,
 out_unlock:
up_write(>init_lock);
 out:
+   bdput(dev_to_bdev(dev);
+out_nodev:
mutex_unlock(_index_mutex);
return err;
 }


Re: [syzbot] KASAN: use-after-free Read in fw_load_sysfs_fallback

2021-04-02 Thread Luis Chamberlain
On Fri, Apr 02, 2021 at 02:41:20PM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:5ee96fa9 Merge tag 'irq-urgent-2021-03-21' of git://git.ke..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=1028d621d0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=c51293a9ca630f6d
> dashboard link: https://syzkaller.appspot.com/bug?extid=9b91d635e2b51efd6371
> compiler:   Debian clang version 11.0.1-2
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1794dad6d0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11f7afe6d0
> 
> The issue was bisected to:
> 
> commit bcfbd3523f3c6eea51a74d217a8ebc5463bcb7f4
> Author: Junyong Sun 
> Date:   Tue Mar 3 02:36:08 2020 +
> 
> firmware: fix a double abort case with fw_load_sysfs_fallback
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=17e4778ad0
> final oops: https://syzkaller.appspot.com/x/report.txt?x=1414778ad0
> console output: https://syzkaller.appspot.com/x/log.txt?x=1014778ad0
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+9b91d635e2b51efd6...@syzkaller.appspotmail.com
> Fixes: bcfbd3523f3c ("firmware: fix a double abort case with 
> fw_load_sysfs_fallback")
> 
> platform regulatory.0: Direct firmware load for regulatory.db failed with 
> error -2
> platform regulatory.0: Falling back to sysfs fallback for: regulatory.db
> ==
> BUG: KASAN: use-after-free in __list_add_valid+0x36/0xc0 lib/list_debug.c:23
> Read of size 8 at addr 888014188ac8 by task syz-executor310/9819
> 
> CPU: 0 PID: 9819 Comm: syz-executor310 Not tainted 5.12.0-rc3-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x176/0x24e lib/dump_stack.c:120
>  print_address_description+0x5f/0x3a0 mm/kasan/report.c:232
>  __kasan_report mm/kasan/report.c:399 [inline]
>  kasan_report+0x15c/0x200 mm/kasan/report.c:416
>  __list_add_valid+0x36/0xc0 lib/list_debug.c:23
>  __list_add include/linux/list.h:67 [inline]
>  list_add include/linux/list.h:86 [inline]
>  fw_load_sysfs_fallback+0x110/0x720 
> drivers/base/firmware_loader/fallback.c:516
>  fw_load_from_user_helper+0x242/0x320 
> drivers/base/firmware_loader/fallback.c:581
>  _request_firmware+0x2c5/0x4c0 drivers/base/firmware_loader/main.c:831
>  request_firmware+0x35/0x50 drivers/base/firmware_loader/main.c:875
>  reg_reload_regdb+0x3a/0x1b0 net/wireless/reg.c:1095
>  genl_family_rcv_msg_doit net/netlink/genetlink.c:739 [inline]
>  genl_family_rcv_msg net/netlink/genetlink.c:783 [inline]
>  genl_rcv_msg+0xe4e/0x1280 net/netlink/genetlink.c:800
>  netlink_rcv_skb+0x190/0x3a0 net/netlink/af_netlink.c:2502
>  genl_rcv+0x24/0x40 net/netlink/genetlink.c:811
>  netlink_unicast_kernel net/netlink/af_netlink.c:1312 [inline]
>  netlink_unicast+0x786/0x940 net/netlink/af_netlink.c:1338
>  netlink_sendmsg+0x9ae/0xd50 net/netlink/af_netlink.c:1927
>  sock_sendmsg_nosec net/socket.c:654 [inline]
>  sock_sendmsg net/socket.c:674 [inline]
>  sys_sendmsg+0x519/0x800 net/socket.c:2350
>  ___sys_sendmsg net/socket.c:2404 [inline]
>  __sys_sendmsg+0x2bf/0x370 net/socket.c:2433
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> RIP: 0033:0x44ab59
> Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 01 16 00 00 90 48 89 f8 48 89 f7 48 
> 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 
> 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:7fa2b8ba9318 EFLAGS: 0246 ORIG_RAX: 002e
> RAX: ffda RBX: 004d62a8 RCX: 0044ab59
> RDX:  RSI: 21c0 RDI: 0003
> RBP: 004d62a0 R08: 0099 R09: 
> R10: 000c R11: 0246 R12: 0031313230386c6e
> R13: 7ffeeaba390f R14: 7fa2b8ba9400 R15: 00022000
> 
> Allocated by task 9818:
>  kasan_save_stack mm/kasan/common.c:38 [inline]
>  kasan_set_track mm/kasan/common.c:46 [inline]
>  set_alloc_info mm/kasan/common.c:427 [inline]
>  kasan_kmalloc+0xc2/0xf0 mm/kasan/common.c:506
>  kasan_kmalloc include/linux/kasan.h:233 [inline]
>  kmem_cache_alloc_trace+0x21b/0x350 mm/slub.c:2934
>  kmalloc include/linux/slab.h:554 [inline]
>  kzalloc include/linux/slab.h:684 [inline]
>  __allocate_fw_priv+0x98/0x340 drivers/base/firmware_loader/main.c:186
>  alloc_lookup_fw_priv+0x1c3/0x380 drivers/base/firmware_loader/main.c:250
>  _request_firmware_prepare+0x23c/0x5b0 drivers/base/firmware_loader/main.c:744
>  _request_firmware+0xd9/0x4c0 drivers/base/firmware_loader/main.c:806
>  request_firmware+0x35/0x50 drivers/base/firmware_loader/main.c:875
>  reg_reload_regdb+0x3a/0x1b0 net/wireless/reg.c:1095
>  

Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate

2021-04-02 Thread Luis Chamberlain
On Fri, Apr 02, 2021 at 09:54:12AM +0200, Greg KH wrote:
> On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote:
> > As for the syfs deadlock possible with drivers, this fixes it in a generic 
> > way:
> > 
> > commit fac43d8025727a74f80a183cc5eb74ed902a5d14
> > Author: Luis Chamberlain 
> > Date:   Sat Mar 27 14:58:15 2021 +
> > 
> > sysfs: add optional module_owner to attribute
> > 
> > This is needed as otherwise the owner of the attribute
> > or group read/store might have a shared lock used on driver removal,
> > and deadlock if we race with driver removal.
> > 
> > Signed-off-by: Luis Chamberlain 
> 
> No, please no.  Module removal is a "best effort",

Not for live patching. I am not sure if I am missing any other valid
use case?

> if the system dies when it happens, that's on you. 

I think the better approach for now is simply to call testers / etc to
deal with this open coded. I cannot be sure that other than live
patching there may be other valid use cases for module removal, and for
races we really may care for where userspace *will* typically be mucking
with sysfs attributes. Monitoring my systems's sysfs attributes I am
actually quite surprised at the random pokes at them.

> I am not willing to expend extra energy
> and maintance of core things like sysfs for stuff like this that does
> not matter in any system other than a developer's box.

Should we document this as well? Without this it is unclear that tons of
random tests are sanely nullified. At least this dead lock I spotted can
be pretty common form on many drivers.

> Lock data, not code please.  Trying to tie data structure's lifespans
> to the lifespan of code is a tangled mess, and one that I do not want to
> add to in any form.

Driver developers will simply have to open code these protections. In
light of what I see on LTP / fuzzing, I suspect the use case will grow
and we'll have to revisit this in the future. But for now, sure, we can
just open code the required protections everywhere to not crash on module
removal.

  Luis


Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate

2021-04-01 Thread Luis Chamberlain
On Mon, Mar 22, 2021 at 03:12:01PM -0700, Minchan Kim wrote:
> On Mon, Mar 22, 2021 at 08:41:56PM +0000, Luis Chamberlain wrote:
> > 
> > I would not call it *every* syfs knob as not all deal with things which
> > are related to CPU hotplug multistate, right? Note that using just
> > try_module_get() alone (that is the second patch only, does not fix the
> > race I am describing above).
> 
> It wouldn't be CPU hotplug multistate issue but I'd like to call it
> as more "zram instance race" bug.
> What happens in this case?
> 
> CPU 1CPU 2
> 
> destroy_devices
> ..
>  compact_store()
>  zram = dev_to_zram(dev);
> idr_for_each(zram_remove_cb
>   zram_remove
>   ..
>   kfree(zram)
>  down_read(>init_lock);
> 
> 
> CPU 1CPU 2
> 
> hot_remove_store
>  compact_store()
>  zram = dev_to_zram(dev);
>   zram_remove
> kfree(zram)
>  down_read(>init_lock);
>   
> So, for me we need to close the zram instance create/removal
> with sysfs rather than focusing on CPU hotplug issue.

Sure, that's a good point.

The issue which I noted for the race which ends up in a deadlock is only
possible if a shared lock is used on removal but also on sysfs knobs.

At first glance, the issue you describe above *seems* to be just proper
care driver developers must take with structures used. It is certainly
yet another issue we need to address, and if we can generalize a
solution even better. I now recall I *think* I spotted that race a while
ago and mentioned it to Kees and David Howells but I didn't have a
solution for it yet. More on this below.

The issue you point out is real, however you cannot disregard the
CPU hoplug possible race as well, it is a design consideration which
the CPU hotplug multistate support warns for -- consider driver removal.
I agree that perhaps solving this "zram instance race" can fix he
hotplug race as well. If we can solves all 3 issues in one shot even
better. But let's evaluate that prospect...

> Maybe, we could reuse zram_index_mutex with modifying it with
> rw_semaphore. What do you think?

Although ideal given it would knock 3 birds with 1 stone, it ends up
actually making the sysfs attributes rather useless in light of the
requirements for each of the races. Namely, the sysfs deadlock race
*must* use a try lock approach, just as the try_module_get() case.
It must use this approach so to immediately just bail out if we have
our module being removed, and so on our __exit path. By trying to
repurpose zram_index_mutex we end up then just doing too much with it
and making the syfs attributes rather fragile for most uses:

Consider disksize_show(), that would have to become:

static ssize_t disksize_show(struct device *dev, struct device_attribute *attr, 
char *buf)
{
struct zram *zram = dev_to_zram(dev);
+   size_t disksize;

+   down_read(_index_rwlock);
+   disksize = zram->disksize;
+   up_read(_index_rwlock);
-   return scnprintf(buf, PAGE_SIZE, "%llu\n", zram->disksize);
+   return scnprintf(buf, PAGE_SIZE, "%llu\n", disksize);
}

What's wrong with this?

It can block during a write, yes, but there is a type of write which
will make this crash after the read lock is acquired. When the instance
is removed. What if we try down_read_trylock()?

static ssize_t disksize_show(struct device *dev, struct device_attribute *attr, 
char *buf)
{
struct zram *zram = dev_to_zram(dev);
+   size_t disksize;

+   if (!down_read_trylock(_index_rwlock))
+   return -ENODEV;
+
+   disksize = zram->disksize;
+   up_read(_index_rwlock);
-   return scnprintf(buf, PAGE_SIZE, "%llu\n", zram->disksize);
+   return scnprintf(buf, PAGE_SIZE, "%llu\n", disksize);
}

What's wrong with this?

If it got the lock, it should be OK as it is preventing writes from
taking the lock for a bit. But then this just becomes pretty fragile,
it will fail whenever another read or write is happening, triggering
perhaps quite a bit of regressions on tests.

And if we use write_trylock() we end up with the same fragile nature
of failing the read with ENODEV for any silly thing going on with the
driver.

And come to think of it the last patch I had sent with a new
DECLARE_RWSEM(zram_unload) also has this same issue making most
sysfs attributes rather fragile.

So, I still believe we should split this up into two separate patches then
as I had originally proposed. I looked into the race you described as well
and I believe I have a generic solution for that as well.

As for the syf

Re: [Ksummit-discuss] RFC: create mailing list "linux-issues" focussed on issues/bugs and regressions

2021-03-23 Thread Luis Chamberlain
On Mon, Mar 22, 2021 at 1:25 PM Thorsten Leemhuis  wrote:
>
>
>
> On 22.03.21 19:32, Linus Torvalds wrote:
> > On Mon, Mar 22, 2021 at 8:18 AM Thorsten Leemhuis  
> > wrote:
> >>
> >> I even requested a
> >> "linux-regressi...@vger.kernel.org" a while later, but didn't hear
> >> anything back; and, sadly, about the same time I started having trouble
> >> finding spare time for working on regression tracking. :-/
> >
> > Honestly, I'd much prefer the name 'linux-regressions' as being much
> > more targeted than 'linux-issues'.
>
> That only solves one of the two problem I'm trying to solve (albeit the
> one that is more important to me). That way users still have no easy way
> to query for reports about issues that are no regressions – say
> something is broken and they have no idea if it once worked or never
> worked at all.

Without a known baseline of what works OK an issue cannot easily be
categorized as a regression. This "problem" I think deserves its own
considerations.

There are some kernel-ci solutions out there which report "issues"
which help develop such baselines, however what we need is a community
visible list of the sum, a list of *known issues upstream* represents
a baseline. The easiest way to develop such baselines are with
respective tests. We however obviously need to also accept new user
reported issues as possible issues which can be candidate baseline
issues, for which perhaps there are no known tests yet available to
reproduce.

Then there are the considerations also that some distribution issues,
which can be part of a distribution baseline, might fit into the
circle of upstream known issues, or baseline as well. But not all
issues part of a distribution baseline are part of the upstream
baseline, an example is a botched backport. Most distribution
baselines however tend to be private, however I'd like to see that
changed using OpenSUSE/SLE as an example in order to help with the
upstream baseline effort. I'd like to encourage other distributions to
follow suit.

Test frameworks help develop a baseline and so working on them helps
reduce the scope of this problem. We have many test frameworks. What I
have not seen is a public generic baseline "list" for each of these. I
have spent a bit of time on this problem and have come up with a
generic format for issues on test frameworks as part of kdevops [0] in
the hopes that it could be used to easily grep for known issues
against upstream kernels / distribution releases. The format is
simple:

mcgrof@bicho ~/kdevops (git::master)$ cat
workflows/blktests/expunges/5.12.0-rc1-next-20210304/failures.txt
block/009 # korg#212305 failure rate 1/669
block/011
block/012

The korg#212305 refers to bugzilla.kernel.org bug ID #212305 [0].

Distribution issues:

mcgrof@bicho ~/kdevops (git::master)$ cat
workflows/blktests/expunges/debian/testing/failures.txt
block/011
block/012
meta/005
meta/006
meta/009
nbd/002
nbd/003 # causes a hang after running a few times
scsi/004

I have support for blktests and fstests, will add selftests soon. I
tend to work on debian baseline as a public demo for work. The
OpenSUSE Leap 15.3 baseline will be reflective of the real SLE15.3
baseline.

The nice thing about having a public baseline is we can then really be
confident into labelling a new issue that comes up as a possible
regression. However, confidence is subjective, and so one must also
define confidence clearly. You associate confidence to a baseline by
the number of full tests you have run against a baseline for a
respective test framework. Borrowing IO stabilizing terms, I'm using a
test "steady state goal" for this, it means how many times have you
run all possible tests against a known baseline without failure. So a
steady state of 100 for blktests means your confidence in the baseline
you have developed is of 100 full tests. A higher steady state goal
however means more time is required to test, and so sometimes you
might be confined to only use a low steady state goal, but then use
side workers to run random tests with a higher test count. So for
instance, the failure rate of the issue reported on korg#212305 is
defined by the average number of times one must run a test in order
for it to fail. If your baseline steady state goal was just 100,
chances are low you may have run into that issue.

Are there other known collections of public baselines easily grep'able
for different test frameworks? Where can we contribute and collaborate
to such a thing?

PS. My current goal for steady state goal for upstream is 1000 for
blktests, 100 for fstests per filesystem.

[0] https://github.com/mcgrof/kdevops
[1] https://bugzilla.kernel.org/show_bug.cgi?id=212305

  Luis


Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate

2021-03-22 Thread Luis Chamberlain
On Mon, Mar 22, 2021 at 09:37:17AM -0700, Minchan Kim wrote:
> On Fri, Mar 19, 2021 at 07:09:24PM +0000, Luis Chamberlain wrote:
> > Indeed one issue is a consequence of the other but a bit better
> > description can be put together for both separately.
> > 
> > The warning comes up when cpu hotplug detects that the callback
> > is being removed, but yet "instances" for multistate are left behind,
> > ie, not removed. CPU hotplug multistate allows us to dedicate a callback
> > for zram so that it gets called every time a CPU hot plugs or unplugs.
> > In the zram driver's case we want to trigger the callback per each
> > struct zcomp, one is used per supported and used supported compression
> 
>  you meant "each one is used per supported compression"?

Yup.

> > algorithm. The zram driver allocates a struct zcomp for an compression
> > algorithm on a need basis. The default compressor is CONFIG_ZRAM_DEF_COMP
> > which today is CONFIG_ZRAM_DEF_COMP_LZORLE ("lzo-rle"). Although we may
> > have different zram devices, zram devices which use the same compression
> > algorithm share the same struct zcomp. The "multi" in CPU hotplug multstate
> 
> It allocates each own zcomp, not sharing even though zram devices
> use the same compression algorithm.

Right.

> > refers to these different struct zcomp instances, each of these will
> > have the callback routine registered run for it. The kernel's CPU
> > hotplug multistate keeps a linked list of these different structures
> > so that it will iterate over them on CPU transitions. By default at
> > driver initialization we will create just one zram device (num_devices=1)
> > and a zcomp structure then set for the lzo-rle comrpession algorithm.
> > At driver removal we first remove each zram device, and so we destroy
> > the struct zcomp. But since we expose sysfs attributes to create new
> > devices or reset / initialize existing ones, we can easily end up
> > re-initializing a struct zcomp before the exit routine of the module
> > removes the cpu hotplug callback. When this happens the kernel's
> > CPU hotplug will detect that at least one instance (struct zcomp for
> > us) exists.
> 
> It's very helpful to understand. Thanks a lot!S
> 
> So IIUC, it's fundamentally race between destroy_devices(i.e., module
> unload) and every sysfs knobs in zram. Isn't it?

I would not call it *every* syfs knob as not all deal with things which
are related to CPU hotplug multistate, right? Note that using just
try_module_get() alone (that is the second patch only, does not fix the
race I am describing above).

There are two issues.

> Below specific example is one of them and every sysfs code in zram
> could access freed object(e.g., zram->something).
> And you are claiming there isn't good way to fix it in kernfs(please
> add it in the description, too) even though it's general problem.

Correct, the easier route would have been through the block layer as
its the one adding our syfs attributes for us but even it canno deal
with this race on behalf of drivers given the currently exposed
semantics on kernfs.

> (If we had, we may detect the race and make zram_remove_cb fails
> so unloading modulies fails, finally)
> 
> So, shouldn't we introcude a global rw_semaphore?
> 
> destroy_devices
>   class_unregister
>   down_write(_unload);
>   idr_for_each(zram_remove_cb);
>   up_write(_unload);
> 
> And every sysfs code hold the lock with down_read in the first place
> and release the lock right before return. So, introduce a zram sysfs
> wrapper to centralize all of locking logic.

Actually that does work but only if we use write lock attempts so to
be able to knock two birds with one stone, so to address the deadlock
with sysfs attribute removal. We're not asking politely to read some
value off of a zram devices with these locks, we're ensuring to not
run if our driver is going to be removed, so replacing the
try_module_get() with something similar.

> Does it make sense?
> 
> > 
> > And so we can have:
> > 
> > static void destroy_devices(void)
> > {
> > class_unregister(_control_class);
> > idr_for_each(_index_idr, _remove_cb, NULL);
> > zram_debugfs_destroy();
> > idr_destroy(_index_idr);
> > unregister_blkdev(zram_major, "zram");
> > cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> > }
> > 
> > While destroy_devices() runs we can have this race:
> > 
> > 
> > CPU 1CPU 2
> > 
> > class_unregister(...);
> > idr_for_each(...);
> I think sysfs was already remove her

Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate

2021-03-19 Thread Luis Chamberlain
On Fri, Mar 12, 2021 at 11:28:21AM -0800, Minchan Kim wrote:
> On Fri, Mar 12, 2021 at 06:32:38PM +0000, Luis Chamberlain wrote:
> > On Thu, Mar 11, 2021 at 06:14:40PM -0800, Minchan Kim wrote:
> > > On Wed, Mar 10, 2021 at 09:21:28PM +, Luis Chamberlain wrote:
> > > > On Mon, Mar 08, 2021 at 06:55:30PM -0800, Minchan Kim wrote:
> > > > > If I understand correctly, bugs you found were related to module
> > > > > unloading race while the zram are still working.
> > > > 
> > > > No, that is a simplifcation of the issue. The issue consists of
> > > > two separate issues:
> > > > 
> > > >  a) race against module unloading in light of incorrect racty use of
> > > > cpu hotplug multistate support
> > > 
> > > 
> > > Could you add some pusedo code sequence to show the race cleary?
> > 
> > Let us deal with each issue one at time. First, let's address
> > understanding the kernel warning can be reproduced easily by
> > triggering zram02.sh from LTP twice:
> > 
> > kernel: [ cut here ]
> > kernel: Error: Removing state 63 which has instances left.
> > kernel: WARNING: CPU: 7 PID: 70457 at kernel/cpu.c:2069 
> > __cpuhp_remove_state_cpuslocked+0xf9/0x100
> > kernel: Modules linked in: zram(E-) zsmalloc(E) 
> > 
> > The first patch prevents this race. This race is possible because on
> > module init we associate callbacks for CPU hotplug add / remove:
> > 
> > static int __init zram_init(void)   
> > 
> > {
> > ...
> > ret = cpuhp_setup_state_multi(CPUHP_ZCOMP_PREPARE, "block/zram:prepare",
> >   zcomp_cpu_up_prepare, zcomp_cpu_dead); 
> > ...
> > }
> > 
> > The zcomp_cpu_dead() accesses the zcom->comp, and if zcomp->comp is
> > removed and this function is called, clearly we'll be accessing some
> > random data here and can easily crash afterwards:
> 
> I am trying to understand this crash. You mentioned the warning
> above but here mention crash(I understand sysfs race but this is
> different topic). What's the relation with above warning and
> crash here? You are talking the warning could change to the
> crash sometimes? 

Indeed one issue is a consequence of the other but a bit better
description can be put together for both separately.

The warning comes up when cpu hotplug detects that the callback
is being removed, but yet "instances" for multistate are left behind,
ie, not removed. CPU hotplug multistate allows us to dedicate a callback
for zram so that it gets called every time a CPU hot plugs or unplugs.
In the zram driver's case we want to trigger the callback per each
struct zcomp, one is used per supported and used supported compression
algorithm. The zram driver allocates a struct zcomp for an compression
algorithm on a need basis. The default compressor is CONFIG_ZRAM_DEF_COMP
which today is CONFIG_ZRAM_DEF_COMP_LZORLE ("lzo-rle"). Although we may
have different zram devices, zram devices which use the same compression
algorithm share the same struct zcomp. The "multi" in CPU hotplug multstate
refers to these different struct zcomp instances, each of these will
have the callback routine registered run for it. The kernel's CPU
hotplug multistate keeps a linked list of these different structures
so that it will iterate over them on CPU transitions. By default at
driver initialization we will create just one zram device (num_devices=1)
and a zcomp structure then set for the lzo-rle comrpession algorithm.
At driver removal we first remove each zram device, and so we destroy
the struct zcomp. But since we expose sysfs attributes to create new
devices or reset / initialize existing ones, we can easily end up
re-initializing a struct zcomp before the exit routine of the module
removes the cpu hotplug callback. When this happens the kernel's
CPU hotplug will detect that at least one instance (struct zcomp for
us) exists.

And so we can have:

static void destroy_devices(void)
{
class_unregister(_control_class);
idr_for_each(_index_idr, _remove_cb, NULL);
zram_debugfs_destroy();
idr_destroy(_index_idr);
unregister_blkdev(zram_major, "zram");
cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
}

While destroy_devices() runs we can have this race:


CPU 1CPU 2

class_unregister(...);
idr_for_each(...);
zram_debugfs_destroy();
disksize_store(...)
idr_destroy(...);
unregister_blkdev(...);
cpuhp_remove_multi_state(...);


The warning comes up on cpuhp_remove_multi_state() when it sees
that the state for CPUHP_ZCOMP_PREPARE does not have an empty
instance linked list.

After the idr_destory() its also easy to run into a crash. The
above predicament also leads to memory leaks.

And so we need to have a state machine for when we're up and when
we're going down generically.

Let me know if it makes sense now, if so I can amend the commit log
accordingly.

  Luis


Re: blktests: block/009 next-20210304 failure rate average of 1/448

2021-03-18 Thread Luis Chamberlain
On Tue, Mar 16, 2021 at 06:47:39PM +, Luis Chamberlain wrote:
> On Tue, Mar 16, 2021 at 05:46:45PM +0000, Luis Chamberlain wrote:
> > I've managed to reproduce blktests block/009 failures with kdevops [0]
> > on linux-next tag next-20210304 with a current failure rate average of
> > 1/448 (3 counted failures so far).
> 
> Confirmed on next-20210316 with current failure rate at 1/1008

Just in case this was a scsi_debug issue instead (I am covering that
prospect on another bug just for scsi_debug korg#212337 [0]) I tried
a userspace solution based on what I have observed I still can reproduce
this block/009 failure. The failure rate is much lower though, I have it
now at 1/1705 but alas it is still failing.

[0] https://bugzilla.kernel.org/show_bug.cgi?id=212337

The patch below demonstrates the exra settle work for scsi_debug
attempted, and with it, this is still failing. So either the settle
work needs *more* effort, or this is a real issue.

diff --git a/common/scsi_debug b/common/scsi_debug
index b48cdc9..ecdbcc6 100644
--- a/common/scsi_debug
+++ b/common/scsi_debug
@@ -8,13 +8,42 @@ _have_scsi_debug() {
_have_modules scsi_debug
 }
 
+# As per korg#212337 [0] we must do more work in userspace to settle
+# scsi_debug devices a bit more carefully.
+
+# [0] https://bugzilla.kernel.org/show_bug.cgi?id=212337
+_settle_scsi_debug_device() {
+   SCSI_DEBUG_MAX_WAIT=10
+   SCSI_DEBUG_COUNT_WAIT_LOOP=0
+   while true ; do
+   if [[ -b $1 ]]; then
+   SCSI_DEBUG_LSOF_COUNT=$(lsof $1 | wc -l)
+   if [[ $SCSI_DEBUG_LSOF_COUNT -ne 0 ]]; then
+   sleep 1;
+   else
+   break
+   fi
+   else
+   # Let device come up
+   sleep 1
+
+   let 
SCSI_DEBUG_COUNT_WAIT_LOOP=$SCSI_DEBUG_COUNT_WAIT_LOOP+1
+   if [[ $SCSI_DEBUG_COUNT_WAIT_LOOP -ge 
$SCSI_DEBUG_MAX_WAIT ]]; then
+   break
+   fi
+   fi
+   done
+}
+
 _init_scsi_debug() {
if ! modprobe -r scsi_debug || ! modprobe scsi_debug "$@"; then
return 1
fi
-
udevadm settle
 
+   # Allow dependencies to load
+   sleep 1
+
local host_sysfs host target_sysfs target
SCSI_DEBUG_HOSTS=()
SCSI_DEBUG_TARGETS=()
@@ -43,6 +72,10 @@ _init_scsi_debug() {
return 1
fi
 
+   for i in $SCSI_DEBUG_DEVICES ; do
+   _settle_scsi_debug_device /dev/$i
+   done
+
return 0
 }
 


Re: blktests: block/009 next-20210304 failure rate average of 1/448

2021-03-18 Thread Luis Chamberlain
Adding linux-fsdevel as folks working on fstests might be
interested.

On Tue, Mar 16, 2021 at 05:46:45PM +, Luis Chamberlain wrote:
> My personal suspicion is not on the block layer but on scsi_debug
> because this can fail:
> 
> modprobe scsi_debug; rmmod scsi_debug
> 
> This second issue may be a secondary separate issue, but I figured 
> I'd mention it. To fix this later issue I've looked at ways to
> make scsi_debug_init() wait until its scsi devices are probed,
> however its not clear how to do this correctly. If someone has
> an idea let me know. If that fixes this issue then we know it was
> that.

OK so this other issue with scsi_debug indeed deserves its own tracking
so I filed a bug for it but also looked into it and tried to see how to
resolve it.

Someone who works on scsi should revise my work as I haven't touched
scsi before except for the recent block layer work I had done for the
blktrace races, however, my own analysis is that this should not be
fixed in scsi_debug but instead in the users of scsi_debug.

The rationale for that is here:

https://bugzilla.kernel.org/show_bug.cgi?id=212337

The skinny of it is that we have no control over when userspace may muck
with the newly exposed devices as they are being initialized, and
shoe-horning a solution in scsi_debug_init() is prone to always be allow
a race with userspace never letting scsi_debug_init() complete.

So best we can do is just use something like lsof on the tools which
use scsi_debug *prior* to mucking with the devices and / or removal of
the module.

I'll follow up with respective blktests / fstests patches, which I
suspect may address a few false positives.

  Luis


  1   2   3   4   5   6   >