Re: [Cluster-devel] [PATCH v3 05/49] mm: shrinker: add infrastructure for dynamically allocating shrinker

2023-07-28 Thread Simon Horman
On Thu, Jul 27, 2023 at 04:04:18PM +0800, Qi Zheng wrote:
> Currently, the shrinker instances can be divided into the following three
> types:
> 
> a) global shrinker instance statically defined in the kernel, such as
>workingset_shadow_shrinker.
> 
> b) global shrinker instance statically defined in the kernel modules, such
>as mmu_shrinker in x86.
> 
> c) shrinker instance embedded in other structures.
> 
> For case a, the memory of shrinker instance is never freed. For case b,
> the memory of shrinker instance will be freed after synchronize_rcu() when
> the module is unloaded. For case c, the memory of shrinker instance will
> be freed along with the structure it is embedded in.
> 
> In preparation for implementing lockless slab shrink, we need to
> dynamically allocate those shrinker instances in case c, then the memory
> can be dynamically freed alone by calling kfree_rcu().
> 
> So this commit adds the following new APIs for dynamically allocating
> shrinker, and add a private_data field to struct shrinker to record and
> get the original embedded structure.
> 
> 1. shrinker_alloc()
> 
> Used to allocate shrinker instance itself and related memory, it will
> return a pointer to the shrinker instance on success and NULL on failure.
> 
> 2. shrinker_register()
> 
> Used to register the shrinker instance, which is same as the current
> register_shrinker_prepared().
> 
> 3. shrinker_free()
> 
> Used to unregister (if needed) and free the shrinker instance.
> 
> In order to simplify shrinker-related APIs and make shrinker more
> independent of other kernel mechanisms, subsequent submissions will use
> the above API to convert all shrinkers (including case a and b) to
> dynamically allocated, and then remove all existing APIs.
> 
> This will also have another advantage mentioned by Dave Chinner:
> 
> ```
> The other advantage of this is that it will break all the existing
> out of tree code and third party modules using the old API and will
> no longer work with a kernel using lockless slab shrinkers. They
> need to break (both at the source and binary levels) to stop bad
> things from happening due to using uncoverted shrinkers in the new

nit: uncoverted -> unconverted

> setup.
> ```
> 
> Signed-off-by: Qi Zheng 

...

> diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
> index f1becfd45853..506257585408 100644
> --- a/mm/shrinker_debug.c
> +++ b/mm/shrinker_debug.c
> @@ -191,6 +191,20 @@ int shrinker_debugfs_add(struct shrinker *shrinker)
>   return 0;
>  }
>  
> +int shrinker_debugfs_name_alloc(struct shrinker *shrinker, const char *fmt,
> + va_list ap)
> +{
> + shrinker->name = kvasprintf_const(GFP_KERNEL, fmt, ap);
> +
> + return shrinker->name ? 0 : -ENOMEM;
> +}
> +
> +void shrinker_debugfs_name_free(struct shrinker *shrinker)
> +{
> + kfree_const(shrinker->name);
> + shrinker->name = NULL;
> +}
> +

These functions have no prototype in this file,
perhaps internal.h should be included?

>  int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
>  {
>   struct dentry *entry;

...



Re: [Cluster-devel] [syzbot] [gfs2?] kernel panic: hung_task: blocked tasks (2)

2023-07-28 Thread Bob Peterson

On 7/28/23 3:20 AM, David Howells wrote:

syzbot  wrote:


Fixes: 9c8ad7a2ff0b ("uapi, x86: Fix the syscall numbering of the mount API syscalls 
[ver #2]")


This would seem unlikely to be the culprit.  It just changes the numbering on
the fsconfig-related syscalls.

Running the test program on v6.5-rc3, however, I end up with the test process
stuck in the D state:

INFO: task repro-17687f1aa:5551 blocked for more than 120 seconds.
   Not tainted 6.5.0-rc3-build3+ #1448
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:repro-17687f1aa state:D stack:0 pid:5551  ppid:5516   flags:0x4002
Call Trace:
  
  __schedule+0x4a7/0x4f1
  schedule+0x66/0xa1
  schedule_timeout+0x9d/0xd7
  ? __next_timer_interrupt+0xf6/0xf6
  gfs2_gl_hash_clear+0xa0/0xdc
  ? sugov_irq_work+0x15/0x15
  gfs2_put_super+0x19f/0x1d3
  generic_shutdown_super+0x78/0x187
  kill_block_super+0x1c/0x32
  deactivate_locked_super+0x2f/0x61
  cleanup_mnt+0xab/0xcc
  task_work_run+0x6b/0x80
  exit_to_user_mode_prepare+0x76/0xfd
  syscall_exit_to_user_mode+0x14/0x31
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f89aac31dab
RSP: 002b:7fff43d9b878 EFLAGS: 0206 ORIG_RAX: 00a6
RAX:  RBX: 7fff43d9cad8 RCX: 7f89aac31dab
RDX:  RSI: 000a RDI: 7fff43d9b920
RBP: 7fff43d9c960 R08:  R09: 0073
R10:  R11: 0206 R12: 
R13: 7fff43d9cae8 R14: 00417e18 R15: 7f89aad51000
  

David


Hi David,

This indicates gfs2 is having trouble resolving and freeing all its 
glocks, which usually means a reference counting problem or ail (active 
items list) problem during unmount.


If gfs2_gl_hash_clear gets stuck for a long period of time it is 
supposed to dump the remaining list of glocks that still have not been 
resolved. I think it takes 10 minutes or so. Can you post the console 
messages that follow? That will help us figure out what's happening. Thanks.


Regards,

Bob Peterson



Re: [Cluster-devel] (subset) [PATCH v6 0/7] fs: implement multigrain timestamps

2023-07-28 Thread Christian Brauner
On Tue, 25 Jul 2023 10:58:13 -0400, Jeff Layton wrote:
> The VFS always uses coarse-grained timestamps when updating the
> ctime and mtime after a change. This has the benefit of allowing
> filesystems to optimize away a lot metadata updates, down to around 1
> per jiffy, even when a file is under heavy writes.
> 
> Unfortunately, this coarseness has always been an issue when we're
> exporting via NFSv3, which relies on timestamps to validate caches. A
> lot of changes can happen in a jiffy, so timestamps aren't sufficient to
> help the client decide to invalidate the cache.
> 
> [...]

Survives xfstests (tmpfs, ext4, overlayfs) and the fs portions of LTP.
Let's keep our eyes open for any potential issues. Past suspects has
been IMA interacting with overlayfs. We'll see. Picked everything minus
the tmpfs-writepage patch that was contentious.

---

Applied to the vfs.ctime branch of the vfs/vfs.git tree.
Patches in the vfs.ctime branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.ctime

[1/7] fs: pass the request_mask to generic_fillattr
  https://git.kernel.org/vfs/vfs/c/0a6ab6dc6958
[2/7] fs: add infrastructure for multigrain timestamps
  https://git.kernel.org/vfs/vfs/c/d242b98ac3e9
[4/7] tmpfs: add support for multigrain timestamps
  https://git.kernel.org/vfs/vfs/c/1f31c58cf032
[5/7] xfs: switch to multigrain timestamps
  https://git.kernel.org/vfs/vfs/c/859dd91017dd
[6/7] ext4: switch to multigrain timestamps
  https://git.kernel.org/vfs/vfs/c/093af249eab4
[7/7] btrfs: convert to multigrain timestamps
  https://git.kernel.org/vfs/vfs/c/b90a04d1c30c



Re: [Cluster-devel] [PATCH v3 04/49] mm: shrinker: remove redundant shrinker_rwsem in debugfs operations

2023-07-28 Thread Simon Horman
On Thu, Jul 27, 2023 at 04:04:17PM +0800, Qi Zheng wrote:
> The debugfs_remove_recursive() will wait for debugfs_file_put() to return,
> so the shrinker will not be freed when doing debugfs operations (such as
> shrinker_debugfs_count_show() and shrinker_debugfs_scan_write()), so there
> is no need to hold shrinker_rwsem during debugfs operations.
> 
> Signed-off-by: Qi Zheng 
> Reviewed-by: Muchun Song 
> ---
>  mm/shrinker_debug.c | 14 --
>  1 file changed, 14 deletions(-)
> 
> diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
> index 3ab53fad8876..f1becfd45853 100644
> --- a/mm/shrinker_debug.c
> +++ b/mm/shrinker_debug.c
> @@ -55,11 +55,6 @@ static int shrinker_debugfs_count_show(struct seq_file *m, 
> void *v)
>   if (!count_per_node)
>   return -ENOMEM;
>  
> - ret = down_read_killable(&shrinker_rwsem);
> - if (ret) {
> - kfree(count_per_node);
> - return ret;
> - }
>   rcu_read_lock();

Hi Qi Zheng,

As can be seen in the next hunk, this function returns 'ret'.
However, with this change 'ret' is uninitialised unless
signal_pending() returns non-zero in the while loop below.

This is flagged in a clan-16 W=1 build.

 mm/shrinker_debug.c:87:11: warning: variable 'ret' is used uninitialized 
whenever 'do' loop exits because its condition is false 
[-Wsometimes-uninitialized]
 } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
  ^~~~
 mm/shrinker_debug.c:92:9: note: uninitialized use occurs here
 return ret;
^~~
 mm/shrinker_debug.c:87:11: note: remove the condition if it is always true
 } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
  ^~~~
  1
 mm/shrinker_debug.c:77:7: warning: variable 'ret' is used uninitialized 
whenever 'if' condition is true [-Wsometimes-uninitialized]
 if (!memcg_aware) {
 ^~~~
 mm/shrinker_debug.c:92:9: note: uninitialized use occurs here
 return ret;
^~~
 mm/shrinker_debug.c:77:3: note: remove the 'if' if its condition is always 
false
 if (!memcg_aware) {
 ^~~
 mm/shrinker_debug.c:52:9: note: initialize the variable 'ret' to silence this 
warning
 int ret, nid;
^
 = 0

>  
>   memcg_aware = shrinker->flags & SHRINKER_MEMCG_AWARE;
> @@ -92,7 +87,6 @@ static int shrinker_debugfs_count_show(struct seq_file *m, 
> void *v)
>   } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
>  
>   rcu_read_unlock();
> - up_read(&shrinker_rwsem);
>  
>   kfree(count_per_node);
>   return ret;

...



Re: [Cluster-devel] [syzbot] [gfs2?] kernel panic: hung_task: blocked tasks (2)

2023-07-28 Thread syzbot
syzbot has bisected this issue to:

commit 9c8ad7a2ff0bfe58f019ec0abc1fb965114dde7d
Author: David Howells 
Date:   Thu May 16 11:52:27 2019 +

uapi, x86: Fix the syscall numbering of the mount API syscalls [ver #2]

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=169b475ea8
start commit:   fdf0eaf11452 Linux 6.5-rc2
git tree:   upstream
final oops: https://syzkaller.appspot.com/x/report.txt?x=159b475ea8
console output: https://syzkaller.appspot.com/x/log.txt?x=119b475ea8
kernel config:  https://syzkaller.appspot.com/x/.config?x=27e33fd2346a54b
dashboard link: https://syzkaller.appspot.com/bug?extid=607aa822c60b2e75b269
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=11322fb6a8
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=17687f1aa8

Reported-by: syzbot+607aa822c60b2e75b...@syzkaller.appspotmail.com
Fixes: 9c8ad7a2ff0b ("uapi, x86: Fix the syscall numbering of the mount API 
syscalls [ver #2]")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection



Re: [Cluster-devel] [PATCH v3 04/49] mm: shrinker: remove redundant shrinker_rwsem in debugfs operations

2023-07-28 Thread Qi Zheng

Hi Simon,

On 2023/7/28 16:13, Simon Horman wrote:

On Thu, Jul 27, 2023 at 04:04:17PM +0800, Qi Zheng wrote:

The debugfs_remove_recursive() will wait for debugfs_file_put() to return,
so the shrinker will not be freed when doing debugfs operations (such as
shrinker_debugfs_count_show() and shrinker_debugfs_scan_write()), so there
is no need to hold shrinker_rwsem during debugfs operations.

Signed-off-by: Qi Zheng 
Reviewed-by: Muchun Song 
---
  mm/shrinker_debug.c | 14 --
  1 file changed, 14 deletions(-)

diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
index 3ab53fad8876..f1becfd45853 100644
--- a/mm/shrinker_debug.c
+++ b/mm/shrinker_debug.c
@@ -55,11 +55,6 @@ static int shrinker_debugfs_count_show(struct seq_file *m, 
void *v)
if (!count_per_node)
return -ENOMEM;
  
-	ret = down_read_killable(&shrinker_rwsem);

-   if (ret) {
-   kfree(count_per_node);
-   return ret;
-   }
rcu_read_lock();


Hi Qi Zheng,

As can be seen in the next hunk, this function returns 'ret'.
However, with this change 'ret' is uninitialised unless
signal_pending() returns non-zero in the while loop below.


Thanks for your feedback, the 'ret' should be initialized to 0,
will fix it.

Thanks,
Qi



This is flagged in a clan-16 W=1 build.

  mm/shrinker_debug.c:87:11: warning: variable 'ret' is used uninitialized 
whenever 'do' loop exits because its condition is false 
[-Wsometimes-uninitialized]
  } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
   ^~~~
  mm/shrinker_debug.c:92:9: note: uninitialized use occurs here
  return ret;
 ^~~
  mm/shrinker_debug.c:87:11: note: remove the condition if it is always true
  } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
   ^~~~
   1
  mm/shrinker_debug.c:77:7: warning: variable 'ret' is used uninitialized 
whenever 'if' condition is true [-Wsometimes-uninitialized]
  if (!memcg_aware) {
  ^~~~
  mm/shrinker_debug.c:92:9: note: uninitialized use occurs here
  return ret;
 ^~~
  mm/shrinker_debug.c:77:3: note: remove the 'if' if its condition is always 
false
  if (!memcg_aware) {
  ^~~
  mm/shrinker_debug.c:52:9: note: initialize the variable 'ret' to silence this 
warning
  int ret, nid;
 ^
  = 0

  
  	memcg_aware = shrinker->flags & SHRINKER_MEMCG_AWARE;

@@ -92,7 +87,6 @@ static int shrinker_debugfs_count_show(struct seq_file *m, 
void *v)
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
  
  	rcu_read_unlock();

-   up_read(&shrinker_rwsem);
  
  	kfree(count_per_node);

return ret;


...