Hi Ulrich,

On 2021/6/15 17:01, Ulrich Windl wrote:
Hi Guys!

Just to keep you informed on the issue:
I was informed that I'm not the only one seeing this problem, and there seems
to be some "negative interference" between BtrFS reorganizing its extents
periodically and OCFS2 making reflink snapshots (a local cron job here) in
current SUSE SLES kernels. It seems that happens almost exactly at 0:00 o'
clock.
We encountered the same hang in local environment, the problem looks like caused by btrfs btrfs-balance job run, but I need to crash the kernel for the further analysis. Hi Ulrich, do you know how to reproduce this hang stably? e.g. run reflink snapshot script and trigger the btrfs-balance job


Thanks
Gang


The only thing that BtrFS and OCFS2 have in common here is that BtrFS provides
the mount point for OCFS2.

Regards,
Ulrich

Ulrich Windl schrieb am 02.06.2021 um 11:00 in Nachricht <60B748A4.E0C :
161 :
60728>:
Gang He <g...@suse.com> schrieb am 02.06.2021 um 08:34 in Nachricht

<am6pr04mb6488de7d2da906bad73fa3a1cf...@am6pr04mb6488.eurprd04.prod.outlook.c

om>

Hi Ulrich,

The hang problem looks like a fix
(90bd070aae6c4fb5d302f9c4b9c88be60c8197ec
ocfs2: fix deadlock between setattr and dio_end_io_write), but it is not
100%
sure.
If possible, could you help to report a bug to SUSE, then we can work on
that further.

Hi!

Actually a service request for the issue is open at SUSE. However I don't
know which L3 engineer is working on it.
I have some "funny" effects, like these:
On one node "ls" hangs, but can be interrupted with ^C; on another node "ls"

also hangs, but cannot be stopped with ^C or ^Z
(Most processes cannot even be killed with "kill -9")
"ls" on the directory also hangs, just as an "rm" for a non-existent file

What I really wonder is what triggered the effect, and more importantly  how

to recover from it.
Initially I had suspected a rather full (95%) flesystem, but that means
there are still 24GB available.
The other suspect was concurrent creation of reflink snapshots while the
file being snapshot did change (e.g. allocate a hole in a sparse file)

Regards,
Ulrich


Thanks
Gang

________________________________________
From: Users <users‑boun...@clusterlabs.org> on behalf of Ulrich Windl
<ulrich.wi...@rz.uni‑regensburg.de>
Sent: Tuesday, June 1, 2021 15:14
To: users@clusterlabs.org
Subject: [ClusterLabs] Antw: Hanging OCFS2 Filesystem any one else?

Ulrich Windl schrieb am 31.05.2021 um 12:11 in Nachricht <60B4B65A.A8F
: 161

:
60728>:
Hi!

We have an OCFS2 filesystem shared between three cluster nodes (SLES 15
SP2,
Kernel 5.3.18‑24.64‑default). The filesystem is filled up to about 95%,
and
we have an odd effect:
A stat() systemcall to some of the files hangs indefinitely (state "D").
("ls ‑l" and "rm" also hang, but I suspect those are calling state()
internally, too).
My only suspect is that the effect might be related to the 95% being
used.
The other suspect is that concurrent reflink calls may trigger the
effect.

Did anyone else experience something similar?

Hi!

I have some details:
It seems there is a reader/writer deadlock trying to allocate additional
blocks for a file.
The stacktrace looks like this:
Jun 01 07:56:31 h16 kernel:  rwsem_down_write_slowpath+0x251/0x620
Jun 01 07:56:31 h16 kernel:  ? __ocfs2_change_file_space+0xb3/0x620
[ocfs2]
Jun 01 07:56:31 h16 kernel:  __ocfs2_change_file_space+0xb3/0x620 [ocfs2]
Jun 01 07:56:31 h16 kernel:  ocfs2_fallocate+0x82/0xa0 [ocfs2]
Jun 01 07:56:31 h16 kernel:  vfs_fallocate+0x13f/0x2a0
Jun 01 07:56:31 h16 kernel:  ksys_fallocate+0x3c/0x70
Jun 01 07:56:31 h16 kernel:  __x64_sys_fallocate+0x1a/0x20
Jun 01 07:56:31 h16 kernel:  do_syscall_64+0x5b/0x1e0

That is the only writer (on that host), bit there are multiple readers
like
this:
Jun 01 07:56:31 h16 kernel:  rwsem_down_read_slowpath+0x172/0x300
Jun 01 07:56:31 h16 kernel:  ? dput+0x2c/0x2f0
Jun 01 07:56:31 h16 kernel:  ? lookup_slow+0x27/0x50
Jun 01 07:56:31 h16 kernel:  lookup_slow+0x27/0x50
Jun 01 07:56:31 h16 kernel:  walk_component+0x1c4/0x300
Jun 01 07:56:31 h16 kernel:  ? path_init+0x192/0x320
Jun 01 07:56:31 h16 kernel:  path_lookupat+0x6e/0x210
Jun 01 07:56:31 h16 kernel:  ? __put_lkb+0x45/0xd0 [dlm]
Jun 01 07:56:31 h16 kernel:  filename_lookup+0xb6/0x190
Jun 01 07:56:31 h16 kernel:  ? kmem_cache_alloc+0x3d/0x250
Jun 01 07:56:31 h16 kernel:  ? getname_flags+0x66/0x1d0
Jun 01 07:56:31 h16 kernel:  ? vfs_statx+0x73/0xe0
Jun 01 07:56:31 h16 kernel:  vfs_statx+0x73/0xe0
Jun 01 07:56:31 h16 kernel:  ? fsnotify_grab_connector+0x46/0x80
Jun 01 07:56:31 h16 kernel:  __do_sys_newstat+0x39/0x70
Jun 01 07:56:31 h16 kernel:  ? do_unlinkat+0x92/0x320
Jun 01 07:56:31 h16 kernel:  do_syscall_64+0x5b/0x1e0

So that will match the hanging stat() quite nicely!

However the PID displayed as holding the writer does not exist in the
system

(on that node).

Regards,
Ulrich



Regards,
Ulrich








_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/







_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to