Re: [ClusterLabs] Antw: [EXT] Correctly stop pacemaker on 2-node cluster with SBD and failed devices?

2021-06-16 Thread Roger Zhou


On 6/16/21 3:03 PM, Andrei Borzenkov wrote:





We thought that access to storage was restored, but one step was
missing so devices appeared empty.

At this point I tried to restart the pacemaker. But as soon as I
stopped pacemaker SBD rebooted nodes ‑ which is logical, as quorum was
now lost.

How to cleanly stop pacemaker in this case and keep nodes up?


Unconfigurte sbd devices I guess.



Do you have *practical* suggestions on how to do it online in a
running pacemaker cluster? Can you explain how it is going to help
given that lack of sbd device was not the problem in the first place?


I would translate this issue as "how to gracefully shutdown sbd to deregister 
sbd from pacemaker for the whole cluster". Seems no way to do that except 
`systemctl stop corosync`.


With that, to calm down sbd suicide, I'm thinking some tricky steps as below 
might help. Well, not sure it fits your situation as the whole.


crm cluster run "systemctl stop pacemaker"
crm cluster run "systemctl stop corosync"

BR,
Roger

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Antw: Hanging OCFS2 Filesystem any one else?

2021-06-16 Thread Gang He

Hi Ulrich,

On 2021/6/15 17:01, Ulrich Windl wrote:

Hi Guys!

Just to keep you informed on the issue:
I was informed that I'm not the only one seeing this problem, and there seems
to be some "negative interference" between BtrFS reorganizing its extents
periodically and OCFS2 making reflink snapshots (a local cron job here) in
current SUSE SLES kernels. It seems that happens almost exactly at 0:00 o'
clock.
We encountered the same hang in local environment, the problem looks 
like caused by btrfs btrfs-balance job run, but I need to crash the 
kernel for the further analysis.
Hi Ulrich, do you know how to reproduce this hang stably? e.g. run 
reflink snapshot script and trigger the btrfs-balance job



Thanks
Gang



The only thing that BtrFS and OCFS2 have in common here is that BtrFS provides
the mount point for OCFS2.

Regards,
Ulrich


Ulrich Windl schrieb am 02.06.2021 um 11:00 in Nachricht <60B748A4.E0C :

161 :
60728>:

Gang He  schrieb am 02.06.2021 um 08:34 in Nachricht




om>


Hi Ulrich,

The hang problem looks like a fix

(90bd070aae6c4fb5d302f9c4b9c88be60c8197ec

ocfs2: fix deadlock between setattr and dio_end_io_write), but it is not

100%

sure.
If possible, could you help to report a bug to SUSE, then we can work on
that further.


Hi!

Actually a service request for the issue is open at SUSE. However I don't
know which L3 engineer is working on it.
I have some "funny" effects, like these:
On one node "ls" hangs, but can be interrupted with ^C; on another node "ls"



also hangs, but cannot be stopped with ^C or ^Z
(Most processes cannot even be killed with "kill -9")
"ls" on the directory also hangs, just as an "rm" for a non-existent file

What I really wonder is what triggered the effect, and more importantly  how



to recover from it.
Initially I had suspected a rather full (95%) flesystem, but that means
there are still 24GB available.
The other suspect was concurrent creation of reflink snapshots while the
file being snapshot did change (e.g. allocate a hole in a sparse file)

Regards,
Ulrich



Thanks
Gang


From: Users  on behalf of Ulrich Windl

Sent: Tuesday, June 1, 2021 15:14
To: users@clusterlabs.org
Subject: [ClusterLabs] Antw: Hanging OCFS2 Filesystem any one else?


Ulrich Windl schrieb am 31.05.2021 um 12:11 in Nachricht <60B4B65A.A8F

: 161



:
60728>:

Hi!

We have an OCFS2 filesystem shared between three cluster nodes (SLES 15

SP2,

Kernel 5.3.18‑24.64‑default). The filesystem is filled up to about 95%,

and

we have an odd effect:
A stat() systemcall to some of the files hangs indefinitely (state "D").
("ls ‑l" and "rm" also hang, but I suspect those are calling state()
internally, too).
My only suspect is that the effect might be related to the 95% being

used.

The other suspect is that concurrent reflink calls may trigger the

effect.


Did anyone else experience something similar?


Hi!

I have some details:
It seems there is a reader/writer deadlock trying to allocate additional
blocks for a file.
The stacktrace looks like this:
Jun 01 07:56:31 h16 kernel:  rwsem_down_write_slowpath+0x251/0x620
Jun 01 07:56:31 h16 kernel:  ? __ocfs2_change_file_space+0xb3/0x620

[ocfs2]

Jun 01 07:56:31 h16 kernel:  __ocfs2_change_file_space+0xb3/0x620 [ocfs2]
Jun 01 07:56:31 h16 kernel:  ocfs2_fallocate+0x82/0xa0 [ocfs2]
Jun 01 07:56:31 h16 kernel:  vfs_fallocate+0x13f/0x2a0
Jun 01 07:56:31 h16 kernel:  ksys_fallocate+0x3c/0x70
Jun 01 07:56:31 h16 kernel:  __x64_sys_fallocate+0x1a/0x20
Jun 01 07:56:31 h16 kernel:  do_syscall_64+0x5b/0x1e0

That is the only writer (on that host), bit there are multiple readers

like

this:
Jun 01 07:56:31 h16 kernel:  rwsem_down_read_slowpath+0x172/0x300
Jun 01 07:56:31 h16 kernel:  ? dput+0x2c/0x2f0
Jun 01 07:56:31 h16 kernel:  ? lookup_slow+0x27/0x50
Jun 01 07:56:31 h16 kernel:  lookup_slow+0x27/0x50
Jun 01 07:56:31 h16 kernel:  walk_component+0x1c4/0x300
Jun 01 07:56:31 h16 kernel:  ? path_init+0x192/0x320
Jun 01 07:56:31 h16 kernel:  path_lookupat+0x6e/0x210
Jun 01 07:56:31 h16 kernel:  ? __put_lkb+0x45/0xd0 [dlm]
Jun 01 07:56:31 h16 kernel:  filename_lookup+0xb6/0x190
Jun 01 07:56:31 h16 kernel:  ? kmem_cache_alloc+0x3d/0x250
Jun 01 07:56:31 h16 kernel:  ? getname_flags+0x66/0x1d0
Jun 01 07:56:31 h16 kernel:  ? vfs_statx+0x73/0xe0
Jun 01 07:56:31 h16 kernel:  vfs_statx+0x73/0xe0
Jun 01 07:56:31 h16 kernel:  ? fsnotify_grab_connector+0x46/0x80
Jun 01 07:56:31 h16 kernel:  __do_sys_newstat+0x39/0x70
Jun 01 07:56:31 h16 kernel:  ? do_unlinkat+0x92/0x320
Jun 01 07:56:31 h16 kernel:  do_syscall_64+0x5b/0x1e0

So that will match the hanging stat() quite nicely!

However the PID displayed as holding the writer does not exist in the

system



(on that node).

Regards,
Ulrich




Regards,
Ulrich









___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



Re: [ClusterLabs] Antw: [EXT] Correctly stop pacemaker on 2-node cluster with SBD and failed devices?

2021-06-16 Thread Klaus Wenninger
On Wed, Jun 16, 2021 at 11:26 AM Klaus Wenninger 
wrote:

>
>
> On Wed, Jun 16, 2021 at 10:47 AM Roger Zhou  wrote:
>
>>
>> On 6/16/21 3:03 PM, Andrei Borzenkov wrote:
>>
>> >
>> >>>
>> >>> We thought that access to storage was restored, but one step was
>> >>> missing so devices appeared empty.
>> >>>
>> >>> At this point I tried to restart the pacemaker. But as soon as I
>> >>> stopped pacemaker SBD rebooted nodes ‑ which is logical, as quorum was
>> >>> now lost.
>> >>>
>> >>> How to cleanly stop pacemaker in this case and keep nodes up?
>> >>
>> >> Unconfigurte sbd devices I guess.
>> >>
>> >
>> > Do you have *practical* suggestions on how to do it online in a
>> > running pacemaker cluster? Can you explain how it is going to help
>> > given that lack of sbd device was not the problem in the first place?
>>
>> I would translate this issue as "how to gracefully shutdown sbd to
>> deregister
>> sbd from pacemaker for the whole cluster". Seems no way to do that except
>> `systemctl stop corosync`.
>>
>> With that, to calm down sbd suicide, I'm thinking some tricky steps as
>> below
>> might help. Well, not sure it fits your situation as the whole.
>>
>> crm cluster run "systemctl stop pacemaker"
>> crm cluster run "systemctl stop corosync"
>>
> I guess this shouldn't be helpful in this situation.
> As I've already tried to explain before shutting down
> pacemaker on one of the nodes - if sbd-device can't
> be reached - should already be enough for the other
> one to suicide.
>
> One - not less ugly than other suggestions here I'm afraid -
> thing coming to my mind is to right after stopping pacemaker
> dummy-register at the cpg-protocol. If after that you want
> to bring down corosync & sbd as well it should be possible
> to do that quickly enough - as pcs is otherwise doing with
> 3+ node clusters.
>

Something else coming to my mind that might be more
helpful and less ugly - have to think it over a bit though:

With the new startup/shutdown-syncing pacemaker
should stay connected to the cpg-protocol till a final
handshake with sbd on shutdown.
If we could bring all nodes to a state right before that
handshake with e.g. pcs we have lots of time for that.
And the final step incl. corosync/sbd shutdown is quick
enough that it can happen on all nodes within
watchdog-timeout.

Klaus

>
>> BR,
>> Roger
>>
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Correctly stop pacemaker on 2-node cluster with SBD and failed devices?

2021-06-16 Thread Klaus Wenninger
On Wed, Jun 16, 2021 at 10:47 AM Roger Zhou  wrote:

>
> On 6/16/21 3:03 PM, Andrei Borzenkov wrote:
>
> >
> >>>
> >>> We thought that access to storage was restored, but one step was
> >>> missing so devices appeared empty.
> >>>
> >>> At this point I tried to restart the pacemaker. But as soon as I
> >>> stopped pacemaker SBD rebooted nodes ‑ which is logical, as quorum was
> >>> now lost.
> >>>
> >>> How to cleanly stop pacemaker in this case and keep nodes up?
> >>
> >> Unconfigurte sbd devices I guess.
> >>
> >
> > Do you have *practical* suggestions on how to do it online in a
> > running pacemaker cluster? Can you explain how it is going to help
> > given that lack of sbd device was not the problem in the first place?
>
> I would translate this issue as "how to gracefully shutdown sbd to
> deregister
> sbd from pacemaker for the whole cluster". Seems no way to do that except
> `systemctl stop corosync`.
>
> With that, to calm down sbd suicide, I'm thinking some tricky steps as
> below
> might help. Well, not sure it fits your situation as the whole.
>
> crm cluster run "systemctl stop pacemaker"
> crm cluster run "systemctl stop corosync"
>
I guess this shouldn't be helpful in this situation.
As I've already tried to explain before shutting down
pacemaker on one of the nodes - if sbd-device can't
be reached - should already be enough for the other
one to suicide.

One - not less ugly than other suggestions here I'm afraid -
thing coming to my mind is to right after stopping pacemaker
dummy-register at the cpg-protocol. If after that you want
to bring down corosync & sbd as well it should be possible
to do that quickly enough - as pcs is otherwise doing with
3+ node clusters.

>
> BR,
> Roger
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: Re: Antw: [EXT] Correctly stop pacemaker on 2-node cluster with SBD and failed devices?

2021-06-16 Thread Ulrich Windl
>>> Andrei Borzenkov  schrieb am 16.06.2021 um 09:03 in
Nachricht
:
> On Wed, Jun 16, 2021 at 9:05 AM Ulrich Windl
>  wrote:
>>
>> >>> Andrei Borzenkov  schrieb am 15.06.2021 um 17:20
in
>> Nachricht
>> :
>> > We had the following situation
>> >
>> > 2‑node cluster with single device (just single external storage
>> > available). Storage failed. So SBD lost access to the device. Cluster
>> > was still up, both nodes were running.
>>
>> Shouldn't sbd fence then (after some delay)?
>>
> 
> No. That is what pacemaker integration is for.
> 
>> >
>> > We thought that access to storage was restored, but one step was
>> > missing so devices appeared empty.
>> >
>> > At this point I tried to restart the pacemaker. But as soon as I
>> > stopped pacemaker SBD rebooted nodes ‑ which is logical, as quorum was
>> > now lost.
>> >
>> > How to cleanly stop pacemaker in this case and keep nodes up?
>>
>> Unconfigurte sbd devices I guess.
>>
> 
> Do you have *practical* suggestions on how to do it online in a
> running pacemaker cluster? Can you explain how it is going to help
> given that lack of sbd device was not the problem in the first place?

My guess was sdb timed out waiting for your failed devices. As you didn't
provide more detials it's mostly guesswork.
You can only change the SBD device while the node is down, but you can do it
node-by-node.

Regards,
Ulrich

> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Re: Issue with Pacemaker config related to VIP and an LSB resource

2021-06-16 Thread Ulrich Windl
>>> Michael Romero  schrieb am 16.06.2021 um 08:02 in
Nachricht
:
> “But in general I guess the idea of rechecking resource after failure
> timeout once (similar to initial probe) sounds interesting. It could be

I think the standad behaviour is "recover", i.e.: try to restart the resource
on the same node anyway.


> more robust in that resource agent could check whether resource start is
> possible now at all and prevent unsuccessful attempt to migrate resource
> back to original node.”

A monitor running into timeout does not mean a resource failure, but a RA
failure.
So it's most likely the next monitor will timeouzt anyway, and you just loose
time.

> 
> Yes! This is exactly the behavior I would like to produce. Maybe if this is
> not possible with LSB, is it possible with an OCF resource ? Also I’ve
> considered having it simply not retry but I would prefer this other
> configuration if it is at all possible.

If you want reliability use OCF RAs; the rest wasn't designed for HA
clusters.

Regards,
Ulrich

> 
> On Tue, Jun 15, 2021 at 10:54 PM Andrei Borzenkov 
> wrote:
> 
>> On 16.06.2021 01:49, Michael Romero wrote:
>> >
>> > At which point an administrator or an automated script could intervene
>>
>> If you are going to always use manual intervention outside of pacemaker,
>> just leave failure timeout on default 0 so cluster will never clear
>> failure count automatically on a node.
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
>>
> -- 
> Michael Romero
> 
> Lead Infrastructure Engineer
> 
> Engineering | Convoso
> 562-338-9868
> mrom...@convoso.com 
> www.convoso.com 
> [image: linkedin] 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Correctly stop pacemaker on 2-node cluster with SBD and failed devices?

2021-06-16 Thread Andrei Borzenkov
On Wed, Jun 16, 2021 at 9:05 AM Ulrich Windl
 wrote:
>
> >>> Andrei Borzenkov  schrieb am 15.06.2021 um 17:20 in
> Nachricht
> :
> > We had the following situation
> >
> > 2‑node cluster with single device (just single external storage
> > available). Storage failed. So SBD lost access to the device. Cluster
> > was still up, both nodes were running.
>
> Shouldn't sbd fence then (after some delay)?
>

No. That is what pacemaker integration is for.

> >
> > We thought that access to storage was restored, but one step was
> > missing so devices appeared empty.
> >
> > At this point I tried to restart the pacemaker. But as soon as I
> > stopped pacemaker SBD rebooted nodes ‑ which is logical, as quorum was
> > now lost.
> >
> > How to cleanly stop pacemaker in this case and keep nodes up?
>
> Unconfigurte sbd devices I guess.
>

Do you have *practical* suggestions on how to do it online in a
running pacemaker cluster? Can you explain how it is going to help
given that lack of sbd device was not the problem in the first place?
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: Re: Antw: [EXT] Re: Antw: Hanging OCFS2 Filesystem any one else?

2021-06-16 Thread Ulrich Windl
>>> Gang He  schrieb am 16.06.2021 um 07:48 in Nachricht
<7b2df128-4701-704c-869f-e08b0e37e...@suse.com>:
> Hi Ulrich,
> 
> On 2021/6/15 17:01, Ulrich Windl wrote:
>> Hi Guys!
>> 
>> Just to keep you informed on the issue:
>> I was informed that I'm not the only one seeing this problem, and there 
> seems
>> to be some "negative interference" between BtrFS reorganizing its extents
>> periodically and OCFS2 making reflink snapshots (a local cron job here) in
>> current SUSE SLES kernels. It seems that happens almost exactly at 0:00 o'
>> clock.
> We encountered the same hang in local environment, the problem looks 
> like caused by btrfs btrfs-balance job run, but I need to crash the 
> kernel for the further analysis.
> Hi Ulrich, do you know how to reproduce this hang stably? e.g. run 
> reflink snapshot script and trigger the btrfs-balance job

Hi!

My guess is (as reflink snapshot is very (or at least "quite") fast while
BtrFS balance takes some time) that you'll have to start BtrFS balance, and
while it's running add one or more reflink snapshots.
In out original case BtrFS balance was run at midnight, and the reflink
snapshots were created hourly. The midnight snapshots did not complete.
The really interesting thing is that the only common point of OCFS2 and BtrFS
is the mountpoint provided by BtrFS.
My guess is that the issue won't happen if the OCFS2 is not mounted on the
BtrFS being balanced; otherwise it would be a disastrous bug (I'm not saying
that the bug in its present form is less severe).

Regards,
Ulrich


> 
> 
> Thanks
> Gang
> 
>> 
>> The only thing that BtrFS and OCFS2 have in common here is that BtrFS 
> provides
>> the mount point for OCFS2.
>> 
>> Regards,
>> Ulrich
>> 
> Ulrich Windl schrieb am 02.06.2021 um 11:00 in Nachricht <60B748A4.E0C
:
>> 161 :
>> 60728>:
>> Gang He  schrieb am 02.06.2021 um 08:34 in Nachricht
>>>
>>
> 
>>> om>
>>>
 Hi Ulrich,

 The hang problem looks like a fix
>> (90bd070aae6c4fb5d302f9c4b9c88be60c8197ec
 ocfs2: fix deadlock between setattr and dio_end_io_write), but it is not
>>> 100%
 sure.
 If possible, could you help to report a bug to SUSE, then we can work on
 that further.
>>>
>>> Hi!
>>>
>>> Actually a service request for the issue is open at SUSE. However I don't
>>> know which L3 engineer is working on it.
>>> I have some "funny" effects, like these:
>>> On one node "ls" hangs, but can be interrupted with ^C; on another node
"ls"
>> 
>>> also hangs, but cannot be stopped with ^C or ^Z
>>> (Most processes cannot even be killed with "kill -9")
>>> "ls" on the directory also hangs, just as an "rm" for a non-existent file
>>>
>>> What I really wonder is what triggered the effect, and more importantly 
how
>> 
>>> to recover from it.
>>> Initially I had suspected a rather full (95%) flesystem, but that means
>>> there are still 24GB available.
>>> The other suspect was concurrent creation of reflink snapshots while the
>>> file being snapshot did change (e.g. allocate a hole in a sparse file)
>>>
>>> Regards,
>>> Ulrich
>>>

 Thanks
 Gang

 
 From: Users  on behalf of Ulrich Windl
 
 Sent: Tuesday, June 1, 2021 15:14
 To: users@clusterlabs.org 
 Subject: [ClusterLabs] Antw: Hanging OCFS2 Filesystem any one else?

>>> Ulrich Windl schrieb am 31.05.2021 um 12:11 in Nachricht
<60B4B65A.A8F
>> : 161
>>>
 :
 60728>:
> Hi!
>
> We have an OCFS2 filesystem shared between three cluster nodes (SLES 15
>> SP2,
> Kernel 5.3.18‑24.64‑default). The filesystem is filled up to about 95%,
>> and
> we have an odd effect:
> A stat() systemcall to some of the files hangs indefinitely (state
"D").
> ("ls ‑l" and "rm" also hang, but I suspect those are calling state()
> internally, too).
> My only suspect is that the effect might be related to the 95% being
>> used.
> The other suspect is that concurrent reflink calls may trigger the
>> effect.
>
> Did anyone else experience something similar?

 Hi!

 I have some details:
 It seems there is a reader/writer deadlock trying to allocate additional
 blocks for a file.
 The stacktrace looks like this:
 Jun 01 07:56:31 h16 kernel:  rwsem_down_write_slowpath+0x251/0x620
 Jun 01 07:56:31 h16 kernel:  ? __ocfs2_change_file_space+0xb3/0x620
>> [ocfs2]
 Jun 01 07:56:31 h16 kernel:  __ocfs2_change_file_space+0xb3/0x620
[ocfs2]
 Jun 01 07:56:31 h16 kernel:  ocfs2_fallocate+0x82/0xa0 [ocfs2]
 Jun 01 07:56:31 h16 kernel:  vfs_fallocate+0x13f/0x2a0
 Jun 01 07:56:31 h16 kernel:  ksys_fallocate+0x3c/0x70
 Jun 01 07:56:31 h16 kernel:  __x64_sys_fallocate+0x1a/0x20
 Jun 01 07:56:31 h16 kernel:  do_syscall_64+0x5b/0x1e0

 That is the only writer (on that host), bit there are multiple readers
>> like
 this:
 Jun 01 07:56:31 h16 kernel:  rwsem_down_read_slowpath+0x172/0x300
 Jun 01 07:56:31