Re: [ClusterLabs] Some unexpected DLM messages; OCFS2 related? "send_repeat_remove dir" / "send_repeat_remove dir"

2021-10-08 Thread Gang He via Users
Hello Ulrich, See my comments inline. On 2021/10/8 16:38, Ulrich Windl wrote: Hi! I just noticed these two messages on two nodes of a 3-node cluster: Oct 08 10:00:14 h18 kernel: dlm: 790F9C237C2A45758135FE4945B7A744: send_repeat_remove dir 119 O09d835 Oct 08 10:00:14

Re: [ClusterLabs] Problem with high load (IO)

2021-09-29 Thread Gang He via Users
On 2021/9/29 16:20, Lentes, Bernd wrote: - On Sep 29, 2021, at 4:37 AM, Gang He g...@suse.com wrote: Hi Lentes, Thank for your feedback. I have some questions as below, 1) how to clone these VM images from each ocfs2 nodes via reflink? do you encounter any problems during this step

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Antw: Hanging OCFS2 Filesystem any one else?

2021-07-13 Thread Gang He
Gang He schrieb am 11.07.2021 um 10:55 in Nachricht Hi Ulrich, Thank for your update. Based on some feedback from the upstream, there is a patch (ocfs2: initialize ip_next_orphan), which should fix this problem. I can comfirm the patch looks very similar with your problem. I will verify

Re: [ClusterLabs] Antw: [EXT] Re: Antw: Hanging OCFS2 Filesystem any one else?

2021-07-11 Thread Gang He
Hi Ulrich, Thank for your update. Based on some feedback from the upstream, there is a patch (ocfs2: initialize ip_next_orphan), which should fix this problem. I can comfirm the patch looks very similar with your problem. I will verify it next week, then let you know the result. Thanks Gang

Re: [ClusterLabs] Antw: [EXT] Re: Antw: Hanging OCFS2 Filesystem any one else?

2021-06-16 Thread Gang He
1 um 11:00 in Nachricht <60B748A4.E0C : 161 : 60728>: Gang He schrieb am 02.06.2021 um 08:34 in Nachricht om> Hi Ulrich, The hang problem looks like a fix (90bd070aae6c4fb5d302f9c4b9c88be60c8197ec ocfs2: fix deadlock between setattr and dio_end_io_write), but it is not 100%

Re: [ClusterLabs] Antw: Hanging OCFS2 Filesystem any one else?

2021-06-02 Thread Gang He
Hi Ulrich, The hang problem looks like a fix (90bd070aae6c4fb5d302f9c4b9c88be60c8197ec ocfs2: fix deadlock between setattr and dio_end_io_write), but it is not 100% sure. If possible, could you help to report a bug to SUSE, then we can work on that further. Thanks Gang

Re: [ClusterLabs] OCFS2 fragmentation with snapshots

2021-05-20 Thread Gang He
Hi Ulrich, On 2021/5/18 18:52, Ulrich Windl wrote: Hi! I thought using the reflink feature of OCFS2 would be just a nice way to make crash-consistent VM snapshots while they are running. As it is a bit tricky to find out how much data is shared between snapshots, I started to write an

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Q: What is lvmlockd locking?

2021-01-22 Thread Gang He
On 2021/1/22 16:17, Ulrich Windl wrote: Gang He schrieb am 22.01.2021 um 09:13 in Nachricht <1fd1c07d-d12c-fea9-4b17-90a977fe7...@suse.com>: Hi Ulrich, I reviewed the crm configuration file, there are some comments as below, 1) lvmlockd resource is used for shared VG, if you do no

Re: [ClusterLabs] Antw: [EXT] Re: Q: What is lvmlockd locking?

2021-01-22 Thread Gang He
That means this order should be wrong. order ord_lockspace_fs__lvmlockd Mandatory: cln_lockspace_ocfs2 cln_lvmlock Thanks Gang On 2021/1/21 20:08, Ulrich Windl wrote: Gang He schrieb am 21.01.2021 um 11:30 in Nachricht <59b543ee-0824-6b91-d0af-48f66922b...@suse.com>: Hi Ulri

Re: [ClusterLabs] Q: What is lvmlockd locking?

2021-01-21 Thread Gang He
Hi Ulrich, The problem is reproduced stably? could you help to share your pacemaker crm configure and OS/lvm2/resource-agents related version information? I feel the problem was probably caused by lvmlock resource agent script, which did not handle this corner case correctly. Thanks Gang

Re: [ClusterLabs] Q: LVM-activate a shared LV

2020-12-11 Thread Gang He
Hi Ulrish Which Linux distribution/version do you use? could you share the whole crm configure? There is a crm configuration demo for your reference. primitive dlm ocf:pacemaker:controld \ op start interval=0 timeout=90 \ op stop interval=0 timeout=100 \ op monitor

Re: [ClusterLabs] ocfs2 + pacemaker

2020-09-23 Thread Gang He
for how to setup pacemaker/corosync cluster stack. Then, add dlm resource clone, and add ocfs2 resource clone. Thanks Gang Best regards, On 23.09.2020 10:11, Gang He wrote: Hello Michael, ocfs2:o2cb resource is provided by resource-agents on SELS 11.x series. For the new SLES series(e.g. 12

Re: [ClusterLabs] ocfs2 + pacemaker

2020-09-23 Thread Gang He
Hello Michael, ocfs2:o2cb resource is provided by resource-agents on SELS 11.x series. For the new SLES series(e.g. 12 or 15), there is not o2cb resource agent in resource-agent rpm, since this resource is not needed. You can refer to new SUSE HA guide for how to setup ocfs2 on pacemaker

Re: [ClusterLabs] SBD on shared disk

2020-02-05 Thread Gang He
Hello Strahil, This kind of configuration should not be recommended. Why? Since SBD partition need to be accessed by the cluster nodes stably/frequently. But the other partition (for XFS file system) is probably under extreme pressure conditions, in that case, the SBD partition IO requests will

Re: [ClusterLabs] DLM in the cluster can tolerate more than one node failure at the same time?

2019-10-23 Thread Gang He
> To: users@clusterlabs.org > Subject: Re: [ClusterLabs] DLM in the cluster can tolerate more than one node > failure at the same time? > > On 22/10/2019 07:15, Gang He wrote: > > Hi List, > > > > I remember that master node has the full copy for one DLM lock > &g

Re: [ClusterLabs] gfs2: fsid=xxxx:work.3: fatal: filesystem consistency error

2019-10-21 Thread Gang He
Hi Bob, > -Original Message- > From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Bob > Peterson > Sent: 2019年10月21日 21:02 > To: Cluster Labs - All topics related to open-source clustering welcomed > > Subject: Re: [ClusterLabs] gfs2: fsid=:work.3: fatal: filesystem >

Re: [ClusterLabs] DLM, cLVM, GFS2 and OCFS2 managed by systemd instead of crm ?

2019-10-15 Thread Gang He
Hello Lentes,o In the cluster environment, usually we need to fence(or dynamically add/delete) node, the full stacks provided by packmaker/corosync can help to complete it automatically/integrally. Thanks Gang From: Users on behalf of Lentes, Bernd Sent:

Re: [ClusterLabs] trace of Filesystem RA does not log

2019-10-14 Thread Gang He
> -Original Message- > From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Lentes, > Bernd > Sent: 2019年10月14日 20:04 > To: Pacemaker ML > Subject: Re: [ClusterLabs] trace of Filesystem RA does not log > > > >> -Original Message- > >> From: Users

Re: [ClusterLabs] trace of Filesystem RA does not log

2019-10-13 Thread Gang He
Hello Lentes, > -Original Message- > From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Lentes, > Bernd > Sent: 2019年10月11日 22:32 > To: Pacemaker ML > Subject: [ClusterLabs] trace of Filesystem RA does not log > > Hi, > > occasionally the stop of a Filesystem resource

Re: [ClusterLabs] Where to find documentation for cluster MD?

2019-10-10 Thread Gang He
Hello Ulrich Cluster MD belongs to SLE HA extension product. The related doc link is here, e.g. https://documentation.suse.com/sle-ha/15-SP1/single-html/SLE-HA-guide/#cha-ha-cluster-md Thanks Gang > -Original Message- > From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of

Re: [ClusterLabs] File System does not do a recovery on fail over

2019-06-12 Thread Gang He
> > Regards, > > > Indivar Nair > > > > On Tue, Jun 11, 2019 at 10:18 AM Gang He wrote: >> >> Hi Indivar, >> >> See my comments inline. >> >> >>> On 6/11/2019 at 12:10 pm, in message >> , Indivar &

Re: [ClusterLabs] File System does not do a recovery on fail over

2019-06-10 Thread Gang He
Hi Indivar, See my comments inline. >>> On 6/11/2019 at 12:10 pm, in message , Indivar Nair wrote: > Hello ..., > > I have an Active-Passive cluster with two nodes hosting an XFS > Filesystem over a CLVM Volume. > > If a failover happens, the volume is mounted on the other node without > a

[ClusterLabs] Where do we download the source code of libdlm

2019-05-27 Thread Gang He
Hello Guys, As the subject said, I want to download the source code of libdlm, to see its git log changes. libdm is used to build dlm_controld, dlm_stonith, dlm_tool and etc. Thanks Gang ___ Manage your subscription:

Re: [ClusterLabs] Q: repeating message " cmirrord[17741]: [yEa32lLX] Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN"

2018-11-12 Thread Gang He
Hello Ulrich, Could you reproduce this issue stably? if yes, please share your steps. Since we also encountered a similar issue, it looks that Cmirrord can not join the CPG(corosync related concept), then the resource is timeout, the node is fenced. Thanks Gang >>> On 2018/11/12 at 15:46, in

Re: [ClusterLabs] VirtualDomain as resources and OCFS2

2018-09-11 Thread Gang He
Hello Lentes, >>> On 2018/9/11 at 20:50, in message <584818902.7776848.1536670226935.javamail.zim...@helmholtz-muenchen.de>, "Lentes, Bernd" wrote: > > - On Sep 11, 2018, at 4:29 AM, Gang He g...@suse.com wrote: > >> Hello Lentes, >> >>

Re: [ClusterLabs] VirtualDomain as resources and OCFS2

2018-09-10 Thread Gang He
Hello Lentes, It does not look like a OCFS2 or pacemaker problem, more like virtualization problem. From OCFS2/LVM2 perspective, if you use one LV for one VirtualDomain, that means the guest VMs on that VirtualDomain can not occupy the other LVs' storage space. If you use OCFS2 on one LV for

Re: [ClusterLabs] Fwd: Re: [Cluster-devel] [PATCH] dlm: prompt the user SCTP is experimental

2018-04-16 Thread Gang He
ks like a bug? since the tests were hanged, we have to reboot that node manually. Thanks Gang >>> > On Thu, Apr 12, 2018 at 09:31:49PM -0600, Gang He wrote: >> During this period, could we allow tcp protocol work (rather than return > error directly) under two-ring cluste

Re: [ClusterLabs] snapshoting of running VirtualDomain resources - OCFS2 ?

2018-03-18 Thread Gang He
Hi Lentes, >>> > > - On Mar 15, 2018, at 3:47 AM, Gang He g...@suse.com wrote: >> Just one comments, you have to make sure the VM file integrity before > calling >> reflink. >> > > Hi Gang, > > how could i achieve that ? sync ? The d

Re: [ClusterLabs] snapshoting of running VirtualDomain resources - OCFS2 ?

2018-03-14 Thread Gang He
Hello Lentes, >>> > Hi, > > i have a 2-node-cluster with my services (web, db) running in VirtualDomain > resources. > I have a SAN with cLVM, each guest lies in a dedicated logical volume with > an ext3 fs. > > Currently i'm thinking about snapshoting the guests to make a backup in the >

Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-12 Thread Gang He
ormally under that situation. Yan/Bin, do you have any comments about two-node cluster? which configuration settings will affect corosync quorum/DLM ? Thanks Gang > > > -- > Regards, > Muhammad Sharfuddin > > On 3/12/2018 10:59 AM, Gang He wrote: >> Hello Muhammad,

Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-12 Thread Gang He
, please watch where is mount.ocfs2 process hanged via "cat /proc/xxx/stack" command. If the back trace is stopped at DLM kernel module, usually the root cause is cluster configuration problem. Thanks Gang >>> > On 3/12/2018 7:32 AM, Gang He wrote: >> Hello Muhammad

Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-11 Thread Gang He
Hello Muhammad, I think this problem is not in ocfs2, the cause looks like the cluster quorum is missed. For two-node cluster (does not three-node cluster), if one node is offline, the quorum will be missed by default. So, you should configure two-node related quorum setting according to the

Re: [ClusterLabs] [Cluster-devel] DLM connection channel switch take too long time (> 5mins)

2018-03-08 Thread Gang He
Hello Digimer, >>> > On 2018-03-08 12:10 PM, David Teigland wrote: >>> I use active rrp_mode in corosync.conf and reboot the cluster to let the > configuration effective. >>> But, the about 5 mins hang in new_lockspace() function is still here. >> >> The last time I tested connection

Re: [ClusterLabs] [Cluster-devel] DLM connection channel switch take too long time (> 5mins)

2018-03-08 Thread Gang He
Hello David, If sctp implementation did not fix this problem, there is any workaround for a two-rings cluster? Could we use TCP protocol in DLM under a two-rings cluster to by-pass connection channel switch issue? Thanks Gang >>> >> I use active rrp_mode in corosync.conf and reboot the

Re: [ClusterLabs] Antw: Re: [Cluster-devel] DLM connection channel switch take too long time (> 5mins)

2018-03-08 Thread Gang He
pleted immediately. Yes, the behavior does not follow the O_NONBLOCK flag, it is too long for 5 mins. Thanks Gang Regards, Ulrich >>> "Gang He" <g...@suse.com> schrieb am 08.03.2018 um 10:48 in Nachricht <5aa1776502f9000ad...@prv-mh.provo.novell.com>: >

Re: [ClusterLabs] [Cluster-devel] DLM connection channel switch take too long time (> 5mins)

2018-03-08 Thread Gang He
/sle_ha/book_sleha/data/sec_ha_installatio > n_terms.html > > That fixes I saw in 4.14.* > >> On 8 Mar 2018, at 09:12, Gang He <g...@suse.com> wrote: >> >> Hi Feldhost, >> >> >>>>> >>> Hello Gang He, >>> >>

Re: [ClusterLabs] [Cluster-devel] DLM connection channel switch take too long time (> 5mins)

2018-03-08 Thread Gang He
Hi Feldhost, >>> > Hello Gang He, > > which type of corosync rrp_mode you use? Passive or Active? clvm1:/etc/corosync # cat corosync.conf | grep rrp_mode rrp_mode: passive Did you try test both? No, only this mode. Also, what kernel version you use? I s

[ClusterLabs] DLM connection channel switch take too long time (> 5mins)

2018-03-07 Thread Gang He
Hello list and David Teigland, I got a problem under a two rings cluster, the problem can be reproduced with the below steps. 1) setup a two rings cluster with two nodes. e.g. clvm1(nodeid 172204569) addr_list eth0 10.67.162.25 eth1 192.168.152.240 clvm2(nodeid 172204570) addr_list eth0

Re: [ClusterLabs] Corosync OCFS2

2017-11-08 Thread Gang He
Hello David, If you want to use OCFS2 with Pacemaker stack, you do not need ocfs2_controld in the new version. you do not need configure o2cb resource too. I can give you a crm demo in SLE12SP3 environment (actually there is not any change since SLE12SP1) crm(live/tb-node1)configure# show

Re: [ClusterLabs] PSA Ubuntu 16.04 and OCSF2 corruption

2017-04-11 Thread Gang He
Hello Kyle, >From the ocfs2 code, ocfs2 supports fstrim operation in the cluster >environment. If there was a file system corruption encountered, it should be a bug. By the way, you should be recommended to run a fstrim operation from one node in the cluster. Thanks Gang >>> > Hello, > >

Re: [ClusterLabs] HA/Clusterlabs Summit 2017 Proposal

2017-02-06 Thread Gang He
Hi Kristoffer, The meeting looks very attractive. Just one question, does the meeting have any website to archive the previous topics/presentations/materials? Thanks Gang >>> > Hi everyone! > > The last time we had an HA summit was in 2015, and the intention then > was to have SUSE