Re: [ClusterLabs] Where to Find pcs and pcsd for OpenSUSE LEAP 4.23

2017-11-06 Thread Eric Ren

Hi,

On 11/07/2017 05:35 AM, Eric Robinson wrote:


I installed corosync 2.4.3 and pacemaker 1.1.17 from the openSUSE Leap 
4.23 repos, but I can’t find pcs or pcsd. Anybody know where to 
download them from?




openSUSE/SUSE uses CLI tool "crmsh" and web UI "hawk" to mange HA cluster.
Please see "quick start" doc [1] and other HA docs under here [2].

[1] 
https://www.suse.com/documentation/sle-ha-12/install-quick/data/install-quick.html

[2] https://www.suse.com/documentation/sle-ha-12/index.html

Eric


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] DRBD, dual-primary, Live-Migration - Cluster FS necessary ?

2017-09-07 Thread Eric Ren

Hi,

On 09/07/2017 01:30 AM, Lentes, Bernd wrote:

Hi,

i just want to be sure. I created a DRBD partition in a dual primary setup. I 
have a VirtualDomain (KVM) resource which resides in the naked DRBD (without 
FS), and i can live migrate.
Are there situations where in this setup a cluster fs is necessary/recommended 
? I'd like to avoid it, it causes more complexity.
With cluster FS (ocfs2, gfs2), you can use qcow2 files as images for the 
VM, then you have the features that qcow2 file provides, like
your image can grow as data is added, instead of allocating the whole 
image space at creation.


Eric

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] convert already clustered LVM to cLVM

2017-07-02 Thread Eric Ren
Hi Cristiano,

On Wed, Jun 28, 2017 at 08:46:55AM +, Cristiano Coltro wrote: 
> Hi all,
> 
> A brief History
> 
> Customer of mine has setted up a SLES 11 sp3 and SLES11sp4 HA environment and 
> he has some LVM as resources.
> 
> When he tries to expand LVM for a "need of space" reason he wasn't able to 
> extend cause a corruption of metadata that even if recovered with 
> vgcfgrestore doesn't heal.
> 

It's dangerous to have a non-clustered VG for shared disks. The
corruption will _not_ surprise me if the VG is on shared disks
which means that the disks have possibility to be attached to more than
one node in the cluster.

> The customer has not configured LVM with the "c"( clustered ) attribute
> 
>95   VG#PV #LV #SN Attr   VSize   VFree
>96   vg001   3   0 wz--n- 279.22g  88.22g
>97   vgWP4   2  10   0 wz--n-  49.97g   5.97g
> 
> that prevents to extend the LVM and gets
> 
> 83664   Setting global/locking_type to 3
> 83665   global/wait_for_locks not found in config: defaulting to 1
> 83666   Cluster locking selected.
> 83667 connect() failed on local socket: No such file or directory
> 83668 Internal cluster locking initialisation failed.

Please check if the clvmd daemon is running well, which is a must when
locking_type is 3.

> 83669   Setting global/fallback_to_local_locking to 1
> 83670   locking/fallback_to_local_locking not found in config: defaulting 
> to 1
> 83671 WARNING: Falling back to local file-based locking
> 
> 
> Cause he has the file locking set to 3 ( correctly ) but he must have the "c" 
> attribute to work properly.
> 
> Coming to my question:
> 
> 
>   *   Is there a way to convert the LVMs without pain to cLVMs?  I know that 
> technically a vgchange -cy do the trick but I'm not sure which impacts it can 
> have.

No, if a clustered VG is what you need. "vgchange -cy" will give the VG
cluster-wide protection, which is needed if the VG is shared among
nodes. It won't
break anything, but otherwise making a non-clustered VG on shared disks
is dangerous, because the metadata on the shared disk may be corrupted
unexpectedly from more than one nodes.

Thanks,
Eric

> 
> Thanks in advance
> Cristiano
> 
> Cristiano Coltro
> Premium Support Engineer
> 
> mail: cristiano.col...@microfocus.com
> mobile: +39 335 1435589
> phone +39 02 36634936
> __
> [microfocus-logo]
> 



> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: PSA Ubuntu 16.04 and OCSF2 corruption

2017-04-11 Thread Eric Ren

Hi,

On 04/11/2017 02:07 PM, Ulrich Windl wrote:

Kyle O'Donnell  schrieb am 10.04.2017 um 22:33 in Nachricht

<1870523950.17456.1491856403263.javamail.zim...@0b10.mx>:

Hello,

Just opened what I think is a bug with the Ubuntu util-linux package:

https://bugs.launchpad.net/ubuntu/+source/util-linux/+bug/1681410

TL;DR

The 'fstrim' command is run weekly on all filesystems.  If you're using
ocfs2 and the same filesystem is mounted on multiple Ubuntu 16.04 servers,
this fstrim is run at the same time to the same device from all servers.  I'm
positive this is what's causing my filesystem corruption issues, which occurs
a minute or two after fstrim is scheduled to run.

Without knowing the details of fstrim I think with proper locking in place, 
filesystem corruption should never occur. However if one node in OCFS allocated 
the block that another node trims (the non-locking case), you will loose that 
newly allocated block. I wonder whether fstrim can be used in cluster 
filesystems.

Looks ocfs2 supports fstrim, according to this commit:

commit e80de36d8dbff216a384e9204e54d59deeadf344
Author: Tao Ma 
Date:   Mon May 23 10:36:43 2011 +0800

ocfs2: Add ocfs2_trim_fs for SSD trim support.
Add ocfs2_trim_fs to support trimming freed clusters in the
volume. A range will be given and all the freed clusters greater
than minlen will be discarded to the block layer.
Signed-off-by: Tao Ma 
Signed-off-by: Joel Becker 

If something goes wrong with it, I'd suggest open a BZ, or ask ocfs2 maillist 
for help.
Eric


Regards,
Ulrich



-Kyle

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] SLES 12 SP2 HAE

2017-04-05 Thread Eric Ren

Hi Cristiano,

Thanks for your report. I suggest you file a bug for this at 
https://bugzilla.suse.com
so that it can be routed to the right person quickly.

On 04/05/2017 03:26 PM, Cristiano Coltro wrote:

Hi all,
I was noticing some behaviour on SLES 12 Sp2 HAE


   1.  Cluster Name
If you go to yast2 > cluster and you change the cluster name, the change is not reflected 
in /etc/corosync/corosync.conf  that shows the default name "hacluster"

   1.  Expected votes
With a 2 node cluster you have the corosync.conf configured like this for the 
quorum section:

provider: corosync_votequorum
 expected_votes: 2
 two_node: 1

and it's correct

If for some reason you redo an "ha-cluster-init"on the same first node  WITHOUT 
overwriting the corosync.conf and then you perform an "ha-cluster-join" on the second 
node the corosync.conf changes like this

   provider: corosync_votequorum
 expected_votes: 3
 two_node: 0

 so it seems that cluster do not check the number of nodes but 
simply add a +1 every join you perform IF you DON'T overwrite the original 
corosync.conf.

Are 1 & 2 expected behaviour? Any experience on that?
It's probably a bug to fix, I think, so don't mix yast and HA bootstrap scripts to setup 
your cluster ATM:)


Regards,
Eric


Thanks,
Cristiano


Cristiano Coltro
Premium Support Engineer

mail: cristiano.col...@microfocus.com
mobile: +39 335 1435589
phone +39 02 36634936
__
[microfocus-logo]




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Cannot clone clvmd resource

2017-03-05 Thread Eric Ren

Hi,

On 03/03/2017 03:27 PM, Ulrich Windl wrote:

Eric Ren <z...@suse.com> schrieb am 03.03.2017 um 04:12 in Nachricht

<c004860e-376e-4bc3-1d35-d60428b41...@suse.com>:
[...]

A bugfix for this issue has been released in lvm2 2.02.120-70.1. And, since
SLE12-SP2
and openSUSE leap42.2, we recommend using
'/usr/lib/ocf/resource.d/heartbeat/clvm'
instead, which is from 'resource-agents' package.

[...]
It seems some release notes were not clear enough: I found out that we are also 
using ocf:lvm2:clvmd here (SLES11 SP4). When trying to diff, I found this:
# diff -u /usr/lib/ocf/resource.d/{lvm2,heartbeat}/clvmd |less
diff: /usr/lib/ocf/resource.d/heartbeat/clvmd: No such file or directory

It's 'clvm', not 'clvmd', under /usr/lib/ocf/resource.d/heartbeat/.

# rpm -qf /usr/lib/ocf/resource.d/heartbeat /usr/lib/ocf/resource.d/lvm2/
resource-agents-3.9.5-49.2
lvm2-clvm-2.02.98-0.42.3

I'm confused!

Did I answer your question?

Thanks,
Eric


Regards,
Ulrich




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Re: ocf:lvm2:VolumeGroup Probe Issue

2017-02-16 Thread Eric Ren

Hi,


On 02/16/2017 08:16 PM, Ulrich Windl wrote:

[snip]

Any other advice? Is ocf:heartbeat:LVM or ocf:lvm2:VolumeGroup the
more popular RA for managing LVM VG's? Any comments from other users
on experiences using either (good, bad)?

I had a little bit experience on "ocf:heartbeat:LVM". Each volume group
needs an
independent resource agent of it. Something like:

You mean "an independent resource instance (primitive)"?

Yes, I meant it for "ocf:heartbeat:LVM". And, I cannot find
"OCF:lvm2:VolumeGroup" on
SLES.

One RA should be good for all VGs ;-)

Oh, really? if so, why "params volgrpname" is required?

You are still mixing class (RA) and object (primitive)! IMHO RA is the script 
(class) like ocf:heartbeat:LVM, while the primitive (object) is a configuration 
based on the RA. So you'd have multiple objects (primitives) of one class (RA).


Aha, my bad. It's like the concepts of "class and object" in 
object-oriented language.

Thanks for pointing it out:)

Eric


Regards,
Ulrich


"""
crm(live)configure# ra info ocf:heartbeat:LVM
...
Parameters (*: required, []: default):

volgrpname* (string): Volume group name
  The name of volume group.
"""

And I failed to show "OCF:lvm2:VolumeGroup":
"""
crm(live)configure# ra info ocf:lvm2:
ocf:lvm2:clvmd ocf:lvm2:cmirrord
"""

Am I missing something?

Thanks for your input:)
Eric

"""
primitive vg1 LVM \
   params volgrpname=vg1 exclusive=true \
   op start timeout=100 interval=0 \
   op stop timeout=40 interval=0 \
   op monitor interval=60 timeout=240
"""

And, "dlm" and "clvm" resource agents are grouped and then cloned like:
"""
group base-group dlm clvm
clone base-clone base-group \
   meta target-role=Started interleave=true
"""

Then, put an "order" constraint like:
"""
order base_first_vg1 inf: base-clone vg1
"""

Does "ocf:lvm2:VolumeGroup" can follow the same pattern?

Thanks,
Eric

Both appear to achieve the
same function, just a bit differently.


Thanks,

Marc

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org








___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: ocf:lvm2:VolumeGroup Probe Issue

2017-02-16 Thread Eric Ren

Hi Ulrich!

On 02/16/2017 03:31 PM, Ulrich Windl wrote:

Eric Ren <z...@suse.com> schrieb am 16.02.2017 um 04:50 in Nachricht

<ae65ce9d-ef8b-1e13-faf1-217c8a5c0...@suse.com>:

Hi,

On 11/09/2016 12:37 AM, Marc Smith wrote:

Hi,

First, I realize ocf:lvm2:VolumeGroup comes from the LVM2 package and
not resource-agents, but I'm hoping someone on this list is familiar
with this RA and can provide some insight.

In my cluster configuration, I'm using ocf:lvm2:VolumeGroup to manage
my LVM VG's, and I'm using the cluster to manage DLM and CLVM. I have
my constraints in place and everything seems to be working mostly,
except I'm hitting a glitch with ocf:lvm2:VolumeGroup and the initial
probe operation.

On startup, a probe operation (monitor) is issued for all of the
resources, but ocf:lvm2:VolumeGroup is returning OCF_ERR_GENERIC in
VolumeGroup_status() (via VolumeGroup_monitor()) since clvmd hasn't
started yet... this line in VolumeGroup_status() is the trouble:

VGOUT=`vgdisplay -v $OCF_RESKEY_volgrpname 2>&1` || exit $OCF_ERR_GENERIC

When clvmd is not running, 'vgdisplay -v name' will always return
something like this:

--snip--
connect() failed on local socket: No such file or directory
Internal cluster locking initialisation failed.
WARNING: Falling back to local file-based locking.
Volume Groups with the clustered attribute will be inaccessible.
  VG name on command line not found in list of VGs: biggie
Volume group "biggie" not found
Cannot process volume group biggie
--snip--

And exits with a status of 5. So, my question is, do I patch the RA?
Or is there some cluster constraint I can add so a probe/monitor
operation isn't performed for the VolumeGroup resource until CLVM has
been started?

Any other advice? Is ocf:heartbeat:LVM or ocf:lvm2:VolumeGroup the
more popular RA for managing LVM VG's? Any comments from other users
on experiences using either (good, bad)?

I had a little bit experience on "ocf:heartbeat:LVM". Each volume group
needs an
independent resource agent of it. Something like:

You mean "an independent resource instance (primitive)"?

Yes, I meant it for "ocf:heartbeat:LVM". And, I cannot find 
"OCF:lvm2:VolumeGroup" on
SLES.

One RA should be good for all VGs ;-)

Oh, really? if so, why "params volgrpname" is required?

"""
crm(live)configure# ra info ocf:heartbeat:LVM
...
Parameters (*: required, []: default):

volgrpname* (string): Volume group name
The name of volume group.
"""

And I failed to show "OCF:lvm2:VolumeGroup":
"""
crm(live)configure# ra info ocf:lvm2:
ocf:lvm2:clvmd ocf:lvm2:cmirrord
"""

Am I missing something?

Thanks for your input:)
Eric



"""
primitive vg1 LVM \
  params volgrpname=vg1 exclusive=true \
  op start timeout=100 interval=0 \
  op stop timeout=40 interval=0 \
  op monitor interval=60 timeout=240
"""

And, "dlm" and "clvm" resource agents are grouped and then cloned like:
"""
group base-group dlm clvm
clone base-clone base-group \
  meta target-role=Started interleave=true
"""

Then, put an "order" constraint like:
"""
order base_first_vg1 inf: base-clone vg1
"""

Does "ocf:lvm2:VolumeGroup" can follow the same pattern?

Thanks,
Eric

Both appear to achieve the
same function, just a bit differently.


Thanks,

Marc

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ocf:lvm2:VolumeGroup Probe Issue

2017-02-15 Thread Eric Ren

Hi,

On 11/09/2016 12:37 AM, Marc Smith wrote:

Hi,

First, I realize ocf:lvm2:VolumeGroup comes from the LVM2 package and
not resource-agents, but I'm hoping someone on this list is familiar
with this RA and can provide some insight.

In my cluster configuration, I'm using ocf:lvm2:VolumeGroup to manage
my LVM VG's, and I'm using the cluster to manage DLM and CLVM. I have
my constraints in place and everything seems to be working mostly,
except I'm hitting a glitch with ocf:lvm2:VolumeGroup and the initial
probe operation.

On startup, a probe operation (monitor) is issued for all of the
resources, but ocf:lvm2:VolumeGroup is returning OCF_ERR_GENERIC in
VolumeGroup_status() (via VolumeGroup_monitor()) since clvmd hasn't
started yet... this line in VolumeGroup_status() is the trouble:

VGOUT=`vgdisplay -v $OCF_RESKEY_volgrpname 2>&1` || exit $OCF_ERR_GENERIC

When clvmd is not running, 'vgdisplay -v name' will always return
something like this:

--snip--
   connect() failed on local socket: No such file or directory
   Internal cluster locking initialisation failed.
   WARNING: Falling back to local file-based locking.
   Volume Groups with the clustered attribute will be inaccessible.
 VG name on command line not found in list of VGs: biggie
   Volume group "biggie" not found
   Cannot process volume group biggie
--snip--

And exits with a status of 5. So, my question is, do I patch the RA?
Or is there some cluster constraint I can add so a probe/monitor
operation isn't performed for the VolumeGroup resource until CLVM has
been started?

Any other advice? Is ocf:heartbeat:LVM or ocf:lvm2:VolumeGroup the
more popular RA for managing LVM VG's? Any comments from other users
on experiences using either (good, bad)?

I had a little bit experience on "ocf:heartbeat:LVM". Each volume group needs an
independent resource agent of it. Something like:

"""
primitive vg1 LVM \
params volgrpname=vg1 exclusive=true \
op start timeout=100 interval=0 \
op stop timeout=40 interval=0 \
op monitor interval=60 timeout=240
"""

And, "dlm" and "clvm" resource agents are grouped and then cloned like:
"""
group base-group dlm clvm
clone base-clone base-group \
meta target-role=Started interleave=true
"""

Then, put an "order" constraint like:
"""
order base_first_vg1 inf: base-clone vg1
"""

Does "ocf:lvm2:VolumeGroup" can follow the same pattern?

Thanks,
Eric

Both appear to achieve the
same function, just a bit differently.


Thanks,

Marc

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Re: OCFS2 on cLVM with node waiting for fencing timeout

2016-10-13 Thread Eric Ren

Hi,

On 10/13/2016 04:36 PM, Ulrich Windl wrote:

Eric Ren <z...@suse.com> schrieb am 13.10.2016 um 09:48 in Nachricht

<73f764d0-75e7-122f-ff4e-d0b27dbdd...@suse.com>:
[...]

When assuming node h01 still lived when communication failed, wouldn't

quorum prevent h01 from doing anything with DLM and OCFS2 anyway?
Not sure I understand you correctly. By default, loosing quorum will make
DLM stop service.

That's what I'm talking about: If 1 of 3 nodes is rebooting (or the cluster is 
split-brain 1:2), the single node CANNOT continue due to lack of quorum, while 
the remaining two nodes can. Is it still necessary to wait for completion of 
stonith?
quorum and fencing completion are different conditions to be checked before starting 
providing service again. FYI,


https://github.com/renzhengeek/libdlm/blob/master/dlm_controld/cpg.c#L603



See `man dlm_controld`:
```
--enable_quorum_lockspace 0|1
 enable/disable quorum requirement for lockspace operations
```

Does not exist in SLES11 SP4...
Well, I think it's better to keeps the default behavior. Otherwise, it's dangerous when 
brain-split happens.


Eric


Ulrich



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: OCFS2 on cLVM with node waiting for fencing timeout

2016-10-13 Thread Eric Ren

Hi,

On 10/11/2016 02:18 PM, Ulrich Windl wrote:

{ emmanuel segura  schrieb am 10.10.2016 um 16:49 in

Nachricht

Re: [ClusterLabs] OCFS2 on cLVM with node waiting for fencing timeout

2016-10-13 Thread Eric Ren

Hi,

On 10/10/2016 10:46 PM, Ulrich Windl wrote:

Hi!

I observed an interesting thing: In a three node cluster (SLES11 SP4) with cLVM 
and OCFS2 on top, one node was fenced as the OCFS2 filesystem was somehow busy 
on unmount. We have (for paranoid reasons mainly) an excessive long fencing 
timout for SBD: 180 seconds

While one node was actually reset immediately (the cluster was still waiting for the fencing 
to "complete" through timeout), the other nodes seemed to freeze the filesystem. 
Thus I observed a read delay > 140 seconds on one node, the other was also close to 140 
seconds.
ocfs2 and cLVM are both depending on DLM. DLM deamon will notify them to stop service (which 
means any cluster locking

request would be blocked) during the fencing process.

So I'm wondering why it takes so long to finish the fencing process?

Eric


This was not expected for a cluster filesystem (by me).

I wonder: Is that expected bahavior?

Regards,
Ulrich



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] node is always offline

2016-08-16 Thread Eric Ren

Hi,

On 08/16/2016 01:29 PM, wsl...@126.com wrote:

thank you for your reply.
The network is well,  firewall has been closed,  and selinux has been disabled.
/var/log/pacemaker.log as follows:
To be honest, this piece of logs is useless. A whole log file should be attached, which logs 
messages from

the first time when things go bad until now;-)

Also,  lots of info is missing, like OS distributions, packages' version and cluster 
configuration, etc.

A "hb_report" (man hb_report) would be great if package "cluster-glue" is 
intalled.

Eric

Aug 16 13:11:12 [8980] node0   crmd: info: peer_update_callback:
node0 is now (null)
Aug 16 13:11:12 [8980] node0   crmd: info: crm_get_peer:Node 1 
has uuid 1
Aug 16 13:11:12 [8980] node0   crmd: info: crm_update_peer_proc:
cluster_connect_cpg: Node node0[1] - corosync-cpg is now online
Aug 16 13:11:12 [8980] node0   crmd: info: peer_update_callback:
Client node0/peer now has status [online] (DC=)
Aug 16 13:11:12 [8980] node0   crmd: info: init_cs_connection_once: 
Connection to 'corosync': established
Aug 16 13:11:12 [8980] node0   crmd:   notice: cluster_connect_quorum:  
Quorum acquired
Aug 16 13:11:12 [8980] node0   crmd: info: do_ha_control:   
Connected to the cluster
Aug 16 13:11:12 [8980] node0   crmd: info: lrmd_ipc_connect:
Connecting to lrmd
Aug 16 13:11:12 [8977] node0   lrmd: info: crm_client_new:  
Connecting 0x16025a0 for uid=189 gid=189 pid=8980 
id=e317bf62-fb55-46a1-af17-9284234917b8
Aug 16 13:11:12 [8975] node0cib: info: cib_process_request: 
Completed cib_modify operation for section nodes: OK (rc=0, 
origin=local/crmd/3, version=0.1.0)
Aug 16 13:11:12 [8980] node0   crmd: info: do_lrm_control:  LRM 
connection established
Aug 16 13:11:12 [8980] node0   crmd: info: do_started:  Delaying start, 
no membership data (0010)
Aug 16 13:11:12 [8980] node0   crmd: info: do_started:  Delaying start, 
no membership data (0010)
Aug 16 13:11:12 [8975] node0cib: info: cib_process_request: 
Completed cib_query operation for section crm_config: OK (rc=0, 
origin=local/crmd/4, version=0.1.0)
Aug 16 13:11:12 [8980] node0   crmd: info: pcmk_quorum_notification:
Membership 8571476093872377196: quorum retained (1)
Aug 16 13:11:12 [8980] node0   crmd:   notice: crm_update_peer_state:   
pcmk_quorum_notification: Node node0[1] - state is now member (was (null))
Aug 16 13:11:12 [8980] node0   crmd: info: peer_update_callback:
node0 is now member (was (null))
Aug 16 13:11:12 [8980] node0   crmd:   notice: crm_update_peer_state:   
pcmk_quorum_notification: Node node0[1] - state is now lost (was member)
Aug 16 13:11:12 [8980] node0   crmd: info: peer_update_callback:
node0 is now lost (was member)
Aug 16 13:11:12 [8980] node0   crmd:error: reap_dead_nodes: We're 
not part of the cluster anymore
Aug 16 13:11:12 [8975] node0cib: info: crm_client_new:  
Connecting 0x1cbcd80 for uid=0 gid=0 pid=8976 
id=9afb683d-8390-408a-a690-fb5447fefd37
Aug 16 13:11:12 [8976] node0 stonith-ng:   notice: setup_cib:   Watching for 
stonith topology changes
Aug 16 13:11:12 [8976] node0 stonith-ng: info: qb_ipcs_us_publish:  server 
name: stonith-ng
Aug 16 13:11:12 [8976] node0 stonith-ng: info: main:Starting 
stonith-ng mainloop
Aug 16 13:11:12 [8976] node0 stonith-ng: info: pcmk_cpg_membership: 
Joined[0.0] stonith-ng.1
Aug 16 13:11:12 [8976] node0 stonith-ng: info: pcmk_cpg_membership: 
Member[0.0] stonith-ng.1
Aug 16 13:11:12 [8980] node0   crmd:error: do_log:  FSA: Input 
I_ERROR from reap_dead_nodes() received in state S_STARTING
Aug 16 13:11:12 [8980] node0   crmd:   notice: do_state_transition: 
State transition S_STARTING -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL 
origin=reap_dead_nodes ]
Aug 16 13:11:12 [8980] node0   crmd:  warning: do_recover:  Fast-tracking 
shutdown in response to errors
Aug 16 13:11:12 [8980] node0   crmd:error: do_started:  Start 
cancelled... S_RECOVERY
Aug 16 13:11:12 [8980] node0   crmd:error: do_log:  FSA: Input 
I_TERMINATE from do_recover() received in state S_RECOVERY
Aug 16 13:11:12 [8980] node0   crmd: info: do_state_transition: 
State transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE 
cause=C_FSA_INTERNAL origin=do_recover ]
Aug 16 13:11:12 [8980] node0   crmd:   notice: lrm_state_verify_stopped:
Stopped 0 recurring operations at shutdown (0 ops remaining)
Aug 16 13:11:12 [8980] node0   crmd: info: do_lrm_control:  
Disconnecting from the LRM
Aug 16 13:11:12 [8980] node0   crmd: info: lrmd_api_disconnect: 
Disconnecting from lrmd service
Aug 16 13:11:12 [8980] node0   crmd: info: 

Re: [ClusterLabs] agent ocf:pacemaker:controld

2016-07-22 Thread Eric Ren

Hello,

On 07/21/2016 09:31 PM, Da Shi Cao wrote:

I've built the dlm_tool suite using the source from 
https://git.fedorahosted.org/cgit/dlm.git/log/.  The resource uisng 
ocf:pacemaker:controld will always fail to start because of timeout, even if 
start timeout is set to 120s! But if dlm_controld is first started outside the 
cluster management,  then the resource will show up and stay well!
1. Why do you suppose it's because of timeout? Any logs when DLM RA 
failed to start?
"ocf:pacemaker:controld" is bash script 
(/usr/lib/ocf/resource.d/pacemaker/controld).
If taking a look at this script, you'll find it suppose that 
dlm_controld is installed in a certain place (/usr/sbin/dlm_controld for

openSUSE). So, how would dlm RA find your dlm deamon?

Another question is what's the difference of dlm_controld and gfs_controld? 
Must they both be present if a cluster gfs file system is mounted?
2. dlm_controld is a deamon in userland for dlm kernel module, while 
gfs2_controld is for gfs2, i think. However, on the recent release 
(redhat and suse, AFAIK),
gfs_controld is no longer needed. But I don't know much history about 
this change. Hope someone could elaborate on this a bit more;-)


Cheers,
Eric



Thanks a lot!
Dashi Cao

From: Da Shi Cao <dscao...@hotmail.com>
Sent: Wednesday, July 20, 2016 4:47:31 PM
To: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] agent ocf:pacemaker:controld

Thank you all for the information about dlm_controld. I will make a try using 
https://git.fedorahosted.org/cgit/dlm.git/log/ .

Dashi Cao


From: Jan Pokorný <jpoko...@redhat.com>
Sent: Monday, July 18, 2016 8:47:50 PM
To: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] agent ocf:pacemaker:controld


On 18/07/16 07:59, Da Shi Cao wrote:

dlm_controld is very tightly coupled with cman.

Wrong assumption.

In fact, support for shipping ocf:pacemaker:controld has been
explicitly restricted to cases when CMAN logic (specifically the
respective handle-all initscript that is in turn, in that limited use
case, triggered from pacemaker's proper one and, moreover, takes
care of dlm_controld management on its own so any subsequent attempts
to do the same would be ineffective) is _not_ around:

https://github.com/ClusterLabs/pacemaker/commit/6a11d2069dcaa57b445f73b52f642f694e55caf3
(accidental syntactical typos were fixed later on:
https://github.com/ClusterLabs/pacemaker/commit/aa5509df412cb9ea39ae3d3918e0c66c326cda77)


I have built a cluster purely with
pacemaker+corosync+fence_sanlock. But if agent
ocf:pacemaker:controld is desired, dlm_controld must exist! I can
only find it in cman.
Can the command dlm_controld be obtained without bringing in cman?

To recap what others have suggested:

On 18/07/16 08:57 +0100, Christine Caulfield wrote:

There should be a package called 'dlm' that has a dlm_controld suitable
for use with pacemaker.

On 18/07/16 17:26 +0800, Eric Ren wrote:

DLM upstream hosted here:
   https://git.fedorahosted.org/cgit/dlm.git/log/

The name of DLM on openSUSE is libdlm.

--
Jan (Poki)

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] agent ocf:pacemaker:controld

2016-07-18 Thread Eric Ren

Hi,

On 07/18/2016 02:59 PM, Da Shi Cao wrote:

dlm_controld is very tightly coupled with cman. I have built a cluster purely 
with pacemaker+corosync+fence_sanlock. But if agent ocf:pacemaker:controld is 
desired, dlm_controld must exist! I can only find it in cman.


Not sure I can understand this words:

 "But if agent ocf:pacemaker:controld is desired, dlm_controld must exist!"

You mean this?
1. dlm RA in pacemaker+corosync?
2. if so, dlm_controld exits in what circumstance? any symptom and logs?


Can the command dlm_controld be obtained without bringing in cman?


DLM upstream hosted here:
  https://git.fedorahosted.org/cgit/dlm.git/log/

The name of DLM on openSUSE is libdlm.

Regards,
Eric


Best Regards
Dashi Cao

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] DLM fencing

2016-03-29 Thread Eric Ren

Hello Ferenc,

Just want to communicate thoughts, AFAIC.



I've ment to explore this connection for long, but never found much
useful material on the subject.  How does DLM fencing fit into the
modern Pacemaker architecture?  Fencing is a confusing topic in itself


Yes, unfortunately, maybe the best material is source code so far, such 
as resource agent(ocf:pacemaker:controld) scripts 
(/usr/lib/ocf/resource.d/pacemaker/controld), and libdlm/dlm_controld/, etc.




already (fence_legacy, fence_pcmk, stonith, stonithd, stonith_admin),
then dlm_controld can use dlm_stonith to proxy fencing requests to
Pacemaker, and it becomes hopeless... :)


What you said about dlm_stonith is true. It just invoke an API of pcmk 
to tell pacemaker who and when should be fenced. Pacemaker do the horse 
work, according to what fencing method is used, I guess the fencing 
request will finally reach its destination - a resource agent for fencing.


I'm just starting learning about the way these components cooperate 
together. Could you share any updates if you've learn something? I would 
be grateful;-)




I'd be grateful for a pointer to a good overview document, or a quick
sketch if you can spare the time.  To invoke some concrete questions:
When does DLM fence a node?  Is it necessary only when there's no


When fencing here is limited within DLM, the time DLM will actively make 
request to fence is when uncontrolled lockspace has been found in 
kernel. Only a rebooting can make that node clean.



resource manager running on the cluster?  Does it matter whether
dlm_controld is run as a standalone daemon or as a controld resource?


According DLM's man pages and codes I've read, DLM provide us two 
options: daemon(I guess it's short of dlm_daemon) fence, and 
dlm_stonith. Daemon fencing is with DLM itself, which have 
configuration(man dlm.conf) and lots of 
code(dlm/dlm_controld/{fence*|daemon_cpg.c}) to handle fencing. But I 
never configured DLM fencing stuff, as defalt, it (may) use dlm_stonith 
as proxy, then pacemaker things...


So I think corosync that provide membership knowledge is a must for DLM, 
but pacemaker is optional if DLM fencing has been configured and you 
don't want other RA, which I also never tried;-)



Wouldn't Pacemaker fence a failing node itself all the same?  Or is
dlm_stonith for the case when only the stonithd component of Pacemaker
is active somehow?



Please correct me if there's any problem.

Eric

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org