[ClusterLabs] Antw: Re: Live Guest Migration timeouts for VirtualDomain resources

2017-01-18 Thread Ulrich Windl
>>> Ken Gaillot  schrieb am 18.01.2017 um 16:32 in 
>>> Nachricht
<4b02d3fa-4693-473b-8bed-dc98f9e3f...@redhat.com>:
> On 01/17/2017 04:45 PM, Scott Greenlese wrote:
>> Ken and Co,
>> 
>> Thanks for the useful information.
>> 

[...]
>> 
>> Is this internally coded within the class=ocf provider=heartbeat
>> type=VirtualDomain resource agent?
> 
> Aha, I just realized what the issue is: the operation name is
> migrate_to, not migrate-to.
> 
> For technical reasons, pacemaker can't validate operation names (at the
> time that the configuration is edited, it does not necessarily have
> access to the agent metadata).

BUT the set of operations is finite, right? So if those were in some XML 
schema, the names could be verified at least (not meaning that the operation is 
actually supported).
BTW: Would a "crm configure verify" detect this kijnd of problem?

[...]

Ulrich




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Colocations and Orders Syntax Changed?

2017-01-18 Thread Eric Robinson
Greetings!

I have a lot of pacemaker clusters, each running multiple instances of mysql.  
I configure it so that the mysql resources are all dependent on an underlying 
stack of supporting resources which consists of a virtual IP address (p_vip), a 
filesystem (p_fs), often an LVM resource (p_lvm), and a drbd resource (p_drbd). 
If any resource in the underlying stack resource moves, then all of them move 
together and the mysql resources follow. However, each of the mysql resources 
can be stopped and started independently without impacting any other resources. 
I accomplish that with a configuration such as the following:

colocation c_clust10 inf: ( p_mysql_103 p_mysql_150 p_mysql_204 p_mysql_206 
p_mysql_244 p_mysql_247 ) p_vip_clust10 p_fs_clust10 ms_drbd0:Master
order o_clust10 inf: ms_drbd0:promote p_fs_clust10 p_vip_clust10 ( p 
p_mysql_103 p_mysql_150 p_mysql_204 p_mysql_206 p_mysql_244 p_mysql_247)

This has suddenly stopped working. On my newest cluster I have the following. 
When I try to use the same approach, the configuration gets rearranged on me 
automatically. The parentheses get moved. Often each of the underlying 
resources is changed to the same thing with ":Master" following. Sometimes the 
whole colocation stanza gets replaced with raw xml. I have messed around with 
it, and the following is the best I can come up with, but when I stop a mysql 
resource everything else stops!

colocation c_clust19 inf: ( p_mysql_057 p_mysql_092 p_mysql_187 p_mysql_213 
p_vip_clust19 p_mysql_702 p_mysql_743 p_fs_clust19 p_lv_on_drbd0 ) ( 
ms_drbd0:Master )
order o_clust19 inf: ms_drbd0:promote ( p_lv_on_drbd0:start ) ( p_fs_clust19 
p_vip_clust19 ) ( p_mysql_057 p_mysql_092 p_mysql_187 p_mysql_213 p_mysql_702 
p_mysql_743 )

The old cluster is running Pacemaker 1.1.10. The new one is running 1.1.12.

What can I do to get it running right again? I want all the underlying 
resources (vip, fs, lvm, drbd) to move together. I want the mysql instances to 
be collocated with the underlying resources, but I want them to be independent 
of each other so they can each be started and stopped without hurting anything.

--
Eric Robinson


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] HALVM problem with 2 nodes cluster

2017-01-18 Thread Ferenc Wágner
Marco Marino  writes:

> Ferenc, regarding the flag use_lvmetad in
> /usr/lib/ocf/resource.d/heartbeat/LVM I read:
>
>> lvmetad is a daemon that caches lvm metadata to improve the
>> performance of LVM commands. This daemon should never be used when
>> volume groups exist that are being managed by the cluster. The
>> lvmetad daemon introduces a response lag, where certain LVM commands
>> look like they have completed (like vg activation) when in fact the
>> command is still in progress by the lvmetad.  This can cause
>> reliability issues when managing volume groups in the cluster.  For
>> Example, if you have a volume group that is a dependency for another
>> application, it is possible the cluster will think the volume group
>> is activated and attempt to start the application before volume group
>> is really accesible... lvmetad is bad.
>
> in the function LVM_validate_all()

Wow, if this is true, then this is serious breakage in LVM.  Thanks for
the pointer.  I think this should be brought up with the LVM developers.
-- 
Feri

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] [Announce] clufter v0.59.8 released

2017-01-18 Thread Jan Pokorný
I am happy to announce that clufter, a tool/library for transforming
and analyzing cluster configuration formats, got its version 0.59.8
tagged and released (incl. signature using my 60BCBB4F5CD7F9EF key):


or alternative (original) location:



The updated test suite for this version is also provided:

or alternatively:


Changelog highlights for v0.59.8 (also available as a tag message):

- bug fix release (related to ccs_flatten helper)
- bug fixes:
  . some configuration flags (sort of meta attributes) of how to treat
services or particular resources by rgmanager were previously
dropped upon initial resource tree flattening phase (via ccs_flatten
binary) whereas now, some are selectively restored on the output;
in particular, omitting __failure_expire_time/__max_failures
was a downright a bug as they were expected by the filters
(ccsflat2cibprelude) further in the chain, the others are
restored for possible future use: __enforce_timeouts and
__restart_expire_time/__max_restarts
  . ccs_flatten binary received an update of function responsible
for parsing time-related inputs so as to harden against values
encoded as unexpectedly long strings or similar exotic cases
[related rgmanager's bugs: rhbz#1036652, rhbz#1414139]
- internal enhancements:
  . ccs_flatten binary sources and snippets of related filters
received some more maintenance care

* * *

The public repository (notably master and next branches) is currently at

(rather than ).

Official, signed releases can be found at
 or, alternatively, at

(also beware, automatic git archives preserve a "dev structure").

Natively packaged in Fedora (python-clufter, clufter-cli, ...).

Issues & suggestions can be reported at either of (regardless if Fedora)
,
.


Happy clustering/high-availing :)

-- 
Jan (Poki)


pgp987kVNGAx5.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources

2017-01-18 Thread Tomas Jelinek

Dne 18.1.2017 v 16:32 Ken Gaillot napsal(a):

On 01/17/2017 04:45 PM, Scott Greenlese wrote:

Ken and Co,

Thanks for the useful information.

I bumped the migrate-to timeout value from 1200ms to 360s , which should
be more than enough time
to successfully migrate the resource (i.e. the KVM guest). The migration
was again interrupted with a timeout
at the 2ms (20 second) mark, thus stopping / failing the resource,
which shuts down the guest, then
reassigning the resource to the target node, then cold starting the
resource and guest on the target.

Does anybody know where this 2ms timeout value gets set? 20sec is
clearly inadequate for a virtual machine
that is running with any kind or I/O or memory intensive workload. I
need to have a way to change that timeout
value, if not by the migrate-to timeout value, then what?

Is this internally coded within the class=ocf provider=heartbeat
type=VirtualDomain resource agent?


Aha, I just realized what the issue is: the operation name is
migrate_to, not migrate-to.

For technical reasons, pacemaker can't validate operation names (at the
time that the configuration is edited, it does not necessarily have
access to the agent metadata).


We'll add operation name validation to pcs to cover this issue. For "pcs 
resource create" command it will be done in a near future. For the rest 
of the commands I cannot really promise any date however. This is not on 
our priority list right now.


Regards,
Tomas




For completeness, here's what I did..

## Show current timeout value.

[root@zs95kj ~]# date;pcs resource show zs95kjg110061_res
Tue Jan 17 13:36:05 EST 2017
Resource: zs95kjg110061_res (class=ocf provider=heartbeat
type=VirtualDomain)
Attributes: config=/guestxml/nfs1/zs95kjg110061.xml
hypervisor=qemu:///system migration_transport=ssh
Meta Attrs: allow-migrate=true remote-node=zs95kjg110061
remote-addr=10.20.110.61 target-role=Stopped
Operations: start interval=0s timeout=480
(zs95kjg110061_res-start-interval-0s)
stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)
monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)
migrate-from interval=0s timeout=1200
(zs95kjg110061_res-migrate-from-interval-0s)
migrate-to interval=0s *timeout=1200*
(zs95kjg110061_res-migrate-to-interval-0s)


## Change timeout value from 1200(ms) to 360s ...


[root@zs95kj ~]# pcs resource update zs95kjg110061_res op migrate-to
timeout="360s"

[root@zs95kj ~]# date;pcs resource show zs95kjg110061_res
Tue Jan 17 13:38:10 EST 2017
Resource: zs95kjg110061_res (class=ocf provider=heartbeat
type=VirtualDomain)
Attributes: config=/guestxml/nfs1/zs95kjg110061.xml
hypervisor=qemu:///system migration_transport=ssh
Meta Attrs: allow-migrate=true remote-node=zs95kjg110061
remote-addr=10.20.110.61 target-role=Stopped
Operations: start interval=0s timeout=480
(zs95kjg110061_res-start-interval-0s)
stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)
monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)
migrate-from interval=0s timeout=1200
(zs95kjg110061_res-migrate-from-interval-0s)
migrate-to interval=0s*timeout=360s*
(zs95kjg110061_res-migrate-to-interval-0s)


[root@zs95kj ~]# date;pcs resource enable zs95kjg110061_res
Tue Jan 17 13:40:55 EST 2017
[root@zs95kj ~]#


[root@zs95kj ~]# date;pcs resource show |grep zs95kjg110061_res
Tue Jan 17 13:41:16 EST 2017
zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1


## Started the I/O intensive 'blast' workload on the guest.


## Initiate the LGM via resource move CLI ..


[root@zs95kj ~]# date;pcs resource move zs95kjg110061_res zs90kppcs1
Tue Jan 17 13:54:50 EST 2017


## System log shows the following error thrown:

Jan 17 13:54:53 zs95kj crmd[27555]: notice: Operation
zs95kjg110061_stop_0: ok (node=zs95kjpcs1, call=44, rc=0,
cib-update=450, confirmed=true)
Jan 17 13:54:53 zs95kj attrd[27553]: notice: Removing all zs95kjg110061
attributes for zs95kjpcs1
Jan 17 13:54:54 zs95kj VirtualDomain(zs95kjg110061_res)[135045]: INFO:
zs95kjg110061: *Starting live migration to zs90kppcs1 (using: virsh
--connect=qemu:///system --quiet migrate --live zs95kjg110061
qemu+ssh://zs90kppcs1/system ).*
Jan 17 13:55:14 zs95kj lrmd[27552]: warning:
zs95kjg110061_res_migrate_to_0 process (PID 135045) timed out
Jan 17 13:55:14 zs95kj lrmd[27552]: warning:
zs95kjg110061_res_migrate_to_0:135045 - timed out after 2ms
Jan 17 13:55:14 zs95kj crmd[27555]: error: Operation
zs95kjg110061_res_migrate_to_0: Timed Out (node=zs95kjpcs1, call=941,
timeout=2ms)
Jan 17 13:55:15 zs95kj VirtualDomain(zs95kjg110061_res)[136996]: INFO:
Issuing graceful shutdown request for domain zs95kjg110061.
Jan 17 13:55:26 zs95kj systemd-machined: Machine qemu-58-zs95kjg110061
terminated.
Jan 17 13:55:26 zs95kj crmd[27555]: notice: Operation
zs95kjg110061_res_stop_0: ok (node=zs95kjpcs1, call=943, rc=0,
cib-update=459, confirmed=true)



This is consistent with my original symptom... the "internal" timeout

value of 2ms seems to 

Re: [ClusterLabs] Antw: Re: VirtualDomain started in two hosts

2017-01-18 Thread Ken Gaillot
On 01/18/2017 03:49 AM, Ferenc Wágner wrote:
> Ken Gaillot  writes:
> 
>> * When you move the VM, the cluster detects that it is not running on
>> the node you told it to keep it running on. Because there is no
>> "Stopped" monitor, the cluster doesn't immediately realize that a new
>> rogue instance is running on another node. So, the cluster thinks the VM
>> crashed on the original node, and recovers it by starting it again.
> 
> Ken, do you mean that if a periodic "stopped" monitor is configured, it
> is forced to run immediately (out of schedule) when the regular periodic
> monitor unexpectedly returns with stopped status?  That is, before the
> cluster takes the recovery action?  Conceptually, that would be similar
> to the probe run on node startup.  If not, then maybe it would be a
> useful resource option to have (I mean running cluster-wide probes on an
> unexpected monitor failure, before recovery).  An optional safety check.

No, there is nothing like that currently. The regular and "Stopped"
monitors run independently. Because they must have different intervals,
that does mean that the two sides of the issue may be detected at
different times.

It is an interesting idea to have an option to reprobe on operation
failure. I think it may be overkill; the only failure situation it would
be good for is one like this, where a resource was moved out of cluster
control. The vast majority of failure scenarios wouldn't be helped. If
that sort of thing happens a lot in your cluster, you really need to
figure out how to stop doing that. :)

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources

2017-01-18 Thread Ken Gaillot
On 01/17/2017 04:45 PM, Scott Greenlese wrote:
> Ken and Co,
> 
> Thanks for the useful information.
> 
> I bumped the migrate-to timeout value from 1200ms to 360s , which should
> be more than enough time
> to successfully migrate the resource (i.e. the KVM guest). The migration
> was again interrupted with a timeout
> at the 2ms (20 second) mark, thus stopping / failing the resource,
> which shuts down the guest, then
> reassigning the resource to the target node, then cold starting the
> resource and guest on the target.
> 
> Does anybody know where this 2ms timeout value gets set? 20sec is
> clearly inadequate for a virtual machine
> that is running with any kind or I/O or memory intensive workload. I
> need to have a way to change that timeout
> value, if not by the migrate-to timeout value, then what?
> 
> Is this internally coded within the class=ocf provider=heartbeat
> type=VirtualDomain resource agent?

Aha, I just realized what the issue is: the operation name is
migrate_to, not migrate-to.

For technical reasons, pacemaker can't validate operation names (at the
time that the configuration is edited, it does not necessarily have
access to the agent metadata).

> For completeness, here's what I did..
> 
> ## Show current timeout value.
> 
> [root@zs95kj ~]# date;pcs resource show zs95kjg110061_res
> Tue Jan 17 13:36:05 EST 2017
> Resource: zs95kjg110061_res (class=ocf provider=heartbeat
> type=VirtualDomain)
> Attributes: config=/guestxml/nfs1/zs95kjg110061.xml
> hypervisor=qemu:///system migration_transport=ssh
> Meta Attrs: allow-migrate=true remote-node=zs95kjg110061
> remote-addr=10.20.110.61 target-role=Stopped
> Operations: start interval=0s timeout=480
> (zs95kjg110061_res-start-interval-0s)
> stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)
> monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)
> migrate-from interval=0s timeout=1200
> (zs95kjg110061_res-migrate-from-interval-0s)
> migrate-to interval=0s *timeout=1200*
> (zs95kjg110061_res-migrate-to-interval-0s)
> 
> 
> ## Change timeout value from 1200(ms) to 360s ...
> 
> 
> [root@zs95kj ~]# pcs resource update zs95kjg110061_res op migrate-to
> timeout="360s"
> 
> [root@zs95kj ~]# date;pcs resource show zs95kjg110061_res
> Tue Jan 17 13:38:10 EST 2017
> Resource: zs95kjg110061_res (class=ocf provider=heartbeat
> type=VirtualDomain)
> Attributes: config=/guestxml/nfs1/zs95kjg110061.xml
> hypervisor=qemu:///system migration_transport=ssh
> Meta Attrs: allow-migrate=true remote-node=zs95kjg110061
> remote-addr=10.20.110.61 target-role=Stopped
> Operations: start interval=0s timeout=480
> (zs95kjg110061_res-start-interval-0s)
> stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)
> monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)
> migrate-from interval=0s timeout=1200
> (zs95kjg110061_res-migrate-from-interval-0s)
> migrate-to interval=0s*timeout=360s*
> (zs95kjg110061_res-migrate-to-interval-0s)
> 
> 
> [root@zs95kj ~]# date;pcs resource enable zs95kjg110061_res
> Tue Jan 17 13:40:55 EST 2017
> [root@zs95kj ~]#
> 
> 
> [root@zs95kj ~]# date;pcs resource show |grep zs95kjg110061_res
> Tue Jan 17 13:41:16 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
> 
> 
> ## Started the I/O intensive 'blast' workload on the guest.
> 
> 
> ## Initiate the LGM via resource move CLI ..
> 
> 
> [root@zs95kj ~]# date;pcs resource move zs95kjg110061_res zs90kppcs1
> Tue Jan 17 13:54:50 EST 2017
> 
> 
> ## System log shows the following error thrown:
> 
> Jan 17 13:54:53 zs95kj crmd[27555]: notice: Operation
> zs95kjg110061_stop_0: ok (node=zs95kjpcs1, call=44, rc=0,
> cib-update=450, confirmed=true)
> Jan 17 13:54:53 zs95kj attrd[27553]: notice: Removing all zs95kjg110061
> attributes for zs95kjpcs1
> Jan 17 13:54:54 zs95kj VirtualDomain(zs95kjg110061_res)[135045]: INFO:
> zs95kjg110061: *Starting live migration to zs90kppcs1 (using: virsh
> --connect=qemu:///system --quiet migrate --live zs95kjg110061
> qemu+ssh://zs90kppcs1/system ).*
> Jan 17 13:55:14 zs95kj lrmd[27552]: warning:
> zs95kjg110061_res_migrate_to_0 process (PID 135045) timed out
> Jan 17 13:55:14 zs95kj lrmd[27552]: warning:
> zs95kjg110061_res_migrate_to_0:135045 - timed out after 2ms
> Jan 17 13:55:14 zs95kj crmd[27555]: error: Operation
> zs95kjg110061_res_migrate_to_0: Timed Out (node=zs95kjpcs1, call=941,
> timeout=2ms)
> Jan 17 13:55:15 zs95kj VirtualDomain(zs95kjg110061_res)[136996]: INFO:
> Issuing graceful shutdown request for domain zs95kjg110061.
> Jan 17 13:55:26 zs95kj systemd-machined: Machine qemu-58-zs95kjg110061
> terminated.
> Jan 17 13:55:26 zs95kj crmd[27555]: notice: Operation
> zs95kjg110061_res_stop_0: ok (node=zs95kjpcs1, call=943, rc=0,
> cib-update=459, confirmed=true)
> 
> 
>>> This is consistent with my original symptom... the "internal" timeout
> value of 2ms seems to override the migrate-to timeout value in the
> resource,
> if in fact ... the migrate-to 

Re: [ClusterLabs] HALVM problem with 2 nodes cluster

2017-01-18 Thread Marco Marino
Ferenc, regarding the flag use_lvmetad in
/usr/lib/ocf/resource.d/heartbeat/LVM I read:

"#lvmetad is a daemon that caches lvm metadata to improve the
# performance of LVM commands. This daemon should never be used when
# volume groups exist that are being managed by the cluster. The
lvmetad
# daemon introduces a response lag, where certain LVM commands look
like
# they have completed (like vg activation) when in fact the command
# is still in progress by the lvmetad.  This can cause reliability
issues
# when managing volume groups in the cluster.  For Example, if you
have a
# volume group that is a dependency for another application, it is
possible
# the cluster will think the volume group is activated and attempt
to start
# the application before volume group is really accesible...
lvmetad is bad."

in the function LVM_validate_all()
Anyway, it's only a warning but there is a good reason. I'm not an expert,
I'm studying for a certification and I have a lot of doubts.
Thank you for your help
Marco




2017-01-18 11:03 GMT+01:00 Ferenc Wágner :

> Marco Marino  writes:
>
> > I agree with you for
> > use_lvmetad = 0 (setting it = 1 in a clustered environment is an error)
>
> Where does this information come from?  AFAIK, if locking_type=3 (LVM
> uses internal clustered locking, that is, clvmd), lvmetad is not used
> anyway, even if it's running.  So it's best to disable it to avoid
> warning messages all around.  This is the case with active/active
> clustering in LVM itself, in which Pacemaker isn't involved.
>
> On the other hand, if you use Pacemaker to do active/passive clustering
> by appropriately activating/deactivating your VG, this isn't clustering
> from the LVM point of view, you don't set the clustered flag on your VG,
> don't run clvmd and use locking_type=1.  Lvmetad should be perfectly
> fine with this in principle (unless it caches metadata of inactive VGs,
> which would be stupid, but I never tested this).
>
> > but I think I have to set
> > locking_type = 3 only if I use clvm
>
> Right.
> --
> Feri
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] HALVM problem with 2 nodes cluster

2017-01-18 Thread Ferenc Wágner
Marco Marino  writes:

> I agree with you for
> use_lvmetad = 0 (setting it = 1 in a clustered environment is an error)

Where does this information come from?  AFAIK, if locking_type=3 (LVM
uses internal clustered locking, that is, clvmd), lvmetad is not used
anyway, even if it's running.  So it's best to disable it to avoid
warning messages all around.  This is the case with active/active
clustering in LVM itself, in which Pacemaker isn't involved.

On the other hand, if you use Pacemaker to do active/passive clustering
by appropriately activating/deactivating your VG, this isn't clustering
from the LVM point of view, you don't set the clustered flag on your VG,
don't run clvmd and use locking_type=1.  Lvmetad should be perfectly
fine with this in principle (unless it caches metadata of inactive VGs,
which would be stupid, but I never tested this).

> but I think I have to set
> locking_type = 3 only if I use clvm

Right.
-- 
Feri

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: how do you do your Samba?

2017-01-18 Thread lejeczek



On 17/01/17 15:59, Ulrich Windl wrote:

lejeczek  schrieb am 17.01.2017 um 16:27 in Nachricht

:

hi everyone

asking here as I hope experts here have already done it
dozens times.
I've gotten a nice answer from samba authors, but explaining
the idea in general, here I hope someone could actually
explain how this(ha cluster) should be configured, set up.
I asked following question:

"...to experts, all experienced sambians - when one uses HA
cluster one deems CTDB needless, redundant?
Or you somehow team those two together?
Or maybe ctdb is so superior that it is ruling this domain
and if HA exists it leaves Samaba to ctdb?

With CTDB management is easier IMHO; otherwise you'll have to keep the Samba 
config (e.g. users and passwords) in sync.


a little bit more?
I'm just starting this adventure with HA, and I read 
ocf:heartbeat:CTDB

"..
This agent expects the samba and windbind resources
to be managed outside of CTDB's control as a separate set of 
resources controlled
by the cluster manager.  The optional support for enabling 
CTDB management of these

daemons will be depreciated.
"
yet, this howto - 
http://linux-ha.org/wiki/CTDB_(resource_agent) - says:

"..
# primitive ctdb ocf:heartbeat:CTDB params \
ctdb_recovery_lock="/shared-fs/samba/ctdb.lock" \
ctdb_manages_samba="yes" \
ctdb_manages_winbind="yes" \
op monitor timeout=20 interval=10
"

and that is vastly confusing to me, and to anybody who just 
started researching the subject yesterday, I'd imagine.
I was hoping someone could, if not in great details, at 
least outline configuration steps on how to make HA+Samba 
smoothly working solution.






"

many thanks & regards,
L.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: VirtualDomain started in two hosts

2017-01-18 Thread Ferenc Wágner
Ken Gaillot  writes:

> * When you move the VM, the cluster detects that it is not running on
> the node you told it to keep it running on. Because there is no
> "Stopped" monitor, the cluster doesn't immediately realize that a new
> rogue instance is running on another node. So, the cluster thinks the VM
> crashed on the original node, and recovers it by starting it again.

Ken, do you mean that if a periodic "stopped" monitor is configured, it
is forced to run immediately (out of schedule) when the regular periodic
monitor unexpectedly returns with stopped status?  That is, before the
cluster takes the recovery action?  Conceptually, that would be similar
to the probe run on node startup.  If not, then maybe it would be a
useful resource option to have (I mean running cluster-wide probes on an
unexpected monitor failure, before recovery).  An optional safety check.
-- 
Regards,
Feri

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] HALVM problem with 2 nodes cluster

2017-01-18 Thread Marco Marino
Hi Bliu, thank you.
I agree with you for
use_lvmetad = 0 (setting it = 1 in a clustered environment is an error)
but I think I have to set
locking_type = 3 only if I use clvm
In my case, I'm trying to use LVM so I think that locking_type = 1 is ok.
What do you think about?

Furthermore, I have an application (managed as a resource in the cluster)
that continously create and remove logical volumes in the cluster. Is this
a problem? The application uses a custom lvm.conf configuration file where
I have volume_list = [ "@pacemaker" ]

Thank you




2017-01-18 10:12 GMT+01:00 bliu :

> Hi, Marco
>
> On 01/18/2017 04:45 PM, Marco Marino wrote:
>
> Hi, I'm trying to realize a cluster with 2 nodes that manages a volume
> group.
> Basically I have a san connected to both nodes that exposes 1 lun. So both
> nodes have a disk /dev/sdb. From one node I did:
> fdisk /dev/sdb  <- Create a partition with type = 8e (LVM)
> pvcreate /dev/sdb1
> vgcreate myvg
>
> then
>
> pcs resource create halvm LVM volgrpname=myvg exclusive=true
>
> Last command fails with an error: "LVM: myvg did not activate correctly"
>
> Reading /usr/lib/ocf/resource.d/heartbeat/LVM, this happens because it
> seems that I need at least one logical volume inside the volume group
> before create the resource. Is this correct?
>
> Yes, you need to create pv, vg before you use cluster to manager it.
>
> Furthermore, how can I set volume_list in lvm.conf? Actually in lvm.conf I
> have:
>
> Normally, clvm is used in cluster with shared storage as:
> locking_type = 3
> use_lvmetad = 0
>
> locking_type = 1
> use_lvmetad = 1
> volume_list = [ "vg-with-root-lv" ]
>
>
> Thank you
>
>
>
>
> ___
> Users mailing list: 
> Users@clusterlabs.orghttp://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] HALVM problem with 2 nodes cluster

2017-01-18 Thread bliu



On 01/18/2017 04:45 PM, Marco Marino wrote:
Hi, I'm trying to realize a cluster with 2 nodes that manages a volume 
group.
Basically I have a san connected to both nodes that exposes 1 lun. So 
both nodes have a disk /dev/sdb. From one node I did:

fdisk /dev/sdb  <- Create a partition with type = 8e (LVM)
pvcreate /dev/sdb1
vgcreate myvg

then

pcs resource create halvm LVM volgrpname=myvg exclusive=true

Last command fails with an error: "LVM: myvg did not activate correctly"

Reading /usr/lib/ocf/resource.d/heartbeat/LVM, this happens because it 
seems that I need at least one logical volume inside the volume group 
before create the resource. Is this correct?

Yes
Furthermore, how can I set volume_list in lvm.conf? Actually in 
lvm.conf I have:

locking_type =3
use_lvmetad = 0

locking_type = 1
use_lvmetad = 1
volume_list = [ "vg-with-root-lv" ]


Thank you




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] HALVM problem with 2 nodes cluster

2017-01-18 Thread Marco Marino
Hi, I'm trying to realize a cluster with 2 nodes that manages a volume
group.
Basically I have a san connected to both nodes that exposes 1 lun. So both
nodes have a disk /dev/sdb. From one node I did:
fdisk /dev/sdb  <- Create a partition with type = 8e (LVM)
pvcreate /dev/sdb1
vgcreate myvg

then

pcs resource create halvm LVM volgrpname=myvg exclusive=true

Last command fails with an error: "LVM: myvg did not activate correctly"

Reading /usr/lib/ocf/resource.d/heartbeat/LVM, this happens because it
seems that I need at least one logical volume inside the volume group
before create the resource. Is this correct?
Furthermore, how can I set volume_list in lvm.conf? Actually in lvm.conf I
have:
locking_type = 1
use_lvmetad = 1
volume_list = [ "vg-with-root-lv" ]


Thank you
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org