Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Prasad, Shashank
> I don't think that having a hook that bypass stonith is the right way….

 

The intention is NOT to bypass STONITH. STONITH shall always remain active, and 
an integral part of the cluster. The discussion is about bailing out of 
situations when the STONITH itself fails due to fencing agent failures, and how 
one can automate the process of bailing out.

 

All that the surviving nodes in the cluster need to be informed is that the 
failed node has indeed failed, and therefore the suggestion for a hook.

 

The hook (lets’ say: STONITH-Failure-Recovery-Hook) under discussion will only 
be fired when Fencing Agent fails. STONITH-Failure-Recovery-Hook is realized 
via a script. The "${CRM_alert_rsc}" , "${CRM_alert_task}",   
"${CRM_alert_desc}"  "${CRM_alert_node}" in the Pacemaker Alert can use used to 
match up with STONITH resource and its failures, and invoke the 
STONITH-Failure-Recovery-Hook as appropriate.

 

I also agree with Klaus that a quorum device is a good strategy.

That needs 3rd node in the cluster. If such an option can be exercised, it 
should be.

 

Thanx.

 

 

 

From: Tomer Azran [mailto:tomer.az...@edp.co.il] 
Sent: Tuesday, July 25, 2017 3:00 AM
To: kwenn...@redhat.com; Cluster Labs - All topics related to open-source 
clustering welcomed; Prasad, Shashank
Subject: RE: [ClusterLabs] Two nodes cluster issue

 

I tend to agree with Klaus – I don't think that having a hook that bypass 
stonith is the right way. It is better to not use stonith at all.

I think I will try to use an iScsi target on my qdevice and set SBD to use it.

I still don't understand why qdevice can't take the place SBD with shared 
storage; correct me if I'm wrong, but it looks like both of them are there for 
the same reason.

 

From: Klaus Wenninger [mailto:kwenn...@redhat.com] 
Sent: Monday, July 24, 2017 9:01 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Prasad, Shashank 
Subject: Re: [ClusterLabs] Two nodes cluster issue

 

On 07/24/2017 07:32 PM, Prasad, Shashank wrote:

Sometimes IPMI fence devices use shared power of the node, and it 
cannot be avoided.

In such scenarios the HA cluster is NOT able to handle the power 
failure of a node, since the power is shared with its own fence device.

The failure of IPMI based fencing can also exist due to other reasons 
also.

 

A failure to fence the failed node will cause cluster to be marked 
UNCLEAN.

To get over it, the following command needs to be invoked on the 
surviving node.

 

pcs stonith confirm  --force

 

This can be automated by hooking a recovery script, when the the 
Stonith resource ‘Timed Out’ event.

To be more specific, the Pacemaker Alerts can be used for watch for 
Stonith timeouts and failures.

In that script, all that’s essentially to be executed is the 
aforementioned command.


If I get you right here you can disable fencing then in the first place.
Actually quorum-based-watchdog-fencing is the way to do this in a
safe manner. This of course assumes you have a proper source for
quorum in your 2-node-setup with e.g. qdevice or using a shared
disk with sbd (not directly pacemaker quorum here but similar thing
handled inside sbd).



Since the alerts are issued from ‘hacluster’ login, sudo permissions 
for ‘hacluster’ needs to be configured.

 

Thanx.

 

 

From: Klaus Wenninger [mailto:kwenn...@redhat.com 
 ] 
Sent: Monday, July 24, 2017 9:24 PM
To: Kristián Feldsam; Cluster Labs - All topics related to open-source 
clustering welcomed
Subject: Re: [ClusterLabs] Two nodes cluster issue

 

On 07/24/2017 05:37 PM, Kristián Feldsam wrote:

I personally think that power off node by switched pdu is more 
safe, or not?


True if that is working in you environment. If you can't do a physical 
setup
where you aren't simultaneously loosing connection to both your node and
the switch-device (or you just want to cover cases where that happens)
you have to come up with something else.





S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz  

www.feldhost.cz   - FeldHost™ – profesionální 
hostingové a serverové služby za adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446 

 

On 24 Jul 2017, at 17:27, Klaus Wenninger 

Re: [ClusterLabs] why resources are restarted when a node rejoins a cluster?

2017-07-24 Thread Digimer

  
  
On 2017-07-24 11:04 PM, ztj wrote:

  
  
  
  
Hi all,
I have 2 Centos nodes with
  heartbeat and pacemaker-1.1.13 installed, and almost
  everything is working fine, I have only apache configured
  for testing, when a node goes down the failover is done
  correctly, but there's a problem when a node failbacks.

For example, let's say that
  Node1 has the lead on apache resource, then I reboot
  Node1, so Pacemaker detect it goes down, then apache is
  promoted to the Node2 and it keeps there running fine,
  that's fine, but when Node1 recovers and joins the cluster
  again, apache is restarted on Node2 again.

Anyone knows, why resources
  are restarted when a node rejoins a cluster? thanks
  


You sent this to the moderators, not the list.

Please don't use heartbeat, it is extremely deprecated. Please
switch to corosync.

To offer any other advice, you need to share your config and the
logs from both nodes. Please respond to the list, not
developers-ow...@clusterlabs.org.

digimer
-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
  


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Ken Gaillot
On Mon, 2017-07-24 at 21:29 +, Tomer Azran wrote:
> I tend to agree with Klaus – I don't think that having a hook that
> bypass stonith is the right way. It is better to not use stonith at
> all.
> 
> I think I will try to use an iScsi target on my qdevice and set SBD to
> use it.

Certainly, two levels of real stonith is best -- but (at the risk of
committing heresy) I can see Shashank's point.

If you've got extensive redundancy everywhere else, then the chances of
something taking out both the node and its IPMI, yet still allowing it
to interfere with shared resources, is very small. It comes down to
whether you're willing to accept that small risk. Such a setup is
definitely better than disabling fencing altogether, because the IPMI
fence level safely handles all node failure scenarios that don't also
take out the IPMI (and the bypass actually handles a complete power cut
safely).

If you can do a second level of real fencing, that is of course
preferred.

> I still don't understand why qdevice can't take the place SBD with
> shared storage; correct me if I'm wrong, but it looks like both of
> them are there for the same reason.
> 
>  
> 
> From: Klaus Wenninger [mailto:kwenn...@redhat.com] 
> Sent: Monday, July 24, 2017 9:01 PM
> To: Cluster Labs - All topics related to open-source clustering
> welcomed ; Prasad, Shashank 
> Subject: Re: [ClusterLabs] Two nodes cluster issue
> 
> 
>  
> 
> On 07/24/2017 07:32 PM, Prasad, Shashank wrote:
> 
> 
> Sometimes IPMI fence devices use shared power of the node, and
> it cannot be avoided.
> 
> In such scenarios the HA cluster is NOT able to handle the
> power failure of a node, since the power is shared with its
> own fence device.
> 
> The failure of IPMI based fencing can also exist due to other
> reasons also.
> 
>  
> 
> A failure to fence the failed node will cause cluster to be
> marked UNCLEAN.
> 
> To get over it, the following command needs to be invoked on
> the surviving node.
> 
>  
> 
> pcs stonith confirm  --force
> 
>  
> 
> This can be automated by hooking a recovery script, when the
> the Stonith resource ‘Timed Out’ event.
> 
> To be more specific, the Pacemaker Alerts can be used for
> watch for Stonith timeouts and failures.
> 
> In that script, all that’s essentially to be executed is the
> aforementioned command.
> 
> 
> 
> If I get you right here you can disable fencing then in the first
> place.
> Actually quorum-based-watchdog-fencing is the way to do this in a
> safe manner. This of course assumes you have a proper source for
> quorum in your 2-node-setup with e.g. qdevice or using a shared
> disk with sbd (not directly pacemaker quorum here but similar thing
> handled inside sbd).
> 
> 
> 
> 
> Since the alerts are issued from ‘hacluster’ login, sudo
> permissions for ‘hacluster’ needs to be configured.
> 
>  
> 
> Thanx.
> 
>  
> 
>  
> 
> From: Klaus Wenninger [mailto:kwenn...@redhat.com] 
> Sent: Monday, July 24, 2017 9:24 PM
> To: Kristián Feldsam; Cluster Labs - All topics related to
> open-source clustering welcomed
> Subject: Re: [ClusterLabs] Two nodes cluster issue
> 
> 
>  
> 
> On 07/24/2017 05:37 PM, Kristián Feldsam wrote:
> 
> 
> I personally think that power off node by switched pdu
> is more safe, or not?
> 
> 
> 
> True if that is working in you environment. If you can't do a
> physical setup
> where you aren't simultaneously loosing connection to both
> your node and
> the switch-device (or you just want to cover cases where that
> happens)
> you have to come up with something else.
> 
> 
> 
> 
> 
> 
> S pozdravem Kristián Feldsam
> Tel.: +420 773 303 353, +421 944 137 535
> E-mail.: supp...@feldhost.cz
> 
> www.feldhost.cz - FeldHost™ – profesionální hostingové a
> serverové služby za adekvátní ceny.
> 
> FELDSAM s.r.o.
> V rohu 434/3
> Praha 4 – Libuš, PSČ 142 00
> IČ: 290 60 958, DIČ: CZ290 60 958
> C 200350 vedená u Městského soudu v Praze
> 
> Banka: Fio banka a.s.
> Číslo účtu: 2400330446/2010
> BIC: FIOBCZPPXX
> IBAN: CZ82 2010  0024 0033 0446 
> 
> 
>  
> 
> On 24 Jul 2017, at 17:27, Klaus Wenninger
>  wrote:
> 
> 

Re: [ClusterLabs] resources do not migrate although node is going to standby

2017-07-24 Thread Ken Gaillot
On Mon, 2017-07-24 at 20:52 +0200, Lentes, Bernd wrote:
> Hi,
> 
> just to be sure:
> i have a VirtualDomain resource (called prim_vm_servers_alive) running on one 
> node (ha-idg-2). From reasons i don't remember i have a location constraint:
> location cli-prefer-prim_vm_servers_alive prim_vm_servers_alive role=Started 
> inf: ha-idg-2
> 
> Now i try to set this node into standby, because i need it to reboot.
> From what i think now the resource can't migrate to node ha-idg-1 because of 
> this constraint. Right ?

Right, the "inf:" makes it mandatory. BTW, the "cli-" at the beginning
indicates that this was created by a command-line tool such as pcs, crm
shell or crm_resource. Such tools implement "ban"/"move" type commands
by adding such constraints, and then offer a separate manual command to
remove such constraints (e.g. "pcs resource clear").

> 
> That's what the log says:
> Jul 21 18:03:50 ha-idg-2 VirtualDomain(prim_vm_servers_alive)[28565]: ERROR: 
> Server_Monitoring: live migration to qemu+ssh://ha-idg-1/system  failed: 1
> Jul 21 18:03:50 ha-idg-2 lrmd[8573]:   notice: operation_finished: 
> prim_vm_servers_alive_migrate_to_0:28565:stderr [ error: Requested operation 
> is not valid: domain 'Server_Monitoring' is already active ]
> Jul 21 18:03:50 ha-idg-2 crmd[8576]:   notice: process_lrm_event: Operation 
> prim_vm_servers_alive_migrate_to_0: unknown error (node=ha-idg-2, call=114, 
> rc=1, cib-update=572, confirmed=true)
> Jul 21 18:03:50 ha-idg-2 crmd[8576]:   notice: process_lrm_event: 
> ha-idg-2-prim_vm_servers_alive_migrate_to_0:114 [ error: Requested operation 
> is not valid: domain 'Server_Monitoring' is already active\n ]
> Jul 21 18:03:50 ha-idg-2 crmd[8576]:  warning: status_from_rc: Action 64 
> (prim_vm_servers_alive_migrate_to_0) on ha-idg-2 failed (target: 0 vs. rc: 
> 1): Error
> Jul 21 18:03:50 ha-idg-2 crmd[8576]:   notice: abort_transition_graph: 
> Transition aborted by prim_vm_servers_alive_migrate_to_0 'modify' on 
> ha-idg-2: Event failed 
> (magic=0:1;64:417:0:656ecd4a-f8e8-46c9-b4e6-194616237988, cib=0.879.5, sou
> rce=match_graph_event:350, 0)
> Jul 21 18:03:50 ha-idg-2 crmd[8576]:  warning: status_from_rc: Action 64 
> (prim_vm_servers_alive_migrate_to_0) on ha-idg-2 failed (target: 0 vs. rc: 
> 1): Error
> Jul 21 18:03:53 ha-idg-2 VirtualDomain(prim_vm_mausdb)[28564]: ERROR: 
> mausdb_vm: live migration to qemu+ssh://ha-idg-1/system  failed: 1
> 
> That is the way i understand "Requested operation is not valid". It's not 
> possible because of the constraint.
> I just wanted to be sure. And because the resource can't be migrated but the 
> host is going to standby the resource is stopped. Right ?
> 
> Strange is that a second resource also running on node ha-idg-2 called 
> prim_vm_mausdb also didn't migrate to the other node. And that's something i 
> don't understand completely.
> The resource didn't have any location constraint.
> Both VirtualDomains have a vnc server configured (that i can monitor the boot 
> procedure if i have starting problems). The vnc port for prim_vm_mausdb is 
> 5900 in the configuration file.
> The port is set to auto for prim_vm_servers_alive because i forgot to 
> configure it fix. So it must be s.th like 5900+ because both resources were 
> running concurrently on the same node.
> But prim_vm_mausdb can't migrate because the port is occupied on the other 
> node ha-idg-1:
> 
> Jul 21 18:03:53 ha-idg-2 VirtualDomain(prim_vm_mausdb)[28564]: ERROR: 
> mausdb_vm: live migration to qemu+ssh://ha-idg-1/system  failed: 1
> Jul 21 18:03:53 ha-idg-2 lrmd[8573]:   notice: operation_finished: 
> prim_vm_mausdb_migrate_to_0:28564:stderr [ error: internal error: early end 
> of file from monitor: possible problem: ]
> Jul 21 18:03:53 ha-idg-2 lrmd[8573]:   notice: operation_finished: 
> prim_vm_mausdb_migrate_to_0:28564:stderr [ Failed to start VNC server on 
> `127.0.0.1:0,share=allow-exclusive': Failed to bind socket: Address already 
> in use ]
> Jul 21 18:03:53 ha-idg-2 lrmd[8573]:   notice: operation_finished: 
> prim_vm_mausdb_migrate_to_0:28564:stderr [  ]
> Jul 21 18:03:53 ha-idg-2 crmd[8576]:   notice: process_lrm_event: Operation 
> prim_vm_mausdb_migrate_to_0: unknown error (node=ha-idg-2, call=110, rc=1, 
> cib-update=573, confirmed=true)
> Jul 21 18:03:53 ha-idg-2 crmd[8576]:   notice: process_lrm_event: 
> ha-idg-2-prim_vm_mausdb_migrate_to_0:110 [ error: internal error: early end 
> of file from monitor: possible problem:\nFailed to start VNC server on 
> `127.0.0.1:0,share=allow
> -exclusive': Failed to bind socket: Address already in use\n\n ]
> Jul 21 18:03:53 ha-idg-2 crmd[8576]:  warning: status_from_rc: Action 51 
> (prim_vm_mausdb_migrate_to_0) on ha-idg-2 failed (target: 0 vs. rc: 1): Error
> Jul 21 18:03:53 ha-idg-2 crmd[8576]:  warning: status_from_rc: Action 51 
> (prim_vm_mausdb_migrate_to_0) on ha-idg-2 failed (target: 0 vs. rc: 1): Error
> 
> Do i understand it correctly that the port is occupied on the node it 

Re: [ClusterLabs] timeout for stop VirtualDomain running Windows 7

2017-07-24 Thread Ken Gaillot
On Mon, 2017-07-24 at 19:30 +0200, Lentes, Bernd wrote:
> Hi,
> 
> i have a VirtualDomian resource running a Windows 7 client. This is the 
> respective configuration:
> 
> primitive prim_vm_servers_alive VirtualDomain \
> params config="/var/lib/libvirt/images/xml/Server_Monitoring.xml" \
> params hypervisor="qemu:///system" \
> params migration_transport=ssh \
> params autoset_utilization_cpu=false \
> params autoset_utilization_hv_memory=false \
> op start interval=0 timeout=120 \
> op stop interval=0 timeout=130 \
> op monitor interval=30 timeout=30 \
> op migrate_from interval=0 timeout=180 \
> op migrate_to interval=0 timeout=190 \
> meta allow-migrate=true target-role=Started is-managed=true
> 
> The timeout for the stop operation is 130 seconds. But our windows 7 clients, 
> as most do, install updates from time to time .
> And then a shutdown can take 10 or 20 minutes or even longer.
> If the timeout isn't as long as the installation of the updates takes then 
> the vm is forced off. With all possible negative consequences.
> But on the other hand i don't like to set a timeout of eg. 20 minutes, which 
> may still not be enough in some circumstances, but is much too long
> if the guest doesn't install updates.
> 
> Any ideas ?
> 
> Thanks.
> 
> 
> Bernd

If you can restrict updates to a certain time window, you can set up a
rule that uses a longer timeout during that window.

If you can't restrict the time window, but you can run a script when
updates are done, you could set a node attribute at that time (and clear
it on reboot), and use a similar rule based on the attribute.
-- 
Ken Gaillot 





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Tomer Azran
I tend to agree with Klaus – I don't think that having a hook that bypass 
stonith is the right way. It is better to not use stonith at all.
I think I will try to use an iScsi target on my qdevice and set SBD to use it.
I still don't understand why qdevice can't take the place SBD with shared 
storage; correct me if I'm wrong, but it looks like both of them are there for 
the same reason.

From: Klaus Wenninger [mailto:kwenn...@redhat.com]
Sent: Monday, July 24, 2017 9:01 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Prasad, Shashank 
Subject: Re: [ClusterLabs] Two nodes cluster issue

On 07/24/2017 07:32 PM, Prasad, Shashank wrote:
Sometimes IPMI fence devices use shared power of the node, and it cannot be 
avoided.
In such scenarios the HA cluster is NOT able to handle the power failure of a 
node, since the power is shared with its own fence device.
The failure of IPMI based fencing can also exist due to other reasons also.

A failure to fence the failed node will cause cluster to be marked UNCLEAN.
To get over it, the following command needs to be invoked on the surviving node.

pcs stonith confirm  --force

This can be automated by hooking a recovery script, when the the Stonith 
resource ‘Timed Out’ event.
To be more specific, the Pacemaker Alerts can be used for watch for Stonith 
timeouts and failures.
In that script, all that’s essentially to be executed is the aforementioned 
command.

If I get you right here you can disable fencing then in the first place.
Actually quorum-based-watchdog-fencing is the way to do this in a
safe manner. This of course assumes you have a proper source for
quorum in your 2-node-setup with e.g. qdevice or using a shared
disk with sbd (not directly pacemaker quorum here but similar thing
handled inside sbd).


Since the alerts are issued from ‘hacluster’ login, sudo permissions for 
‘hacluster’ needs to be configured.

Thanx.


From: Klaus Wenninger [mailto:kwenn...@redhat.com]
Sent: Monday, July 24, 2017 9:24 PM
To: Kristián Feldsam; Cluster Labs - All topics related to open-source 
clustering welcomed
Subject: Re: [ClusterLabs] Two nodes cluster issue

On 07/24/2017 05:37 PM, Kristián Feldsam wrote:
I personally think that power off node by switched pdu is more safe, or not?

True if that is working in you environment. If you can't do a physical setup
where you aren't simultaneously loosing connection to both your node and
the switch-device (or you just want to cover cases where that happens)
you have to come up with something else.




S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové 
a serverové služby za adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

On 24 Jul 2017, at 17:27, Klaus Wenninger 
> wrote:

On 07/24/2017 05:15 PM, Tomer Azran wrote:
I still don't understand why the qdevice concept doesn't help on this 
situation. Since the master node is down, I would expect the quorum to declare 
it as dead.
Why doesn't it happens?

That is not how quorum works. It just limits the decision-making to the quorate 
subset of the cluster.
Still the unknown nodes are not sure to be down.
That is why I suggested to have quorum-based watchdog-fencing with sbd.
That would assure that within a certain time all nodes of the non-quorate part
of the cluster are down.






On Mon, Jul 24, 2017 at 4:15 PM +0300, "Dmitri Maziuk" 
> wrote:

On 2017-07-24 07:51, Tomer Azran wrote:

> We don't have the ability to use it.

> Is that the only solution?



No, but I'd recommend thinking about it first. Are you sure you will

care about your cluster working when your server room is on fire? 'Cause

unless you have halon suppression, your server room is a complete

write-off anyway. (Think water from sprinklers hitting rich chunky volts

in the servers.)



Dima



___

Users mailing list: Users@clusterlabs.org

http://lists.clusterlabs.org/mailman/listinfo/users



Project Home: http://www.clusterlabs.org

Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org





___

Users mailing list: Users@clusterlabs.org

http://lists.clusterlabs.org/mailman/listinfo/users



Project Home: http://www.clusterlabs.org

Getting started: 

Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Kristián Feldsam
yes, I just have idea, he probably have managed switch or fabric...

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 24 Jul 2017, at 22:18, Klaus Wenninger  wrote:
> 
> On 07/24/2017 09:46 PM, Kristián Feldsam wrote:
>> so why to use some other fencing method like disablink port on switch, so 
>> nobody can acces faultly node and write data to it. it is common practice 
>> too.
> 
> Well don't get me wrong here. I don't want to hard-sell sbd.
> Just though that very likely requirements that prevent usage
> of a remote-controlled power-switch will make access
> to a switch to disable the ports unusable as well.
> And if a working qdevice setup is there already the gap between
> what he thought he would get from qdevice and what he actually
> had just matches exactly quorum-based-watchdog-fencing.
> 
> But you are of course right.
> I don't really know the scenario.
> Maybe fabric fencing is the perfect match - good to mention it
> here as a possibility.
> 
> Regards,
> Klaus
>   
>> 
>> S pozdravem Kristián Feldsam
>> Tel.: +420 773 303 353, +421 944 137 535
>> E-mail.: supp...@feldhost.cz 
>> 
>> www.feldhost.cz  - FeldHost™ – profesionální 
>> hostingové a serverové služby za adekvátní ceny.
>> 
>> FELDSAM s.r.o.
>> V rohu 434/3
>> Praha 4 – Libuš, PSČ 142 00
>> IČ: 290 60 958, DIČ: CZ290 60 958
>> C 200350 vedená u Městského soudu v Praze
>> 
>> Banka: Fio banka a.s.
>> Číslo účtu: 2400330446/2010
>> BIC: FIOBCZPPXX
>> IBAN: CZ82 2010  0024 0033 0446
>> 
>>> On 24 Jul 2017, at 21:16, Klaus Wenninger >> > wrote:
>>> 
>>> On 07/24/2017 08:27 PM, Prasad, Shashank wrote:
 My understanding is that  SBD will need a shared storage between clustered 
 nodes.
 And that, SBD will need at least 3 nodes in a cluster, if using w/o shared 
 storage.
>>> 
>>> Haven't tried to be honest but reason for 3 nodes is that without
>>> shared disk you need a real quorum-source and not something
>>> 'faked' as with 2-node-feature in corosync.
>>> But I don't see anything speaking against getting the proper
>>> quorum via qdevice instead with a third full cluster-node.
>>> 
  
 Therefore, for systems which do NOT use shared storage between 1+1 HA 
 clustered nodes, SBD may NOT be an option.
 Correct me, if I am wrong.
  
 For cluster systems using the likes of iDRAC/IMM2 fencing agents, which 
 have redundant but shared power supply units with the nodes, the normal 
 fencing mechanisms should work for all resiliency scenarios, but for 
 IMM2/iDRAC are being NOT reachable for whatsoever reasons. And, to bail 
 out of those situations in the absence of SBD, I believe using 
 used-defined failover hooks (via scripts) into Pacemaker Alerts, with sudo 
 permissions for ‘hacluster’, should help.
>>> 
>>> If you don't see your fencing device assuming after some time
>>> the the corresponding node will probably be down is quite risky
>>> in my opinion.
>>> But why not assure it to be down using a watchdog?
>>> 
  
 Thanx.
  
  
 From: Klaus Wenninger [mailto:kwenn...@redhat.com 
 ] 
 Sent: Monday, July 24, 2017 11:31 PM
 To: Cluster Labs - All topics related to open-source clustering welcomed; 
 Prasad, Shashank
 Subject: Re: [ClusterLabs] Two nodes cluster issue
  
 On 07/24/2017 07:32 PM, Prasad, Shashank wrote:
 Sometimes IPMI fence devices use shared power of the node, and it cannot 
 be avoided.
 In such scenarios the HA cluster is NOT able to handle the power failure 
 of a node, since the power is shared with its own fence device.
 The failure of IPMI based fencing can also exist due to other reasons also.
  
 A failure to fence the failed node will cause cluster to be marked UNCLEAN.
 To get over it, the following command needs to be invoked on the surviving 
 node.
  
 pcs stonith confirm  --force
  
 This can be automated by hooking a recovery script, when the the Stonith 
 resource ‘Timed Out’ event.
 To be more specific, the Pacemaker Alerts can be used for watch for 
 Stonith timeouts and failures.
 In that script, all that’s essentially to be executed is the 
 aforementioned command.
 
 If I get you right here you can disable fencing then in the first place.
 Actually quorum-based-watchdog-fencing is the way to do this in a
 safe manner. This of course assumes you have a proper source for

Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Klaus Wenninger
On 07/24/2017 09:46 PM, Kristián Feldsam wrote:
> so why to use some other fencing method like disablink port on switch,
> so nobody can acces faultly node and write data to it. it is common
> practice too.

Well don't get me wrong here. I don't want to hard-sell sbd.
Just though that very likely requirements that prevent usage
of a remote-controlled power-switch will make access
to a switch to disable the ports unusable as well.
And if a working qdevice setup is there already the gap between
what he thought he would get from qdevice and what he actually
had just matches exactly quorum-based-watchdog-fencing.

But you are of course right.
I don't really know the scenario.
Maybe fabric fencing is the perfect match - good to mention it
here as a possibility.

Regards,
Klaus
 
>
> S pozdravem Kristián Feldsam
> Tel.: +420 773 303 353, +421 944 137 535
> E-mail.: supp...@feldhost.cz 
>
> www.feldhost.cz  - FeldHost™ – profesionální
> hostingové a serverové služby za adekvátní ceny.
>
> FELDSAM s.r.o.
> V rohu 434/3
> Praha 4 – Libuš, PSČ 142 00
> IČ: 290 60 958, DIČ: CZ290 60 958
> C 200350 vedená u Městského soudu v Praze
>
> Banka: Fio banka a.s.
> Číslo účtu: 2400330446/2010
> BIC: FIOBCZPPXX
> IBAN: CZ82 2010  0024 0033 0446
>
>> On 24 Jul 2017, at 21:16, Klaus Wenninger > > wrote:
>>
>> On 07/24/2017 08:27 PM, Prasad, Shashank wrote:
>>> My understanding is that  SBD will need a shared storage between
>>> clustered nodes.
>>> And that, SBD will need at least 3 nodes in a cluster, if using w/o
>>> shared storage.
>>
>> Haven't tried to be honest but reason for 3 nodes is that without
>> shared disk you need a real quorum-source and not something
>> 'faked' as with 2-node-feature in corosync.
>> But I don't see anything speaking against getting the proper
>> quorum via qdevice instead with a third full cluster-node.
>>
>>>  
>>> Therefore, for systems which do NOT use shared storage between 1+1
>>> HA clustered nodes, SBD may NOT be an option.
>>> Correct me, if I am wrong.
>>>  
>>> For cluster systems using the likes of iDRAC/IMM2 fencing agents,
>>> which have redundant but shared power supply units with the nodes,
>>> the normal fencing mechanisms should work for all resiliency
>>> scenarios, but for IMM2/iDRAC are being NOT reachable for whatsoever
>>> reasons. And, to bail out of those situations in the absence of SBD,
>>> I believe using used-defined failover hooks (via scripts) into
>>> Pacemaker Alerts, with sudo permissions for ‘hacluster’, should help.
>>
>> If you don't see your fencing device assuming after some time
>> the the corresponding node will probably be down is quite risky
>> in my opinion.
>> But why not assure it to be down using a watchdog?
>>
>>>  
>>> Thanx.
>>>  
>>>  
>>> *From:* Klaus Wenninger [mailto:kwenn...@redhat.com] 
>>> *Sent:* Monday, July 24, 2017 11:31 PM
>>> *To:* Cluster Labs - All topics related to open-source clustering
>>> welcomed; Prasad, Shashank
>>> *Subject:* Re: [ClusterLabs] Two nodes cluster issue
>>>  
>>> On 07/24/2017 07:32 PM, Prasad, Shashank wrote:
>>>
>>> Sometimes IPMI fence devices use shared power of the node, and
>>> it cannot be avoided.
>>> In such scenarios the HA cluster is NOT able to handle the power
>>> failure of a node, since the power is shared with its own fence
>>> device.
>>> The failure of IPMI based fencing can also exist due to other
>>> reasons also.
>>>  
>>> A failure to fence the failed node will cause cluster to be
>>> marked UNCLEAN.
>>> To get over it, the following command needs to be invoked on the
>>> surviving node.
>>>  
>>> pcs stonith confirm  --force
>>>  
>>> This can be automated by hooking a recovery script, when the the
>>> Stonith resource ‘Timed Out’ event.
>>> To be more specific, the Pacemaker Alerts can be used for watch
>>> for Stonith timeouts and failures.
>>> In that script, all that’s essentially to be executed is the
>>> aforementioned command.
>>>
>>>
>>> If I get you right here you can disable fencing then in the first place.
>>> Actually quorum-based-watchdog-fencing is the way to do this in a
>>> safe manner. This of course assumes you have a proper source for
>>> quorum in your 2-node-setup with e.g. qdevice or using a shared
>>> disk with sbd (not directly pacemaker quorum here but similar thing
>>> handled inside sbd).
>>>
>>>
>>> Since the alerts are issued from ‘hacluster’ login, sudo permissions
>>> for ‘hacluster’ needs to be configured.
>>>  
>>> Thanx.
>>>  
>>>  
>>> *From:* Klaus Wenninger [mailto:kwenn...@redhat.com] 
>>> *Sent:* Monday, July 24, 2017 9:24 PM
>>> *To:* Kristián Feldsam; Cluster Labs - All topics related to
>>> open-source clustering welcomed
>>> *Subject:* Re: [ClusterLabs] Two nodes cluster issue
>>>  
>>> On 07/24/2017 05:37 PM, Kristián Feldsam wrote:
>>>
>>> I 

Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Kristián Feldsam
so why to use some other fencing method like disablink port on switch, so 
nobody can acces faultly node and write data to it. it is common practice too.

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 24 Jul 2017, at 21:16, Klaus Wenninger  wrote:
> 
> On 07/24/2017 08:27 PM, Prasad, Shashank wrote:
>> My understanding is that  SBD will need a shared storage between clustered 
>> nodes.
>> And that, SBD will need at least 3 nodes in a cluster, if using w/o shared 
>> storage.
> 
> Haven't tried to be honest but reason for 3 nodes is that without
> shared disk you need a real quorum-source and not something
> 'faked' as with 2-node-feature in corosync.
> But I don't see anything speaking against getting the proper
> quorum via qdevice instead with a third full cluster-node.
> 
>>  
>> Therefore, for systems which do NOT use shared storage between 1+1 HA 
>> clustered nodes, SBD may NOT be an option.
>> Correct me, if I am wrong.
>>  
>> For cluster systems using the likes of iDRAC/IMM2 fencing agents, which have 
>> redundant but shared power supply units with the nodes, the normal fencing 
>> mechanisms should work for all resiliency scenarios, but for IMM2/iDRAC are 
>> being NOT reachable for whatsoever reasons. And, to bail out of those 
>> situations in the absence of SBD, I believe using used-defined failover 
>> hooks (via scripts) into Pacemaker Alerts, with sudo permissions for 
>> ‘hacluster’, should help.
> 
> If you don't see your fencing device assuming after some time
> the the corresponding node will probably be down is quite risky
> in my opinion.
> But why not assure it to be down using a watchdog?
> 
>>  
>> Thanx.
>>  
>>  
>> From: Klaus Wenninger [mailto:kwenn...@redhat.com 
>> ] 
>> Sent: Monday, July 24, 2017 11:31 PM
>> To: Cluster Labs - All topics related to open-source clustering welcomed; 
>> Prasad, Shashank
>> Subject: Re: [ClusterLabs] Two nodes cluster issue
>>  
>> On 07/24/2017 07:32 PM, Prasad, Shashank wrote:
>> Sometimes IPMI fence devices use shared power of the node, and it cannot be 
>> avoided.
>> In such scenarios the HA cluster is NOT able to handle the power failure of 
>> a node, since the power is shared with its own fence device.
>> The failure of IPMI based fencing can also exist due to other reasons also.
>>  
>> A failure to fence the failed node will cause cluster to be marked UNCLEAN.
>> To get over it, the following command needs to be invoked on the surviving 
>> node.
>>  
>> pcs stonith confirm  --force
>>  
>> This can be automated by hooking a recovery script, when the the Stonith 
>> resource ‘Timed Out’ event.
>> To be more specific, the Pacemaker Alerts can be used for watch for Stonith 
>> timeouts and failures.
>> In that script, all that’s essentially to be executed is the aforementioned 
>> command.
>> 
>> If I get you right here you can disable fencing then in the first place.
>> Actually quorum-based-watchdog-fencing is the way to do this in a
>> safe manner. This of course assumes you have a proper source for
>> quorum in your 2-node-setup with e.g. qdevice or using a shared
>> disk with sbd (not directly pacemaker quorum here but similar thing
>> handled inside sbd).
>> 
>> 
>> Since the alerts are issued from ‘hacluster’ login, sudo permissions for 
>> ‘hacluster’ needs to be configured.
>>  
>> Thanx.
>>  
>>  
>> From: Klaus Wenninger [mailto:kwenn...@redhat.com 
>> ] 
>> Sent: Monday, July 24, 2017 9:24 PM
>> To: Kristián Feldsam; Cluster Labs - All topics related to open-source 
>> clustering welcomed
>> Subject: Re: [ClusterLabs] Two nodes cluster issue
>>  
>> On 07/24/2017 05:37 PM, Kristián Feldsam wrote:
>> I personally think that power off node by switched pdu is more safe, or not?
>> 
>> True if that is working in you environment. If you can't do a physical setup
>> where you aren't simultaneously loosing connection to both your node and
>> the switch-device (or you just want to cover cases where that happens)
>> you have to come up with something else.
>> 
>> 
>> 
>> 
>> S pozdravem Kristián Feldsam
>> Tel.: +420 773 303 353, +421 944 137 535
>> E-mail.: supp...@feldhost.cz 
>> 
>> www.feldhost.cz  - FeldHost™ – profesionální 
>> hostingové a serverové služby za adekvátní ceny.
>> 
>> FELDSAM s.r.o.
>> V rohu 434/3
>> Praha 4 – Libuš, PSČ 142 00
>> IČ: 290 60 958, DIČ: CZ290 60 958
>> C 200350 vedená u Městského soudu v Praze
>> 
>> Banka: Fio banka a.s.
>> Číslo účtu: 2400330446/2010
>> BIC: FIOBCZPPXX
>> IBAN: 

Re: [ClusterLabs] resources do not migrate although node is going to standby

2017-07-24 Thread Kristián Feldsam
hmmi think that it is just prefered location, if it is not available, server 
should start on other node. you can of cource migrate manualy byt crm resource 
move resource_name node_name - which in effect change that location pref

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 24 Jul 2017, at 20:52, Lentes, Bernd  
> wrote:
> 
> Hi,
> 
> just to be sure:
> i have a VirtualDomain resource (called prim_vm_servers_alive) running on one 
> node (ha-idg-2). From reasons i don't remember i have a location constraint:
> location cli-prefer-prim_vm_servers_alive prim_vm_servers_alive role=Started 
> inf: ha-idg-2
> 
> Now i try to set this node into standby, because i need it to reboot.
> From what i think now the resource can't migrate to node ha-idg-1 because of 
> this constraint. Right ?
> 
> That's what the log says:
> Jul 21 18:03:50 ha-idg-2 VirtualDomain(prim_vm_servers_alive)[28565]: ERROR: 
> Server_Monitoring: live migration to qemu+ssh://ha-idg-1/system  failed: 1
> Jul 21 18:03:50 ha-idg-2 lrmd[8573]:   notice: operation_finished: 
> prim_vm_servers_alive_migrate_to_0:28565:stderr [ error: Requested operation 
> is not valid: domain 'Server_Monitoring' is already active ]
> Jul 21 18:03:50 ha-idg-2 crmd[8576]:   notice: process_lrm_event: Operation 
> prim_vm_servers_alive_migrate_to_0: unknown error (node=ha-idg-2, call=114, 
> rc=1, cib-update=572, confirmed=true)
> Jul 21 18:03:50 ha-idg-2 crmd[8576]:   notice: process_lrm_event: 
> ha-idg-2-prim_vm_servers_alive_migrate_to_0:114 [ error: Requested operation 
> is not valid: domain 'Server_Monitoring' is already active\n ]
> Jul 21 18:03:50 ha-idg-2 crmd[8576]:  warning: status_from_rc: Action 64 
> (prim_vm_servers_alive_migrate_to_0) on ha-idg-2 failed (target: 0 vs. rc: 
> 1): Error
> Jul 21 18:03:50 ha-idg-2 crmd[8576]:   notice: abort_transition_graph: 
> Transition aborted by prim_vm_servers_alive_migrate_to_0 'modify' on 
> ha-idg-2: Event failed 
> (magic=0:1;64:417:0:656ecd4a-f8e8-46c9-b4e6-194616237988, cib=0.879.5, sou
> rce=match_graph_event:350, 0)
> Jul 21 18:03:50 ha-idg-2 crmd[8576]:  warning: status_from_rc: Action 64 
> (prim_vm_servers_alive_migrate_to_0) on ha-idg-2 failed (target: 0 vs. rc: 
> 1): Error
> Jul 21 18:03:53 ha-idg-2 VirtualDomain(prim_vm_mausdb)[28564]: ERROR: 
> mausdb_vm: live migration to qemu+ssh://ha-idg-1/system  failed: 1
> 
> That is the way i understand "Requested operation is not valid". It's not 
> possible because of the constraint.
> I just wanted to be sure. And because the resource can't be migrated but the 
> host is going to standby the resource is stopped. Right ?
> 
> Strange is that a second resource also running on node ha-idg-2 called 
> prim_vm_mausdb also didn't migrate to the other node. And that's something i 
> don't understand completely.
> The resource didn't have any location constraint.
> Both VirtualDomains have a vnc server configured (that i can monitor the boot 
> procedure if i have starting problems). The vnc port for prim_vm_mausdb is 
> 5900 in the configuration file.
> The port is set to auto for prim_vm_servers_alive because i forgot to 
> configure it fix. So it must be s.th like 5900+ because both resources were 
> running concurrently on the same node.
> But prim_vm_mausdb can't migrate because the port is occupied on the other 
> node ha-idg-1:
> 
> Jul 21 18:03:53 ha-idg-2 VirtualDomain(prim_vm_mausdb)[28564]: ERROR: 
> mausdb_vm: live migration to qemu+ssh://ha-idg-1/system  failed: 1
> Jul 21 18:03:53 ha-idg-2 lrmd[8573]:   notice: operation_finished: 
> prim_vm_mausdb_migrate_to_0:28564:stderr [ error: internal error: early end 
> of file from monitor: possible problem: ]
> Jul 21 18:03:53 ha-idg-2 lrmd[8573]:   notice: operation_finished: 
> prim_vm_mausdb_migrate_to_0:28564:stderr [ Failed to start VNC server on 
> `127.0.0.1:0,share=allow-exclusive': Failed to bind socket: Address already 
> in use ]
> Jul 21 18:03:53 ha-idg-2 lrmd[8573]:   notice: operation_finished: 
> prim_vm_mausdb_migrate_to_0:28564:stderr [  ]
> Jul 21 18:03:53 ha-idg-2 crmd[8576]:   notice: process_lrm_event: Operation 
> prim_vm_mausdb_migrate_to_0: unknown error (node=ha-idg-2, call=110, rc=1, 
> cib-update=573, confirmed=true)
> Jul 21 18:03:53 ha-idg-2 crmd[8576]:   notice: process_lrm_event: 
> ha-idg-2-prim_vm_mausdb_migrate_to_0:110 [ error: internal error: early end 
> of file from monitor: possible problem:\nFailed to start VNC server on 
> `127.0.0.1:0,share=allow
> -exclusive': Failed to bind socket: Address already in use\n\n ]
> Jul 21 18:03:53 ha-idg-2 

Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Klaus Wenninger
On 07/24/2017 08:27 PM, Prasad, Shashank wrote:
>
> My understanding is that  SBD will need a shared storage between
> clustered nodes.
>
> And that, SBD will need at least 3 nodes in a cluster, if using w/o
> shared storage.
>

Haven't tried to be honest but reason for 3 nodes is that without
shared disk you need a real quorum-source and not something
'faked' as with 2-node-feature in corosync.
But I don't see anything speaking against getting the proper
quorum via qdevice instead with a third full cluster-node.

>  
>
> Therefore, for systems which do NOT use shared storage between 1+1 HA
> clustered nodes, SBD may NOT be an option.
>
> Correct me, if I am wrong.
>
>  
>
> For cluster systems using the likes of iDRAC/IMM2 fencing agents,
> which have redundant but shared power supply units with the nodes, the
> normal fencing mechanisms should work for all resiliency scenarios,
> but for IMM2/iDRAC are being NOT reachable for whatsoever reasons.
> And, to bail out of those situations in the absence of SBD, I believe
> using used-defined failover hooks (via scripts) into Pacemaker Alerts,
> with sudo permissions for ‘hacluster’, should help.
>

If you don't see your fencing device assuming after some time
the the corresponding node will probably be down is quite risky
in my opinion.
But why not assure it to be down using a watchdog?

>  
>
> Thanx.
>
>  
>
>  
>
> *From:*Klaus Wenninger [mailto:kwenn...@redhat.com]
> *Sent:* Monday, July 24, 2017 11:31 PM
> *To:* Cluster Labs - All topics related to open-source clustering
> welcomed; Prasad, Shashank
> *Subject:* Re: [ClusterLabs] Two nodes cluster issue
>
>  
>
> On 07/24/2017 07:32 PM, Prasad, Shashank wrote:
>
> Sometimes IPMI fence devices use shared power of the node, and it
> cannot be avoided.
>
> In such scenarios the HA cluster is NOT able to handle the power
> failure of a node, since the power is shared with its own fence
> device.
>
> The failure of IPMI based fencing can also exist due to other
> reasons also.
>
>  
>
> A failure to fence the failed node will cause cluster to be marked
> UNCLEAN.
>
> To get over it, the following command needs to be invoked on the
> surviving node.
>
>  
>
> pcs stonith confirm  --force
>
>  
>
> This can be automated by hooking a recovery script, when the the
> Stonith resource ‘Timed Out’ event.
>
> To be more specific, the Pacemaker Alerts can be used for watch
> for Stonith timeouts and failures.
>
> In that script, all that’s essentially to be executed is the
> aforementioned command.
>
>
> If I get you right here you can disable fencing then in the first place.
> Actually quorum-based-watchdog-fencing is the way to do this in a
> safe manner. This of course assumes you have a proper source for
> quorum in your 2-node-setup with e.g. qdevice or using a shared
> disk with sbd (not directly pacemaker quorum here but similar thing
> handled inside sbd).
>
>
> Since the alerts are issued from ‘hacluster’ login, sudo permissions
> for ‘hacluster’ needs to be configured.
>
>  
>
> Thanx.
>
>  
>
>  
>
> *From:*Klaus Wenninger [mailto:kwenn...@redhat.com]
> *Sent:* Monday, July 24, 2017 9:24 PM
> *To:* Kristián Feldsam; Cluster Labs - All topics related to
> open-source clustering welcomed
> *Subject:* Re: [ClusterLabs] Two nodes cluster issue
>
>  
>
> On 07/24/2017 05:37 PM, Kristián Feldsam wrote:
>
> I personally think that power off node by switched pdu is more
> safe, or not?
>
>
> True if that is working in you environment. If you can't do a physical
> setup
> where you aren't simultaneously loosing connection to both your node and
> the switch-device (or you just want to cover cases where that happens)
> you have to come up with something else.
>
>
>
>
> S pozdravem Kristián Feldsam
> Tel.: +420 773 303 353, +421 944 137 535
> E-mail.: supp...@feldhost.cz 
>
> www.feldhost.cz  - *Feld*Host™ – profesionální
> hostingové a serverové služby za adekvátní ceny.
>
> FELDSAM s.r.o.
> V rohu 434/3
> Praha 4 – Libuš, PSČ 142 00
> IČ: 290 60 958, DIČ: CZ290 60 958
> C 200350 vedená u Městského soudu v Praze
>
> Banka: Fio banka a.s.
> Číslo účtu: 2400330446/2010
> BIC: FIOBCZPPXX
> IBAN: CZ82 2010  0024 0033 0446
>
>  
>
> On 24 Jul 2017, at 17:27, Klaus Wenninger  > wrote:
>
>  
>
> On 07/24/2017 05:15 PM, Tomer Azran wrote:
>
> I still don't understand why the qdevice concept doesn't help
> on this situation. Since the master node is down, I would
> expect the quorum to declare it as dead.
>
> Why doesn't it happens?
>
>
> That is not how quorum works. It just limits the decision-making
> to the quorate subset of the cluster.
> Still the unknown nodes are not sure to be down.
> That is why I suggested to have quorum-based watchdog-fencing with
> 

[ClusterLabs] resources do not migrate although node is going to standby

2017-07-24 Thread Lentes, Bernd
Hi,

just to be sure:
i have a VirtualDomain resource (called prim_vm_servers_alive) running on one 
node (ha-idg-2). From reasons i don't remember i have a location constraint:
location cli-prefer-prim_vm_servers_alive prim_vm_servers_alive role=Started 
inf: ha-idg-2

Now i try to set this node into standby, because i need it to reboot.
From what i think now the resource can't migrate to node ha-idg-1 because of 
this constraint. Right ?

That's what the log says:
Jul 21 18:03:50 ha-idg-2 VirtualDomain(prim_vm_servers_alive)[28565]: ERROR: 
Server_Monitoring: live migration to qemu+ssh://ha-idg-1/system  failed: 1
Jul 21 18:03:50 ha-idg-2 lrmd[8573]:   notice: operation_finished: 
prim_vm_servers_alive_migrate_to_0:28565:stderr [ error: Requested operation is 
not valid: domain 'Server_Monitoring' is already active ]
Jul 21 18:03:50 ha-idg-2 crmd[8576]:   notice: process_lrm_event: Operation 
prim_vm_servers_alive_migrate_to_0: unknown error (node=ha-idg-2, call=114, 
rc=1, cib-update=572, confirmed=true)
Jul 21 18:03:50 ha-idg-2 crmd[8576]:   notice: process_lrm_event: 
ha-idg-2-prim_vm_servers_alive_migrate_to_0:114 [ error: Requested operation is 
not valid: domain 'Server_Monitoring' is already active\n ]
Jul 21 18:03:50 ha-idg-2 crmd[8576]:  warning: status_from_rc: Action 64 
(prim_vm_servers_alive_migrate_to_0) on ha-idg-2 failed (target: 0 vs. rc: 1): 
Error
Jul 21 18:03:50 ha-idg-2 crmd[8576]:   notice: abort_transition_graph: 
Transition aborted by prim_vm_servers_alive_migrate_to_0 'modify' on ha-idg-2: 
Event failed (magic=0:1;64:417:0:656ecd4a-f8e8-46c9-b4e6-194616237988, 
cib=0.879.5, sou
rce=match_graph_event:350, 0)
Jul 21 18:03:50 ha-idg-2 crmd[8576]:  warning: status_from_rc: Action 64 
(prim_vm_servers_alive_migrate_to_0) on ha-idg-2 failed (target: 0 vs. rc: 1): 
Error
Jul 21 18:03:53 ha-idg-2 VirtualDomain(prim_vm_mausdb)[28564]: ERROR: 
mausdb_vm: live migration to qemu+ssh://ha-idg-1/system  failed: 1

That is the way i understand "Requested operation is not valid". It's not 
possible because of the constraint.
I just wanted to be sure. And because the resource can't be migrated but the 
host is going to standby the resource is stopped. Right ?

Strange is that a second resource also running on node ha-idg-2 called 
prim_vm_mausdb also didn't migrate to the other node. And that's something i 
don't understand completely.
The resource didn't have any location constraint.
Both VirtualDomains have a vnc server configured (that i can monitor the boot 
procedure if i have starting problems). The vnc port for prim_vm_mausdb is 5900 
in the configuration file.
The port is set to auto for prim_vm_servers_alive because i forgot to configure 
it fix. So it must be s.th like 5900+ because both resources were running 
concurrently on the same node.
But prim_vm_mausdb can't migrate because the port is occupied on the other node 
ha-idg-1:

Jul 21 18:03:53 ha-idg-2 VirtualDomain(prim_vm_mausdb)[28564]: ERROR: 
mausdb_vm: live migration to qemu+ssh://ha-idg-1/system  failed: 1
Jul 21 18:03:53 ha-idg-2 lrmd[8573]:   notice: operation_finished: 
prim_vm_mausdb_migrate_to_0:28564:stderr [ error: internal error: early end of 
file from monitor: possible problem: ]
Jul 21 18:03:53 ha-idg-2 lrmd[8573]:   notice: operation_finished: 
prim_vm_mausdb_migrate_to_0:28564:stderr [ Failed to start VNC server on 
`127.0.0.1:0,share=allow-exclusive': Failed to bind socket: Address already in 
use ]
Jul 21 18:03:53 ha-idg-2 lrmd[8573]:   notice: operation_finished: 
prim_vm_mausdb_migrate_to_0:28564:stderr [  ]
Jul 21 18:03:53 ha-idg-2 crmd[8576]:   notice: process_lrm_event: Operation 
prim_vm_mausdb_migrate_to_0: unknown error (node=ha-idg-2, call=110, rc=1, 
cib-update=573, confirmed=true)
Jul 21 18:03:53 ha-idg-2 crmd[8576]:   notice: process_lrm_event: 
ha-idg-2-prim_vm_mausdb_migrate_to_0:110 [ error: internal error: early end of 
file from monitor: possible problem:\nFailed to start VNC server on 
`127.0.0.1:0,share=allow
-exclusive': Failed to bind socket: Address already in use\n\n ]
Jul 21 18:03:53 ha-idg-2 crmd[8576]:  warning: status_from_rc: Action 51 
(prim_vm_mausdb_migrate_to_0) on ha-idg-2 failed (target: 0 vs. rc: 1): Error
Jul 21 18:03:53 ha-idg-2 crmd[8576]:  warning: status_from_rc: Action 51 
(prim_vm_mausdb_migrate_to_0) on ha-idg-2 failed (target: 0 vs. rc: 1): Error

Do i understand it correctly that the port is occupied on the node it should 
migrate to (ha-idg-1) ?
But there is no vm running and i don't have a standalone vnc server configured. 
Why is the port occupied ?

Btw: the network sockets are live migrated too during a live migration of a 
VirtualDomain resource ?
It should be like that.

Thanks.


Bernd



-- 
Bernd Lentes 

Systemadministration 
institute of developmental genetics 
Gebäude 35.34 - Raum 208 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 (0)89 3187 1241 
fax: +49 (0)89 3187 2294 

no backup - no mercy
 

Helmholtz Zentrum Muenchen

Re: [ClusterLabs] timeout for stop VirtualDomain running Windows 7

2017-07-24 Thread Kristián Feldsam
hmm, it is possible disable installing update on shutdown? and do regular 
maintanence for updating manually?

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 24 Jul 2017, at 19:30, Lentes, Bernd  
> wrote:
> 
> Hi,
> 
> i have a VirtualDomian resource running a Windows 7 client. This is the 
> respective configuration:
> 
> primitive prim_vm_servers_alive VirtualDomain \
>params config="/var/lib/libvirt/images/xml/Server_Monitoring.xml" \
>params hypervisor="qemu:///system" \
>params migration_transport=ssh \
>params autoset_utilization_cpu=false \
>params autoset_utilization_hv_memory=false \
>op start interval=0 timeout=120 \
>op stop interval=0 timeout=130 \
>op monitor interval=30 timeout=30 \
>op migrate_from interval=0 timeout=180 \
>op migrate_to interval=0 timeout=190 \
>meta allow-migrate=true target-role=Started is-managed=true
> 
> The timeout for the stop operation is 130 seconds. But our windows 7 clients, 
> as most do, install updates from time to time .
> And then a shutdown can take 10 or 20 minutes or even longer.
> If the timeout isn't as long as the installation of the updates takes then 
> the vm is forced off. With all possible negative consequences.
> But on the other hand i don't like to set a timeout of eg. 20 minutes, which 
> may still not be enough in some circumstances, but is much too long
> if the guest doesn't install updates.
> 
> Any ideas ?
> 
> Thanks.
> 
> 
> Bernd
> 
> -- 
> Bernd Lentes 
> 
> Systemadministration 
> institute of developmental genetics 
> Gebäude 35.34 - Raum 208 
> HelmholtzZentrum München 
> bernd.len...@helmholtz-muenchen.de 
> phone: +49 (0)89 3187 1241 
> fax: +49 (0)89 3187 2294 
> 
> no backup - no mercy
> 
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
> Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons 
> Enhsen
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Klaus Wenninger
On 07/24/2017 07:32 PM, Prasad, Shashank wrote:
>
> Sometimes IPMI fence devices use shared power of the node, and it
> cannot be avoided.
>
> In such scenarios the HA cluster is NOT able to handle the power
> failure of a node, since the power is shared with its own fence device.
>
> The failure of IPMI based fencing can also exist due to other reasons
> also.
>
>  
>
> A failure to fence the failed node will cause cluster to be marked
> UNCLEAN.
>
> To get over it, the following command needs to be invoked on the
> surviving node.
>
>  
>
> pcs stonith confirm  --force
>
>  
>
> This can be automated by hooking a recovery script, when the the
> Stonith resource ‘Timed Out’ event.
>
> To be more specific, the Pacemaker Alerts can be used for watch for
> Stonith timeouts and failures.
>
> In that script, all that’s essentially to be executed is the
> aforementioned command.
>

If I get you right here you can disable fencing then in the first place.
Actually quorum-based-watchdog-fencing is the way to do this in a
safe manner. This of course assumes you have a proper source for
quorum in your 2-node-setup with e.g. qdevice or using a shared
disk with sbd (not directly pacemaker quorum here but similar thing
handled inside sbd).

> Since the alerts are issued from ‘hacluster’ login, sudo permissions
> for ‘hacluster’ needs to be configured.
>
>  
>
> Thanx.
>
>  
>
>  
>
> *From:*Klaus Wenninger [mailto:kwenn...@redhat.com]
> *Sent:* Monday, July 24, 2017 9:24 PM
> *To:* Kristián Feldsam; Cluster Labs - All topics related to
> open-source clustering welcomed
> *Subject:* Re: [ClusterLabs] Two nodes cluster issue
>
>  
>
> On 07/24/2017 05:37 PM, Kristián Feldsam wrote:
>
> I personally think that power off node by switched pdu is more
> safe, or not?
>
>
> True if that is working in you environment. If you can't do a physical
> setup
> where you aren't simultaneously loosing connection to both your node and
> the switch-device (or you just want to cover cases where that happens)
> you have to come up with something else.
>
>
>
> S pozdravem Kristián Feldsam
> Tel.: +420 773 303 353, +421 944 137 535
> E-mail.: supp...@feldhost.cz 
>
> www.feldhost.cz  - *Feld*Host™ – profesionální
> hostingové a serverové služby za adekvátní ceny.
>
> FELDSAM s.r.o.
> V rohu 434/3
> Praha 4 – Libuš, PSČ 142 00
> IČ: 290 60 958, DIČ: CZ290 60 958
> C 200350 vedená u Městského soudu v Praze
>
> Banka: Fio banka a.s.
> Číslo účtu: 2400330446/2010
> BIC: FIOBCZPPXX
> IBAN: CZ82 2010  0024 0033 0446
>
>  
>
> On 24 Jul 2017, at 17:27, Klaus Wenninger  > wrote:
>
>  
>
> On 07/24/2017 05:15 PM, Tomer Azran wrote:
>
> I still don't understand why the qdevice concept doesn't help
> on this situation. Since the master node is down, I would
> expect the quorum to declare it as dead.
>
> Why doesn't it happens?
>
>
> That is not how quorum works. It just limits the decision-making
> to the quorate subset of the cluster.
> Still the unknown nodes are not sure to be down.
> That is why I suggested to have quorum-based watchdog-fencing with
> sbd.
> That would assure that within a certain time all nodes of the
> non-quorate part
> of the cluster are down.
>
>
>
>
> On Mon, Jul 24, 2017 at 4:15 PM +0300, "Dmitri
> Maziuk"  > wrote:
>
> On 2017-07-24 07:51, Tomer Azran wrote:
>
> > We don't have the ability to use it.
>
> > Is that the only solution?
>
>  
>
> No, but I'd recommend thinking about it first. Are you sure you will 
>
> care about your cluster working when your server room is on fire? 'Cause 
>
> unless you have halon suppression, your server room is a complete 
>
> write-off anyway. (Think water from sprinklers hitting rich chunky volts 
>
> in the servers.)
>
>  
>
> Dima
>
>  
>
> ___
>
> Users mailing list: Users@clusterlabs.org 
>
> http://lists.clusterlabs.org/mailman/listinfo/users
>
>  
>
> Project Home: http://www.clusterlabs.org 
>
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>
> Bugs: http://bugs.clusterlabs.org 
>
>
>
>
> ___
>
> Users mailing list: Users@clusterlabs.org 
>
> http://lists.clusterlabs.org/mailman/listinfo/users
>
>  
>
> Project Home: http://www.clusterlabs.org 
>
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>
> Bugs: http://bugs.clusterlabs.org 
>
>  
>
> -- 
>
> Klaus Wenninger
>
>  
>
> Senior 

Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Prasad, Shashank
Sometimes IPMI fence devices use shared power of the node, and it cannot be 
avoided.

In such scenarios the HA cluster is NOT able to handle the power failure of a 
node, since the power is shared with its own fence device.

The failure of IPMI based fencing can also exist due to other reasons also.

 

A failure to fence the failed node will cause cluster to be marked UNCLEAN.

To get over it, the following command needs to be invoked on the surviving node.

 

pcs stonith confirm  --force

 

This can be automated by hooking a recovery script, when the the Stonith 
resource ‘Timed Out’ event.

To be more specific, the Pacemaker Alerts can be used for watch for Stonith 
timeouts and failures.

In that script, all that’s essentially to be executed is the aforementioned 
command.

Since the alerts are issued from ‘hacluster’ login, sudo permissions for 
‘hacluster’ needs to be configured.

 

Thanx.

 

 

From: Klaus Wenninger [mailto:kwenn...@redhat.com] 
Sent: Monday, July 24, 2017 9:24 PM
To: Kristián Feldsam; Cluster Labs - All topics related to open-source 
clustering welcomed
Subject: Re: [ClusterLabs] Two nodes cluster issue

 

On 07/24/2017 05:37 PM, Kristián Feldsam wrote:

I personally think that power off node by switched pdu is more safe, or 
not?


True if that is working in you environment. If you can't do a physical setup
where you aren't simultaneously loosing connection to both your node and
the switch-device (or you just want to cover cases where that happens)
you have to come up with something else.





S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446 

 

On 24 Jul 2017, at 17:27, Klaus Wenninger  wrote:

 

On 07/24/2017 05:15 PM, Tomer Azran wrote:

I still don't understand why the qdevice concept doesn't help 
on this situation. Since the master node is down, I would expect the quorum to 
declare it as dead.

Why doesn't it happens?


That is not how quorum works. It just limits the decision-making to the 
quorate subset of the cluster.
Still the unknown nodes are not sure to be down.
That is why I suggested to have quorum-based watchdog-fencing with sbd.
That would assure that within a certain time all nodes of the 
non-quorate part
of the cluster are down.








On Mon, Jul 24, 2017 at 4:15 PM +0300, "Dmitri Maziuk" 
 wrote:

On 2017-07-24 07:51, Tomer Azran wrote:
> We don't have the ability to use it.
> Is that the only solution?
 
No, but I'd recommend thinking about it first. Are you sure you will 
care about your cluster working when your server room is on fire? 
'Cause 
unless you have halon suppression, your server room is a complete 
write-off anyway. (Think water from sprinklers hitting rich chunky 
volts 
in the servers.)
 
Dima
 
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users
 
Project Home: http://www.clusterlabs.org  
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org  






___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users
 
Project Home: http://www.clusterlabs.org  
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org  

 

-- 
Klaus Wenninger
 
Senior Software Engineer, EMEA ENG Openstack Infrastructure
 
Red Hat
 
kwenn...@redhat.com   

___
Users mailing list: Users@clusterlabs.org 
 
http://lists.clusterlabs.org/mailman/listinfo/users 
 

Project Home: http://www.clusterlabs.org  
Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
 
Bugs: http://bugs.clusterlabs.org 

[ClusterLabs] timeout for stop VirtualDomain running Windows 7

2017-07-24 Thread Lentes, Bernd
Hi,

i have a VirtualDomian resource running a Windows 7 client. This is the 
respective configuration:

primitive prim_vm_servers_alive VirtualDomain \
params config="/var/lib/libvirt/images/xml/Server_Monitoring.xml" \
params hypervisor="qemu:///system" \
params migration_transport=ssh \
params autoset_utilization_cpu=false \
params autoset_utilization_hv_memory=false \
op start interval=0 timeout=120 \
op stop interval=0 timeout=130 \
op monitor interval=30 timeout=30 \
op migrate_from interval=0 timeout=180 \
op migrate_to interval=0 timeout=190 \
meta allow-migrate=true target-role=Started is-managed=true

The timeout for the stop operation is 130 seconds. But our windows 7 clients, 
as most do, install updates from time to time .
And then a shutdown can take 10 or 20 minutes or even longer.
If the timeout isn't as long as the installation of the updates takes then the 
vm is forced off. With all possible negative consequences.
But on the other hand i don't like to set a timeout of eg. 20 minutes, which 
may still not be enough in some circumstances, but is much too long
if the guest doesn't install updates.

Any ideas ?

Thanks.


Bernd

-- 
Bernd Lentes 

Systemadministration 
institute of developmental genetics 
Gebäude 35.34 - Raum 208 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 (0)89 3187 1241 
fax: +49 (0)89 3187 2294 

no backup - no mercy
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Dimitri Maziuk
On 07/24/2017 11:34 AM, Ken Gaillot wrote:
> On Mon, 2017-07-24 at 18:09 +0200, Valentin Vidic wrote:
>> On Mon, Jul 24, 2017 at 11:01:26AM -0500, Dimitri Maziuk wrote:
>>> Lsof/fuser show the PID of the process holding FS open as "kernel".
>>
>> That could be the NFS server running in the kernel.
> 
> Dimitri,
> 
> Is the NFS server also managed by pacemaker? Is it ordered after DRBD?
> Did pacemaker try to stop it before stopping DRBD?
> 

See the other post w/ the log. Sorry for trimming it off of the first
one -- I can repost the whole thing if it makes it easier.

Yes, it's successfully stopped stopped dovecot @ 14:03:46, nfs_server @
14:03:47, removed all the symlinks, and failed to unmount /raid @ 14:03:47.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Ken Gaillot
On Mon, 2017-07-24 at 18:09 +0200, Valentin Vidic wrote:
> On Mon, Jul 24, 2017 at 11:01:26AM -0500, Dimitri Maziuk wrote:
> > Lsof/fuser show the PID of the process holding FS open as "kernel".
> 
> That could be the NFS server running in the kernel.

Dimitri,

Is the NFS server also managed by pacemaker? Is it ordered after DRBD?
Did pacemaker try to stop it before stopping DRBD?
-- 
Ken Gaillot 





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Valentin Vidic
On Mon, Jul 24, 2017 at 10:38:40AM -0500, Ken Gaillot wrote:
> Standby is not necessary, it's just a cautious step that allows the
> admin to verify that all resources moved off correctly. The restart that
> yum does should be sufficient for pacemaker to move everything.
> 
> A restart shouldn't lead to fencing in any case where something's not
> going seriously wrong. I'm not familiar with the "kernel is using it"
> message, I haven't run into that before.

Right, pacemaker upgrade might not be the biggest problem.  I've seen
other packages upgrades cause RA monitors to return results like 
$OCF_NOT_RUNNING or $OCF_ERR_INSTALLED.  This of course causes the
cluster to react, so I prefer the node standby option :)

In this case the pacemaker was trying to stop the resources, the stop
action has failed and the upgrading node was killed off by the second
node trying to cleanup the mess.  The resources should have come up
on the second node after that.

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Dimitri Maziuk
> Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stopping NFS 
> server ...
> Jul 22 14:03:46 zebrafish systemd: Stopping NFS server and services...
> Jul 22 14:03:46 zebrafish systemd: Stopped NFS server and services.
> Jul 22 14:03:46 zebrafish systemd: Stopping NFS Mount Daemon...
> Jul 22 14:03:46 zebrafish systemd: Stopping NFSv4 ID-name mapping service...
> Jul 22 14:03:46 zebrafish rpc.mountd[2655]: Caught signal 15, un-registering 
> and exiting.
> Jul 22 14:03:46 zebrafish systemd: Stopped NFSv4 ID-name mapping service.
> Jul 22 14:03:46 zebrafish systemd: Stopped NFS Mount Daemon.
> Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stop: threads
> Jul 22 14:03:46 zebrafish kernel: nfsd: last server has exited, flushing 
> export cache
> Jul 22 14:03:46 zebrafish systemd: Stopping NFS status monitor for NFSv2/3 
> locking
> Jul 22 14:03:46 zebrafish systemd: Stopped NFS status monitor for NFSv2/3 
> locking..
> Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stop: rpc-statd
> Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stop: nfs-idmapd
> Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stop: nfs-mountd
> Jul 22 14:03:46 zebrafish systemd: Stopping RPC bind service...
> Jul 22 14:03:46 zebrafish systemd: Stopped RPC bind service.
> Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stop: rpcbind
> Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stop: rpc-gssd
> Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stop: umount 
> (1/10 attempts)
> Jul 22 14:03:47 zebrafish nfsserver(server_nfs)[6614]: INFO: NFS server 
> stopped
> Jul 22 14:03:47 zebrafish crmd[1078]:  notice: Result of stop operation for 
> server_nfs on zebrafish: 0 (ok)
> Jul 22 14:03:47 zebrafish crmd[1078]:  notice: Initiating stop operation 
> floating_ip_stop_0 locally on zebrafish
> Jul 22 14:03:48 zebrafish crmd[1078]:  notice: Result of stop operation for 
> server_dovecot on zebrafish: 0 (ok)
> Jul 22 14:03:48 zebrafish crmd[1078]:  notice: Initiating stop operation 
> symlink_etc_pki_stop_0 locally on zebrafish
> Jul 22 14:03:48 zebrafish IPaddr2(floating_ip)[6769]: INFO: IP status = ok, 
> IP_CIP=
> Jul 22 14:03:48 zebrafish crmd[1078]:  notice: Initiating stop operation 
> symlink_var_dovecot_stop_0 locally on zebrafish
> Jul 22 14:03:48 zebrafish crmd[1078]:  notice: Result of stop operation for 
> floating_ip on zebrafish: 0 (ok)
> Jul 22 14:03:48 zebrafish symlink(symlink_etc_pki)[6821]: INFO: removed 
> '/etc/pki'
> Jul 22 14:03:48 zebrafish symlink(symlink_var_dovecot)[6822]: INFO: removed 
> '/var/spool/dovecot'
> Jul 22 14:03:48 zebrafish crmd[1078]:  notice: Result of stop operation for 
> symlink_var_dovecot on zebrafish: 0 (ok)
> Jul 22 14:03:48 zebrafish crmd[1078]:  notice: Initiating stop operation 
> symlink_etc_dovecot_stop_0 locally on zebrafish
> Jul 22 14:03:48 zebrafish crmd[1078]:  notice: Result of stop operation for 
> symlink_etc_pki on zebrafish: 0 (ok)
> Jul 22 14:03:48 zebrafish symlink(symlink_etc_dovecot)[6863]: INFO: removed 
> '/etc/dovecot'
> Jul 22 14:03:48 zebrafish crmd[1078]:  notice: Result of stop operation for 
> symlink_etc_dovecot on zebrafish: 0 (ok)
> Jul 22 14:03:48 zebrafish crmd[1078]:  notice: Initiating stop operation 
> drbd_filesystem_stop_0 locally on zebrafish
> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: Running 
> stop for /dev/drbd0 on /raid
> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: Trying to 
> unmount /raid
> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
> unmount /raid; trying cleanup with TERM
...

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Kristián Feldsam
nfs server/share is also managed by pacemaker and orderis set right?

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 24 Jul 2017, at 18:01, Dimitri Maziuk  wrote:
> 
> On 07/24/2017 10:38 AM, Ken Gaillot wrote:
> 
>> A restart shouldn't lead to fencing in any case where something's not
>> going seriously wrong. I'm not familiar with the "kernel is using it"
>> message, I haven't run into that before.
> 
> I posted it at least once before.
> 
>> 
>> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: Running 
>> stop for /dev/drbd0 on /raid
>> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: Trying to 
>> unmount /raid
>> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
>> unmount /raid; trying cleanup with TERM
>> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
>> processes on /raid were signalled. force_unmount is set to 'yes'
>> Jul 22 14:03:49 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
>> unmount /raid; trying cleanup with TERM
>> Jul 22 14:03:49 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
>> processes on /raid were signalled. force_unmount is set to 'yes'
>> Jul 22 14:03:50 zebrafish ntpd[596]: Deleting interface #8 enp2s0f0, 
>> 144.92.167.221#123, interface stats: received=0, sent=0, dropped=0, 
>> active_time=260 secs
>> Jul 22 14:03:50 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
>> unmount /raid; trying cleanup with TERM
>> Jul 22 14:03:50 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
>> processes on /raid were signalled. force_unmount is set to 'yes'
>> Jul 22 14:03:51 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
>> unmount /raid; trying cleanup with KILL
>> Jul 22 14:03:51 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
>> processes on /raid were signalled. force_unmount is set to 'yes'
>> Jul 22 14:03:52 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
>> unmount /raid; trying cleanup with KILL
>> Jul 22 14:03:53 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
>> processes on /raid were signalled. force_unmount is set to 'yes'
>> Jul 22 14:03:54 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
>> unmount /raid; trying cleanup with KILL
>> Jul 22 14:03:54 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
>> processes on /raid were signalled. force_unmount is set to 'yes'
>> Jul 22 14:03:55 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
>> unmount /raid, giving up!
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info 
>> about processes that use ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [  the device is found by lsof(8) 
>> or fuser(1)) ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; 
>> trying cleanup with TERM ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info 
>> about processes that use ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [  the device is found by lsof(8) 
>> or fuser(1)) ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; 
>> trying cleanup with TERM ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info 
>> about processes that use ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [  the device is found by lsof(8) 
>> or fuser(1)) ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; 
>> trying cleanup with TERM ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info 
>> about processes 

Re: [ClusterLabs] epic fail

2017-07-24 Thread Valentin Vidic
On Mon, Jul 24, 2017 at 11:01:26AM -0500, Dimitri Maziuk wrote:
> Lsof/fuser show the PID of the process holding FS open as "kernel".

That could be the NFS server running in the kernel.

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Dimitri Maziuk
On 07/24/2017 10:38 AM, Ken Gaillot wrote:

> A restart shouldn't lead to fencing in any case where something's not
> going seriously wrong. I'm not familiar with the "kernel is using it"
> message, I haven't run into that before.

I posted it at least once before.

> 
> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: Running 
> stop for /dev/drbd0 on /raid
> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: Trying to 
> unmount /raid
> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
> unmount /raid; trying cleanup with TERM
> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
> processes on /raid were signalled. force_unmount is set to 'yes'
> Jul 22 14:03:49 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
> unmount /raid; trying cleanup with TERM
> Jul 22 14:03:49 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
> processes on /raid were signalled. force_unmount is set to 'yes'
> Jul 22 14:03:50 zebrafish ntpd[596]: Deleting interface #8 enp2s0f0, 
> 144.92.167.221#123, interface stats: received=0, sent=0, dropped=0, 
> active_time=260 secs
> Jul 22 14:03:50 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
> unmount /raid; trying cleanup with TERM
> Jul 22 14:03:50 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
> processes on /raid were signalled. force_unmount is set to 'yes'
> Jul 22 14:03:51 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
> unmount /raid; trying cleanup with KILL
> Jul 22 14:03:51 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
> processes on /raid were signalled. force_unmount is set to 'yes'
> Jul 22 14:03:52 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
> unmount /raid; trying cleanup with KILL
> Jul 22 14:03:53 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
> processes on /raid were signalled. force_unmount is set to 'yes'
> Jul 22 14:03:54 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
> unmount /raid; trying cleanup with KILL
> Jul 22 14:03:54 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
> processes on /raid were signalled. force_unmount is set to 'yes'
> Jul 22 14:03:55 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
> unmount /raid, giving up!
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info about 
> processes that use ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [  the device is found by lsof(8) 
> or fuser(1)) ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; 
> trying cleanup with TERM ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info about 
> processes that use ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [  the device is found by lsof(8) 
> or fuser(1)) ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; 
> trying cleanup with TERM ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info about 
> processes that use ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [  the device is found by lsof(8) 
> or fuser(1)) ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; 
> trying cleanup with TERM ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info about 
> processes that use ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [  the device is found by lsof(8) 
> or fuser(1)) ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; 
> trying cleanup with KILL ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info about 
> processes that use ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> 

Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Klaus Wenninger
On 07/24/2017 05:37 PM, Kristián Feldsam wrote:
> I personally think that power off node by switched pdu is more safe,
> or not?

True if that is working in you environment. If you can't do a physical setup
where you aren't simultaneously loosing connection to both your node and
the switch-device (or you just want to cover cases where that happens)
you have to come up with something else.

>
> S pozdravem Kristián Feldsam
> Tel.: +420 773 303 353, +421 944 137 535
> E-mail.: supp...@feldhost.cz 
>
> www.feldhost.cz  - FeldHost™ – profesionální
> hostingové a serverové služby za adekvátní ceny.
>
> FELDSAM s.r.o.
> V rohu 434/3
> Praha 4 – Libuš, PSČ 142 00
> IČ: 290 60 958, DIČ: CZ290 60 958
> C 200350 vedená u Městského soudu v Praze
>
> Banka: Fio banka a.s.
> Číslo účtu: 2400330446/2010
> BIC: FIOBCZPPXX
> IBAN: CZ82 2010  0024 0033 0446
>
>> On 24 Jul 2017, at 17:27, Klaus Wenninger > > wrote:
>>
>> On 07/24/2017 05:15 PM, Tomer Azran wrote:
>>> I still don't understand why the qdevice concept doesn't help on
>>> this situation. Since the master node is down, I would expect the
>>> quorum to declare it as dead.
>>> Why doesn't it happens?
>>
>> That is not how quorum works. It just limits the decision-making to
>> the quorate subset of the cluster.
>> Still the unknown nodes are not sure to be down.
>> That is why I suggested to have quorum-based watchdog-fencing with sbd.
>> That would assure that within a certain time all nodes of the
>> non-quorate part
>> of the cluster are down.
>>
>>>
>>>
>>>
>>> On Mon, Jul 24, 2017 at 4:15 PM +0300, "Dmitri
>>> Maziuk" >> > wrote:
>>>
>>> On 2017-07-24 07:51, Tomer Azran wrote:
>>> > We don't have the ability to use it.
>>> > Is that the only solution?
>>>
>>> No, but I'd recommend thinking about it first. Are you sure you will 
>>> care about your cluster working when your server room is on fire? 
>>> 'Cause 
>>> unless you have halon suppression, your server room is a complete 
>>> write-off anyway. (Think water from sprinklers hitting rich chunky 
>>> volts 
>>> in the servers.)
>>>
>>> Dima
>>>
>>> ___
>>> Users mailing list: Users@clusterlabs.org
>>> http://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>>
>>> ___
>>> Users mailing list: Users@clusterlabs.org
>>> http://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> -- 
>> Klaus Wenninger
>>
>> Senior Software Engineer, EMEA ENG Openstack Infrastructure
>>
>> Red Hat
>>
>> kwenn...@redhat.com   
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org 
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Ken Gaillot
On Mon, 2017-07-24 at 17:13 +0200, Kristián Feldsam wrote:
> Hmm, so when you know, that it happens also when putting node standy,
> them why you run yum update on live cluster, it must be clear that
> node will be fenced.

Standby is not necessary, it's just a cautious step that allows the
admin to verify that all resources moved off correctly. The restart that
yum does should be sufficient for pacemaker to move everything.

A restart shouldn't lead to fencing in any case where something's not
going seriously wrong. I'm not familiar with the "kernel is using it"
message, I haven't run into that before.

The only case where special handling was needed before a yum update is a
node running pacemaker_remote instead of the full cluster stack, before
pacemaker 1.1.15.

> Would you post your pacemaker config? + some logs?
> 
> S pozdravem Kristián Feldsam
> Tel.: +420 773 303 353, +421 944 137 535
> E-mail.: supp...@feldhost.cz
> 
> www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové
> služby za adekvátní ceny.
> 
> FELDSAM s.r.o.
> V rohu 434/3
> Praha 4 – Libuš, PSČ 142 00
> IČ: 290 60 958, DIČ: CZ290 60 958
> C 200350 vedená u Městského soudu v Praze
> 
> Banka: Fio banka a.s.
> Číslo účtu: 2400330446/2010
> BIC: FIOBCZPPXX
> IBAN: CZ82 2010  0024 0033 0446
> 
> > On 24 Jul 2017, at 17:04, Dimitri Maziuk 
> > wrote:
> > 
> > On 07/24/2017 09:40 AM, Jan Pokorný wrote:
> > 
> > > Would there be an interest, though?  And would that be meaningful?
> > 
> > IMO the only reason to put a node in standby is if you want to
> > reboot
> > the active node with no service interruption. For anything else,
> > including a reboot with service interruption (during maintenance
> > window), it's a no.
> > 
> > This is akin to "your mouse has moved, windows needs to be
> > restarted".
> > Except the mouse thing is a joke whereas those "standby" clowns
> > appear
> > to be serious.
> > 
> > With this particular failure, something in the Redhat patched kernel
> > (NFS?) does not release the DRBD filesystem. It happens when I put
> > the
> > node in standby as well, the only difference is not messing up the
> > RPM
> > database which isn't that hard to fix. Since I have several centos 6
> > +
> > DRBD + NFS + heartbeat R1 pairs running happily for years, I have to
> > conclude that centos 7 is simply the wrong tool for this particular
> > job.
> > 
> > -- 
> > Dimitri Maziuk
> > Programmer/sysadmin
> > BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Klaus Wenninger
On 07/24/2017 05:32 PM, Tomer Azran wrote:
> So your suggestion is to use sbd with or without qdevice? What is the
> point of having a qdevice in two node cluster if it doesn't help in
> this situation?

If you have a qdevice setup that is already working (meaning in terms of
that one
of your nodes is quorate and the other not if they are split) I would
use that.
And if you use sbd with just a watchdog (no shared disk) - should be
supported in 
Centos 7.3 (you said you are there somewhere down below iirc) - it would be
assured that the node that is not quorate goes down reliably and that
the other
node is assuming it to be down after a timeout you configured using cluster
property stonith-watchdog-timeout.

>
>
> From: Klaus Wenninger
> Sent: Monday, July 24, 18:28
> Subject: Re: [ClusterLabs] Two nodes cluster issue
> To: Cluster Labs - All topics related to open-source clustering
> welcomed, Tomer Azran
>
>
> On 07/24/2017 05:15 PM, Tomer Azran wrote:
>> I still don't understand why the qdevice concept doesn't help on this
>> situation. Since the master node is down, I would expect the quorum
>> to declare it as dead.
>> Why doesn't it happens?
>
> That is not how quorum works. It just limits the decision-making to
> the quorate subset of the cluster.
> Still the unknown nodes are not sure to be down.
> That is why I suggested to have quorum-based watchdog-fencing with sbd.
> That would assure that within a certain time all nodes of the
> non-quorate part
> of the cluster are down.
>
>>
>>
>>
>> On Mon, Jul 24, 2017 at 4:15 PM +0300, "Dmitri Maziuk"
>> > wrote:
>>
>>> On 2017-07-24 07:51, Tomer Azran wrote: > We don't have the ability
>>> to use it. > Is that the only solution? No, but I'd recommend
>>> thinking about it first. Are you sure you will care about your
>>> cluster working when your server room is on fire? 'Cause unless you
>>> have halon suppression, your server room is a complete write-off
>>> anyway. (Think water from sprinklers hitting rich chunky volts in
>>> the servers.) Dima ___
>>> Users mailing list: Users@clusterlabs.org
>>>  http://lists.clusterlabs.org/mailman/
>>> listinfo
>>> /users
>>>  Project Home:
>>> http://www.clusterlabs.org Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
>>> http://bugs.clusterlabs.org
>>
>>
>> ___ Users mailing list:
>> Users@clusterlabs.org 
>> http://lists.clusterlabs.org/mailman/
>> listinfo
>> /users
>>  Project Home:
>> http://www.clusterlabs.org Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
>> http://bugs.clusterlabs.org
>
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Kristián Feldsam
I personally think that power off node by switched pdu is more safe, or not?

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 24 Jul 2017, at 17:27, Klaus Wenninger  wrote:
> 
> On 07/24/2017 05:15 PM, Tomer Azran wrote:
>> I still don't understand why the qdevice concept doesn't help on this 
>> situation. Since the master node is down, I would expect the quorum to 
>> declare it as dead.
>> Why doesn't it happens?
> 
> That is not how quorum works. It just limits the decision-making to the 
> quorate subset of the cluster.
> Still the unknown nodes are not sure to be down.
> That is why I suggested to have quorum-based watchdog-fencing with sbd.
> That would assure that within a certain time all nodes of the non-quorate part
> of the cluster are down.
> 
>> 
>> 
>> 
>> On Mon, Jul 24, 2017 at 4:15 PM +0300, "Dmitri Maziuk" 
>> > wrote:
>> 
>> On 2017-07-24 07:51, Tomer Azran wrote:
>> > We don't have the ability to use it.
>> > Is that the only solution?
>> 
>> No, but I'd recommend thinking about it first. Are you sure you will 
>> care about your cluster working when your server room is on fire? 'Cause 
>> unless you have halon suppression, your server room is a complete 
>> write-off anyway. (Think water from sprinklers hitting rich chunky volts 
>> in the servers.)
>> 
>> Dima
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> 
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> 
>> Bugs: http://bugs.clusterlabs.org 
>> 
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> 
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> 
>> Bugs: http://bugs.clusterlabs.org 
> 
> -- 
> Klaus Wenninger
> 
> Senior Software Engineer, EMEA ENG Openstack Infrastructure
> 
> Red Hat
> 
> kwenn...@redhat.com    
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> 
> Bugs: http://bugs.clusterlabs.org 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Tomer Azran
So your suggestion is to use sbd with or without qdevice? What is the point of 
having a qdevice in two node cluster if it doesn't help in this situation?


From: Klaus Wenninger
Sent: Monday, July 24, 18:28
Subject: Re: [ClusterLabs] Two nodes cluster issue
To: Cluster Labs - All topics related to open-source clustering welcomed, Tomer 
Azran


On 07/24/2017 05:15 PM, Tomer Azran wrote:
I still don't understand why the qdevice concept doesn't help on this 
situation. Since the master node is down, I would expect the quorum to declare 
it as dead.
Why doesn't it happens?

That is not how quorum works. It just limits the decision-making to the quorate 
subset of the cluster.
Still the unknown nodes are not sure to be down.
That is why I suggested to have quorum-based watchdog-fencing with sbd.
That would assure that within a certain time all nodes of the non-quorate part
of the cluster are down.




On Mon, Jul 24, 2017 at 4:15 PM +0300, "Dmitri Maziuk" 
> wrote:

On 2017-07-24 07:51, Tomer Azran wrote: > We don't have the ability to use it. 
> Is that the only solution? No, but I'd recommend thinking about it first. Are 
you sure you will care about your cluster working when your server room is on 
fire? 'Cause unless you have halon suppression, your server room is a complete 
write-off anyway. (Think water from sprinklers hitting rich chunky volts in the 
servers.) Dima ___ Users mailing 
list: Users@clusterlabs.org 
http://lists.clusterlabs.org/mailman/listinfo/users
 Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: 
http://bugs.clusterlabs.org


___ Users mailing list: 
Users@clusterlabs.org 
http://lists.clusterlabs.org/mailman/listinfo/users
 Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: 
http://bugs.clusterlabs.org

-- Klaus Wenninger Senior Software Engineer, EMEA ENG Openstack Infrastructure 
Red Hat 
kwenning@redhat.com

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Klaus Wenninger
On 07/24/2017 05:15 PM, Tomer Azran wrote:
> I still don't understand why the qdevice concept doesn't help on this
> situation. Since the master node is down, I would expect the quorum to
> declare it as dead.
> Why doesn't it happens?

That is not how quorum works. It just limits the decision-making to the
quorate subset of the cluster.
Still the unknown nodes are not sure to be down.
That is why I suggested to have quorum-based watchdog-fencing with sbd.
That would assure that within a certain time all nodes of the
non-quorate part
of the cluster are down.

>
>
>
> On Mon, Jul 24, 2017 at 4:15 PM +0300, "Dmitri Maziuk"
> > wrote:
>
> On 2017-07-24 07:51, Tomer Azran wrote:
> > We don't have the ability to use it.
> > Is that the only solution?
>
> No, but I'd recommend thinking about it first. Are you sure you will 
> care about your cluster working when your server room is on fire? 'Cause 
> unless you have halon suppression, your server room is a complete 
> write-off anyway. (Think water from sprinklers hitting rich chunky volts 
> in the servers.)
>
> Dima
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


-- 
Klaus Wenninger

Senior Software Engineer, EMEA ENG Openstack Infrastructure

Red Hat

kwenn...@redhat.com   

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Tomer Azran
I still don't understand why the qdevice concept doesn't help on this 
situation. Since the master node is down, I would expect the quorum to declare 
it as dead.
Why doesn't it happens?



On Mon, Jul 24, 2017 at 4:15 PM +0300, "Dmitri Maziuk" 
> wrote:


On 2017-07-24 07:51, Tomer Azran wrote:
> We don't have the ability to use it.
> Is that the only solution?

No, but I'd recommend thinking about it first. Are you sure you will
care about your cluster working when your server room is on fire? 'Cause
unless you have halon suppression, your server room is a complete
write-off anyway. (Think water from sprinklers hitting rich chunky volts
in the servers.)

Dima

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Kristián Feldsam
Hmm, so when you know, that it happens also when putting node standy, them why 
you run yum update on live cluster, it must be clear that node will be fenced.

Would you post your pacemaker config? + some logs?

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 24 Jul 2017, at 17:04, Dimitri Maziuk  wrote:
> 
> On 07/24/2017 09:40 AM, Jan Pokorný wrote:
> 
>> Would there be an interest, though?  And would that be meaningful?
> 
> IMO the only reason to put a node in standby is if you want to reboot
> the active node with no service interruption. For anything else,
> including a reboot with service interruption (during maintenance
> window), it's a no.
> 
> This is akin to "your mouse has moved, windows needs to be restarted".
> Except the mouse thing is a joke whereas those "standby" clowns appear
> to be serious.
> 
> With this particular failure, something in the Redhat patched kernel
> (NFS?) does not release the DRBD filesystem. It happens when I put the
> node in standby as well, the only difference is not messing up the RPM
> database which isn't that hard to fix. Since I have several centos 6 +
> DRBD + NFS + heartbeat R1 pairs running happily for years, I have to
> conclude that centos 7 is simply the wrong tool for this particular job.
> 
> -- 
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Dimitri Maziuk
On 07/24/2017 09:40 AM, Jan Pokorný wrote:

> Would there be an interest, though?  And would that be meaningful?

IMO the only reason to put a node in standby is if you want to reboot
the active node with no service interruption. For anything else,
including a reboot with service interruption (during maintenance
window), it's a no.

This is akin to "your mouse has moved, windows needs to be restarted".
Except the mouse thing is a joke whereas those "standby" clowns appear
to be serious.

With this particular failure, something in the Redhat patched kernel
(NFS?) does not release the DRBD filesystem. It happens when I put the
node in standby as well, the only difference is not messing up the RPM
database which isn't that hard to fix. Since I have several centos 6 +
DRBD + NFS + heartbeat R1 pairs running happily for years, I have to
conclude that centos 7 is simply the wrong tool for this particular job.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [ClusterLabs Developers] [HA/ClusterLabs Summit] Key-Signing Party, 2017 Edition

2017-07-24 Thread Jan Pokorný
On 23/07/17 12:32 +0100, Adam Spiers wrote:
> Jan Pokorný  wrote:
>> So, going to attend summit and want your key signed while reciprocally
>> spreading the web of trust?
>> Awesome, let's reuse the steps from the last time:
>> 
>> Once you have a key pair (and provided that you are using GnuPG),
>> please run the following sequence:
>> 
>>   # figure out the key ID for the identity to be verified;
>>   # IDENTITY is either your associated email address/your name
>>   # if only single key ID matches, specific key otherwise
>>   # (you can use "gpg -K" to select a desired ID at the "sec" line)
>>   KEY=$(gpg --with-colons 'IDENTITY' | grep '^pub' | cut -d: -f5)
> 
> AFAICS this has two problems: it's missing a --list-key option,

Bummer!  I've been checking the original thread(s) for responses from
others, but forgot to check my own:
http://lists.linux-ha.org/pipermail/linux-ha/2015-January/048511.html

Thanks for spotting (and the public key already sent), Adam.

> and it doesn't handle multiple matches for 'IDENTITY'.  So to make it
> choose the newest key if there are several:
> 
>read IDENTITY
>KEY=$(gpg --with-colons --list-key "$IDENTITY" | grep '^pub' |
>  sort -t: -nr -k6 | head -n1 | cut -d: -f5)

Good point.  Hopefully affected persons, allegedly heavy users of GPG,
are capable to adapt on-the-fly anyway :-)

>>  # export the public key to a file that is suitable for exchange
>>  gpg --export -a -- $KEY > $KEY
>> 
>>  # verify that you have an expected data to share
>>  gpg --with-fingerprint -- $KEY

-- 
Jan (Poki)


pgpUQYEVl7JOS.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Jan Pokorný
On 23/07/17 14:40 +0200, Valentin Vidic wrote:
> On Sun, Jul 23, 2017 at 07:27:03AM -0500, Dmitri Maziuk wrote:
>> So yesterday I ran yum update that puled in the new pacemaker and tried to
>> restart it. The node went into its usual "can't unmount drbd because kernel
>> is using it" and got stonith'ed in the middle of yum transaction. The end
>> result: DRBD reports split brain, HA daemons don't start on boot, RPM
>> database is FUBAR. I've had enough. I'm rebuilding this cluster as centos 6
>> + heartbeat R1.
> 
> It seems you did not put the node into standby before the upgrade as it
> still had resources running.  What was the old/new pacemaker version there?

Thinking out loud, it shouldn't be too hard to deliver an RPM
plugin[1] with RPM-shipped pacemaker (it doesn't make much sense
otherwise) that will hook into RPM transactions, putting the node
into standby first so to cover the corner case one updates the
live cluster.  Something akin to systemd_inhibit.so.

Would there be an interest, though?  And would that be meaningful?

[1] http://rpm.org/devel_doc/plugins.html

-- 
Jan (Poki)


pgpIjMoTZC4Yn.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Kristián Feldsam
APC AP7921 is just for 200€ on ebay.

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 24 Jul 2017, at 15:12, Dmitri Maziuk  wrote:
> 
> On 2017-07-24 07:51, Tomer Azran wrote:
>> We don't have the ability to use it.
>> Is that the only solution?
> 
> No, but I'd recommend thinking about it first. Are you sure you will care 
> about your cluster working when your server room is on fire? 'Cause unless 
> you have halon suppression, your server room is a complete write-off anyway. 
> (Think water from sprinklers hitting rich chunky volts in the servers.)
> 
> Dima
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Dmitri Maziuk

On 2017-07-24 07:51, Tomer Azran wrote:

We don't have the ability to use it.
Is that the only solution?


No, but I'd recommend thinking about it first. Are you sure you will 
care about your cluster working when your server room is on fire? 'Cause 
unless you have halon suppression, your server room is a complete 
write-off anyway. (Think water from sprinklers hitting rich chunky volts 
in the servers.)


Dima

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Tomer Azran
We don't have the ability to use it.
Is that the only solution?

In addition, it will not cover a scenario that the server room is down (for 
example - fire or earthquake), the switch will go down as well.

From: Klaus Wenninger
Sent: Monday, July 24, 15:31
Subject: Re: [ClusterLabs] Two nodes cluster issue
To: Cluster Labs - All topics related to open-source clustering welcomed, 
Kristián Feldsam


On 07/24/2017 02:05 PM, Kristián Feldsam wrote:
Hello, you have to use second fencing device, for ex. APC Switched PDU.

https://wiki.clusterlabs.org/wiki/Configure_Multiple_Fencing_Devices_Using_pcs

Problem here seems to be that the fencing devices available are running from
the same power-supply as the node itself. So they are kind of useless to 
determine
weather the partner-node has no power or simply is no reachable via network.


S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové 
a serverové služby za adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

On 24 Jul 2017, at 13:51, Tomer Azran 
 wrote:

Hello,

We built a pacemaker cluster with 2 physical servers.
We configured DRBD in Master\Slave setup, a floating IP and file system mount 
in Active\Passive mode.
We configured two STONITH devices (fence_ipmilan), one for each server.

We are trying to simulate a situation when the Master server crushes with no 
power.
We pulled both of the PSU cables and the server becomes offline (UNCLEAN).
The resources that the Master use to hold are now in Started (UNCLEAN) state.
The state is unclean since the STONITH failed (the STONITH device is located on 
the server (Intel RMM4 - IPMI) – which uses the same power supply).

The problem is that now, the cluster does not releasing the resources that the 
Master holds, and the service goes down.

Is there any way to overcome this situation?
We tried to add a qdevice but got the same results.

If you have already setup qdevice (using an additional node or so) you could use
quorum-based watchdog-fencing via SBD.


We are using pacemaker 1.1.15 on CentOS 7.3

Thanks,
Tomer.
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___ Users mailing list: 
Users@clusterlabs.org 
http://lists.clusterlabs.org/mailman/listinfo/users
 Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: 
http://bugs.clusterlabs.org

-- Klaus Wenninger Senior Software Engineer, EMEA ENG Openstack Infrastructure 
Red Hat 
kwenning@redhat.com

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Klaus Wenninger
On 07/24/2017 02:05 PM, Kristián Feldsam wrote:
> Hello, you have to use second fencing device, for ex. APC Switched PDU.
>
> https://wiki.clusterlabs.org/wiki/Configure_Multiple_Fencing_Devices_Using_pcs

Problem here seems to be that the fencing devices available are running from
the same power-supply as the node itself. So they are kind of useless to
determine
weather the partner-node has no power or simply is no reachable via network.
 
>
> S pozdravem Kristián Feldsam
> Tel.: +420 773 303 353, +421 944 137 535
> E-mail.: supp...@feldhost.cz
>
> www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové
> služby za adekvátní ceny.
>
> FELDSAM s.r.o.
> V rohu 434/3
> Praha 4 – Libuš, PSČ 142 00
> IČ: 290 60 958, DIČ: CZ290 60 958
> C 200350 vedená u Městského soudu v Praze
>
> Banka: Fio banka a.s.
> Číslo účtu: 2400330446/2010
> BIC: FIOBCZPPXX
> IBAN: CZ82 2010  0024 0033 0446
>
>> On 24 Jul 2017, at 13:51, Tomer Azran  wrote:
>>
>> Hello,
>>  
>> We built a pacemaker cluster with 2 physical servers.
>> We configured DRBD in Master\Slave setup, a floating IP and file
>> system mount in Active\Passive mode.
>> We configured two STONITH devices (fence_ipmilan), one for each server.
>>  
>> We are trying to simulate a situation when the Master server crushes
>> with no power.
>> We pulled both of the PSU cables and the server becomes offline
>> (UNCLEAN).
>> The resources that the Master use to hold are now in Started
>> (UNCLEAN) state.
>> The state is unclean since the STONITH failed (the STONITH device is
>> located on the server (Intel RMM4 - IPMI) – which uses the same power
>> supply).
>>  
>> The problem is that now, the cluster does not releasing the resources
>> that the Master holds, and the service goes down.
>>  
>> Is there any way to overcome this situation?
>> We tried to add a qdevice but got the same results.

If you have already setup qdevice (using an additional node or so) you
could use
quorum-based watchdog-fencing via SBD.

>>  
>> We are using pacemaker 1.1.15 on CentOS 7.3
>>  
>> Thanks,
>> Tomer.
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org 
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


-- 
Klaus Wenninger

Senior Software Engineer, EMEA ENG Openstack Infrastructure

Red Hat

kwenn...@redhat.com   

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Kristián Feldsam
Hello, you have to use second fencing device, for ex. APC Switched PDU.

https://wiki.clusterlabs.org/wiki/Configure_Multiple_Fencing_Devices_Using_pcs

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 24 Jul 2017, at 13:51, Tomer Azran  wrote:
> 
> Hello,
>  
> We built a pacemaker cluster with 2 physical servers.
> We configured DRBD in Master\Slave setup, a floating IP and file system mount 
> in Active\Passive mode.
> We configured two STONITH devices (fence_ipmilan), one for each server.
>  
> We are trying to simulate a situation when the Master server crushes with no 
> power.
> We pulled both of the PSU cables and the server becomes offline (UNCLEAN).
> The resources that the Master use to hold are now in Started (UNCLEAN) state.
> The state is unclean since the STONITH failed (the STONITH device is located 
> on the server (Intel RMM4 - IPMI) – which uses the same power supply).
>  
> The problem is that now, the cluster does not releasing the resources that 
> the Master holds, and the service goes down.
>  
> Is there any way to overcome this situation?
> We tried to add a qdevice but got the same results.
>  
> We are using pacemaker 1.1.15 on CentOS 7.3
>  
> Thanks,
> Tomer.
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> 
> Bugs: http://bugs.clusterlabs.org 
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Tomer Azran
Hello,

We built a pacemaker cluster with 2 physical servers.
We configured DRBD in Master\Slave setup, a floating IP and file system mount 
in Active\Passive mode.
We configured two STONITH devices (fence_ipmilan), one for each server.

We are trying to simulate a situation when the Master server crushes with no 
power.
We pulled both of the PSU cables and the server becomes offline (UNCLEAN).
The resources that the Master use to hold are now in Started (UNCLEAN) state.
The state is unclean since the STONITH failed (the STONITH device is located on 
the server (Intel RMM4 - IPMI) - which uses the same power supply).

The problem is that now, the cluster does not releasing the resources that the 
Master holds, and the service goes down.

Is there any way to overcome this situation?
We tried to add a qdevice but got the same results.

We are using pacemaker 1.1.15 on CentOS 7.3

Thanks,
Tomer.
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pcs: how to properly unset a value for resource/stonith? [Was: (no subject)]

2017-07-24 Thread ArekW
Hi, Thank you for setting subject. I confirm that the parameter can be
disabled. The only issue is that sometimes thete is a "zombie" message
in logs like I shown before:
Jul 20 07:14:11 nfsnode1 stonith-ng[11097]: warning: fence_vbox[3092]
stderr: [ WARNING:root:Parse error: Ignoring option 'verbose' because
it does not have value ]


2017-07-20 15:46 GMT+02:00 Jan Pokorný :
> Hello ArekW,
>
> first of all, gentle reminder to always set the subject for the posts
> to the list (or, as a rule of thumb, in any email-based conversation).
>
> On 20/07/17 08:43 +0200, Klaus Wenninger wrote:
>> On 07/20/2017 07:21 AM, ArekW wrote:
>>> Hi, How to properly unset a value with pcs? Set to false or null gives 
>>> error:
>>>
>>> # pcs stonith update vbox-fencing verbose=false --force
>>> or
>>> # pcs stonith update vbox-fencing verbose= --force
>>>
>> The latter should be fine actually.
>
> True:
>
>   # rm test.cib
>   # yum install -y fence-virt
>
>   # pcs -f test.cib stonith create fence-virt-069 fence_xvm auth=sha256 \
> hash=sha256 key_file=/etc/cluster/fence_xvm.key timeout=5 \
> pcmk_host_map=virt-069:virt-069.example.com
>   # pcs -f test.cib cluster cib scope=resources
>   
> 
>   
>  value="sha256"/>
>  value="sha256"/>
>  name="key_file" value="/etc/cluster/fence_xvm.key"/>
>  name="pcmk_host_map" value="virt-069:virt-069.example.com"/>
>  name="timeout" value="5"/>
>   
>   
>  name="monitor"/>
>   
> 
>   
>
>   # pcs -f test.cib stonith update fence-virt-069 key_file=
>   # pcs -f test.cib cluster cib scope=resources
>   
> 
>   
>  value="sha256"/>
>  value="sha256"/>
>  name="pcmk_host_map" value="virt-069:virt-069.example.com"/>
>  name="timeout" value="5"/>
>   
>   
>  name="monitor"/>
>   
>   
> 
>   
>
> --
> Jan (Poki)
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pcs: how to properly unset a value for resource/stonith? [Was: (no subject)]

2017-07-24 Thread ArekW
Hi, Thank you for setting subject. I confirm that the parameter can be
disabled. The only issue is that sometimes thete is a "zombie" message
in logs like I shown before:
Jul 20 07:14:11 nfsnode1 stonith-ng[11097]: warning: fence_vbox[3092]
stderr: [ WARNING:root:Parse error: Ignoring option 'verbose' because
it does not have value ]
Pozdrawiam,
Arek


2017-07-20 15:46 GMT+02:00 Jan Pokorný :
> Hello ArekW,
>
> first of all, gentle reminder to always set the subject for the posts
> to the list (or, as a rule of thumb, in any email-based conversation).
>
> On 20/07/17 08:43 +0200, Klaus Wenninger wrote:
>> On 07/20/2017 07:21 AM, ArekW wrote:
>>> Hi, How to properly unset a value with pcs? Set to false or null gives 
>>> error:
>>>
>>> # pcs stonith update vbox-fencing verbose=false --force
>>> or
>>> # pcs stonith update vbox-fencing verbose= --force
>>>
>> The latter should be fine actually.
>
> True:
>
>   # rm test.cib
>   # yum install -y fence-virt
>
>   # pcs -f test.cib stonith create fence-virt-069 fence_xvm auth=sha256 \
> hash=sha256 key_file=/etc/cluster/fence_xvm.key timeout=5 \
> pcmk_host_map=virt-069:virt-069.example.com
>   # pcs -f test.cib cluster cib scope=resources
>   
> 
>   
>  value="sha256"/>
>  value="sha256"/>
>  name="key_file" value="/etc/cluster/fence_xvm.key"/>
>  name="pcmk_host_map" value="virt-069:virt-069.example.com"/>
>  name="timeout" value="5"/>
>   
>   
>  name="monitor"/>
>   
> 
>   
>
>   # pcs -f test.cib stonith update fence-virt-069 key_file=
>   # pcs -f test.cib cluster cib scope=resources
>   
> 
>   
>  value="sha256"/>
>  value="sha256"/>
>  name="pcmk_host_map" value="virt-069:virt-069.example.com"/>
>  name="timeout" value="5"/>
>   
>   
>  name="monitor"/>
>   
>   
> 
>   
>
> --
> Jan (Poki)
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [ClusterLabs Developers] [HA/ClusterLabs Summit] Key-Signing Party, 2017 Edition

2017-07-24 Thread Kristoffer Grönlund
Jan Pokorný  writes:

> [ Unknown signature status ]
> Hello cluster masters :-)
>
> as there's little less than 7 weeks left to "The Summit" meetup
> (), it's about time to get the ball
> rolling so we can voluntarily augment the digital trust amongst
> us the attendees, on OpenGPG basis.
>
> Doing that, we'll actually establish a tradition since this will
> be the second time such event is being kicked off (unlike the birds
> of the feather gathering itself, was edu-feathered back then):
>
>   
>   
>
> If there are no objections, yours truly will conduct this undertaking.
> (As an aside, I am toying with an idea of optimizing the process
> a bit now that many keys are cross-signed already; I doubt there's
> a value of adding identical signatures just with different timestamps,
> unless, of course, the inscribed level of trust is going to change,
> presumably elevate -- any comments?)

Hi Jan,

No objections from me, thank you for taking charge of this!

Cheers,
Kristoffer


>
> * * *
>
> So, going to attend summit and want your key signed while reciprocally
> spreading the web of trust?
> Awesome, let's reuse the steps from the last time:
>
> Once you have a key pair (and provided that you are using GnuPG),
> please run the following sequence:
>
> # figure out the key ID for the identity to be verified;
> # IDENTITY is either your associated email address/your name
> # if only single key ID matches, specific key otherwise
> # (you can use "gpg -K" to select a desired ID at the "sec" line)
> KEY=$(gpg --with-colons 'IDENTITY' | grep '^pub' | cut -d: -f5)
>
> # export the public key to a file that is suitable for exchange
> gpg --export -a -- $KEY > $KEY
>
> # verify that you have an expected data to share
> gpg --with-fingerprint -- $KEY
>
> with IDENTITY adjusted as per the instruction above, and send me the
> resulting $KEY file, preferably in a signed (or even encrypted[*]) email
> from an address associated with that very public key of yours.
>
> Timeline?
> Please, send me your public keys *by 2017-09-05*, off-list and
> best with [key-2017-ha] prefix in the subject.  I will then compile
> a list of the attendees together with their keys and publish it at
> 
> so it can be printed beforehand.
>
> [*] You can find my public key at public keyservers:
> 
> Indeed, the trust in this key should be ephemeral/one-off
> (e.g. using a temporary keyring, not a universal one before we
> proceed with the signing :)
>
> * * *
>
> Thanks for your cooperation, looking forward to this side stage
> (but nonetheless important if release or commit[1] signing is to get
> traction) happening and hope this will be beneficial to all involved.
>
> See you there!
>
>
> [1] for instance, see:
> 
> 
>
> -- 
> Jan (Poki)
> ___
> Developers mailing list
> develop...@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/developers

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org