[ClusterLabs] Is STONITH resource location special?

2015-10-19 Thread Ferenc Wagner
Hi,

Pacemaker Explained discusses in 13.3 the special treatment of STONITH
resources.  Now, I configured 6 fence_ipmilan instances in a cluster
which runs on 4 nodes currently.  No STONITH resource started on the
node it can kill, even though I haven't configured location constraints
yet.  Is this clever placement accidental, or is it another aspect of
the special treatment of STONITH resources?  Should I configure the
usual location constraints, or are they not needed under 1.1.13?

Also, I haven't configured pcmk_host_check, so is should default to
dynamic-list, but the pcmk_host_list setting is still effective,
according to stonith_admin.  I don't mind this, but it seems to
contradict the documentation.  Who is right?
-- 
Thanks,
Feri.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Fencing questions.

2015-10-19 Thread Digimer
On 19/10/15 06:53 AM, Arjun Pandey wrote:
> Hi
> 
> I am running a 2 node cluster with this config on centos 6.5/6.6  where

It's important to keep both nodes on the same minor version,
particularly in this case. Please either upgrade centos 6.5 to 6.6 or
both to 6.7.

> i have a multi-state resource foo being run in master/slave mode and  a
> bunch of floating IP addresses configured. Additionally i have
> a collocation constraint for the IP addr to be collocated with the master.
> 
> Please find the following files attached 
> cluster.conf
> CIB

It's preferable on a mailing list to copy the text into the body of the
message. Easier to read.

> Issues that i have :-
> 1. Daemons required for fencing
> Earlier we were invoking cman start quorum from pacemaker script which
> ensured that fenced / gfs and other daemons are not started. This was ok
> since fencing wasn't being handled earlier.

The cman fencing is simply a pass-through to pacemaker. When pacemaker
tells cman that fencing succeeded, it inform DLM and begins cleanup.

> For fencing purpose do we only need the fenced to be started ?  We don't
> have any gfs partitions that we want to monitor via pacemaker. My
> concern here is that if i use the unmodified script then pacemaker start
> time increases significantly. I see a difference of 60 sec from the
> earlier startup before service pacemaker status shows up as started.

Don't start fenced manually, just start pacemaker and let it handle
everything. Ideally, use the pcs command (and pcsd daemon on the nodes)
to start/stop the cluster, but you'll need to upgrade to 6.7.

> 2. Fencing test cases.
>  Based on the internet queries i could find , apart from plugging out
> the dedicated cable. The only other case suggested is killing corosync
> process on one of the nodes.
> Are there any other basic cases that i should look at ?
> What about bring up interface down manually ? I understand that this is
> an unlikely scenario but i am just looking for more ways to test this out.

echo c > /proc/sysrq-trigger == kernel panic. It's my preferred test.
Also, killing the power to the node will cause IPMI to fail and will
test your backup fence method, if you have it, or ensure the cluster
locks up if you don't (better to hang than to risk corruption).

> 3. Testing whether fencing is working or not.
> Previously i have been using fence_ilo4 from the shell to test whether
> the command is working. I was assuming that similar invocation would be
> done by stonith when actual fencing needs to be done. 
> 
> However based on other threads i could find people also use fence_tool
>  to try this out. According to me this tests out whether
> fencing when invoked by fenced for a particular node succeeds or not. Is
> that valid ? 

Fence tool is just a command to control the cluster's fencing. The
fence_X agents do the actual work.

> Since we are configuring fence_pcmk as the fence device the flow of
> things is 
> fenced -> fence_pcmk -> stonith -> fence agent.

Basically correct.

> 4. Fencing agent to be used (fence_ipmilan vs fence_ilo4)
> Also for ILO fencing i see fence_ilo4 and fence_ipmilan both available.
> I had been using fence_ilo4 till now. 

Which ever works is fine. I believe a lot of the fence_X out-of-band
agents are actually just links to fence_ipmilan, but I might be wrong.

> I think this mail has multiple questions and i will probably send out
> another mail for a few issues i see after fencing takes place. 
> 
> Thanks in advance
> Arjun
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Coming in 1.1.14: remapping sequential reboots to all-off-then-all-on

2015-10-19 Thread Ken Gaillot
On 10/19/2015 11:42 AM, Digimer wrote:
> On 19/10/15 12:34 PM, Ken Gaillot wrote:
>> Pacemaker supports fencing "topologies", allowing multiple fencing
>> devices to be used (in conjunction or as fallbacks) when a node needs to
>> be fenced.
>>
>> However, there is a catch when using something like redundant power
>> supplies. If you put two power switches in the same topology level, and
>> Pacemaker needs to reboot the node, it will reboot the first power
>> switch and then the second -- which has no effect since the supplies are
>> redundant.
>>
>> Pacemaker's upstream master branch has new handling that will be part of
>> the eventual 1.1.14 release. In such a case, it will turn all the
>> devices off, then turn them all back on again.
> 
> How long will it leave stay in the 'off' state? Is it configurable? I
> ask because if it's too short, some PSUs may not actually lose power.
> One or two seconds should be way more than enough though.

It simply waits for the fence agent to return success from the "off"
command before proceeding. I wouldn't assume any particular time between
that and initiating "on", and there's no way to set a delay there --
it's up to the agent to not return success until the action is actually
complete.

The standard says that agents should actually confirm that the device is
in the desired state after sending a command, so hopefully this is
already baked in.

>> With previous versions, there was a complicated configuration workaround
>> involving creating separate devices for the off and on actions. With the
>> new version, it happens automatically, and no special configuration is
>> needed.
>>
>> Here's an example where node1 is the affected node, and apc1 and apc2
>> are the fence devices:
>>
>>pcs stonith level add 1 node1 apc1,apc2
> 
> Where would the outlet definition go? 'apc1:4,apc2:4'?

"apc1" here is name of a Pacemaker fence resource. Hostname, port, etc.
would be configured in the definition of the "apc1" resource (which I
omitted above to focus on the topology config).

>> Of course you can configure it using crm or XML as well.
>>
>> The fencing operation will be treated as successful as long as the "off"
>> commands succeed, because then it is safe for the cluster to recover any
>> resources that were on the node. Timeouts and errors in the "on" phase
>> will be logged but ignored.
>>
>> Any action-specific timeout for the remapped action will be used (for
>> example, pcmk_off_timeout will be used when executing the "off" command,
>> not pcmk_reboot_timeout).
> 
> I think this answers my question about how long it stays off for. What
> would be an example config to control the off time then?

This isn't a delay, but a timeout before declaring the action failed. If
an "off" command does not return in this amount of time, the command
(and the entire topology level) will be considered failed, and the next
level will be tried.

The timeouts are configured in the fence resource definition. So
combining the above questions, apc1 might be defined like this:

   pcs stonith create apc1 fence_apc_snmp \
  ipaddr=apc1.example.com \
  login=user passwd='supersecret' \
  pcmk_off_timeout=30s \
  pcmk_host_map="node1.example.com:1,node2.example.com:2"

>> The new code knows to skip the "on" step if the fence agent has
>> automatic unfencing (because it will happen when the node rejoins the
>> cluster). This allows fence_scsi to work with this feature.
> 
> http://i.imgur.com/i7BzivK.png

:-D


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Coming in 1.1.14: remapping sequential reboots to all-off-then-all-on

2015-10-19 Thread Ken Gaillot
Pacemaker supports fencing "topologies", allowing multiple fencing
devices to be used (in conjunction or as fallbacks) when a node needs to
be fenced.

However, there is a catch when using something like redundant power
supplies. If you put two power switches in the same topology level, and
Pacemaker needs to reboot the node, it will reboot the first power
switch and then the second -- which has no effect since the supplies are
redundant.

Pacemaker's upstream master branch has new handling that will be part of
the eventual 1.1.14 release. In such a case, it will turn all the
devices off, then turn them all back on again.

With previous versions, there was a complicated configuration workaround
involving creating separate devices for the off and on actions. With the
new version, it happens automatically, and no special configuration is
needed.

Here's an example where node1 is the affected node, and apc1 and apc2
are the fence devices:

   pcs stonith level add 1 node1 apc1,apc2

Of course you can configure it using crm or XML as well.

The fencing operation will be treated as successful as long as the "off"
commands succeed, because then it is safe for the cluster to recover any
resources that were on the node. Timeouts and errors in the "on" phase
will be logged but ignored.

Any action-specific timeout for the remapped action will be used (for
example, pcmk_off_timeout will be used when executing the "off" command,
not pcmk_reboot_timeout).

The new code knows to skip the "on" step if the fence agent has
automatic unfencing (because it will happen when the node rejoins the
cluster). This allows fence_scsi to work with this feature.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Automatic IPC buffer adjustment in Pacemaker 1.1.11?

2015-10-19 Thread Ferenc Wagner
Hi,

The http://www.ultrabug.fr/tuning-pacemaker-for-large-clusters/ blog
post states that "Pacemaker v1.1.11 should come with a feature which
will allow the IPC layer to adjust the PCMK_ipc_buffer automagically".
However, I failed to identify this in the ChangeLog.  Did it really
happen, can I drop my PCMK_ipc_buffer settings, and if yes, do I need to
switch this feature on somehow?
-- 
Thanks,
Feri.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync & pacemaker quit between boot and login

2015-10-19 Thread Matthias Ferdinand
On Fri, Oct 16, 2015 at 06:25:29PM +0200, users-requ...@clusterlabs.org wrote:
> From: Russell Martin 
> Subject: [ClusterLabs] Corosync & pacemaker quit between boot and login
> 
> ...
> Both corosync and pacemaker seem to start fine during boot (they both say 
> "[OK]")
> However, once logged in, neither daemon is running.

Hi,

since you are using a Desktop install, could it be that NetworkManager
interferes with your network configuration?

Otherwise you would have to look through /var/log/corosync/corosync.log
for a hint.

Regards
Matthias

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org