[ClusterLabs] Antw: Re: Resources not monitored in SLES11 SP4 (1.1.12-f47ea56)

2018-06-26 Thread Ulrich Windl
>>> Ken Gaillot  schrieb am 26.06.2018 um 18:22 in 
>>> Nachricht
<1530030128.5202.5.ca...@redhat.com>:
> On Tue, 2018-06-26 at 10:45 +0300, Vladislav Bogdanov wrote:
>> 26.06.2018 09:14, Ulrich Windl wrote:
>> > Hi!
>> > 
>> > We just observed some strange effect we cannot explain in SLES 11
>> > SP4 (pacemaker 1.1.12-f47ea56):
>> > We run about a dozen of Xen PVMs on a three-node cluster (plus some
>> > infrastructure and monitoring stuff). It worked all well so far,
>> > and there was no significant change recently.
>> > However when a colleague stopped on VM for maintenance via cluster
>> > command, the cluster did not notice when the PVM actually was
>> > running again (it had been started not using the cluster (a bad
>> > idea, I know)).
>> 
>> To be on a safe side in such cases you'd probably want to enable 
>> additional monitor for a "Stopped" role. Default one covers only 
>> "Started" role. The same thing as for multistate resources, where
>> you 
>> need several monitor ops, for "Started/Slave" and "Master" roles.
>> But, this will increase a load.
>> And, I believe cluster should reprobe a resource on all nodes once
>> you 
>> change target-role back to "Started".
> 
> Which raises the question, how did you stop the VM initially?

I thought "(...) stopped one VM for maintenance via cluster command" is 
obvious. It was something like "crm resource stop ...".

> 
> If you stopped it by setting target-role to Stopped, likely the cluster
> still thinks it's stopped, and you need to set it to Started again. If
> instead you set maintenance mode or unmanaged the resource, then
> stopped the VM manually, then most likely it's still in that mode and
> needs to be taken out of it.

The point was when the command to start the resource was given, the cluster had 
completely ignored the fact that it was running already and started to start 
the VM on a second node (which may be desastrous). But that's leading away from 
the main question...

> 
>> 
>> > Examining the logs, it seems that the recheck timer popped
>> > periodically, but no monitor action was run for the VM (the action
>> > is configured to run every 10 minutes).
>> > 
>> > Actually the only monitor operations found were:
>> > May 23 08:04:13
>> > Jun 13 08:13:03
>> > Jun 25 09:29:04
>> > Then a manual "reprobe" was done, and several monitor operations
>> > were run.
>> > Then again I see no more monitor actions in syslog.
>> > 
>> > What could be the reasons for this? Too many operations defined?
>> > 
>> > The other message I don't understand is like ":
>> > Rolling back scores from "
>> > 
>> > Could it be a new bug introduced in pacemaker, or could it be some
>> > configuration problem (The status is completely clean however)?
>> > 
>> > According to the packet changelog, there was no change since Nov
>> > 2016...
>> > 
>> > Regards,
>> > Ulrich
>> > 
>> > 
>> > ___
>> > Users mailing list: Users@clusterlabs.org 
>> > https://lists.clusterlabs.org/mailman/listinfo/users 
>> > 
>> > Project Home: http://www.clusterlabs.org 
>> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc 
>> > h.pdf
>> > Bugs: http://bugs.clusterlabs.org 
>> > 
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
>> pdf
>> Bugs: http://bugs.clusterlabs.org 
> -- 
> Ken Gaillot 
> ___
> Users mailing list: Users@clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 




___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker 2.0.0-rc6 now available

2018-06-26 Thread Ken Gaillot
We have one final (seriously this time, I mean it) release candidate
for Pacemaker 2.0.0, to highlight recent changes and give everyone one
more chance to test. The final should be released in one to two weeks.
Source code is available at:

  https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-2.0.0-rc6

The most important part of this release is the automatic transformation
of pre-2.0 configurations to 2.0 syntax. It should be possible to use
any older configuration with 2.0 (a few obscure features were dropped
with no replacement, but the configuration should still upgrade, and
warnings will be given in such cases).

We also have two small new features: .mount, .path, and .timer systemd
unit files are now supported as resources, and stonith_admin has a new
--validate option to check a potential device configuration (which will
be more useful for scripting or higher-level tools than end users).

As usual, there were bug fixes, including for a couple of regressions
introduced in 1.1.17 and 1.1.18. For details, see the change log:

  https://github.com/ClusterLabs/pacemaker/blob/2.0/ChangeLog

If you are upgrading to 2.0.0 for the first time, the wiki page for the
2.0 release may come in handy:

  https://wiki.clusterlabs.org/wiki/Pacemaker_2.0_Changes

Many thanks to contributors of source code to this release, including 
Jan Pokorný, Klaus Wenninger, and Ken Gaillot.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] VM failure during shutdown

2018-06-26 Thread Andrei Borzenkov
26.06.2018 19:36, Ken Gaillot пишет:
> 
> One problem is that you are creating the VM, and then later adding
> constraints about what the cluster can do with it. Therefore there is a
> time in between where the cluster can start it without any constraint.
> The solution is to make your changes all at once. Both pcs and crm have
> a way to do this; with pcs, it's:
> 
>   pcs cluster cib 
>   pcs -f  ...whatever command you want...
>   ...repeat...
>   pcs cluster cib-push --config 
> 

Alternative is to initially create resources with target-role=Stopped
and then remove it after having added suitable constraints.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Stop one VM, another tries to migrate

2018-06-26 Thread Jason Gauthier
On Tue, Jun 26, 2018 at 12:40 PM Ken Gaillot  wrote:
>
> On Tue, 2018-06-26 at 07:19 -0400, Jason Gauthier wrote:
> > Greetings,
> >
> >I am using my cluster platform primarily for virtual machines.
> > While I've still been in implementation mode, I felt like things were
> > somewhat stable. However, I've noticed that sometimes when I stop a
> > resource another resource tries to migrate.   I did this morning, and
> > that scenario occurred.   Basically, I 'crm resource stop Omicron',
> > and the machine 'Lapras' tried to migrate as well.  I've included
> > cluster logs since I can't make heads or tails of this decision.
>
> Have a look at resource-stickiness.
>
> Basically, the cluster will by default try to balance the number of
> resources across all nodes (subject to your constraints of course).
> Stickiness tells it to prefer to keep running resources where they are,
> and only consider balancing when starting a resource.
>

Ah, I had no idea that was a thing!   I wouldn't have noticed if the
migration didn't fail.
Which, is a secondary concern.

> > I've attached a cluster log, but also put it in line here since I'm
> > not sure the preferred way.  This log only pertains to the actions
> > since issuing the resource stop.
> >
> > Jun 26 07:01:49 [4552] alphacib: info: cib_perform_op:
> >  Diff: --- 1.442.64 2
> > Jun 26 07:01:49 [4552] alphacib: info: cib_perform_op:
> >  Diff: +++ 1.443.0 92508eef9d32f83b93e7f1ed2dff3340
> > Jun 26 07:01:49 [4552] alphacib: info: cib_perform_op:
> >  +  /cib:  @epoch=443, @num_updates=0
> > Jun 26 07:01:49 [4552] alphacib: info: cib_perform_op:
> >  +  /cib/configuration/resources/primitive[@id='Omicron']/meta_attrib
> > utes[@id='Omicron-meta_attributes']/nvpair[@id='Omicron-
> > meta_attributes-target-ro
> > le']:  @value=Stopped
> > Jun 26 07:01:49 [4557] alpha   crmd: info:
> > abort_transition_graph:  Transition aborted by
> > Omicron-meta_attributes-target-role doing modify target-role=Stopped:
> > Configuration change | cib=1.443.0 source=te_upda
> > te_diff:444
> > path=/cib/configuration/resources/primitive[@id='Omicron']/meta_attri
> > butes[@id='Omicron-meta_attributes']/nvpair[@id='Omicron-
> > meta_attributes-target-role']
> > complete=true
> > Jun 26 07:01:49 [4557] alpha   crmd:   notice:
> > do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE |
> > input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph
> > Jun 26 07:01:49 [4553] alpha stonith-ng: info:
> > update_cib_stonith_devices_v2:   Updating device list from the
> > cib: modify nvpair[@id='Omicron-meta_attributes-target-role']
> > Jun 26 07:01:49 [4553] alpha stonith-ng: info:
> > cib_devices_update:
> >  Updating devices to version 1.443.0
> > Jun 26 07:01:49 [4552] alphacib: info:
> > cib_process_request: Completed cib_apply_diff operation for section
> > 'all': OK (rc=0, origin=alpha/cibadmin/2, version=1.443.0)
> > Jun 26 07:01:49 [4553] alpha stonith-ng: info: cib_device_update:
> >  Device ipmi_alpha has been disabled on alpha: score=-INFINITY
> > Jun 26 07:01:49 [4552] alphacib: info: cib_file_backup:
> >  Archived previous version as /var/lib/pacemaker/cib/cib-83.raw
> > Jun 26 07:01:49 [4552] alphacib: info:
> > cib_file_write_with_digest:  Wrote version 1.443.0 of the CIB to disk
> > (digest: 2a60981d2eceb59a6ed3015ce20f9dff)
> > Jun 26 07:01:49 [4552] alphacib: info:
> > cib_file_write_with_digest:  Reading cluster configuration file
> > /var/lib/pacemaker/cib/cib.g9gWwY (digest:
> > /var/lib/pacemaker/cib/cib.tslkpk)
> > Jun 26 07:01:49 [4556] alphapengine: info:
> > determine_online_status_fencing: Node beta is active
> > Jun 26 07:01:49 [4556] alphapengine: info:
> > determine_online_status: Node beta is online
> > Jun 26 07:01:49 [4556] alphapengine: info:
> > determine_online_status_fencing: Node alpha is active
> > Jun 26 07:01:49 [4556] alphapengine: info:
> > determine_online_status: Node alpha is online
> > Jun 26 07:01:49 [4556] alphapengine: info:
> > determine_op_status: Operation monitor found resource Calibre active
> > on beta
> > Jun 26 07:01:49 [4556] alphapengine: info:
> > determine_op_status: Operation monitor found resource Calibre active
> > on beta
> > Jun 26 07:01:49 [4556] alphapengine: info:
> > determine_op_status: Operation monitor found resource Iota active on
> > beta
> > Jun 26 07:01:49 [4556] alphapengine: info:
> > determine_op_status: Operation monitor found resource Iota active on
> > beta
> > Jun 26 07:01:49 [4556] alphapengine: info:
> > determine_op_status: Operation monitor found resource Lapras active
> > on
> > beta
> > Jun 26 07:01:49 [4556] alphapengine: info:
> > determine_op_status: Operation monitor found resource Lapras active
> > on
> > beta
> > Jun 26 07:01:49 [4556] alphapengine:   

Re: [ClusterLabs] Stop one VM, another tries to migrate

2018-06-26 Thread Ken Gaillot
On Tue, 2018-06-26 at 07:19 -0400, Jason Gauthier wrote:
> Greetings,
> 
>    I am using my cluster platform primarily for virtual machines.
> While I've still been in implementation mode, I felt like things were
> somewhat stable. However, I've noticed that sometimes when I stop a
> resource another resource tries to migrate.   I did this morning, and
> that scenario occurred.   Basically, I 'crm resource stop Omicron',
> and the machine 'Lapras' tried to migrate as well.  I've included
> cluster logs since I can't make heads or tails of this decision.

Have a look at resource-stickiness.

Basically, the cluster will by default try to balance the number of
resources across all nodes (subject to your constraints of course).
Stickiness tells it to prefer to keep running resources where they are,
and only consider balancing when starting a resource.

> 
> I've attached a cluster log, but also put it in line here since I'm
> not sure the preferred way.  This log only pertains to the actions
> since issuing the resource stop.
> 
> Jun 26 07:01:49 [4552] alphacib: info: cib_perform_op:
>  Diff: --- 1.442.64 2
> Jun 26 07:01:49 [4552] alphacib: info: cib_perform_op:
>  Diff: +++ 1.443.0 92508eef9d32f83b93e7f1ed2dff3340
> Jun 26 07:01:49 [4552] alphacib: info: cib_perform_op:
>  +  /cib:  @epoch=443, @num_updates=0
> Jun 26 07:01:49 [4552] alphacib: info: cib_perform_op:
>  +  /cib/configuration/resources/primitive[@id='Omicron']/meta_attrib
> utes[@id='Omicron-meta_attributes']/nvpair[@id='Omicron-
> meta_attributes-target-ro
> le']:  @value=Stopped
> Jun 26 07:01:49 [4557] alpha   crmd: info:
> abort_transition_graph:  Transition aborted by
> Omicron-meta_attributes-target-role doing modify target-role=Stopped:
> Configuration change | cib=1.443.0 source=te_upda
> te_diff:444
> path=/cib/configuration/resources/primitive[@id='Omicron']/meta_attri
> butes[@id='Omicron-meta_attributes']/nvpair[@id='Omicron-
> meta_attributes-target-role']
> complete=true
> Jun 26 07:01:49 [4557] alpha   crmd:   notice:
> do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE |
> input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph
> Jun 26 07:01:49 [4553] alpha stonith-ng: info:
> update_cib_stonith_devices_v2:   Updating device list from the
> cib: modify nvpair[@id='Omicron-meta_attributes-target-role']
> Jun 26 07:01:49 [4553] alpha stonith-ng: info:
> cib_devices_update:
>  Updating devices to version 1.443.0
> Jun 26 07:01:49 [4552] alphacib: info:
> cib_process_request: Completed cib_apply_diff operation for section
> 'all': OK (rc=0, origin=alpha/cibadmin/2, version=1.443.0)
> Jun 26 07:01:49 [4553] alpha stonith-ng: info: cib_device_update:
>  Device ipmi_alpha has been disabled on alpha: score=-INFINITY
> Jun 26 07:01:49 [4552] alphacib: info: cib_file_backup:
>  Archived previous version as /var/lib/pacemaker/cib/cib-83.raw
> Jun 26 07:01:49 [4552] alphacib: info:
> cib_file_write_with_digest:  Wrote version 1.443.0 of the CIB to disk
> (digest: 2a60981d2eceb59a6ed3015ce20f9dff)
> Jun 26 07:01:49 [4552] alphacib: info:
> cib_file_write_with_digest:  Reading cluster configuration file
> /var/lib/pacemaker/cib/cib.g9gWwY (digest:
> /var/lib/pacemaker/cib/cib.tslkpk)
> Jun 26 07:01:49 [4556] alphapengine: info:
> determine_online_status_fencing: Node beta is active
> Jun 26 07:01:49 [4556] alphapengine: info:
> determine_online_status: Node beta is online
> Jun 26 07:01:49 [4556] alphapengine: info:
> determine_online_status_fencing: Node alpha is active
> Jun 26 07:01:49 [4556] alphapengine: info:
> determine_online_status: Node alpha is online
> Jun 26 07:01:49 [4556] alphapengine: info:
> determine_op_status: Operation monitor found resource Calibre active
> on beta
> Jun 26 07:01:49 [4556] alphapengine: info:
> determine_op_status: Operation monitor found resource Calibre active
> on beta
> Jun 26 07:01:49 [4556] alphapengine: info:
> determine_op_status: Operation monitor found resource Iota active on
> beta
> Jun 26 07:01:49 [4556] alphapengine: info:
> determine_op_status: Operation monitor found resource Iota active on
> beta
> Jun 26 07:01:49 [4556] alphapengine: info:
> determine_op_status: Operation monitor found resource Lapras active
> on
> beta
> Jun 26 07:01:49 [4556] alphapengine: info:
> determine_op_status: Operation monitor found resource Lapras active
> on
> beta
> Jun 26 07:01:49 [4556] alphapengine: info:
> determine_op_status: Operation monitor found resource Tau active on
> beta
> Jun 26 07:01:49 [4556] alphapengine: info:
> determine_op_status: Operation monitor found resource Tau active on
> beta
> Jun 26 07:01:49 [4556] alphapengine: info:
> determine_op_status: Operation monitor found resource Omicron active
> on alpha
> Jun 26 

Re: [ClusterLabs] VM failure during shutdown

2018-06-26 Thread Ken Gaillot
On Tue, 2018-06-26 at 18:24 +0300, Vaggelis Papastavros wrote:
> Many thanks for the excellent answer ,
> Ken after investigation of the log files :
> In our environment we have two drbd partitions one for customer_vms
> and on for sigma_vms 
> For the customer_vms the active node is node2 and for the sigma_vms
> the active node is node1 .
> [root@sgw-01 drbd.d]# drbdadm status
> customer_vms role:Secondary
>   disk:UpToDate
>   sgw-02 role:Primary
>     peer-disk:UpToDate
> 
> sigma_vms role:Primary
>   disk:UpToDate
>   sgw-02 role:Secondary
>     peer-disk:UpToDate
> 
> when i create a new VM i can't force the resource creation to take
> place on a specific node , the cluster places the resource 
> spontaneously on one of the two nodes (if the node happens to be the
> drbd Primary then is ok, else the pacemaker raise a failure fro the
> node) .
> My solution is the following  :
> pcs resource create windows_VM_res VirtualDomain
> hypervisor="qemu:///system"
> config="/opt/sigma_vms/xml_definitions/windows_VM.xml" 
> (the cluster arbitrarily try to place the above resource on node 2
> who is currently the secondary for the corresponding partition.
> Personally 
> i assume that the VirtualDomain agent should be able to read the
> correct disk location from the xml defintion and then try to find the
> correct drbd node)      
> pcs constraint colocation add windows_VM_res with
> StorageDRBD_SigmaVMs INFINITY
> 
> pcs constraint order start StorageDRBD_SigmaVMs_rers then start
> windows_VM

Two things will help:

One problem is that you are creating the VM, and then later adding
constraints about what the cluster can do with it. Therefore there is a
time in between where the cluster can start it without any constraint.
The solution is to make your changes all at once. Both pcs and crm have
a way to do this; with pcs, it's:

  pcs cluster cib 
  pcs -f  ...whatever command you want...
  ...repeat...
  pcs cluster cib-push --config 

The second problem is that you have an ordering constraint but no
colocation constraint. With your current setup, windows_VM has to start
after the storage, but it doesn't have to start on the same node. You
need a colocation constraint as well, to ensure they start on the same
node.

> 
> pcs resource cleanup windows_VM_res
> After the above steps the VM is located on the correct node and
> everything is ok.
> 
> Is my approach correct ?
> 
> Your opinion would be valuable,
> Sincerely 
> 
> 
> On 06/25/2018 07:15 PM, Ken Gaillot wrote:
> > On Mon, 2018-06-25 at 09:47 -0500, Ken Gaillot wrote:
> > > On Mon, 2018-06-25 at 11:33 +0300, Vaggelis Papastavros wrote:
> > > > Dear friends ,
> > > > 
> > > > We have the following configuration :
> > > > 
> > > > CentOS7 , pacemaker 0.9.152 and Corosync 2.4.0, storage with
> > > > DRBD
> > > > and 
> > > > stonith eanbled with APC PDU devices.
> > > > 
> > > > I have a windows VM configured as cluster resource with the
> > > > following 
> > > > attributes :
> > > > 
> > > > Resource: WindowSentinelOne_res (class=ocf provider=heartbeat 
> > > > type=VirtualDomain)
> > > > Attributes: hypervisor=qemu:///system 
> > > > config=/opt/customer_vms/conf/WindowSentinelOne/WindowSentinelO
> > > > ne.x
> > > > ml
> > > >  
> > > > migration_transport=ssh
> > > > Utilization: cpu=8 hv_memory=8192
> > > > Operations: start interval=0s timeout=120s 
> > > > (WindowSentinelOne_res-start-interval-0s)
> > > >          stop interval=0s timeout=120s 
> > > > (WindowSentinelOne_res-stop-interval-0s)
> > > >  monitor interval=10s timeout=30s 
> > > > (WindowSentinelOne_res-monitor-interval-10s)
> > > > 
> > > > under some circumstances  (which i try to identify) the VM
> > > > fails
> > > > and 
> > > > disappears under virsh list --all and also pacemaker reports
> > > > the VM
> > > > as 
> > > > stopped .
> > > > 
> > > > If run pcs resource cleanup windows_wm everything is OK, but i
> > > > can't 
> > > > identify the reason of failure.
> > > > 
> > > > For example when shutdown the VM (with windows shutdown)  the
> > > > cluster 
> > > > reports the following :
> > > > 
> > > > WindowSentinelOne_res    (ocf::heartbeat:VirtualDomain):
> > > > Started
> > > > sgw-
> > > > 02 
> > > > (failure ignored)
> > > > 
> > > > Failed Actions:
> > > > * WindowSentinelOne_res_monitor_1 on sgw-02 'not running'
> > > > (7): 
> > > > call=67, status=complete, exitreason='none',
> > > >  last-rc-change='Mon Jun 25 07:41:37 2018', queued=0ms,
> > > > exec=0ms.
> > > > 
> > > > 
> > > > My questions are
> > > > 
> > > > 1) why the VM shutdown is reported as (FailedAction) from
> > > > cluster ?
> > > > Its 
> > > > a worthy operation during VM life cycle .
> > > 
> > > Pacemaker has no way of knowing that the VM was intentionally
> > > shut
> > > down, vs crashed.
> > > 
> > > When some resource is managed by the cluster, all starts and
> > > stops of
> > > the resource have to go through the cluster. You can either set
> > > 

Re: [ClusterLabs] difference between external/ipmi and fence_ipmilan

2018-06-26 Thread Ken Gaillot
On Tue, 2018-06-26 at 12:00 +0200, Stefan K wrote:
> Hello,
> 
> can somebody tell me the difference between external/ipmi and
> fence_ipmilan? Are there preferences?
> Is one of these more common or has some advantages? 
> 
> Thanks in advance!
> best regards
> Stefan

The distinction is mostly historical. At one time, there were two
different open-source clustering environments, each with its own set of
fence agents. The community eventually settled on Pacemaker as a sort
of merged evolution of the earlier environments, and so it supports
both styles of fence agents. Thus, you often see an "external/*" agent
and a "fence_*" agent available for the same physical device.

However, they are completely different implementations, so there may be
substantive differences as well. I'm not familiar enough with these two
to address that, maybe someone else can.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] difference between external/ipmi and fence_ipmilan

2018-06-26 Thread Digimer
On 2018-06-26 06:00 AM, Stefan K wrote:
> Hello,
> 
> can somebody tell me the difference between external/ipmi and fence_ipmilan? 
> Are there preferences?
> Is one of these more common or has some advantages? 
> 
> Thanks in advance!
> best regards
> Stefan

I believe (others can confirm) that 'external/ipmi' was used before the
fence agents were merged during the larger heartbeat/pacemaker and the
red hat stack merger.

As I understand it, external/ipmi is deprecated and fence_X are the
agents that should be used now.

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Resources not monitored in SLES11 SP4 (1.1.12-f47ea56)

2018-06-26 Thread Ken Gaillot
On Tue, 2018-06-26 at 10:45 +0300, Vladislav Bogdanov wrote:
> 26.06.2018 09:14, Ulrich Windl wrote:
> > Hi!
> > 
> > We just observed some strange effect we cannot explain in SLES 11
> > SP4 (pacemaker 1.1.12-f47ea56):
> > We run about a dozen of Xen PVMs on a three-node cluster (plus some
> > infrastructure and monitoring stuff). It worked all well so far,
> > and there was no significant change recently.
> > However when a colleague stopped on VM for maintenance via cluster
> > command, the cluster did not notice when the PVM actually was
> > running again (it had been started not using the cluster (a bad
> > idea, I know)).
> 
> To be on a safe side in such cases you'd probably want to enable 
> additional monitor for a "Stopped" role. Default one covers only 
> "Started" role. The same thing as for multistate resources, where
> you 
> need several monitor ops, for "Started/Slave" and "Master" roles.
> But, this will increase a load.
> And, I believe cluster should reprobe a resource on all nodes once
> you 
> change target-role back to "Started".

Which raises the question, how did you stop the VM initially?

If you stopped it by setting target-role to Stopped, likely the cluster
still thinks it's stopped, and you need to set it to Started again. If
instead you set maintenance mode or unmanaged the resource, then
stopped the VM manually, then most likely it's still in that mode and
needs to be taken out of it.

> 
> > Examining the logs, it seems that the recheck timer popped
> > periodically, but no monitor action was run for the VM (the action
> > is configured to run every 10 minutes).
> > 
> > Actually the only monitor operations found were:
> > May 23 08:04:13
> > Jun 13 08:13:03
> > Jun 25 09:29:04
> > Then a manual "reprobe" was done, and several monitor operations
> > were run.
> > Then again I see no more monitor actions in syslog.
> > 
> > What could be the reasons for this? Too many operations defined?
> > 
> > The other message I don't understand is like ":
> > Rolling back scores from "
> > 
> > Could it be a new bug introduced in pacemaker, or could it be some
> > configuration problem (The status is completely clean however)?
> > 
> > According to the packet changelog, there was no change since Nov
> > 2016...
> > 
> > Regards,
> > Ulrich
> > 
> > 
> > ___
> > Users mailing list: Users@clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> > h.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] VM failure during shutdown

2018-06-26 Thread Emmanuel Gelati
I think that you need  pcs resource create windows_VM_res VirtualDomain
hypervisor="qemu:///system"
config="/opt/sigma_vms/xml_definitions/windows_VM.xml"
meta target-role=Stopped

In this way, pacemaker doesn't starts the resources

2018-06-26 17:24 GMT+02:00 Vaggelis Papastavros :

> Many thanks for the excellent answer ,
>
> Ken after investigation of the log files :
>
> In our environment we have two drbd partitions one for customer_vms and on
> for sigma_vms
>
> For the customer_vms the active node is node2 and for the sigma_vms the
> active node is node1 .
>
> [root@sgw-01 drbd.d]# drbdadm status
>
> customer_vms role:Secondary
>   disk:UpToDate
>   sgw-02 role:Primary
> peer-disk:UpToDate
>
> sigma_vms role:Primary
>   disk:UpToDate
>   sgw-02 role:Secondary
> peer-disk:UpToDate
>
> when i create a new VM *i can't force the resource creation* to take
> place on a specific node , the cluster places the resource
>
> spontaneously on one of the two nodes (if the node happens to be the drbd
> Primary then is ok, else the pacemaker raise a failure fro the node) .
>
> My solution is the following  :
>
> pcs resource create windows_VM_res VirtualDomain
> hypervisor="qemu:///system" 
> config="/opt/sigma_vms/xml_definitions/windows_VM.xml"
>
>
> (the cluster arbitrarily try to place the above resource on node 2 who is
> currently the secondary for the corresponding partition. Personally
>
> i assume that the VirtualDomain agent should be able to read the correct
> disk location from the xml defintion and then try to find the correct drbd
> node)
>
> pcs constraint colocation add windows_VM_res with StorageDRBD_SigmaVMs
> INFINITY
>
> pcs constraint order start StorageDRBD_SigmaVMs_rers then start windows_VM
>
> pcs resource cleanup windows_VM_res
>
> After the above steps the VM is located on the correct node and everything
> is ok.
>
>
> *Is my approach correct ?*
>
>
> Your opinion would be valuable,
>
> Sincerely
>
>
>
> On 06/25/2018 07:15 PM, Ken Gaillot wrote:
>
> On Mon, 2018-06-25 at 09:47 -0500, Ken Gaillot wrote:
>
> On Mon, 2018-06-25 at 11:33 +0300, Vaggelis Papastavros wrote:
>
> Dear friends ,
>
> We have the following configuration :
>
> CentOS7 , pacemaker 0.9.152 and Corosync 2.4.0, storage with DRBD
> and
> stonith eanbled with APC PDU devices.
>
> I have a windows VM configured as cluster resource with the
> following
> attributes :
>
> Resource: WindowSentinelOne_res (class=ocf provider=heartbeat
> type=VirtualDomain)
> Attributes: hypervisor=qemu:///system
> config=/opt/customer_vms/conf/WindowSentinelOne/WindowSentinelOne.x
> ml
>
> migration_transport=ssh
> Utilization: cpu=8 hv_memory=8192
> Operations: start interval=0s timeout=120s
> (WindowSentinelOne_res-start-interval-0s)
>  stop interval=0s timeout=120s
> (WindowSentinelOne_res-stop-interval-0s)
>  monitor interval=10s timeout=30s
> (WindowSentinelOne_res-monitor-interval-10s)
>
> under some circumstances  (which i try to identify) the VM fails
> and
> disappears under virsh list --all and also pacemaker reports the VM
> as
> stopped .
>
> If run pcs resource cleanup windows_wm everything is OK, but i
> can't
> identify the reason of failure.
>
> For example when shutdown the VM (with windows shutdown)  the
> cluster
> reports the following :
>
> WindowSentinelOne_res(ocf::heartbeat:VirtualDomain): Started
> sgw-
> 02
> (failure ignored)
>
> Failed Actions:
> * WindowSentinelOne_res_monitor_1 on sgw-02 'not running' (7):
> call=67, status=complete, exitreason='none',
>  last-rc-change='Mon Jun 25 07:41:37 2018', queued=0ms,
> exec=0ms.
>
>
> My questions are
>
> 1) why the VM shutdown is reported as (FailedAction) from cluster ?
> Its
> a worthy operation during VM life cycle .
>
> Pacemaker has no way of knowing that the VM was intentionally shut
> down, vs crashed.
>
> When some resource is managed by the cluster, all starts and stops of
> the resource have to go through the cluster. You can either set
> target-
> role=Stopped in the resource configuration, or if it's a temporary
> issue (e.g. rebooting for some OS updates), you could set is-
> managed=false to take it out of cluster control, do the work, then
> set
> is-managed=true again.
>
> Also, a nice feature is that you can use rules to set a maintenance
> window ahead of time (especially helpful if the person who maintains
> the cluster isn't the same person who needs to do the VM updates). For
> example, you could set a rule that the resource's is-managed option
> will be false from 9pm to midnight on Fridays. See:
> http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pa
> cemaker_Explained/index.html#idm140583511697312
>
> particularly the parts about time/date expressions and using rules to
> control resource options.
>
>
> 2) why sometimes the resource is marked as stopped (the VM is
> healthy)
> and needs cleanup ?
>
> That's a problem. If the VM is truly healthy, it sounds like there's

Re: [ClusterLabs] Antw: Re: Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
Hi again,I did another test. I modified docker container in order to be able to run strace.Running strace corosync-quorumtool -ps I got the following:

corosync-quorumtool-strace.log
Description: Binary data
I tried to understand what happen behind the scene but it is not easy for me.Hoping someone on this list can help.On 26 Jun 2018, at 16:06, Ulrich Windl  wrote:Salvatore D'angelo  schrieb am 26.06.2018 um 10:40 inNachricht :Hi,Yes,I am reproducing only the required part for test. I think the original system has a larger shm. The problem is that I do not know exactly how to change it.If you want to go paranoid, here's a setting from a SLES11 system:# grep shm /etc/sysctl.confkernel.shmmax = 9223372036854775807kernel.shmall = 1152921504606846720[...]See SYSCTL(8)Regards,Ulrich___Users mailing list: Users@clusterlabs.orghttps://lists.clusterlabs.org/mailman/listinfo/usersProject Home: http://www.clusterlabs.orgGetting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdfBugs: http://bugs.clusterlabs.org___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] VM failure during shutdown

2018-06-26 Thread Vaggelis Papastavros

Many thanks for the excellent answer ,

Ken after investigation of the log files :

In our environment we have two drbd partitions one for customer_vms and 
on for sigma_vms


For the customer_vms the active node is node2 and for the sigma_vms the 
active node is node1 .


[root@sgw-01 drbd.d]# drbdadm status

customer_vms role:Secondary
  disk:UpToDate
  sgw-02 role:Primary
    peer-disk:UpToDate

sigma_vms role:Primary
  disk:UpToDate
  sgw-02 role:Secondary
    peer-disk:UpToDate

when i create a new VM *i can't force the resource creation* to take 
place on a specific node , the cluster places the resource


spontaneously on one of the two nodes (if the node happens to be the 
drbd Primary then is ok, else the pacemaker raise a failure fro the node) .


My solution is the following  :

pcs resource create windows_VM_res VirtualDomain 
hypervisor="qemu:///system" 
config="/opt/sigma_vms/xml_definitions/windows_VM.xml"


(the cluster arbitrarily try to place the above resource on node 2 who 
is currently the secondary for the corresponding partition. Personally


i assume that the VirtualDomain agent should be able to read the correct 
disk location from the xml defintion and then try to find the correct 
drbd node)


pcs constraint colocation add windows_VM_res with StorageDRBD_SigmaVMs 
INFINITY


pcs constraint order start StorageDRBD_SigmaVMs_rers then start windows_VM

pcs resource cleanup windows_VM_res

After the above steps the VM is located on the correct node and 
everything is ok.



*Is my approach correct ?*


Your opinion would be valuable,

Sincerely



On 06/25/2018 07:15 PM, Ken Gaillot wrote:

On Mon, 2018-06-25 at 09:47 -0500, Ken Gaillot wrote:

On Mon, 2018-06-25 at 11:33 +0300, Vaggelis Papastavros wrote:

Dear friends ,

We have the following configuration :

CentOS7 , pacemaker 0.9.152 and Corosync 2.4.0, storage with DRBD
and
stonith eanbled with APC PDU devices.

I have a windows VM configured as cluster resource with the
following
attributes :

Resource: WindowSentinelOne_res (class=ocf provider=heartbeat
type=VirtualDomain)
Attributes: hypervisor=qemu:///system
config=/opt/customer_vms/conf/WindowSentinelOne/WindowSentinelOne.x
ml
  
migration_transport=ssh

Utilization: cpu=8 hv_memory=8192
Operations: start interval=0s timeout=120s
(WindowSentinelOne_res-start-interval-0s)
          stop interval=0s timeout=120s
(WindowSentinelOne_res-stop-interval-0s)
  monitor interval=10s timeout=30s
(WindowSentinelOne_res-monitor-interval-10s)

under some circumstances  (which i try to identify) the VM fails
and
disappears under virsh list --all and also pacemaker reports the VM
as
stopped .

If run pcs resource cleanup windows_wm everything is OK, but i
can't
identify the reason of failure.

For example when shutdown the VM (with windows shutdown)  the
cluster
reports the following :

WindowSentinelOne_res    (ocf::heartbeat:VirtualDomain): Started
sgw-
02
(failure ignored)

Failed Actions:
* WindowSentinelOne_res_monitor_1 on sgw-02 'not running' (7):
call=67, status=complete, exitreason='none',
  last-rc-change='Mon Jun 25 07:41:37 2018', queued=0ms,
exec=0ms.


My questions are

1) why the VM shutdown is reported as (FailedAction) from cluster ?
Its
a worthy operation during VM life cycle .

Pacemaker has no way of knowing that the VM was intentionally shut
down, vs crashed.

When some resource is managed by the cluster, all starts and stops of
the resource have to go through the cluster. You can either set
target-
role=Stopped in the resource configuration, or if it's a temporary
issue (e.g. rebooting for some OS updates), you could set is-
managed=false to take it out of cluster control, do the work, then
set
is-managed=true again.

Also, a nice feature is that you can use rules to set a maintenance
window ahead of time (especially helpful if the person who maintains
the cluster isn't the same person who needs to do the VM updates). For
example, you could set a rule that the resource's is-managed option
will be false from 9pm to midnight on Fridays. See:

http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pa
cemaker_Explained/index.html#idm140583511697312

particularly the parts about time/date expressions and using rules to
control resource options.


2) why sometimes the resource is marked as stopped (the VM is
healthy)
and needs cleanup ?

That's a problem. If the VM is truly healthy, it sounds like there's
an
issue with the resource agent. You'd have to look at the logs to see
if
it gave any more information (e.g. if it's a timeout, raising the
timeout might be sufficient).


3) I can't understand the corosync logs ... during the the VM
shutdown
corosync logs is the following

FYI, the system log will have the most important messages.
corosync.log
will additionally have info-level messages -- potentially helpful but
definitely difficult to follow.


Jun 25 07:41:37 [5140] sgw-02   crmd: info:
process_lrm_event: 

Re: [ClusterLabs] pcs 0.9.165 released

2018-06-26 Thread Tomas Jelinek

Dne 25.6.2018 v 21:29 Jan Pokorný napsal(a):

On 25/06/18 12:08 +0200, Tomas Jelinek wrote:

I am happy to announce the latest release of pcs, version 0.9.165.


What a mighty patch/micro version component ;-)

With several pacemaker 2.0 release candidates out, it would be perhaps
welcome to share details about versioning (branches) politics of pcs
regarding the supported stacks, since this is something I myself
didn't learn about until recently and only by chance...


There are two pcs branches:
* pcs-0.9 continues supporting corosync 2.x and pacemaker 1.x, no 
corosync 3.x or pacemaker 2.x support is planned

* pcs-0.10 will support corosync 3.x and pacemaker 2.x only

pcs-0.10 will be released and announced on this list when ready.


Tomas



Thanks



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Upgrade corosync problem

2018-06-26 Thread Ulrich Windl
>>> Salvatore D'angelo  schrieb am 26.06.2018 um 10:40 in
Nachricht :
> Hi,
> 
> Yes,
> 
> I am reproducing only the required part for test. I think the original 
> system has a larger shm. The problem is that I do not know exactly how to 
> change it.

If you want to go paranoid, here's a setting from a SLES11 system:
# grep shm /etc/sysctl.conf
kernel.shmmax = 9223372036854775807
kernel.shmall = 1152921504606846720

[...]
See SYSCTL(8)

Regards,
Ulrich


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
I noticed that corosync 2.4.4 depends on the following libraries:
https://launchpad.net/ubuntu/+source/corosync/2.4.4-3 


I imagine that all the corosync-* and libcorosync-* libraries are build from 
the corosync build, so I should have them. Am I correct?

libcfg6
libcmap4
libcpg4
libquorum5
libsam4
libtotem-pg5
libvotequorum8

Can you tell me where these libraries come from and if I need them?

> On 26 Jun 2018, at 14:08, Christine Caulfield  wrote:
> 
> On 26/06/18 12:16, Salvatore D'angelo wrote:
>> libqb update to 1.0.3 but same issue.
>> 
>> I know corosync has also these dependencies nspr and nss3. I updated
>> them using apt-get install, here the version installed:
>> 
>>libnspr4, libnspr4-dev  2:4.13.1-0ubuntu0.14.04.1
>>libnss3, libnss3-dev, libnss3-nssb   2:3.28.4-0ubuntu0.14.04.3
>> 
>> but same problem.
>> 
>> I am working on Ubuntu 14.04 image and I know that packages could be
>> quite old here. Are there new versions for these libraries?
>> Where I can download them? I tried to search on google but results where
>> quite confusing.
>> 
> 
> It's pretty unlikely to be the crypto libraries. It's almost certainly
> in libqb, with a small possibility that of corosync.  Which versions did
> you have that worked (libqb and corosync) ?
> 
> Chrissie
> 
> 
>> 
>>> On 26 Jun 2018, at 12:27, Christine Caulfield >> > wrote:
>>> 
>>> On 26/06/18 11:24, Salvatore D'angelo wrote:
 Hi,
 
 I have tried with:
 0.16.0.real-1ubuntu4
 0.16.0.real-1ubuntu5
 
 which version should I try?
>>> 
>>> 
>>> Hmm both of those are actually quite old! maybe a newer one?
>>> 
>>> Chrissie
>>> 
 
> On 26 Jun 2018, at 12:03, Christine Caulfield  
> > wrote:
> 
> On 26/06/18 11:00, Salvatore D'angelo wrote:
>> Consider that the container is the same when corosync 2.3.5 run.
>> If it is something related to the container probably the 2.4.4
>> introduced a feature that has an impact on container.
>> Should be something related to libqb according to the code.
>> Anyone can help?
>> 
> 
> 
> Have you tried downgrading libqb to the previous version to see if it
> still happens?
> 
> Chrissie
> 
>>> On 26 Jun 2018, at 11:56, Christine Caulfield >> 
>>> 
>>> > wrote:
>>> 
>>> On 26/06/18 10:35, Salvatore D'angelo wrote:
 Sorry after the command:
 
 corosync-quorumtool -ps
 
 the error in log are still visible. Looking at the source code it
 seems
 problem is at this line:
 https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
 
 if (quorum_initialize(_handle, _callbacks, _type) !=
 CS_OK) {
 fprintf(stderr, "Cannot initialize QUORUM service\n");
 q_handle = 0;
 goto out;
 }
 
 if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
 fprintf(stderr, "Cannot initialise CFG service\n");
 c_handle = 0;
 goto out;
 }
 
 The quorum_initialize function is defined here:
 https://github.com/corosync/corosync/blob/master/lib/quorum.c
 
 It seems interacts with libqb to allocate space on /dev/shm but
 something fails. I tried to update the libqb with apt-get install
 but no
 success.
 
 The same for second function:
 https://github.com/corosync/corosync/blob/master/lib/cfg.c
 
 Now I am not an expert of libqb. I have the
 version 0.16.0.real-1ubuntu5.
 
 The folder /dev/shm has 777 permission like other nodes with older
 corosync and pacemaker that work fine. The only difference is that I
 only see files created by root, no one created by hacluster like
 other
 two nodes (probably because pacemaker didn’t start correctly).
 
 This is the analysis I have done so far.
 Any suggestion?
 
 
>>> 
>>> Hmm. t seems very likely something to do with the way the container is
>>> set up then - and I know nothing about containers. Sorry :/
>>> 
>>> Can anyone else help here?
>>> 
>>> Chrissie
>>> 
> On 26 Jun 2018, at 11:03, Salvatore D'angelo
> mailto:sasadang...@gmail.com>
> 
> 
> > wrote:
> 
> Yes, sorry you’re right I could find it by myself.
> However, I did the following:
> 
> 1. Added the line you suggested to /etc/fstab
> 2. mount -o remount /dev/shm
> 3. Now I correctly see /dev/shm of 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
corosync 2.3.5 and libqb 0.16.0

> On 26 Jun 2018, at 14:08, Christine Caulfield  wrote:
> 
> On 26/06/18 12:16, Salvatore D'angelo wrote:
>> libqb update to 1.0.3 but same issue.
>> 
>> I know corosync has also these dependencies nspr and nss3. I updated
>> them using apt-get install, here the version installed:
>> 
>>libnspr4, libnspr4-dev  2:4.13.1-0ubuntu0.14.04.1
>>libnss3, libnss3-dev, libnss3-nssb   2:3.28.4-0ubuntu0.14.04.3
>> 
>> but same problem.
>> 
>> I am working on Ubuntu 14.04 image and I know that packages could be
>> quite old here. Are there new versions for these libraries?
>> Where I can download them? I tried to search on google but results where
>> quite confusing.
>> 
> 
> It's pretty unlikely to be the crypto libraries. It's almost certainly
> in libqb, with a small possibility that of corosync.  Which versions did
> you have that worked (libqb and corosync) ?
> 
> Chrissie
> 
> 
>> 
>>> On 26 Jun 2018, at 12:27, Christine Caulfield >> 
>>> >> wrote:
>>> 
>>> On 26/06/18 11:24, Salvatore D'angelo wrote:
 Hi,
 
 I have tried with:
 0.16.0.real-1ubuntu4
 0.16.0.real-1ubuntu5
 
 which version should I try?
>>> 
>>> 
>>> Hmm both of those are actually quite old! maybe a newer one?
>>> 
>>> Chrissie
>>> 
 
> On 26 Jun 2018, at 12:03, Christine Caulfield  
> >
> >> wrote:
> 
> On 26/06/18 11:00, Salvatore D'angelo wrote:
>> Consider that the container is the same when corosync 2.3.5 run.
>> If it is something related to the container probably the 2.4.4
>> introduced a feature that has an impact on container.
>> Should be something related to libqb according to the code.
>> Anyone can help?
>> 
> 
> 
> Have you tried downgrading libqb to the previous version to see if it
> still happens?
> 
> Chrissie
> 
>>> On 26 Jun 2018, at 11:56, Christine Caulfield >> 
>>> >
>>> >
>>> >> wrote:
>>> 
>>> On 26/06/18 10:35, Salvatore D'angelo wrote:
 Sorry after the command:
 
 corosync-quorumtool -ps
 
 the error in log are still visible. Looking at the source code it
 seems
 problem is at this line:
 https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
  
 
 
 if (quorum_initialize(_handle, _callbacks, _type) !=
 CS_OK) {
 fprintf(stderr, "Cannot initialize QUORUM service\n");
 q_handle = 0;
 goto out;
 }
 
 if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
 fprintf(stderr, "Cannot initialise CFG service\n");
 c_handle = 0;
 goto out;
 }
 
 The quorum_initialize function is defined here:
 https://github.com/corosync/corosync/blob/master/lib/quorum.c 
 
 
 It seems interacts with libqb to allocate space on /dev/shm but
 something fails. I tried to update the libqb with apt-get install
 but no
 success.
 
 The same for second function:
 https://github.com/corosync/corosync/blob/master/lib/cfg.c 
 
 
 Now I am not an expert of libqb. I have the
 version 0.16.0.real-1ubuntu5.
 
 The folder /dev/shm has 777 permission like other nodes with older
 corosync and pacemaker that work fine. The only difference is that I
 only see files created by root, no one created by hacluster like
 other
 two nodes (probably because pacemaker didn’t start correctly).
 
 This is the analysis I have done so far.
 Any suggestion?
 
 
>>> 
>>> Hmm. t seems very likely something to do with the way the container is
>>> set up then - and I know nothing about containers. Sorry :/
>>> 
>>> Can anyone else help here?
>>> 
>>> Chrissie
>>> 
> On 26 Jun 2018, at 11:03, Salvatore D'angelo
> mailto:sasadang...@gmail.com> 
> >
> >
> >
> >> 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Christine Caulfield
On 26/06/18 12:16, Salvatore D'angelo wrote:
> libqb update to 1.0.3 but same issue.
> 
> I know corosync has also these dependencies nspr and nss3. I updated
> them using apt-get install, here the version installed:
> 
>    libnspr4, libnspr4-dev  2:4.13.1-0ubuntu0.14.04.1
>    libnss3, libnss3-dev, libnss3-nssb   2:3.28.4-0ubuntu0.14.04.3
> 
> but same problem.
> 
> I am working on Ubuntu 14.04 image and I know that packages could be
> quite old here. Are there new versions for these libraries?
> Where I can download them? I tried to search on google but results where
> quite confusing.
> 

It's pretty unlikely to be the crypto libraries. It's almost certainly
in libqb, with a small possibility that of corosync.  Which versions did
you have that worked (libqb and corosync) ?

Chrissie


> 
>> On 26 Jun 2018, at 12:27, Christine Caulfield > > wrote:
>>
>> On 26/06/18 11:24, Salvatore D'angelo wrote:
>>> Hi,
>>>
>>> I have tried with:
>>> 0.16.0.real-1ubuntu4
>>> 0.16.0.real-1ubuntu5
>>>
>>> which version should I try?
>>
>>
>> Hmm both of those are actually quite old! maybe a newer one?
>>
>> Chrissie
>>
>>>
 On 26 Jun 2018, at 12:03, Christine Caulfield >>> 
 > wrote:

 On 26/06/18 11:00, Salvatore D'angelo wrote:
> Consider that the container is the same when corosync 2.3.5 run.
> If it is something related to the container probably the 2.4.4
> introduced a feature that has an impact on container.
> Should be something related to libqb according to the code.
> Anyone can help?
>


 Have you tried downgrading libqb to the previous version to see if it
 still happens?

 Chrissie

>> On 26 Jun 2018, at 11:56, Christine Caulfield > 
>> 
>> > wrote:
>>
>> On 26/06/18 10:35, Salvatore D'angelo wrote:
>>> Sorry after the command:
>>>
>>> corosync-quorumtool -ps
>>>
>>> the error in log are still visible. Looking at the source code it
>>> seems
>>> problem is at this line:
>>> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
>>>
>>>     if (quorum_initialize(_handle, _callbacks, _type) !=
>>> CS_OK) {
>>> fprintf(stderr, "Cannot initialize QUORUM service\n");
>>> q_handle = 0;
>>> goto out;
>>> }
>>>
>>> if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
>>> fprintf(stderr, "Cannot initialise CFG service\n");
>>> c_handle = 0;
>>> goto out;
>>> }
>>>
>>> The quorum_initialize function is defined here:
>>> https://github.com/corosync/corosync/blob/master/lib/quorum.c
>>>
>>> It seems interacts with libqb to allocate space on /dev/shm but
>>> something fails. I tried to update the libqb with apt-get install
>>> but no
>>> success.
>>>
>>> The same for second function:
>>> https://github.com/corosync/corosync/blob/master/lib/cfg.c
>>>
>>> Now I am not an expert of libqb. I have the
>>> version 0.16.0.real-1ubuntu5.
>>>
>>> The folder /dev/shm has 777 permission like other nodes with older
>>> corosync and pacemaker that work fine. The only difference is that I
>>> only see files created by root, no one created by hacluster like
>>> other
>>> two nodes (probably because pacemaker didn’t start correctly).
>>>
>>> This is the analysis I have done so far.
>>> Any suggestion?
>>>
>>>
>>
>> Hmm. t seems very likely something to do with the way the container is
>> set up then - and I know nothing about containers. Sorry :/
>>
>> Can anyone else help here?
>>
>> Chrissie
>>
 On 26 Jun 2018, at 11:03, Salvatore D'angelo
 mailto:sasadang...@gmail.com>
 
 
 > wrote:

 Yes, sorry you’re right I could find it by myself.
 However, I did the following:

 1. Added the line you suggested to /etc/fstab
 2. mount -o remount /dev/shm
 3. Now I correctly see /dev/shm of 512M with df -h
 Filesystem      Size  Used Avail Use% Mounted on
 overlay          63G   11G   49G  19% /
 tmpfs            64M  4.0K   64M   1% /dev
 tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
 osxfs           466G  158G  305G  35% /Users
 /dev/sda1        63G   11G   49G  19% /etc/hosts
 *shm             512M   15M  498M   3% /dev/shm*
 tmpfs          1000M     0 1000M   0% /sys/firmware
 tmpfs           128M     0  128M   0% /tmp

 The errors in log went away. Consider that I remove the log file
 before start corosync so it does not contains lines of previous
 

[ClusterLabs] Stop one VM, another tries to migrate

2018-06-26 Thread Jason Gauthier
Greetings,

   I am using my cluster platform primarily for virtual machines.
While I've still been in implementation mode, I felt like things were
somewhat stable. However, I've noticed that sometimes when I stop a
resource another resource tries to migrate.   I did this morning, and
that scenario occurred.   Basically, I 'crm resource stop Omicron',
and the machine 'Lapras' tried to migrate as well.  I've included
cluster logs since I can't make heads or tails of this decision.

I've attached a cluster log, but also put it in line here since I'm
not sure the preferred way.  This log only pertains to the actions
since issuing the resource stop.

Jun 26 07:01:49 [4552] alphacib: info: cib_perform_op:
 Diff: --- 1.442.64 2
Jun 26 07:01:49 [4552] alphacib: info: cib_perform_op:
 Diff: +++ 1.443.0 92508eef9d32f83b93e7f1ed2dff3340
Jun 26 07:01:49 [4552] alphacib: info: cib_perform_op:
 +  /cib:  @epoch=443, @num_updates=0
Jun 26 07:01:49 [4552] alphacib: info: cib_perform_op:
 +  
/cib/configuration/resources/primitive[@id='Omicron']/meta_attributes[@id='Omicron-meta_attributes']/nvpair[@id='Omicron-meta_attributes-target-ro
le']:  @value=Stopped
Jun 26 07:01:49 [4557] alpha   crmd: info:
abort_transition_graph:  Transition aborted by
Omicron-meta_attributes-target-role doing modify target-role=Stopped:
Configuration change | cib=1.443.0 source=te_upda
te_diff:444 
path=/cib/configuration/resources/primitive[@id='Omicron']/meta_attributes[@id='Omicron-meta_attributes']/nvpair[@id='Omicron-meta_attributes-target-role']
complete=true
Jun 26 07:01:49 [4557] alpha   crmd:   notice:
do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE |
input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph
Jun 26 07:01:49 [4553] alpha stonith-ng: info:
update_cib_stonith_devices_v2:   Updating device list from the
cib: modify nvpair[@id='Omicron-meta_attributes-target-role']
Jun 26 07:01:49 [4553] alpha stonith-ng: info: cib_devices_update:
 Updating devices to version 1.443.0
Jun 26 07:01:49 [4552] alphacib: info:
cib_process_request: Completed cib_apply_diff operation for section
'all': OK (rc=0, origin=alpha/cibadmin/2, version=1.443.0)
Jun 26 07:01:49 [4553] alpha stonith-ng: info: cib_device_update:
 Device ipmi_alpha has been disabled on alpha: score=-INFINITY
Jun 26 07:01:49 [4552] alphacib: info: cib_file_backup:
 Archived previous version as /var/lib/pacemaker/cib/cib-83.raw
Jun 26 07:01:49 [4552] alphacib: info:
cib_file_write_with_digest:  Wrote version 1.443.0 of the CIB to disk
(digest: 2a60981d2eceb59a6ed3015ce20f9dff)
Jun 26 07:01:49 [4552] alphacib: info:
cib_file_write_with_digest:  Reading cluster configuration file
/var/lib/pacemaker/cib/cib.g9gWwY (digest:
/var/lib/pacemaker/cib/cib.tslkpk)
Jun 26 07:01:49 [4556] alphapengine: info:
determine_online_status_fencing: Node beta is active
Jun 26 07:01:49 [4556] alphapengine: info:
determine_online_status: Node beta is online
Jun 26 07:01:49 [4556] alphapengine: info:
determine_online_status_fencing: Node alpha is active
Jun 26 07:01:49 [4556] alphapengine: info:
determine_online_status: Node alpha is online
Jun 26 07:01:49 [4556] alphapengine: info:
determine_op_status: Operation monitor found resource Calibre active
on beta
Jun 26 07:01:49 [4556] alphapengine: info:
determine_op_status: Operation monitor found resource Calibre active
on beta
Jun 26 07:01:49 [4556] alphapengine: info:
determine_op_status: Operation monitor found resource Iota active on
beta
Jun 26 07:01:49 [4556] alphapengine: info:
determine_op_status: Operation monitor found resource Iota active on
beta
Jun 26 07:01:49 [4556] alphapengine: info:
determine_op_status: Operation monitor found resource Lapras active on
beta
Jun 26 07:01:49 [4556] alphapengine: info:
determine_op_status: Operation monitor found resource Lapras active on
beta
Jun 26 07:01:49 [4556] alphapengine: info:
determine_op_status: Operation monitor found resource Tau active on
beta
Jun 26 07:01:49 [4556] alphapengine: info:
determine_op_status: Operation monitor found resource Tau active on
beta
Jun 26 07:01:49 [4556] alphapengine: info:
determine_op_status: Operation monitor found resource Omicron active
on alpha
Jun 26 07:01:49 [4556] alphapengine: info:
determine_op_status: Operation monitor found resource Omicron active
on alpha
Jun 26 07:01:49 [4556] alphapengine: info:
determine_op_status: Operation monitor found resource Plex active on
alpha
Jun 26 07:01:49 [4556] alphapengine: info:
determine_op_status: Operation monitor found resource Plex active on
alpha
Jun 26 07:01:49 [4556] alphapengine: info:
determine_op_status: Operation monitor found resource Umbreon active
on alpha
Jun 26 07:01:49 [4556] alphapengine: 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
libqb update to 1.0.3 but same issue.

I know corosync has also these dependencies nspr and nss3. I updated them using 
apt-get install, here the version installed:

   libnspr4, libnspr4-dev   2:4.13.1-0ubuntu0.14.04.1
   libnss3, libnss3-dev, libnss3-nssb   2:3.28.4-0ubuntu0.14.04.3

but same problem.

I am working on Ubuntu 14.04 image and I know that packages could be quite old 
here. Are there new versions for these libraries?
Where I can download them? I tried to search on google but results where quite 
confusing.


> On 26 Jun 2018, at 12:27, Christine Caulfield  wrote:
> 
> On 26/06/18 11:24, Salvatore D'angelo wrote:
>> Hi,
>> 
>> I have tried with:
>> 0.16.0.real-1ubuntu4
>> 0.16.0.real-1ubuntu5
>> 
>> which version should I try?
> 
> 
> Hmm both of those are actually quite old! maybe a newer one?
> 
> Chrissie
> 
>> 
>>> On 26 Jun 2018, at 12:03, Christine Caulfield >> > wrote:
>>> 
>>> On 26/06/18 11:00, Salvatore D'angelo wrote:
 Consider that the container is the same when corosync 2.3.5 run.
 If it is something related to the container probably the 2.4.4
 introduced a feature that has an impact on container.
 Should be something related to libqb according to the code.
 Anyone can help?
 
>>> 
>>> 
>>> Have you tried downgrading libqb to the previous version to see if it
>>> still happens?
>>> 
>>> Chrissie
>>> 
> On 26 Jun 2018, at 11:56, Christine Caulfield  
> > wrote:
> 
> On 26/06/18 10:35, Salvatore D'angelo wrote:
>> Sorry after the command:
>> 
>> corosync-quorumtool -ps
>> 
>> the error in log are still visible. Looking at the source code it seems
>> problem is at this line:
>> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
>> 
>> if (quorum_initialize(_handle, _callbacks, _type) != CS_OK) {
>> fprintf(stderr, "Cannot initialize QUORUM service\n");
>> q_handle = 0;
>> goto out;
>> }
>> 
>> if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
>> fprintf(stderr, "Cannot initialise CFG service\n");
>> c_handle = 0;
>> goto out;
>> }
>> 
>> The quorum_initialize function is defined here:
>> https://github.com/corosync/corosync/blob/master/lib/quorum.c
>> 
>> It seems interacts with libqb to allocate space on /dev/shm but
>> something fails. I tried to update the libqb with apt-get install
>> but no
>> success.
>> 
>> The same for second function:
>> https://github.com/corosync/corosync/blob/master/lib/cfg.c
>> 
>> Now I am not an expert of libqb. I have the
>> version 0.16.0.real-1ubuntu5.
>> 
>> The folder /dev/shm has 777 permission like other nodes with older
>> corosync and pacemaker that work fine. The only difference is that I
>> only see files created by root, no one created by hacluster like other
>> two nodes (probably because pacemaker didn’t start correctly).
>> 
>> This is the analysis I have done so far.
>> Any suggestion?
>> 
>> 
> 
> Hmm. t seems very likely something to do with the way the container is
> set up then - and I know nothing about containers. Sorry :/
> 
> Can anyone else help here?
> 
> Chrissie
> 
>>> On 26 Jun 2018, at 11:03, Salvatore D'angelo
>>> mailto:sasadang...@gmail.com>
>>> 
>>> > wrote:
>>> 
>>> Yes, sorry you’re right I could find it by myself.
>>> However, I did the following:
>>> 
>>> 1. Added the line you suggested to /etc/fstab
>>> 2. mount -o remount /dev/shm
>>> 3. Now I correctly see /dev/shm of 512M with df -h
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> overlay  63G   11G   49G  19% /
>>> tmpfs64M  4.0K   64M   1% /dev
>>> tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
>>> osxfs   466G  158G  305G  35% /Users
>>> /dev/sda163G   11G   49G  19% /etc/hosts
>>> *shm 512M   15M  498M   3% /dev/shm*
>>> tmpfs  1000M 0 1000M   0% /sys/firmware
>>> tmpfs   128M 0  128M   0% /tmp
>>> 
>>> The errors in log went away. Consider that I remove the log file
>>> before start corosync so it does not contains lines of previous
>>> executions.
>>> 
>>> 
>>> But the command:
>>> corosync-quorumtool -ps
>>> 
>>> still give:
>>> Cannot initialize QUORUM service
>>> 
>>> Consider that few minutes before it gave me the message:
>>> Cannot initialize CFG service
>>> 
>>> I do not know the differences between CFG and QUORUM in this case.
>>> 
>>> If I try to start pacemaker the service is OK but I see only pacemaker
>>> and the Transport does not work if I try to run a cam command.

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Christine Caulfield
On 26/06/18 11:24, Salvatore D'angelo wrote:
> Hi,
> 
> I have tried with:
> 0.16.0.real-1ubuntu4
> 0.16.0.real-1ubuntu5
> 
> which version should I try?


Hmm both of those are actually quite old! maybe a newer one?

Chrissie

> 
>> On 26 Jun 2018, at 12:03, Christine Caulfield > > wrote:
>>
>> On 26/06/18 11:00, Salvatore D'angelo wrote:
>>> Consider that the container is the same when corosync 2.3.5 run.
>>> If it is something related to the container probably the 2.4.4
>>> introduced a feature that has an impact on container.
>>> Should be something related to libqb according to the code.
>>> Anyone can help?
>>>
>>
>>
>> Have you tried downgrading libqb to the previous version to see if it
>> still happens?
>>
>> Chrissie
>>
 On 26 Jun 2018, at 11:56, Christine Caulfield >>> 
 > wrote:

 On 26/06/18 10:35, Salvatore D'angelo wrote:
> Sorry after the command:
>
> corosync-quorumtool -ps
>
> the error in log are still visible. Looking at the source code it seems
> problem is at this line:
> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
>
>     if (quorum_initialize(_handle, _callbacks, _type) != CS_OK) {
> fprintf(stderr, "Cannot initialize QUORUM service\n");
> q_handle = 0;
> goto out;
> }
>
> if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
> fprintf(stderr, "Cannot initialise CFG service\n");
> c_handle = 0;
> goto out;
> }
>
> The quorum_initialize function is defined here:
> https://github.com/corosync/corosync/blob/master/lib/quorum.c
>
> It seems interacts with libqb to allocate space on /dev/shm but
> something fails. I tried to update the libqb with apt-get install
> but no
> success.
>
> The same for second function:
> https://github.com/corosync/corosync/blob/master/lib/cfg.c
>
> Now I am not an expert of libqb. I have the
> version 0.16.0.real-1ubuntu5.
>
> The folder /dev/shm has 777 permission like other nodes with older
> corosync and pacemaker that work fine. The only difference is that I
> only see files created by root, no one created by hacluster like other
> two nodes (probably because pacemaker didn’t start correctly).
>
> This is the analysis I have done so far.
> Any suggestion?
>
>

 Hmm. t seems very likely something to do with the way the container is
 set up then - and I know nothing about containers. Sorry :/

 Can anyone else help here?

 Chrissie

>> On 26 Jun 2018, at 11:03, Salvatore D'angelo
>> mailto:sasadang...@gmail.com>
>> 
>> > wrote:
>>
>> Yes, sorry you’re right I could find it by myself.
>> However, I did the following:
>>
>> 1. Added the line you suggested to /etc/fstab
>> 2. mount -o remount /dev/shm
>> 3. Now I correctly see /dev/shm of 512M with df -h
>> Filesystem      Size  Used Avail Use% Mounted on
>> overlay          63G   11G   49G  19% /
>> tmpfs            64M  4.0K   64M   1% /dev
>> tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
>> osxfs           466G  158G  305G  35% /Users
>> /dev/sda1        63G   11G   49G  19% /etc/hosts
>> *shm             512M   15M  498M   3% /dev/shm*
>> tmpfs          1000M     0 1000M   0% /sys/firmware
>> tmpfs           128M     0  128M   0% /tmp
>>
>> The errors in log went away. Consider that I remove the log file
>> before start corosync so it does not contains lines of previous
>> executions.
>> 
>>
>> But the command:
>> corosync-quorumtool -ps
>>
>> still give:
>> Cannot initialize QUORUM service
>>
>> Consider that few minutes before it gave me the message:
>> Cannot initialize CFG service
>>
>> I do not know the differences between CFG and QUORUM in this case.
>>
>> If I try to start pacemaker the service is OK but I see only pacemaker
>> and the Transport does not work if I try to run a cam command.
>> Any suggestion?
>>
>>
>>> On 26 Jun 2018, at 10:49, Christine Caulfield
>>> mailto:ccaul...@redhat.com>
>>> 
>>> > wrote:
>>>
>>> On 26/06/18 09:40, Salvatore D'angelo wrote:
 Hi,

 Yes,

 I am reproducing only the required part for test. I think the
 original
 system has a larger shm. The problem is that I do not know
 exactly how
 to change it.
 I tried the following steps, but I have the impression I didn’t
 performed the right one:

 1. remove everything under /tmp
 2. Added the following line to /etc/fstab
 tmpfs   /tmp         tmpfs  

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
Hi,

I have tried with:
0.16.0.real-1ubuntu4
0.16.0.real-1ubuntu5

which version should I try?

> On 26 Jun 2018, at 12:03, Christine Caulfield  wrote:
> 
> On 26/06/18 11:00, Salvatore D'angelo wrote:
>> Consider that the container is the same when corosync 2.3.5 run.
>> If it is something related to the container probably the 2.4.4
>> introduced a feature that has an impact on container.
>> Should be something related to libqb according to the code.
>> Anyone can help?
>> 
> 
> 
> Have you tried downgrading libqb to the previous version to see if it
> still happens?
> 
> Chrissie
> 
>>> On 26 Jun 2018, at 11:56, Christine Caulfield >> > wrote:
>>> 
>>> On 26/06/18 10:35, Salvatore D'angelo wrote:
 Sorry after the command:
 
 corosync-quorumtool -ps
 
 the error in log are still visible. Looking at the source code it seems
 problem is at this line:
 https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
 
 if (quorum_initialize(_handle, _callbacks, _type) != CS_OK) {
 fprintf(stderr, "Cannot initialize QUORUM service\n");
 q_handle = 0;
 goto out;
 }
 
 if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
 fprintf(stderr, "Cannot initialise CFG service\n");
 c_handle = 0;
 goto out;
 }
 
 The quorum_initialize function is defined here:
 https://github.com/corosync/corosync/blob/master/lib/quorum.c
 
 It seems interacts with libqb to allocate space on /dev/shm but
 something fails. I tried to update the libqb with apt-get install but no
 success.
 
 The same for second function:
 https://github.com/corosync/corosync/blob/master/lib/cfg.c
 
 Now I am not an expert of libqb. I have the version 0.16.0.real-1ubuntu5.
 
 The folder /dev/shm has 777 permission like other nodes with older
 corosync and pacemaker that work fine. The only difference is that I
 only see files created by root, no one created by hacluster like other
 two nodes (probably because pacemaker didn’t start correctly).
 
 This is the analysis I have done so far.
 Any suggestion?
 
 
>>> 
>>> Hmm. t seems very likely something to do with the way the container is
>>> set up then - and I know nothing about containers. Sorry :/
>>> 
>>> Can anyone else help here?
>>> 
>>> Chrissie
>>> 
> On 26 Jun 2018, at 11:03, Salvatore D'angelo  
> > wrote:
> 
> Yes, sorry you’re right I could find it by myself.
> However, I did the following:
> 
> 1. Added the line you suggested to /etc/fstab
> 2. mount -o remount /dev/shm
> 3. Now I correctly see /dev/shm of 512M with df -h
> Filesystem  Size  Used Avail Use% Mounted on
> overlay  63G   11G   49G  19% /
> tmpfs64M  4.0K   64M   1% /dev
> tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
> osxfs   466G  158G  305G  35% /Users
> /dev/sda163G   11G   49G  19% /etc/hosts
> *shm 512M   15M  498M   3% /dev/shm*
> tmpfs  1000M 0 1000M   0% /sys/firmware
> tmpfs   128M 0  128M   0% /tmp
> 
> The errors in log went away. Consider that I remove the log file
> before start corosync so it does not contains lines of previous
> executions.
> 
> 
> But the command:
> corosync-quorumtool -ps
> 
> still give:
> Cannot initialize QUORUM service
> 
> Consider that few minutes before it gave me the message:
> Cannot initialize CFG service
> 
> I do not know the differences between CFG and QUORUM in this case.
> 
> If I try to start pacemaker the service is OK but I see only pacemaker
> and the Transport does not work if I try to run a cam command.
> Any suggestion?
> 
> 
>> On 26 Jun 2018, at 10:49, Christine Caulfield > 
>> > wrote:
>> 
>> On 26/06/18 09:40, Salvatore D'angelo wrote:
>>> Hi,
>>> 
>>> Yes,
>>> 
>>> I am reproducing only the required part for test. I think the original
>>> system has a larger shm. The problem is that I do not know exactly how
>>> to change it.
>>> I tried the following steps, but I have the impression I didn’t
>>> performed the right one:
>>> 
>>> 1. remove everything under /tmp
>>> 2. Added the following line to /etc/fstab
>>> tmpfs   /tmp tmpfs  
>>> defaults,nodev,nosuid,mode=1777,size=128M 
>>> 0  0
>>> 3. mount /tmp
>>> 4. df -h
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> overlay  63G   11G   49G  19% /
>>> tmpfs64M  4.0K   64M   1% /dev
>>> tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
>>> osxfs   466G  158G  305G  35% /Users
>>> /dev/sda1  

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Christine Caulfield
On 26/06/18 11:00, Salvatore D'angelo wrote:
> Consider that the container is the same when corosync 2.3.5 run.
> If it is something related to the container probably the 2.4.4
> introduced a feature that has an impact on container.
> Should be something related to libqb according to the code.
> Anyone can help?
> 


Have you tried downgrading libqb to the previous version to see if it
still happens?

Chrissie

>> On 26 Jun 2018, at 11:56, Christine Caulfield > > wrote:
>>
>> On 26/06/18 10:35, Salvatore D'angelo wrote:
>>> Sorry after the command:
>>>
>>> corosync-quorumtool -ps
>>>
>>> the error in log are still visible. Looking at the source code it seems
>>> problem is at this line:
>>> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
>>>
>>>     if (quorum_initialize(_handle, _callbacks, _type) != CS_OK) {
>>> fprintf(stderr, "Cannot initialize QUORUM service\n");
>>> q_handle = 0;
>>> goto out;
>>> }
>>>
>>> if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
>>> fprintf(stderr, "Cannot initialise CFG service\n");
>>> c_handle = 0;
>>> goto out;
>>> }
>>>
>>> The quorum_initialize function is defined here:
>>> https://github.com/corosync/corosync/blob/master/lib/quorum.c
>>>
>>> It seems interacts with libqb to allocate space on /dev/shm but
>>> something fails. I tried to update the libqb with apt-get install but no
>>> success.
>>>
>>> The same for second function:
>>> https://github.com/corosync/corosync/blob/master/lib/cfg.c
>>>
>>> Now I am not an expert of libqb. I have the version 0.16.0.real-1ubuntu5.
>>>
>>> The folder /dev/shm has 777 permission like other nodes with older
>>> corosync and pacemaker that work fine. The only difference is that I
>>> only see files created by root, no one created by hacluster like other
>>> two nodes (probably because pacemaker didn’t start correctly).
>>>
>>> This is the analysis I have done so far.
>>> Any suggestion?
>>>
>>>
>>
>> Hmm. t seems very likely something to do with the way the container is
>> set up then - and I know nothing about containers. Sorry :/
>>
>> Can anyone else help here?
>>
>> Chrissie
>>
 On 26 Jun 2018, at 11:03, Salvatore D'angelo >>> 
 > wrote:

 Yes, sorry you’re right I could find it by myself.
 However, I did the following:

 1. Added the line you suggested to /etc/fstab
 2. mount -o remount /dev/shm
 3. Now I correctly see /dev/shm of 512M with df -h
 Filesystem      Size  Used Avail Use% Mounted on
 overlay          63G   11G   49G  19% /
 tmpfs            64M  4.0K   64M   1% /dev
 tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
 osxfs           466G  158G  305G  35% /Users
 /dev/sda1        63G   11G   49G  19% /etc/hosts
 *shm             512M   15M  498M   3% /dev/shm*
 tmpfs          1000M     0 1000M   0% /sys/firmware
 tmpfs           128M     0  128M   0% /tmp

 The errors in log went away. Consider that I remove the log file
 before start corosync so it does not contains lines of previous
 executions.
 

 But the command:
 corosync-quorumtool -ps

 still give:
 Cannot initialize QUORUM service

 Consider that few minutes before it gave me the message:
 Cannot initialize CFG service

 I do not know the differences between CFG and QUORUM in this case.

 If I try to start pacemaker the service is OK but I see only pacemaker
 and the Transport does not work if I try to run a cam command.
 Any suggestion?


> On 26 Jun 2018, at 10:49, Christine Caulfield  
> > wrote:
>
> On 26/06/18 09:40, Salvatore D'angelo wrote:
>> Hi,
>>
>> Yes,
>>
>> I am reproducing only the required part for test. I think the original
>> system has a larger shm. The problem is that I do not know exactly how
>> to change it.
>> I tried the following steps, but I have the impression I didn’t
>> performed the right one:
>>
>> 1. remove everything under /tmp
>> 2. Added the following line to /etc/fstab
>> tmpfs   /tmp         tmpfs  
>> defaults,nodev,nosuid,mode=1777,size=128M 
>>         0  0
>> 3. mount /tmp
>> 4. df -h
>> Filesystem      Size  Used Avail Use% Mounted on
>> overlay          63G   11G   49G  19% /
>> tmpfs            64M  4.0K   64M   1% /dev
>> tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
>> osxfs           466G  158G  305G  35% /Users
>> /dev/sda1        63G   11G   49G  19% /etc/hosts
>> shm              64M   11M   54M  16% /dev/shm
>> tmpfs          1000M     0 1000M   0% /sys/firmware
>> *tmpfs           128M     0  128M   0% /tmp*
>>
>> The errors are exactly the same.
>> I have the impression that I changed the wrong parameter. Probably I

[ClusterLabs] difference between external/ipmi and fence_ipmilan

2018-06-26 Thread Stefan K
Hello,

can somebody tell me the difference between external/ipmi and fence_ipmilan? 
Are there preferences?
Is one of these more common or has some advantages? 

Thanks in advance!
best regards
Stefan
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
Consider that the container is the same when corosync 2.3.5 run.
If it is something related to the container probably the 2.4.4 introduced a 
feature that has an impact on container.
Should be something related to libqb according to the code.
Anyone can help?

> On 26 Jun 2018, at 11:56, Christine Caulfield  wrote:
> 
> On 26/06/18 10:35, Salvatore D'angelo wrote:
>> Sorry after the command:
>> 
>> corosync-quorumtool -ps
>> 
>> the error in log are still visible. Looking at the source code it seems
>> problem is at this line:
>> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
>> 
>> if (quorum_initialize(_handle, _callbacks, _type) != CS_OK) {
>> fprintf(stderr, "Cannot initialize QUORUM service\n");
>> q_handle = 0;
>> goto out;
>> }
>> 
>> if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
>> fprintf(stderr, "Cannot initialise CFG service\n");
>> c_handle = 0;
>> goto out;
>> }
>> 
>> The quorum_initialize function is defined here:
>> https://github.com/corosync/corosync/blob/master/lib/quorum.c
>> 
>> It seems interacts with libqb to allocate space on /dev/shm but
>> something fails. I tried to update the libqb with apt-get install but no
>> success.
>> 
>> The same for second function:
>> https://github.com/corosync/corosync/blob/master/lib/cfg.c
>> 
>> Now I am not an expert of libqb. I have the version 0.16.0.real-1ubuntu5.
>> 
>> The folder /dev/shm has 777 permission like other nodes with older
>> corosync and pacemaker that work fine. The only difference is that I
>> only see files created by root, no one created by hacluster like other
>> two nodes (probably because pacemaker didn’t start correctly).
>> 
>> This is the analysis I have done so far.
>> Any suggestion?
>> 
>> 
> 
> Hmm. t seems very likely something to do with the way the container is
> set up then - and I know nothing about containers. Sorry :/
> 
> Can anyone else help here?
> 
> Chrissie
> 
>>> On 26 Jun 2018, at 11:03, Salvatore D'angelo >> 
>>> >> wrote:
>>> 
>>> Yes, sorry you’re right I could find it by myself.
>>> However, I did the following:
>>> 
>>> 1. Added the line you suggested to /etc/fstab
>>> 2. mount -o remount /dev/shm
>>> 3. Now I correctly see /dev/shm of 512M with df -h
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> overlay  63G   11G   49G  19% /
>>> tmpfs64M  4.0K   64M   1% /dev
>>> tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
>>> osxfs   466G  158G  305G  35% /Users
>>> /dev/sda163G   11G   49G  19% /etc/hosts
>>> *shm 512M   15M  498M   3% /dev/shm*
>>> tmpfs  1000M 0 1000M   0% /sys/firmware
>>> tmpfs   128M 0  128M   0% /tmp
>>> 
>>> The errors in log went away. Consider that I remove the log file
>>> before start corosync so it does not contains lines of previous
>>> executions.
>>> 
>>> 
>>> But the command:
>>> corosync-quorumtool -ps
>>> 
>>> still give:
>>> Cannot initialize QUORUM service
>>> 
>>> Consider that few minutes before it gave me the message:
>>> Cannot initialize CFG service
>>> 
>>> I do not know the differences between CFG and QUORUM in this case.
>>> 
>>> If I try to start pacemaker the service is OK but I see only pacemaker
>>> and the Transport does not work if I try to run a cam command.
>>> Any suggestion?
>>> 
>>> 
 On 26 Jun 2018, at 10:49, Christine Caulfield >>> 
 >> wrote:
 
 On 26/06/18 09:40, Salvatore D'angelo wrote:
> Hi,
> 
> Yes,
> 
> I am reproducing only the required part for test. I think the original
> system has a larger shm. The problem is that I do not know exactly how
> to change it.
> I tried the following steps, but I have the impression I didn’t
> performed the right one:
> 
> 1. remove everything under /tmp
> 2. Added the following line to /etc/fstab
> tmpfs   /tmp tmpfs   defaults,nodev,nosuid,mode=1777,size=128M 
> 0  0
> 3. mount /tmp
> 4. df -h
> Filesystem  Size  Used Avail Use% Mounted on
> overlay  63G   11G   49G  19% /
> tmpfs64M  4.0K   64M   1% /dev
> tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
> osxfs   466G  158G  305G  35% /Users
> /dev/sda163G   11G   49G  19% /etc/hosts
> shm  64M   11M   54M  16% /dev/shm
> tmpfs  1000M 0 1000M   0% /sys/firmware
> *tmpfs   128M 0  128M   0% /tmp*
> 
> The errors are exactly the same.
> I have the impression that I changed the wrong parameter. Probably I
> have to change:
> shm  64M   11M   54M  16% /dev/shm
> 
> but I do not know how to do that. Any suggestion?
> 
 
 According to google, you just add a new line to /etc/fstab for /dev/shm

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Christine Caulfield
On 26/06/18 10:35, Salvatore D'angelo wrote:
> Sorry after the command:
> 
> corosync-quorumtool -ps
> 
> the error in log are still visible. Looking at the source code it seems
> problem is at this line:
> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
> 
>     if (quorum_initialize(_handle, _callbacks, _type) != CS_OK) {
> fprintf(stderr, "Cannot initialize QUORUM service\n");
> q_handle = 0;
> goto out;
> }
> 
> if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
> fprintf(stderr, "Cannot initialise CFG service\n");
> c_handle = 0;
> goto out;
> }
> 
> The quorum_initialize function is defined here:
> https://github.com/corosync/corosync/blob/master/lib/quorum.c
> 
> It seems interacts with libqb to allocate space on /dev/shm but
> something fails. I tried to update the libqb with apt-get install but no
> success.
> 
> The same for second function:
> https://github.com/corosync/corosync/blob/master/lib/cfg.c
> 
> Now I am not an expert of libqb. I have the version 0.16.0.real-1ubuntu5.
> 
> The folder /dev/shm has 777 permission like other nodes with older
> corosync and pacemaker that work fine. The only difference is that I
> only see files created by root, no one created by hacluster like other
> two nodes (probably because pacemaker didn’t start correctly).
> 
> This is the analysis I have done so far.
> Any suggestion?
> 
> 

Hmm. t seems very likely something to do with the way the container is
set up then - and I know nothing about containers. Sorry :/

Can anyone else help here?

Chrissie

>> On 26 Jun 2018, at 11:03, Salvatore D'angelo > > wrote:
>>
>> Yes, sorry you’re right I could find it by myself.
>> However, I did the following:
>>
>> 1. Added the line you suggested to /etc/fstab
>> 2. mount -o remount /dev/shm
>> 3. Now I correctly see /dev/shm of 512M with df -h
>> Filesystem      Size  Used Avail Use% Mounted on
>> overlay          63G   11G   49G  19% /
>> tmpfs            64M  4.0K   64M   1% /dev
>> tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
>> osxfs           466G  158G  305G  35% /Users
>> /dev/sda1        63G   11G   49G  19% /etc/hosts
>> *shm             512M   15M  498M   3% /dev/shm*
>> tmpfs          1000M     0 1000M   0% /sys/firmware
>> tmpfs           128M     0  128M   0% /tmp
>>
>> The errors in log went away. Consider that I remove the log file
>> before start corosync so it does not contains lines of previous
>> executions.
>> 
>>
>> But the command:
>> corosync-quorumtool -ps
>>
>> still give:
>> Cannot initialize QUORUM service
>>
>> Consider that few minutes before it gave me the message:
>> Cannot initialize CFG service
>>
>> I do not know the differences between CFG and QUORUM in this case.
>>
>> If I try to start pacemaker the service is OK but I see only pacemaker
>> and the Transport does not work if I try to run a cam command.
>> Any suggestion?
>>
>>
>>> On 26 Jun 2018, at 10:49, Christine Caulfield >> > wrote:
>>>
>>> On 26/06/18 09:40, Salvatore D'angelo wrote:
 Hi,

 Yes,

 I am reproducing only the required part for test. I think the original
 system has a larger shm. The problem is that I do not know exactly how
 to change it.
 I tried the following steps, but I have the impression I didn’t
 performed the right one:

 1. remove everything under /tmp
 2. Added the following line to /etc/fstab
 tmpfs   /tmp         tmpfs   defaults,nodev,nosuid,mode=1777,size=128M 
         0  0
 3. mount /tmp
 4. df -h
 Filesystem      Size  Used Avail Use% Mounted on
 overlay          63G   11G   49G  19% /
 tmpfs            64M  4.0K   64M   1% /dev
 tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
 osxfs           466G  158G  305G  35% /Users
 /dev/sda1        63G   11G   49G  19% /etc/hosts
 shm              64M   11M   54M  16% /dev/shm
 tmpfs          1000M     0 1000M   0% /sys/firmware
 *tmpfs           128M     0  128M   0% /tmp*

 The errors are exactly the same.
 I have the impression that I changed the wrong parameter. Probably I
 have to change:
 shm              64M   11M   54M  16% /dev/shm

 but I do not know how to do that. Any suggestion?

>>>
>>> According to google, you just add a new line to /etc/fstab for /dev/shm
>>>
>>> tmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0
>>>
>>> Chrissie
>>>
> On 26 Jun 2018, at 09:48, Christine Caulfield  
> > wrote:
>
> On 25/06/18 20:41, Salvatore D'angelo wrote:
>> Hi,
>>
>> Let me add here one important detail. I use Docker for my test with 5
>> containers deployed on my Mac.
>> Basically the team that worked on this project installed the cluster
>> on soft layer bare metal.
>> The PostgreSQL cluster was hard to test and if a misconfiguration
>> occurred 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
Sorry after the command:

corosync-quorumtool -ps

the error in log are still visible. Looking at the source code it seems problem 
is at this line:
https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c 


if (quorum_initialize(_handle, _callbacks, _type) != CS_OK) {
fprintf(stderr, "Cannot initialize QUORUM service\n");
q_handle = 0;
goto out;
}

if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
fprintf(stderr, "Cannot initialise CFG service\n");
c_handle = 0;
goto out;
}

The quorum_initialize function is defined here:
https://github.com/corosync/corosync/blob/master/lib/quorum.c 


It seems interacts with libqb to allocate space on /dev/shm but something 
fails. I tried to update the libqb with apt-get install but no success.

The same for second function:
https://github.com/corosync/corosync/blob/master/lib/cfg.c 


Now I am not an expert of libqb. I have the version 0.16.0.real-1ubuntu5.

The folder /dev/shm has 777 permission like other nodes with older corosync and 
pacemaker that work fine. The only difference is that I only see files created 
by root, no one created by hacluster like other two nodes (probably because 
pacemaker didn’t start correctly).

This is the analysis I have done so far.
Any suggestion?


> On 26 Jun 2018, at 11:03, Salvatore D'angelo  wrote:
> 
> Yes, sorry you’re right I could find it by myself.
> However, I did the following:
> 
> 1. Added the line you suggested to /etc/fstab
> 2. mount -o remount /dev/shm
> 3. Now I correctly see /dev/shm of 512M with df -h
> Filesystem  Size  Used Avail Use% Mounted on
> overlay  63G   11G   49G  19% /
> tmpfs64M  4.0K   64M   1% /dev
> tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
> osxfs   466G  158G  305G  35% /Users
> /dev/sda163G   11G   49G  19% /etc/hosts
> shm 512M   15M  498M   3% /dev/shm
> tmpfs  1000M 0 1000M   0% /sys/firmware
> tmpfs   128M 0  128M   0% /tmp
> 
> The errors in log went away. Consider that I remove the log file before start 
> corosync so it does not contains lines of previous executions.
> 
> 
> But the command:
> corosync-quorumtool -ps
> 
> still give:
> Cannot initialize QUORUM service
> 
> Consider that few minutes before it gave me the message:
> Cannot initialize CFG service
> 
> I do not know the differences between CFG and QUORUM in this case.
> 
> If I try to start pacemaker the service is OK but I see only pacemaker and 
> the Transport does not work if I try to run a cam command.
> Any suggestion?
> 
> 
>> On 26 Jun 2018, at 10:49, Christine Caulfield > > wrote:
>> 
>> On 26/06/18 09:40, Salvatore D'angelo wrote:
>>> Hi,
>>> 
>>> Yes,
>>> 
>>> I am reproducing only the required part for test. I think the original
>>> system has a larger shm. The problem is that I do not know exactly how
>>> to change it.
>>> I tried the following steps, but I have the impression I didn’t
>>> performed the right one:
>>> 
>>> 1. remove everything under /tmp
>>> 2. Added the following line to /etc/fstab
>>> tmpfs   /tmp tmpfs   defaults,nodev,nosuid,mode=1777,size=128M 
>>> 0  0
>>> 3. mount /tmp
>>> 4. df -h
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> overlay  63G   11G   49G  19% /
>>> tmpfs64M  4.0K   64M   1% /dev
>>> tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
>>> osxfs   466G  158G  305G  35% /Users
>>> /dev/sda163G   11G   49G  19% /etc/hosts
>>> shm  64M   11M   54M  16% /dev/shm
>>> tmpfs  1000M 0 1000M   0% /sys/firmware
>>> *tmpfs   128M 0  128M   0% /tmp*
>>> 
>>> The errors are exactly the same.
>>> I have the impression that I changed the wrong parameter. Probably I
>>> have to change:
>>> shm  64M   11M   54M  16% /dev/shm
>>> 
>>> but I do not know how to do that. Any suggestion?
>>> 
>> 
>> According to google, you just add a new line to /etc/fstab for /dev/shm
>> 
>> tmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0
>> 
>> Chrissie
>> 
 On 26 Jun 2018, at 09:48, Christine Caulfield >>> 
 >> wrote:
 
 On 25/06/18 20:41, Salvatore D'angelo wrote:
> Hi,
> 
> Let me add here one important detail. I use Docker for my test with 5
> containers deployed on my Mac.
> Basically the team that worked on this project installed the cluster
> on soft layer bare metal.
> The PostgreSQL cluster was hard to test and if a misconfiguration
> occurred recreate the cluster from scratch is not 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
Yes, sorry you’re right I could find it by myself.However, I did the following:1. Added the line you suggested to /etc/fstab2. mount -o remount /dev/shm3. Now I correctly see /dev/shm of 512M with df -hFilesystem      Size  Used Avail Use% Mounted onoverlay          63G   11G   49G  19% /tmpfs            64M  4.0K   64M   1% /devtmpfs          1000M     0 1000M   0% /sys/fs/cgrouposxfs           466G  158G  305G  35% /Users/dev/sda1        63G   11G   49G  19% /etc/hostsshm             512M   15M  498M   3% /dev/shmtmpfs          1000M     0 1000M   0% /sys/firmwaretmpfs           128M     0  128M   0% /tmpThe errors in log went away. Consider that I remove the log file before start corosync so it does not contains lines of previous executions.

corosync.log
Description: Binary data
But the command:corosync-quorumtool -psstill give:Cannot initialize QUORUM serviceConsider that few minutes before it gave me the message:Cannot initialize CFG serviceI do not know the differences between CFG and QUORUM in this case.If I try to start pacemaker the service is OK but I see only pacemaker and the Transport does not work if I try to run a cam command.Any suggestion?On 26 Jun 2018, at 10:49, Christine Caulfield  wrote:On 26/06/18 09:40, Salvatore D'angelo wrote:Hi,Yes,I am reproducing only the required part for test. I think the originalsystem has a larger shm. The problem is that I do not know exactly howto change it.I tried the following steps, but I have the impression I didn’tperformed the right one:1. remove everything under /tmp2. Added the following line to /etc/fstabtmpfs   /tmp         tmpfs   defaults,nodev,nosuid,mode=1777,size=128M         0  03. mount /tmp4. df -hFilesystem      Size  Used Avail Use% Mounted onoverlay          63G   11G   49G  19% /tmpfs            64M  4.0K   64M   1% /devtmpfs          1000M     0 1000M   0% /sys/fs/cgrouposxfs           466G  158G  305G  35% /Users/dev/sda1        63G   11G   49G  19% /etc/hostsshm              64M   11M   54M  16% /dev/shmtmpfs          1000M     0 1000M   0% /sys/firmware*tmpfs           128M     0  128M   0% /tmp*The errors are exactly the same.I have the impression that I changed the wrong parameter. Probably Ihave to change:shm              64M   11M   54M  16% /dev/shmbut I do not know how to do that. Any suggestion?According to google, you just add a new line to /etc/fstab for /dev/shmtmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0ChrissieOn 26 Jun 2018, at 09:48, Christine Caulfield > wrote:On 25/06/18 20:41, Salvatore D'angelo wrote:Hi,Let me add here one important detail. I use Docker for my test with 5containers deployed on my Mac.Basically the team that worked on this project installed the clusteron soft layer bare metal.The PostgreSQL cluster was hard to test and if a misconfigurationoccurred recreate the cluster from scratch is not easy.Test it was a cumbersome if you consider that we access to themachines with a complex system hard to describe here.For this reason I ported the cluster on Docker for test purpose. I amnot interested to have it working for months, I just need a proof ofconcept. When the migration works I’ll port everything on bare metal where thesize of resources are ambundant.  Now I have enough RAM and disk space on my Mac so if you tell me whatshould be an acceptable size for several days of running it is ok for me.It is ok also have commands to clean the shm when required.I know I can find them on Google but if you can suggest me these infoI’ll appreciate. I have OS knowledge to do that but I would like toavoid days of guesswork and try and error if possible.I would recommend at least 128MB of space on /dev/shm, 256MB if you canspare it. My 'standard' system uses 75MB under normal running allowingfor one command-line query to run.If I read this right then you're reproducing a bare-metal system incontainers now? so the original systems will have a default /dev/shmsize which is probably much larger than your containers?I'm just checking here that we don't have a regression in memory usageas Poki suggested.ChrissieOn 25 Jun 2018, at 21:18, Jan Pokorný > wrote:On 25/06/18 19:06 +0200, Salvatore D'angelo wrote:Thanks for reply. I scratched my cluster and created it again andthen migrated as before. This time I uninstalled pacemaker,corosync, crmsh and resource agents with make uninstallthen I installed new packages. The problem is the same, whenI launch:corosync-quorumtool -psI got: Cannot initialize QUORUM serviceHere the log with debug enabled:[18019] pg3 corosyncerror   [QB    ] couldn't create circular mmapon /dev/shm/qb-cfg-event-18020-18028-23-data[18019] pg3 corosyncerror   [QB    ]qb_rb_open:cfg-event-18020-18028-23: Resource temporarilyunavailable (11)[18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer:/dev/shm/qb-cfg-request-18020-18028-23-header[18019] pg3 corosyncdebug   [QB    

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Christine Caulfield
On 26/06/18 09:40, Salvatore D'angelo wrote:
> Hi,
> 
> Yes,
> 
> I am reproducing only the required part for test. I think the original
> system has a larger shm. The problem is that I do not know exactly how
> to change it.
> I tried the following steps, but I have the impression I didn’t
> performed the right one:
> 
> 1. remove everything under /tmp
> 2. Added the following line to /etc/fstab
> tmpfs   /tmp         tmpfs   defaults,nodev,nosuid,mode=1777,size=128M 
>         0  0
> 3. mount /tmp
> 4. df -h
> Filesystem      Size  Used Avail Use% Mounted on
> overlay          63G   11G   49G  19% /
> tmpfs            64M  4.0K   64M   1% /dev
> tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
> osxfs           466G  158G  305G  35% /Users
> /dev/sda1        63G   11G   49G  19% /etc/hosts
> shm              64M   11M   54M  16% /dev/shm
> tmpfs          1000M     0 1000M   0% /sys/firmware
> *tmpfs           128M     0  128M   0% /tmp*
> 
> The errors are exactly the same.
> I have the impression that I changed the wrong parameter. Probably I
> have to change:
> shm              64M   11M   54M  16% /dev/shm
> 
> but I do not know how to do that. Any suggestion?
> 

According to google, you just add a new line to /etc/fstab for /dev/shm

tmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0

Chrissie

>> On 26 Jun 2018, at 09:48, Christine Caulfield > > wrote:
>>
>> On 25/06/18 20:41, Salvatore D'angelo wrote:
>>> Hi,
>>>
>>> Let me add here one important detail. I use Docker for my test with 5
>>> containers deployed on my Mac.
>>> Basically the team that worked on this project installed the cluster
>>> on soft layer bare metal.
>>> The PostgreSQL cluster was hard to test and if a misconfiguration
>>> occurred recreate the cluster from scratch is not easy.
>>> Test it was a cumbersome if you consider that we access to the
>>> machines with a complex system hard to describe here.
>>> For this reason I ported the cluster on Docker for test purpose. I am
>>> not interested to have it working for months, I just need a proof of
>>> concept. 
>>>
>>> When the migration works I’ll port everything on bare metal where the
>>> size of resources are ambundant.  
>>>
>>> Now I have enough RAM and disk space on my Mac so if you tell me what
>>> should be an acceptable size for several days of running it is ok for me.
>>> It is ok also have commands to clean the shm when required.
>>> I know I can find them on Google but if you can suggest me these info
>>> I’ll appreciate. I have OS knowledge to do that but I would like to
>>> avoid days of guesswork and try and error if possible.
>>
>>
>> I would recommend at least 128MB of space on /dev/shm, 256MB if you can
>> spare it. My 'standard' system uses 75MB under normal running allowing
>> for one command-line query to run.
>>
>> If I read this right then you're reproducing a bare-metal system in
>> containers now? so the original systems will have a default /dev/shm
>> size which is probably much larger than your containers?
>>
>> I'm just checking here that we don't have a regression in memory usage
>> as Poki suggested.
>>
>> Chrissie
>>
 On 25 Jun 2018, at 21:18, Jan Pokorný >>> > wrote:

 On 25/06/18 19:06 +0200, Salvatore D'angelo wrote:
> Thanks for reply. I scratched my cluster and created it again and
> then migrated as before. This time I uninstalled pacemaker,
> corosync, crmsh and resource agents with make uninstall
>
> then I installed new packages. The problem is the same, when
> I launch:
> corosync-quorumtool -ps
>
> I got: Cannot initialize QUORUM service
>
> Here the log with debug enabled:
>
>
> [18019] pg3 corosyncerror   [QB    ] couldn't create circular mmap
> on /dev/shm/qb-cfg-event-18020-18028-23-data
> [18019] pg3 corosyncerror   [QB    ]
> qb_rb_open:cfg-event-18020-18028-23: Resource temporarily
> unavailable (11)
> [18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer:
> /dev/shm/qb-cfg-request-18020-18028-23-header
> [18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer:
> /dev/shm/qb-cfg-response-18020-18028-23-header
> [18019] pg3 corosyncerror   [QB    ] shm connection FAILED:
> Resource temporarily unavailable (11)
> [18019] pg3 corosyncerror   [QB    ] Error in connection setup
> (18020-18028-23): Resource temporarily unavailable (11)
>
> I tried to check /dev/shm and I am not sure these are the right
> commands, however:
>
> df -h /dev/shm
> Filesystem  Size  Used Avail Use% Mounted on
> shm  64M   16M   49M  24% /dev/shm
>
> ls /dev/shm
> qb-cmap-request-18020-18036-25-data    qb-corosync-blackbox-data
>    qb-quorum-request-18020-18095-32-data
> qb-cmap-request-18020-18036-25-header  qb-corosync-blackbox-header
>  qb-quorum-request-18020-18095-32-header
>
> Is 64 Mb 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
Hi,

Yes,

I am reproducing only the required part for test. I think the original system 
has a larger shm. The problem is that I do not know exactly how to change it.
I tried the following steps, but I have the impression I didn’t performed the 
right one:

1. remove everything under /tmp
2. Added the following line to /etc/fstab
tmpfs   /tmp tmpfs   defaults,nodev,nosuid,mode=1777,size=128M  
0  0
3. mount /tmp
4. df -h
Filesystem  Size  Used Avail Use% Mounted on
overlay  63G   11G   49G  19% /
tmpfs64M  4.0K   64M   1% /dev
tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
osxfs   466G  158G  305G  35% /Users
/dev/sda163G   11G   49G  19% /etc/hosts
shm  64M   11M   54M  16% /dev/shm
tmpfs  1000M 0 1000M   0% /sys/firmware
tmpfs   128M 0  128M   0% /tmp

The errors are exactly the same.
I have the impression that I changed the wrong parameter. Probably I have to 
change:
shm  64M   11M   54M  16% /dev/shm

but I do not know how to do that. Any suggestion?

> On 26 Jun 2018, at 09:48, Christine Caulfield  wrote:
> 
> On 25/06/18 20:41, Salvatore D'angelo wrote:
>> Hi,
>> 
>> Let me add here one important detail. I use Docker for my test with 5 
>> containers deployed on my Mac.
>> Basically the team that worked on this project installed the cluster on soft 
>> layer bare metal.
>> The PostgreSQL cluster was hard to test and if a misconfiguration occurred 
>> recreate the cluster from scratch is not easy.
>> Test it was a cumbersome if you consider that we access to the machines with 
>> a complex system hard to describe here.
>> For this reason I ported the cluster on Docker for test purpose. I am not 
>> interested to have it working for months, I just need a proof of concept. 
>> 
>> When the migration works I’ll port everything on bare metal where the size 
>> of resources are ambundant.  
>> 
>> Now I have enough RAM and disk space on my Mac so if you tell me what should 
>> be an acceptable size for several days of running it is ok for me.
>> It is ok also have commands to clean the shm when required.
>> I know I can find them on Google but if you can suggest me these info I’ll 
>> appreciate. I have OS knowledge to do that but I would like to avoid days of 
>> guesswork and try and error if possible.
> 
> 
> I would recommend at least 128MB of space on /dev/shm, 256MB if you can
> spare it. My 'standard' system uses 75MB under normal running allowing
> for one command-line query to run.
> 
> If I read this right then you're reproducing a bare-metal system in
> containers now? so the original systems will have a default /dev/shm
> size which is probably much larger than your containers?
> 
> I'm just checking here that we don't have a regression in memory usage
> as Poki suggested.
> 
> Chrissie
> 
>>> On 25 Jun 2018, at 21:18, Jan Pokorný  wrote:
>>> 
>>> On 25/06/18 19:06 +0200, Salvatore D'angelo wrote:
 Thanks for reply. I scratched my cluster and created it again and
 then migrated as before. This time I uninstalled pacemaker,
 corosync, crmsh and resource agents with make uninstall
 
 then I installed new packages. The problem is the same, when
 I launch:
 corosync-quorumtool -ps
 
 I got: Cannot initialize QUORUM service
 
 Here the log with debug enabled:
 
 
 [18019] pg3 corosyncerror   [QB] couldn't create circular mmap on 
 /dev/shm/qb-cfg-event-18020-18028-23-data
 [18019] pg3 corosyncerror   [QB] qb_rb_open:cfg-event-18020-18028-23: 
 Resource temporarily unavailable (11)
 [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
 /dev/shm/qb-cfg-request-18020-18028-23-header
 [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
 /dev/shm/qb-cfg-response-18020-18028-23-header
 [18019] pg3 corosyncerror   [QB] shm connection FAILED: Resource 
 temporarily unavailable (11)
 [18019] pg3 corosyncerror   [QB] Error in connection setup 
 (18020-18028-23): Resource temporarily unavailable (11)
 
 I tried to check /dev/shm and I am not sure these are the right
 commands, however:
 
 df -h /dev/shm
 Filesystem  Size  Used Avail Use% Mounted on
 shm  64M   16M   49M  24% /dev/shm
 
 ls /dev/shm
 qb-cmap-request-18020-18036-25-dataqb-corosync-blackbox-data
 qb-quorum-request-18020-18095-32-data
 qb-cmap-request-18020-18036-25-header  qb-corosync-blackbox-header  
 qb-quorum-request-18020-18095-32-header
 
 Is 64 Mb enough for /dev/shm. If no, why it worked with previous
 corosync release?
>>> 
>>> For a start, can you try configuring corosync with
>>> --enable-small-memory-footprint switch?
>>> 
>>> Hard to say why the space provisioned to /dev/shm is the direct
>>> opposite of generous (per today's standards), but may be the result
>>> of automatic HW adaptation, and if RAM is so 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Christine Caulfield
On 25/06/18 20:41, Salvatore D'angelo wrote:
> Hi,
> 
> Let me add here one important detail. I use Docker for my test with 5 
> containers deployed on my Mac.
> Basically the team that worked on this project installed the cluster on soft 
> layer bare metal.
> The PostgreSQL cluster was hard to test and if a misconfiguration occurred 
> recreate the cluster from scratch is not easy.
> Test it was a cumbersome if you consider that we access to the machines with 
> a complex system hard to describe here.
> For this reason I ported the cluster on Docker for test purpose. I am not 
> interested to have it working for months, I just need a proof of concept. 
> 
> When the migration works I’ll port everything on bare metal where the size of 
> resources are ambundant.  
> 
> Now I have enough RAM and disk space on my Mac so if you tell me what should 
> be an acceptable size for several days of running it is ok for me.
> It is ok also have commands to clean the shm when required.
> I know I can find them on Google but if you can suggest me these info I’ll 
> appreciate. I have OS knowledge to do that but I would like to avoid days of 
> guesswork and try and error if possible.


I would recommend at least 128MB of space on /dev/shm, 256MB if you can
spare it. My 'standard' system uses 75MB under normal running allowing
for one command-line query to run.

If I read this right then you're reproducing a bare-metal system in
containers now? so the original systems will have a default /dev/shm
size which is probably much larger than your containers?

I'm just checking here that we don't have a regression in memory usage
as Poki suggested.

Chrissie

>> On 25 Jun 2018, at 21:18, Jan Pokorný  wrote:
>>
>> On 25/06/18 19:06 +0200, Salvatore D'angelo wrote:
>>> Thanks for reply. I scratched my cluster and created it again and
>>> then migrated as before. This time I uninstalled pacemaker,
>>> corosync, crmsh and resource agents with make uninstall
>>>
>>> then I installed new packages. The problem is the same, when
>>> I launch:
>>> corosync-quorumtool -ps
>>>
>>> I got: Cannot initialize QUORUM service
>>>
>>> Here the log with debug enabled:
>>>
>>>
>>> [18019] pg3 corosyncerror   [QB] couldn't create circular mmap on 
>>> /dev/shm/qb-cfg-event-18020-18028-23-data
>>> [18019] pg3 corosyncerror   [QB] qb_rb_open:cfg-event-18020-18028-23: 
>>> Resource temporarily unavailable (11)
>>> [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
>>> /dev/shm/qb-cfg-request-18020-18028-23-header
>>> [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
>>> /dev/shm/qb-cfg-response-18020-18028-23-header
>>> [18019] pg3 corosyncerror   [QB] shm connection FAILED: Resource 
>>> temporarily unavailable (11)
>>> [18019] pg3 corosyncerror   [QB] Error in connection setup 
>>> (18020-18028-23): Resource temporarily unavailable (11)
>>>
>>> I tried to check /dev/shm and I am not sure these are the right
>>> commands, however:
>>>
>>> df -h /dev/shm
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> shm  64M   16M   49M  24% /dev/shm
>>>
>>> ls /dev/shm
>>> qb-cmap-request-18020-18036-25-dataqb-corosync-blackbox-data
>>> qb-quorum-request-18020-18095-32-data
>>> qb-cmap-request-18020-18036-25-header  qb-corosync-blackbox-header  
>>> qb-quorum-request-18020-18095-32-header
>>>
>>> Is 64 Mb enough for /dev/shm. If no, why it worked with previous
>>> corosync release?
>>
>> For a start, can you try configuring corosync with
>> --enable-small-memory-footprint switch?
>>
>> Hard to say why the space provisioned to /dev/shm is the direct
>> opposite of generous (per today's standards), but may be the result
>> of automatic HW adaptation, and if RAM is so scarce in your case,
>> the above build-time toggle might help.
>>
>> If not, then exponentially increasing size of /dev/shm space is
>> likely your best bet (I don't recommended fiddling with mlockall()
>> and similar measures in corosync).
>>
>> Of course, feel free to raise a regression if you have a reproducible
>> comparison between two corosync (plus possibly different libraries
>> like libqb) versions, one that works and one that won't, in
>> reproducible conditions (like this small /dev/shm, VM image, etc.).
>>
>> -- 
>> Jan (Poki)
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org

Re: [ClusterLabs] Resources not monitored in SLES11 SP4 (1.1.12-f47ea56)

2018-06-26 Thread Vladislav Bogdanov

26.06.2018 09:14, Ulrich Windl wrote:

Hi!

We just observed some strange effect we cannot explain in SLES 11 SP4 
(pacemaker 1.1.12-f47ea56):
We run about a dozen of Xen PVMs on a three-node cluster (plus some 
infrastructure and monitoring stuff). It worked all well so far, and there was 
no significant change recently.
However when a colleague stopped on VM for maintenance via cluster command, the 
cluster did not notice when the PVM actually was running again (it had been 
started not using the cluster (a bad idea, I know)).


To be on a safe side in such cases you'd probably want to enable 
additional monitor for a "Stopped" role. Default one covers only 
"Started" role. The same thing as for multistate resources, where you 
need several monitor ops, for "Started/Slave" and "Master" roles.

But, this will increase a load.
And, I believe cluster should reprobe a resource on all nodes once you 
change target-role back to "Started".



Examining the logs, it seems that the recheck timer popped periodically, but no 
monitor action was run for the VM (the action is configured to run every 10 
minutes).

Actually the only monitor operations found were:
May 23 08:04:13
Jun 13 08:13:03
Jun 25 09:29:04
Then a manual "reprobe" was done, and several monitor operations were run.
Then again I see no more monitor actions in syslog.

What could be the reasons for this? Too many operations defined?

The other message I don't understand is like ": Rolling back scores from 
"

Could it be a new bug introduced in pacemaker, or could it be some 
configuration problem (The status is completely clean however)?

According to the packet changelog, there was no change since Nov 2016...

Regards,
Ulrich


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Resources not monitored in SLES11 SP4 (1.1.12-f47ea56)

2018-06-26 Thread Ulrich Windl
Hi!

We just observed some strange effect we cannot explain in SLES 11 SP4 
(pacemaker 1.1.12-f47ea56):
We run about a dozen of Xen PVMs on a three-node cluster (plus some 
infrastructure and monitoring stuff). It worked all well so far, and there was 
no significant change recently.
However when a colleague stopped on VM for maintenance via cluster command, the 
cluster did not notice when the PVM actually was running again (it had been 
started not using the cluster (a bad idea, I know)).
Examining the logs, it seems that the recheck timer popped periodically, but no 
monitor action was run for the VM (the action is configured to run every 10 
minutes).

Actually the only monitor operations found were:
May 23 08:04:13
Jun 13 08:13:03
Jun 25 09:29:04
Then a manual "reprobe" was done, and several monitor operations were run.
Then again I see no more monitor actions in syslog.

What could be the reasons for this? Too many operations defined?

The other message I don't understand is like ": Rolling back 
scores from "

Could it be a new bug introduced in pacemaker, or could it be some 
configuration problem (The status is completely clean however)?

According to the packet changelog, there was no change since Nov 2016...

Regards,
Ulrich


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org