Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-02 Thread Nikhil Utane
It is Freescale e6500 processor. Nobody here has tried running it in LE
mode so it is going to take some doing.
We are going to add some debug logs to figure out where does corosync
initialization get stalled.
If you have have suggestions, pls let us know.

-Thanks
Nikhil


On Mon, May 2, 2016 at 1:00 PM, Nikhil Utane 
wrote:

> So what I understand what you are saying is, if the HW is bi-endian, then
> enable LE on PPC. Is that right?
> Need to check on that.
>
> On Mon, May 2, 2016 at 12:49 PM, Nikhil Utane  > wrote:
>
>> Sorry about my ignorance but could you pls elaborate what do you mean by
>> "try to ppcle"?
>>
>> Our target platform is ppc so it is BE. We have to get it running only on
>> that.
>> How do we know this is LE/BE issue and nothing else?
>>
>> -Thanks
>> Nikhil
>>
>>
>> On Mon, May 2, 2016 at 12:24 PM, Jan Friesse  wrote:
>>
>>> As your hardware is probably capable of running ppcle and if you have an
 environment
 at hand without too much effort it might pay off to try that.
 There are of course distributions out there support corosync on
 big-endian architectures
 but I don't know if there is an automatized regression for corosync on
 big-endian that
 would catch big-endian-issues right away with something as current as
 your 2.3.5.

>>>
>>> No we are not testing big-endian.
>>>
>>> So totally agree with Klaus. Give a try to ppcle. Also make sure all
>>> nodes are little-endian. Corosync should work in mixed BE/LE environment
>>> but because it's not tested, it may not work (and it's a bug, so if ppcle
>>> works I will try to fix BE).
>>>
>>> Regards,
>>>   Honza
>>>
>>>
>>>
 Regards,
 Klaus

 On 05/02/2016 06:44 AM, Nikhil Utane wrote:

> Re-sending as I don't see my post on the thread.
>
> On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
> >
> wrote:
>
>  Hi,
>
>  Looking for some guidance here as we are completely blocked
>  otherwise :(.
>
>  -Regards
>  Nikhil
>
>  On Fri, Apr 29, 2016 at 6:11 PM, Sriram   > wrote:
>
>  Corrected the subject.
>
>  We went ahead and captured corosync debug logs for our ppc
> board.
>  After log analysis and comparison with the sucessful logs(
>  from x86 machine) ,
>  we didnt find *"[ MAIN  ] Completed service synchronization,
>  ready to provide service.*" in ppc logs.
>  So, looks like corosync is not in a position to accept
>  connection from Pacemaker.
>  Even I tried with the new corosync.conf with no success.
>
>  Any hints on this issue would be really helpful.
>
>  Attaching ppc_notworking.log, x86_working.log, corosync.conf.
>
>  Regards,
>  Sriram
>
>
>
>  On Fri, Apr 29, 2016 at 2:44 PM, Sriram   > wrote:
>
>  Hi,
>
>  I went ahead and made some changes in file system(Like I
>  brought in /etc/init.d/corosync and /etc/init.d/pacemaker,
>  /etc/sysconfig ), After that I was able to run  "pcs
>  cluster start".
>  But it failed with the following error
>   # pcs cluster start
>  Starting Cluster...
>  Starting Pacemaker Cluster Manager[FAILED]
>  Error: unable to start pacemaker
>
>  And in the /var/log/pacemaker.log, I saw these errors
>  pacemakerd: info: mcp_read_config:  cmap connection
>  setup failed: CS_ERR_TRY_AGAIN.  Retrying in 4s
>  Apr 29 08:53:47 [15863] node_cu pacemakerd: info:
>  mcp_read_config:  cmap connection setup failed:
>  CS_ERR_TRY_AGAIN.  Retrying in 5s
>  Apr 29 08:53:52 [15863] node_cu pacemakerd:  warning:
>  mcp_read_config:  Could not connect to Cluster
>  Configuration Database API, error 6
>  Apr 29 08:53:52 [15863] node_cu pacemakerd:   notice:
>  main: Could not obtain corosync config data, exiting
>  Apr 29 08:53:52 [15863] node_cu pacemakerd: info:
>  crm_xml_cleanup:  Cleaning up memory from libxml2
>
>
>  And in the /var/log/Debuglog, I saw these errors coming
>  from corosync
>  20160429 085347.487050  airv_cu
>  daemon.warn corosync[12857]:   [QB] Denied connection,
>  is not ready (12857-15863-14)
>  20160429 

Re: [ClusterLabs] Coming in 1.1.15: Event-driven alerts

2016-05-02 Thread Ken Gaillot
On 04/25/2016 07:28 AM, Lars Ellenberg wrote:
> On Thu, Apr 21, 2016 at 12:50:43PM -0500, Ken Gaillot wrote:
>> Hello everybody,
>>
>> The release cycle for 1.1.15 will be started soon (hopefully tomorrow)!
>>
>> The most prominent feature will be Klaus Wenninger's new implementation
>> of event-driven alerts -- the ability to call scripts whenever
>> interesting events occur (nodes joining/leaving, resources
>> starting/stopping, etc.).
> 
> What exactly is "etc." here?
> What is the comprehensive list
> of which "events" will trigger "alerts"?

The exact list should be documented in Pacemaker Explained before the
final 1.1.15 release. I think it's comparable to what crm_mon -E does
currently. The basic categories are node events, fencing events, and
resource events.

> My guess would be
>  DC election/change
>which does not necessarily imply membership change
>  change in membership
>which includes change in quorum
>  fencing events
>(even failed fencing?)
>  resource start/stop/promote/demote
>   (probably) monitor failure?
>maybe only if some fail-count changes to/from infinity?
>or above a certain threshold?
> 
>  change of maintenance-mode?
>  node standby/online (maybe)?
>  maybe "resource cannot be run anywhere"?

It would certainly be possible to expand alerts to more situations if
there is a need. I think the existing ones will be sufficient for common
use cases though.

> would it be useful to pass in the "transaction ID"
> or other pointer to the recorded cib input at the time
> the "alert" was triggered?

Possibly, though it isn't currently. We do pass a node-local counter and
a subsecond-resolution timestamp, to help with ordering.

> can an alert "observer" (alert script) "register"
> for only a subset of the "alerts"?

Not explicitly, but the alert type is passed in as an environment
variable, so the script can simply exit for "uninteresting" event types.
That's not as efficient since the process must still be spawned, but it
simplifies things.

> if so, can this filter be per alert script,
> or per "recipient", or both?
> 
> Thanks,
> 
> Lars
> 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Node is silently unfenced if transition is very long

2016-05-02 Thread Ken Gaillot
On 04/19/2016 10:47 AM, Vladislav Bogdanov wrote:
> Hi,
> 
> Just found an issue with node is silently unfenced.
> 
> That is quite large setup (2 cluster nodes and 8 remote ones) with
> a plenty of slowly starting resources (lustre filesystem).
> 
> Fencing was initiated due to resource stop failure.
> lustre often starts very slowly due to internal recovery, and some such
> resources were starting in that transition where another resource failed to 
> stop.
> And, as transition did not finish in time specified by the
> "failure-timeout" (set to 9 min), and was not aborted, that stop failure was 
> successfully cleaned.
> There were transition aborts due to attribute changes, after that stop 
> failure happened, but fencing
> was not initiated for some reason.

Unfortunately, that makes sense with the current code. Failure timeout
changes the node attribute, which aborts the transition, which causes a
recalculation based on the new state, and the fencing is no longer
needed. I'll make a note to investigate a fix, but feel free to file a
bug report at bugs.clusterlabs.org for tracking purposes.

> Node where stop failed was a DC.
> pacemaker is 1.1.14-5a6cdd1 (from fedora, built on EL7)
> 
> Here is log excerpt illustrating the above:
> Apr 19 14:57:56 mds1 pengine[3452]:   notice: Movemdt0-es03a-vg
> (Started mds1 -> mds0)
> Apr 19 14:58:06 mds1 pengine[3452]:   notice: Movemdt0-es03a-vg
> (Started mds1 -> mds0)
> Apr 19 14:58:10 mds1 crmd[3453]:   notice: Initiating action 81: monitor 
> mdt0-es03a-vg_monitor_0 on mds0
> Apr 19 14:58:11 mds1 crmd[3453]:   notice: Initiating action 2993: stop 
> mdt0-es03a-vg_stop_0 on mds1 (local)
> Apr 19 14:58:11 mds1 LVM(mdt0-es03a-vg)[6228]: INFO: Deactivating volume 
> group vg_mdt0_es03a
> Apr 19 14:58:12 mds1 LVM(mdt0-es03a-vg)[6541]: ERROR: Logical volume 
> vg_mdt0_es03a/mdt0 contains a filesystem in use. Can't deactivate volume 
> group "vg_mdt0_es03a" with 1 open logical volume(s)
> [...]
> Apr 19 14:58:30 mds1 LVM(mdt0-es03a-vg)[9939]: ERROR: LVM: vg_mdt0_es03a did 
> not stop correctly
> Apr 19 14:58:30 mds1 LVM(mdt0-es03a-vg)[9943]: WARNING: vg_mdt0_es03a still 
> Active
> Apr 19 14:58:30 mds1 LVM(mdt0-es03a-vg)[9947]: INFO: Retry deactivating 
> volume group vg_mdt0_es03a
> Apr 19 14:58:31 mds1 lrmd[3450]:   notice: mdt0-es03a-vg_stop_0:5865:stderr [ 
> ocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctly ]
> [...]
> Apr 19 14:58:31 mds1 lrmd[3450]:   notice: mdt0-es03a-vg_stop_0:5865:stderr [ 
> ocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctly ]
> Apr 19 14:58:31 mds1 crmd[3453]:   notice: Operation mdt0-es03a-vg_stop_0: 
> unknown error (node=mds1, call=324, rc=1, cib-update=1695, confirmed=true)
> Apr 19 14:58:31 mds1 crmd[3453]:   notice: mds1-mdt0-es03a-vg_stop_0:324 [ 
> ocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
> correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
> correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
> correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
> correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
> correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
> correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
> correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctl
> Apr 19 14:58:31 mds1 crmd[3453]:  warning: Action 2993 (mdt0-es03a-vg_stop_0) 
> on mds1 failed (target: 0 vs. rc: 1): Error
> Apr 19 14:58:31 mds1 crmd[3453]:  warning: Action 2993 (mdt0-es03a-vg_stop_0) 
> on mds1 failed (target: 0 vs. rc: 1): Error
> Apr 19 15:02:03 mds1 pengine[3452]:  warning: Processing failed op stop for 
> mdt0-es03a-vg on mds1: unknown error (1)
> Apr 19 15:02:03 mds1 pengine[3452]:  warning: Processing failed op stop for 
> mdt0-es03a-vg on mds1: unknown error (1)
> Apr 19 15:02:03 mds1 pengine[3452]:  warning: Node mds1 will be fenced 
> because of resource failure(s)
> Apr 19 15:02:03 mds1 pengine[3452]:  warning: Forcing mdt0-es03a-vg away from 
> mds1 after 100 failures (max=100)
> Apr 19 15:02:03 mds1 pengine[3452]:  warning: Scheduling Node mds1 for STONITH
> Apr 19 15:02:03 mds1 pengine[3452]:   notice: Stop of failed resource 
> mdt0-es03a-vg is implicit after mds1 is fenced
> Apr 19 15:02:03 mds1 pengine[3452]:   notice: Recover mdt0-es03a-vg
> (Started mds1 -> mds0)
> [... many of these ]
> Apr 19 15:07:22 mds1 pengine[3452]:  warning: Processing failed op stop for 
> mdt0-es03a-vg on mds1: unknown error (1)
> Apr 19 15:07:22 mds1 pengine[3452]:  warning: Processing failed op stop for 
> mdt0-es03a-vg on mds1: unknown error (1)
> Apr 19 15:07:22 mds1 pengine[3452]:  warning: Node mds1 will be fenced 
> because of resource failure(s)
> Apr 19 15:07:22 mds1 pengine[3452]:  warning: Forcing mdt0-es03a-vg away from 
> mds1 after 100 failures (max=100)
> Apr 19 15:07:23 mds1 pengine[3452]:  warning: Scheduling Node mds1 for STONITH
> Apr 19 15:07:23 mds1 pengine[3452]:   notice: Stop of failed resource 
> 

Re: [ClusterLabs] Coming in 1.1.15: Event-driven alerts

2016-05-02 Thread Ken Gaillot
On 04/22/2016 05:55 PM, Adam Spiers wrote:
> Ken Gaillot  wrote:
>> On 04/21/2016 06:09 PM, Adam Spiers wrote:
>>> Ken Gaillot  wrote:
 Hello everybody,

 The release cycle for 1.1.15 will be started soon (hopefully tomorrow)!

 The most prominent feature will be Klaus Wenninger's new implementation
 of event-driven alerts -- the ability to call scripts whenever
 interesting events occur (nodes joining/leaving, resources
 starting/stopping, etc.).
>>>
>>> Ooh, that sounds cool!  Can it call scripts after fencing has
>>> completed?  And how is it determined which node the script runs on,
>>> and can that be limited via constraints or similar?
>>
>> Yes, it called after all "interesting" events (including fencing), and
>> the script can use the provided environment variables to determine what
>> type of event it was.
> 
> Great.  Does the script run on the DC, or is that configurable somehow?

The script runs on all cluster nodes, to give maximum flexibility and
resiliency (during partitions etc.). Scripts must handle ordering and
de-duplication themselves, if needed.

A script that isn't too concerned about partitions might simply check
whether the local node is the DC, and only take action if so, to avoid
duplicates.

We're definitely interested in hearing how people approach these issues.
The possibilities for what an alert script might do are wide open, and
we want to be as flexible as possible at this stage. If the community
settles on certain approaches or finds certain gaps, we can enhance the
support in those areas as needed.

>> We don't notify before events, because at that moment we don't know
>> whether the event will really happen or not. We might try but fail.
> 
> You lost me here ;-)

We only call alert scripts after an event occurs, because we can't
predict the future. :-) For example, we don't know whether a node is
about to join or leave the cluster. Or for fencing, we might try to
fence but be unsuccessful -- and the part of pacemaker that calls the
alert scripts won't even know about fencing initiated outside cluster
control, such as by DLM or a human running stonith_admin.

>>> I'm wondering if it could replace the current fencing_topology hack we
>>> use to invoke fence_compute which starts the workflow for recovering
>>> VMs off dead OpenStack nova-compute nodes.
>>
>> Yes, that is one of the reasons we did this!
> 
> Haha, at this point can I say great minds think alike? ;-)
> 
>> The initial implementation only allowed for one script to be called (the
>> "notification-agent" property), but we quickly found out that someone
>> might need to email an administrator, notify nova-compute, and do other
>> types of handling as well. Making someone write one script that did
>> everything would be too complicated and error-prone (and unsupportable).
>> So we abandoned "notification-agent" and went with this new approach.
>>
>> Coordinate with Andrew Beekhof for the nova-compute alert script, as he
>> already has some ideas for that.
> 
> OK.  I'm sure we'll be able to talk about this more next week in Austin!
> 
>>> Although even if that's possible, maybe there are good reasons to stay
>>> with the fencing_topology approach?
>>>
>>> Within the same OpenStack compute node HA scenario, it strikes me that
>>> this could be used to invoke "nova service-disable" when the
>>> nova-compute service crashes on a compute node and then fails to
>>> restart.  This would eliminate the window in between the crash and the
>>> nova server timing out the nova-compute service - during which it
>>> would otherwise be possible for nova-scheduler to attempt to schedule
>>> new VMs on the compute node with the crashed nova-compute service.
>>>
>>> IIUC, this is one area where masakari is currently more sophisticated
>>> than the approach based on OCF RAs:
>>>
>>> https://github.com/ntt-sic/masakari/blob/master/docs/evacuation_patterns.md#evacuation-patterns
>>>
>>> Does that make sense?
>>
>> Maybe. The script would need to be able to determine based on the
>> provided environment variables whether it's in that situation or not.
> 
> Yep.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Running several instances of a Corosync/Pacemaker cluster on a node

2016-05-02 Thread Ken Gaillot
On 04/26/2016 03:33 AM, Bogdan Dobrelya wrote:
> Is it possible to run several instances of a Corosync/Pacemaker clusters
> on a node? Can a node be a member of several clusters, so they could put
> resources there? I'm sure it's doable with separate nodes or containers,
> but that's not the case.
> 
> My case is to separate data-critical resources, like storage or VIPs,
> from the complex resources like DB or MQ clusters.
> 
> The latter should run with no-quorum-policy=ignore as they know how to
> deal with network partitions/split-brain, use own techniques to protect
> data and don't want external fencing from a Pacemaker, which
> no-quorum-policy/STONITH is.
> 
> The former must use STONITH (or a stop policy, if it's only a VIP), as
> they don't know how to deal with split-brain, for example.

I don't think it's possible, though I could be wrong, if separate
IPs/ports, chroots and node names are used (just shy of a container ...).

However I suspect it would not meet your goal in any case. DB and MQ
software generally do NOT have sufficient techniques to deal with a
split-brain situation -- either you lose high availability or you
corrupt data. Using no-quorum-policy=stop is fine for handling network
splits, but it does not help if a node becomes unresponsive.

Also note that pacemaker does provide the ability to treat different
resources differently with respect to quorum and fencing, without
needing to run separate clusters. See the "required" meta-attribute:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_resource_meta_attributes

I suspect your motive for this is to be able to run a cluster without
fencing. There are certain failure scenarios that simply are not
recoverable without fencing, regardless of what the application software
can do. There is really only one case in which doing without fencing is
reasonable: when you're willing to lose your data and/or have downtime
when a situation arises that requires fencing.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] libqb 0.17.1 - segfault at 1b8

2016-05-02 Thread Jan Pokorný
Hello Radoslaw,

On 02/05/16 11:47 -0500, Radoslaw Garbacz wrote:
> When testing pacemaker I encountered a start error, which seems to be
> related to reported libqb segmentation fault.
> - cluster started and acquired quorum
> - some nodes failed to connect to CIB, and lost membership as a result
> - restart solved the problem
> 
> Segmentation fault reports libqb library in version 0.17.1, a standard
> package provided for CentOS.6.

Chances are that you are running into this nasty bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1114852

> Please let me know if the problem is known, and if  there is a remedy (e.g.
> using the latest libqb).

Try libqb >= 0.17.2.

[...]

> Logs from /var/log/messages:
> 
> Apr 22 15:46:41 (...) pacemakerd[90]:   notice: Additional logging
> available in /var/log/pacemaker.log
> Apr 22 15:46:41 (...) pacemakerd[90]:   notice: Configured corosync to
> accept connections from group 498: Library error (2)

IIRC, that last line ^ was one of the symptoms.

-- 
Jan (Poki)


pgpsFynHFYy51.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] libqb 0.17.1 - segfault at 1b8

2016-05-02 Thread Radoslaw Garbacz
Hi,

Firstly thank you for such a great tool.

When testing pacemaker I encountered a start error, which seems to be
related to reported libqb segmentation fault.
- cluster started and acquired quorum
- some nodes failed to connect to CIB, and lost membership as a result
- restart solved the problem

Segmentation fault reports libqb library in version 0.17.1, a standard
package provided for CentOS.6.

Please let me know if the problem is known, and if  there is a remedy (e.g.
using the latest libqb).
Logs are below.


Thank you in advance,




Logs from /var/log/messages:

Apr 22 15:46:41 (...) pacemakerd[90]:   notice: Additional logging
available in /var/log/pacemaker.log
Apr 22 15:46:41 (...) pacemakerd[90]:   notice: Configured corosync to
accept connections from group 498: Library error (2)
Apr 22 15:46:41 (...) pacemakerd[90]:   notice: Starting Pacemaker
1.1.13-1.el6 (Build: 577898d):  generated-manpages agent-manpages ncurses
libqb-logging libqb-ipc upstart nagios  corosync-native atomic-attrd acls
Apr 22 15:46:41 (...) pacemakerd[90]:   notice: Quorum acquired
Apr 22 15:46:41 (...) pacemakerd[90]:   notice:
pcmk_quorum_notification: Node (...)[3] - state is now member (was (null))
Apr 22 15:46:41 (...) pacemakerd[90]:   notice:
pcmk_quorum_notification: Node (...)[4] - state is now member (was (null))
Apr 22 15:46:41 (...) pacemakerd[90]:   notice:
pcmk_quorum_notification: Node (...)[2] - state is now member (was (null))
Apr 22 15:46:41 (...) pacemakerd[90]:   notice:
pcmk_quorum_notification: Node (...)[1] - state is now member (was (null))
Apr 22 15:46:41 (...) lrmd[94]:   notice: Additional logging available
in /var/log/pacemaker.log
Apr 22 15:46:41 (...) stonith-ng[93]:   notice: Additional logging
available in /var/log/pacemaker.log
Apr 22 15:46:41 (...) cib[92]:   notice: Additional logging available
in /var/log/pacemaker.log
Apr 22 15:46:41 (...) attrd[95]:   notice: Additional logging available
in /var/log/pacemaker.log
Apr 22 15:46:41 (...) stonith-ng[93]:   notice: Connecting to cluster
infrastructure: corosync
Apr 22 15:46:41 (...) pengine[96]:   notice: Additional logging
available in /var/log/pacemaker.log
Apr 22 15:46:41 (...) attrd[95]:   notice: Connecting to cluster
infrastructure: corosync
Apr 22 15:46:41 (...) crmd[97]:   notice: Additional logging available
in /var/log/pacemaker.log
Apr 22 15:46:41 (...) crmd[97]:   notice: CRM Git Version: 1.1.13-1.el6
(577898d)
Apr 22 15:46:41 (...) attrd[95]:error: Could not connect to the
Cluster Process Group API: 11
Apr 22 15:46:41 (...) attrd[95]:error: Cluster connection failed
Apr 22 15:46:41 (...) attrd[95]:   notice: Cleaning up before exit
Apr 22 15:46:41 (...) stonith-ng[93]:   notice: crm_update_peer_proc:
Node (...)[3] - state is now member (was (null))
Apr 22 15:46:41 (...) pacemakerd[90]:error: Managed process 95
(attrd) dumped core
Apr 22 15:46:41 (...) pacemakerd[90]:error: The attrd process
(95) terminated with signal 11 (core=1)
Apr 22 15:46:41 (...) pacemakerd[90]:   notice: Respawning failed child
process: attrd
Apr 22 15:46:41 (...) cib[92]:   notice: Connecting to cluster
infrastructure: corosync
Apr 22 15:46:41 (...) cib[92]:error: Could not connect to the
Cluster Process Group API: 11
Apr 22 15:46:41 (...) cib[92]: crit: Cannot sign in to the
cluster... terminating
Apr 22 15:46:41 (...) kernel: [17169.112132] attrd[95]: segfault at 1b8
ip 7f6fc9dc3181 sp 7ffd7cf668f0 error 4 in
libqb.so.0.17.1[7f6fc9db4000+21000]
Apr 22 15:46:41 (...) pacemakerd[90]:  warning: The cib process
(92) can no longer be respawned, shutting the cluster down.
Apr 22 15:46:41 (...) pacemakerd[90]:   notice: Shutting down Pacemaker
Apr 22 15:46:41 (...) pacemakerd[90]:   notice: Stopping crmd: Sent -15
to process 97
Apr 22 15:46:41 (...) attrd[98]:   notice: Additional logging available
in /var/log/pacemaker.log
Apr 22 15:46:41 (...) crmd[97]:  warning: Couldn't complete CIB
registration 1 times... pause and retry
Apr 22 15:46:41 (...) crmd[97]:   notice: Invoking handler for signal
15: Terminated
Apr 22 15:46:41 (...) crmd[97]:   notice: Requesting shutdown, upper
limit is 120ms
Apr 22 15:46:41 (...) crmd[97]:  warning: FSA: Input I_SHUTDOWN from
crm_shutdown() received in state S_STARTING
Apr 22 15:46:41 (...) crmd[97]:   notice: State transition S_STARTING
-> S_STOPPING [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
Apr 22 15:46:41 (...) crmd[97]:   notice: Disconnecting from Corosync
Apr 22 15:46:41 (...) attrd[98]:   notice: Connecting to cluster
infrastructure: corosync
Apr 22 15:46:41 (...) attrd[98]:error: Could not connect to the
Cluster Process Group API: 11
Apr 22 15:46:41 (...) attrd[98]:error: Cluster connection failed
Apr 22 15:46:41 (...) attrd[98]:   notice: Cleaning up before exit
Apr 22 

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-02 Thread Nikhil Utane
So what I understand what you are saying is, if the HW is bi-endian, then
enable LE on PPC. Is that right?
Need to check on that.

On Mon, May 2, 2016 at 12:49 PM, Nikhil Utane 
wrote:

> Sorry about my ignorance but could you pls elaborate what do you mean by
> "try to ppcle"?
>
> Our target platform is ppc so it is BE. We have to get it running only on
> that.
> How do we know this is LE/BE issue and nothing else?
>
> -Thanks
> Nikhil
>
>
> On Mon, May 2, 2016 at 12:24 PM, Jan Friesse  wrote:
>
>> As your hardware is probably capable of running ppcle and if you have an
>>> environment
>>> at hand without too much effort it might pay off to try that.
>>> There are of course distributions out there support corosync on
>>> big-endian architectures
>>> but I don't know if there is an automatized regression for corosync on
>>> big-endian that
>>> would catch big-endian-issues right away with something as current as
>>> your 2.3.5.
>>>
>>
>> No we are not testing big-endian.
>>
>> So totally agree with Klaus. Give a try to ppcle. Also make sure all
>> nodes are little-endian. Corosync should work in mixed BE/LE environment
>> but because it's not tested, it may not work (and it's a bug, so if ppcle
>> works I will try to fix BE).
>>
>> Regards,
>>   Honza
>>
>>
>>
>>> Regards,
>>> Klaus
>>>
>>> On 05/02/2016 06:44 AM, Nikhil Utane wrote:
>>>
 Re-sending as I don't see my post on the thread.

 On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
 >
 wrote:

  Hi,

  Looking for some guidance here as we are completely blocked
  otherwise :(.

  -Regards
  Nikhil

  On Fri, Apr 29, 2016 at 6:11 PM, Sriram > wrote:

  Corrected the subject.

  We went ahead and captured corosync debug logs for our ppc
 board.
  After log analysis and comparison with the sucessful logs(
  from x86 machine) ,
  we didnt find *"[ MAIN  ] Completed service synchronization,
  ready to provide service.*" in ppc logs.
  So, looks like corosync is not in a position to accept
  connection from Pacemaker.
  Even I tried with the new corosync.conf with no success.

  Any hints on this issue would be really helpful.

  Attaching ppc_notworking.log, x86_working.log, corosync.conf.

  Regards,
  Sriram



  On Fri, Apr 29, 2016 at 2:44 PM, Sriram > wrote:

  Hi,

  I went ahead and made some changes in file system(Like I
  brought in /etc/init.d/corosync and /etc/init.d/pacemaker,
  /etc/sysconfig ), After that I was able to run  "pcs
  cluster start".
  But it failed with the following error
   # pcs cluster start
  Starting Cluster...
  Starting Pacemaker Cluster Manager[FAILED]
  Error: unable to start pacemaker

  And in the /var/log/pacemaker.log, I saw these errors
  pacemakerd: info: mcp_read_config:  cmap connection
  setup failed: CS_ERR_TRY_AGAIN.  Retrying in 4s
  Apr 29 08:53:47 [15863] node_cu pacemakerd: info:
  mcp_read_config:  cmap connection setup failed:
  CS_ERR_TRY_AGAIN.  Retrying in 5s
  Apr 29 08:53:52 [15863] node_cu pacemakerd:  warning:
  mcp_read_config:  Could not connect to Cluster
  Configuration Database API, error 6
  Apr 29 08:53:52 [15863] node_cu pacemakerd:   notice:
  main: Could not obtain corosync config data, exiting
  Apr 29 08:53:52 [15863] node_cu pacemakerd: info:
  crm_xml_cleanup:  Cleaning up memory from libxml2


  And in the /var/log/Debuglog, I saw these errors coming
  from corosync
  20160429 085347.487050  airv_cu
  daemon.warn corosync[12857]:   [QB] Denied connection,
  is not ready (12857-15863-14)
  20160429 085347.487067  airv_cu
  daemon.info  corosync[12857]:   [QB
  ] Denied connection, is not ready (12857-15863-14)


  I browsed the code of libqb to find that it is failing in


 https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c

  Line 600 :
  handle_new_connection function

  Line 637:
  if (auth_result == 0 

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-02 Thread Nikhil Utane
Sorry about my ignorance but could you pls elaborate what do you mean by
"try to ppcle"?

Our target platform is ppc so it is BE. We have to get it running only on
that.
How do we know this is LE/BE issue and nothing else?

-Thanks
Nikhil


On Mon, May 2, 2016 at 12:24 PM, Jan Friesse  wrote:

> As your hardware is probably capable of running ppcle and if you have an
>> environment
>> at hand without too much effort it might pay off to try that.
>> There are of course distributions out there support corosync on
>> big-endian architectures
>> but I don't know if there is an automatized regression for corosync on
>> big-endian that
>> would catch big-endian-issues right away with something as current as
>> your 2.3.5.
>>
>
> No we are not testing big-endian.
>
> So totally agree with Klaus. Give a try to ppcle. Also make sure all nodes
> are little-endian. Corosync should work in mixed BE/LE environment but
> because it's not tested, it may not work (and it's a bug, so if ppcle works
> I will try to fix BE).
>
> Regards,
>   Honza
>
>
>
>> Regards,
>> Klaus
>>
>> On 05/02/2016 06:44 AM, Nikhil Utane wrote:
>>
>>> Re-sending as I don't see my post on the thread.
>>>
>>> On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
>>> >
>>> wrote:
>>>
>>>  Hi,
>>>
>>>  Looking for some guidance here as we are completely blocked
>>>  otherwise :(.
>>>
>>>  -Regards
>>>  Nikhil
>>>
>>>  On Fri, Apr 29, 2016 at 6:11 PM, Sriram >>  > wrote:
>>>
>>>  Corrected the subject.
>>>
>>>  We went ahead and captured corosync debug logs for our ppc
>>> board.
>>>  After log analysis and comparison with the sucessful logs(
>>>  from x86 machine) ,
>>>  we didnt find *"[ MAIN  ] Completed service synchronization,
>>>  ready to provide service.*" in ppc logs.
>>>  So, looks like corosync is not in a position to accept
>>>  connection from Pacemaker.
>>>  Even I tried with the new corosync.conf with no success.
>>>
>>>  Any hints on this issue would be really helpful.
>>>
>>>  Attaching ppc_notworking.log, x86_working.log, corosync.conf.
>>>
>>>  Regards,
>>>  Sriram
>>>
>>>
>>>
>>>  On Fri, Apr 29, 2016 at 2:44 PM, Sriram >>  > wrote:
>>>
>>>  Hi,
>>>
>>>  I went ahead and made some changes in file system(Like I
>>>  brought in /etc/init.d/corosync and /etc/init.d/pacemaker,
>>>  /etc/sysconfig ), After that I was able to run  "pcs
>>>  cluster start".
>>>  But it failed with the following error
>>>   # pcs cluster start
>>>  Starting Cluster...
>>>  Starting Pacemaker Cluster Manager[FAILED]
>>>  Error: unable to start pacemaker
>>>
>>>  And in the /var/log/pacemaker.log, I saw these errors
>>>  pacemakerd: info: mcp_read_config:  cmap connection
>>>  setup failed: CS_ERR_TRY_AGAIN.  Retrying in 4s
>>>  Apr 29 08:53:47 [15863] node_cu pacemakerd: info:
>>>  mcp_read_config:  cmap connection setup failed:
>>>  CS_ERR_TRY_AGAIN.  Retrying in 5s
>>>  Apr 29 08:53:52 [15863] node_cu pacemakerd:  warning:
>>>  mcp_read_config:  Could not connect to Cluster
>>>  Configuration Database API, error 6
>>>  Apr 29 08:53:52 [15863] node_cu pacemakerd:   notice:
>>>  main: Could not obtain corosync config data, exiting
>>>  Apr 29 08:53:52 [15863] node_cu pacemakerd: info:
>>>  crm_xml_cleanup:  Cleaning up memory from libxml2
>>>
>>>
>>>  And in the /var/log/Debuglog, I saw these errors coming
>>>  from corosync
>>>  20160429 085347.487050  airv_cu
>>>  daemon.warn corosync[12857]:   [QB] Denied connection,
>>>  is not ready (12857-15863-14)
>>>  20160429 085347.487067  airv_cu
>>>  daemon.info  corosync[12857]:   [QB
>>>  ] Denied connection, is not ready (12857-15863-14)
>>>
>>>
>>>  I browsed the code of libqb to find that it is failing in
>>>
>>>
>>> https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c
>>>
>>>  Line 600 :
>>>  handle_new_connection function
>>>
>>>  Line 637:
>>>  if (auth_result == 0 &&
>>>  c->service->serv_fns.connection_accept) {
>>>  res = c->service->serv_fns.connection_accept(c,
>>>   c->euid, c->egid);
>>>  }
>>>  if (res != 0) {
>>>  goto send_response;
>>>  }
>>>
>>> 

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-02 Thread Jan Friesse

As your hardware is probably capable of running ppcle and if you have an
environment
at hand without too much effort it might pay off to try that.
There are of course distributions out there support corosync on
big-endian architectures
but I don't know if there is an automatized regression for corosync on
big-endian that
would catch big-endian-issues right away with something as current as
your 2.3.5.


No we are not testing big-endian.

So totally agree with Klaus. Give a try to ppcle. Also make sure all 
nodes are little-endian. Corosync should work in mixed BE/LE environment 
but because it's not tested, it may not work (and it's a bug, so if 
ppcle works I will try to fix BE).


Regards,
  Honza



Regards,
Klaus

On 05/02/2016 06:44 AM, Nikhil Utane wrote:

Re-sending as I don't see my post on the thread.

On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
> wrote:

 Hi,

 Looking for some guidance here as we are completely blocked
 otherwise :(.

 -Regards
 Nikhil

 On Fri, Apr 29, 2016 at 6:11 PM, Sriram > wrote:

 Corrected the subject.

 We went ahead and captured corosync debug logs for our ppc board.
 After log analysis and comparison with the sucessful logs(
 from x86 machine) ,
 we didnt find *"[ MAIN  ] Completed service synchronization,
 ready to provide service.*" in ppc logs.
 So, looks like corosync is not in a position to accept
 connection from Pacemaker.
 Even I tried with the new corosync.conf with no success.

 Any hints on this issue would be really helpful.

 Attaching ppc_notworking.log, x86_working.log, corosync.conf.

 Regards,
 Sriram



 On Fri, Apr 29, 2016 at 2:44 PM, Sriram > wrote:

 Hi,

 I went ahead and made some changes in file system(Like I
 brought in /etc/init.d/corosync and /etc/init.d/pacemaker,
 /etc/sysconfig ), After that I was able to run  "pcs
 cluster start".
 But it failed with the following error
  # pcs cluster start
 Starting Cluster...
 Starting Pacemaker Cluster Manager[FAILED]
 Error: unable to start pacemaker

 And in the /var/log/pacemaker.log, I saw these errors
 pacemakerd: info: mcp_read_config:  cmap connection
 setup failed: CS_ERR_TRY_AGAIN.  Retrying in 4s
 Apr 29 08:53:47 [15863] node_cu pacemakerd: info:
 mcp_read_config:  cmap connection setup failed:
 CS_ERR_TRY_AGAIN.  Retrying in 5s
 Apr 29 08:53:52 [15863] node_cu pacemakerd:  warning:
 mcp_read_config:  Could not connect to Cluster
 Configuration Database API, error 6
 Apr 29 08:53:52 [15863] node_cu pacemakerd:   notice:
 main: Could not obtain corosync config data, exiting
 Apr 29 08:53:52 [15863] node_cu pacemakerd: info:
 crm_xml_cleanup:  Cleaning up memory from libxml2


 And in the /var/log/Debuglog, I saw these errors coming
 from corosync
 20160429 085347.487050  airv_cu
 daemon.warn corosync[12857]:   [QB] Denied connection,
 is not ready (12857-15863-14)
 20160429 085347.487067  airv_cu
 daemon.info  corosync[12857]:   [QB
 ] Denied connection, is not ready (12857-15863-14)


 I browsed the code of libqb to find that it is failing in

 https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c

 Line 600 :
 handle_new_connection function

 Line 637:
 if (auth_result == 0 &&
 c->service->serv_fns.connection_accept) {
 res = c->service->serv_fns.connection_accept(c,
  c->euid, c->egid);
 }
 if (res != 0) {
 goto send_response;
 }

 Any hints on this issue would be really helpful for me to
 go ahead.
 Please let me know if any logs are required,

 Regards,
 Sriram

 On Thu, Apr 28, 2016 at 2:42 PM, Sriram
 > wrote:

 Thanks Ken and Emmanuel.
 Its a big endian machine. I will try with running "pcs
 cluster setup" and "pcs cluster start"
 Inside cluster.py, "service pacemaker start" and
 "service corosync start" are executed to bring up
 pacemaker and corosync.
 Those service scripts and the infrastructure needed 

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-02 Thread Klaus Wenninger
As your hardware is probably capable of running ppcle and if you have an
environment
at hand without too much effort it might pay off to try that.
There are of course distributions out there support corosync on
big-endian architectures
but I don't know if there is an automatized regression for corosync on
big-endian that
would catch big-endian-issues right away with something as current as
your 2.3.5.

Regards,
Klaus

On 05/02/2016 06:44 AM, Nikhil Utane wrote:
> Re-sending as I don't see my post on the thread.
>
> On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
> > wrote:
>
> Hi,
>
> Looking for some guidance here as we are completely blocked
> otherwise :(.
>
> -Regards
> Nikhil
>
> On Fri, Apr 29, 2016 at 6:11 PM, Sriram  > wrote:
>
> Corrected the subject.
>
> We went ahead and captured corosync debug logs for our ppc board.
> After log analysis and comparison with the sucessful logs(
> from x86 machine) ,
> we didnt find *"[ MAIN  ] Completed service synchronization,
> ready to provide service.*" in ppc logs.
> So, looks like corosync is not in a position to accept
> connection from Pacemaker.
> Even I tried with the new corosync.conf with no success.
>
> Any hints on this issue would be really helpful.
>
> Attaching ppc_notworking.log, x86_working.log, corosync.conf.
>
> Regards,
> Sriram
>
>
>
> On Fri, Apr 29, 2016 at 2:44 PM, Sriram  > wrote:
>
> Hi,
>
> I went ahead and made some changes in file system(Like I
> brought in /etc/init.d/corosync and /etc/init.d/pacemaker,
> /etc/sysconfig ), After that I was able to run  "pcs
> cluster start".
> But it failed with the following error
>  # pcs cluster start
> Starting Cluster...
> Starting Pacemaker Cluster Manager[FAILED]
> Error: unable to start pacemaker
>
> And in the /var/log/pacemaker.log, I saw these errors
> pacemakerd: info: mcp_read_config:  cmap connection
> setup failed: CS_ERR_TRY_AGAIN.  Retrying in 4s
> Apr 29 08:53:47 [15863] node_cu pacemakerd: info:
> mcp_read_config:  cmap connection setup failed:
> CS_ERR_TRY_AGAIN.  Retrying in 5s
> Apr 29 08:53:52 [15863] node_cu pacemakerd:  warning:
> mcp_read_config:  Could not connect to Cluster
> Configuration Database API, error 6
> Apr 29 08:53:52 [15863] node_cu pacemakerd:   notice:
> main: Could not obtain corosync config data, exiting
> Apr 29 08:53:52 [15863] node_cu pacemakerd: info:
> crm_xml_cleanup:  Cleaning up memory from libxml2
>
>
> And in the /var/log/Debuglog, I saw these errors coming
> from corosync
> 20160429 085347.487050  airv_cu
> daemon.warn corosync[12857]:   [QB] Denied connection,
> is not ready (12857-15863-14)
> 20160429 085347.487067  airv_cu
> daemon.info  corosync[12857]:   [QB   
> ] Denied connection, is not ready (12857-15863-14)
>
>
> I browsed the code of libqb to find that it is failing in
>
> https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c
>
> Line 600 :
> handle_new_connection function
>
> Line 637:
> if (auth_result == 0 &&
> c->service->serv_fns.connection_accept) {
> res = c->service->serv_fns.connection_accept(c,
>  c->euid, c->egid);
> }
> if (res != 0) {
> goto send_response;
> }
>
> Any hints on this issue would be really helpful for me to
> go ahead.
> Please let me know if any logs are required,
>
> Regards,
> Sriram
>
> On Thu, Apr 28, 2016 at 2:42 PM, Sriram
> > wrote:
>
> Thanks Ken and Emmanuel.
> Its a big endian machine. I will try with running "pcs
> cluster setup" and "pcs cluster start"
> Inside cluster.py, "service pacemaker start" and
> "service corosync start" are executed to bring up
> pacemaker and corosync.
> Those service scripts and the infrastructure needed to
> bring up the processes in the above said manner
> doesn't exist in my board.
> As it is a embedded board with the limited memory,
>