Re: [ClusterLabs] start a resource

2016-05-05 Thread Moiz Arif
Hi Dimitri,
Try cleanup of the fail count for the resource with the any of the below 
commands:
via pcs : pcs resource cleanup rsyncdvia crm: crm resource cleanup rsyncd
Hope it helps. 
Moiz
To: users@clusterlabs.org
From: dmaz...@bmrb.wisc.edu
Date: Thu, 5 May 2016 14:15:09 -0500
Subject: [ClusterLabs] start a resource

Hi all,
 
I'm sure it must be a FAQ, but how do I start a resource? E.g.
 
Failed Actions:
* rsyncd_start_0 on tarpon 'unknown error' (1): call=78,
status=complete, exitreason='Error. "pid file" entry required in the
rsyncd config file by rsyncd OCF RA.',
last-rc-change='Thu May  5 13:55:50 2016', queued=0ms, exec=51ms
 
OK, I fixed the config file, how do I restart rsyncd now?
 
TIA
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org ___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] start a resource

2016-05-05 Thread Dimitri Maziuk
Hi all,

I'm sure it must be a FAQ, but how do I start a resource? E.g.

Failed Actions:
* rsyncd_start_0 on tarpon 'unknown error' (1): call=78,
status=complete, exitreason='Error. "pid file" entry required in the
rsyncd config file by rsyncd OCF RA.',
last-rc-change='Thu May  5 13:55:50 2016', queued=0ms, exec=51ms

OK, I fixed the config file, how do I restart rsyncd now?

TIA
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Unable to run Pacemaker: pcmk_child_exit

2016-05-05 Thread Ken Gaillot
On 05/05/2016 11:25 AM, Nikhil Utane wrote:
> Thanks Ken for your quick response as always.
> 
> But what if I don't want to use quorum? I just want to bring up
> pacemaker + corosync on 1 node to check that it all comes up fine.
> I added corosync_votequorum as you suggested. Additionally I also added
> these 2 lines:
> 
> expected_votes: 2
> two_node: 1

There's actually nothing wrong with configuring a single-node cluster.
You can list just one node in corosync.conf and leave off the above.

> However still pacemaker is not able to run.

There must be other issues involved. Even if pacemaker doesn't have
quorum, it will still run, it just won't start resources.

> [root@airv_cu root]# pcs cluster start
> Starting Cluster...
> Starting Pacemaker Cluster Manager[FAILED]
> 
> Error: unable to start pacemaker
> 
> Corosync.log:
> *May 05 16:15:20 [16294] airv_cu pacemakerd: info:
> pcmk_quorum_notification: Membership 240: quorum still lost (1)*
> May 05 16:15:20 [16259] airv_cu corosync debug   [QB] Free'ing
> ringbuffer: /dev/shm/qb-cmap-request-16259-16294-21-header
> May 05 16:15:20 [16294] airv_cu pacemakerd:   notice:
> crm_update_peer_state_iter:   pcmk_quorum_notification: Node
> airv_cu[181344357] - state is now member (was (null))
> May 05 16:15:20 [16294] airv_cu pacemakerd: info:
> pcmk_cpg_membership:  Node 181344357 joined group pacemakerd
> (counter=0.0)
> May 05 16:15:20 [16294] airv_cu pacemakerd: info:
> pcmk_cpg_membership:  Node 181344357 still member of group
> pacemakerd (peer=airv_cu, counter=0.0)
> May 05 16:15:20 [16294] airv_cu pacemakerd:  warning: pcmk_child_exit:
>  The cib process (16353) can no longer be respawned, shutting the
> cluster down.
> May 05 16:15:20 [16294] airv_cu pacemakerd:   notice:
> pcmk_shutdown_worker: Shutting down Pacemaker
> 
> The log and conf file is attached.
> 
> -Regards
> Nikhil
> 
> On Thu, May 5, 2016 at 8:04 PM, Ken Gaillot  > wrote:
> 
> On 05/05/2016 08:36 AM, Nikhil Utane wrote:
> > Hi,
> >
> > Continuing with my adventure to run Pacemaker & Corosync on our
> > big-endian system, I managed to get past the corosync issue for now. But
> > facing an issue in running Pacemaker.
> >
> > Seeing following messages in corosync.log.
> >  pacemakerd:  warning: pcmk_child_exit:  The cib process (2) can no
> > longer be respawned, shutting the cluster down.
> >  pacemakerd:  warning: pcmk_child_exit:  The stonith-ng process (20001)
> > can no longer be respawned, shutting the cluster down.
> >  pacemakerd:  warning: pcmk_child_exit:  The lrmd process (20002) can no
> > longer be respawned, shutting the cluster down.
> >  pacemakerd:  warning: pcmk_child_exit:  The attrd process (20003) can
> > no longer be respawned, shutting the cluster down.
> >  pacemakerd:  warning: pcmk_child_exit:  The pengine process (20004) can
> > no longer be respawned, shutting the cluster down.
> >  pacemakerd:  warning: pcmk_child_exit:  The crmd process (20005) can no
> > longer be respawned, shutting the cluster down.
> >
> > I see following error before these messages. Not sure if this is the 
> cause.
> > May 05 11:26:24 [19998] airv_cu pacemakerd:error:
> > cluster_connect_quorum:   Corosync quorum is not configured
> >
> > I tried removing the quorum block (which is anyways blank) from the conf
> > file but still had the same error.
> 
> Yes, that is the issue. Pacemaker can't do anything if it can't ask
> corosync about quorum. I don't know what the issue is at the corosync
> level, but your corosync.conf should have:
> 
> quorum {
> provider: corosync_votequorum
> }
> 
> 
> > Attaching the log and conf files. Please let me know if there is any
> > obvious mistake or how to investigate it further.
> >
> > I am using pcs cluster start command to start the cluster
> >
> > -Thanks
> > Nikhil

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Unable to run Pacemaker: pcmk_child_exit

2016-05-05 Thread Nikhil Utane
Hi,

Continuing with my adventure to run Pacemaker & Corosync on our big-endian
system, I managed to get past the corosync issue for now. But facing an
issue in running Pacemaker.

Seeing following messages in corosync.log.
 pacemakerd:  warning: pcmk_child_exit:  The cib process (2) can no
longer be respawned, shutting the cluster down.
 pacemakerd:  warning: pcmk_child_exit:  The stonith-ng process (20001) can
no longer be respawned, shutting the cluster down.
 pacemakerd:  warning: pcmk_child_exit:  The lrmd process (20002) can no
longer be respawned, shutting the cluster down.
 pacemakerd:  warning: pcmk_child_exit:  The attrd process (20003) can no
longer be respawned, shutting the cluster down.
 pacemakerd:  warning: pcmk_child_exit:  The pengine process (20004) can no
longer be respawned, shutting the cluster down.
 pacemakerd:  warning: pcmk_child_exit:  The crmd process (20005) can no
longer be respawned, shutting the cluster down.

I see following error before these messages. Not sure if this is the cause.
May 05 11:26:24 [19998] airv_cu pacemakerd:error:
cluster_connect_quorum:   Corosync quorum is not configured

I tried removing the quorum block (which is anyways blank) from the conf
file but still had the same error.

Attaching the log and conf files. Please let me know if there is any
obvious mistake or how to investigate it further.

I am using pcs cluster start command to start the cluster

-Thanks
Nikhil


corosync.log
Description: Binary data


pacemaker.log
Description: Binary data


corosync.conf
Description: Binary data
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: ringid interface FAULTY no resource move

2016-05-05 Thread Rafał Sanocki

for what?

cluster working good when i stop pacemaker, resources go to second node, 
but when connection is lost nothing happend.


Failed Actions:
* p_ip_2_monitor_3 on csb01B 'not running' (7): call=47, 
status=complete, exitreason='none',

last-rc-change='Wed May  4 17:34:50 2016', queued=0ms, exec=0ms

I just want to move resources when connection on one ring is lost.




W dniu 2016-05-04 o 15:50, emmanuel segura pisze:

use fencing and drbd fencing handler

2016-05-04 14:46 GMT+02:00 Rafał Sanocki :

Resources shuld move to second node when any  interface is down.




W dniu 2016-05-04 o 14:41, Ulrich Windl pisze:


Rafal Sanocki  schrieb am 04.05.2016 um 14:14
in

Nachricht <78d882b1-a407-31e0-2b9e-b5f8406d4...@gmail.com>:

Hello,
I cant find what i did wrong. I have 2 node cluster, Corosync ,Pacemaker
, DRBD .  When i plug out cable nothing happend.

"nothing"? The wrong cable?

[...]

Regards,
Ulrich



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org






___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-05 Thread Nikhil Utane
It worked for me. :)
I'll wait for your formal patch but until then I am able to proceed
further. (Don't know if I'll run into something else)

However now encountering issue in pacemaker.

May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:  The
cib process (15224) can no longer be respawned, shutting the cluster down.
May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:  The
stonith-ng process (15225) can no longer be respawned, shutting the cluster
down.
May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:  The
lrmd process (15226) can no longer be respawned, shutting the cluster down.
May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:  The
crmd process (15229) can no longer be respawned, shutting the cluster down.
May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:  The
pengine process (15228) can no longer be respawned, shutting the cluster
down.
May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:  The
attrd process (15227) can no longer be respawned, shutting the cluster down.

Looking into it.

-Thanks
Nikhil

On Thu, May 5, 2016 at 2:58 PM, Jan Friesse  wrote:

> Nikhil
>
> Found the root-cause.
>> In file schedwrk.c, the function handle2void() uses a union which was not
>> initialized.
>> Because of that the handle value was computed incorrectly (lower half was
>> garbage).
>>
>>   56 static hdb_handle_t
>>   57 void2handle (const void *v) { union u u={}; u.v = v; return u.h; }
>>   58 static const void *
>>   59 handle2void (hdb_handle_t h) { union u u={}; u.h = h; return u.v; }
>>
>> After initializing (as highlighted), the corosync initialization seems to
>> be going through fine. Will check other things.
>>
>
> Your patch is incorrect and actually doesn't work. As I said (when
> pointing you to schedwrk.c), I will send you proper patch, but fix that
> issue correctly is not easy.
>
> Regards,
>   Honza
>
>
>> -Regards
>> Nikhil
>>
>> On Tue, May 3, 2016 at 7:04 PM, Nikhil Utane > >
>> wrote:
>>
>> Thanks for your response Dejan.
>>>
>>> I do not know yet whether this has anything to do with endianness.
>>> FWIW, there could be something quirky with the system so keeping all
>>> options open. :)
>>>
>>> I added some debug prints to understand what's happening under the hood.
>>>
>>> *Success case: (on x86 machine): *
>>> [TOTEM ] entering OPERATIONAL state.
>>> [TOTEM ] A new membership (10.206.1.7:137220) was formed. Members
>>> joined:
>>> 181272839
>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
>>> my_high_delivered=0
>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
>>> my_high_delivered=0
>>> [TOTEM ] Delivering 0 to 1
>>> [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
>>> [SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=1
>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=2,
>>> my_high_delivered=1
>>> [TOTEM ] Delivering 1 to 2
>>> [TOTEM ] Delivering MCAST message with seq 2 to pending delivery queue
>>> [SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=0
>>> [SYNC  ] Nikhil: Entering sync_barrier_handler
>>> [SYNC  ] Committing synchronization for corosync configuration map access
>>> .
>>> [TOTEM ] Delivering 2 to 4
>>> [TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
>>> [TOTEM ] Delivering MCAST message with seq 4 to pending delivery queue
>>> [CPG   ] comparing: sender r(0) ip(10.206.1.7) ; members(old:0 left:0)
>>> [CPG   ] chosen downlist: sender r(0) ip(10.206.1.7) ; members(old:0
>>> left:0)
>>> [SYNC  ] Committing synchronization for corosync cluster closed process
>>> group service v1.01
>>> *[MAIN  ] Completed service synchronization, ready to provide service.*
>>>
>>>
>>> *Failure case: (on ppc)*:
>>>
>>> [TOTEM ] entering OPERATIONAL state.
>>> [TOTEM ] A new membership (10.207.24.101:16) was formed. Members joined:
>>> 181344357
>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
>>> my_high_delivered=0
>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
>>> my_high_delivered=0
>>> [TOTEM ] Delivering 0 to 1
>>> [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
>>> [SYNC  ] Nikhil: Inside sync_deliver_fn header->id=1
>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
>>> my_high_delivered=1
>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
>>> my_high_delivered=1
>>> Above message repeats continuously.
>>>
>>> So it appears that in failure case I do not receive messages with
>>> sequence
>>> number 2-4.
>>> If somebody can throw some ideas that'll help a lot.
>>>
>>> -Thanks
>>> Nikhil
>>>
>>> On Tue, May 3, 2016 at 5:26 PM, Dejan Muhamedagic 
>>> wrote:
>>>
>>> Hi,

 On Mon, May 02, 2016 at 08:54:09AM +0200, Jan Friesse wrote:

> As your hardware is probably capable of running ppcle and if you have

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-05 Thread Jan Friesse

Nikhil


Found the root-cause.
In file schedwrk.c, the function handle2void() uses a union which was not
initialized.
Because of that the handle value was computed incorrectly (lower half was
garbage).

  56 static hdb_handle_t
  57 void2handle (const void *v) { union u u={}; u.v = v; return u.h; }
  58 static const void *
  59 handle2void (hdb_handle_t h) { union u u={}; u.h = h; return u.v; }

After initializing (as highlighted), the corosync initialization seems to
be going through fine. Will check other things.


Your patch is incorrect and actually doesn't work. As I said (when 
pointing you to schedwrk.c), I will send you proper patch, but fix that 
issue correctly is not easy.


Regards,
  Honza



-Regards
Nikhil

On Tue, May 3, 2016 at 7:04 PM, Nikhil Utane 
wrote:


Thanks for your response Dejan.

I do not know yet whether this has anything to do with endianness.
FWIW, there could be something quirky with the system so keeping all
options open. :)

I added some debug prints to understand what's happening under the hood.

*Success case: (on x86 machine): *
[TOTEM ] entering OPERATIONAL state.
[TOTEM ] A new membership (10.206.1.7:137220) was formed. Members joined:
181272839
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
my_high_delivered=0
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=0
[TOTEM ] Delivering 0 to 1
[TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
[SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=1
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=2,
my_high_delivered=1
[TOTEM ] Delivering 1 to 2
[TOTEM ] Delivering MCAST message with seq 2 to pending delivery queue
[SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=0
[SYNC  ] Nikhil: Entering sync_barrier_handler
[SYNC  ] Committing synchronization for corosync configuration map access
.
[TOTEM ] Delivering 2 to 4
[TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
[TOTEM ] Delivering MCAST message with seq 4 to pending delivery queue
[CPG   ] comparing: sender r(0) ip(10.206.1.7) ; members(old:0 left:0)
[CPG   ] chosen downlist: sender r(0) ip(10.206.1.7) ; members(old:0
left:0)
[SYNC  ] Committing synchronization for corosync cluster closed process
group service v1.01
*[MAIN  ] Completed service synchronization, ready to provide service.*


*Failure case: (on ppc)*:
[TOTEM ] entering OPERATIONAL state.
[TOTEM ] A new membership (10.207.24.101:16) was formed. Members joined:
181344357
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
my_high_delivered=0
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=0
[TOTEM ] Delivering 0 to 1
[TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
[SYNC  ] Nikhil: Inside sync_deliver_fn header->id=1
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=1
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=1
Above message repeats continuously.

So it appears that in failure case I do not receive messages with sequence
number 2-4.
If somebody can throw some ideas that'll help a lot.

-Thanks
Nikhil

On Tue, May 3, 2016 at 5:26 PM, Dejan Muhamedagic 
wrote:


Hi,

On Mon, May 02, 2016 at 08:54:09AM +0200, Jan Friesse wrote:

As your hardware is probably capable of running ppcle and if you have

an

environment
at hand without too much effort it might pay off to try that.
There are of course distributions out there support corosync on
big-endian architectures
but I don't know if there is an automatized regression for corosync on
big-endian that
would catch big-endian-issues right away with something as current as
your 2.3.5.


No we are not testing big-endian.

So totally agree with Klaus. Give a try to ppcle. Also make sure all
nodes are little-endian. Corosync should work in mixed BE/LE
environment but because it's not tested, it may not work (and it's a
bug, so if ppcle works I will try to fix BE).


I tested a cluster consisting of big endian/little endian nodes
(s390 and x86-64), but that was a while ago. IIRC, all relevant
bugs in corosync got fixed at that time. Don't know what is the
situation with the latest version.

Thanks,

Dejan


Regards,
   Honza



Regards,
Klaus

On 05/02/2016 06:44 AM, Nikhil Utane wrote:

Re-sending as I don't see my post on the thread.

On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
>

wrote:


 Hi,

 Looking for some guidance here as we are completely blocked
 otherwise :(.

 -Regards
 Nikhil

 On Fri, Apr 29, 2016 at 6:11 PM, Sriram > wrote:

 Corrected the subject.

 We went ahead and captured corosync debug logs for our ppc

board.

 After log analysis and comparison with the sucessful logs(
 from x86 machine) ,
 we 

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-05 Thread Nikhil Utane
Found the root-cause.
In file schedwrk.c, the function handle2void() uses a union which was not
initialized.
Because of that the handle value was computed incorrectly (lower half was
garbage).

 56 static hdb_handle_t
 57 void2handle (const void *v) { union u u={}; u.v = v; return u.h; }
 58 static const void *
 59 handle2void (hdb_handle_t h) { union u u={}; u.h = h; return u.v; }

After initializing (as highlighted), the corosync initialization seems to
be going through fine. Will check other things.

-Regards
Nikhil

On Tue, May 3, 2016 at 7:04 PM, Nikhil Utane 
wrote:

> Thanks for your response Dejan.
>
> I do not know yet whether this has anything to do with endianness.
> FWIW, there could be something quirky with the system so keeping all
> options open. :)
>
> I added some debug prints to understand what's happening under the hood.
>
> *Success case: (on x86 machine): *
> [TOTEM ] entering OPERATIONAL state.
> [TOTEM ] A new membership (10.206.1.7:137220) was formed. Members joined:
> 181272839
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
> my_high_delivered=0
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
> my_high_delivered=0
> [TOTEM ] Delivering 0 to 1
> [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
> [SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=1
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=2,
> my_high_delivered=1
> [TOTEM ] Delivering 1 to 2
> [TOTEM ] Delivering MCAST message with seq 2 to pending delivery queue
> [SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=0
> [SYNC  ] Nikhil: Entering sync_barrier_handler
> [SYNC  ] Committing synchronization for corosync configuration map access
> .
> [TOTEM ] Delivering 2 to 4
> [TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
> [TOTEM ] Delivering MCAST message with seq 4 to pending delivery queue
> [CPG   ] comparing: sender r(0) ip(10.206.1.7) ; members(old:0 left:0)
> [CPG   ] chosen downlist: sender r(0) ip(10.206.1.7) ; members(old:0
> left:0)
> [SYNC  ] Committing synchronization for corosync cluster closed process
> group service v1.01
> *[MAIN  ] Completed service synchronization, ready to provide service.*
>
>
> *Failure case: (on ppc)*:
> [TOTEM ] entering OPERATIONAL state.
> [TOTEM ] A new membership (10.207.24.101:16) was formed. Members joined:
> 181344357
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
> my_high_delivered=0
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
> my_high_delivered=0
> [TOTEM ] Delivering 0 to 1
> [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
> [SYNC  ] Nikhil: Inside sync_deliver_fn header->id=1
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
> my_high_delivered=1
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
> my_high_delivered=1
> Above message repeats continuously.
>
> So it appears that in failure case I do not receive messages with sequence
> number 2-4.
> If somebody can throw some ideas that'll help a lot.
>
> -Thanks
> Nikhil
>
> On Tue, May 3, 2016 at 5:26 PM, Dejan Muhamedagic 
> wrote:
>
>> Hi,
>>
>> On Mon, May 02, 2016 at 08:54:09AM +0200, Jan Friesse wrote:
>> > >As your hardware is probably capable of running ppcle and if you have
>> an
>> > >environment
>> > >at hand without too much effort it might pay off to try that.
>> > >There are of course distributions out there support corosync on
>> > >big-endian architectures
>> > >but I don't know if there is an automatized regression for corosync on
>> > >big-endian that
>> > >would catch big-endian-issues right away with something as current as
>> > >your 2.3.5.
>> >
>> > No we are not testing big-endian.
>> >
>> > So totally agree with Klaus. Give a try to ppcle. Also make sure all
>> > nodes are little-endian. Corosync should work in mixed BE/LE
>> > environment but because it's not tested, it may not work (and it's a
>> > bug, so if ppcle works I will try to fix BE).
>>
>> I tested a cluster consisting of big endian/little endian nodes
>> (s390 and x86-64), but that was a while ago. IIRC, all relevant
>> bugs in corosync got fixed at that time. Don't know what is the
>> situation with the latest version.
>>
>> Thanks,
>>
>> Dejan
>>
>> > Regards,
>> >   Honza
>> >
>> > >
>> > >Regards,
>> > >Klaus
>> > >
>> > >On 05/02/2016 06:44 AM, Nikhil Utane wrote:
>> > >>Re-sending as I don't see my post on the thread.
>> > >>
>> > >>On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
>> > >>>
>> wrote:
>> > >>
>> > >> Hi,
>> > >>
>> > >> Looking for some guidance here as we are completely blocked
>> > >> otherwise :(.
>> > >>
>> > >> -Regards
>> > >> Nikhil
>> > >>
>> > >> On Fri, Apr 29, 2016 at 6:11 PM, Sriram > > >> > wrote:
>> > >>
>> > >> Corrected the