Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-17 Thread Nikhil Utane
Hi Honza,

Just checking if you have the official patch available for this issue.

As far as I am concerned barring couple of issues, everything seems to be
working fine on our big-endian system. Much relieved. :)

-Thanks
Nikhil


On Thu, May 5, 2016 at 3:24 PM, Nikhil Utane 
wrote:

> It worked for me. :)
> I'll wait for your formal patch but until then I am able to proceed
> further. (Don't know if I'll run into something else)
>
> However now encountering issue in pacemaker.
>
> May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:
>  The cib process (15224) can no longer be respawned, shutting the cluster
> down.
> May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:
>  The stonith-ng process (15225) can no longer be respawned, shutting the
> cluster down.
> May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:
>  The lrmd process (15226) can no longer be respawned, shutting the cluster
> down.
> May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:
>  The crmd process (15229) can no longer be respawned, shutting the cluster
> down.
> May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:
>  The pengine process (15228) can no longer be respawned, shutting the
> cluster down.
> May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:
>  The attrd process (15227) can no longer be respawned, shutting the cluster
> down.
>
> Looking into it.
>
> -Thanks
> Nikhil
>
> On Thu, May 5, 2016 at 2:58 PM, Jan Friesse  wrote:
>
>> Nikhil
>>
>> Found the root-cause.
>>> In file schedwrk.c, the function handle2void() uses a union which was not
>>> initialized.
>>> Because of that the handle value was computed incorrectly (lower half was
>>> garbage).
>>>
>>>   56 static hdb_handle_t
>>>   57 void2handle (const void *v) { union u u={}; u.v = v; return u.h; }
>>>   58 static const void *
>>>   59 handle2void (hdb_handle_t h) { union u u={}; u.h = h; return u.v; }
>>>
>>> After initializing (as highlighted), the corosync initialization seems to
>>> be going through fine. Will check other things.
>>>
>>
>> Your patch is incorrect and actually doesn't work. As I said (when
>> pointing you to schedwrk.c), I will send you proper patch, but fix that
>> issue correctly is not easy.
>>
>> Regards,
>>   Honza
>>
>>
>>> -Regards
>>> Nikhil
>>>
>>> On Tue, May 3, 2016 at 7:04 PM, Nikhil Utane <
>>> nikhil.subscri...@gmail.com>
>>> wrote:
>>>
>>> Thanks for your response Dejan.

 I do not know yet whether this has anything to do with endianness.
 FWIW, there could be something quirky with the system so keeping all
 options open. :)

 I added some debug prints to understand what's happening under the hood.

 *Success case: (on x86 machine): *
 [TOTEM ] entering OPERATIONAL state.
 [TOTEM ] A new membership (10.206.1.7:137220) was formed. Members
 joined:
 181272839
 [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
 my_high_delivered=0
 [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
 my_high_delivered=0
 [TOTEM ] Delivering 0 to 1
 [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
 [SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=1
 [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=2,
 my_high_delivered=1
 [TOTEM ] Delivering 1 to 2
 [TOTEM ] Delivering MCAST message with seq 2 to pending delivery queue
 [SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=0
 [SYNC  ] Nikhil: Entering sync_barrier_handler
 [SYNC  ] Committing synchronization for corosync configuration map
 access
 .
 [TOTEM ] Delivering 2 to 4
 [TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
 [TOTEM ] Delivering MCAST message with seq 4 to pending delivery queue
 [CPG   ] comparing: sender r(0) ip(10.206.1.7) ; members(old:0 left:0)
 [CPG   ] chosen downlist: sender r(0) ip(10.206.1.7) ; members(old:0
 left:0)
 [SYNC  ] Committing synchronization for corosync cluster closed process
 group service v1.01
 *[MAIN  ] Completed service synchronization, ready to provide service.*


 *Failure case: (on ppc)*:

 [TOTEM ] entering OPERATIONAL state.
 [TOTEM ] A new membership (10.207.24.101:16) was formed. Members
 joined:
 181344357
 [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
 my_high_delivered=0
 [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
 my_high_delivered=0
 [TOTEM ] Delivering 0 to 1
 [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
 [SYNC  ] Nikhil: Inside sync_deliver_fn header->id=1
 [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
 my_high_delivered=1
 [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
 my_high_delivered=1
 Above message repeats continuously.

 So it appea

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-05 Thread Nikhil Utane
It worked for me. :)
I'll wait for your formal patch but until then I am able to proceed
further. (Don't know if I'll run into something else)

However now encountering issue in pacemaker.

May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:  The
cib process (15224) can no longer be respawned, shutting the cluster down.
May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:  The
stonith-ng process (15225) can no longer be respawned, shutting the cluster
down.
May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:  The
lrmd process (15226) can no longer be respawned, shutting the cluster down.
May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:  The
crmd process (15229) can no longer be respawned, shutting the cluster down.
May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:  The
pengine process (15228) can no longer be respawned, shutting the cluster
down.
May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:  The
attrd process (15227) can no longer be respawned, shutting the cluster down.

Looking into it.

-Thanks
Nikhil

On Thu, May 5, 2016 at 2:58 PM, Jan Friesse  wrote:

> Nikhil
>
> Found the root-cause.
>> In file schedwrk.c, the function handle2void() uses a union which was not
>> initialized.
>> Because of that the handle value was computed incorrectly (lower half was
>> garbage).
>>
>>   56 static hdb_handle_t
>>   57 void2handle (const void *v) { union u u={}; u.v = v; return u.h; }
>>   58 static const void *
>>   59 handle2void (hdb_handle_t h) { union u u={}; u.h = h; return u.v; }
>>
>> After initializing (as highlighted), the corosync initialization seems to
>> be going through fine. Will check other things.
>>
>
> Your patch is incorrect and actually doesn't work. As I said (when
> pointing you to schedwrk.c), I will send you proper patch, but fix that
> issue correctly is not easy.
>
> Regards,
>   Honza
>
>
>> -Regards
>> Nikhil
>>
>> On Tue, May 3, 2016 at 7:04 PM, Nikhil Utane > >
>> wrote:
>>
>> Thanks for your response Dejan.
>>>
>>> I do not know yet whether this has anything to do with endianness.
>>> FWIW, there could be something quirky with the system so keeping all
>>> options open. :)
>>>
>>> I added some debug prints to understand what's happening under the hood.
>>>
>>> *Success case: (on x86 machine): *
>>> [TOTEM ] entering OPERATIONAL state.
>>> [TOTEM ] A new membership (10.206.1.7:137220) was formed. Members
>>> joined:
>>> 181272839
>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
>>> my_high_delivered=0
>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
>>> my_high_delivered=0
>>> [TOTEM ] Delivering 0 to 1
>>> [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
>>> [SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=1
>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=2,
>>> my_high_delivered=1
>>> [TOTEM ] Delivering 1 to 2
>>> [TOTEM ] Delivering MCAST message with seq 2 to pending delivery queue
>>> [SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=0
>>> [SYNC  ] Nikhil: Entering sync_barrier_handler
>>> [SYNC  ] Committing synchronization for corosync configuration map access
>>> .
>>> [TOTEM ] Delivering 2 to 4
>>> [TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
>>> [TOTEM ] Delivering MCAST message with seq 4 to pending delivery queue
>>> [CPG   ] comparing: sender r(0) ip(10.206.1.7) ; members(old:0 left:0)
>>> [CPG   ] chosen downlist: sender r(0) ip(10.206.1.7) ; members(old:0
>>> left:0)
>>> [SYNC  ] Committing synchronization for corosync cluster closed process
>>> group service v1.01
>>> *[MAIN  ] Completed service synchronization, ready to provide service.*
>>>
>>>
>>> *Failure case: (on ppc)*:
>>>
>>> [TOTEM ] entering OPERATIONAL state.
>>> [TOTEM ] A new membership (10.207.24.101:16) was formed. Members joined:
>>> 181344357
>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
>>> my_high_delivered=0
>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
>>> my_high_delivered=0
>>> [TOTEM ] Delivering 0 to 1
>>> [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
>>> [SYNC  ] Nikhil: Inside sync_deliver_fn header->id=1
>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
>>> my_high_delivered=1
>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
>>> my_high_delivered=1
>>> Above message repeats continuously.
>>>
>>> So it appears that in failure case I do not receive messages with
>>> sequence
>>> number 2-4.
>>> If somebody can throw some ideas that'll help a lot.
>>>
>>> -Thanks
>>> Nikhil
>>>
>>> On Tue, May 3, 2016 at 5:26 PM, Dejan Muhamedagic 
>>> wrote:
>>>
>>> Hi,

 On Mon, May 02, 2016 at 08:54:09AM +0200, Jan Friesse wrote:

> As your hardware is probably capable of running ppcle and if you have
>>
> an

> environment
>> at hand without too much e

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-05 Thread Jan Friesse

Nikhil


Found the root-cause.
In file schedwrk.c, the function handle2void() uses a union which was not
initialized.
Because of that the handle value was computed incorrectly (lower half was
garbage).

  56 static hdb_handle_t
  57 void2handle (const void *v) { union u u={}; u.v = v; return u.h; }
  58 static const void *
  59 handle2void (hdb_handle_t h) { union u u={}; u.h = h; return u.v; }

After initializing (as highlighted), the corosync initialization seems to
be going through fine. Will check other things.


Your patch is incorrect and actually doesn't work. As I said (when 
pointing you to schedwrk.c), I will send you proper patch, but fix that 
issue correctly is not easy.


Regards,
  Honza



-Regards
Nikhil

On Tue, May 3, 2016 at 7:04 PM, Nikhil Utane 
wrote:


Thanks for your response Dejan.

I do not know yet whether this has anything to do with endianness.
FWIW, there could be something quirky with the system so keeping all
options open. :)

I added some debug prints to understand what's happening under the hood.

*Success case: (on x86 machine): *
[TOTEM ] entering OPERATIONAL state.
[TOTEM ] A new membership (10.206.1.7:137220) was formed. Members joined:
181272839
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
my_high_delivered=0
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=0
[TOTEM ] Delivering 0 to 1
[TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
[SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=1
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=2,
my_high_delivered=1
[TOTEM ] Delivering 1 to 2
[TOTEM ] Delivering MCAST message with seq 2 to pending delivery queue
[SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=0
[SYNC  ] Nikhil: Entering sync_barrier_handler
[SYNC  ] Committing synchronization for corosync configuration map access
.
[TOTEM ] Delivering 2 to 4
[TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
[TOTEM ] Delivering MCAST message with seq 4 to pending delivery queue
[CPG   ] comparing: sender r(0) ip(10.206.1.7) ; members(old:0 left:0)
[CPG   ] chosen downlist: sender r(0) ip(10.206.1.7) ; members(old:0
left:0)
[SYNC  ] Committing synchronization for corosync cluster closed process
group service v1.01
*[MAIN  ] Completed service synchronization, ready to provide service.*


*Failure case: (on ppc)*:
[TOTEM ] entering OPERATIONAL state.
[TOTEM ] A new membership (10.207.24.101:16) was formed. Members joined:
181344357
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
my_high_delivered=0
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=0
[TOTEM ] Delivering 0 to 1
[TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
[SYNC  ] Nikhil: Inside sync_deliver_fn header->id=1
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=1
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=1
Above message repeats continuously.

So it appears that in failure case I do not receive messages with sequence
number 2-4.
If somebody can throw some ideas that'll help a lot.

-Thanks
Nikhil

On Tue, May 3, 2016 at 5:26 PM, Dejan Muhamedagic 
wrote:


Hi,

On Mon, May 02, 2016 at 08:54:09AM +0200, Jan Friesse wrote:

As your hardware is probably capable of running ppcle and if you have

an

environment
at hand without too much effort it might pay off to try that.
There are of course distributions out there support corosync on
big-endian architectures
but I don't know if there is an automatized regression for corosync on
big-endian that
would catch big-endian-issues right away with something as current as
your 2.3.5.


No we are not testing big-endian.

So totally agree with Klaus. Give a try to ppcle. Also make sure all
nodes are little-endian. Corosync should work in mixed BE/LE
environment but because it's not tested, it may not work (and it's a
bug, so if ppcle works I will try to fix BE).


I tested a cluster consisting of big endian/little endian nodes
(s390 and x86-64), but that was a while ago. IIRC, all relevant
bugs in corosync got fixed at that time. Don't know what is the
situation with the latest version.

Thanks,

Dejan


Regards,
   Honza



Regards,
Klaus

On 05/02/2016 06:44 AM, Nikhil Utane wrote:

Re-sending as I don't see my post on the thread.

On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
mailto:nikhil.subscri...@gmail.com>>

wrote:


 Hi,

 Looking for some guidance here as we are completely blocked
 otherwise :(.

 -Regards
 Nikhil

 On Fri, Apr 29, 2016 at 6:11 PM, Sriram mailto:sriram...@gmail.com>> wrote:

 Corrected the subject.

 We went ahead and captured corosync debug logs for our ppc

board.

 After log analysis and comparison with the sucessful logs(
 from x86 machine) ,
 we didnt find *"[ MAIN  ] Completed service synchronization,
 ready to provide service.*" in ppc logs.
 

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-05 Thread Nikhil Utane
Found the root-cause.
In file schedwrk.c, the function handle2void() uses a union which was not
initialized.
Because of that the handle value was computed incorrectly (lower half was
garbage).

 56 static hdb_handle_t
 57 void2handle (const void *v) { union u u={}; u.v = v; return u.h; }
 58 static const void *
 59 handle2void (hdb_handle_t h) { union u u={}; u.h = h; return u.v; }

After initializing (as highlighted), the corosync initialization seems to
be going through fine. Will check other things.

-Regards
Nikhil

On Tue, May 3, 2016 at 7:04 PM, Nikhil Utane 
wrote:

> Thanks for your response Dejan.
>
> I do not know yet whether this has anything to do with endianness.
> FWIW, there could be something quirky with the system so keeping all
> options open. :)
>
> I added some debug prints to understand what's happening under the hood.
>
> *Success case: (on x86 machine): *
> [TOTEM ] entering OPERATIONAL state.
> [TOTEM ] A new membership (10.206.1.7:137220) was formed. Members joined:
> 181272839
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
> my_high_delivered=0
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
> my_high_delivered=0
> [TOTEM ] Delivering 0 to 1
> [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
> [SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=1
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=2,
> my_high_delivered=1
> [TOTEM ] Delivering 1 to 2
> [TOTEM ] Delivering MCAST message with seq 2 to pending delivery queue
> [SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=0
> [SYNC  ] Nikhil: Entering sync_barrier_handler
> [SYNC  ] Committing synchronization for corosync configuration map access
> .
> [TOTEM ] Delivering 2 to 4
> [TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
> [TOTEM ] Delivering MCAST message with seq 4 to pending delivery queue
> [CPG   ] comparing: sender r(0) ip(10.206.1.7) ; members(old:0 left:0)
> [CPG   ] chosen downlist: sender r(0) ip(10.206.1.7) ; members(old:0
> left:0)
> [SYNC  ] Committing synchronization for corosync cluster closed process
> group service v1.01
> *[MAIN  ] Completed service synchronization, ready to provide service.*
>
>
> *Failure case: (on ppc)*:
> [TOTEM ] entering OPERATIONAL state.
> [TOTEM ] A new membership (10.207.24.101:16) was formed. Members joined:
> 181344357
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
> my_high_delivered=0
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
> my_high_delivered=0
> [TOTEM ] Delivering 0 to 1
> [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
> [SYNC  ] Nikhil: Inside sync_deliver_fn header->id=1
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
> my_high_delivered=1
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
> my_high_delivered=1
> Above message repeats continuously.
>
> So it appears that in failure case I do not receive messages with sequence
> number 2-4.
> If somebody can throw some ideas that'll help a lot.
>
> -Thanks
> Nikhil
>
> On Tue, May 3, 2016 at 5:26 PM, Dejan Muhamedagic 
> wrote:
>
>> Hi,
>>
>> On Mon, May 02, 2016 at 08:54:09AM +0200, Jan Friesse wrote:
>> > >As your hardware is probably capable of running ppcle and if you have
>> an
>> > >environment
>> > >at hand without too much effort it might pay off to try that.
>> > >There are of course distributions out there support corosync on
>> > >big-endian architectures
>> > >but I don't know if there is an automatized regression for corosync on
>> > >big-endian that
>> > >would catch big-endian-issues right away with something as current as
>> > >your 2.3.5.
>> >
>> > No we are not testing big-endian.
>> >
>> > So totally agree with Klaus. Give a try to ppcle. Also make sure all
>> > nodes are little-endian. Corosync should work in mixed BE/LE
>> > environment but because it's not tested, it may not work (and it's a
>> > bug, so if ppcle works I will try to fix BE).
>>
>> I tested a cluster consisting of big endian/little endian nodes
>> (s390 and x86-64), but that was a while ago. IIRC, all relevant
>> bugs in corosync got fixed at that time. Don't know what is the
>> situation with the latest version.
>>
>> Thanks,
>>
>> Dejan
>>
>> > Regards,
>> >   Honza
>> >
>> > >
>> > >Regards,
>> > >Klaus
>> > >
>> > >On 05/02/2016 06:44 AM, Nikhil Utane wrote:
>> > >>Re-sending as I don't see my post on the thread.
>> > >>
>> > >>On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
>> > >>mailto:nikhil.subscri...@gmail.com>>
>> wrote:
>> > >>
>> > >> Hi,
>> > >>
>> > >> Looking for some guidance here as we are completely blocked
>> > >> otherwise :(.
>> > >>
>> > >> -Regards
>> > >> Nikhil
>> > >>
>> > >> On Fri, Apr 29, 2016 at 6:11 PM, Sriram > > >> > wrote:
>> > >>
>> > >> Corrected the subject.
>> > >>
>> > >> We went ahead and captured corosync debug logs for our ppc
>> board.
>> > >>

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-03 Thread Nikhil Utane
Thanks for your response Dejan.

I do not know yet whether this has anything to do with endianness.
FWIW, there could be something quirky with the system so keeping all
options open. :)

I added some debug prints to understand what's happening under the hood.

*Success case: (on x86 machine): *
[TOTEM ] entering OPERATIONAL state.
[TOTEM ] A new membership (10.206.1.7:137220) was formed. Members joined:
181272839
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
my_high_delivered=0
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=0
[TOTEM ] Delivering 0 to 1
[TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
[SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=1
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=2,
my_high_delivered=1
[TOTEM ] Delivering 1 to 2
[TOTEM ] Delivering MCAST message with seq 2 to pending delivery queue
[SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=0
[SYNC  ] Nikhil: Entering sync_barrier_handler
[SYNC  ] Committing synchronization for corosync configuration map access
.
[TOTEM ] Delivering 2 to 4
[TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
[TOTEM ] Delivering MCAST message with seq 4 to pending delivery queue
[CPG   ] comparing: sender r(0) ip(10.206.1.7) ; members(old:0 left:0)
[CPG   ] chosen downlist: sender r(0) ip(10.206.1.7) ; members(old:0 left:0)
[SYNC  ] Committing synchronization for corosync cluster closed process
group service v1.01
*[MAIN  ] Completed service synchronization, ready to provide service.*


*Failure case: (on ppc)*:
[TOTEM ] entering OPERATIONAL state.
[TOTEM ] A new membership (10.207.24.101:16) was formed. Members joined:
181344357
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
my_high_delivered=0
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=0
[TOTEM ] Delivering 0 to 1
[TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
[SYNC  ] Nikhil: Inside sync_deliver_fn header->id=1
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=1
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=1
Above message repeats continuously.

So it appears that in failure case I do not receive messages with sequence
number 2-4.
If somebody can throw some ideas that'll help a lot.

-Thanks
Nikhil

On Tue, May 3, 2016 at 5:26 PM, Dejan Muhamedagic 
wrote:

> Hi,
>
> On Mon, May 02, 2016 at 08:54:09AM +0200, Jan Friesse wrote:
> > >As your hardware is probably capable of running ppcle and if you have an
> > >environment
> > >at hand without too much effort it might pay off to try that.
> > >There are of course distributions out there support corosync on
> > >big-endian architectures
> > >but I don't know if there is an automatized regression for corosync on
> > >big-endian that
> > >would catch big-endian-issues right away with something as current as
> > >your 2.3.5.
> >
> > No we are not testing big-endian.
> >
> > So totally agree with Klaus. Give a try to ppcle. Also make sure all
> > nodes are little-endian. Corosync should work in mixed BE/LE
> > environment but because it's not tested, it may not work (and it's a
> > bug, so if ppcle works I will try to fix BE).
>
> I tested a cluster consisting of big endian/little endian nodes
> (s390 and x86-64), but that was a while ago. IIRC, all relevant
> bugs in corosync got fixed at that time. Don't know what is the
> situation with the latest version.
>
> Thanks,
>
> Dejan
>
> > Regards,
> >   Honza
> >
> > >
> > >Regards,
> > >Klaus
> > >
> > >On 05/02/2016 06:44 AM, Nikhil Utane wrote:
> > >>Re-sending as I don't see my post on the thread.
> > >>
> > >>On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
> > >>mailto:nikhil.subscri...@gmail.com>>
> wrote:
> > >>
> > >> Hi,
> > >>
> > >> Looking for some guidance here as we are completely blocked
> > >> otherwise :(.
> > >>
> > >> -Regards
> > >> Nikhil
> > >>
> > >> On Fri, Apr 29, 2016 at 6:11 PM, Sriram  > >> > wrote:
> > >>
> > >> Corrected the subject.
> > >>
> > >> We went ahead and captured corosync debug logs for our ppc
> board.
> > >> After log analysis and comparison with the sucessful logs(
> > >> from x86 machine) ,
> > >> we didnt find *"[ MAIN  ] Completed service synchronization,
> > >> ready to provide service.*" in ppc logs.
> > >> So, looks like corosync is not in a position to accept
> > >> connection from Pacemaker.
> > >> Even I tried with the new corosync.conf with no success.
> > >>
> > >> Any hints on this issue would be really helpful.
> > >>
> > >> Attaching ppc_notworking.log, x86_working.log, corosync.conf.
> > >>
> > >> Regards,
> > >> Sriram
> > >>
> > >>
> > >>
> > >> On Fri, Apr 29, 2016 at 2:44 PM, Sriram  > >> > wrote:
> > >>
> 

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-03 Thread Dejan Muhamedagic
Hi,

On Mon, May 02, 2016 at 08:54:09AM +0200, Jan Friesse wrote:
> >As your hardware is probably capable of running ppcle and if you have an
> >environment
> >at hand without too much effort it might pay off to try that.
> >There are of course distributions out there support corosync on
> >big-endian architectures
> >but I don't know if there is an automatized regression for corosync on
> >big-endian that
> >would catch big-endian-issues right away with something as current as
> >your 2.3.5.
> 
> No we are not testing big-endian.
> 
> So totally agree with Klaus. Give a try to ppcle. Also make sure all
> nodes are little-endian. Corosync should work in mixed BE/LE
> environment but because it's not tested, it may not work (and it's a
> bug, so if ppcle works I will try to fix BE).

I tested a cluster consisting of big endian/little endian nodes
(s390 and x86-64), but that was a while ago. IIRC, all relevant
bugs in corosync got fixed at that time. Don't know what is the
situation with the latest version.

Thanks,

Dejan

> Regards,
>   Honza
> 
> >
> >Regards,
> >Klaus
> >
> >On 05/02/2016 06:44 AM, Nikhil Utane wrote:
> >>Re-sending as I don't see my post on the thread.
> >>
> >>On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
> >>mailto:nikhil.subscri...@gmail.com>> wrote:
> >>
> >> Hi,
> >>
> >> Looking for some guidance here as we are completely blocked
> >> otherwise :(.
> >>
> >> -Regards
> >> Nikhil
> >>
> >> On Fri, Apr 29, 2016 at 6:11 PM, Sriram  >> > wrote:
> >>
> >> Corrected the subject.
> >>
> >> We went ahead and captured corosync debug logs for our ppc board.
> >> After log analysis and comparison with the sucessful logs(
> >> from x86 machine) ,
> >> we didnt find *"[ MAIN  ] Completed service synchronization,
> >> ready to provide service.*" in ppc logs.
> >> So, looks like corosync is not in a position to accept
> >> connection from Pacemaker.
> >> Even I tried with the new corosync.conf with no success.
> >>
> >> Any hints on this issue would be really helpful.
> >>
> >> Attaching ppc_notworking.log, x86_working.log, corosync.conf.
> >>
> >> Regards,
> >> Sriram
> >>
> >>
> >>
> >> On Fri, Apr 29, 2016 at 2:44 PM, Sriram  >> > wrote:
> >>
> >> Hi,
> >>
> >> I went ahead and made some changes in file system(Like I
> >> brought in /etc/init.d/corosync and /etc/init.d/pacemaker,
> >> /etc/sysconfig ), After that I was able to run  "pcs
> >> cluster start".
> >> But it failed with the following error
> >>  # pcs cluster start
> >> Starting Cluster...
> >> Starting Pacemaker Cluster Manager[FAILED]
> >> Error: unable to start pacemaker
> >>
> >> And in the /var/log/pacemaker.log, I saw these errors
> >> pacemakerd: info: mcp_read_config:  cmap connection
> >> setup failed: CS_ERR_TRY_AGAIN.  Retrying in 4s
> >> Apr 29 08:53:47 [15863] node_cu pacemakerd: info:
> >> mcp_read_config:  cmap connection setup failed:
> >> CS_ERR_TRY_AGAIN.  Retrying in 5s
> >> Apr 29 08:53:52 [15863] node_cu pacemakerd:  warning:
> >> mcp_read_config:  Could not connect to Cluster
> >> Configuration Database API, error 6
> >> Apr 29 08:53:52 [15863] node_cu pacemakerd:   notice:
> >> main: Could not obtain corosync config data, exiting
> >> Apr 29 08:53:52 [15863] node_cu pacemakerd: info:
> >> crm_xml_cleanup:  Cleaning up memory from libxml2
> >>
> >>
> >> And in the /var/log/Debuglog, I saw these errors coming
> >> from corosync
> >> 20160429 085347.487050  airv_cu
> >> daemon.warn corosync[12857]:   [QB] Denied connection,
> >> is not ready (12857-15863-14)
> >> 20160429 085347.487067  airv_cu
> >> daemon.info  corosync[12857]:   [QB
> >> ] Denied connection, is not ready (12857-15863-14)
> >>
> >>
> >> I browsed the code of libqb to find that it is failing in
> >>
> >> 
> >> https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c
> >>
> >> Line 600 :
> >> handle_new_connection function
> >>
> >> Line 637:
> >> if (auth_result == 0 &&
> >> c->service->serv_fns.connection_accept) {
> >> res = c->service->serv_fns.connection_accept(c,
> >>  c->euid, c->egid);
> >> }
> >> if (res != 0) {
> >> goto send_response;
> >> }
> >>
> >> Any hints on this issue would be really helpful fo

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-02 Thread Nikhil Utane
It is Freescale e6500 processor. Nobody here has tried running it in LE
mode so it is going to take some doing.
We are going to add some debug logs to figure out where does corosync
initialization get stalled.
If you have have suggestions, pls let us know.

-Thanks
Nikhil


On Mon, May 2, 2016 at 1:00 PM, Nikhil Utane 
wrote:

> So what I understand what you are saying is, if the HW is bi-endian, then
> enable LE on PPC. Is that right?
> Need to check on that.
>
> On Mon, May 2, 2016 at 12:49 PM, Nikhil Utane  > wrote:
>
>> Sorry about my ignorance but could you pls elaborate what do you mean by
>> "try to ppcle"?
>>
>> Our target platform is ppc so it is BE. We have to get it running only on
>> that.
>> How do we know this is LE/BE issue and nothing else?
>>
>> -Thanks
>> Nikhil
>>
>>
>> On Mon, May 2, 2016 at 12:24 PM, Jan Friesse  wrote:
>>
>>> As your hardware is probably capable of running ppcle and if you have an
 environment
 at hand without too much effort it might pay off to try that.
 There are of course distributions out there support corosync on
 big-endian architectures
 but I don't know if there is an automatized regression for corosync on
 big-endian that
 would catch big-endian-issues right away with something as current as
 your 2.3.5.

>>>
>>> No we are not testing big-endian.
>>>
>>> So totally agree with Klaus. Give a try to ppcle. Also make sure all
>>> nodes are little-endian. Corosync should work in mixed BE/LE environment
>>> but because it's not tested, it may not work (and it's a bug, so if ppcle
>>> works I will try to fix BE).
>>>
>>> Regards,
>>>   Honza
>>>
>>>
>>>
 Regards,
 Klaus

 On 05/02/2016 06:44 AM, Nikhil Utane wrote:

> Re-sending as I don't see my post on the thread.
>
> On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
> mailto:nikhil.subscri...@gmail.com>>
> wrote:
>
>  Hi,
>
>  Looking for some guidance here as we are completely blocked
>  otherwise :(.
>
>  -Regards
>  Nikhil
>
>  On Fri, Apr 29, 2016 at 6:11 PM, Sriram   > wrote:
>
>  Corrected the subject.
>
>  We went ahead and captured corosync debug logs for our ppc
> board.
>  After log analysis and comparison with the sucessful logs(
>  from x86 machine) ,
>  we didnt find *"[ MAIN  ] Completed service synchronization,
>  ready to provide service.*" in ppc logs.
>  So, looks like corosync is not in a position to accept
>  connection from Pacemaker.
>  Even I tried with the new corosync.conf with no success.
>
>  Any hints on this issue would be really helpful.
>
>  Attaching ppc_notworking.log, x86_working.log, corosync.conf.
>
>  Regards,
>  Sriram
>
>
>
>  On Fri, Apr 29, 2016 at 2:44 PM, Sriram   > wrote:
>
>  Hi,
>
>  I went ahead and made some changes in file system(Like I
>  brought in /etc/init.d/corosync and /etc/init.d/pacemaker,
>  /etc/sysconfig ), After that I was able to run  "pcs
>  cluster start".
>  But it failed with the following error
>   # pcs cluster start
>  Starting Cluster...
>  Starting Pacemaker Cluster Manager[FAILED]
>  Error: unable to start pacemaker
>
>  And in the /var/log/pacemaker.log, I saw these errors
>  pacemakerd: info: mcp_read_config:  cmap connection
>  setup failed: CS_ERR_TRY_AGAIN.  Retrying in 4s
>  Apr 29 08:53:47 [15863] node_cu pacemakerd: info:
>  mcp_read_config:  cmap connection setup failed:
>  CS_ERR_TRY_AGAIN.  Retrying in 5s
>  Apr 29 08:53:52 [15863] node_cu pacemakerd:  warning:
>  mcp_read_config:  Could not connect to Cluster
>  Configuration Database API, error 6
>  Apr 29 08:53:52 [15863] node_cu pacemakerd:   notice:
>  main: Could not obtain corosync config data, exiting
>  Apr 29 08:53:52 [15863] node_cu pacemakerd: info:
>  crm_xml_cleanup:  Cleaning up memory from libxml2
>
>
>  And in the /var/log/Debuglog, I saw these errors coming
>  from corosync
>  20160429 085347.487050  airv_cu
>  daemon.warn corosync[12857]:   [QB] Denied connection,
>  is not ready (12857-15863-14)
>  20160429 085347.487067  airv_cu
>  daemon.info  corosync[12857]:   [QB
>  ] Denied connection, is not ready (128

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-02 Thread Nikhil Utane
So what I understand what you are saying is, if the HW is bi-endian, then
enable LE on PPC. Is that right?
Need to check on that.

On Mon, May 2, 2016 at 12:49 PM, Nikhil Utane 
wrote:

> Sorry about my ignorance but could you pls elaborate what do you mean by
> "try to ppcle"?
>
> Our target platform is ppc so it is BE. We have to get it running only on
> that.
> How do we know this is LE/BE issue and nothing else?
>
> -Thanks
> Nikhil
>
>
> On Mon, May 2, 2016 at 12:24 PM, Jan Friesse  wrote:
>
>> As your hardware is probably capable of running ppcle and if you have an
>>> environment
>>> at hand without too much effort it might pay off to try that.
>>> There are of course distributions out there support corosync on
>>> big-endian architectures
>>> but I don't know if there is an automatized regression for corosync on
>>> big-endian that
>>> would catch big-endian-issues right away with something as current as
>>> your 2.3.5.
>>>
>>
>> No we are not testing big-endian.
>>
>> So totally agree with Klaus. Give a try to ppcle. Also make sure all
>> nodes are little-endian. Corosync should work in mixed BE/LE environment
>> but because it's not tested, it may not work (and it's a bug, so if ppcle
>> works I will try to fix BE).
>>
>> Regards,
>>   Honza
>>
>>
>>
>>> Regards,
>>> Klaus
>>>
>>> On 05/02/2016 06:44 AM, Nikhil Utane wrote:
>>>
 Re-sending as I don't see my post on the thread.

 On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
 mailto:nikhil.subscri...@gmail.com>>
 wrote:

  Hi,

  Looking for some guidance here as we are completely blocked
  otherwise :(.

  -Regards
  Nikhil

  On Fri, Apr 29, 2016 at 6:11 PM, Sriram >>>  > wrote:

  Corrected the subject.

  We went ahead and captured corosync debug logs for our ppc
 board.
  After log analysis and comparison with the sucessful logs(
  from x86 machine) ,
  we didnt find *"[ MAIN  ] Completed service synchronization,
  ready to provide service.*" in ppc logs.
  So, looks like corosync is not in a position to accept
  connection from Pacemaker.
  Even I tried with the new corosync.conf with no success.

  Any hints on this issue would be really helpful.

  Attaching ppc_notworking.log, x86_working.log, corosync.conf.

  Regards,
  Sriram



  On Fri, Apr 29, 2016 at 2:44 PM, Sriram >>>  > wrote:

  Hi,

  I went ahead and made some changes in file system(Like I
  brought in /etc/init.d/corosync and /etc/init.d/pacemaker,
  /etc/sysconfig ), After that I was able to run  "pcs
  cluster start".
  But it failed with the following error
   # pcs cluster start
  Starting Cluster...
  Starting Pacemaker Cluster Manager[FAILED]
  Error: unable to start pacemaker

  And in the /var/log/pacemaker.log, I saw these errors
  pacemakerd: info: mcp_read_config:  cmap connection
  setup failed: CS_ERR_TRY_AGAIN.  Retrying in 4s
  Apr 29 08:53:47 [15863] node_cu pacemakerd: info:
  mcp_read_config:  cmap connection setup failed:
  CS_ERR_TRY_AGAIN.  Retrying in 5s
  Apr 29 08:53:52 [15863] node_cu pacemakerd:  warning:
  mcp_read_config:  Could not connect to Cluster
  Configuration Database API, error 6
  Apr 29 08:53:52 [15863] node_cu pacemakerd:   notice:
  main: Could not obtain corosync config data, exiting
  Apr 29 08:53:52 [15863] node_cu pacemakerd: info:
  crm_xml_cleanup:  Cleaning up memory from libxml2


  And in the /var/log/Debuglog, I saw these errors coming
  from corosync
  20160429 085347.487050  airv_cu
  daemon.warn corosync[12857]:   [QB] Denied connection,
  is not ready (12857-15863-14)
  20160429 085347.487067  airv_cu
  daemon.info  corosync[12857]:   [QB
  ] Denied connection, is not ready (12857-15863-14)


  I browsed the code of libqb to find that it is failing in


 https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c

  Line 600 :
  handle_new_connection function

  Line 637:
  if (auth_result == 0 &&
  c->service->serv_fns.connection_accept) {
  res = c->service->serv_fns.connecti

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-02 Thread Nikhil Utane
Sorry about my ignorance but could you pls elaborate what do you mean by
"try to ppcle"?

Our target platform is ppc so it is BE. We have to get it running only on
that.
How do we know this is LE/BE issue and nothing else?

-Thanks
Nikhil


On Mon, May 2, 2016 at 12:24 PM, Jan Friesse  wrote:

> As your hardware is probably capable of running ppcle and if you have an
>> environment
>> at hand without too much effort it might pay off to try that.
>> There are of course distributions out there support corosync on
>> big-endian architectures
>> but I don't know if there is an automatized regression for corosync on
>> big-endian that
>> would catch big-endian-issues right away with something as current as
>> your 2.3.5.
>>
>
> No we are not testing big-endian.
>
> So totally agree with Klaus. Give a try to ppcle. Also make sure all nodes
> are little-endian. Corosync should work in mixed BE/LE environment but
> because it's not tested, it may not work (and it's a bug, so if ppcle works
> I will try to fix BE).
>
> Regards,
>   Honza
>
>
>
>> Regards,
>> Klaus
>>
>> On 05/02/2016 06:44 AM, Nikhil Utane wrote:
>>
>>> Re-sending as I don't see my post on the thread.
>>>
>>> On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
>>> mailto:nikhil.subscri...@gmail.com>>
>>> wrote:
>>>
>>>  Hi,
>>>
>>>  Looking for some guidance here as we are completely blocked
>>>  otherwise :(.
>>>
>>>  -Regards
>>>  Nikhil
>>>
>>>  On Fri, Apr 29, 2016 at 6:11 PM, Sriram >>  > wrote:
>>>
>>>  Corrected the subject.
>>>
>>>  We went ahead and captured corosync debug logs for our ppc
>>> board.
>>>  After log analysis and comparison with the sucessful logs(
>>>  from x86 machine) ,
>>>  we didnt find *"[ MAIN  ] Completed service synchronization,
>>>  ready to provide service.*" in ppc logs.
>>>  So, looks like corosync is not in a position to accept
>>>  connection from Pacemaker.
>>>  Even I tried with the new corosync.conf with no success.
>>>
>>>  Any hints on this issue would be really helpful.
>>>
>>>  Attaching ppc_notworking.log, x86_working.log, corosync.conf.
>>>
>>>  Regards,
>>>  Sriram
>>>
>>>
>>>
>>>  On Fri, Apr 29, 2016 at 2:44 PM, Sriram >>  > wrote:
>>>
>>>  Hi,
>>>
>>>  I went ahead and made some changes in file system(Like I
>>>  brought in /etc/init.d/corosync and /etc/init.d/pacemaker,
>>>  /etc/sysconfig ), After that I was able to run  "pcs
>>>  cluster start".
>>>  But it failed with the following error
>>>   # pcs cluster start
>>>  Starting Cluster...
>>>  Starting Pacemaker Cluster Manager[FAILED]
>>>  Error: unable to start pacemaker
>>>
>>>  And in the /var/log/pacemaker.log, I saw these errors
>>>  pacemakerd: info: mcp_read_config:  cmap connection
>>>  setup failed: CS_ERR_TRY_AGAIN.  Retrying in 4s
>>>  Apr 29 08:53:47 [15863] node_cu pacemakerd: info:
>>>  mcp_read_config:  cmap connection setup failed:
>>>  CS_ERR_TRY_AGAIN.  Retrying in 5s
>>>  Apr 29 08:53:52 [15863] node_cu pacemakerd:  warning:
>>>  mcp_read_config:  Could not connect to Cluster
>>>  Configuration Database API, error 6
>>>  Apr 29 08:53:52 [15863] node_cu pacemakerd:   notice:
>>>  main: Could not obtain corosync config data, exiting
>>>  Apr 29 08:53:52 [15863] node_cu pacemakerd: info:
>>>  crm_xml_cleanup:  Cleaning up memory from libxml2
>>>
>>>
>>>  And in the /var/log/Debuglog, I saw these errors coming
>>>  from corosync
>>>  20160429 085347.487050  airv_cu
>>>  daemon.warn corosync[12857]:   [QB] Denied connection,
>>>  is not ready (12857-15863-14)
>>>  20160429 085347.487067  airv_cu
>>>  daemon.info  corosync[12857]:   [QB
>>>  ] Denied connection, is not ready (12857-15863-14)
>>>
>>>
>>>  I browsed the code of libqb to find that it is failing in
>>>
>>>
>>> https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c
>>>
>>>  Line 600 :
>>>  handle_new_connection function
>>>
>>>  Line 637:
>>>  if (auth_result == 0 &&
>>>  c->service->serv_fns.connection_accept) {
>>>  res = c->service->serv_fns.connection_accept(c,
>>>   c->euid, c->egid);
>>>  }
>>>  if (res != 0) {
>>>  goto send_response;
>>>  }
>>>
>>>  Any hints on this issue would be really helpful for me to
>>>  go ahead.
>>>  

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-01 Thread Jan Friesse

As your hardware is probably capable of running ppcle and if you have an
environment
at hand without too much effort it might pay off to try that.
There are of course distributions out there support corosync on
big-endian architectures
but I don't know if there is an automatized regression for corosync on
big-endian that
would catch big-endian-issues right away with something as current as
your 2.3.5.


No we are not testing big-endian.

So totally agree with Klaus. Give a try to ppcle. Also make sure all 
nodes are little-endian. Corosync should work in mixed BE/LE environment 
but because it's not tested, it may not work (and it's a bug, so if 
ppcle works I will try to fix BE).


Regards,
  Honza



Regards,
Klaus

On 05/02/2016 06:44 AM, Nikhil Utane wrote:

Re-sending as I don't see my post on the thread.

On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
mailto:nikhil.subscri...@gmail.com>> wrote:

 Hi,

 Looking for some guidance here as we are completely blocked
 otherwise :(.

 -Regards
 Nikhil

 On Fri, Apr 29, 2016 at 6:11 PM, Sriram mailto:sriram...@gmail.com>> wrote:

 Corrected the subject.

 We went ahead and captured corosync debug logs for our ppc board.
 After log analysis and comparison with the sucessful logs(
 from x86 machine) ,
 we didnt find *"[ MAIN  ] Completed service synchronization,
 ready to provide service.*" in ppc logs.
 So, looks like corosync is not in a position to accept
 connection from Pacemaker.
 Even I tried with the new corosync.conf with no success.

 Any hints on this issue would be really helpful.

 Attaching ppc_notworking.log, x86_working.log, corosync.conf.

 Regards,
 Sriram



 On Fri, Apr 29, 2016 at 2:44 PM, Sriram mailto:sriram...@gmail.com>> wrote:

 Hi,

 I went ahead and made some changes in file system(Like I
 brought in /etc/init.d/corosync and /etc/init.d/pacemaker,
 /etc/sysconfig ), After that I was able to run  "pcs
 cluster start".
 But it failed with the following error
  # pcs cluster start
 Starting Cluster...
 Starting Pacemaker Cluster Manager[FAILED]
 Error: unable to start pacemaker

 And in the /var/log/pacemaker.log, I saw these errors
 pacemakerd: info: mcp_read_config:  cmap connection
 setup failed: CS_ERR_TRY_AGAIN.  Retrying in 4s
 Apr 29 08:53:47 [15863] node_cu pacemakerd: info:
 mcp_read_config:  cmap connection setup failed:
 CS_ERR_TRY_AGAIN.  Retrying in 5s
 Apr 29 08:53:52 [15863] node_cu pacemakerd:  warning:
 mcp_read_config:  Could not connect to Cluster
 Configuration Database API, error 6
 Apr 29 08:53:52 [15863] node_cu pacemakerd:   notice:
 main: Could not obtain corosync config data, exiting
 Apr 29 08:53:52 [15863] node_cu pacemakerd: info:
 crm_xml_cleanup:  Cleaning up memory from libxml2


 And in the /var/log/Debuglog, I saw these errors coming
 from corosync
 20160429 085347.487050  airv_cu
 daemon.warn corosync[12857]:   [QB] Denied connection,
 is not ready (12857-15863-14)
 20160429 085347.487067  airv_cu
 daemon.info  corosync[12857]:   [QB
 ] Denied connection, is not ready (12857-15863-14)


 I browsed the code of libqb to find that it is failing in

 https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c

 Line 600 :
 handle_new_connection function

 Line 637:
 if (auth_result == 0 &&
 c->service->serv_fns.connection_accept) {
 res = c->service->serv_fns.connection_accept(c,
  c->euid, c->egid);
 }
 if (res != 0) {
 goto send_response;
 }

 Any hints on this issue would be really helpful for me to
 go ahead.
 Please let me know if any logs are required,

 Regards,
 Sriram

 On Thu, Apr 28, 2016 at 2:42 PM, Sriram
 mailto:sriram...@gmail.com>> wrote:

 Thanks Ken and Emmanuel.
 Its a big endian machine. I will try with running "pcs
 cluster setup" and "pcs cluster start"
 Inside cluster.py, "service pacemaker start" and
 "service corosync start" are executed to bring up
 pacemaker and corosync.
 Those service scripts and the infrastructure needed to
 bring up the processes in the above said manner
 doesn't exist in my board.

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-01 Thread Klaus Wenninger
As your hardware is probably capable of running ppcle and if you have an
environment
at hand without too much effort it might pay off to try that.
There are of course distributions out there support corosync on
big-endian architectures
but I don't know if there is an automatized regression for corosync on
big-endian that
would catch big-endian-issues right away with something as current as
your 2.3.5.

Regards,
Klaus

On 05/02/2016 06:44 AM, Nikhil Utane wrote:
> Re-sending as I don't see my post on the thread.
>
> On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
> mailto:nikhil.subscri...@gmail.com>> wrote:
>
> Hi,
>
> Looking for some guidance here as we are completely blocked
> otherwise :(.
>
> -Regards
> Nikhil
>
> On Fri, Apr 29, 2016 at 6:11 PM, Sriram  > wrote:
>
> Corrected the subject.
>
> We went ahead and captured corosync debug logs for our ppc board.
> After log analysis and comparison with the sucessful logs(
> from x86 machine) ,
> we didnt find *"[ MAIN  ] Completed service synchronization,
> ready to provide service.*" in ppc logs.
> So, looks like corosync is not in a position to accept
> connection from Pacemaker.
> Even I tried with the new corosync.conf with no success.
>
> Any hints on this issue would be really helpful.
>
> Attaching ppc_notworking.log, x86_working.log, corosync.conf.
>
> Regards,
> Sriram
>
>
>
> On Fri, Apr 29, 2016 at 2:44 PM, Sriram  > wrote:
>
> Hi,
>
> I went ahead and made some changes in file system(Like I
> brought in /etc/init.d/corosync and /etc/init.d/pacemaker,
> /etc/sysconfig ), After that I was able to run  "pcs
> cluster start".
> But it failed with the following error
>  # pcs cluster start
> Starting Cluster...
> Starting Pacemaker Cluster Manager[FAILED]
> Error: unable to start pacemaker
>
> And in the /var/log/pacemaker.log, I saw these errors
> pacemakerd: info: mcp_read_config:  cmap connection
> setup failed: CS_ERR_TRY_AGAIN.  Retrying in 4s
> Apr 29 08:53:47 [15863] node_cu pacemakerd: info:
> mcp_read_config:  cmap connection setup failed:
> CS_ERR_TRY_AGAIN.  Retrying in 5s
> Apr 29 08:53:52 [15863] node_cu pacemakerd:  warning:
> mcp_read_config:  Could not connect to Cluster
> Configuration Database API, error 6
> Apr 29 08:53:52 [15863] node_cu pacemakerd:   notice:
> main: Could not obtain corosync config data, exiting
> Apr 29 08:53:52 [15863] node_cu pacemakerd: info:
> crm_xml_cleanup:  Cleaning up memory from libxml2
>
>
> And in the /var/log/Debuglog, I saw these errors coming
> from corosync
> 20160429 085347.487050  airv_cu
> daemon.warn corosync[12857]:   [QB] Denied connection,
> is not ready (12857-15863-14)
> 20160429 085347.487067  airv_cu
> daemon.info  corosync[12857]:   [QB   
> ] Denied connection, is not ready (12857-15863-14)
>
>
> I browsed the code of libqb to find that it is failing in
>
> https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c
>
> Line 600 :
> handle_new_connection function
>
> Line 637:
> if (auth_result == 0 &&
> c->service->serv_fns.connection_accept) {
> res = c->service->serv_fns.connection_accept(c,
>  c->euid, c->egid);
> }
> if (res != 0) {
> goto send_response;
> }
>
> Any hints on this issue would be really helpful for me to
> go ahead.
> Please let me know if any logs are required,
>
> Regards,
> Sriram
>
> On Thu, Apr 28, 2016 at 2:42 PM, Sriram
> mailto:sriram...@gmail.com>> wrote:
>
> Thanks Ken and Emmanuel.
> Its a big endian machine. I will try with running "pcs
> cluster setup" and "pcs cluster start"
> Inside cluster.py, "service pacemaker start" and
> "service corosync start" are executed to bring up
> pacemaker and corosync.
> Those service scripts and the infrastructure needed to
> bring up the processes in the above said manner
> doesn't exist in my board.
> As it is a embedded board with the limited memory,
> full fledged linux is not installed.
> Just curious to know, what could 

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-01 Thread Nikhil Utane
Re-sending as I don't see my post on the thread.

On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane 
wrote:

> Hi,
>
> Looking for some guidance here as we are completely blocked otherwise :(.
>
> -Regards
> Nikhil
>
> On Fri, Apr 29, 2016 at 6:11 PM, Sriram  wrote:
>
>> Corrected the subject.
>>
>> We went ahead and captured corosync debug logs for our ppc board.
>> After log analysis and comparison with the sucessful logs( from x86
>> machine) ,
>> we didnt find * "[ MAIN  ] Completed service synchronization, ready to
>> provide service.*" in ppc logs.
>> So, looks like corosync is not in a position to accept connection from
>> Pacemaker.
>> Even I tried with the new corosync.conf with no success.
>>
>> Any hints on this issue would be really helpful.
>>
>> Attaching ppc_notworking.log, x86_working.log, corosync.conf.
>>
>> Regards,
>> Sriram
>>
>>
>>
>> On Fri, Apr 29, 2016 at 2:44 PM, Sriram  wrote:
>>
>>> Hi,
>>>
>>> I went ahead and made some changes in file system(Like I brought in
>>> /etc/init.d/corosync and /etc/init.d/pacemaker, /etc/sysconfig ), After
>>> that I was able to run  "pcs cluster start".
>>> But it failed with the following error
>>>  # pcs cluster start
>>> Starting Cluster...
>>> Starting Pacemaker Cluster Manager[FAILED]
>>> Error: unable to start pacemaker
>>>
>>> And in the /var/log/pacemaker.log, I saw these errors
>>> pacemakerd: info: mcp_read_config:  cmap connection setup failed:
>>> CS_ERR_TRY_AGAIN.  Retrying in 4s
>>> Apr 29 08:53:47 [15863] node_cu pacemakerd: info: mcp_read_config:
>>> cmap connection setup failed: CS_ERR_TRY_AGAIN.  Retrying in 5s
>>> Apr 29 08:53:52 [15863] node_cu pacemakerd:  warning: mcp_read_config:
>>> Could not connect to Cluster Configuration Database API, error 6
>>> Apr 29 08:53:52 [15863] node_cu pacemakerd:   notice: main: Could
>>> not obtain corosync config data, exiting
>>> Apr 29 08:53:52 [15863] node_cu pacemakerd: info: crm_xml_cleanup:
>>> Cleaning up memory from libxml2
>>>
>>>
>>> And in the /var/log/Debuglog, I saw these errors coming from corosync
>>> 20160429 085347.487050 airv_cu daemon.warn corosync[12857]:   [QB]
>>> Denied connection, is not ready (12857-15863-14)
>>> 20160429 085347.487067 airv_cu daemon.info corosync[12857]:   [QB]
>>> Denied connection, is not ready (12857-15863-14)
>>>
>>>
>>> I browsed the code of libqb to find that it is failing in
>>>
>>> https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c
>>>
>>> Line 600 :
>>> handle_new_connection function
>>>
>>> Line 637:
>>> if (auth_result == 0 && c->service->serv_fns.connection_accept) {
>>> res = c->service->serv_fns.connection_accept(c,
>>>  c->euid, c->egid);
>>> }
>>> if (res != 0) {
>>> goto send_response;
>>> }
>>>
>>> Any hints on this issue would be really helpful for me to go ahead.
>>> Please let me know if any logs are required,
>>>
>>> Regards,
>>> Sriram
>>>
>>> On Thu, Apr 28, 2016 at 2:42 PM, Sriram  wrote:
>>>
 Thanks Ken and Emmanuel.
 Its a big endian machine. I will try with running "pcs cluster setup"
 and "pcs cluster start"
 Inside cluster.py, "service pacemaker start" and "service corosync
 start" are executed to bring up pacemaker and corosync.
 Those service scripts and the infrastructure needed to bring up the
 processes in the above said manner doesn't exist in my board.
 As it is a embedded board with the limited memory, full fledged linux
 is not installed.
 Just curious to know, what could be reason the pacemaker throws that
 error.



 *"cmap connection setup failed: CS_ERR_TRY_AGAIN.  Retrying in 1s"*
 Thanks for response.

 Regards,
 Sriram.

 On Thu, Apr 28, 2016 at 8:55 AM, Ken Gaillot 
 wrote:

> On 04/27/2016 11:25 AM, emmanuel segura wrote:
> > you need to use pcs to do everything, pcs cluster setup and pcs
> > cluster start, try to use the redhat docs for more information.
>
> Agreed -- pcs cluster setup will create a proper corosync.conf for you.
> Your corosync.conf below uses corosync 1 syntax, and there were
> significant changes in corosync 2. In particular, you don't need the
> file created in step 4, because pacemaker is no longer launched via a
> corosync plugin.
>
> > 2016-04-27 17:28 GMT+02:00 Sriram :
> >> Dear All,
> >>
> >> I m trying to use pacemaker and corosync for the clustering
> requirement that
> >> came up recently.
> >> We have cross compiled corosync, pacemaker and pcs(python) for ppc
> >> environment (Target board where pacemaker and corosync are supposed
> to run)
> >> I m having trouble bringing up pacemaker in that environment,
> though I could
> >> successfully bring up corosync.
> >> Any help is welcome.
> >>
> >> I m using these versions of pacemaker and corosync
> >> [root@node_cu pacemaker]# corosync -v
>>

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-01 Thread Nikhil Utane
Hi,

Looking for some guidance here as we are completely blocked otherwise :(.

-Regards
Nikhil

On Fri, Apr 29, 2016 at 6:11 PM, Sriram  wrote:

> Corrected the subject.
>
> We went ahead and captured corosync debug logs for our ppc board.
> After log analysis and comparison with the sucessful logs( from x86
> machine) ,
> we didnt find * "[ MAIN  ] Completed service synchronization, ready to
> provide service.*" in ppc logs.
> So, looks like corosync is not in a position to accept connection from
> Pacemaker.
> Even I tried with the new corosync.conf with no success.
>
> Any hints on this issue would be really helpful.
>
> Attaching ppc_notworking.log, x86_working.log, corosync.conf.
>
> Regards,
> Sriram
>
>
>
> On Fri, Apr 29, 2016 at 2:44 PM, Sriram  wrote:
>
>> Hi,
>>
>> I went ahead and made some changes in file system(Like I brought in
>> /etc/init.d/corosync and /etc/init.d/pacemaker, /etc/sysconfig ), After
>> that I was able to run  "pcs cluster start".
>> But it failed with the following error
>>  # pcs cluster start
>> Starting Cluster...
>> Starting Pacemaker Cluster Manager[FAILED]
>> Error: unable to start pacemaker
>>
>> And in the /var/log/pacemaker.log, I saw these errors
>> pacemakerd: info: mcp_read_config:  cmap connection setup failed:
>> CS_ERR_TRY_AGAIN.  Retrying in 4s
>> Apr 29 08:53:47 [15863] node_cu pacemakerd: info: mcp_read_config:
>> cmap connection setup failed: CS_ERR_TRY_AGAIN.  Retrying in 5s
>> Apr 29 08:53:52 [15863] node_cu pacemakerd:  warning: mcp_read_config:
>> Could not connect to Cluster Configuration Database API, error 6
>> Apr 29 08:53:52 [15863] node_cu pacemakerd:   notice: main: Could not
>> obtain corosync config data, exiting
>> Apr 29 08:53:52 [15863] node_cu pacemakerd: info: crm_xml_cleanup:
>> Cleaning up memory from libxml2
>>
>>
>> And in the /var/log/Debuglog, I saw these errors coming from corosync
>> 20160429 085347.487050 airv_cu daemon.warn corosync[12857]:   [QB]
>> Denied connection, is not ready (12857-15863-14)
>> 20160429 085347.487067 airv_cu daemon.info corosync[12857]:   [QB]
>> Denied connection, is not ready (12857-15863-14)
>>
>>
>> I browsed the code of libqb to find that it is failing in
>>
>> https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c
>>
>> Line 600 :
>> handle_new_connection function
>>
>> Line 637:
>> if (auth_result == 0 && c->service->serv_fns.connection_accept) {
>> res = c->service->serv_fns.connection_accept(c,
>>  c->euid, c->egid);
>> }
>> if (res != 0) {
>> goto send_response;
>> }
>>
>> Any hints on this issue would be really helpful for me to go ahead.
>> Please let me know if any logs are required,
>>
>> Regards,
>> Sriram
>>
>> On Thu, Apr 28, 2016 at 2:42 PM, Sriram  wrote:
>>
>>> Thanks Ken and Emmanuel.
>>> Its a big endian machine. I will try with running "pcs cluster setup"
>>> and "pcs cluster start"
>>> Inside cluster.py, "service pacemaker start" and "service corosync
>>> start" are executed to bring up pacemaker and corosync.
>>> Those service scripts and the infrastructure needed to bring up the
>>> processes in the above said manner doesn't exist in my board.
>>> As it is a embedded board with the limited memory, full fledged linux is
>>> not installed.
>>> Just curious to know, what could be reason the pacemaker throws that
>>> error.
>>>
>>>
>>>
>>> *"cmap connection setup failed: CS_ERR_TRY_AGAIN.  Retrying in 1s"*
>>> Thanks for response.
>>>
>>> Regards,
>>> Sriram.
>>>
>>> On Thu, Apr 28, 2016 at 8:55 AM, Ken Gaillot 
>>> wrote:
>>>
 On 04/27/2016 11:25 AM, emmanuel segura wrote:
 > you need to use pcs to do everything, pcs cluster setup and pcs
 > cluster start, try to use the redhat docs for more information.

 Agreed -- pcs cluster setup will create a proper corosync.conf for you.
 Your corosync.conf below uses corosync 1 syntax, and there were
 significant changes in corosync 2. In particular, you don't need the
 file created in step 4, because pacemaker is no longer launched via a
 corosync plugin.

 > 2016-04-27 17:28 GMT+02:00 Sriram :
 >> Dear All,
 >>
 >> I m trying to use pacemaker and corosync for the clustering
 requirement that
 >> came up recently.
 >> We have cross compiled corosync, pacemaker and pcs(python) for ppc
 >> environment (Target board where pacemaker and corosync are supposed
 to run)
 >> I m having trouble bringing up pacemaker in that environment, though
 I could
 >> successfully bring up corosync.
 >> Any help is welcome.
 >>
 >> I m using these versions of pacemaker and corosync
 >> [root@node_cu pacemaker]# corosync -v
 >> Corosync Cluster Engine, version '2.3.5'
 >> Copyright (c) 2006-2009 Red Hat, Inc.
 >> [root@node_cu pacemaker]# pacemakerd -$
 >> Pacemaker 1.1.14
 >> Written by Andrew Beekhof
 >>
 >> For running corosync, I did 

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-04-29 Thread Sriram
Corrected the subject.

We went ahead and captured corosync debug logs for our ppc board.
After log analysis and comparison with the sucessful logs( from x86
machine) ,
we didnt find * "[ MAIN  ] Completed service synchronization, ready to
provide service.*" in ppc logs.
So, looks like corosync is not in a position to accept connection from
Pacemaker.
Even I tried with the new corosync.conf with no success.

Any hints on this issue would be really helpful.

Attaching ppc_notworking.log, x86_working.log, corosync.conf.

Regards,
Sriram



On Fri, Apr 29, 2016 at 2:44 PM, Sriram  wrote:

> Hi,
>
> I went ahead and made some changes in file system(Like I brought in
> /etc/init.d/corosync and /etc/init.d/pacemaker, /etc/sysconfig ), After
> that I was able to run  "pcs cluster start".
> But it failed with the following error
>  # pcs cluster start
> Starting Cluster...
> Starting Pacemaker Cluster Manager[FAILED]
> Error: unable to start pacemaker
>
> And in the /var/log/pacemaker.log, I saw these errors
> pacemakerd: info: mcp_read_config:  cmap connection setup failed:
> CS_ERR_TRY_AGAIN.  Retrying in 4s
> Apr 29 08:53:47 [15863] node_cu pacemakerd: info: mcp_read_config:
> cmap connection setup failed: CS_ERR_TRY_AGAIN.  Retrying in 5s
> Apr 29 08:53:52 [15863] node_cu pacemakerd:  warning: mcp_read_config:
> Could not connect to Cluster Configuration Database API, error 6
> Apr 29 08:53:52 [15863] node_cu pacemakerd:   notice: main: Could not
> obtain corosync config data, exiting
> Apr 29 08:53:52 [15863] node_cu pacemakerd: info: crm_xml_cleanup:
> Cleaning up memory from libxml2
>
>
> And in the /var/log/Debuglog, I saw these errors coming from corosync
> 20160429 085347.487050 airv_cu daemon.warn corosync[12857]:   [QB]
> Denied connection, is not ready (12857-15863-14)
> 20160429 085347.487067 airv_cu daemon.info corosync[12857]:   [QB]
> Denied connection, is not ready (12857-15863-14)
>
>
> I browsed the code of libqb to find that it is failing in
>
> https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c
>
> Line 600 :
> handle_new_connection function
>
> Line 637:
> if (auth_result == 0 && c->service->serv_fns.connection_accept) {
> res = c->service->serv_fns.connection_accept(c,
>  c->euid, c->egid);
> }
> if (res != 0) {
> goto send_response;
> }
>
> Any hints on this issue would be really helpful for me to go ahead.
> Please let me know if any logs are required,
>
> Regards,
> Sriram
>
> On Thu, Apr 28, 2016 at 2:42 PM, Sriram  wrote:
>
>> Thanks Ken and Emmanuel.
>> Its a big endian machine. I will try with running "pcs cluster setup" and
>> "pcs cluster start"
>> Inside cluster.py, "service pacemaker start" and "service corosync start"
>> are executed to bring up pacemaker and corosync.
>> Those service scripts and the infrastructure needed to bring up the
>> processes in the above said manner doesn't exist in my board.
>> As it is a embedded board with the limited memory, full fledged linux is
>> not installed.
>> Just curious to know, what could be reason the pacemaker throws that
>> error.
>>
>>
>>
>> *"cmap connection setup failed: CS_ERR_TRY_AGAIN.  Retrying in 1s"*
>> Thanks for response.
>>
>> Regards,
>> Sriram.
>>
>> On Thu, Apr 28, 2016 at 8:55 AM, Ken Gaillot  wrote:
>>
>>> On 04/27/2016 11:25 AM, emmanuel segura wrote:
>>> > you need to use pcs to do everything, pcs cluster setup and pcs
>>> > cluster start, try to use the redhat docs for more information.
>>>
>>> Agreed -- pcs cluster setup will create a proper corosync.conf for you.
>>> Your corosync.conf below uses corosync 1 syntax, and there were
>>> significant changes in corosync 2. In particular, you don't need the
>>> file created in step 4, because pacemaker is no longer launched via a
>>> corosync plugin.
>>>
>>> > 2016-04-27 17:28 GMT+02:00 Sriram :
>>> >> Dear All,
>>> >>
>>> >> I m trying to use pacemaker and corosync for the clustering
>>> requirement that
>>> >> came up recently.
>>> >> We have cross compiled corosync, pacemaker and pcs(python) for ppc
>>> >> environment (Target board where pacemaker and corosync are supposed
>>> to run)
>>> >> I m having trouble bringing up pacemaker in that environment, though
>>> I could
>>> >> successfully bring up corosync.
>>> >> Any help is welcome.
>>> >>
>>> >> I m using these versions of pacemaker and corosync
>>> >> [root@node_cu pacemaker]# corosync -v
>>> >> Corosync Cluster Engine, version '2.3.5'
>>> >> Copyright (c) 2006-2009 Red Hat, Inc.
>>> >> [root@node_cu pacemaker]# pacemakerd -$
>>> >> Pacemaker 1.1.14
>>> >> Written by Andrew Beekhof
>>> >>
>>> >> For running corosync, I did the following.
>>> >> 1. Created the following directories,
>>> >> /var/lib/pacemaker
>>> >> /var/lib/corosync
>>> >> /var/lib/pacemaker
>>> >> /var/lib/pacemaker/cores
>>> >> /var/lib/pacemaker/pengine
>>> >> /var/lib/pacemaker/blackbox
>>> >> /var/lib/pacemaker/cib