Re: [Pacemaker] cib: ERROR: send_ais_message: Not connected to AIS

2014-04-14 Thread Marco Felettigh
On Mon, 14 Apr 2014 14:40:43 +1000
Andrew Beekhof  wrote:

> 
> On 11 Apr 2014, at 10:54 pm, Marco Felettigh  wrote:
> 
> > On Fri, 11 Apr 2014 17:17:57 +1000
> > Andrew Beekhof  wrote:
> > 
> >> 
> >> On 8 Apr 2014, at 8:37 pm, ma...@nucleus.it wrote:
> >> 
> >>> On Tue, 8 Apr 2014 10:49:16 +1000
> >>> Andrew Beekhof  wrote:
> >>> 
>  
>  On 7 Apr 2014, at 8:46 pm, ma...@nucleus.it wrote:
>  
> > Hi,
> > in a production environment with 2 nodes ( nodeA , nodeB ) we
> > had an hardware failure so we restart the nodeB.
> > After the restarted nodeB came up we restart corosync/pacemaker
> > on it but for 2 days till now che corosync/pacemaker stuff is
> > looping.
> > 
> > crm_mon NodeA:
> > 
> > Stack: openais
> > Current DC: nodeA - partition with quorum
> > Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
> > 2 Nodes configured, 2 expected votes
> > 17 Resources configured.
> > 
> > 
> > Online: [ nodeA ]
> > OFFLINE: [ nodeB ]
> > 
> > 
> > crm_mon NodeB:
> > 
> > Stack: openais
> > Current DC: NONE
> > 2 Nodes configured, 2 expected votes
> > 17 Resources configured.
> > 
> > 
> > OFFLINE: [ nodeA nodeB ]
> > 
> > This loop on nodeB reports:
> > crmd: [7149]: debug: do_election_count_vote: Election 3 (owner:
> > nodeA) lost: vote from nodeA (Age)
> > 
> > So investigating around i found these message on nodeA:
> > cib: [28755]: ERROR: send_ais_message: Not connected to AIS
> > 
> > now this message is repeating for every operation.
> > Is it a corosync problem or a cib/pacemaker one ?
> > Any suggestion on what is happened ?
>  
>  For some reason the cib can't connect to corosync anymore.
>  No software got upgraded recently?
>  
>  Are there any logs from corosync?
>  Which distro is this?
>  
> > And why the start of a cluster node crasched the DC suff ? :(
> > 
> > 
> > Bye Marco
> > 
> > ___
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
> > http://bugs.clusterlabs.org
>  
> >>> 
> >>> Hi,
> >>> the distro in an opensuse 11.1 and there is no updates also
> >>> because the distro is out of maintenance.
> >> 
> >> A good reason to be using SLES (or RHEL/CentOS).
> > 
> > Better Gentoo ;)
> > 
> >> 
> >>> We are planning and upgrade but the interesting thing is to figure
> >>> out the reasons of the problem.
> >>> The log in attachment, thanks for the support
> >> 
> >> There's nothing obvious in the logs.  Just that as far as pacemaker
> >> could tell, corosync suddenly went away. Was the corosync process
> >> still running?
> >> 
> > 
> > Yes , corosync was still running .
> 
> Stopping pacemaker and restarting it didnt help?
> 

At the end we restarted the two server and then start the
corosync/pacemaker stuff.


Thanks for the support
Marco

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] cib: ERROR: send_ais_message: Not connected to AIS

2014-04-13 Thread Andrew Beekhof

On 11 Apr 2014, at 10:54 pm, Marco Felettigh  wrote:

> On Fri, 11 Apr 2014 17:17:57 +1000
> Andrew Beekhof  wrote:
> 
>> 
>> On 8 Apr 2014, at 8:37 pm, ma...@nucleus.it wrote:
>> 
>>> On Tue, 8 Apr 2014 10:49:16 +1000
>>> Andrew Beekhof  wrote:
>>> 
 
 On 7 Apr 2014, at 8:46 pm, ma...@nucleus.it wrote:
 
> Hi,
> in a production environment with 2 nodes ( nodeA , nodeB ) we had
> an hardware failure so we restart the nodeB.
> After the restarted nodeB came up we restart corosync/pacemaker on
> it but for 2 days till now che corosync/pacemaker stuff is
> looping.
> 
> crm_mon NodeA:
> 
> Stack: openais
> Current DC: nodeA - partition with quorum
> Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
> 2 Nodes configured, 2 expected votes
> 17 Resources configured.
> 
> 
> Online: [ nodeA ]
> OFFLINE: [ nodeB ]
> 
> 
> crm_mon NodeB:
> 
> Stack: openais
> Current DC: NONE
> 2 Nodes configured, 2 expected votes
> 17 Resources configured.
> 
> 
> OFFLINE: [ nodeA nodeB ]
> 
> This loop on nodeB reports:
> crmd: [7149]: debug: do_election_count_vote: Election 3 (owner:
> nodeA) lost: vote from nodeA (Age)
> 
> So investigating around i found these message on nodeA:
> cib: [28755]: ERROR: send_ais_message: Not connected to AIS
> 
> now this message is repeating for every operation.
> Is it a corosync problem or a cib/pacemaker one ?
> Any suggestion on what is happened ?
 
 For some reason the cib can't connect to corosync anymore.
 No software got upgraded recently?
 
 Are there any logs from corosync?
 Which distro is this?
 
> And why the start of a cluster node crasched the DC suff ? :(
> 
> 
> Bye Marco
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
> http://bugs.clusterlabs.org
 
>>> 
>>> Hi,
>>> the distro in an opensuse 11.1 and there is no updates also because
>>> the distro is out of maintenance.
>> 
>> A good reason to be using SLES (or RHEL/CentOS).
> 
> Better Gentoo ;)
> 
>> 
>>> We are planning and upgrade but the interesting thing is to figure
>>> out the reasons of the problem.
>>> The log in attachment, thanks for the support
>> 
>> There's nothing obvious in the logs.  Just that as far as pacemaker
>> could tell, corosync suddenly went away. Was the corosync process
>> still running?
>> 
> 
> Yes , corosync was still running .

Stopping pacemaker and restarting it didnt help?



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] cib: ERROR: send_ais_message: Not connected to AIS

2014-04-11 Thread Marco Felettigh
On Fri, 11 Apr 2014 17:17:57 +1000
Andrew Beekhof  wrote:

> 
> On 8 Apr 2014, at 8:37 pm, ma...@nucleus.it wrote:
> 
> > On Tue, 8 Apr 2014 10:49:16 +1000
> > Andrew Beekhof  wrote:
> > 
> >> 
> >> On 7 Apr 2014, at 8:46 pm, ma...@nucleus.it wrote:
> >> 
> >>> Hi,
> >>> in a production environment with 2 nodes ( nodeA , nodeB ) we had
> >>> an hardware failure so we restart the nodeB.
> >>> After the restarted nodeB came up we restart corosync/pacemaker on
> >>> it but for 2 days till now che corosync/pacemaker stuff is
> >>> looping.
> >>> 
> >>> crm_mon NodeA:
> >>> 
> >>> Stack: openais
> >>> Current DC: nodeA - partition with quorum
> >>> Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
> >>> 2 Nodes configured, 2 expected votes
> >>> 17 Resources configured.
> >>> 
> >>> 
> >>> Online: [ nodeA ]
> >>> OFFLINE: [ nodeB ]
> >>> 
> >>> 
> >>> crm_mon NodeB:
> >>> 
> >>> Stack: openais
> >>> Current DC: NONE
> >>> 2 Nodes configured, 2 expected votes
> >>> 17 Resources configured.
> >>> 
> >>> 
> >>> OFFLINE: [ nodeA nodeB ]
> >>> 
> >>> This loop on nodeB reports:
> >>> crmd: [7149]: debug: do_election_count_vote: Election 3 (owner:
> >>> nodeA) lost: vote from nodeA (Age)
> >>> 
> >>> So investigating around i found these message on nodeA:
> >>> cib: [28755]: ERROR: send_ais_message: Not connected to AIS
> >>> 
> >>> now this message is repeating for every operation.
> >>> Is it a corosync problem or a cib/pacemaker one ?
> >>> Any suggestion on what is happened ?
> >> 
> >> For some reason the cib can't connect to corosync anymore.
> >> No software got upgraded recently?
> >> 
> >> Are there any logs from corosync?
> >> Which distro is this?
> >> 
> >>> And why the start of a cluster node crasched the DC suff ? :(
> >>> 
> >>> 
> >>> Bye Marco
> >>> 
> >>> ___
> >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>> 
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started:
> >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
> >>> http://bugs.clusterlabs.org
> >> 
> > 
> > Hi,
> > the distro in an opensuse 11.1 and there is no updates also because
> > the distro is out of maintenance.
> 
> A good reason to be using SLES (or RHEL/CentOS).

Better Gentoo ;)

> 
> > We are planning and upgrade but the interesting thing is to figure
> > out the reasons of the problem.
> > The log in attachment, thanks for the support
> 
> There's nothing obvious in the logs.  Just that as far as pacemaker
> could tell, corosync suddenly went away. Was the corosync process
> still running?
> 

Yes , corosync was still running .


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] cib: ERROR: send_ais_message: Not connected to AIS

2014-04-11 Thread Andrew Beekhof

On 8 Apr 2014, at 8:37 pm, ma...@nucleus.it wrote:

> On Tue, 8 Apr 2014 10:49:16 +1000
> Andrew Beekhof  wrote:
> 
>> 
>> On 7 Apr 2014, at 8:46 pm, ma...@nucleus.it wrote:
>> 
>>> Hi,
>>> in a production environment with 2 nodes ( nodeA , nodeB ) we had an
>>> hardware failure so we restart the nodeB.
>>> After the restarted nodeB came up we restart corosync/pacemaker on
>>> it but for 2 days till now che corosync/pacemaker stuff is looping.
>>> 
>>> crm_mon NodeA:
>>> 
>>> Stack: openais
>>> Current DC: nodeA - partition with quorum
>>> Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
>>> 2 Nodes configured, 2 expected votes
>>> 17 Resources configured.
>>> 
>>> 
>>> Online: [ nodeA ]
>>> OFFLINE: [ nodeB ]
>>> 
>>> 
>>> crm_mon NodeB:
>>> 
>>> Stack: openais
>>> Current DC: NONE
>>> 2 Nodes configured, 2 expected votes
>>> 17 Resources configured.
>>> 
>>> 
>>> OFFLINE: [ nodeA nodeB ]
>>> 
>>> This loop on nodeB reports:
>>> crmd: [7149]: debug: do_election_count_vote: Election 3 (owner:
>>> nodeA) lost: vote from nodeA (Age)
>>> 
>>> So investigating around i found these message on nodeA:
>>> cib: [28755]: ERROR: send_ais_message: Not connected to AIS
>>> 
>>> now this message is repeating for every operation.
>>> Is it a corosync problem or a cib/pacemaker one ?
>>> Any suggestion on what is happened ?
>> 
>> For some reason the cib can't connect to corosync anymore.
>> No software got upgraded recently?
>> 
>> Are there any logs from corosync?
>> Which distro is this?
>> 
>>> And why the start of a cluster node crasched the DC suff ? :(
>>> 
>>> 
>>> Bye Marco
>>> 
>>> ___
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
>>> http://bugs.clusterlabs.org
>> 
> 
> Hi,
> the distro in an opensuse 11.1 and there is no updates also because the
> distro is out of maintenance.

A good reason to be using SLES (or RHEL/CentOS).

> We are planning and upgrade but the interesting thing is to figure out
> the reasons of the problem.
> The log in attachment, thanks for the support

There's nothing obvious in the logs.  Just that as far as pacemaker could tell, 
corosync suddenly went away.
Was the corosync process still running?



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] cib: ERROR: send_ais_message: Not connected to AIS

2014-04-07 Thread Andrew Beekhof

On 7 Apr 2014, at 8:46 pm, ma...@nucleus.it wrote:

> Hi,
> in a production environment with 2 nodes ( nodeA , nodeB ) we had an
> hardware failure so we restart the nodeB.
> After the restarted nodeB came up we restart corosync/pacemaker on it
> but for 2 days till now che corosync/pacemaker stuff is looping.
> 
> crm_mon NodeA:
> 
> Stack: openais
> Current DC: nodeA - partition with quorum
> Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
> 2 Nodes configured, 2 expected votes
> 17 Resources configured.
> 
> 
> Online: [ nodeA ]
> OFFLINE: [ nodeB ]
> 
> 
> crm_mon NodeB:
> 
> Stack: openais
> Current DC: NONE
> 2 Nodes configured, 2 expected votes
> 17 Resources configured.
> 
> 
> OFFLINE: [ nodeA nodeB ]
> 
> This loop on nodeB reports:
> crmd: [7149]: debug: do_election_count_vote: Election 3 (owner: nodeA)
> lost: vote from nodeA (Age)
> 
> So investigating around i found these message on nodeA:
> cib: [28755]: ERROR: send_ais_message: Not connected to AIS
> 
> now this message is repeating for every operation.
> Is it a corosync problem or a cib/pacemaker one ?
> Any suggestion on what is happened ?

For some reason the cib can't connect to corosync anymore.
No software got upgraded recently?

Are there any logs from corosync?
Which distro is this?

> And why the start of a cluster node crasched the DC suff ? :(
> 
> 
> Bye Marco
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] cib connection error

2013-09-23 Thread Andrew Beekhof

On 24/09/2013, at 2:09 AM, Халезов Иван  wrote:

> Hi all,
> 
> I use pacemaker 1.1.9 with corosync 2.3 both built from source.
> My OS is CentOS 6.4 x86_64
> 
> I have about 30 resources of one type managed by my own resource agent. It is 
> nesessary for the resource agent to know utilization parameter of the 
> configured resource. I query for this parameter by crm_resource utility in 
> the start function of the RA. After I had implemented this feature, I got a 
> lot of error's in my logs:
> 
> Sep 23 19:19:47 iblade5 lrmd[7492]:   notice: operation_finished: 
> RESOURCE_start_0:8445:stderr [ Could not establish cib_rw connection: 
> Resource temporarily unavailable (11) ]
> Sep 23 19:19:47 iblade5 lrmd[7492]:   notice: operation_finished: 
> RESOURCE_start_0:8445:stderr [ Error signing on to the CIB service: Transport 
> endpoint is not connected ]
> 
> So, only few resources (about 4 or 5), every time different, start correctly 
> (crm_resource correctly returns the needed value during start action). And 
> all other resources fail to start.
> 
> I think there is a problem when many (20-30) resources start at the same 
> time, and there are 20-30 queries to CIB from the resource agents
> 
> How can I correct this ?

I recall talking to NTT about this recently but I forget what they did to make 
progress.
Perhaps you could look for $?=11 and try again.  I _think_ there might have 
been a patch for libqb that resolved it.


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CIB verification failure with any change via crmsh

2013-01-24 Thread Dejan Muhamedagic
On Thu, Jan 24, 2013 at 09:10:33AM +0100, Jacek Konieczny wrote:
> On Thu, 24 Jan 2013 09:04:14 +0100
> Jacek Konieczny  wrote:
> > I should probably upgrade my CIB somehow
> 
> Indeed. 'cibadmin --upgrade --force' solved my problem.
> Thanks for all the hints.

crm(live)configure# help upgrade

If you get the `CIB not supported` error, which typically means
that the current CIB version is coming from the older release,
you may try to upgrade it to the latest revision. The command
to perform the upgrade is:
...

I knew it was somewhere.

Thanks,

Dejan

> Greets,
>   Jacek
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CIB verification failure with any change via crmsh

2013-01-24 Thread Dejan Muhamedagic
On Thu, Jan 24, 2013 at 09:04:14AM +0100, Jacek Konieczny wrote:
> Hi,
> 
> On Wed, 23 Jan 2013 18:52:20 +0100
> Dejan Muhamedagic  wrote:
> > > 
> > >   
> > 
> > Note sure if id can start with a digit.
> 
> Corosync node id's are always digits-only.
> 
> > This should really work with versions >= v1.2.4
> 
> Yeah… I have looked into the crmsh code and it has explicit support for
> node 'type' attribute in Pacemaker 1.1.8. For some reason this does not
> work for me on this cluster (no such problems on another cluster, which
> was not upgraded, but set up on Pacemaker 1.1 from the beginning).
> 
> > Which schema do you validate against? Look for the validate-with
> > attribute of the cib element. 
> 
> validate-with="pacemaker-1.0"
> 
> 
>   
>   
>   
> 
>   normal
>   member
>   ping
> 
>   
> 
> So no, it is not optional here. But it is optional in the pacemaker-1.1 
> schema.
> So the problem is crmsh uses the wrong schema for the XML it generates…
> 
> # cibadmin -Q | grep validate-with
>  admin_epoch="0" epoch="337" num_updates="158" cib-last-written="Wed Jan 23 
> 15:23:22 2013" dc-uuid="19179712">
> 
> So, the 'validate-with="pacemaker-1.0"' comes from the current CIB. crmsh 
> keeps
> that, but generates Pacemaker 1.1 XML, so the verification fails.
> 
> I should probably upgrade my CIB somehow, but still it seems there is a bug in
> crmsh.

crmsh relies in this case on the pacemaker version. It should
check the schema, but currently the schema support is somewhat
lacking, i.e. it's not possible to get information on whether
this particular attribute is optional or not.

Thanks,

Dejan

> Greets,
>   Jacek
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CIB verification failure with any change via crmsh

2013-01-24 Thread Jacek Konieczny
On Thu, 24 Jan 2013 09:04:14 +0100
Jacek Konieczny  wrote:
> I should probably upgrade my CIB somehow

Indeed. 'cibadmin --upgrade --force' solved my problem.
Thanks for all the hints.

Greets,
Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CIB verification failure with any change via crmsh

2013-01-24 Thread Jacek Konieczny
Hi,

On Wed, 23 Jan 2013 18:52:20 +0100
Dejan Muhamedagic  wrote:
> > 
> >   
> 
> Note sure if id can start with a digit.

Corosync node id's are always digits-only.

> This should really work with versions >= v1.2.4

Yeah… I have looked into the crmsh code and it has explicit support for
node 'type' attribute in Pacemaker 1.1.8. For some reason this does not
work for me on this cluster (no such problems on another cluster, which
was not upgraded, but set up on Pacemaker 1.1 from the beginning).

> Which schema do you validate against? Look for the validate-with
> attribute of the cib element. 

validate-with="pacemaker-1.0"


  
  
  

  normal
  member
  ping

  

So no, it is not optional here. But it is optional in the pacemaker-1.1 schema.
So the problem is crmsh uses the wrong schema for the XML it generates…

# cibadmin -Q | grep validate-with


So, the 'validate-with="pacemaker-1.0"' comes from the current CIB. crmsh keeps
that, but generates Pacemaker 1.1 XML, so the verification fails.

I should probably upgrade my CIB somehow, but still it seems there is a bug in
crmsh.

Greets,
Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CIB verification failure with any change via crmsh

2013-01-23 Thread Dejan Muhamedagic
Hi,

On Wed, Jan 23, 2013 at 04:31:20PM +0100, Jacek Konieczny wrote:
> Hi,
> 
> I have recently upgraded Pacemaker on one of my clusters from
> 1.0.something to 1.1.8 and installed crmsh to manage it as I used to.
> 
> crmsh mostly works for me, until I try to change the configuration with
> 'crm configure'. Any, even trivial change shows verification errors and
> fails to commit:
> 
> > crm(live)configure# commit
> > element instance_attributes: Relax-NG validity error : Expecting an element 
> > nvpair, got nothing
> > element node: Relax-NG validity error : Expecting an element 
> > instance_attributes, got nothing
> > element node: Relax-NG validity error : Element nodes has extra content: 
> > node
> > element configuration: Relax-NG validity error : Invalid sequence in 
> > interleave
> > element instance_attributes: Relax-NG validity error : Element node failed 
> > to validate attributes
> > element cib: Relax-NG validity error : Element cib failed to validate 
> > content
> >error: main: CIB did not pass DTD/schema validation
> > Errors found during check: config not valid
> >   -V may provide more details
> > Do you still want to commit? no
> 
> It seems as crmsh fails to parse current configuration properly, as:
> 
> crm configure save xml /tmp/saved.xml ; crm_verify -V --xml-file 
> /tmp/saved.xml
> 
> fails the same way:
> 
> > /tmp/saved.xml:19: element instance_attributes: Relax-NG validity error : 
> > Expecting an element nvpair, got nothing
> > /tmp/saved.xml:18: element node: Relax-NG validity error : Expecting an 
> > element instance_attributes, got nothing
> > /tmp/saved.xml:18: element node: Relax-NG validity error : Element nodes 
> > has extra content: node
> > /tmp/saved.xml:3: element configuration: Relax-NG validity error : Invalid 
> > sequence in interleave
> > /tmp/saved.xml:19: element instance_attributes: Relax-NG validity error : 
> > Element node failed to validate attributes
> > /tmp/saved.xml:2: element cib: Relax-NG validity error : Element cib failed 
> > to validate content
> >error: main: CIB did not pass DTD/schema validation
> > Errors found during check: config not valid
> >   -V may provide more details
> 
> 
> But:
> 
> cibadmin -Q > /tmp/good.xml ; crm_verify --xml-file 
> 
> shows no error.
> 
> Any ideas?
> 
> Looking into the 'invalid' XML file gives me no hints, as the line
> 18 is the first  in:
> 
> 
>   

Note sure if id can start with a digit.

> 
>   
> 
>   
>   
> 
>   
> 
>   
> 
> 
> which looks quite right too me.
> 
> Oh… now I see the difference with the current cib. The  elements miss
> the type="normal" attribute. After adding those to the crmsh-generated XML
> everything works. Then it is a crmsh bug, right?

This should really work with versions >= v1.2.4
Which schema do you validate against? Look for the validate-with
attribute of the cib element. Does that schema support optional
type attribute?

Thanks,

Dejan

> And the errors reported by crm_verify are very misleading.
> 
> Greets,
> Jacek
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CIB verification failure with any change via crmsh

2013-01-23 Thread Jacek Konieczny
On Wed, 23 Jan 2013 16:44:45 +0100
Lars Marowsky-Bree  wrote:

> On 2013-01-23T16:31:20, Jacek Konieczny  wrote:
> 
> > I have recently upgraded Pacemaker on one of my clusters from
> > 1.0.something to 1.1.8 and installed crmsh to manage it as I used
> > to.
> 
> It'd be helpful if you mentioned which crmsh version you installed.
> The errors you get suggest you need to update it.

You are right, I missed the information.

It was crmsh 1.2.1 and the first thing I tried was an upgrade to 1.2.4,
but this did not change a thing. So it is the same with crmsh 1.2.1 and
crmsh 1.2.4.

Greets,
Jacek


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CIB verification failure with any change via crmsh

2013-01-23 Thread Lars Marowsky-Bree
On 2013-01-23T16:31:20, Jacek Konieczny  wrote:

> I have recently upgraded Pacemaker on one of my clusters from
> 1.0.something to 1.1.8 and installed crmsh to manage it as I used to.

It'd be helpful if you mentioned which crmsh version you installed. The
errors you get suggest you need to update it.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CIB not saved

2012-03-29 Thread Fiorenza Meini


Normally we log an error at startup if we can't write there... did
this not happen?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Ies, it happened. I saw a warning while writing CIB..but after I wrote 
in this mailing list :)


Regards

--

Fiorenza Meini
Spazio Web S.r.l.

V. Dante Alighieri, 10 - 13900 Biella
Tel.: 015.2431982 - 015.9526066
Fax: 015.2522600
Reg. Imprese, CF e P.I.: 02414430021
Iscr. REA: BI - 188936
Iscr. CCIAA: Biella - 188936
Cap. Soc.: 30.000,00 Euro i.v.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CIB not saved

2012-03-29 Thread Andrew Beekhof
On Thu, Mar 29, 2012 at 8:45 PM, Fiorenza Meini  wrote:
> Il 29/03/2012 10:12, Rasto Levrinc ha scritto:
>
>> On Thu, Mar 29, 2012 at 9:54 AM, Fiorenza Meini  wrote:
>>>
>>> Hi there,
>>> a strange thing happened to my two node cluster: I rebooted both machine
>>> at
>>> the same time, when s.o. went up again, no resources were configured
>>> anymore: as it was a fresh installation. Why ?
>>> It was explained to me that the configuration of resources managed by
>>> pacemaker should be in a file called cib.xml, but cannot find it in the
>>> system. Have I to specify any particular option in the configuration
>>> file?
>>
>>
>> Normally you shouldn't worry about it. cib.xml is stored in
>> /var/lib/heartbeat/crm/ or similar and the directory should have have
>> hacluster:haclient permissions. What distro is it and how did you install
>> it?
>>
>> Rasto
>>
>
> Thanks, it was a permission problems.

Normally we log an error at startup if we can't write there... did
this not happen?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CIB not saved

2012-03-29 Thread Fiorenza Meini

Il 29/03/2012 10:12, Rasto Levrinc ha scritto:

On Thu, Mar 29, 2012 at 9:54 AM, Fiorenza Meini  wrote:

Hi there,
a strange thing happened to my two node cluster: I rebooted both machine at
the same time, when s.o. went up again, no resources were configured
anymore: as it was a fresh installation. Why ?
It was explained to me that the configuration of resources managed by
pacemaker should be in a file called cib.xml, but cannot find it in the
system. Have I to specify any particular option in the configuration file?


Normally you shouldn't worry about it. cib.xml is stored in
/var/lib/heartbeat/crm/ or similar and the directory should have have
hacluster:haclient permissions. What distro is it and how did you install
it?

Rasto



Thanks, it was a permission problems.

Regards
--

Fiorenza Meini
Spazio Web S.r.l.

V. Dante Alighieri, 10 - 13900 Biella
Tel.: 015.2431982 - 015.9526066
Fax: 015.2522600
Reg. Imprese, CF e P.I.: 02414430021
Iscr. REA: BI - 188936
Iscr. CCIAA: Biella - 188936
Cap. Soc.: 30.000,00 Euro i.v.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CIB not saved

2012-03-29 Thread Rasto Levrinc
On Thu, Mar 29, 2012 at 9:54 AM, Fiorenza Meini  wrote:
> Hi there,
> a strange thing happened to my two node cluster: I rebooted both machine at
> the same time, when s.o. went up again, no resources were configured
> anymore: as it was a fresh installation. Why ?
> It was explained to me that the configuration of resources managed by
> pacemaker should be in a file called cib.xml, but cannot find it in the
> system. Have I to specify any particular option in the configuration file?

Normally you shouldn't worry about it. cib.xml is stored in
/var/lib/heartbeat/crm/ or similar and the directory should have have
hacluster:haclient permissions. What distro is it and how did you install
it?

Rasto

-- 
Dipl.-Ing. Rastislav Levrinc
rasto.levr...@gmail.com
Linux Cluster Management Console
http://lcmc.sf.net/

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] cib not connected

2011-11-01 Thread Andrew Beekhof
On Tue, Oct 25, 2011 at 4:08 AM, Proskurin Kirill
 wrote:
> Hello.
>
> corosync-1.4.1
> pacemaker-1.1.5
> pacemaker runs with "ver: 1"
>
> I run on strange problem. Hope someone can help me.
>
> I have 9 nodes cluster. All was fine till I need to reboot a node.
> After reboot it don`t want to come back to cluster with "not in our
> membership" error.
>
> I happens with other 2 nodes on this cluster.
>
> Network is fine.
> rm -rf /var/lib/heartbeat/crm/*  not helps.
>
> I ask for help at IRC and we do this:
> I run one node with debug for few sec and I strace cib process. Both in
> links below.
> In debug logs we found "cib not connected" error but can`t understand reason
> of this.
>
> Debug logs: http://dl.dropbox.com/u/1932700/corosync.log.debug.gz
> cib strace: http://dl.dropbox.com/u/1932700/cib-starce.log.gz
>
> P.S. I have equal problem on other cluster and "fix" it with shutdown all
> nodes(corosync + pacemaker), rm -rf /var/lib/heartbeat/crm/* , startup all
> nodes. But it`s not really an option. :-)

removing the contents of  /var/lib/heartbeat/crm wont achieve much.
its the restart thats getting corosync unstuck

>
> --
> Best regards,
> Proskurin Kirill
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] cib

2010-10-05 Thread Shravan Mishra
Really appreciate your response.
I just wanted to close this thread by saying that we were able to
figure out the problem.

Since pacemaker was running on other virtual machines except our
appliance clearly the problem was our runtime environment.

It turns out that our libxml2 library on our appliance was corrupt.

It caused huge amount of churn on our side as we had to meet a
customer deadline.

But the good side was that I took this as an opportunity to gdb pacemaker code.

Thanks all

-Shravan


On Tue, Oct 5, 2010 at 5:01 AM, Andrew Beekhof  wrote:
> On Fri, Oct 1, 2010 at 3:45 PM, Shravan Mishra  
> wrote:
>> Hi,
>>
>> Just a quick question, who generates the very first cib.xml when
>> pacemaker processes are initialized?
>
> The cib
>
>>
>> Thanks
>> Shravan
>>
>> On Thu, Sep 30, 2010 at 4:22 AM, Andrew Beekhof  wrote:
>>> On Tue, Sep 28, 2010 at 11:47 AM, Andrew Beekhof  wrote:
 On Mon, Sep 27, 2010 at 6:26 AM, Shravan Mishra
  wrote:
> Thanks Raoul for the response.
>
> Changing the permission to hacluster:haclient did stop that error.
>
> Now I'm hitting another problem whereby cib is failing to start

 Very strange logs.
 Which distribution is this?
>>>
>>>   
>>>
 What does your corosync.conf look like?


> =
> Sep 27 00:16:29 corosync [pcmk  ] info: update_member: Node
> ha2.itactics.com now has process list:
> 00110012 (1114130)
> Sep 27 00:16:29 corosync [pcmk  ] info: update_member: Node
> ha2.itactics.com now has 1 quorum votes (was 0)
> Sep 27 00:16:29 corosync [pcmk  ] info: send_member_notification:
> Sending membership update 100 to 0 children
> Sep 27 00:16:29 corosync [MAIN  ] Completed service synchronization,
> ready to provide service.
> Sep 27 00:16:30 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
> process cib exited (pid=14889, rc=127)
> Sep 27 00:16:30 corosync [pcmk  ] notice: pcmk_wait_dispatch:
> Respawning failed child process: cib
> Sep 27 00:16:30 corosync [pcmk  ] info: spawn_child: Forked child
> 14896 for process cib
> crmd[14893]: 2010/09/27_00:16:30 WARN: do_cib_control: Couldn't
> complete CIB registration 1 times... pause and retry
> Sep 27 00:16:31 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
> process cib exited (pid=14896, rc=127)
> Sep 27 00:16:31 corosync [pcmk  ] notice: pcmk_wait_dispatch:
> Respawning failed child process: cib
> Sep 27 00:16:31 corosync [pcmk  ] info: spawn_child: Forked child
> 14901 for process cib
> Sep 27 00:16:32 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
> process cib exited (pid=14901, rc=1
> ==
>
>
> I have attached the full logs.
>
> We are using  corosync 1.2.8 and pacemaker 1.1.3.
>
>
>  Thanks.
> Shravan
>
>
>
> On Sat, Sep 25, 2010 at 4:36 AM, Raoul Bhatia [IPAX]  
> wrote:
>> On 24.09.2010 21:41, Shravan Mishra wrote:
>>>
>>> crmd[20612]: 2010/09/24_15:29:57 ERROR: crm_log_init_worker: Cannot
>>> change active directory to /var/lib/heartbeat/cores/hacluster:
>>> Permission denied (13)
>>
>> ls -ald /var/lib/heartbeat/cores/hacluster /var/lib/heartbeat/cores/
>> /var/lib/heartbeat/ /var/lib/ /var/
>>
>> is haclient allowed to cd all the way into
>> /var/lib/heartbeat/cores/hacluster ?
>>
>> cheers,
>>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>

>>>
>>> ___
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: 
>>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: 
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linu

Re: [Pacemaker] cib

2010-10-05 Thread Andrew Beekhof
On Fri, Oct 1, 2010 at 3:45 PM, Shravan Mishra  wrote:
> Hi,
>
> Just a quick question, who generates the very first cib.xml when
> pacemaker processes are initialized?

The cib

>
> Thanks
> Shravan
>
> On Thu, Sep 30, 2010 at 4:22 AM, Andrew Beekhof  wrote:
>> On Tue, Sep 28, 2010 at 11:47 AM, Andrew Beekhof  wrote:
>>> On Mon, Sep 27, 2010 at 6:26 AM, Shravan Mishra
>>>  wrote:
 Thanks Raoul for the response.

 Changing the permission to hacluster:haclient did stop that error.

 Now I'm hitting another problem whereby cib is failing to start
>>>
>>> Very strange logs.
>>> Which distribution is this?
>>
>>   
>>
>>> What does your corosync.conf look like?
>>>
>>>
 =
 Sep 27 00:16:29 corosync [pcmk  ] info: update_member: Node
 ha2.itactics.com now has process list:
 00110012 (1114130)
 Sep 27 00:16:29 corosync [pcmk  ] info: update_member: Node
 ha2.itactics.com now has 1 quorum votes (was 0)
 Sep 27 00:16:29 corosync [pcmk  ] info: send_member_notification:
 Sending membership update 100 to 0 children
 Sep 27 00:16:29 corosync [MAIN  ] Completed service synchronization,
 ready to provide service.
 Sep 27 00:16:30 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
 process cib exited (pid=14889, rc=127)
 Sep 27 00:16:30 corosync [pcmk  ] notice: pcmk_wait_dispatch:
 Respawning failed child process: cib
 Sep 27 00:16:30 corosync [pcmk  ] info: spawn_child: Forked child
 14896 for process cib
 crmd[14893]: 2010/09/27_00:16:30 WARN: do_cib_control: Couldn't
 complete CIB registration 1 times... pause and retry
 Sep 27 00:16:31 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
 process cib exited (pid=14896, rc=127)
 Sep 27 00:16:31 corosync [pcmk  ] notice: pcmk_wait_dispatch:
 Respawning failed child process: cib
 Sep 27 00:16:31 corosync [pcmk  ] info: spawn_child: Forked child
 14901 for process cib
 Sep 27 00:16:32 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
 process cib exited (pid=14901, rc=1
 ==


 I have attached the full logs.

 We are using  corosync 1.2.8 and pacemaker 1.1.3.


  Thanks.
 Shravan



 On Sat, Sep 25, 2010 at 4:36 AM, Raoul Bhatia [IPAX]  
 wrote:
> On 24.09.2010 21:41, Shravan Mishra wrote:
>>
>> crmd[20612]: 2010/09/24_15:29:57 ERROR: crm_log_init_worker: Cannot
>> change active directory to /var/lib/heartbeat/cores/hacluster:
>> Permission denied (13)
>
> ls -ald /var/lib/heartbeat/cores/hacluster /var/lib/heartbeat/cores/
> /var/lib/heartbeat/ /var/lib/ /var/
>
> is haclient allowed to cd all the way into
> /var/lib/heartbeat/cores/hacluster ?
>
> cheers,
>

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


>>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: 
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] cib

2010-10-01 Thread Shravan Mishra
Hi,

Just a quick question, who generates the very first cib.xml when
pacemaker processes are initialized?

Thanks
Shravan

On Thu, Sep 30, 2010 at 4:22 AM, Andrew Beekhof  wrote:
> On Tue, Sep 28, 2010 at 11:47 AM, Andrew Beekhof  wrote:
>> On Mon, Sep 27, 2010 at 6:26 AM, Shravan Mishra
>>  wrote:
>>> Thanks Raoul for the response.
>>>
>>> Changing the permission to hacluster:haclient did stop that error.
>>>
>>> Now I'm hitting another problem whereby cib is failing to start
>>
>> Very strange logs.
>> Which distribution is this?
>
>   
>
>> What does your corosync.conf look like?
>>
>>
>>> =
>>> Sep 27 00:16:29 corosync [pcmk  ] info: update_member: Node
>>> ha2.itactics.com now has process list:
>>> 00110012 (1114130)
>>> Sep 27 00:16:29 corosync [pcmk  ] info: update_member: Node
>>> ha2.itactics.com now has 1 quorum votes (was 0)
>>> Sep 27 00:16:29 corosync [pcmk  ] info: send_member_notification:
>>> Sending membership update 100 to 0 children
>>> Sep 27 00:16:29 corosync [MAIN  ] Completed service synchronization,
>>> ready to provide service.
>>> Sep 27 00:16:30 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
>>> process cib exited (pid=14889, rc=127)
>>> Sep 27 00:16:30 corosync [pcmk  ] notice: pcmk_wait_dispatch:
>>> Respawning failed child process: cib
>>> Sep 27 00:16:30 corosync [pcmk  ] info: spawn_child: Forked child
>>> 14896 for process cib
>>> crmd[14893]: 2010/09/27_00:16:30 WARN: do_cib_control: Couldn't
>>> complete CIB registration 1 times... pause and retry
>>> Sep 27 00:16:31 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
>>> process cib exited (pid=14896, rc=127)
>>> Sep 27 00:16:31 corosync [pcmk  ] notice: pcmk_wait_dispatch:
>>> Respawning failed child process: cib
>>> Sep 27 00:16:31 corosync [pcmk  ] info: spawn_child: Forked child
>>> 14901 for process cib
>>> Sep 27 00:16:32 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
>>> process cib exited (pid=14901, rc=1
>>> ==
>>>
>>>
>>> I have attached the full logs.
>>>
>>> We are using  corosync 1.2.8 and pacemaker 1.1.3.
>>>
>>>
>>>  Thanks.
>>> Shravan
>>>
>>>
>>>
>>> On Sat, Sep 25, 2010 at 4:36 AM, Raoul Bhatia [IPAX]  
>>> wrote:
 On 24.09.2010 21:41, Shravan Mishra wrote:
>
> crmd[20612]: 2010/09/24_15:29:57 ERROR: crm_log_init_worker: Cannot
> change active directory to /var/lib/heartbeat/cores/hacluster:
> Permission denied (13)

 ls -ald /var/lib/heartbeat/cores/hacluster /var/lib/heartbeat/cores/
 /var/lib/heartbeat/ /var/lib/ /var/

 is haclient allowed to cd all the way into
 /var/lib/heartbeat/cores/hacluster ?

 cheers,

>>>
>>> ___
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: 
>>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>
>>>
>>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] cib

2010-09-30 Thread Andrew Beekhof
On Tue, Sep 28, 2010 at 11:47 AM, Andrew Beekhof  wrote:
> On Mon, Sep 27, 2010 at 6:26 AM, Shravan Mishra
>  wrote:
>> Thanks Raoul for the response.
>>
>> Changing the permission to hacluster:haclient did stop that error.
>>
>> Now I'm hitting another problem whereby cib is failing to start
>
> Very strange logs.
> Which distribution is this?

   

> What does your corosync.conf look like?
>
>
>> =
>> Sep 27 00:16:29 corosync [pcmk  ] info: update_member: Node
>> ha2.itactics.com now has process list:
>> 00110012 (1114130)
>> Sep 27 00:16:29 corosync [pcmk  ] info: update_member: Node
>> ha2.itactics.com now has 1 quorum votes (was 0)
>> Sep 27 00:16:29 corosync [pcmk  ] info: send_member_notification:
>> Sending membership update 100 to 0 children
>> Sep 27 00:16:29 corosync [MAIN  ] Completed service synchronization,
>> ready to provide service.
>> Sep 27 00:16:30 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
>> process cib exited (pid=14889, rc=127)
>> Sep 27 00:16:30 corosync [pcmk  ] notice: pcmk_wait_dispatch:
>> Respawning failed child process: cib
>> Sep 27 00:16:30 corosync [pcmk  ] info: spawn_child: Forked child
>> 14896 for process cib
>> crmd[14893]: 2010/09/27_00:16:30 WARN: do_cib_control: Couldn't
>> complete CIB registration 1 times... pause and retry
>> Sep 27 00:16:31 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
>> process cib exited (pid=14896, rc=127)
>> Sep 27 00:16:31 corosync [pcmk  ] notice: pcmk_wait_dispatch:
>> Respawning failed child process: cib
>> Sep 27 00:16:31 corosync [pcmk  ] info: spawn_child: Forked child
>> 14901 for process cib
>> Sep 27 00:16:32 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
>> process cib exited (pid=14901, rc=1
>> ==
>>
>>
>> I have attached the full logs.
>>
>> We are using  corosync 1.2.8 and pacemaker 1.1.3.
>>
>>
>>  Thanks.
>> Shravan
>>
>>
>>
>> On Sat, Sep 25, 2010 at 4:36 AM, Raoul Bhatia [IPAX]  
>> wrote:
>>> On 24.09.2010 21:41, Shravan Mishra wrote:

 crmd[20612]: 2010/09/24_15:29:57 ERROR: crm_log_init_worker: Cannot
 change active directory to /var/lib/heartbeat/cores/hacluster:
 Permission denied (13)
>>>
>>> ls -ald /var/lib/heartbeat/cores/hacluster /var/lib/heartbeat/cores/
>>> /var/lib/heartbeat/ /var/lib/ /var/
>>>
>>> is haclient allowed to cd all the way into
>>> /var/lib/heartbeat/cores/hacluster ?
>>>
>>> cheers,
>>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: 
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>>
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] cib

2010-09-29 Thread Shravan Mishra
Some more info:


root 14170 14166  0 12:23 ?00:00:00 /usr/lib64/heartbeat/stonithd
nobody   14172 14166  0 12:23 ?00:00:00 /usr/lib64/heartbeat/lrmd
82   14173 14166  0 12:23 ?00:00:00 /usr/lib64/heartbeat/attrd
82   14174 14166  0 12:23 ?00:00:00 /usr/lib64/heartbeat/pengine
82   14175 14166  0 12:23 ?00:00:00 /usr/lib64/heartbeat/crmd




--lrmd is running as nobody when it should have been root.

I'm not sure why that would happen.


Thanks
Shravan

On Wed, Sep 29, 2010 at 10:29 AM, Shravan Mishra
 wrote:
> Hi,
>
>
>
> I did a bt on the core, this is what I found:
>
>
> ==
> Core was generated by `/usr/lib64/heartbeat/cib'.
> Program terminated with signal 11, Segmentation fault.
> [New process 12340]
> #0  0x7f23acc553fa in strncmp () from /lib64/libc.so.6
> (gdb) bt
> #0  0x7f23acc553fa in strncmp () from /lib64/libc.so.6
> #1  0x7f23acf87c39 in __xmlParserInputBufferCreateFilename () from
> /usr/lib64/libxml2.so.2
> #2  0x7f23acf6147b in xmlNewInputFromFile () from /usr/lib64/libxml2.so.2
> #3  0x7f23acf641d4 in xmlCreateURLParserCtxt () from 
> /usr/lib64/libxml2.so.2
> #4  0x7f23acf78f3a in xmlReadFile () from /usr/lib64/libxml2.so.2
> #5  0x7f23ad0167b1 in xmlRelaxNGParse () from /usr/lib64/libxml2.so.2
> #6  0x7f23ae967321 in validate_with_relaxng (doc=0x626020, to_logs=1,
>    relaxng_file=0x7f23ae97ba10
> "/usr/share/pacemaker/pacemaker-1.2.rng") at xml.c:
> #7  0x7f23ae967769 in validate_with (xml=0x6260d0, method=6,
> to_logs=1) at xml.c:2287
> #8  0x7f23ae967b9f in validate_xml (xml_blob=0x6260d0,
> validation=0x626910 "pacemaker-1.2",
>    to_logs=1) at xml.c:2373
> #9  0x00405b23 in readCibXmlFile (dir=0x41b580
> "/var/lib/heartbeat/crm",
>    file=0x41c40a "cib.xml", discard_status=1) at io.c:396
> #10 0x00412285 in startCib (filename=0x41c40a "cib.xml") at main.c:613
> #11 0x00411309 in cib_init () at main.c:408
> #12 0x0041064a in main (argc=1, argv=0x7fff942e0f58) at main.c:218
>
>
> ==
>
>
>
> If it's a fresh install let's say then cib.xml will not exist.
> Then why is it looking for this file on startup.
>
>
> Sincerely
> Shravan
>
>
> On Tue, Sep 28, 2010 at 10:24 AM, Shravan Mishra
>  wrote:
>> Sorry forgot to attach my corosync.conf.
>>
>>
>> =
>> totem {
>>        version: 2
>> #       token: 3000
>> #       token_retransmits_before_loss_const: 10
>> #       join: 60
>> #       consensus: 1500
>> #       vsftype: none
>> #       max_messages: 20
>> #       clear_node_high_bit: yes
>>        secauth: off
>>        threads: 0
>> #       rrp_mode: passive
>>
>>        interface {
>>                ringnumber: 0
>>                bindnetaddr: 192.168.2.0
>>                #mcastaddr: 226.94.1.1
>>                broadcast: yes
>>                mcastport: 5405
>>        }
>> #       interface {
>> #               ringnumber: 1
>> #               bindnetaddr: 172.20.20.0
>>                #mcastaddr: 226.94.1.1
>> #               broadcast: yes
>> #               mcastport: 5405
>> #       }
>> }
>>
>> logging {
>>        fileline: off
>>        to_stderr: yes
>>        to_logfile: yes
>>        to_syslog: yes
>>        logfile: /tmp/corosync.log
>>        debug: off
>>        timestamp: on
>>        logger_subsys {
>>                subsys: AMF
>>                debug: off
>>        }
>> }
>>
>> service {
>>        name: pacemaker
>>        ver: 0
>> }
>>
>> aisexec {
>>        user:root
>>        group: root
>> }
>>
>> amf {
>>        mode: disabled
>> }
>>
>>
>>
>>
>> =
>>
>> On Tue, Sep 28, 2010 at 10:10 AM, Shravan Mishra
>>  wrote:
>>> Hi Andrew,
>>>
>>> I'm attaching another log file as I reflashed my machine started
>>> everything from scratch.
>>> Looks like my old system got little messed up as I was trying to
>>> install old HA libraries - corosyc/pacemaker that was initially
>>> working for me.
>>>
>>>
>>> Here are the details:
>>>
>>> As of now  I just want to see cib/attrd up so I have only one machine
>>> where I want to see things in a sane state.
>>>
>>> [r...@ha2 ~]# /usr/sbin/corosync -v
>>> Corosync Cluster Engine, version '1.2.8' SVN revision '3035'
>>> Copyright (c) 2006-2009 Red Hat, Inc.
>>>
>>> [r...@ha2 ~]# /usr/lib64/heartbeat/crmd version
>>> CRM Version: 1.1.2 (e0d731c2b1be446b27a73327a53067bf6230fb6a)
>>>
>>>
>>>
>>> Pacemaker version is 1.1, the release based on the above output is
>>> 1.1.2 if I correctly understand.
>>>
>>> This one is showing --
>>>
>>> Sep 27 12:30:45 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
>>> process cib terminated with signal 11 (pid=9216, core=false)
>>>
>>>
>>> Please find corosync logs attached.
>>>
>>> Thanks
>>> Shravan
>>>
>>>
>>> On Tue, Sep 28, 2010 at 5:47 AM, Andrew Beekhof  wrote:
 On Mon, Sep 27, 2010 at 6:26 AM, Shravan Mishra
  wrote:
> Thanks Raoul for the response.
>
> Changing the permission to hacluster:haclient did stop

Re: [Pacemaker] cib

2010-09-29 Thread Shravan Mishra
Hi,



I did a bt on the core, this is what I found:


==
Core was generated by `/usr/lib64/heartbeat/cib'.
Program terminated with signal 11, Segmentation fault.
[New process 12340]
#0  0x7f23acc553fa in strncmp () from /lib64/libc.so.6
(gdb) bt
#0  0x7f23acc553fa in strncmp () from /lib64/libc.so.6
#1  0x7f23acf87c39 in __xmlParserInputBufferCreateFilename () from
/usr/lib64/libxml2.so.2
#2  0x7f23acf6147b in xmlNewInputFromFile () from /usr/lib64/libxml2.so.2
#3  0x7f23acf641d4 in xmlCreateURLParserCtxt () from /usr/lib64/libxml2.so.2
#4  0x7f23acf78f3a in xmlReadFile () from /usr/lib64/libxml2.so.2
#5  0x7f23ad0167b1 in xmlRelaxNGParse () from /usr/lib64/libxml2.so.2
#6  0x7f23ae967321 in validate_with_relaxng (doc=0x626020, to_logs=1,
relaxng_file=0x7f23ae97ba10
"/usr/share/pacemaker/pacemaker-1.2.rng") at xml.c:
#7  0x7f23ae967769 in validate_with (xml=0x6260d0, method=6,
to_logs=1) at xml.c:2287
#8  0x7f23ae967b9f in validate_xml (xml_blob=0x6260d0,
validation=0x626910 "pacemaker-1.2",
to_logs=1) at xml.c:2373
#9  0x00405b23 in readCibXmlFile (dir=0x41b580
"/var/lib/heartbeat/crm",
file=0x41c40a "cib.xml", discard_status=1) at io.c:396
#10 0x00412285 in startCib (filename=0x41c40a "cib.xml") at main.c:613
#11 0x00411309 in cib_init () at main.c:408
#12 0x0041064a in main (argc=1, argv=0x7fff942e0f58) at main.c:218


==



If it's a fresh install let's say then cib.xml will not exist.
Then why is it looking for this file on startup.


Sincerely
Shravan


On Tue, Sep 28, 2010 at 10:24 AM, Shravan Mishra
 wrote:
> Sorry forgot to attach my corosync.conf.
>
>
> =
> totem {
>        version: 2
> #       token: 3000
> #       token_retransmits_before_loss_const: 10
> #       join: 60
> #       consensus: 1500
> #       vsftype: none
> #       max_messages: 20
> #       clear_node_high_bit: yes
>        secauth: off
>        threads: 0
> #       rrp_mode: passive
>
>        interface {
>                ringnumber: 0
>                bindnetaddr: 192.168.2.0
>                #mcastaddr: 226.94.1.1
>                broadcast: yes
>                mcastport: 5405
>        }
> #       interface {
> #               ringnumber: 1
> #               bindnetaddr: 172.20.20.0
>                #mcastaddr: 226.94.1.1
> #               broadcast: yes
> #               mcastport: 5405
> #       }
> }
>
> logging {
>        fileline: off
>        to_stderr: yes
>        to_logfile: yes
>        to_syslog: yes
>        logfile: /tmp/corosync.log
>        debug: off
>        timestamp: on
>        logger_subsys {
>                subsys: AMF
>                debug: off
>        }
> }
>
> service {
>        name: pacemaker
>        ver: 0
> }
>
> aisexec {
>        user:root
>        group: root
> }
>
> amf {
>        mode: disabled
> }
>
>
>
>
> =
>
> On Tue, Sep 28, 2010 at 10:10 AM, Shravan Mishra
>  wrote:
>> Hi Andrew,
>>
>> I'm attaching another log file as I reflashed my machine started
>> everything from scratch.
>> Looks like my old system got little messed up as I was trying to
>> install old HA libraries - corosyc/pacemaker that was initially
>> working for me.
>>
>>
>> Here are the details:
>>
>> As of now  I just want to see cib/attrd up so I have only one machine
>> where I want to see things in a sane state.
>>
>> [r...@ha2 ~]# /usr/sbin/corosync -v
>> Corosync Cluster Engine, version '1.2.8' SVN revision '3035'
>> Copyright (c) 2006-2009 Red Hat, Inc.
>>
>> [r...@ha2 ~]# /usr/lib64/heartbeat/crmd version
>> CRM Version: 1.1.2 (e0d731c2b1be446b27a73327a53067bf6230fb6a)
>>
>>
>>
>> Pacemaker version is 1.1, the release based on the above output is
>> 1.1.2 if I correctly understand.
>>
>> This one is showing --
>>
>> Sep 27 12:30:45 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
>> process cib terminated with signal 11 (pid=9216, core=false)
>>
>>
>> Please find corosync logs attached.
>>
>> Thanks
>> Shravan
>>
>>
>> On Tue, Sep 28, 2010 at 5:47 AM, Andrew Beekhof  wrote:
>>> On Mon, Sep 27, 2010 at 6:26 AM, Shravan Mishra
>>>  wrote:
 Thanks Raoul for the response.

 Changing the permission to hacluster:haclient did stop that error.

 Now I'm hitting another problem whereby cib is failing to start
>>>
>>> Very strange logs.
>>> Which distribution is this?
>>> What does your corosync.conf look like?
>>>
>>>
 =
 Sep 27 00:16:29 corosync [pcmk  ] info: update_member: Node
 ha2.itactics.com now has process list:
 00110012 (1114130)
 Sep 27 00:16:29 corosync [pcmk  ] info: update_member: Node
 ha2.itactics.com now has 1 quorum votes (was 0)
 Sep 27 00:16:29 corosync [pcmk  ] info: send_member_notification:
 Sending membership update 100 to 0 children
 Sep 27 00:16:29 corosync [MAIN  ] Completed service synchronization,
 ready to provide service.
 Sep 27 00:16:30 corosync [pcmk  ] ERROR: pcmk_wait_disp

Re: [Pacemaker] cib

2010-09-28 Thread Shravan Mishra
Sorry forgot to attach my corosync.conf.


=
totem {
version: 2
#   token: 3000
#   token_retransmits_before_loss_const: 10
#   join: 60
#   consensus: 1500
#   vsftype: none
#   max_messages: 20
#   clear_node_high_bit: yes
secauth: off
threads: 0
#   rrp_mode: passive

interface {
ringnumber: 0
bindnetaddr: 192.168.2.0
#mcastaddr: 226.94.1.1
broadcast: yes
mcastport: 5405
}
#   interface {
#   ringnumber: 1
#   bindnetaddr: 172.20.20.0
#mcastaddr: 226.94.1.1
#   broadcast: yes
#   mcastport: 5405
#   }
}

logging {
fileline: off
to_stderr: yes
to_logfile: yes
to_syslog: yes
logfile: /tmp/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}

service {
name: pacemaker
ver: 0
}

aisexec {
user:root
group: root
}

amf {
mode: disabled
}




=

On Tue, Sep 28, 2010 at 10:10 AM, Shravan Mishra
 wrote:
> Hi Andrew,
>
> I'm attaching another log file as I reflashed my machine started
> everything from scratch.
> Looks like my old system got little messed up as I was trying to
> install old HA libraries - corosyc/pacemaker that was initially
> working for me.
>
>
> Here are the details:
>
> As of now  I just want to see cib/attrd up so I have only one machine
> where I want to see things in a sane state.
>
> [r...@ha2 ~]# /usr/sbin/corosync -v
> Corosync Cluster Engine, version '1.2.8' SVN revision '3035'
> Copyright (c) 2006-2009 Red Hat, Inc.
>
> [r...@ha2 ~]# /usr/lib64/heartbeat/crmd version
> CRM Version: 1.1.2 (e0d731c2b1be446b27a73327a53067bf6230fb6a)
>
>
>
> Pacemaker version is 1.1, the release based on the above output is
> 1.1.2 if I correctly understand.
>
> This one is showing --
>
> Sep 27 12:30:45 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
> process cib terminated with signal 11 (pid=9216, core=false)
>
>
> Please find corosync logs attached.
>
> Thanks
> Shravan
>
>
> On Tue, Sep 28, 2010 at 5:47 AM, Andrew Beekhof  wrote:
>> On Mon, Sep 27, 2010 at 6:26 AM, Shravan Mishra
>>  wrote:
>>> Thanks Raoul for the response.
>>>
>>> Changing the permission to hacluster:haclient did stop that error.
>>>
>>> Now I'm hitting another problem whereby cib is failing to start
>>
>> Very strange logs.
>> Which distribution is this?
>> What does your corosync.conf look like?
>>
>>
>>> =
>>> Sep 27 00:16:29 corosync [pcmk  ] info: update_member: Node
>>> ha2.itactics.com now has process list:
>>> 00110012 (1114130)
>>> Sep 27 00:16:29 corosync [pcmk  ] info: update_member: Node
>>> ha2.itactics.com now has 1 quorum votes (was 0)
>>> Sep 27 00:16:29 corosync [pcmk  ] info: send_member_notification:
>>> Sending membership update 100 to 0 children
>>> Sep 27 00:16:29 corosync [MAIN  ] Completed service synchronization,
>>> ready to provide service.
>>> Sep 27 00:16:30 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
>>> process cib exited (pid=14889, rc=127)
>>> Sep 27 00:16:30 corosync [pcmk  ] notice: pcmk_wait_dispatch:
>>> Respawning failed child process: cib
>>> Sep 27 00:16:30 corosync [pcmk  ] info: spawn_child: Forked child
>>> 14896 for process cib
>>> crmd[14893]: 2010/09/27_00:16:30 WARN: do_cib_control: Couldn't
>>> complete CIB registration 1 times... pause and retry
>>> Sep 27 00:16:31 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
>>> process cib exited (pid=14896, rc=127)
>>> Sep 27 00:16:31 corosync [pcmk  ] notice: pcmk_wait_dispatch:
>>> Respawning failed child process: cib
>>> Sep 27 00:16:31 corosync [pcmk  ] info: spawn_child: Forked child
>>> 14901 for process cib
>>> Sep 27 00:16:32 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
>>> process cib exited (pid=14901, rc=1
>>> ==
>>>
>>>
>>> I have attached the full logs.
>>>
>>> We are using  corosync 1.2.8 and pacemaker 1.1.3.
>>>
>>>
>>>  Thanks.
>>> Shravan
>>>
>>>
>>>
>>> On Sat, Sep 25, 2010 at 4:36 AM, Raoul Bhatia [IPAX]  
>>> wrote:
 On 24.09.2010 21:41, Shravan Mishra wrote:
>
> crmd[20612]: 2010/09/24_15:29:57 ERROR: crm_log_init_worker: Cannot
> change active directory to /var/lib/heartbeat/cores/hacluster:
> Permission denied (13)

 ls -ald /var/lib/heartbeat/cores/hacluster /var/lib/heartbeat/cores/
 /var/lib/heartbeat/ /var/lib/ /var/

 is haclient allowed to cd all the way into
 /var/lib/heartbeat/cores/hacluster ?

 cheers,

>>>
>>> ___
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bu

Re: [Pacemaker] cib

2010-09-28 Thread Andrew Beekhof
On Mon, Sep 27, 2010 at 6:26 AM, Shravan Mishra
 wrote:
> Thanks Raoul for the response.
>
> Changing the permission to hacluster:haclient did stop that error.
>
> Now I'm hitting another problem whereby cib is failing to start

Very strange logs.
Which distribution is this?
What does your corosync.conf look like?


> =
> Sep 27 00:16:29 corosync [pcmk  ] info: update_member: Node
> ha2.itactics.com now has process list:
> 00110012 (1114130)
> Sep 27 00:16:29 corosync [pcmk  ] info: update_member: Node
> ha2.itactics.com now has 1 quorum votes (was 0)
> Sep 27 00:16:29 corosync [pcmk  ] info: send_member_notification:
> Sending membership update 100 to 0 children
> Sep 27 00:16:29 corosync [MAIN  ] Completed service synchronization,
> ready to provide service.
> Sep 27 00:16:30 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
> process cib exited (pid=14889, rc=127)
> Sep 27 00:16:30 corosync [pcmk  ] notice: pcmk_wait_dispatch:
> Respawning failed child process: cib
> Sep 27 00:16:30 corosync [pcmk  ] info: spawn_child: Forked child
> 14896 for process cib
> crmd[14893]: 2010/09/27_00:16:30 WARN: do_cib_control: Couldn't
> complete CIB registration 1 times... pause and retry
> Sep 27 00:16:31 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
> process cib exited (pid=14896, rc=127)
> Sep 27 00:16:31 corosync [pcmk  ] notice: pcmk_wait_dispatch:
> Respawning failed child process: cib
> Sep 27 00:16:31 corosync [pcmk  ] info: spawn_child: Forked child
> 14901 for process cib
> Sep 27 00:16:32 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
> process cib exited (pid=14901, rc=1
> ==
>
>
> I have attached the full logs.
>
> We are using  corosync 1.2.8 and pacemaker 1.1.3.
>
>
>  Thanks.
> Shravan
>
>
>
> On Sat, Sep 25, 2010 at 4:36 AM, Raoul Bhatia [IPAX]  wrote:
>> On 24.09.2010 21:41, Shravan Mishra wrote:
>>>
>>> crmd[20612]: 2010/09/24_15:29:57 ERROR: crm_log_init_worker: Cannot
>>> change active directory to /var/lib/heartbeat/cores/hacluster:
>>> Permission denied (13)
>>
>> ls -ald /var/lib/heartbeat/cores/hacluster /var/lib/heartbeat/cores/
>> /var/lib/heartbeat/ /var/lib/ /var/
>>
>> is haclient allowed to cd all the way into
>> /var/lib/heartbeat/cores/hacluster ?
>>
>> cheers,
>>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] cib

2010-09-25 Thread Raoul Bhatia [IPAX]

On 24.09.2010 21:41, Shravan Mishra wrote:

crmd[20612]: 2010/09/24_15:29:57 ERROR: crm_log_init_worker: Cannot
change active directory to /var/lib/heartbeat/cores/hacluster:
Permission denied (13)


ls -ald /var/lib/heartbeat/cores/hacluster /var/lib/heartbeat/cores/ 
/var/lib/heartbeat/ /var/lib/ /var/


is haclient allowed to cd all the way into 
/var/lib/heartbeat/cores/hacluster ?


cheers,

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] cib fails to start until host is rebooted

2010-09-17 Thread Michael Smith

Andrew Beekhof wrote:

I spoke to Steve, and the only thing he could come up with was that
the group might not be correct.

When the cluster is in this state, please run:
   ps x -o pid,euser,ruser,egroup,rgroup,command

And compare it to the "normal" output.

Also, confirm that there is only one group named haclient, and one
user named hacluster.


Thanks, that was the right track. Looks like I fat-fingered a '9' in 
front of the '0' in root's gid in /etc/passwd:


root:x:0:90:root:/root:/bin/bash

gid 90 happens to be owned by haclient. With root's gid fixed, 
everything works as expected.


Mike

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] cib fails to start until host is rebooted

2010-09-17 Thread Andrew Beekhof
I spoke to Steve, and the only thing he could come up with was that
the group might not be correct.

When the cluster is in this state, please run:
   ps x -o pid,euser,ruser,egroup,rgroup,command

And compare it to the "normal" output.

Also, confirm that there is only one group named haclient, and one
user named hacluster.

On Tue, Sep 7, 2010 at 11:03 PM, Michael Smith  wrote:
> Michael Smith wrote:
>>
>> On Mon, 6 Sep 2010, Andrew Beekhof wrote:
>>
> Is /dev/shm full (or not mounted) by any chance?

 No - I tried clearing that out, too.
>>>
>>> And corosync is actually running?
>>
>> Yes, it's logging "[IPC   ] Invalid IPC credentials." when cib tries to
>> connect.
>
> For what it's worth, I have the same problem after updating:
>
>
> cluster-glue-1.0.6-2.1
> corosync-1.2.7-1.1
> openais-1.1.3-1.1
> pacemaker-1.1.2.1-5.1
>
> Mike
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] cib fails to start until host is rebooted

2010-09-07 Thread Michael Smith

Michael Smith wrote:

On Mon, 6 Sep 2010, Andrew Beekhof wrote:


Is /dev/shm full (or not mounted) by any chance?

No - I tried clearing that out, too.

And corosync is actually running?


Yes, it's logging "[IPC   ] Invalid IPC credentials." when cib tries to 
connect.


For what it's worth, I have the same problem after updating:


cluster-glue-1.0.6-2.1
corosync-1.2.7-1.1
openais-1.1.3-1.1
pacemaker-1.1.2.1-5.1

Mike

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] cib fails to start until host is rebooted

2010-09-06 Thread Michael Smith
On Mon, 6 Sep 2010, Andrew Beekhof wrote:

> >> Is /dev/shm full (or not mounted) by any chance?
> >
> > No - I tried clearing that out, too.
> 
> And corosync is actually running?

Yes, it's logging "[IPC   ] Invalid IPC credentials." when cib tries to 
connect.

Mike

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] cib fails to start until host is rebooted

2010-09-05 Thread Andrew Beekhof
On Thu, Sep 2, 2010 at 2:18 PM, Michael Smith  wrote:
> On Thu, 2 Sep 2010, Andrew Beekhof wrote:
>
>> On Mon, Aug 30, 2010 at 10:04 PM, Michael Smith  wrote:
>> > Hi,
>> >
>> > I have a pacemaker/corosync setup on a bunch of fully patched SLES11 SP1
>> > systems. On one of the systems, if I /etc/init.d/openais stop, then
>> > /etc/init.d/openais start, pacemaker fails to come up:
>>
>> Is /dev/shm full (or not mounted) by any chance?
>
> No - I tried clearing that out, too.

And corosync is actually running?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] cib fails to start until host is rebooted

2010-09-02 Thread Michael Smith
On Thu, 2 Sep 2010, Andrew Beekhof wrote:

> On Mon, Aug 30, 2010 at 10:04 PM, Michael Smith  wrote:
> > Hi,
> >
> > I have a pacemaker/corosync setup on a bunch of fully patched SLES11 SP1
> > systems. On one of the systems, if I /etc/init.d/openais stop, then
> > /etc/init.d/openais start, pacemaker fails to come up:
> 
> Is /dev/shm full (or not mounted) by any chance?

No - I tried clearing that out, too.

Thanks,
Mike

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] cib fails to start until host is rebooted

2010-09-01 Thread Andrew Beekhof
On Mon, Aug 30, 2010 at 10:04 PM, Michael Smith  wrote:
> Hi,
>
> I have a pacemaker/corosync setup on a bunch of fully patched SLES11 SP1
> systems. On one of the systems, if I /etc/init.d/openais stop, then
> /etc/init.d/openais start, pacemaker fails to come up:

Is /dev/shm full (or not mounted) by any chance?

>
> Aug 30 15:48:09 xen-test1 cib: [5858]: info: crm_cluster_connect: Connecting
> to OpenAIS
> Aug 30 15:48:09 xen-test1 cib: [5858]: info: init_ais_connection: Creating
> connection to our AIS plugin
>
> Aug 30 15:48:10 xen-test1 corosync[5851]:  [IPC   ] Invalid IPC credentials.
> Aug 30 15:48:10 xen-test1 cib: [5858]: info: init_ais_connection: Connection
> to our AIS plugin (9) failed: unknown (100)
>
> Aug 30 15:48:10 xen-test1 cib: [5858]: CRIT: cib_init: Cannot sign in to the
> cluster... terminating
>
> I've tried rm /var/run/crm/*, but it doesn't help; the only fix is to
> reboot.
>
> I have an strace -f of /etc/init.d/openais start, if that would help.
>
> cluster-glue-1.0.5-0.5.1
> corosync-1.2.1-0.5.1
> libpacemaker3-1.1.2-0.2.1
> libcorosync4-1.2.1-0.5.1
> libopenais3-1.1.2-0.5.19
> pacemaker-1.1.2-0.2.1
> openais-1.1.2-0.5.19
>
> Thanks,
> Mike
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] CIB write-to-disk bug?

2010-04-08 Thread Andrew Beekhof
On Fri, Apr 2, 2010 at 4:16 PM, Alan Robertson  wrote:
>> Do it again, with higher log level.  Sorry, no time right now to rebuild
>> your exact thing with your exact gcc and stuff to look at your core file.
>
> You can just download the RPM and extract the objects.  That's what I used.

Spend half a day mirroring the RHEL54 tree and farting around with gdb
to try to get a sensible trace? Not likely.
And please tell me these aren't production machines, you really should
know better than to be using external/ssh outside of CTS.

Back to the logs, it looks like the initial digest is incorrect.

Mar 31 19:02:52 vhost0384 cib: [13294]: info: write_cib_contents:
Wrote version 0.50.0 of the CIB to disk (digest:
316049fa7ee8d2e107573ce7cded07cf)
Mar 31 19:02:52 vhost0384 cib: [13294]: info: retrieveCib: Reading
cluster configuration from: /var/lib/heartbeat/crm/cib.uHFtAW (digest:
/var/lib/heartbeat/crm/cib.GUdD9T)
Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: validate_cib_digest:
Digest comparision failed: expected 316049fa7ee8d2e107573ce7cded07cf
(/var/lib/heartbeat/crm/cib.GUdD9T), calculated
0bac3440f5c42f0f37d22ea7dfe433e8

Based on cib.uHFtAW, the correct digest would appear to be the
calculated one and not the one written to cib.GUdD9T.
Absolutely no idea how that could be the case, is it repeatable?

I do notice though, that the location constraint is recorded in the
cib unformatted (indicating something is amiss):



and the addition of that constraint was also the change that triggered
the behavior.
It also looks related to the link lge posted.  Can you please verify
if your systems are affected by that bug.

How did you load it btw? There's no record of it in the logs.
This is why we prefer hb_reports containing the info from both machines.

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] CIB write-to-disk bug?

2010-04-02 Thread Lars Ellenberg
On Fri, Apr 02, 2010 at 08:16:32AM -0600, Alan Robertson wrote:
> >Do it again, with higher log level.  Sorry, no time right now to rebuild
> >your exact thing with your exact gcc and stuff to look at your core file.
> 
> You can just download the RPM and extract the objects.  That's what I used.

core files generated on a rhel box are not particularly easy to use on
a debian box... so I did not do much beyond "strings" on it.

anyways, I suggest that you hit (some variant of)
http://markmail.org/message/exsz6rf7vhjntqgu

there is also a patch:
http://markmail.org/message/utjcety2tiu6zaer

and the upstream commit of it
http://git.gnome.org/browse/libxml2/commit/?id=c4ba8a42214c9f1cc16da14f29d63db2d0cec55a

if this guess turns out to be true:
congrats, you have found a bug in libxml2
that has been fixed for over two years!

enterprisey ;-)

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] CIB write-to-disk bug?

2010-04-02 Thread Alan Robertson

Lars Ellenberg wrote:

On Thu, Apr 01, 2010 at 08:27:02AM -0600, Alan Robertson wrote:

Lars Ellenberg wrote:

On Thu, Apr 01, 2010 at 12:12:47AM -0600, Alan Robertson wrote:

OK

Since there was no ssh-as-root between the cluster nodes, I didn't
send all the logs along from every node in the cluster - and it
didn't occur to me to look at all of them.

However, the problem has gotten curioser and curioser - because ALL
the nodes in the cluster reported the same problem at the same
time...

That makes it a lot less likely to be a race condition with the disk
writing infrastructure...

I've attached the relevant lines from the various machines -
slightly processed (date stamp format changed and a few other minor
things).

Let me know if you want me to send all the system logs along...

There should be core files.
You should be able to get some interessting information out there,
especially "the_cib" and "digest" at the point of abort().


Also, for my reference - what method are you using to compute the
digest of the file?  That is, what command should I execute to get
the same results?

It's an md5sum over the xml tree -- not over the formated ascii buffer,
though, so "md5sum cib.xml" won't do.
I think it is the same as
echo " $(perl -pe 's/^\s*(.*?)\s*\z/$1/g' cib.whatever)" | md5sum
But there is "cibadmin --md5-sum -x cib.xml",
to use the exact same code path.

This is a change from how it used to be (the last time I looked - at
least according to my not-always-reliable memory).  Thanks for the
update.



2010/03/31_19:02:52 vhost0384   [13294]: ERROR: crm_abort:
write_cib_contents: Triggered fatal assert at io.c:624 :
retrieveCib(tmp1, tmp2, FALSE) != NULL

So it did not verify right after it was written.
Can you reproduce?

I have no idea.  I didn't do anything much.  Hopefully the test
suite does a lot more strenuous things...


The core files may actually contains some hints,
so have a look there.

None of them verified.  All the nodes in the cluster failed the test
at the same time - and now I have no official CIBs on disk - on any
cluster nodes...  I sent Andrew all the CIBs, and all the core


Well, Andrew is on vacation right now... you will have noticed.


files, and basically everything under /var/lib/heartbeat/ from one
machine. They're from the latest official release - so the binaries
that match them are readily available.


The strange thing is that your "corrupt" cib.uHFtAW
contains a  thing.  it should not.
No other cib*.raw or cib.xml does.

Because  is explicitly filtered out in write_cib_contents:
 free_xml_from_parent(the_cib, cib_status_root);
before
 write_xml_file(the_cib, tmp1, FALSE),
so that should never have made it in there.

Something is very wrong somewhere...

Did you manage to get two status sections in there, somehow?
You tried anything funky with the cib as last action before this failed?


Not that I recall...


Do it again, with higher log level.  Sorry, no time right now to rebuild
your exact thing with your exact gcc and stuff to look at your core file.


You can just download the RPM and extract the objects.  That's what I used.

--
Alan Robertson 

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] CIB write-to-disk bug?

2010-04-01 Thread Lars Ellenberg
On Thu, Apr 01, 2010 at 08:27:02AM -0600, Alan Robertson wrote:
> Lars Ellenberg wrote:
> >On Thu, Apr 01, 2010 at 12:12:47AM -0600, Alan Robertson wrote:
> >>OK
> >>
> >>Since there was no ssh-as-root between the cluster nodes, I didn't
> >>send all the logs along from every node in the cluster - and it
> >>didn't occur to me to look at all of them.
> >>
> >>However, the problem has gotten curioser and curioser - because ALL
> >>the nodes in the cluster reported the same problem at the same
> >>time...
> >>
> >>That makes it a lot less likely to be a race condition with the disk
> >>writing infrastructure...
> >>
> >>I've attached the relevant lines from the various machines -
> >>slightly processed (date stamp format changed and a few other minor
> >>things).
> >>
> >>Let me know if you want me to send all the system logs along...
> >
> >There should be core files.
> >You should be able to get some interessting information out there,
> >especially "the_cib" and "digest" at the point of abort().
> >
> >>>
> >>>Also, for my reference - what method are you using to compute the
> >>>digest of the file?  That is, what command should I execute to get
> >>>the same results?
> >
> >It's an md5sum over the xml tree -- not over the formated ascii buffer,
> >though, so "md5sum cib.xml" won't do.
> >I think it is the same as
> > echo " $(perl -pe 's/^\s*(.*?)\s*\z/$1/g' cib.whatever)" | md5sum
> >But there is "cibadmin --md5-sum -x cib.xml",
> >to use the exact same code path.
> 
> This is a change from how it used to be (the last time I looked - at
> least according to my not-always-reliable memory).  Thanks for the
> update.
> 
> 
> >>2010/03/31_19:02:52 vhost0384   [13294]: ERROR: crm_abort:
> >>write_cib_contents: Triggered fatal assert at io.c:624 :
> >>retrieveCib(tmp1, tmp2, FALSE) != NULL
> >
> >So it did not verify right after it was written.
> >Can you reproduce?
> 
> I have no idea.  I didn't do anything much.  Hopefully the test
> suite does a lot more strenuous things...
> 
> >The core files may actually contains some hints,
> >so have a look there.
> 
> None of them verified.  All the nodes in the cluster failed the test
> at the same time - and now I have no official CIBs on disk - on any
> cluster nodes...  I sent Andrew all the CIBs, and all the core

Well, Andrew is on vacation right now... you will have noticed.

> files, and basically everything under /var/lib/heartbeat/ from one
> machine. They're from the latest official release - so the binaries
> that match them are readily available.

The strange thing is that your "corrupt" cib.uHFtAW
contains a  thing.  it should not.
No other cib*.raw or cib.xml does.

Because  is explicitly filtered out in write_cib_contents:
 free_xml_from_parent(the_cib, cib_status_root);
before
 write_xml_file(the_cib, tmp1, FALSE),
so that should never have made it in there.

Something is very wrong somewhere...

Did you manage to get two status sections in there, somehow?
You tried anything funky with the cib as last action before this failed?

Do it again, with higher log level.  Sorry, no time right now to rebuild
your exact thing with your exact gcc and stuff to look at your core file.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] CIB write-to-disk bug?

2010-04-01 Thread Alan Robertson

Florian Haas wrote:

On 2010-04-01 16:27, Alan Robertson wrote:


None of them verified.  All the nodes in the cluster failed the test at
the same time - and now I have no official CIBs on disk - on any cluster
nodes...  I sent Andrew all the CIBs, and all the core files, and
basically everything under /var/lib/heartbeat/ from one machine. They're
from the latest official release - so the binaries that match them are
readily available.


Any particular reason to not create an hb_report tarball and attach that
to a bug report in the LF bugzilla?


I did create the tarball - and a second one with all the CIBs, core 
files, and so on.  I just didn't create a bug report.  This looks like 
the same bugzilla that Heartbeat uses.  Is that right?


I was kind of hoping someone would have an easy answer ;-).



--
Alan Robertson 

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] CIB write-to-disk bug?

2010-04-01 Thread Florian Haas
On 2010-04-01 16:27, Alan Robertson wrote:

> None of them verified.  All the nodes in the cluster failed the test at
> the same time - and now I have no official CIBs on disk - on any cluster
> nodes...  I sent Andrew all the CIBs, and all the core files, and
> basically everything under /var/lib/heartbeat/ from one machine. They're
> from the latest official release - so the binaries that match them are
> readily available.

Any particular reason to not create an hb_report tarball and attach that
to a bug report in the LF bugzilla?

Cheers,
Florian



signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] CIB write-to-disk bug?

2010-04-01 Thread Alan Robertson

Lars Ellenberg wrote:

On Thu, Apr 01, 2010 at 12:12:47AM -0600, Alan Robertson wrote:

OK

Since there was no ssh-as-root between the cluster nodes, I didn't
send all the logs along from every node in the cluster - and it
didn't occur to me to look at all of them.

However, the problem has gotten curioser and curioser - because ALL
the nodes in the cluster reported the same problem at the same
time...

That makes it a lot less likely to be a race condition with the disk
writing infrastructure...

I've attached the relevant lines from the various machines -
slightly processed (date stamp format changed and a few other minor
things).

Let me know if you want me to send all the system logs along...


There should be core files.
You should be able to get some interessting information out there,
especially "the_cib" and "digest" at the point of abort().



Also, for my reference - what method are you using to compute the
digest of the file?  That is, what command should I execute to get
the same results?


It's an md5sum over the xml tree -- not over the formated ascii buffer,
though, so "md5sum cib.xml" won't do.
I think it is the same as
 echo " $(perl -pe 's/^\s*(.*?)\s*\z/$1/g' cib.whatever)" | md5sum
But there is "cibadmin --md5-sum -x cib.xml",
to use the exact same code path.


This is a change from how it used to be (the last time I looked - at 
least according to my not-always-reliable memory).  Thanks for the update.




2010/03/31_19:02:52 vhost0384   [13294]: ERROR: crm_abort:
write_cib_contents: Triggered fatal assert at io.c:624 :
retrieveCib(tmp1, tmp2, FALSE) != NULL


So it did not verify right after it was written.
Can you reproduce?


I have no idea.  I didn't do anything much.  Hopefully the test suite 
does a lot more strenuous things...



The core files may actually contains some hints,
so have a look there.


None of them verified.  All the nodes in the cluster failed the test at 
the same time - and now I have no official CIBs on disk - on any cluster 
nodes...  I sent Andrew all the CIBs, and all the core files, and 
basically everything under /var/lib/heartbeat/ from one machine. 
They're from the latest official release - so the binaries that match 
them are readily available.


Thanks Lars!


--
Alan Robertson 

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] CIB write-to-disk bug?

2010-04-01 Thread Lars Ellenberg
On Thu, Apr 01, 2010 at 12:12:47AM -0600, Alan Robertson wrote:
> OK
> 
> Since there was no ssh-as-root between the cluster nodes, I didn't
> send all the logs along from every node in the cluster - and it
> didn't occur to me to look at all of them.
> 
> However, the problem has gotten curioser and curioser - because ALL
> the nodes in the cluster reported the same problem at the same
> time...
> 
> That makes it a lot less likely to be a race condition with the disk
> writing infrastructure...
> 
> I've attached the relevant lines from the various machines -
> slightly processed (date stamp format changed and a few other minor
> things).
> 
> Let me know if you want me to send all the system logs along...

There should be core files.
You should be able to get some interessting information out there,
especially "the_cib" and "digest" at the point of abort().

> >I did not make manual changes on a running CIB. I was using the
> >cluster shell at the time.   The CIB it is complaining about
> >appears to be an intact, valid CIB with contents approximately
> >like they should have been at the time.  By the way, I have a
> >report from another IBMer that they have seen systems that stop
> >writing to their local CIBs.  I'll contact him.
> >
> >Here are some relevant facts:
> >  These machines are virtual guests in a cloud somewhere - operations
> >have somewhat unpredictable latency.  But, nothing too egregious
> >was happening at the time or Heartbeat would have bitched.
> >  I was doing some testing at the time.  I was putting on and
> >taking off constraints using the cluster shell
> >migrate and unmigrate operations.
> >
> >Given that the file looks intact, and I know how the CIB is
> >written to disk (since I originally wrote that code), I wonder if
> >it isn't a versioning issue / race condition.  That is, the code
> >for writing to disk does NOT guarantee when it gets done (assuming
> >you're still using it).  It would be easy to do a checksum on the
> >wrong version compared to the version you thought it should be (or
> >before it completed).
> >
> >Andrew:  You should have already received all the relevant logs to
> >you on a separate email.
> >
> >Also, for my reference - what method are you using to compute the
> >digest of the file?  That is, what command should I execute to get
> >the same results?

It's an md5sum over the xml tree -- not over the formated ascii buffer,
though, so "md5sum cib.xml" won't do.
I think it is the same as
 echo " $(perl -pe 's/^\s*(.*?)\s*\z/$1/g' cib.whatever)" | md5sum
But there is "cibadmin --md5-sum -x cib.xml",
to use the exact same code path.

> 2010/03/31_19:02:52   vhost0384   [13294]: ERROR: crm_abort:
> write_cib_contents: Triggered fatal assert at io.c:624 :
> retrieveCib(tmp1, tmp2, FALSE) != NULL

So it did not verify right after it was written.
Can you reproduce?

The core files may actually contains some hints,
so have a look there.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] CIB write-to-disk bug?

2010-03-31 Thread Alan Robertson

OK

Since there was no ssh-as-root between the cluster nodes, I didn't send 
all the logs along from every node in the cluster - and it didn't occur 
to me to look at all of them.


However, the problem has gotten curioser and curioser - because ALL the 
nodes in the cluster reported the same problem at the same time...


That makes it a lot less likely to be a race condition with the disk 
writing infrastructure...


I've attached the relevant lines from the various machines - slightly 
processed (date stamp format changed and a few other minor things).


Let me know if you want me to send all the system logs along...


Alan Robertson wrote:

Hi,

I've run into what looks at first blush to be a CIB bug in writing to disk.

The key messages from this incident are these:


Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: validate_cib_digest: 
Digest comparision failed: expected 316049fa7ee8d2e107573ce7cded07cf 
(/var/lib/heartbeat/crm/cib.GUdD9T), calculated 
0bac3440f5c42f0f37d22ea7dfe433e8
Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: retrieveCib: Checksum of 
/var/lib/heartbeat/crm/cib.uHFtAW failed!  Configuration contents ignored!
Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: retrieveCib: Usually this 
is caused by manual changes, please refer to 
http://clusterlabs.org/wiki/FAQ#cib_changes_detected
Mar 31 19:02:52 vhost0384 cib: [13294]: WARN: retrieveCib: Continuing 
but /var/lib/heartbeat/crm/cib.uHFtAW will NOT used.



I did not make manual changes on a running CIB. I was using the cluster 
shell at the time.   The CIB it is complaining about appears to be an 
intact, valid CIB with contents approximately like they should have been 
at the time.  By the way, I have a report from another IBMer that they 
have seen systems that stop writing to their local CIBs.  I'll contact him.


Here are some relevant facts:
  These machines are virtual guests in a cloud somewhere - operations
have somewhat unpredictable latency.  But, nothing too egregious
was happening at the time or Heartbeat would have bitched.
  I was doing some testing at the time.  I was putting on and
taking off constraints using the cluster shell
migrate and unmigrate operations.

Given that the file looks intact, and I know how the CIB is written to 
disk (since I originally wrote that code), I wonder if it isn't a 
versioning issue / race condition.  That is, the code for writing to 
disk does NOT guarantee when it gets done (assuming you're still using 
it).  It would be easy to do a checksum on the wrong version compared to 
the version you thought it should be (or before it completed).


Andrew:  You should have already received all the relevant logs to you 
on a separate email.


Also, for my reference - what method are you using to compute the digest 
of the file?  That is, what command should I execute to get the same 
results?





--
Alan Robertson 

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce
2010/03/31_19:02:52 vhost0384   [13294]: ERROR: crm_abort: 
write_cib_contents: Triggered fatal assert at io.c:624 : retrieveCib(tmp1, 
tmp2, FALSE) != NULL
2010/03/31_19:02:52 vhost0384   [13294]: ERROR: retrieveCib: Checksum 
of /var/lib/heartbeat/crm/cib.uHFtAW failed!  Configuration contents ignored!
2010/03/31_19:02:52 vhost0384   [13294]: ERROR: retrieveCib: Usually 
this is caused by manual changes, please refer to 
http://clusterlabs.org/wiki/FAQ#cib_changes_detected
2010/03/31_19:02:52 vhost0384   [13294]: ERROR: validate_cib_digest: 
Digest comparision failed: expected 316049fa7ee8d2e107573ce7cded07cf 
(/var/lib/heartbeat/crm/cib.GUdD9T), calculated 0bac3440f5c42f0f37d22ea7dfe433e8
2010/03/31_19:02:52 vhost0384   [6297]: ERROR: cib_diskwrite_complete: 
Disabling disk writes after write failure
2010/03/31_19:02:52 vhost0384   [6297]: ERROR: cib_diskwrite_complete: 
Disk write failed: status=134, signo=6, exitcode=0
2010/03/31_19:02:52 vhost0384   [6297]: ERROR: Managed 
write_cib_contents process 13294 dumped core
2010/03/31_19:02:53 vhost0150   [15083]: ERROR: crm_abort: 
write_cib_contents: Triggered fatal assert at io.c:624 : retrieveCib(tmp1, 
tmp2, FALSE) != NULL
2010/03/31_19:02:53 vhost0150   [15083]: ERROR: retrieveCib: Checksum 
of /var/lib/heartbeat/crm/cib.n66oB0 failed!  Configuration contents ignored!
2010/03/31_19:02:53 vhost0150   [15083]: ERROR: retrieveCib: Usually 
this is caused by manual changes, please refer to 
http://clusterlabs.org/wiki/FAQ#cib_changes_detected
2010/03/31_19:02:53 vhost0150   [15083]: ERROR: validate_cib_digest: 
Digest comparision failed: expected 316049fa7ee8d2e107573ce7cded07cf 
(/var/lib/heartbeat/crm/cib.UJSSzR), calculated 0bac3440f5c42f0f37d22ea7dfe433e8
2010/03/31_19:02:53 vhost0150   [2564]: ERROR: cib_diskwrite_complete: 
Disabling disk writes after write f

Re: [Pacemaker] cib and attrd processes segfault

2010-02-17 Thread Alessandro Federico
>
>
> Please
> - enable coredumps (set "ulimit -c unlimited" at the top of the
> corosync init file)
> - use hb_report to create a support tarball covering the problem
> - attach the tarball to a new bug:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> Thats the minimum we'd need to be able to assist.
>
>

Ok thank you very much


> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>



-- 
All work and no play makes Jack a dull boy.
   All work and no play makes Jack a dull
 boy. All work and no play makes Jack...
___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] cib and attrd processes segfault

2010-02-16 Thread Andrew Beekhof
On Tue, Feb 16, 2010 at 4:36 PM, Alessandro Federico
 wrote:
> Hi all,
> we have just installed the latest versions of pacemaker/corosync software:
> cluster-glue-1.0.3-1.el5.x86_64
> cluster-glue-libs-1.0.3-1.el5.x86_64
> corosync-1.2.0-1.el5.x86_64
> corosynclib-1.2.0-1.el5.x86_64
> heartbeat-3.0.2-2.el5.x86_64
> heartbeat-libs-3.0.2-2.el5.x86_64
> pacemaker-1.0.7-4.el5.x86_64
> pacemaker-libs-1.0.7-4.el5.x86_64
> resource-agents-1.0.1-1.el5.x86_64
> on a Scientific Linux SL 5.4 box (kernel 2.6.18-128.7.1.el5).
> The configuration file corosync.conf is attached.
> The problem is that we get segfault of cib and attrd processes as soon
> as we start the corosync service (see the attached messages.log file).
> Can anybody help us, please?

Please
- enable coredumps (set "ulimit -c unlimited" at the top of the
corosync init file)
- use hb_report to create a support tarball covering the problem
- attach the tarball to a new bug:
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Thats the minimum we'd need to be able to assist.

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] cib still leaks in pacemaker-1.0.3

2009-05-22 Thread Andrew Beekhof
And you'll also want this patch for the crmd

diff -r 4619c842d58c crmd/callbacks.c
--- a/crmd/callbacks.c  Fri May 22 16:52:14 2009 +0200
+++ b/crmd/callbacks.c  Fri May 22 21:34:12 2009 +0200
@@ -179,7 +179,6 @@ crmd_ha_msg_callback(HA_Message *hamsg,

} else {
crmd_ha_msg_filter(msg);
-   return;
}

   bail:


On Wed, May 20, 2009 at 2:47 PM, Nikola Ciprich  wrote:
> On Wed, May 20, 2009 at 02:02:52PM +0200, Andrew Beekhof wrote:
>> Ah, well that was pretty obvious.
>> /me humbly apologizes for such a stupid error.
> Hi and thanks! no problem
>
>
>> (It wasn't caught by my own valgrind testing because this function is
>> specific to heartbeat based clusters)
> don't worry, I'm doing a lots of testing for you ;)
> I've already compiled it an deployed on testing machines,
> memory usage seems to be pretty low. I'll report
> few days later if everything is OK.
> thanks a lot once more!
> nik
>
>>
>>
>> Try this:
>>
>> diff -r ea5d0b58c0be cib/callbacks.c
>> --- a/cib/callbacks.c Wed May 20 11:56:39 2009 +0200
>> +++ b/cib/callbacks.c Wed May 20 14:01:30 2009 +0200
>> @@ -1064,6 +1064,7 @@ cib_ha_peer_callback(HA_Message * msg, v
>>  {
>>      xmlNode *xml = convert_ha_message(NULL, msg, __FUNCTION__);
>>      cib_peer_callback(xml, private_data);
>> +    free_xml(xml);
>>  }
>>
>>  void
>>
>>
>>
>>
>> On Tue, May 19, 2009 at 8:24 PM, Andrew Beekhof  wrote:
>> > I'll take a look at the valgrind data.  Thanks!
>> >
>> > On Tue, May 19, 2009 at 6:39 PM, Nikola Ciprich  
>> > wrote:
>> >> Hello,
>> >> sorry to bother again. I've discovered why valgrind didn't
>> >> find anything. It is important to stop the process in order to
>> >> have valgrind finish the analysis. And it seems that there
>> >> really are leaks not only in cib, but also in attrd and crmd.
>> >> I just had a slight look into the code reported by valgrind
>> >> as problematic and though I would certainly need to examine
>> >> it much more to understand it properly, I think there are
>> >> leaks. I'm attaching the valgrind reports, In case You would be
>> >> interested in examining them.
>> >> If I could provide any help, I'll be more than happy.
>> >> (well, I guess I could of course help by sending patches :) but I'm
>> >> afraid this will take me a lot of time, I can try though).
>> >> with best regards
>> >> nik
>> >>
>> >>> Not really. Sorry :(
>> >>>
>> >>
>> >> --
>> >> -
>> >> Nikola CIPRICH
>> >> LinuxBox.cz, s.r.o.
>> >> 28. rijna 168, 709 01 Ostrava
>> >>
>> >> tel.:   +420 596 603 142
>> >> fax:    +420 596 621 273
>> >> mobil:  +420 777 093 799
>> >>
>> >> www.linuxbox.cz
>> >>
>> >> mobil servis: +420 737 238 656
>> >> email servis: ser...@linuxbox.cz
>> >> -
>> >>
>> >
>>
>

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] cib still leaks in pacemaker-1.0.3

2009-05-20 Thread Nikola Ciprich
On Wed, May 20, 2009 at 02:02:52PM +0200, Andrew Beekhof wrote:
> Ah, well that was pretty obvious.
> /me humbly apologizes for such a stupid error.
Hi and thanks! no problem


> (It wasn't caught by my own valgrind testing because this function is
> specific to heartbeat based clusters)
don't worry, I'm doing a lots of testing for you ;)
I've already compiled it an deployed on testing machines,
memory usage seems to be pretty low. I'll report
few days later if everything is OK.
thanks a lot once more!
nik

> 
> 
> Try this:
> 
> diff -r ea5d0b58c0be cib/callbacks.c
> --- a/cib/callbacks.c Wed May 20 11:56:39 2009 +0200
> +++ b/cib/callbacks.c Wed May 20 14:01:30 2009 +0200
> @@ -1064,6 +1064,7 @@ cib_ha_peer_callback(HA_Message * msg, v
>  {
>  xmlNode *xml = convert_ha_message(NULL, msg, __FUNCTION__);
>  cib_peer_callback(xml, private_data);
> +free_xml(xml);
>  }
> 
>  void
> 
> 
> 
> 
> On Tue, May 19, 2009 at 8:24 PM, Andrew Beekhof  wrote:
> > I'll take a look at the valgrind data.  Thanks!
> >
> > On Tue, May 19, 2009 at 6:39 PM, Nikola Ciprich  
> > wrote:
> >> Hello,
> >> sorry to bother again. I've discovered why valgrind didn't
> >> find anything. It is important to stop the process in order to
> >> have valgrind finish the analysis. And it seems that there
> >> really are leaks not only in cib, but also in attrd and crmd.
> >> I just had a slight look into the code reported by valgrind
> >> as problematic and though I would certainly need to examine
> >> it much more to understand it properly, I think there are
> >> leaks. I'm attaching the valgrind reports, In case You would be
> >> interested in examining them.
> >> If I could provide any help, I'll be more than happy.
> >> (well, I guess I could of course help by sending patches :) but I'm
> >> afraid this will take me a lot of time, I can try though).
> >> with best regards
> >> nik
> >>
> >>> Not really. Sorry :(
> >>>
> >>
> >> --
> >> -
> >> Nikola CIPRICH
> >> LinuxBox.cz, s.r.o.
> >> 28. rijna 168, 709 01 Ostrava
> >>
> >> tel.:   +420 596 603 142
> >> fax:    +420 596 621 273
> >> mobil:  +420 777 093 799
> >>
> >> www.linuxbox.cz
> >>
> >> mobil servis: +420 737 238 656
> >> email servis: ser...@linuxbox.cz
> >> -
> >>
> >
> 

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] cib still leaks in pacemaker-1.0.3

2009-05-20 Thread Andrew Beekhof
Ah, well that was pretty obvious.
/me humbly apologizes for such a stupid error.

(It wasn't caught by my own valgrind testing because this function is
specific to heartbeat based clusters)


Try this:

diff -r ea5d0b58c0be cib/callbacks.c
--- a/cib/callbacks.c   Wed May 20 11:56:39 2009 +0200
+++ b/cib/callbacks.c   Wed May 20 14:01:30 2009 +0200
@@ -1064,6 +1064,7 @@ cib_ha_peer_callback(HA_Message * msg, v
 {
 xmlNode *xml = convert_ha_message(NULL, msg, __FUNCTION__);
 cib_peer_callback(xml, private_data);
+free_xml(xml);
 }

 void




On Tue, May 19, 2009 at 8:24 PM, Andrew Beekhof  wrote:
> I'll take a look at the valgrind data.  Thanks!
>
> On Tue, May 19, 2009 at 6:39 PM, Nikola Ciprich  
> wrote:
>> Hello,
>> sorry to bother again. I've discovered why valgrind didn't
>> find anything. It is important to stop the process in order to
>> have valgrind finish the analysis. And it seems that there
>> really are leaks not only in cib, but also in attrd and crmd.
>> I just had a slight look into the code reported by valgrind
>> as problematic and though I would certainly need to examine
>> it much more to understand it properly, I think there are
>> leaks. I'm attaching the valgrind reports, In case You would be
>> interested in examining them.
>> If I could provide any help, I'll be more than happy.
>> (well, I guess I could of course help by sending patches :) but I'm
>> afraid this will take me a lot of time, I can try though).
>> with best regards
>> nik
>>
>>> Not really. Sorry :(
>>>
>>
>> --
>> -
>> Nikola CIPRICH
>> LinuxBox.cz, s.r.o.
>> 28. rijna 168, 709 01 Ostrava
>>
>> tel.:   +420 596 603 142
>> fax:    +420 596 621 273
>> mobil:  +420 777 093 799
>>
>> www.linuxbox.cz
>>
>> mobil servis: +420 737 238 656
>> email servis: ser...@linuxbox.cz
>> -
>>
>

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] cib still leaks in pacemaker-1.0.3

2009-05-19 Thread Andrew Beekhof
I'll take a look at the valgrind data.  Thanks!

On Tue, May 19, 2009 at 6:39 PM, Nikola Ciprich  wrote:
> Hello,
> sorry to bother again. I've discovered why valgrind didn't
> find anything. It is important to stop the process in order to
> have valgrind finish the analysis. And it seems that there
> really are leaks not only in cib, but also in attrd and crmd.
> I just had a slight look into the code reported by valgrind
> as problematic and though I would certainly need to examine
> it much more to understand it properly, I think there are
> leaks. I'm attaching the valgrind reports, In case You would be
> interested in examining them.
> If I could provide any help, I'll be more than happy.
> (well, I guess I could of course help by sending patches :) but I'm
> afraid this will take me a lot of time, I can try though).
> with best regards
> nik
>
>> Not really. Sorry :(
>>
>
> --
> -
> Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28. rijna 168, 709 01 Ostrava
>
> tel.:   +420 596 603 142
> fax:    +420 596 621 273
> mobil:  +420 777 093 799
>
> www.linuxbox.cz
>
> mobil servis: +420 737 238 656
> email servis: ser...@linuxbox.cz
> -
>

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] cib still leaks in pacemaker-1.0.3

2009-05-19 Thread Nikola Ciprich
Hello,
sorry to bother again. I've discovered why valgrind didn't
find anything. It is important to stop the process in order to 
have valgrind finish the analysis. And it seems that there 
really are leaks not only in cib, but also in attrd and crmd. 
I just had a slight look into the code reported by valgrind 
as problematic and though I would certainly need to examine 
it much more to understand it properly, I think there are
leaks. I'm attaching the valgrind reports, In case You would be 
interested in examining them.
If I could provide any help, I'll be more than happy.
(well, I guess I could of course help by sending patches :) but I'm
afraid this will take me a lot of time, I can try though).
with best regards
nik

> Not really. Sorry :(
> 

-- 
-
Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.:   +420 596 603 142
fax:+420 596 621 273
mobil:  +420 777 093 799

www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-


valgrind.tar.gz
Description: GNU Zip compressed data
___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] cib still leaks in pacemaker-1.0.3

2009-05-18 Thread Andrew Beekhof
On Sat, May 16, 2009 at 10:33 PM, Nikola Ciprich
 wrote:
> Hi guys,
> I was able to enable valgrind on our production cluster today,
> but unfortunately only on the secondary node, I'll be allowed to enable
> it on primary node hopefully during next weekend.
> Unfortunately it seems that valgrind probably won't be of much help here.
> I've got some output from it, but it's only few warnings and it seems
> that growing memory consumption is not really caused by leak, but (maybe)
> only by some growing memory structure. I'm doing one not very nice thing
> in my cluster which might be the culprit:
> I'm monitoring some service by a cron script and periodically changing
> related resource score by the following command:
>
> cibadmin -U -o constraints -X "
>        
>            role="Master">
>               operation="eq" value="${host}"/>
>           
>        
> Is it possible that this could be causing cib growing memory consumption?

Anything is possible, but it would be unlikely.
There's nothing special about that command that would make only it leak.

> Anyways, I'm attaching valgrind output for cib process:
>
> ==14779== My PID = 14779, parent PID = 14766.  Prog and args are:
> ==14779==    /usr/lib64/heartbeat/cib
> ==14779==

> Can this help?

Not really. Sorry :(

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] cib still leaks in pacemaker-1.0.3

2009-05-16 Thread Nikola Ciprich
Hi guys,
I was able to enable valgrind on our production cluster today,
but unfortunately only on the secondary node, I'll be allowed to enable 
it on primary node hopefully during next weekend.
Unfortunately it seems that valgrind probably won't be of much help here.
I've got some output from it, but it's only few warnings and it seems 
that growing memory consumption is not really caused by leak, but (maybe)
only by some growing memory structure. I'm doing one not very nice thing
in my cluster which might be the culprit:
I'm monitoring some service by a cron script and periodically changing
related resource score by the following command:

cibadmin -U -o constraints -X "

   
  
   

Is it possible that this could be causing cib growing memory consumption?
Anyways, I'm attaching valgrind output for cib process:

==14779== My PID = 14779, parent PID = 14766.  Prog and args are:
==14779==/usr/lib64/heartbeat/cib
==14779==
==14779== Conditional jump or move depends on uninitialised value(s)
==14779==at 0x674E354: (within /usr/lib64/libxml2.so.2.6.26)
==14779==by 0x674CDA5: (within /usr/lib64/libxml2.so.2.6.26)
 gz   ==14779==by 0x674CD5E: 
(within /usr/lib64/libxml2.so.2.6.26)
==14779==by 0x674CD5E: (within /usr/lib64/libxml2.so.2.6.26)
==14779==by 0x674C77D: (within /usr/lib64/libxml2.so.2.6.26)
==14779==by 0x6751853: xmlXPathEvalExpression (in 
/usr/lib64/libxml2.so.2.6.26)
==14779==by 0x4E3CB58: xpath_search (xml.c:2545)
==14779==by 0x50567BE: cib_process_xpath (cib_ops.c:880)
==14779==by 0x5053CB3: cib_process_query (cib_ops.c:49)
==14779==by 0x5057F3E: cib_perform_op (cib_utils.c:539)
==14779==by 0x40AFBD: cib_process_command (callbacks.c:843)
==14779==by 0x40A3FC: cib_process_request (callbacks.c:660)
==14779==by 0x408E7E: cib_common_callback_worker (callbacks.c:259)
==14779==by 0x4090EE: cib_common_callback (callbacks.c:315)
==14779==by 0x408C4C: cib_rw_callback (callbacks.c:206)
==14779==by 0x5E69858: G_CH_dispatch_int (GSource.c:624)
==14779==by 0x739FDB3: g_main_context_dispatch (in 
/lib64/libglib-2.0.so.0.1200.3)
==14779==by 0x73A2C0C: (within /lib64/libglib-2.0.so.0.1200.3)
==14779==by 0x73A2F19: g_main_loop_run (in /lib64/libglib-2.0.so.0.1200.3)
==14779==by 0x40D3F0: cib_init (main.c:508) │ ││
 │   │
==14779==by 0x40C8AE: main (main.c:217)
==14779== Conditional jump or move depends on uninitialised value(s)
==14779==at 0x674E354: (within /usr/lib64/libxml2.so.2.6.26)
==14779==by 0x674CDA5: (within /usr/lib64/libxml2.so.2.6.26)
==14779==by 0x674C77D: (within /usr/lib64/libxml2.so.2.6.26)
  ==14779==by 0x6751853: 
xmlXPathEvalExpression (in /usr/lib64/libxml2.so.2.6.26)
==14779==by 0x4E3CB58: xpath_search (xml.c:2545)
==14779==by 0x50567BE: cib_process_xpath (cib_ops.c:880)
==14779==by 0x5053CB3: cib_process_query (cib_ops.c:49)
==14779==by 0x5057F3E: cib_perform_op (cib_utils.c:539)
==14779==by 0x40AFBD: cib_process_command (callbacks.c:843)
==14779==by 0x40A3FC: cib_process_request (callbacks.c:660)
==14779==by 0x408E7E: cib_common_callback_worker (callbacks.c:259)
==14779==by 0x4090EE: cib_common_callback (callbacks.c:315)
==14779==by 0x408C4C: cib_rw_callback (callbacks.c:206)
==14779==by 0x5E69858: G_CH_dispatch_int (GSource.c:624)
==14779==by 0x739FDB3: g_main_context_dispatch (in 
/lib64/libglib-2.0.so.0.1200.3)
==14779==by 0x73A2C0C: (within /lib64/libglib-2.0.so.0.1200.3)
==14779==by 0x73A2F19: g_main_loop_run (in /lib64/libglib-2.0.so.0.1200.3)
==14779==by 0x40D3F0: cib_init (main.c:508)
==14779==by 0x40C8AE: main (main.c:217)
==14779== Syscall param unlink(pathname) points to uninitialised byte(s)
==14779==at 0x6ACCC27: unlink (in /lib64/libc-2.5.so)
==14779==by 0x5E6BBC5: socket_destroy_channel (ipcsocket.c:870)
==14779==by 0x5E6780A: G_CH_destroy_int (GSource.c:677)
==14779==by 0x739F74C: (within /lib64/libglib-2.0.so.0.1200.3)
==14779==by 0x739FEB9: g_main_context_dispatch (in 
/lib64/libglib-2.0.so.0.1200.3)
==14779==by 0x73A2C0C: (within /lib64/libglib-2.0.so.0.1200.3)
==14779==by 0x73A2F19: g_main_loop_run (in /lib64/libglib-2.0.so.0.1200.3)
==14779==by 0x40D3F0: cib_init (main.c:508)
==14779==by 0x40C8AE: main (main.c:217)
==14779==  Address 0x4092A72 is 2 bytes inside a block of size 110 alloc'd
==14779==at 0x4C20809: malloc (vg_replace_malloc.c:149)
==14779==by 0x73A6BFA: g_malloc (in /lib64/libglib-2.0.so.0.1200.3)
==14779==by 0x5E6B7AE: socket_accept_connection (ipcsocket.c:708)
==14779==by 0x5E69364: G_WC_dispatch (GSource.c:830)
==14779==by 0x739FDB3: g_main_context_dispatch (in 
/lib64/libglib-2.0.so.0.1200.3)
==14779==by 0x73

Re: [Pacemaker] cib still leaks in pacemaker-1.0.3

2009-05-14 Thread Nikola Ciprich
Hi guys,
sooo I've got valgrind grinding:)
I had some trouble getting the latest stuff working, so I used heartbeat-2.99.2 
with Dejan's (fixed) patch and --enable-valgrind 
--with-valgrind-log="--log-file=/tmp/crm-%p.valgrind"
and recompiled pacemaker-1.0.3 (withount openais as Andrew suggested).
now enabling valgrind works!
Unfortulately I don't see the leaks on my testing machine, so I'll have to try 
it directly on
production one. Hopefully I'll have some time for playing Tomorrow or during 
the weekend, so I'll
report ASAP.
thanks a lot for all Your help!
best regards
nik

On Thu, May 14, 2009 at 04:12:52PM +0200, Andrew Beekhof wrote:
> On Thu, May 14, 2009 at 3:58 PM, Nikola Ciprich  
> wrote:
> > Hi,
> > Dejan, thanks a lot, I compiled Your version, but crmd with shipped 
> > pacemaker keeps segfaulting
> > with it, and unable to rebuild pacemaker with this heartbeat to get the 
> > -debug package.
> > compilation fails with:
> >
> > plugin.c: In function 'check_message_sanity':
> > plugin.c:1190: warning: format '%d' expects type 'int', but argument 10 has 
> > type 'long unsigned int'
> > plugin.c:1190: warning: format '%d' expects type 'int', but argument 10 has 
> > type 'long unsigned int'
> > gmake[2]: *** [plugin.lo] Error 1
> > gmake[2]: Leaving directory `/home/src/redhat/BUILD/pacemaker/lib/ais'
> > gmake[1]: *** [all-recursive] Error 1
> > gmake[1]: Leaving directory `/home/src/redhat/BUILD/pacemaker/lib'
> > make: *** [all-recursive] Error 1
> > error: Bad exit status from /var/tmp/rpm-tmp.81431 (%build)
> >
> > Could You please send me only the related patch, so I could try compiling 
> > latest stable
> > version? I don't see it in the mercurial...
> 
> When you configure pacemaker, just add the --without-ais option.
> 
> >
> > Andrew thanks for Your patches as well, I'll try them, but honestly I'm a 
> > bit confused,
> > first patch is for heartbeat, right?
> 
> actually, you probably dont need the second one.  i think its in 1.0 already.
> 

-- 
-
Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.:   +420 596 603 142
fax:+420 596 621 273
mobil:  +420 777 093 799

www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] cib still leaks in pacemaker-1.0.3

2009-05-14 Thread Andrew Beekhof
On Thu, May 14, 2009 at 3:58 PM, Nikola Ciprich  wrote:
> Hi,
> Dejan, thanks a lot, I compiled Your version, but crmd with shipped pacemaker 
> keeps segfaulting
> with it, and unable to rebuild pacemaker with this heartbeat to get the 
> -debug package.
> compilation fails with:
>
> plugin.c: In function 'check_message_sanity':
> plugin.c:1190: warning: format '%d' expects type 'int', but argument 10 has 
> type 'long unsigned int'
> plugin.c:1190: warning: format '%d' expects type 'int', but argument 10 has 
> type 'long unsigned int'
> gmake[2]: *** [plugin.lo] Error 1
> gmake[2]: Leaving directory `/home/src/redhat/BUILD/pacemaker/lib/ais'
> gmake[1]: *** [all-recursive] Error 1
> gmake[1]: Leaving directory `/home/src/redhat/BUILD/pacemaker/lib'
> make: *** [all-recursive] Error 1
> error: Bad exit status from /var/tmp/rpm-tmp.81431 (%build)
>
> Could You please send me only the related patch, so I could try compiling 
> latest stable
> version? I don't see it in the mercurial...

When you configure pacemaker, just add the --without-ais option.

>
> Andrew thanks for Your patches as well, I'll try them, but honestly I'm a bit 
> confused,
> first patch is for heartbeat, right?

actually, you probably dont need the second one.  i think its in 1.0 already.

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] cib still leaks in pacemaker-1.0.3

2009-05-14 Thread Nikola Ciprich
Hi,
Dejan, thanks a lot, I compiled Your version, but crmd with shipped pacemaker 
keeps segfaulting
with it, and unable to rebuild pacemaker with this heartbeat to get the -debug 
package.
compilation fails with:

plugin.c: In function 'check_message_sanity':
plugin.c:1190: warning: format '%d' expects type 'int', but argument 10 has 
type 'long unsigned int'
plugin.c:1190: warning: format '%d' expects type 'int', but argument 10 has 
type 'long unsigned int'
gmake[2]: *** [plugin.lo] Error 1
gmake[2]: Leaving directory `/home/src/redhat/BUILD/pacemaker/lib/ais'
gmake[1]: *** [all-recursive] Error 1
gmake[1]: Leaving directory `/home/src/redhat/BUILD/pacemaker/lib'
make: *** [all-recursive] Error 1
error: Bad exit status from /var/tmp/rpm-tmp.81431 (%build)

Could You please send me only the related patch, so I could try compiling 
latest stable 
version? I don't see it in the mercurial...

Andrew thanks for Your patches as well, I'll try them, but honestly I'm a bit 
confused,
first patch is for heartbeat, right? and the second one for pacemaker? It 
doesn't seem 
to apply either to -tip, or to 1.0.3...
BR
nik


-- 
-
Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.:   +420 596 603 142
fax:+420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] cib still leaks in pacemaker-1.0.3

2009-05-13 Thread Andrew Beekhof
On Wed, May 13, 2009 at 7:41 PM, Dejan Muhamedagic  wrote:
> Hi,
>
> On Wed, May 13, 2009 at 05:36:40PM +0200, Nikola Ciprich wrote:
>> > holy !
>> yes! exactly! :)
>>
>> > sure
>> > in theory you can just add "crm valgrind" instead of "crm yes" in ha.cf
>>
>> hmm, i tried that now, but all I got is:
>> May 13 16:46:16 faxb heartbeat: [1655]: ERROR: Heartbeat was not compiled 
>> with --enable-libc-malloc, "crm valgrind" is therefor not supported.
>>
>> So I wanted to compile myself, but I see this option neither in
>> pacemaker's configure, nor in heartbeat's.  But I noticed
>> --enable-valgrind option for heartbeat configure,
>> but enabling it and recompiling the heartbeat didn't help.  so
>> maybe this part needs some updating?
>
> Looks like it. Just pushed a patch for that. Can you try again
> with the new tarball:
>
> http://hg.linux-ha.org/dev/archive/6467be4d4cb7.tar.bz2

Thanks Dejan!

Nikola, I also suggest the following two patches

diff -r 4038c4644964 configure.in
--- a/configure.in  Wed May 13 17:07:22 2009 +0200
+++ b/configure.in  Wed May 13 20:48:05 2009 +0200
@@ -2799,17 +2799,14 @@ AC_ARG_WITH(valgrind-suppress,
 [ VALGRIND_SUPP="/dev/null" ])

 if test "x" = "x$VALGRIND_LOG"; then
-VALGRIND_LOG="--log-socket=127.0.0.1:1234"
-AC_MSG_NOTICE(Set default Valgrind options to: $VALGRIND_OPTS)
-AC_MSG_NOTICE(Remember to start a receiver on localhost:1234)
+VALGRIND_LOG="--log-file=/tmp/crm-%p.valgrind"
 fi

-AC_PATH_PROG(VALGRIND_BIN, valgrind)
 if test "xyes" = "x$enable_valgrind" -a "x$VALGRIND_BIN" != "x"; then
enable_libc_malloc=yes
 fi

-AC_DEFINE_UNQUOTED(VALGRIND_BIN, "$VALGRIND_BIN", Valgrind command)
+AC_DEFINE_UNQUOTED(VALGRIND_BIN, "valgrind", Valgrind command)
 AC_DEFINE_UNQUOTED(VALGRIND_LOG, "$VALGRIND_LOG", Valgrind logging options)
 AC_DEFINE_UNQUOTED(VALGRIND_SUPP, "$VALGRIND_SUPP", Name of a
suppression file to pass to Valgrind)

diff -r 4038c4644964 crm/crmd/subsystems.c
--- a/crm/crmd/subsystems.c Wed May 13 17:07:22 2009 +0200
+++ b/crm/crmd/subsystems.c Wed May 13 20:48:05 2009 +0200
@@ -148,6 +148,7 @@ start_subsystem(struct crm_subsystem_s* 
unsigned intj;
struct rlimit   oflimits;
const char  *devnull = "/dev/null";
+   const char  *grind = getenv("HA_VALGRIND_ENABLED");

crm_info("Starting sub-system \"%s\"", the_subsystem->name);
set_bit_inplace(fsa_input_register, the_subsystem->flag_required);
@@ -211,7 +212,8 @@ start_subsystem(struct crm_subsystem_s* 
(void)open(devnull, O_WRONLY);  /* Stdout: fd 1 */
(void)open(devnull, O_WRONLY);  /* Stderr: fd 2 */

-   if(getenv("HA_VALGRIND_ENABLED") != NULL) {
+   if(grind != NULL
+  && (crm_is_true(grind) || strstr(grind, the_subsystem->name))) {
char *opts[] = { crm_strdup(VALGRIND_BIN),
 crm_strdup("--show-reachable=yes"),
 crm_strdup("--leak-check=full"),

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] cib still leaks in pacemaker-1.0.3

2009-05-13 Thread Dejan Muhamedagic
Hi,

On Wed, May 13, 2009 at 05:36:40PM +0200, Nikola Ciprich wrote:
> > holy !
> yes! exactly! :)
> 
> > sure
> > in theory you can just add "crm valgrind" instead of "crm yes" in ha.cf
> 
> hmm, i tried that now, but all I got is:
> May 13 16:46:16 faxb heartbeat: [1655]: ERROR: Heartbeat was not compiled 
> with --enable-libc-malloc, "crm valgrind" is therefor not supported.
> 
> So I wanted to compile myself, but I see this option neither in
> pacemaker's configure, nor in heartbeat's.  But I noticed
> --enable-valgrind option for heartbeat configure,
> but enabling it and recompiling the heartbeat didn't help.  so
> maybe this part needs some updating?

Looks like it. Just pushed a patch for that. Can you try again
with the new tarball:

http://hg.linux-ha.org/dev/archive/6467be4d4cb7.tar.bz2

Thanks,

Dejan

> BR
> nik
> 
> 
> >
> > did this not work?
> >
> > ___
> > Pacemaker mailing list
> > Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> 
> -- 
> -
> Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28. rijna 168, 709 01 Ostrava
> 
> tel.:   +420 596 603 142
> fax:+420 596 621 273
> mobil:  +420 777 093 799
> www.linuxbox.cz
> 
> mobil servis: +420 737 238 656
> email servis: ser...@linuxbox.cz
> -
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] cib still leaks in pacemaker-1.0.3

2009-05-13 Thread Nikola Ciprich
> holy !
yes! exactly! :)

> sure
> in theory you can just add "crm valgrind" instead of "crm yes" in ha.cf

hmm, i tried that now, but all I got is:
May 13 16:46:16 faxb heartbeat: [1655]: ERROR: Heartbeat was not compiled with 
--enable-libc-malloc, "crm valgrind" is therefor not supported.

So I wanted to compile myself, but I see this option neither in pacemaker's 
configure, nor in heartbeat's.
But I noticed --enable-valgrind option for heartbeat configure, but enabling it 
and recompiling the heartbeat didn't help.
so maybe this part needs some updating?
BR
nik


>
> did this not work?
>
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>

-- 
-
Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.:   +420 596 603 142
fax:+420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] cib still leaks in pacemaker-1.0.3

2009-05-13 Thread Andrew Beekhof


On May 13, 2009, at 8:28 AM, Nikola Ciprich wrote:


Hello,
I've reported this some time ago, few days ago I've updated my  
system to pacemaker-1.0.3 + related packages.
But unfortunately cib process seems to be still leaking,ie it's RSS  
memory usage is constantly growing.
This means we have to restart whole heartbeat service approximately  
once every two weeks as the memory usage of cib process gets to  
~1.5GB.


holy !

Some time ago when I was trying to use valgrind, I had some trouble,  
Andrew, You wrote that You're mostly testing openais variant, and  
it's possible that heartbeat has some problems being started with  
valgrind. could You please help me with running the cib process with  
valgrind so I could provide more accurate repport?


sure
in theory you can just add "crm valgrind" instead of "crm yes" in ha.cf

did this not work?

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker