Re: [Pacemaker] disable failover when doing orderly reboot

2010-04-01 Thread martin . braun
Hi Gerry,

> Stop all resources running on the node going to be shut down.
> That's what you want in the end, isn't it?

That should also work: set the second node to standby and do the reboot on 
the primary - when primary is up again you'll set the secondary back 
online.

Best,
Martin


"Gerry Kernan"  wrote on 31.03.2010 22:25:05:

> [image removed] 
> 
> Re: [Pacemaker] disable failover when doing orderly reboot
> 
> Gerry Kernan 
> 
> to:
> 
> 'The Pacemaker cluster resource manager'
> 
> 31.03.2010 22:26
> 
> Please respond to Gerry Kernan, The Pacemaker cluster resource manager 
> 
> Hi Andreas,
> 
> Thanks for your answer, I was hoping there was some config option 
> that would not failover the node if the reboot or shutdown was orderly. 
> 
> 
> 
> Geery
> 
> 
> -Original Message-
> From: Andreas Mock [mailto:andreas.m...@web.de] 
> Sent: 31 March 2010 17:24
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] disable failover when doing orderly reboot
> 
> Hi Gerry,
> 
> my poor man's answer is:
> Stop all resources running on the node going to be shut down.
> That's what you want in the end, isn't it?
> 
> Best regards
> Andreas Mock
> 
> 
> -
> Von: Gerry Kernan 
> Gesendet: 31.03.2010 16:13:46
> An: pacemaker@oss.clusterlabs.org
> Betreff: [Pacemaker] disable failover when doing orderly reboot
> 
> 
> 
> 
> Hi
> 
> 
> 
> 
> 
> Is it possible to disable failover when doing an orderly  reboot on 
> the primary node so that the resources don?t fail over to the standby 
node.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Regards,
> 
> 
> Gerry Kernan
> 
> 
> InfinityIT
> 
> 
> 
> 
> 
> Suite 17 The Mall,
> 
> 
> Beacon court,
> 
> 
> Sandyford,
> 
> 
> Dublin 18.
> 
> 
> 
> 
> 
> p:+353-1-2930090
> 
> 
> f:+353-1-2930137
> 
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


InterComponentWare AG:  
Vorstand: Peter Kirschbauer (Vors.), Jörg Stadler / Aufsichtsratsvors.: Prof. 
Dr. Christof Hettich  
Firmensitz: 69190 Walldorf, Industriestraße 41 / AG Mannheim HRB 351761 / 
USt.-IdNr.: DE 198388516  
___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] CIB write-to-disk bug?

2010-04-01 Thread Lars Ellenberg
On Thu, Apr 01, 2010 at 12:12:47AM -0600, Alan Robertson wrote:
> OK
> 
> Since there was no ssh-as-root between the cluster nodes, I didn't
> send all the logs along from every node in the cluster - and it
> didn't occur to me to look at all of them.
> 
> However, the problem has gotten curioser and curioser - because ALL
> the nodes in the cluster reported the same problem at the same
> time...
> 
> That makes it a lot less likely to be a race condition with the disk
> writing infrastructure...
> 
> I've attached the relevant lines from the various machines -
> slightly processed (date stamp format changed and a few other minor
> things).
> 
> Let me know if you want me to send all the system logs along...

There should be core files.
You should be able to get some interessting information out there,
especially "the_cib" and "digest" at the point of abort().

> >I did not make manual changes on a running CIB. I was using the
> >cluster shell at the time.   The CIB it is complaining about
> >appears to be an intact, valid CIB with contents approximately
> >like they should have been at the time.  By the way, I have a
> >report from another IBMer that they have seen systems that stop
> >writing to their local CIBs.  I'll contact him.
> >
> >Here are some relevant facts:
> >  These machines are virtual guests in a cloud somewhere - operations
> >have somewhat unpredictable latency.  But, nothing too egregious
> >was happening at the time or Heartbeat would have bitched.
> >  I was doing some testing at the time.  I was putting on and
> >taking off constraints using the cluster shell
> >migrate and unmigrate operations.
> >
> >Given that the file looks intact, and I know how the CIB is
> >written to disk (since I originally wrote that code), I wonder if
> >it isn't a versioning issue / race condition.  That is, the code
> >for writing to disk does NOT guarantee when it gets done (assuming
> >you're still using it).  It would be easy to do a checksum on the
> >wrong version compared to the version you thought it should be (or
> >before it completed).
> >
> >Andrew:  You should have already received all the relevant logs to
> >you on a separate email.
> >
> >Also, for my reference - what method are you using to compute the
> >digest of the file?  That is, what command should I execute to get
> >the same results?

It's an md5sum over the xml tree -- not over the formated ascii buffer,
though, so "md5sum cib.xml" won't do.
I think it is the same as
 echo " $(perl -pe 's/^\s*(.*?)\s*\z/$1/g' cib.whatever)" | md5sum
But there is "cibadmin --md5-sum -x cib.xml",
to use the exact same code path.

> 2010/03/31_19:02:52   vhost0384   [13294]: ERROR: crm_abort:
> write_cib_contents: Triggered fatal assert at io.c:624 :
> retrieveCib(tmp1, tmp2, FALSE) != NULL

So it did not verify right after it was written.
Can you reproduce?

The core files may actually contains some hints,
so have a look there.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] CIB write-to-disk bug?

2010-04-01 Thread Alan Robertson

Lars Ellenberg wrote:

On Thu, Apr 01, 2010 at 12:12:47AM -0600, Alan Robertson wrote:

OK

Since there was no ssh-as-root between the cluster nodes, I didn't
send all the logs along from every node in the cluster - and it
didn't occur to me to look at all of them.

However, the problem has gotten curioser and curioser - because ALL
the nodes in the cluster reported the same problem at the same
time...

That makes it a lot less likely to be a race condition with the disk
writing infrastructure...

I've attached the relevant lines from the various machines -
slightly processed (date stamp format changed and a few other minor
things).

Let me know if you want me to send all the system logs along...


There should be core files.
You should be able to get some interessting information out there,
especially "the_cib" and "digest" at the point of abort().



Also, for my reference - what method are you using to compute the
digest of the file?  That is, what command should I execute to get
the same results?


It's an md5sum over the xml tree -- not over the formated ascii buffer,
though, so "md5sum cib.xml" won't do.
I think it is the same as
 echo " $(perl -pe 's/^\s*(.*?)\s*\z/$1/g' cib.whatever)" | md5sum
But there is "cibadmin --md5-sum -x cib.xml",
to use the exact same code path.


This is a change from how it used to be (the last time I looked - at 
least according to my not-always-reliable memory).  Thanks for the update.




2010/03/31_19:02:52 vhost0384   [13294]: ERROR: crm_abort:
write_cib_contents: Triggered fatal assert at io.c:624 :
retrieveCib(tmp1, tmp2, FALSE) != NULL


So it did not verify right after it was written.
Can you reproduce?


I have no idea.  I didn't do anything much.  Hopefully the test suite 
does a lot more strenuous things...



The core files may actually contains some hints,
so have a look there.


None of them verified.  All the nodes in the cluster failed the test at 
the same time - and now I have no official CIBs on disk - on any cluster 
nodes...  I sent Andrew all the CIBs, and all the core files, and 
basically everything under /var/lib/heartbeat/ from one machine. 
They're from the latest official release - so the binaries that match 
them are readily available.


Thanks Lars!


--
Alan Robertson 

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] CIB write-to-disk bug?

2010-04-01 Thread Florian Haas
On 2010-04-01 16:27, Alan Robertson wrote:

> None of them verified.  All the nodes in the cluster failed the test at
> the same time - and now I have no official CIBs on disk - on any cluster
> nodes...  I sent Andrew all the CIBs, and all the core files, and
> basically everything under /var/lib/heartbeat/ from one machine. They're
> from the latest official release - so the binaries that match them are
> readily available.

Any particular reason to not create an hb_report tarball and attach that
to a bug report in the LF bugzilla?

Cheers,
Florian



signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] CIB write-to-disk bug?

2010-04-01 Thread Alan Robertson

Florian Haas wrote:

On 2010-04-01 16:27, Alan Robertson wrote:


None of them verified.  All the nodes in the cluster failed the test at
the same time - and now I have no official CIBs on disk - on any cluster
nodes...  I sent Andrew all the CIBs, and all the core files, and
basically everything under /var/lib/heartbeat/ from one machine. They're
from the latest official release - so the binaries that match them are
readily available.


Any particular reason to not create an hb_report tarball and attach that
to a bug report in the LF bugzilla?


I did create the tarball - and a second one with all the CIBs, core 
files, and so on.  I just didn't create a bug report.  This looks like 
the same bugzilla that Heartbeat uses.  Is that right?


I was kind of hoping someone would have an easy answer ;-).



--
Alan Robertson 

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] CIB write-to-disk bug?

2010-04-01 Thread Lars Ellenberg
On Thu, Apr 01, 2010 at 08:27:02AM -0600, Alan Robertson wrote:
> Lars Ellenberg wrote:
> >On Thu, Apr 01, 2010 at 12:12:47AM -0600, Alan Robertson wrote:
> >>OK
> >>
> >>Since there was no ssh-as-root between the cluster nodes, I didn't
> >>send all the logs along from every node in the cluster - and it
> >>didn't occur to me to look at all of them.
> >>
> >>However, the problem has gotten curioser and curioser - because ALL
> >>the nodes in the cluster reported the same problem at the same
> >>time...
> >>
> >>That makes it a lot less likely to be a race condition with the disk
> >>writing infrastructure...
> >>
> >>I've attached the relevant lines from the various machines -
> >>slightly processed (date stamp format changed and a few other minor
> >>things).
> >>
> >>Let me know if you want me to send all the system logs along...
> >
> >There should be core files.
> >You should be able to get some interessting information out there,
> >especially "the_cib" and "digest" at the point of abort().
> >
> >>>
> >>>Also, for my reference - what method are you using to compute the
> >>>digest of the file?  That is, what command should I execute to get
> >>>the same results?
> >
> >It's an md5sum over the xml tree -- not over the formated ascii buffer,
> >though, so "md5sum cib.xml" won't do.
> >I think it is the same as
> > echo " $(perl -pe 's/^\s*(.*?)\s*\z/$1/g' cib.whatever)" | md5sum
> >But there is "cibadmin --md5-sum -x cib.xml",
> >to use the exact same code path.
> 
> This is a change from how it used to be (the last time I looked - at
> least according to my not-always-reliable memory).  Thanks for the
> update.
> 
> 
> >>2010/03/31_19:02:52 vhost0384   [13294]: ERROR: crm_abort:
> >>write_cib_contents: Triggered fatal assert at io.c:624 :
> >>retrieveCib(tmp1, tmp2, FALSE) != NULL
> >
> >So it did not verify right after it was written.
> >Can you reproduce?
> 
> I have no idea.  I didn't do anything much.  Hopefully the test
> suite does a lot more strenuous things...
> 
> >The core files may actually contains some hints,
> >so have a look there.
> 
> None of them verified.  All the nodes in the cluster failed the test
> at the same time - and now I have no official CIBs on disk - on any
> cluster nodes...  I sent Andrew all the CIBs, and all the core

Well, Andrew is on vacation right now... you will have noticed.

> files, and basically everything under /var/lib/heartbeat/ from one
> machine. They're from the latest official release - so the binaries
> that match them are readily available.

The strange thing is that your "corrupt" cib.uHFtAW
contains a  thing.  it should not.
No other cib*.raw or cib.xml does.

Because  is explicitly filtered out in write_cib_contents:
 free_xml_from_parent(the_cib, cib_status_root);
before
 write_xml_file(the_cib, tmp1, FALSE),
so that should never have made it in there.

Something is very wrong somewhere...

Did you manage to get two status sections in there, somehow?
You tried anything funky with the cib as last action before this failed?

Do it again, with higher log level.  Sorry, no time right now to rebuild
your exact thing with your exact gcc and stuff to look at your core file.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] debugging levels

2010-04-01 Thread Alan Jones
Hi,
I would like to set debugging levels higher than zero with
pacemaker/corosync.

[r...@fc12-a heartbeat]# ./crmd version
CRM Version: 1.0.5 (ee19d8e83c2a5d45988f1cee36d334a631d84fc7)
[r...@fc12-a heartbeat]# corosync -v
Corosync Cluster Engine, version '1.1.2' SVN revision '2539'
Copyright (c) 2006-2009 Red Hat, Inc.

What I would like is to only write log messages to private files and not
syslog
and to enable debug log messages from pacemaker, particularly the pengine.
So far I have corosync writing to its own log file and pacemaker writing to
both syslog and through ha_logd to ha-log and ha-debug.  However, both
outputs
are the same.

It seems there are only two configuration options for pacemaker as started
by
corosync: use_logd which I've enabled and use_mgmtd which I don't
understand.
There is no documentation for the service section of corosync.conf that I
could
find, except for lib/ais/plugin.c which has no comments.

- How to I turn off output to syslog by pacemaker?
- How to I enable all those nice debug prints in pacemaker source to go to
some file?

Thanks!
Alan
___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] debugging levels

2010-04-01 Thread Alan Jones
This worked.  Let me know if there is a better way.  From crm.h:

#ifdef ALAN_JONES_WANTS_HIS_LOG_MESSAGES
#define do_crm_log_unlikely(level, fmt, args...) do {   \
if(__likely(crm_log_level < (level))) { \
continue;   \
} else if((level) < LOG_DEBUG_2) {  \
cl_log(level, "%s: " fmt, __PRETTY_FUNCTION__ , ##args);\
} else {\
cl_log(LOG_DEBUG, "debug%d: %s: " fmt,  \
   level-LOG_INFO, __PRETTY_FUNCTION__ , ##args);   \
}   \
} while(0)
#else
#define do_crm_log_unlikely(level, fmt, args...) do {   \
(void)(level);  \
cl_log(LOG_INFO, fmt, ##args);  \
} while(0)
#endif


On Thu, Apr 1, 2010 at 4:03 PM, Alan Jones  wrote:

> Hi,
> I would like to set debugging levels higher than zero with
> pacemaker/corosync.
>
> [r...@fc12-a heartbeat]# ./crmd version
> CRM Version: 1.0.5 (ee19d8e83c2a5d45988f1cee36d334a631d84fc7)
> [r...@fc12-a heartbeat]# corosync -v
> Corosync Cluster Engine, version '1.1.2' SVN revision '2539'
> Copyright (c) 2006-2009 Red Hat, Inc.
>
> What I would like is to only write log messages to private files and not
> syslog
> and to enable debug log messages from pacemaker, particularly the pengine.
> So far I have corosync writing to its own log file and pacemaker writing to
> both syslog and through ha_logd to ha-log and ha-debug.  However, both
> outputs
> are the same.
>
> It seems there are only two configuration options for pacemaker as started
> by
> corosync: use_logd which I've enabled and use_mgmtd which I don't
> understand.
> There is no documentation for the service section of corosync.conf that I
> could
> find, except for lib/ais/plugin.c which has no comments.
>
> - How to I turn off output to syslog by pacemaker?
> - How to I enable all those nice debug prints in pacemaker source to go to
> some file?
>
> Thanks!
> Alan
>
>
___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker