Re: [Linux-HA] R2 Two-node apache cluster with STONITH

2007-03-21 Thread Alan Robertson
Dejan Muhamedagic wrote:
> On Wed, Mar 21, 2007 at 05:45:14AM -0600, Alan Robertson wrote:
>> Dejan Muhamedagic wrote:
>>> On Tue, Mar 20, 2007 at 10:59:06PM -0600, Alan Robertson wrote:
>>>> Dejan Muhamedagic wrote:
>>>>> On Tue, Mar 20, 2007 at 01:11:21PM -0400, Bjorn Oglefjorn wrote:
>>>>>> Odd.  I've changed that op to be "monitor" and now I get this error:
>>>>>>
>>>>>> Mar 20 13:10:14 test-2 lrmd: [28651]: ERROR: RA lsb:httpd:monitor 
>>>>>> (process
>>>>>> 29131) failed to redirect stdout for its background child (daemon)
>>>>>> processes. This will likely cause those processes to die mysteriously at
>>>>>> some later time (terminated by signal SIGPIPE).
>>>>> Odd. Don't see this in the code anymore. Which version do you run?
>>>> Dejan:  This is the code which I mentioned that I could no longer find.
>>>>  _I_ didn't remove it ;-).
>>>>
>>>> I think that 2.0.8 did a better job of this, but still had this message
>>>> in it.
>>>>
>>>> It's reasonably painful to find who removed a message, unfortunately...
>>>>  Because I'd like to know why it was removed...  [I'd like the answer to
>>>> be "because I cleaned up the code"]
>>> You can find it here:
>>> http://hg.linux-ha.org/dev/file/875a67e6242a/lrm/lrmd/lrmd.c
>>>
>>> Bjorn: you can just ignore this message.
>> That depends on what version he's running.  If he's running an old
>> enough version, that message was indicative of a rather serious design
>> flaw in the lrm -- and exactly what it predicted would happen to some RAs.
> 
> Yes, you're right, and the bug would be triggered in case the RA
> writes exactly 80 bytes.


Or if it continues to write after the first line has been read.

For example, an hour later, or a day later...



-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Replacing Existing Node

2007-03-21 Thread Alan Robertson
Max Hofer wrote:
> On Wednesday 21 March 2007 12:55, Alan Robertson wrote:
>> Andrew Beekhof wrote:
>>> On 3/20/07, Alan Robertson <[EMAIL PROTECTED]> wrote:
>>>> Max Hofer wrote:
>>>>> OK,
>>>>>
>>>>> i lost a day just trying to figure out how to replace a cluster node
>>>> with
>>>>> a spare part. I just thought someone else needs this info or maybe
>>>>> knows a better way as How I did.
>>>>>
>>>>> Situation:
>>>>> - cluster with 2 nodes (routing1, routing2)
>>>>> - routing2 should be replaced with a spare part
>>>>> - routing1 and routing2 use a file system on a drbd to share
>>>>>   common data
>>>>>
>>>>> Precondition:
>>>>> - routing2 crashed and hb_uuid is not recoverable
>>>> FYI: It's in the CIB, and also in the hb_uuid files on every machine.
>>>>
>>>>> - spare part is configured to not start heartbeat after power-on
>>>>>
>>>>> Steps I did:
>>>>> * replaced crashed routing2 with spare part (cabling etc.)
>>>>> * powered on routing2
>>>>> * on routing2 invalidate data on drbd device (---> sync from routing1
>>>>> to routing2)
>>>>> * on routing1 delete routing2 (I found a bug that pingd resets to 0
>>>>> when calling hb_delnode ---> see bug #1535)
>>>>> # /usr/lib/heartbeat/hb_delnode routing2 && killall pingd
>>>>> (!!!NOTE: if your cluster configuration triggers a failover on a pingd
>>>>> failure set the cluster in unmanaged mode, stop pingd, delete
>>>>> the node and then restart pingd, setting the cluster in managed mode
>>>>> again)
>>>>> * on routing1 delete removed hostcache (I'm not sure if this setp is
>>>>> neccessary but someone in the mailing list explained it has to be done)
>>>>> # rm /var/lib/heartbeat/delhostcache
>>>>> * on routing1 add routing2 again
>>>>> # /usr/lib/heartbeat/hb_addnode routing2
>>>>> * start heartbeat on routing2
>>>>>
>>>>> Finished .
>>>>>
>>>>> What i really find stupid about the whole proccedure:
>>>>> * the assumption the UUID file (/var/lib/heartbeat/hb_uuid) should can
>>>>> be used on the spare part is probably never the case (except you
>>>>> perform a planned replacement ... )
>>>> See note above...
>>>>
>>>>> * this assumption does not work well if the spare part is installed to
>>>>> be a replacement for different cluster nodes. The UUDI is created
>>>>> on the veiry first install of heartbeat (and thus is not part of my
>>>>> configuration data). It would be a cofiguration hell to "save all
>>>>> UUID of all clusters after cluster actvation" on a system with a
>>>>> couple nodes
>>>> It's already saved for you - in two places on every machine...
>>>>
>>>> What's missing is the conversion from ASCII to binary.  Could you make a
>>>> bugzilla for that and assign it to me?
>>>>
>>> been there done that:
>>>   crm_uuid -w
> Tthis command returns the ASCII UUID of the currenltly running node. What i 
> need is
> a command which returns me the binary version of the node which has to be 
> replaced.
> 
> Example: two nodes N1 and N2. N2 is replaced (because of HD crash)
> 
> So i need to create the binary UUID for N2 on N1 - something like
> 
> crm_uuid -b N2 > hb_uuid_N2
> 
>> Andrew:  Is there a man page or other documentation outside the command
>> for this?

Reading the source code, it looks like crm_uuid -r reads the UUID from
that file and prints it out in ASCII, and crm_uuid -w writes it.

Here's the code for the -w option:



typedef struct cl_uuid_s{
unsigned char   uuid[16];
}cl_uuid_t;


#define UUID_FILE HA_VARLIBDIR"/"PACKAGE"/hb_uuid"


int
write_hb_uuid(const char *new_value)
{
    int fd;
int rc;
cl_uuid_t uuid;
char *buffer = strdup(new_value);
rc = cl_uuid_parse(buffer, &uuid);
if(rc != 0) {
fprintf(stderr, "Invalid ASCII UUID supplied: %s\n",
new_value);
fprintf(stderr, "ASCII UUIDs must be of the form
---- and contain only letters and
digits\n");
return 1;
}

if ((fd = open(UUID_FILE, O_WRONLY|O_SYNC|O_CREAT, 0644)) < 0) {
cl_perror("Could not open %s", UUID_FILE);
return 1;
}

if (write(fd, uuid.uuid, UUID_LEN) != UUID_LEN) {
cl_perror("Could not write UUID to %s", UUID_FILE);
}

if (close(fd) < 0) {
cl_perror("Could not close %s", UUID_FILE);
}
return 0;
}


Although IMHO he could have been even more paranoid in writing the UUID
file ;-), I think it does what he said it does...




-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Help to translate.

2007-03-21 Thread Alan Robertson
Mauro Alexandre Nogueira wrote:
> I can help!
> 
> Just tell me how do it...

Great!

I've attached the other email for you, and have CCed some of my other
Brazilian friends.  It talks about this in more detail.  Let me know if
you have further questions.

Luis Claudio R. Goncalves says:

> There is now a mail list for linux-ha in portuguese, hosted by the people
> from Linux-Chix Brazil. I was planning on calling people from the list to
> help in this quest.

I need for someone to coordinate the translation to Brazilian
Portuguese, so that when something is out of date, that person can give
that task of updating the pt_BR pages to someone - or do it themselves ;-).

That person should subscribe to all changes to the web site so they know
when something is updated...


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
--- Begin Message ---
Hi,

I finished a set of changes to the web site to allow it to better
support non-English content.  It's not perfect, but it's a reasonable
start...

You can see an example of this in Chinese here:
http://linux-ha.org/zh/HomePage_zh

[Note that although the links are in Chinese,
in many cases, they currently point
at pages in English - this is not
a limitation of the system, just
a reflection of how few pages are
in Chinese so far...]

The naming convention for non-English pages is:
http://linux-ha.org/{ISO langname}/pagename_{ISO langname}

So, if we had a German Home Page it would be here:
http://linux-ha.org/de/HomePage_de

So, if we had a Spanish Home Page it would be here:
http://linux-ha.org/es/HomePage_es

So, if we had a Brazilian Portuguese Home Page it would be here:
http://linux-ha.org/pt_BR/HomePage_pt_BR

and so on...

And, something VERY IMPORTANT which you need to understand...

The content for the web site http://linux-ha.org/ is derived from the
wiki at http://wiki.linux-ha.org/

This is explained in some detail here:
http://linux-ha.org/WikiTransclusion

The minimum set of pages in any given language is those described by
this page:
http://linux-ha.org/WikiTransclusion/SiteComposition


To create or modify content on the wiki, you have to create an account
for yourself, and then log in.  Once you've done that you can modify
pages on the wiki.

To bring a page over to the www site the first time, just reference it.
 To refresh it with a new copy, view the page on the www site you want
to update, then press shift-reload, and the www site will refresh its
cache from the wiki site.

So

We need two kinds of people:

 One or more people to oversee the translation effort and help
 coordinate the translations, and keep versions in sync

 Lots of people to translate pages into various languages


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

--- End Message ---
___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Replacing Existing Node

2007-03-21 Thread Alan Robertson
Andrew Beekhof wrote:
> On 3/20/07, Alan Robertson <[EMAIL PROTECTED]> wrote:
>> Max Hofer wrote:
>> > OK,
>> >
>> > i lost a day just trying to figure out how to replace a cluster node
>> with
>> > a spare part. I just thought someone else needs this info or maybe
>> > knows a better way as How I did.
>> >
>> > Situation:
>> > - cluster with 2 nodes (routing1, routing2)
>> > - routing2 should be replaced with a spare part
>> > - routing1 and routing2 use a file system on a drbd to share
>> >   common data
>> >
>> > Precondition:
>> > - routing2 crashed and hb_uuid is not recoverable
>>
>> FYI: It's in the CIB, and also in the hb_uuid files on every machine.
>>
>> > - spare part is configured to not start heartbeat after power-on
>> >
>> > Steps I did:
>> > * replaced crashed routing2 with spare part (cabling etc.)
>> > * powered on routing2
>> > * on routing2 invalidate data on drbd device (---> sync from routing1
>> > to routing2)
>> > * on routing1 delete routing2 (I found a bug that pingd resets to 0
>> > when calling hb_delnode ---> see bug #1535)
>> > # /usr/lib/heartbeat/hb_delnode routing2 && killall pingd
>> > (!!!NOTE: if your cluster configuration triggers a failover on a pingd
>> > failure set the cluster in unmanaged mode, stop pingd, delete
>> > the node and then restart pingd, setting the cluster in managed mode
>> > again)
>> > * on routing1 delete removed hostcache (I'm not sure if this setp is
>> > neccessary but someone in the mailing list explained it has to be done)
>> > # rm /var/lib/heartbeat/delhostcache
>> > * on routing1 add routing2 again
>> > # /usr/lib/heartbeat/hb_addnode routing2
>> > * start heartbeat on routing2
>> >
>> > Finished .
>> >
>> > What i really find stupid about the whole proccedure:
>> > * the assumption the UUID file (/var/lib/heartbeat/hb_uuid) should can
>> > be used on the spare part is probably never the case (except you
>> > perform a planned replacement ... )
>>
>> See note above...
>>
>> > * this assumption does not work well if the spare part is installed to
>> > be a replacement for different cluster nodes. The UUDI is created
>> > on the veiry first install of heartbeat (and thus is not part of my
>> > configuration data). It would be a cofiguration hell to "save all
>> > UUID of all clusters after cluster actvation" on a system with a
>> > couple nodes
>>
>> It's already saved for you - in two places on every machine...
>>
>> What's missing is the conversion from ASCII to binary.  Could you make a
>> bugzilla for that and assign it to me?
>>
> 
> been there done that:
>   crm_uuid -w

Max:

Would you kindly write this procedure up, and add it to the R2 FAQ on
the web site?

Thanks!


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Replacing Existing Node

2007-03-21 Thread Alan Robertson
Max Hofer wrote:
> On Wednesday 21 March 2007 15:12, Alan Robertson wrote:
>> Andrew Beekhof wrote:
>>> On 3/20/07, Alan Robertson <[EMAIL PROTECTED]> wrote:
>>>> Max Hofer wrote:
>>>>> OK,
>>>>>
>>>>> i lost a day just trying to figure out how to replace a cluster node
>>>> with
>>>>> a spare part. I just thought someone else needs this info or maybe
>>>>> knows a better way as How I did.
>>>>>
>>>>> Situation:
>>>>> - cluster with 2 nodes (routing1, routing2)
>>>>> - routing2 should be replaced with a spare part
>>>>> - routing1 and routing2 use a file system on a drbd to share
>>>>>   common data
>>>>>
>>>>> Precondition:
>>>>> - routing2 crashed and hb_uuid is not recoverable
>>>> FYI: It's in the CIB, and also in the hb_uuid files on every machine.
>>>>
>>>>> - spare part is configured to not start heartbeat after power-on
>>>>>
>>>>> Steps I did:
>>>>> * replaced crashed routing2 with spare part (cabling etc.)
>>>>> * powered on routing2
>>>>> * on routing2 invalidate data on drbd device (---> sync from routing1
>>>>> to routing2)
>>>>> * on routing1 delete routing2 (I found a bug that pingd resets to 0
>>>>> when calling hb_delnode ---> see bug #1535)
>>>>> # /usr/lib/heartbeat/hb_delnode routing2 && killall pingd
>>>>> (!!!NOTE: if your cluster configuration triggers a failover on a pingd
>>>>> failure set the cluster in unmanaged mode, stop pingd, delete
>>>>> the node and then restart pingd, setting the cluster in managed mode
>>>>> again)
>>>>> * on routing1 delete removed hostcache (I'm not sure if this setp is
>>>>> neccessary but someone in the mailing list explained it has to be done)
>>>>> # rm /var/lib/heartbeat/delhostcache
>>>>> * on routing1 add routing2 again
>>>>> # /usr/lib/heartbeat/hb_addnode routing2
>>>>> * start heartbeat on routing2
>>>>>
>>>>> Finished .
>>>>>
>>>>> What i really find stupid about the whole proccedure:
>>>>> * the assumption the UUID file (/var/lib/heartbeat/hb_uuid) should can
>>>>> be used on the spare part is probably never the case (except you
>>>>> perform a planned replacement ... )
>>>> See note above...
>>>>
>>>>> * this assumption does not work well if the spare part is installed to
>>>>> be a replacement for different cluster nodes. The UUDI is created
>>>>> on the veiry first install of heartbeat (and thus is not part of my
>>>>> configuration data). It would be a cofiguration hell to "save all
>>>>> UUID of all clusters after cluster actvation" on a system with a
>>>>> couple nodes
>>>> It's already saved for you - in two places on every machine...
>>>>
>>>> What's missing is the conversion from ASCII to binary.  Could you make a
>>>> bugzilla for that and assign it to me?
>>>>
>>> been there done that:
>>>   crm_uuid -w
>> Max:
>>
>> Would you kindly write this procedure up, and add it to the R2 FAQ on
>> the web site?
> done 
> http://wiki.linux-ha.org/v2/faq#head-09fdadc641deb9ee88120bb122c49502071b0495
> 
> Could you please review it :
> * for correctness
> * i'm not a native english speaker  ;-)

Nice work!

FYI: I made a couple of minor changes.

Many Thanks!


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Slave takes over all server IP

2007-03-21 Thread Alan Robertson
Tanveer Chowdhury wrote:
> thanks for your reply.
> Yes I have gone through the documentations. 
> Say Master node has 2 NICs one with 192.168.10.1 and
> another 10.10.0.1 
> and Slave node has 2 NICs one with 192.168.10.2 and
> another 10.10.0.2. 
> Virtual IP is 192.168.10.3. Now when master goes down
> then slaves Ip will
> change from 192.168.10.2 to 192.168.10.1 and 10.10.0.1
> to 10.10.0.2. 
> And when master comes back how it could regain its ip
> ? 

For an HA cluster, you need both virtual and real IP addresses.

The customers talk to the services by virtual addresses.

The administrators need to talk to specific servers by "real" IP
addresses - and so does heartbeat.

So, in your OS, only configure the fixed/real IP addresses, and ignore
the virtual IP addresses.

In heartbeat, configure the virtual IP addresses for your services.

Heartbeat will move the virtual IP addresses where they need to be, and
ignore the fixed/real/administrative IP addresses.

Does that help?


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] questions on resources, iptables, ipaddr2, ocfs2

2007-03-21 Thread Alan Robertson
Dejan Muhamedagic wrote:
> On Tue, Mar 20, 2007 at 08:57:52AM -0600, Alan Robertson wrote:
>> Andrew Beekhof wrote:
>>> On 3/14/07, Florian Heigl <[EMAIL PROTECTED]> wrote:
>>>> Hi,
>>>>
>>>> sorry for the late reply, I was catching up on sleep.
>>>> I've put the logs online now, they hold all messages from starting on
>>>> the first node to the second node successdully joining.
>>>>
>>>> i have debug 1 on my ha.cf, if this is too verbose, I will regenerate
>>>> the logs.
>>> thanks to grep, logs can never be too long :-)
>>>
>>>> http://wartungsfenster.dyndns.org/outbox/messages-domU-bacula1
>>>>
>>>> other than that I have not yet done any changes so that the log is in
>>>> sync with what you read in the last email.
>>>>
>>>> what I know I need to change so far:
>>>> don't use ucast as I want to have >2 nodes, so it will be
>>>> mcast(encrypted) on eth0 and mcast (crc) on eth1.
>>>> I'll remove the ocfs2 bit and make the filesystem RA mount appropriately
>>>> I'll rework my application start scripts to include the 'monitor'
>>>> action and be more compliant for  OCF standards.
>>>>
>>>> Also thank You a lot for explaining the conditions that make something
>>>> unmanaged, I don't know why it happens, but at least I know what it
>>>> means now.
>>> this seems to be an lrm bug.
>>>
>>> the crm can handle it when an RA isnt installed on all nodes, but
>>> apparently the lrm never bothers to tell us this is the case and just
>>> returns "unknown error"
>>>
>>> which means:
>>> * monitor actions for that resource on that node will fail - making it
>>> look active but failed
>>> * stop actions for that resource on that node will fail - making it
>>> unmanaged
>> How is it supposed to tell you?
> 
> I suppose that Andrew meant this: in case an RA is not present (I
> mean the actual RA script) on the node, then the lrm should
> consider any resource of that type stopped and return OK for the
> stop operation and 7 for the monitor op.

Well... He already implemented the fix ;-)

My question was this:  There is no exit code which means "resource agent
not installed".  The OCF doesn't talk about that.  But, it does talk
about "application not installed".  And, since we'd treat them the same,
then it makes perfect sense to not invent some new non-standard exit
code for that (which might screw something _else_ up).


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Bug in changelog

2007-03-21 Thread Alan Robertson
Ragnar Kjørstad wrote:
> The following patch fixes the %changelog section in heartbeat.spec.in:
> 
> diff -r 34ecbda23c2a heartbeat.spec.in
> --- a/heartbeat.spec.in Tue Feb 27 13:07:06 2007 +0100
> +++ b/heartbeat.spec.in Wed Feb 28 15:26:07 2007 +0100
> @@ -92,7 +92,7 @@ GUI client for heartbeat clusters
>  %endif
> 
>  %changelog
> -* Tue Dec 09 2007  Alan Robertson <[EMAIL PROTECTED]> (see doc/AUTHORS file)
> +* Tue Jan 09 2007  Alan Robertson <[EMAIL PROTECTED]> (see doc/AUTHORS file)
>  + Version 2.0.8 - bug fixes and enhancements
>+ Allow colocation based on node attributes other than #id
>+ SAPDatabase and SAPInstance resource agents added.
> 
> 
> Dec 09 2007 is in the future and it causes rpmbuild to fail when adding
> new entries, because they are not in cronological order. The hg log
> includes:
> changeset:   9962:cd6d5d4fafa8
> parent:  9917:78d18bebc528
> user:Alan Robertson <[EMAIL PROTECTED]>
> date:Tue Jan 09 09:15:42 2007 -0700
> summary: Updated spec file in preparation for 2.0.8
> 
> So I'm guessing the correct date should have been Jan 9th.

I changed the date to Jan 9th in Hg.

Sorry for the delay.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Resources failing to start takes down all HB resources

2007-03-21 Thread Alan Robertson
Serge Dubrouski wrote:
> Usually resources are combined into groups if there is no reason to
> run them separately, like IP address, Filesystem and Database server.
> There is no reason to keep IP address up if Database that should
> provide service on that IP is down.
> 
> In your case if resource A, B and C are independent of each other
> probably they shouldn't be combined into a group at all?

B depends on A.  A doesn't depend on B.

C depends on B.  B doesn't depend on C.  A doesn't depend on C.

So, if you can't start C, there's not a obvious reason not to run A or B.

If B provides a vital service, then it might be important to also run B
- even if C isn't available.





-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Replacing Existing Node

2007-03-22 Thread Alan Robertson
Max Hofer wrote:
> On Wednesday 21 March 2007 14:54, Alan Robertson wrote:
>> Max Hofer wrote:
>>> On Wednesday 21 March 2007 12:55, Alan Robertson wrote:
>>>> Andrew Beekhof wrote:
>>>>> On 3/20/07, Alan Robertson <[EMAIL PROTECTED]> wrote:
>>>>>> Max Hofer wrote:
>>>>>>> OK,
>>>>>>>
>>>>>>> i lost a day just trying to figure out how to replace a cluster node
>>>>>> with
>>>>>>> a spare part. I just thought someone else needs this info or maybe
>>>>>>> knows a better way as How I did.
>>>>>>>
>>>>>>> Situation:
>>>>>>> - cluster with 2 nodes (routing1, routing2)
>>>>>>> - routing2 should be replaced with a spare part
>>>>>>> - routing1 and routing2 use a file system on a drbd to share
>>>>>>>   common data
>>>>>>>
>>>>>>> Precondition:
>>>>>>> - routing2 crashed and hb_uuid is not recoverable
>>>>>> FYI: It's in the CIB, and also in the hb_uuid files on every machine.
>>>>>>
>>>>>>> - spare part is configured to not start heartbeat after power-on
>>>>>>>
>>>>>>> Steps I did:
>>>>>>> * replaced crashed routing2 with spare part (cabling etc.)
>>>>>>> * powered on routing2
>>>>>>> * on routing2 invalidate data on drbd device (---> sync from routing1
>>>>>>> to routing2)
>>>>>>> * on routing1 delete routing2 (I found a bug that pingd resets to 0
>>>>>>> when calling hb_delnode ---> see bug #1535)
>>>>>>> # /usr/lib/heartbeat/hb_delnode routing2 && killall pingd
>>>>>>> (!!!NOTE: if your cluster configuration triggers a failover on a pingd
>>>>>>> failure set the cluster in unmanaged mode, stop pingd, delete
>>>>>>> the node and then restart pingd, setting the cluster in managed mode
>>>>>>> again)
>>>>>>> * on routing1 delete removed hostcache (I'm not sure if this setp is
>>>>>>> neccessary but someone in the mailing list explained it has to be done)
>>>>>>> # rm /var/lib/heartbeat/delhostcache
>>>>>>> * on routing1 add routing2 again
>>>>>>> # /usr/lib/heartbeat/hb_addnode routing2
>>>>>>> * start heartbeat on routing2
>>>>>>>
>>>>>>> Finished .
>>>>>>>
>>>>>>> What i really find stupid about the whole proccedure:
>>>>>>> * the assumption the UUID file (/var/lib/heartbeat/hb_uuid) should can
>>>>>>> be used on the spare part is probably never the case (except you
>>>>>>> perform a planned replacement ... )
>>>>>> See note above...
>>>>>>
>>>>>>> * this assumption does not work well if the spare part is installed to
>>>>>>> be a replacement for different cluster nodes. The UUDI is created
>>>>>>> on the veiry first install of heartbeat (and thus is not part of my
>>>>>>> configuration data). It would be a cofiguration hell to "save all
>>>>>>> UUID of all clusters after cluster actvation" on a system with a
>>>>>>> couple nodes
>>>>>> It's already saved for you - in two places on every machine...
>>>>>>
>>>>>> What's missing is the conversion from ASCII to binary.  Could you make a
>>>>>> bugzilla for that and assign it to me?
>>>>>>
>>>>> been there done that:
>>>>>   crm_uuid -w
>>> Tthis command returns the ASCII UUID of the currenltly running node. What i 
>>> need is
>>> a command which returns me the binary version of the node which has to be 
>>> replaced.
>>>
>>> Example: two nodes N1 and N2. N2 is replaced (because of HD crash)
>>>
>>> So i need to create the binary UUID for N2 on N1 - something like
>>>
>>> crm_uuid -b N2 > hb_uuid_N2
>>>
>>>> Andrew:  Is there a man page or other documentation outside the command
>>>> for this?
>> Reading the source code, it looks like crm_uuid -r reads the UUID from
>> that file and prints it out in ASCII, and crm_uuid -w writes it.
>

[Linux-HA] Web site and Wiki under construction today (2007/03/23)

2007-03-23 Thread Alan Robertson
Hopefully outages will be brief.

But, there will be outages, and perhaps some incorrectly formatted content.

There is a notice on the web site about this.

Until that notice goes away please do not update the wiki or the web site.

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Re: [DRBD-user] Heartbeat Version

2007-03-23 Thread Alan Robertson
George H wrote:
> The only reason I can think of for using version 1 with DRBD is due to
> the limiting factor of DRBD. DRBD can only replicate data between 2
> hard disks and so a Version 1 setup of heartbeat works cos it can only
> have 2 nodes (though you can have version 2 with 2 nodes as well)
> 
> On 3/22/07, Hannes Dorbath <[EMAIL PROTECTED]> wrote:
>> Is there a prefered version of Heartbeat to use with DRBD?
>>
>> I've read some people prefer version 1, what is the reason?

I would recommend using the version 2 software for all purposes.

HOWEVER, some people prefer the version 1 configurations (which are
still supported by version 2 software), because they're REALLY easy to
understand and get going.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Error in compiling LinxuHA.

2007-03-23 Thread Alan Robertson
Athrun Zara wrote:
> Dear Linux HA experts,
> 
> I am trying to build Linux HA from scratch.
> but I have an error when compiling the package :
> 
> Making all in libnet_util
> Compiling send_arp.c:
> [ERROR]
>  gcc -DHAVE_CONFIG_H -I. -I. -I../../linux-ha -I../../include
> -I../../include -I../../include -I../../linux-ha -I../../linux-ha
> -I../../libltdl -I../../libltdl -pthread
> -I/usr/include/glib-2.0-I/usr/lib/glib-
> 2.0/include -I/usr/include/libxml2 -I/opt/include -Wl,--rpath=/opt/lib
> -Wall
> -Wmissing-prototypes -Wmissing-declarations -Wstrict-prototypes
> -Wdeclaration-after-statement -Wpointer-arith -Wwrite-strings -Wcast-qual
> -Wcast-align -Wbad-function-cast -Winline -Wmissing-format-attribute
> -Wformat=2 -Wformat-security -Wformat-nonliteral -Wno-long-long
> -Wno-strict-aliasing -Werror -ggdb3 -funsigned-char -DVAR_RUN_D=""
> -DVAR_LIB_D="" -DHA_D="" -DHALIB="/opt/lib/heartbeat" -I/opt/include
> -Wl,--rpath=/opt/lib -Wall -Wmissing-prototypes -Wmissing-declarations
> -Wstrict-prototypes -Wdeclaration-afIn file included from
> /opt/include/libnet.h:124,
>   from send_arp.c:37:
>  /opt/include/./libnet/libnet-functions.h:1839: warning: function
> declaration isn't a prototype
>  /opt/include/./libnet/libnet-functions.h:1861: warning: function
> declaration isn't a prototype
>  /opt/include/./libnet/libnet-functions.h:1868: warning: function
> declaration isn't a prototype
>  /opt/include/./libnet/libnet-functions.h:1876: warning: function
> declaration isn't a prototype
>  /opt/include/./libnet/libnet-functions.h:1884: warning: function
> declaration isn't a prototype
> gmake[2]: *** [send_arp.o] Error 1
> gmake[1]: *** [all-recursive] Error 1
> make: *** [all-recursive] Error 1
> 
> --
> Linux : Centos 4.3 with recompiled kernel 2.6.19.1
> Linux HA : ver 2.0.8
> LibNet : ver 1.1.2.1
> 
> Both LinuxHA and Libnet are configured with --prefix=/opt

void libnet_cq_destroy();

It ought to have a (void) instead of the ().

You can either patch those ()'s into (void)'s or you can configure
heartbeat with --disable-fatal-warnings.

Hmmm... I'm not sure why we don't get that error (?).  Obviously you're
doing something differently ;-)  I don't think the CentOS people got
that error.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Pingd and Stonith Suicide

2007-03-23 Thread Alan Robertson
Eric HERVE (SERVICES SDIR PRODUCTION) wrote:
> Hello,
> 
> I don't understand how configure (by the GUI) a ping from nodes towards
> my gateway (for example) and that in the event of ping failed, it
> commits suicide the node which doesn't reach the network.
> 
> I created a "pingd" clone resource with the operation "monitor" but I
> don't know how to use it. There is no attribute for this resource?

Have you run /usr/lib/ocf/resource.d/heartbeat/pingd meta-data ?  I
suspect not, since it produces 86 lines of output, and describes 8
parameters.  Perhaps you're using the GUI?  By default, the GUI only
shows you the mandatory parameters - unless you press the add parameters
box at the bottom of that dialog box.

But the better question is:  What are you trying to accomplish here?
(I have my suspicions, but I don't know for sure)

> I also created a "suicide" resource.

OK.  This is fine.  But we won't use it when connectivity is lost.

Why wouldn't you just want to stop resources and wait for the gateway to
come back up?  We support this behavior directly.  We also support
moving (selected) resources around to machines that do have
connectivity, or leaving them alone if no one has connectivity.

If all machines can't reach it, then they'll just go into suicide loops
until the gateway comes back...

If you enable quorum, and set the policy for loss-of-quorum to stop,
then a node which loses quorum will stop.

I think what you really want is something we don't really have yet, but
wouldn't be too hard to write if you have a day or two...  A quorum
plugin tie-breaker module which breaks a quorum tie if you have
connectivity to ping nodes.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] linux-ha connection problem, heartbeat 2.0.8

2007-03-23 Thread Alan Robertson
Jon Savian wrote:
> I am having an issue with getting my linux-ha to work.  I use crm_mon
> to see whats going on and i see a message:
> 
> Not connected: Refresh 1s
> 
> I also checked the permissions of my ha.cf, authkeys, and cib.xml
> files.  They look fine.
> 
> Also, when i do a "service heartbeat restart", everything says ok.
> 
> In my logfile i get
> WARN: cib_native_signon: Connection to CIB failed: connection failed
> ERROR: Can't initialize management library.Shutting down.(-1)

That's no doubt not the only error in your logs ;-)

I'd guess you didn't create the hacluster user id, or the haclient group id.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] How update cib.xml on working cluster ?

2007-03-23 Thread Alan Robertson
Serge Dubrouski wrote:
> http://www.linux-ha.org/v2/AdminTools/cibadmin
> 
> On 3/22/07, Alex Orlov <[EMAIL PROTECTED]> wrote:
>> Hi!
>>
>> I dont understand how update cib.xml on working cluster? Where FAQ ?
>> How can i make changes in this file and distribute it to other
>> cluster-node?

I believe the cibadmin man page has more examples.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Documentation for constraints

2007-03-23 Thread Alan Robertson
Dejan Muhamedagic wrote:
> On Thu, Mar 22, 2007 at 06:15:41PM +0100, Ragnar Kj?rstad wrote:
>> On Thu, Mar 22, 2007 at 02:05:57PM +0100, Dejan Muhamedagic wrote:
>>> On Thu, Mar 22, 2007 at 02:21:23AM +0100, Ragnar Kj?rstad wrote:
>> BTW: I just noticed that the DTD is also checked into mercury. So which
>> is the authoriative version? mercury or wiki?
> 
> i think it's wiki, but i'm really not sure. alan?

Depends on your point of view ;-)  Are you a human or a program?   ;-)

But there's something really important about the one in Hg:
It is the one that crm_verify relies on.  Is it annotated?  I don't
remember it being annotated...

However, because DTDs in general are weak verifiers of syntax, and have
no support for semantics, and don't know anything about the CRM, the
formal DTD provides at best a poor description of what goes into the CIB.

It's really the comments (annotations) that tell the tale...

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] errors during make install process of heartbeat 2.0.8

2007-03-23 Thread Alan Robertson
Hamster wrote:
> Hi,
> 
>> It seems to be a deficiency in the configure/make setup. You
>> should try with the ConfigureMe script. Run it without args to see
>> usage.
> 
> Using the ConfigureMe script makes no difference I'm afraid. It seems
> it doesn't create the haclient/hacluster user/group either.
> 
> Thanks anyway! I'll just run adduser in between the make and make
> install processes. That should fix it I hope!

Only the RPM versions to the adduser things.

I changed the install script so that (I think) it's at least checking
for them afterwards, so you won't be in a complete mystery about what's
wrong.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] HA storage (open iscsi)

2007-03-23 Thread Alan Robertson
Dominik Klein wrote:
> Hello
> 
> I have a working test-setup which uses open iscsi to export a block
> device (say /dev/sdb) to 2 xen hosts. This iscsi device in the hosts
> (say /dev/iscsi) is used as a physical device for the xen guest machines
> and so makes possible migration. Works great.
> 
> Setup looks like this now:
> 
> xen host 1 --\   /- open iscsi server (target)
> switch
> xen host 2 --/
> 
> But now storage is a single point of failure. How can I make iscsi high
> available?
> 
> What I was thinking about might look like this:
> 
> xen host 1 --\   /- open iscsi server 1 (target)
>   \ / |
> switch | maybe replicate /dev/sdb with drbd?
>   / \ |
> xen host 2 --/   \- open iscsi server 2 (target)
> 
> I would be glad if you could share ideas on how to achieve HA for this
> storage. Pointers to documentation would also be great.

It's all based on an earlier version of DRBD, and heartbeat release 1,
but there's an article exactly like that in the PressRoom:

http://linux-ha.org/PressRoom

Simon Brock and Ian Wrigley use off-the-peg open-source software to
build a basic implementation of a SAN using iSCSI, heartbeat and DRBD in
the March 2006 edition of the UK's PC Pro Magazine in an article
entitled SAN on the Cheap.

That should certainly give you an idea of what's possible...

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] linux-ha connection problem, heartbeat 2.0.8

2007-03-23 Thread Alan Robertson
Jon Savian wrote:
> I am having an issue with getting my linux-ha to work.  I use crm_mon
> to see whats going on and i see a message:
> 
> Not connected: Refresh 1s
> 
> I also checked the permissions of my ha.cf, authkeys, and cib.xml
> files.  They look fine.
> 
> Also, when i do a "service heartbeat restart", everything says ok.
> 
> In my logfile i get
> WARN: cib_native_signon: Connection to CIB failed: connection failed
> ERROR: Can't initialize management library.Shutting down.(-1)

I thought I replied to this before, but I guess not...

I'm sure there are more failure messages than that in your log files.

Most probably the group id and user id we need don't exist on your
machine.  That's at the root of most "can't connect" messages in heartbeat.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Replacing Existing Node

2007-03-23 Thread Alan Robertson
Max Hofer wrote:
> On Thursday 22 March 2007 23:11, Alan Robertson wrote:
>> Max Hofer wrote:
>>> On Wednesday 21 March 2007 14:54, Alan Robertson wrote:
>>>> Max Hofer wrote:
>>>>> On Wednesday 21 March 2007 12:55, Alan Robertson wrote:
>>>>>> Andrew Beekhof wrote:
>>>>>>> On 3/20/07, Alan Robertson <[EMAIL PROTECTED]> wrote:
>>>>>>>> Max Hofer wrote:
>>>>>>>>> OK,
>>>>>>>>>
>>>>>>>>> i lost a day just trying to figure out how to replace a cluster node
>>>>>>>> with
>>>>>>>>> a spare part. I just thought someone else needs this info or maybe
>>>>>>>>> knows a better way as How I did.
>>>>>>>>>
>>>>>>>>> Situation:
>>>>>>>>> - cluster with 2 nodes (routing1, routing2)
>>>>>>>>> - routing2 should be replaced with a spare part
>>>>>>>>> - routing1 and routing2 use a file system on a drbd to share
>>>>>>>>>   common data
>>>>>>>>>
>>>>>>>>> Precondition:
>>>>>>>>> - routing2 crashed and hb_uuid is not recoverable
>>>>>>>> FYI: It's in the CIB, and also in the hb_uuid files on every machine.
>>>>>>>>
>>>>>>>>> - spare part is configured to not start heartbeat after power-on
>>>>>>>>>
>>>>>>>>> Steps I did:
>>>>>>>>> * replaced crashed routing2 with spare part (cabling etc.)
>>>>>>>>> * powered on routing2
>>>>>>>>> * on routing2 invalidate data on drbd device (---> sync from routing1
>>>>>>>>> to routing2)
>>>>>>>>> * on routing1 delete routing2 (I found a bug that pingd resets to 0
>>>>>>>>> when calling hb_delnode ---> see bug #1535)
>>>>>>>>> # /usr/lib/heartbeat/hb_delnode routing2 && killall pingd
>>>>>>>>> (!!!NOTE: if your cluster configuration triggers a failover on a pingd
>>>>>>>>> failure set the cluster in unmanaged mode, stop pingd, delete
>>>>>>>>> the node and then restart pingd, setting the cluster in managed mode
>>>>>>>>> again)
>>>>>>>>> * on routing1 delete removed hostcache (I'm not sure if this setp is
>>>>>>>>> neccessary but someone in the mailing list explained it has to be 
>>>>>>>>> done)
>>>>>>>>> # rm /var/lib/heartbeat/delhostcache
>>>>>>>>> * on routing1 add routing2 again
>>>>>>>>> # /usr/lib/heartbeat/hb_addnode routing2
>>>>>>>>> * start heartbeat on routing2
>>>>>>>>>
>>>>>>>>> Finished .
>>>>>>>>>
>>>>>>>>> What i really find stupid about the whole proccedure:
>>>>>>>>> * the assumption the UUID file (/var/lib/heartbeat/hb_uuid) should can
>>>>>>>>> be used on the spare part is probably never the case (except you
>>>>>>>>> perform a planned replacement ... )
>>>>>>>> See note above...
>>>>>>>>
>>>>>>>>> * this assumption does not work well if the spare part is installed to
>>>>>>>>> be a replacement for different cluster nodes. The UUDI is created
>>>>>>>>> on the veiry first install of heartbeat (and thus is not part of my
>>>>>>>>> configuration data). It would be a cofiguration hell to "save all
>>>>>>>>> UUID of all clusters after cluster actvation" on a system with a
>>>>>>>>> couple nodes
>>>>>>>> It's already saved for you - in two places on every machine...
>>>>>>>>
>>>>>>>> What's missing is the conversion from ASCII to binary.  Could you make 
>>>>>>>> a
>>>>>>>> bugzilla for that and assign it to me?
>>>>>>>>
>>>>>>> been there done that:
>>>>>>>   crm_uuid -w
>>>>> Tthis command returns the ASCII UUID of the currenl

Re: [Linux-HA] WARN: G_SIG_dispatch

2007-03-23 Thread Alan Robertson
Paulo F. Andrade wrote:
> Hello,
> 
> I'm running heartbeat-2.0.8 on a couple of Debian systems.
> 
> And I just noticed that my log files are filled with this message:
> 
> "WARN: G_SIG_dispatch: Dispatch function for SIGCHLD took too long to
> execute: 20 ms (> 10 ms) (GSource: 0x805f880)"
> 
> Their occurring with sporadic intervals some times just seconds between
> them, others a few minutes. What does it mean?


It means I'm being too grumpy on how fast things are happening.

Do all yours say 20 ms?


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] heartbeat 2.0.8 and apache 2, apache fails back when auto_failback is off

2007-03-23 Thread Alan Robertson
Jon Savian wrote:
> I am trying to configure ipfailover for apache in an active/passive
> configuration.  Some lines from my /etc/ha.d/ha.cf file:
> 
> node node1 node2
> auto_failback off
> respawn hacluster /usr/lib64/heartbeat/ipfail
> 
> I generated (using the provided python script) my cib.xml from a
> working haresources file:
> node1 xxx.xxx.xxx.xxx apache
> 
> So now i startup heartbeat on node1 and node2, then use crm_mon to
> monitor these nodes.  Everything looks fine, so i "service heartbeat
> stop" on node1 and node2 takes over apache as it should.
> 
> Great!
> 
> So now i do "service heartbeat start" on node1.  Now apache is auto
> failbacked to node1, while my "Current DC" says node2.
> 
> How should i get apache to stay on node2 instead of reverting to
> node1.  Is there anything i need to add configuration wise?

There is a parameter called resource_stickiness which determines the
propensity for a resource to stay where it is.

You can either set it globally, or separately for each resource.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [EMAIL PROTECTED]: [Linux-HA] WARN: G_SIG_dispatch]

2007-03-24 Thread Alan Robertson
Dejan Muhamedagic wrote:
> Hi Alan,
> 
> I also wondered about this before.
> 
> The dispatch function is what exactly: a signal handler set by the
> user (i.e. crm, cib, etc), right?
> 
> Can you clue me in? Some web resources?

The mainloop docs kinda suck...

Here's what is going on with respect to the mainloop and these messages...

Mainloop has a simple and uniform paradigm for dealing with events from
sources.

   prepare function -- do whatever you need to do get ready to
  detect events
   ***mainloop code issues poll(2) system call is issued to wait
  for events for "long enough"
   check function -- did any events come in?  If so, then return TRUE
   ***mainloop code looks at all the event sources that returned TRUE
  and calls the dispatch functions for them in priority order
   dispatch function - called if either prepare or check functions
  return TRUE
Translating:  Called when your event occurs

We handle death-of-child as a signal.  We have an mainloop source for
handling signals.  Although a signal would interrupt the poll system
call, if a signal comes in after the signal's prepare function was
called and before the poll system call is issued where we might sleep
forever waiting for a signal which we really already know we have gotten
because there's no way to have poll() poll on variables being set as
well as file descriptors...

So, we put an upper bound on how long we will stay in poll() (the "long
enough" above) to work around this timing window (this is currently 1
second)

Continuing on...
We wrote code to hook the code for SIGCHLD in with our process tracking
(proctrack) code so that all death-of-child events are handled generally
- and a higher-level dispatch function is all the caller has to supply.
 All the scaffolding and glue between the signal handler and the
proctrack "this particular child process died" is automatically put up
and handled - which is quite nice.

Now in the past we had some troubles with some beliefs that certain
events weren't being handled fast enough - and there was a certain
amount of finger-pointing going on about it.  So, we put in some code in
our library code to track various things that we might have been doing
wrong to make things not be handled fast enough.

One of those is the time to handle the death-of-child signal (how long
the dispatch function runs).

Because such beliefs might come up again, and it was a lot of trouble to
put all this checking code in place, it hasn't been removed (and
shouldn't be).

But, it should be toned down some to not be so "fast-twitch" -
especially the death-of-child handling - which is only one clock tick.

I'd like to get this down to the point that only a few of these occur
per day under normal circumstances.

But what it _does_ mean when these happen, is that things are running
slow on the particular host.  If the delay numbers observed were really
high other things will start timing out.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] WARN: G_SIG_dispatch

2007-03-24 Thread Alan Robertson
Paulo F. Andrade wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> 99% of them say 20 ms. In the last couple of days there are a few (1
> every 100 warnings, more or less) that go beyond, 50, 80 or 90 ms. But
> these are very rare compared to the 20 ms warning that's happening all
> the time.

OK.  I set it down to 30ms instead of 10ms.

The patch is here:
http://hg.linux-ha.org/dev/rev/e201e971f226

Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Documentation for constraints

2007-03-24 Thread Alan Robertson
Dejan Muhamedagic wrote:
> On Fri, Mar 23, 2007 at 02:08:17PM +0100, Andrew Beekhof wrote:
>> On Mar 22, 2007, at 7:40 PM, Dejan Muhamedagic wrote:
>>
>>> On Thu, Mar 22, 2007 at 06:15:41PM +0100, Ragnar Kj?rstad wrote:
>>>> On Thu, Mar 22, 2007 at 02:05:57PM +0100, Dejan Muhamedagic wrote:
>>>>
>>>>> On Thu, Mar 22, 2007 at 02:21:23AM +0100, Ragnar Kj?rstad wrote:
>>>> BTW: I just noticed that the DTD is also checked into mercury. So  
>>>> which
>>>> is the authoriative version? mercury or wiki?
>>> i think it's wiki, but i'm really not sure. alan?
>> mercurial is
> 
> right. is it possible to edit the dtd in wiki?

I'm pretty sure it is.

> it shouldn't be.

It has to be _possible_ to edit it in the wiki, or it could never be
changed from Mercurial ;-).  On the other hand, it could be replaced
with links to various versions in Hg - maybe one link per recent version
of heartbeat plus one link pointing to the tip.

At this point, I think some corrections have probably been made to the
version in the Wiki that aren't in the Mercurial version.  The reverse
is probably true also...

Andrew??


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] pingd configuration

2007-03-24 Thread Alan Robertson
Modar Nakshbandi wrote:
> Hi all,
> I have upgraded my haresources to cib.xml
> but the problem was that when I remove a network cable no failover happens
> what i need is a simple virtual ip cofiguration in heartbeat v2
> I was using "ping defaultrouter" in v1 to check the health of the
> connection
> 
> any one has a simple template for v2 ip failover please send it to me

It's described in the web site, and also in a tutorial.  If you had
searched using the "Enter search here" site search box on the web site,
this would be the first match:

http://linux-ha.org/v2/faq/pingd

It's what you need.

I just added a simple alias to it in the web site:
http://linux-ha.org/pingd
And I added to the list of common searches on the right-hand side column
(replacing the ipfail link).

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


XML tools? WAS Re: [Linux-HA] Documentation for constraints

2007-03-25 Thread Alan Robertson
Ragnar Kjørstad wrote:
> On Sat, Mar 24, 2007 at 06:58:48PM -0600, Alan Robertson wrote:
>>>>>> BTW: I just noticed that the DTD is also checked into mercury. So  
>>>>>> which
>>>>>> is the authoriative version? mercury or wiki?
>>>>> i think it's wiki, but i'm really not sure. alan?
>>>> mercurial is
>>> right. is it possible to edit the dtd in wiki?
>> I'm pretty sure it is.
>>
>> ...
>>
>> At this point, I think some corrections have probably been made to the
>> version in the Wiki that aren't in the Mercurial version.  The reverse
>> is probably true also...
> 
> Yes, both are true.
> 
> There are a lot of text in the annotated wiki version not available in
> the Mercurial version, and there are features in the Mercurial version
> that are missing in the wiki.
> 
> Great.
> 
> I assume the Mercurial version was inserted into the wiki at some point?
> how?
> 
> If we can transfer it back to xml format it should be possible to
> perform some kind of merge.

It's worth noting that the annotated DTD in the wiki has features
missing from the online version - like a table of contents, and so on.

If we fix the two to be consistent, they will just get out of sync again.

Since the DTD is regarded as the primary documentation source for R2,
and the wiki version is hard enough to navigate, losing those features
would make it a poorer source of documentation.  And, we certainly don't
need that.

Are there good XML tools for doing table of contents, or indexing or
something like that?

Or could we write a tool to grab the DTD and slap it into the wiki in
wiki format?

Here's what the wiki format looks like:
http://wiki.linux-ha.org/v2/dtd1.0/annotated?action=raw



-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] why number of nodes < 16?

2007-03-26 Thread Alan Robertson
tanghy wrote:
> Hi,
> 
> I am reading Alan's tutorial of Linux-HA. I am wondering why it is
> said in the tutorial that number of
> suppported cluster nodes is less than 16?
> What causes it not able to extend to larger size?

The near-term bottleneck is an internal limit on packet size.  The
bigger your cluster, the bigger the CIB, and the larger the packets it
wants to send.

http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1339

There is no specific limit on cluster size - just a practical one given
the current implementation.

This is very fixable, it's just that no one has the time to fix it right
now.

The further term limit is expected to be on the protocol itself.

This could be solved by extending the ideas described in bug 1417...

http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1339


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Replacing Existing Node

2007-03-26 Thread Alan Robertson
Max Hofer wrote:
> On Saturday 24 March 2007 12:14, Ragnar Kjørstad wrote:
>> On Fri, Mar 23, 2007 at 09:31:24PM -0600, Alan Robertson wrote:
>>>> IMHO the defaulöt value of hbgenmethod to "file" is a bad choice (if it 
>>>> can not be turned off).
>>> It's very simple to work around, but it is irritating.
>>>
>>> Ragnar Kjørstad gave a suggestion from which would fix this without much
>>> less exposure than turning .  He gave me a patch for it...
>> And Max, if you want to test it out before it's merged, it's available
>> at http://ragnark.vestdata.no/HB_VERS_FILE-2.0.8.patch.
>>
> I will try it out. 
> 
> Can you explain me what it does? 
> 
> (from the patch context it's not visible for me and i don't have the time 
> to dig into the code ATM).
> 
> What i would like to achive is replacing a cluster node without saving any
> files from the old cluster node.
> 
> The UUID problem i solved. The hbgen method is set ATM to "time" which
> fixes the problem for me. (I hope this is save against time changes
> for winter/summer time)


Yes.  The patch just constrains you so that time cannot have gone
backwards (in UTC, not local time), between the time the cluster was
installed and the time the cluster was re-installed.

This is VERY unlikely to happen.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Hard story how i install HA with GUI in to slackware 11.0

2007-03-26 Thread Alan Robertson
Alex Orlov wrote:
> Hard story how i install HA with GUI in to slackware 11.0 :)
>> http://www.linux-ha.org/GuiGuide

Do you want to put this up on the website?

Could you kindly name it:
http://wiki.linux-ha.org/HOWTO/Slackware
which means it will show up on the web site here:
http://linux-ha.org/HOWTO/Slackware

If you would do this, I would much appreciate it.

Also, you might enjoy this page:
http://linux-ha.org/Education/Newbie/IPaddrScreencast

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Weird auto_failback behaviour

2007-03-26 Thread Alan Robertson
Michael Dodd wrote:
> We're setting up an LVS cluster with two directors for failover.  I'm
> having a hard time getting the behaviour I want from the directors. 
> When the master fails and the secondary picks up, I don't want the
> master to take back resources when it comes back up.  We're using
> co-equal boxes for directors so there's no reason for us to go through
> the service interruption of moving resources back to the failed primary.
> 
> Right now, if we test it be stopping heartbeat ( /etc/init.d/heartbeat
> stop ) on one of the directors, everything seems to work exactly like we
> want it to.
> 
> If we simulate an outage by pulling the ethernet cable on the
> broadcasting nic,   Failover happens correctly, but when we plug the NIC
> back in, that machine seems to take resources back over again.

That would probably be because you've created a split-brain situation,
and heartbeat is recovering from it by restarting the services on both
machines.

http://linux-ha.org/SplitBrain

Generally, you want to avoid a split-brain condition.  If you have
shared storage you REALLY want to avoid it - since it will trash your
data.  http://linux-ha.org/BadThingsWillHappen


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] why number of nodes < 16?

2007-03-26 Thread Alan Robertson
Andrew Beekhof wrote:
> 
> On Mar 26, 2007, at 12:10 PM, tanghy wrote:
> 
>> Hi,
>>
>> I am reading Alan's tutorial of Linux-HA.
>> I am wondering why it is said in the tutorial that number of
>> suppported cluster nodes is less than 16?
>> What causes it not able to extend to larger size?
> 
> basically just the amount of testing we've done.  we've had reports of
> people using more than 16

The highest number we've seen it work succesfully is a few over 20.  BUT
this depends heavily on what's in your CIB.  With a big CIB (lots of
resources, lots of history in the status section), it will cause
problems sooner than it will if you have a smaller CIB.

The limit is really on CIB size.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat compatibility question

2007-03-26 Thread Alan Robertson
Dejan Muhamedagic wrote:
> Hi,
> 
> On Mon, Mar 26, 2007 at 06:42:05PM +0200, Patrick Begou wrote:
>> Hi,
>>
>> Please, is there a backward compatibility between version 2.0.8 and 
>> version 1.2.x of heartbeat ?
>>
>> I'm in a crazy situation: my HA conf runing with Debian Sarge crashes 
>> every one or two weeks (one node or the other) and I'm unable to 
>> understand why for more than one year (changing kernel, configs, drbd 
>> module...etc) ! Many Debian users have told me than the 64bits of debian 
>> is not well maintained and was not a good choice.
>>
>> I'm tired to test new solutions not working better than the previous 
>> ones on these servers in production! At this point I want to change my 
>> OS (may be FC6 X86_64) to use recent versions of heartbeat, drbd, nfs
>> But in an intermediate state I will have one node (active) in 
>> sarge+heartbeat 1.2 and one node (slave for synchronisation) in FC6 and 
>> heartbeat 2.08
> 
> I think that it should work. The newer releases do support old v1
> style configurations. I'd be carefull with it though, because I
> suppose that not many people run such a mix.
> 
>> This week-end my two servers crashes and as I didn't know wich crashes 
>> first I restarted the wrong one first and lost critical datas!
> 
> Hmm, are you sure that your hardware is good and that it is well
> supported under Linux? Haven't you been able to find the reason
> for your computers crashing so often? BTW, you can always try the
> vanilla kernel and then bother people on the kernel list ;-)

Heartbeat is designed so that this should work for every pair of
releases, and it is known to work for many cases where the two releases
are different.

However, as Dejan points out, no one has gone out of there and tested
your particular combination of machines.

There was one particular problem we created in this regard, which I'll
mention...

If you have your host names in upper case, there are certain version
boundaries you can't cross without changing your host names to lower
case before you cross it.

Once you cross that boundary, it doesn't matter what case your host
names are in.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] pingd not failing over

2007-03-26 Thread Alan Robertson
value 'true' for cluster option 'startup-fencing'
> Mar 26 08:16:04 node1 pengine: [20392]: info: determine_online_status:
> Node node1 is online
> Mar 26 08:16:04 node1 pengine: [20392]: info: determine_online_status:
> Node node2 is online
> Mar 26 08:16:04 node1 pengine: [20392]: info: group_print: Resource
> Group: group_my_cluster

You're trying to start pingd by two ways - both by the respawn
directive, and also as a resource.

You can't do that.

And, you're not using the attribute that pingd is creating in your CIB.

See http://linux-ha.org/pingd for a sample rule to use a pingd attribute
- or you can see my linux-ha tutorial for similar information:
http://linux-ha.org/HeartbeatTutorials - first tutorial listed
starting at about slide 139...

Here's the example from the pingd page:



   



In fact, I'm not 100% sure it's right...


I think the example from the tutorial is a little more general...


  

  



What this rule says is:

For resource "my_resource", add the value of the pingd attribute
to the amount score for locating my_resource on a given
machine.

For your example flags to pingd, you use a multiplier (-m flag) of 100,
so having access to 0 ping nodes is worth zero, 1 ping nodes is worth
100 points, 2 ping nodes is worth 200 points, and so on...

So, if one node has access to a ping node and the other one does not
have access to a ping node, then the first node would get 100 added to
its location score, and the second node would have an unchanged location
score.

If the the second node scored as much as 99 points higher than the first
node, it would locate the resource on the first node.  If you don't like
that, you can change your ping count multiplier, write a different rule,
or add a rule.

You can change how much ping node access is worth with the -m flag, or
the "multiplier" attribute in the pingd resource.  Note that you didn't
supply a multiplier attribute in your pingd resource - so it would
default to 1 -- probably not what you wanted...

And, don't run pingd twice - especially not with different parameters...

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Virtual IP alias gets incremented

2007-03-27 Thread Alan Robertson
Abdul Khader wrote:
> Hi All,
> I am running Heartbeat 1.2.3-2 on Fedora Core 3 with  2.6.11-1.35_FC3smp
> kernel. The problem is, when I do  failover multiple times, the virtual
> interface alias eth0:1 becomes eth0:2 . It keeps on increasing by one
> number on each failover. I have seen virtual alias as big as eth0:12
> 
> Any help or pointers would be great.

There's something on your system causing that.  What it is I couldn't
say.  But we do track which aliases are in use, and make sure we don't
grab one that's already in use.

Something in your environment is interfering with that logic.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] How to set this up correctly...?

2007-03-27 Thread Alan Robertson
Howard Yuan wrote:
> Okay...I got a working heartbeat setup...but...now I'm curious and
> can't figure a way around this. Wondering if you guys have any idea.
> 
> Currently, I have two SLES running DRBD and Heartbeat. Here are the
> configs:
> 
> Server A LAN 1: 192.168.15.30/255.255.255.0 LAN 2:
> 10.0.0.2/255.255.255.0
> 
> Server B LAN 1: 192.168.15.31/255.255.255.0 LAN 2:
> 10.0.0.4/255.255.255.0
> 
> LAN 1's is connected to the main network switch. LAN 2's are
> connected to each other via a crossover cable. Heartbeat is serving
> up MySQL and a floating IP address of 10.0.0.5/255.255.255.0.
> 
> The question I'm having is...if the crossover cable breaks for any
> reason...heartbeat never fails over (as the servers are technically
> still alive), but the service is no longer accessible (as they're
> trying to access it via 10.0.0.5).
> 
> What is the best way to get around this problem? I want to make it
> where the service is/will still be available whenever ONE of the
> connection is broken..as the fail over doesn't occur until both
> connections die. Any ideas?

Why does the service die when the crossover breaks?

I need more configuration information to give a detailed answer.

But, the short answer is "run an R2 configuration with pingd properly
configured for your problem"


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] haclient.py requires a port to login ?

2007-03-27 Thread Alan Robertson
Karl Hanzel wrote:
> 
> 'Just upgraded to heartbeat-2.0.8-2.el4.centos, and it's new companion
> RPMs.
> 
> 'Found the new haclient.py under /usr/lib64/heartbeat-gui/.  It starts
> up fine, but upon login to my running cluster and supplying these args
> to the Login window: 127.0.0.1 / hacluster / hacluster's_pw (which
> always worked previously), i'm now getting:
> 
> -- 
>   Traceback (most recent call last):
> File "/usr/lib64/heartbeat-gui/haclient.py", line 1598, in on_login
>   if not manager.login(server, user, password):
> File "/usr/lib64/heartbeat-gui/haclient.py", line 1943, in login
>   ret = mgmt_connect(ip, username, password, port)
>   TypeError: mgmt_connect() argument 4 must be string, not None
> -- 
> 
> ...and it fails to login/connect.
> 
> If in the "Server(:port):" field of the Login window i specify
> "127.0.0.1:xyz" it makes the login without the above complaint.  That
> "xyz" can be literal... i seem to be able to supply anything (including
> a null string) there.
> 
> So what's the rub ... why do i/we now have to specify a port?  And if we
> do, what's the appropriate port to specify?

You don't have to supply a port...  You _can_ specify a port if you like...

You can check out the screencast here, and see what I mean...
http://linux-ha.org/Education/Newbie/IPaddrScreencast

In other words It Works For Me (tm)  ;-)



-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] pingd not failing over

2007-03-27 Thread Alan Robertson
Terry L. Inzauro wrote:
> Alan Robertson wrote:
>> Daniel Bray wrote:
>>> Hello List,
>>>
>>> I have been unable to get a 2 node active/passive cluster to
>>> auto-failover using pingd.  I was hoping someone could look over my
>>> configs and tell me what I'm missing.  I can manually fail the cluster
>>> over, and it will even auto-fail over if I stop heartbeat on one of the
>>> nodes.  But, what I would like to have happen, is when I unplug the
>>> network cable from node1, everything auto-fails over to node2 and stays
>>> there until I manually fail it back.
>>>
>>> #/etc/ha.d/ha.cf
>>> udpport 6901
>>> autojoin any
>>> crm true
>>> bcast eth1
>>> node node1
>>> node node2
>>> respawn root /sbin/evmsd
>>> apiauth evms uid=hacluster,root
>>> ping 192.168.1.1
>>> respawn root /usr/lib/heartbeat/pingd -m 100 -d 5s
>>>
>>> #/var/lib/heartbeat/crm/cib.xml
>>>   >> ignore_dtd="false" ccm_transition="14" num_peers="2"
>>> cib_feature_revision="1.3"
>>> dc_uuid="e88ed713-ba7b-4c42-8a38-983eada05adb" epoch="14"
>>> num_updates="330" cib-last-written="Mon Mar 26 10:48:31 2007">
>>>
>>>  
>>>
>>>  
>>>>> value="True"/>
>>>>> id="cib-bootstrap-options-symmetric-cluster" value="True"/>
>>>>> name="default-action-timeout" value="60s"/>
>>>>> id="cib-bootstrap-options-default-resource-failure-stickiness"
>>> name="default-resource-failure-stickiness" value="-500"/>
>>>>> id="cib-bootstrap-options-default-resource-stickiness"
>>> name="default-resource-stickiness" value="INFINITY"/>
>>>>> id="cib-bootstrap-options-last-lrm-refresh" value="1174833528"/>
>>>  
>>>
>>>  
>>>  
>>>>> id="e88ed713-ba7b-4c42-8a38-983eada05adb">
>>>  >> id="nodes-e88ed713-ba7b-4c42-8a38-983eada05adb">
>>>
>>>  >> id="standby-e88ed713-ba7b-4c42-8a38-983eada05adb" value="off"/>
>>>
>>>  
>>>
>>>>> id="f6774ed6-4e03-4eb1-9e4a-8aea20c4ee8e">
>>>  >> id="nodes-f6774ed6-4e03-4eb1-9e4a-8aea20c4ee8e">
>>>
>>>  >> id="standby-f6774ed6-4e03-4eb1-9e4a-8aea20c4ee8e" value="off"/>
>>>
>>>  
>>>
>>>  
>>>  
>>>>> resource_stickiness="INFINITY" id="group_my_cluster">
>>>  >> id="resource_my_cluster-data">
>>>>> id="resource_my_cluster-data_instance_attrs">
>>>  
>>>>> id="resource_my_cluster-data_target_role" value="started"/>
>>>>> name="device" value="/dev/sdb1"/>
>>>>> id="9e0a0246-e5cb-4261-9916-ad967772c80b" value="/data"/>
>>>>> name="fstype" value="ext3"/>
>>>  
>>>
>>>  
>>>  >> type="IPaddr" provider="heartbeat">
>>>>> id="resource_my_cluster-IP_instance_attrs">
>>>  
>>>>> name="target_role" value="started"/>
>>>>> name="ip" value="101.202.43.251"/>
>>>  
>>>
>>>  
>>>  >> id="resource_my_cluster-pingd">
>>>>> id="resource_my_cluster-pingd_instance_attrs">
>>>  
>>>>> id="resource_my_cluster-pingd_target_role" value="started"/>
>>>>> name="host_list" value="node1,node2"/>
>>>  
>>>
>>>
>>>  &g

Re: [Linux-HA] pingd not failing over

2007-03-27 Thread Alan Robertson
Daniel Bray wrote:
> On Tue, 2007-03-27 at 10:13 +0200, Andrew Beekhof wrote:
> 
>> You're trying to start pingd by two ways - both by the respawn
>>  directive, and also as a resource.
>>
>>  You can't do that.
>>
>>  And, you're not using the attribute that pingd is creating in your  
>>  CIB.
>>
>>  See http://linux-ha.org/pingd for a sample rule to use a pingd  
>>  attribute
>>  - or you can see my linux-ha tutorial for similar information:
>> http://linux-ha.org/HeartbeatTutorials - first tutorial listed
>> starting at about slide 139...
>>
>>  Here's the example from the pingd page:
>>
>>  
>> 
>>>attribute="pingd_score" operation="not_defined"/>
>> 
>>  
>>
>>  In fact, I'm not 100% sure it's right...
>>
>> it does exactly what the title claims it will:
>>  "Only Run my_resource on Nodes with Access to a Single Ping Node"
>>
>> there are other examples on that page that cover more complicated  
>> scenarios, complete with worked solutions
>>
>>
>>  I think the example from the tutorial is a little more general...
>>
>>  
>>   > score_attribute="pingd" >
>> > attribute="pingd"
>> operation="defined"/>
>>   
>>  
>>
>>
>>  What this rule says is:
>>
>> For resource "my_resource", add the value of the pingd attribute
>> to the amount score for locating my_resource on a given
>> machine.
>>
>>  For your example flags to pingd, you use a multiplier (-m flag) of  
>>  100,
>>  so having access to 0 ping nodes is worth zero, 1 ping nodes is worth
>>  100 points, 2 ping nodes is worth 200 points, and so on...
>>
>>  So, if one node has access to a ping node and the other one does not
>>  have access to a ping node, then the first node would get 100 added to
>>  its location score, and the second node would have an unchanged  
>>  location
>>  score.
>>
>>  If the the second node scored as much as 99 points higher than the  
>>  first
>>  node, it would locate the resource on the first node.  If you don't  
>>  like
>>  that, you can change your ping count multiplier, write a different  
>>  rule,
>>  or add a rule.
>>
>>  You can change how much ping node access is worth with the -m flag, or
>>  the "multiplier" attribute in the pingd resource.  Note that you  
>>  didn't
>>  supply a multiplier attribute in your pingd resource - so it would
>>  default to 1 -- probably not what you wanted...
>>
>>  And, don't run pingd twice - especially not with different  
>>  parameters...
> 
> 
> 
> Thanks for the detailed feedback!  
> 
> 
> I've made the alterations you suggested, via the GUI, but saw some
> issues.  I think one of my biggest issues here, is all the documentation
> points to creating XML files by hand, and then "importing" them into the
> cluster.  What I'm trying to do, is strictly use the GUI, so that I can
> do some video documentation for the rest of my team.  Once I have
> everything working, I will distribute the videos so that my team members
> can learn from this.  Unfortunately, things don't always work as
> expected from within the GUI.
> 
> For instance, in your example above:
> 
> 
>
>  attribute="pingd_score" operation="not_defined"/>
>
> 
> 
> You don't have a "value" set for the expression.  In the GUI, it will
> not let you add an expression like that, unless you give it a value.
> Here is my updated config files.

The GUI is running behind where we'd like for it to be, unfortunately
:-(  We added a bunch of features and changed a number of things at a
time when there was little effort being put into the GUI.

By the way, you might be interested in the newly-founded education
sub-project:
http://linux-ha.org/Education
which has a screencast (video) for a simple GUI operation, here:
http://linux-ha.org/Education/Newbie/IPaddrScreencast
I haven't finished adding all the audio for it.  Maybe I can do that today?

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Weird auto_failback behaviour

2007-03-27 Thread Alan Robertson
Michael Dodd wrote:
> Dejan Muhamedagic wrote:
>> On Tue, Mar 27, 2007 at 04:08:27PM +0200, Max Hofer wrote:
>>  
>>> On Tuesday 27 March 2007 13:11, Dejan Muhamedagic wrote:
>>>
>>>> On Tue, Mar 27, 2007 at 10:17:22AM +0200, Andrew Beekhof wrote:
>>>>  
>>>>> On Mar 27, 2007, at 1:19 AM, Michael Dodd wrote:
>>>>>
>>>>>
>>>>>> Alan Robertson wrote:
>>>>>>  
>>>>>>> That would probably be because you've created a split-brain 
>>>>>>> situation,
>>>>>>> and heartbeat is recovering from it by restarting the services
>>>>>>> on  both
>>>>>>> machines.
>>>>>>>
>>>>>>> http://linux-ha.org/SplitBrain
>>>>>>>
>>>>>>> Generally, you want to avoid a split-brain condition.  If you have
>>>>>>> shared storage you REALLY want to avoid it - since it will trash
>>>>>>> your
>>>>>>> data.  http://linux-ha.org/BadThingsWillHappen
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>> Thanks-I wondered if that's what was happening.
>>>>>>
>>>>>> Am I going to need to get STONITH configured for this?  We're not 
>>>>>> doing any kind of resource sharing on realservers, so I'd like to 
>>>>>> avoid the complexity there.   We're looking for something similar
>>>>>> to  what Daniel Bray has mentioned in his recent mail to the list,
>>>>>> but  ideally I'd like to avoid the added complexity of having to
>>>>>> maintain  cib.xml.
>>>>>>   
>>>>> maintain?
>>>>> sure its a bit more complex to set up but what do you mean by
>>>>> maintain?
>>>>> 
>>>> as a matter of fact, you'll be so much better off with the crm
>>>> based cluster (v2) when it comes to maintenance. v1 is definitely
>>>> easier to start with, but once you get the v2 going you'll find it
>>>> more enjoyable for administration.
>>>>   
>>> I agree with you from the point of view of a cluster system
>>> designer/tester but I disagree from the point of view of a customer
>>> (the person who bought the cluster).
>>> 
>>
>> hmm. does the customer have skilled personel? after all, whoever's
>> going to manage a cluster (any kind of cluster) has to have
>> certain admin skills. it's definitely not like getting a household
>> appliance.
>>
>>  
>>> Lets see what operations a normal sysadmin had to do with heartbeat v1
>>> and compare it to v2:
>>>
>>> heartbeat v1:
>>> * start/stop heartbeat
>>> * make a node standby --> forced switchover to the other node
>>> 
>>
>> yes, and that's more or less _all_ it can do.
>>   
> In our example, this is really all we're looking for.  Are there things
> we'll get from v2 that would  make the additional learning curve
> worthwhile?

The ability to monitor resources.  The ability to have clusters with
more than 2 nodes.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] How to set this up correctly...?

2007-03-27 Thread Alan Robertson
Howard Yuan wrote:
> "Why does the service die when the crossover breaks?" Because I was
> planning on telling the services to look for MySQL on 10.0.0.5 (the
> floating IP) and if the crossover link breaks, the systems don't know
> how to get to 10.0.0.5 as they're looking to find it via the
> crossover link (LAN 2).

OK.  I understand that.  So, mysql needs connectivity via the crossover
cable.

> "I need more configuration information to give a detailed answer." 
> Hum...what information would you need to understand it better? I can
> try to draw an ASCII picture of the network diagram if you need me
> to.

No, I think that was enough for now.


> "But, the short answer is 'run an R2 configuration with pingd
> properly configured for your problem'" I looked at this for awhile
> and I can't figure out what you mean by "R2."

I mean a "crm yes" configuration.

> Also, on my heartbeat configuration right now, i'm using "crm no" to
> use ipfail, as I found on the mailing list that someone said that
> ipfail doesn't work with crm. Does CRM include a replacement for
> ipfail that works better?

It's called pingd ;-).

I'm not sure that either toolset precisely addresses your problem.

What you likely really want is this:

If one node is up, run all services there.
If both nodes are up, and the two nodes can't talk across
the crossover, then:
run both services on the machine with better
connectivity to your clients

It's the dual-connectivity test that you'd really like to have that
pingd won't really handle.

I believe that pingd treats all ping nodes the same.  But, to truly
solve this problem, you need to treat outside ping nodes differently
from inside ping nodes.

You _can_ solve the problem in R2, you'll just have to write your own
pingd replacement - since it doesn't have to be general, it'll be easy
enough to write, but you'll still have to do it...  You could do it all
in the shell if you like...

Then you only tell heartbeat about one of the sets of ping nodes, and
not the other set, and your tool would manage the other set.

But, none of this is likely to make much sense to you unless you
understand the CRM's way of doing things through the rules in the CIB.

This is explained in some detail in my tutorial on R2.  It's the newest
tutorial in the http://linux-ha.org/HeartbeatTutorials page on the web
site.  The relevant section is slides 137-145.

If you haven't used R2 at all, then maybe reviewing the presentation
from the beginning would be good.  There is a 90 minute video covering
some basic things -- given from these slides - and it's linked to from
that same page.  If you have trouble viewing the video directly, then
try the abstract page - it has an embedded video viewer in Java at the
bottom of the web page.

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] why number of nodes < 16?

2007-03-27 Thread Alan Robertson
Aleksandar Lazic wrote:
> Hi,
> 
> On Mon 26.03.2007 18:50, Andrew Beekhof wrote:
>>
>>
>> basically just the amount of testing we've done.  we've had reports of
>> people using more than 16
> 
> Does anybody know who have done this because there is a company which
> need more then the 16 servers and are able to help?
> 
> As far as I know they also want to pay for the help.

I know companies who could help this happen, and if it's done right,
we'd likely take the patches back into the base.

How big does it need to go?


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Replacing Existing Node

2007-03-27 Thread Alan Robertson
Andrew Beekhof wrote:
> 
> On Mar 21, 2007, at 12:55 PM, Alan Robertson wrote:
> 
>> Andrew Beekhof wrote:
>>> On 3/20/07, Alan Robertson <[EMAIL PROTECTED]> wrote:
>>>
>>>> Max Hofer wrote:
>>>
>>>>
>>>>> OK,
>>>
>>>>
>>>>>
>>>
>>>>
>>>>> i lost a day just trying to figure out how to replace a cluster node
>>>
>>>> with
>>>
>>>>
>>>>> a spare part. I just thought someone else needs this info or maybe
>>>
>>>>
>>>>> knows a better way as How I did.
>>>
>>>>
>>>>>
>>>
>>>>
>>>>> Situation:
>>>
>>>>
>>>>> - cluster with 2 nodes (routing1, routing2)
>>>
>>>>
>>>>> - routing2 should be replaced with a spare part
>>>
>>>>
>>>>> - routing1 and routing2 use a file system on a drbd to share
>>>
>>>>
>>>>>  common data
>>>
>>>>
>>>>>
>>>
>>>>
>>>>> Precondition:
>>>
>>>>
>>>>> - routing2 crashed and hb_uuid is not recoverable
>>>
>>>>
>>>
>>>> FYI: It's in the CIB, and also in the hb_uuid files on every machine.
>>>
>>>>
>>>
>>>>
>>>>> - spare part is configured to not start heartbeat after power-on
>>>
>>>>
>>>>>
>>>
>>>>
>>>>> Steps I did:
>>>
>>>>
>>>>> * replaced crashed routing2 with spare part (cabling etc.)
>>>
>>>>
>>>>> * powered on routing2
>>>
>>>>
>>>>> * on routing2 invalidate data on drbd device (---> sync from routing1
>>>
>>>>
>>>>> to routing2)
>>>
>>>>
>>>>> * on routing1 delete routing2 (I found a bug that pingd resets to 0
>>>
>>>>
>>>>> when calling hb_delnode ---> see bug #1535)
>>>
>>>>
>>>>> # /usr/lib/heartbeat/hb_delnode routing2 && killall pingd
>>>
>>>>
>>>>> (!!!NOTE: if your cluster configuration triggers a failover on a pingd
>>>
>>>>
>>>>> failure set the cluster in unmanaged mode, stop pingd, delete
>>>
>>>>
>>>>> the node and then restart pingd, setting the cluster in managed mode
>>>
>>>>
>>>>> again)
>>>
>>>>
>>>>> * on routing1 delete removed hostcache (I'm not sure if this setp is
>>>
>>>>
>>>>> neccessary but someone in the mailing list explained it has to be
>>>>> done)
>>>
>>>>
>>>>> # rm /var/lib/heartbeat/delhostcache
>>>
>>>>
>>>>> * on routing1 add routing2 again
>>>
>>>>
>>>>> # /usr/lib/heartbeat/hb_addnode routing2
>>>
>>>>
>>>>> * start heartbeat on routing2
>>>
>>>>
>>>>>
>>>
>>>>
>>>>> Finished .
>>>
>>>>
>>>>>
>>>
>>>>
>>>>> What i really find stupid about the whole proccedure:
>>>
>>>>
>>>>> * the assumption the UUID file (/var/lib/heartbeat/hb_uuid) should can
>>>
>>>>
>>>>> be used on the spare part is probably never the case (except you
>>>
>>>>
>>>>> perform a planned replacement ... )
>>>
>>>>
>>>
>>>> See note above...
>>>
>>>>
>>>
>>>>
>>>>> * this assumption does not work well if the spare part is installed to
>>>
>>>>
>>>>> be a replacement for different cluster nodes. The UUDI is created
>>>
>>>>
>>>>> on the veiry first install of heartbeat (and thus is not part of my
>>>
>>>>
>>>>> configuration data). It would be a cofiguration hell to "save all
>>>
>>>>
>>>>> UUID of all clusters after cluster actvation" on a system with a
>>>
>>>>
>>>>> couple nodes
>>>
>>>>
>>>
>>>> It's already saved for you - in two places on every machine...
>>>
>>>>
>>>
>>>> What's missing is the conversion from ASCII to binary.  Could you
>>>> make a
>>>
>>>> bugzilla for that and assign it to me?
>>>
>>>>
>>>
>>> been there done that:
>>>  crm_uuid -w
>>
>> Andrew:  Is there a man page or other documentation outside the command
>> for this?
> 
> it will be in the set novell is making available to us

Does "us" mean novell customers?

As a note, there does need to be a man page specifically, and it needs
to be created with the man page macros through roff.  Man is the
UNIX/Linux documentation standard, unfortunately.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Server IP takeover by slave PC

2007-03-27 Thread Alan Robertson
Tanveer Chowdhury wrote:
> Hi all: Thank you for all your help and support. This is what I am
> going to do after reading carefully the documentation.
> 
> Server PC will have 3 NICS and all three NICS with real IP. Only one
> NIC is aliased to support blocks of internal private network to
> access Internet. You already know what scenario I am trying to
> achieve here and also attached the document that I will follow with a
> little modification to server my purpose.  My slave has also 3 NICS
> and will make an exact of IPs as it was in server when master server
> fails.
> 
> 
> Now my steps are as follows: As heartbeat only looks for services
> which can be start/stop using service command so I made a service
> script and put it in /etc/rc./init.d and it works. Below is the
> script: [b] #!/bin/sh # /etc/init.d/ha #
> 
> #some things that run always touch /var/lock/ha
> 
> #carry out specific functions when asked to by the system
> 
> case "$1" in start) echo " starting ha script to Change IP of Slave" 
> ifconfig eth0 192.168.100.11 netmask 255.255.255.0 up ...//all the ip
> change command will write here and also iptbales rules one by one ;;
> 
> stop) echo " starting ha script to change IP back" # ifconfig eth0
> 192.168.100.10 netmask 255.255.255.0 up ...//change all the ip change
> command will write here and also iptbales rules one by one ;;
> 
> *) echo "Usage:: /etc/init.d/ha {start | stop}" exit 1 ;; esac exit 0
>  [/b]
> 
> Now this script has all the IP configuration and will put this on
> both master and slave so when Master start it will set the IP as
> required and when master fails and stops the service then actually
> executes the stop fucntion and slave takes over the virtual IP and
> executes this script again setting the same IP configuration like
> master though initially slave had a diff IP settings than master. AM
> I right?
> 
> What you think of this? Will this work or I m making some move ? 
> Please let me know. Waiting for your kind response.


Why don't you just use our IPaddr resource that we already supply?

Here's how to configure IPaddr (IPaddr2) using the GUI:
http://wiki.linux-ha.org/Education/Newbie/IPaddrScreencast

Hope that helps...

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Monitoring resources

2007-03-27 Thread Alan Robertson
Michael Fernández M. wrote:
> Hi
> 
> i have Debian Etch, with:
> 
> - heartbeat  2.0.7-2 
> - drbd0.7
> 
> and i need to monitorice postfix, apache2, ldap and courier, i already
> have the cluster configurated and it works with out problems, but i need
> but i need to active a monitor. I saw in the linux-ha.org about CRM, but
> i do not know how to do it.
> 
> There is a page with some details about the implementation?

Try this to start with:
http://linux-ha.org/GettingStartedV2
http://linux-ha.org/GettingStartedRevisedV2
http://linux-ha.org/ClusterResourceManager/FAQ#Converting_your_1.x_%28haresources%29_configuration
http://linux-ha.org/ClusterInformationBase/Conversion

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ip address resources faile to start

2007-03-27 Thread Alan Robertson
Robert Fowler wrote:
> Every time I go to start heartbeat I get an error stating ip addresource
> is stopped and as a result heartbeat fails to start the Virtual IP
> address

> 
> any ideas appreciated

> See also: http://linux-ha.org/ReportingProblems

How about reading the link that's at the bottom of every single posting
on this mailing list and following the directions?

Please provide logs.  I don't know of such a failure condition, so I
need to see the logs.



-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Masters take long time to get back the ip from slave

2007-03-27 Thread Alan Robertson
Austin Rock wrote:
> I have configure ha with latest ver.  When my master gets down slave auto
> take ip in 1 sec. which i have mention in my ha.cf file.  But when master
> get`s back to work it takes around 40-50 seconds to take over the ip.  When
> can i mention that second so that when master gets up it will take in 1
> sec.
> only.

I don't have any idea what in your environment might be causing this.
And, without logs, I'll never know.

Have you read this link that's at the bottom of every posting to this list?

> See also: http://linux-ha.org/ReportingProblems

Logs are very important.  No logs == no clue.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] haclient.py requires a port to login ?

2007-03-27 Thread Alan Robertson
Paulo F. Andrade wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Actually a similar thing happens on my Mac, the diference is that I do
> have to specify the correct port to connect. Anything else won't do.
> 
> On my Linux systems (running ubuntu and gentoo) I don't have to specify
> the port number, and it defaults to 5560.
> 
> Here's the output when I don't specify a port number:
> Traceback (most recent call last):
>   File "/sw/lib/heartbeat-gui/haclient.py", line 1598, in on_login
> if not manager.login(server, user, password):
>   File "/sw/lib/heartbeat-gui/haclient.py", line 1943, in login
> ret = mgmt_connect(ip, username, password, port)
> TypeError: mgmt_connect() argument 4 must be string, not None
> 
> It's no big deal, but it must be a bug.
> Paulo F. Andrade [EMAIL PROTECTED]

Could one of you kindly make a bugzilla for this for me please?

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] pingd not failing over

2007-03-27 Thread Alan Robertson
Daniel Bray wrote:
> On Tue, 2007-03-27 at 13:19 -0600, Alan Robertson wrote:
>> Daniel Bray wrote:
>>> On Tue, 2007-03-27 at 10:13 +0200, Andrew Beekhof wrote:
>>>
>>>> You're trying to start pingd by two ways - both by the respawn
>>>>  directive, and also as a resource.
>>>>
>>>>  You can't do that.
>>>>
>>>>  And, you're not using the attribute that pingd is creating in your  
>>>>  CIB.
>>>>
>>>>  See http://linux-ha.org/pingd for a sample rule to use a pingd  
>>>>  attribute
>>>>  - or you can see my linux-ha tutorial for similar information:
>>>> http://linux-ha.org/HeartbeatTutorials - first tutorial listed
>>>> starting at about slide 139...
>>>>
>>>>  Here's the example from the pingd page:
>>>>
>>>>  
>>>> 
>>>>>>>attribute="pingd_score" operation="not_defined"/>
>>>> 
>>>>  
>>>>
>>>>  In fact, I'm not 100% sure it's right...
>>>>
>>>> it does exactly what the title claims it will:
>>>>  "Only Run my_resource on Nodes with Access to a Single Ping Node"
>>>>
>>>> there are other examples on that page that cover more complicated  
>>>> scenarios, complete with worked solutions
>>>>
>>>>
>>>>  I think the example from the tutorial is a little more general...
>>>>
>>>>  
>>>>   >>> score_attribute="pingd" >
>>>> >>> attribute="pingd"
>>>> operation="defined"/>
>>>>   
>>>>  
>>>>
>>>>
>>>>  What this rule says is:
>>>>
>>>> For resource "my_resource", add the value of the pingd attribute
>>>> to the amount score for locating my_resource on a given
>>>> machine.
>>>>
>>>>  For your example flags to pingd, you use a multiplier (-m flag) of  
>>>>  100,
>>>>  so having access to 0 ping nodes is worth zero, 1 ping nodes is worth
>>>>  100 points, 2 ping nodes is worth 200 points, and so on...
>>>>
>>>>  So, if one node has access to a ping node and the other one does not
>>>>  have access to a ping node, then the first node would get 100 added to
>>>>  its location score, and the second node would have an unchanged  
>>>>  location
>>>>  score.
>>>>
>>>>  If the the second node scored as much as 99 points higher than the  
>>>>  first
>>>>  node, it would locate the resource on the first node.  If you don't  
>>>>  like
>>>>  that, you can change your ping count multiplier, write a different  
>>>>  rule,
>>>>  or add a rule.
>>>>
>>>>  You can change how much ping node access is worth with the -m flag, or
>>>>  the "multiplier" attribute in the pingd resource.  Note that you  
>>>>  didn't
>>>>  supply a multiplier attribute in your pingd resource - so it would
>>>>  default to 1 -- probably not what you wanted...
>>>>
>>>>  And, don't run pingd twice - especially not with different  
>>>>  parameters...
>>>
>>>
>>> Thanks for the detailed feedback!  
>>>
>>>
>>> I've made the alterations you suggested, via the GUI, but saw some
>>> issues.  I think one of my biggest issues here, is all the documentation
>>> points to creating XML files by hand, and then "importing" them into the
>>> cluster.  What I'm trying to do, is strictly use the GUI, so that I can
>>> do some video documentation for the rest of my team.  Once I have
>>> everything working, I will distribute the videos so that my team members
>>> can learn from this.  Unfortunately, things don't always work as
>>> expected from within the GUI.
>>>
>>> For instance, in your example above:
>>>
>>> 
>>>
>>>   >>   attribute="pingd_score" operation="not_defined"/>
>>>
>>> 
>>>
>>> You don't have a "value" set for the expression.  In the GUI, it will
>>> not l

Re: [Linux-HA] Re: [Xen-users] Cluster xen servers with san storage

2007-03-29 Thread Alan Robertson
Tijl Van den Broeck wrote:
> For the answers I start from the point of view of hearbeat 2
> (www.linux-ha.org). I do not know about commercial cluster products
> supporting Xen. Crossposting this to linux-ha mailing list as they can
> correct me on possible mistakes :-)
> 
> On 3/28/07, Carl Caum <[EMAIL PROTECTED]> wrote:
>> Hello all.  I have two servers with Xen running.  I also have  a SAN
>> that hold the virtual machines.  I need to know if a few things are
>> possible.
>>
>> 1)  I need to know if failover is possible and how is the best way to
>> accomplish this?  Is it possible to have one dom0 take snapshots of
>> the other dom0 and restore the VMs from the other dom0 if it suddenly
>> stops responding?
> 
> In theory yes, if you wrote your own necessary external plugins for
> doing so. But I presume you don't want to go through those troubles as
> there already is a native Xen OCF Resource Agent in the heartbeat
> project. A good explanation & demonstration for it can be found in the
> "SUSE Linux Enterprise Server 10 - Exploring the High-Availability
> Storage Foundation" presentation from september 2006, more
> specifically pages 137-142. It should also answer questions you might
> have. However if you really want to reinvent the wheel, please stick
> to the guidelines mentioned in the linux-ha wiki.
> 
>>
>> 2)  Is it possible to have a VM do a live migrate to the other server
>> automatically if the current server it's running on is being shutdown
>> for maintenance.
> 
> Live-migration is not yet included in the Xen OCF RA, only "migrate"
> is possible at the moment. Also do NOT trigger a live-migration
> yourself using "xm", as hb2 will think the resource crashed and will
> try to start it again which leads to ehm... data corruption to say at
> least :-)
> 
>>
>> 3)  Is it possible to cluster the two servers so they share CPU power
>> and memory resources between the two instances of xen.  So really the
>> VMs are arbitrarily run on both servers simultaneously thus
>> increasing their speed.  If this is possible, what happens if one of
>> the servers suddenly crashes?
> 
> By this you mean... running the same machine in parallel, whilst
> sharing exclusive resources as such would be possible by an
> Active-Active cluster? This would not really be a Xen issue, rather a
> hb2 one with the use of a cluster filesystem, not everything can run
> in A-A mode I think (not an expert on this, check the linux-ha wiki
> for more info).
> To accomplish this, additionally to setting up hb2 between your
> dom0's, you'd also have to set it up between the domU's running inside
> the dom0's.
> 
> There have been discussions on the linux-ha list about the Xen RA (and
> someone who was writing an alternative one for domU's specifically I
> think), I suggest you find 'm & read :-)


I _think_ that the live migration stuff is now in our development
version.  But I'm not 100% sure of that ;-).  If it is, then it'll show
up in a few weeks in a new version.

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] App Failure Induced Failover Behavior Question - 2.0.7

2007-03-29 Thread Alan Robertson
Mohler, Eric (EMOHLER) wrote:
>> I have a question regarding failover behavior of 2.0.7.
>>
>> We're running just 2 nodes on a single LAN and no crossover cable. 
>>
>> We're running with 2.0.7 (R2). Our project has locked our release to
>> 2.0.7. There is no possibility to upgrade to 2.0.8 until it is
>> sechuled for a subsequent release. 
>>
>>
>> I am able to get the following behavior with this (below) config in
>> the event of an ***application failure***: 
>> default_resource_stickiness value="INFINITY"
>> default_resource_failure_stickiness value="-INFINITY"
>>

I cannot read either your ASCII art or your CIB in your email.  CIBs
should typically be sent as attachments.  ASCII art usually comes
through if it isn't too wide, but something in the reply process seems
to have mangled it.

I don't see your original post in my mail client.  Was it recent?

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] IPAddr2 RA and iptables clusterip target

2007-03-29 Thread Alan Robertson
Michael Schwartzkopff wrote:
> Hi,
> 
> I am trying to get a Linux-HA with IPaddr2 resource up and running. I want to 
> use the clusterip target for load sharing between several nodes. I have seen 
> that the concept of using the clusterip target of iptables was planned in the 
> resource agent, but it does not seem to work.
> 
> I am using 1.24 2006/08/09 of the RA. Is there any new version where the 
> problems are fixed?
> 
> Is there any documentation howto use this feature?

I confess I've never understood how this works.

Isn't there a web paper somewhere on this?  Maybe by Fabio Olive Leite?


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heatbeat try to load ressource twice, and fail

2007-03-29 Thread Alan Robertson
Dejan Muhamedagic wrote:
> On Wed, Mar 28, 2007 at 04:01:05PM +0200, Benjamin Watine wrote:
>> Hello the list :)
>>
>> I'm trying to get running a simple Heartbeat configuration : 2 nodes, 
>> active / passive configuration, running LDAP on a drbd disk.
>>
>> When Heartbeat is launched, it loads slapd, that start well. But a few 
>> second later, heartbeat try to load slapd a second time, and because 
>> slapd is already loaded, it return an error code and heartbeat fail. You 
>> can see the problem in the above log.
> 
> It is sometimes OK for heartbeat to try to start a resource twice
> (though that's probably not the case here). A resource agent
> should be able to handle that, but yours obviously can't, i.e. it
> is not LSB compliant. Pls check this:
> 
> http://www.linux-ha.org/ResourceAgent
> http://www.linux-ha.org/LSBResourceAgent
> 
>> Why is heartbeat trying to load slapd twice ? Help would be appreciate :)
> 
> Not sure. Unless we see the configuration. Pls check out this one
> too:
> 
> http://linux-ha.org/ReportingProblems
> 
>> Thank you, and sorry for my english !
> 
> English good, reporting a tad lacking ;-)


It would typically try to start it twice if it didn't report status
"correctly".  For R1 configurations read
http://linux-ha.org/HeartbeatResourceAgent for probable problems.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] why number of nodes < 16?

2007-03-29 Thread Alan Robertson
Andrew Beekhof wrote:
> On 3/28/07, Aleksandar Lazic <[EMAIL PROTECTED]> wrote:
>> On Die 27.03.2007 13:45, Alan Robertson wrote:
>> >Aleksandar Lazic wrote:
>> >>
>> >> Does anybody know who have done this because there is a company which
>> >> need more then the 16 servers and are able to help?
>> >>
>> >> As far as I know they also want to pay for the help.
>> >
>> >I know companies who could help this happen, and if it's done right,
>> >we'd likely take the patches back into the base.
>>
>> Please can you send me some infos about this companies, maybe offlist,
>> so that I can contact they, thanks.
> 
> SUSE might possibly be interested
> - but you'd obviously have to be running SLES instead of RH :-)
> 
>>
>> >How big does it need to go?
>>
>> I have described the setup here:
>>
>> http://lists.linux-ha.org/pipermail/linux-ha/2007-March/023826.html

OK

So, it sounds like 4 bladecenters - minus a few.  A goal of 28 nodes
isn't out of the question.  Of course, IBM would also be interested if
their services interest you, and if you're a big enough IBM customer.  I
believe that I know people who will likely do this kind of thing for you
and who are trustworthy and do good work, and have forwarded their
information to you.   The people I'd first suggest are our friends at
tummy.com who kindly host our web servers, Mercurial access, DNS, and so
on for us.  If they can't help you, I know there are dozens of
consulting firms on this mailing list.  Feel free to ask the list at
large.  I'm sure you'll find some people to be interested.

The work is probably 1/4 coding effort and 3/4 testing effort -- or
something like that.  The majority of the testing effort would not
require more than a 3 computers.  The remaining amount would require the
 largest cluster you can find to test it on.

Our friends at the Sanger Institute have some pretty large clusters for
testing above your size -- if we could engage their help.  I know
they're also interested in larger clusters.

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Re: Hard story how i install HA with GUI in to slackware 11.0

2007-03-29 Thread Alan Robertson
Alex Orlov wrote:
> How fix start mgmtd with option -t... in my system it start only once, after 
> touching config file.
> Next time, after reboot or stop/start it start with  -v option! I put line 
> "respawn" in config...
> 
> I do not wont use auth ... i am a only one person on server and port blocked 
> from everyone.
> 
> PS: PAM auth in slackware are very "stink"...

You might get some messages for the mgmtd daemon trying to be started
multiple times, but if you don't mind the messages, it should work.

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] sub-clusters - heartbeat tunneling

2007-03-29 Thread Alan Robertson
Max Hofer wrote:
> I have a questiuon regarding the heartbeat messgage exchange.
> 
> Currently i have 2 cluster systems, each consisting of 2 node:
> - cluster A consists of nodes A1, A2
> - cluster B consists of nodes B1, B2
> 
> All 4 nodes are attached with bonded interface to a tow LAN
> switches SW1 and SW2 (lets call it normal LAN).
> 
> A1 and A2 (and B1 and B2) have a direct interconnection where
> the DRBD devices are syncronized plus a serial cable (lets call it
> DRBD LAN)
> 
> Thus currently cluster A (and B) use 3 different ways to exchange
> the heartbeat packages:
> - bcast ofer the DRDB LAN
> - ucast using normal lan
> - the serial cable
> 
> Now i figured out the cluster A needs states/data from cluster B 
> (and vice versa) for some fail-over decisions.
> 
> I see 2 possible solutions:
> a) wrting a resource agent which polls the state from the other cluster
> and i use this state
> b) i configure 1 single cib.xml with 2 "sub-clusters"
> 
> With sub-cluster i mean certain resource run only on cluster A and other
> resource run only on B.
> 
> My question now:
> * what will happen if one of the nodes is disconnected from the normal
> LAN - are the information tunnled over the redundant connections?
> 
> Scenario: A1 is disconnected from SW1. A2 still recieves HB packages 
> via the serial line and the DRBD LAN. Do B1 and B2 see A1 as dead or
> do the get the information about A1 via A2?
> 
> maybe strange scenario but i have it (but unfortunatly i can not test it out
> because some external constraints  managers! ;-)

I'd suggest starting your thinking by having a single cluster, and using
the heartbeat APIs to exchange messages.  You can even do this at a
shell script level ;-).

Or you could write status information into CIB node attributes - which
are automatically distributed...  And, clone resources notify you when
your peers come and go.

So, a thought would be a clone resource combined with writing short,
simple state into the node attributes.

But, you didn't give enough detail to know which of these would be best,
or whether any of them would work for you.



-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] IPAddr2 RA and iptables clusterip target

2007-03-29 Thread Alan Robertson
Michael Schwartzkopff wrote:
> On Thursday 29 March 2007 12:43, Alan Robertson wrote:
>> Michael Schwartzkopff wrote:
>>> Hi,
>>>
>>> I am trying to get a Linux-HA with IPaddr2 resource up and running. I
>>> want to use the clusterip target for load sharing between several nodes.
>>> I have seen that the concept of using the clusterip target of iptables
>>> was planned in the resource agent, but it does not seem to work.
>>>
>>> I am using 1.24 2006/08/09 of the RA. Is there any new version where the
>>> problems are fixed?
>>>
>>> Is there any documentation howto use this feature?
>> I confess I've never understood how this works.
>>
>> Isn't there a web paper somewhere on this?  Maybe by Fabio Olive Leite?
> 
> 1) 
> http://flaviostechnotalk.com/wordpress/index.php/2005/06/12/loadbalancer-less-clusters-on-linux/
> 
> 2) Next months article in Linuxmagazin (sorry, German). But perhaps it will 
> be 
> published in Linux Magazine also.

Thanks for the link above.
> 
> 3) I did make and update of the IPaddr2 RA which works for me. Should I send 
> it to you?

What does this mean?  How big were the changes?

Please send to the list, and not to me personally.

> 4) At the moment I am struggling with the GUI, which does not stop the 
> ressources and does not set the OCF_RESKEY... variables correctly, sometimes. 
> I did not understand the whole process completely and it is not reproduceable 
> completely. So give me some time playing around, here. Perhaps I also have to 
> upgrade to 2.0.8. I have 2.0.7 now. Is it worth?

2.0.8 saw a rearchitecture of some things in the CRM.  I'd consider it.

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] WARN: G_SIG_dispatch

2007-03-29 Thread Alan Robertson
Bjorn Oglefjorn wrote:
> Mar 29 11:56:07 test-1 stonithd: [27339]: WARN: G_SIG_dispatch: Dispatch
> function for SIGCHLD took too long to execute: 350 ms (> 10 ms) (GSource:
> 0x8105820)
> 
> I get this error when the stonithd runs a 'gethosts' for my STONITH device.
> I realize that you have increased this timeout recently, but getting the
> host names from the plugin takes at least 350ms in the case of the DRAC4/I.
> 
> I'm suspecting that this 'warning' is preventing STONITH operations from
> happening.  Does this message mean that test-1 could not STONITH test-2 and
> is looking for another node to do so:
> Mar 29 12:01:27 test-1 stonithd: [27339]: info: Broadcasting the message
> succeeded: require others to stonith node test-2.domain.

It shouldn't cause anything to be lost.

You really need to present messages in context.   One message by itself
(or two) isn't that helpful.

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] R2 Two-node apache cluster with STONITH

2007-03-29 Thread Alan Robertson
Dejan Muhamedagic wrote:
> On Wed, Mar 28, 2007 at 02:33:28PM -0400, Bjorn Oglefjorn wrote:
>> Thanks for the reply Dejan.  My responses are inline.
>> --BO
>>
>> On 3/28/07, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
>>> On Wed, Mar 28, 2007 at 11:29:35AM -0400, Bjorn Oglefjorn wrote:
>>>> I believe I've corrected some issues, but now I'm getting more of this:
>>>> Mar 28 11:02:37 test-1 lrmd: [22008]: ERROR: RA lsb:httpd:monitor
>>> (process
>>>> 24472) failed to redirect stdout for its background child (daemon)
>>>> processes. This will likely cause those processes to die mysteriously at
>>>> some later time (terminated by signal SIGPIPE).
>>> Hmm, I think that this has been addressed as Alan had already
>>> pointed out, probably after the 2.0.7 release. If you can, please
>>> upgrade to 2.0.8.
>>
>> I'd prefer to stick with the package that comes from CentOS extras (2.0.7).
> 
> I'd prefer the other way around :) Unless you have a very good
> support contract with your supplier, but in that case we probably
> would be hearing from them and not from you :)  For better or
> worse (probably the latter, because you people would prefer a more
> stable thing :-/), heartbeat development is very fast and the
> number of things fixed from one to the next release is
> substantial.
> 
>> I don't get this error all the time, so I'm not sure why it's happening.
>> Can someone give me a deeper explanation of what the lrmd doesn't like here?
> 
> I can't. If popen(3), read(2), EAGAIN, and SIGPIPE make any sense
> to you, perhaps you can figure it out :) Seriously though, I think
> it was a bug to treat the EAGAIN the way it has been treated in
> 2.0.7 in lrmd.


I can.

The LRM closed the pipe after the start operation completed.  If there
were any child processes which were still running, their stdout/stderr
would be a pipe with no one to read from it -- ever.

If the service you started with the init script ever wrote even one line
to stdout after the start operation failed, that process would die with
SIGPIPE as the cause of death.

This message WAS CORRECT for the lrmd.   We rewrote it to not close the
pipe until the other end closed the pipe in 2.0.8.

I'm pretty sure that the CENTOS people now have a 2.0.8 version.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] R2 Two-node apache cluster with STONITH

2007-03-29 Thread Alan Robertson
Bjorn Oglefjorn wrote:
> Thanks for the reply Dejan.  My responses are inline.
> --BO
> 
> On 3/28/07, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
>>
>> On Wed, Mar 28, 2007 at 11:29:35AM -0400, Bjorn Oglefjorn wrote:
>> > I believe I've corrected some issues, but now I'm getting more of this:
>> > Mar 28 11:02:37 test-1 lrmd: [22008]: ERROR: RA lsb:httpd:monitor
>> (process
>> > 24472) failed to redirect stdout for its background child (daemon)
>> > processes. This will likely cause those processes to die
>> mysteriously at
>> > some later time (terminated by signal SIGPIPE).
>>
>> Hmm, I think that this has been addressed as Alan had already
>> pointed out, probably after the 2.0.7 release. If you can, please
>> upgrade to 2.0.8.
> 
> 
> I'd prefer to stick with the package that comes from CentOS extras (2.0.7).
> I don't get this error all the time, so I'm not sure why it's happening.
> Can someone give me a deeper explanation of what the lrmd doesn't like
> here?
> 
>> When I attempt to move resources to another node (useing crm_standby) I
>> get
>> > these errors:
>> > Mar 28 10:56:04 test-1 crmd: [22011]: info:
>> do_lrm_rsc_op:lrm.cPerforming
>> > op stop on httpd (interval=0ms,
>> key=28:66532759-6190-4321-9be3-07730b15aeae)
>> > Mar 28 10:56:04 test-1 lrmd: [22773]: WARN: For LSB init script, no
>> > additional parameters are needed.
>>
>> Can't say unless you show me this rsc definition, but it seems
>> like bad usage. I found one below, but that one should not cause
>> this problem:
> 
> 
> It's slightly different now (is provider="heartbeat" bad here?):
> 
>  type="httpd-lsb">
>   
>  on_fail="restart"/>
>  on_fail="restart" prereq="fencing"/>
>  prereq="fencing"/>
>   
> 
> 
>> 
>> > 
>> > > on_fail="fence"/>
>> > 
>> > 
>>
>> One thing that looks odd is 5s interval and 20s timeout. The
>> timeout is probably OK, but the interval is a bit exaggerated.
>> What I mean is that, apart from putting extra strain on your host
>> which may or may not be an issue, a 5 seconds monitoring interval
>> won't bring you much, or, in other words, how about your response
>> time in case a problem occurs? Is it of the same order?
> 
> 
> Would it make more sense to have the timeout and interval equal?  I can see
> your point.
> 
>> Mar 28 10:56:04 test-1 lrmd: [22008]: ERROR: RA lsb:httpd:stop (process
>> > 22773) failed to redirect stdout for its background child (daemon)
>> > processes. This will likely cause those processes to die
>> mysteriously at
>> > some later time (terminated by signal SIGPIPE).
>> > Mar 28 10:56:04 test-1 lrmd: [22008]: info: RA output:
>> (httpd:stop:stdout)
>> > httpd (pid 22165 22164 22163 22162 22161 22160 22159 22157 22155) is
>> > running...
>> > Mar 28 10:56:04 test-1 crmd: [22011]: WARN: process_lrm_event:lrm.c LRM
>> > operation (44) stop_0 on httpd Error: (1) unknown error
>>
>> I'd strongly recommend that you use the OCF RA in stead of your
>> distributions init script. It is otherwise rather difficult to
>> figure out what this error means apart from the fact that the stop
>> op failed. I wonder why did it show up as WARN and not ERROR.

I agree.  Also, our resource agent monitors apache much better than
status on the LSB init script.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] How to prevent node from taking over

2007-03-29 Thread Alan Robertson
Mohler, Eric (EMOHLER) wrote:
> Can anyone help with this scenario? Running 2.0.7.
> 
> Thanks in advance.
> 
> 
> 
> BOX1 App StateBOX2 App State
> ONOFF
> ONOFF
> ONOFF
> 
> (Pull cable for t > deadtime)
> 
> Still ON  ON
> Still ON  ON
> Still ON  ON
> 
> (STOP HA)
> 
> OFF   ON <--- this is desired behavior
> OFF   ON <--- this is desired behavior
> OFF   ON <--- this is desired behavior
> 
> (Replace cable)
> 
> OFF   ON <--- this is desired behavior
> OFF   ON <--- this is desired behavior
> OFF   ON <--- this is desired behavior
> 
> (START HA)
> 
> ONOFF<---this is bad!! As soon as HA
> takes 
> ONOFFcontrol BOX2 closes apps and
> BOX1  
> ONOFFstarts apps. I need BOX2 to
> stay put 
>  BOX1 to simply leave apps
> closed as 
>  just above.


I have no foggy idea what any of your output means.  It hurts that you
made your lines too long, and they get wrapped.

I know something is bad, but I can't make any sense of your table.

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] R2 Two-node apache cluster with STONITH

2007-03-30 Thread Alan Robertson
Bjorn Oglefjorn wrote:
> I took a look at the apache RA, but it makes a lot of assumptions about the
> environment which are mostly untrue in Red Hat.  How can I configure
> this RA
> short of making changes to the script?  Can I set environmental variables?
> I tried setting what's shown in the 'meta-data' output, but with no luck.


The environment variables are related to, but not the same as, the names
in meta-data.  The environment variable names have OCF_RESKEY_ prepended
to the front.  So, ipaddr becomes OCF_RESKEY_ipaddr, and so on...

Everything that's in there as far as I know are simply defaults.

http://linux-ha.org/OCFResourceAgent

talks about this in more detail.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] sub-clusters - heartbeat tunneling

2007-03-30 Thread Alan Robertson
Dejan Muhamedagic wrote:
> On Fri, Mar 30, 2007 at 11:13:48AM +0200, Max Hofer wrote:
>> On Thursday 29 March 2007 22:32, Dejan Muhamedagic wrote:
>>> On Thu, Mar 29, 2007 at 06:36:15PM +0200, Max Hofer wrote:
>>>> I have a questiuon regarding the heartbeat messgage exchange.
>>>>
>>>> Currently i have 2 cluster systems, each consisting of 2 node:
>>>> - cluster A consists of nodes A1, A2
>>>> - cluster B consists of nodes B1, B2
>>>>
>>>> All 4 nodes are attached with bonded interface to a tow LAN
>>>> switches SW1 and SW2 (lets call it normal LAN).
>>>>
>>>> A1 and A2 (and B1 and B2) have a direct interconnection where
>>>> the DRBD devices are syncronized plus a serial cable (lets call it
>>>> DRBD LAN)
>>>>
>>>> Thus currently cluster A (and B) use 3 different ways to exchange
>>>> the heartbeat packages:
>>>> - bcast ofer the DRDB LAN
>>>> - ucast using normal lan
>>>> - the serial cable
>>>> I see 2 possible solutions:
>>>> a) wrting a resource agent which polls the state from the other cluster
>>>> and i use this state
>>> Interesting idea. Not sure how tricky it would be to do right.
>>> Depends also what for you would use that state. I guess to restart
>>> some resources.
>> My problem with this solution that i do not use the action-serialzing effect 
>> of the transition engine.
> 
> I don't really get this one.

Me neither...

>>>> b) i configure 1 single cib.xml with 2 "sub-clusters"
>>> This is an obvious solution, but probably you'd have to do some
>>> rewiring, i.e. all nodes should be equally well connected with
>>> each other.
>> This is out of question because i can  not interconnect all nodes to each
>> other with a serial cable or with a directy LAN cable.
> 
> You could use a switch in case there are no security issues. I'm
> afraid that they really have to be able to talk to each other.

It is the case that heartbeat expects effectively a star topology.  In
fact, even if we worked around it, we might schedule resources that
violate the topology, because we didn't understand it.

You ever think about using IP routes?  Then everything would just work :-).

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] SMITH?

2007-03-30 Thread Alan Robertson
Eduardo Grosclaude wrote:
> Hello,
> You might be frightened by this question, but I am looking for an effective
> way to Shoot Myself In The Head for testing purposes. Sort of an ungraceful
> "software cold reset"... Is this possible? A panic followed by reboot would
> probably be interesting as well.
> I want to simulate a power brownout while I am not around my cluster. I've
> tried haltsys -f -p but this seems too much of a latency to be a real down.
> In my normal test case, the node would be fully operational by the time I
> issue the SMITH. I have no hardware STONITH device that could be operated
> from within $this_node.
> Sorry about suicidal language.

You'll like the suicide STONITH module then ;-)


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Masters take long time to get back the ip from slave

2007-03-30 Thread Alan Robertson
Dejan Muhamedagic wrote:
> On Fri, Mar 30, 2007 at 11:13:22AM +0530, Austin Rock wrote:
>> Pls. somebody Help me..Thanks
> 
> As Alan said, no logs -- no cigar and no amount of pleading helps :-)


I suspect he sent the logs in a .gz file, but I've been with a customer
most of this week.  Gotta keep those bills paid!


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] OCF RA for openLDAP

2007-03-30 Thread Alan Robertson
Benjamin Watine wrote:
> Hello
> 
> I'm about to write an OCF Resource Agent for openLDAP and I wonder if
> one of you have already write one. I've taken a look on google, but I
> didn't find it. Also, is there is an OCF repository for common services
> (apache, mysql, etc) ?

You can use LSB resource agents, of course...

We supply OCF resource agents for
apache.inICP.in ManageVE.inRaid1.inWAS6.in
AudibleAlarm.in  IPaddr2.in mysql.in   rsyncd.in   WAS.in
ClusterMon.inIPaddr.in  SAPDatabase.in WinPopup.in
db2.in   IPsrcaddr.in   oracle.in  SAPInstance.in  Xen.in
Delay.in LinuxSCSI.in   oralsnr.in SendArp.in  Xinetd.in
drbd.in  LVM.in pgsql.in   ServeRAID.in
Dummy.in MailTo.in  pingd.in   Stateful.in
EvmsSCC.in   Makefile.inportblock.in   SysInfo.in
Filesystem.inManageRAID.in  Pure-FTPd.in   VIPArip.in

Please ignore the .in part of the names...

Of course, you could ls the directory yourself ;-)

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] heartbeat does not set parameters for the RA

2007-03-30 Thread Alan Robertson
Michael Schwartzkopff wrote:
> Hi,
> 
> I try to shut down a clone (clone max = 2) resource with the the gui:
> Right mouseclick on the single resource or the clone resource or 
> writing "stopped" to the target role sometimes calls the RA script, but 
> without correct set up parameters. The only values that are set up are:
> 
> $OCF_RESKEY_CRM_meta_clone and
> $OCF_RESKEY_CRM_meta_clone_max
> 
> All others are missing. Any idea what might be wrong?
> 
> The RA script is ocf/IPadd2. At the moment I am using version 2.0.7.

Never seen this so far, but clone resources are used a lot less often
than others...

Please supply logs...

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Server IP takeover by slave PC

2007-03-30 Thread Alan Robertson
Dejan Muhamedagic wrote:
> On Thu, Mar 29, 2007 at 09:30:51PM -0700, Tanveer Chowdhury wrote:
>> Finally I am going to implement it on my home PCs each with 2 NICS before 
>> attempting it in production servers. One(eth0) is connected to DSL cable for 
>> intenet and other(eth1) I will use for Local intranet connectivity.
>>
>> PC1:  eth0: 10.0.16.111eth0:0  10.0.16.115  eth1: 192.168.100.10
>> PC2:  eth0: 10.0.16.113  eth1: 192.168.100.20
>>
>> 10.0.16.115 will be my VIP. Questions: Do I have to define
>> eth0:0 in PC1 or just declaring it in haresources file is ok.
> 
> Don't understand this one, but it is enough to just put it in the
> haresources.
> 
>> Now suppose if someone pulls the cable of say, eth0 / eth1 of
>> PC1 then PC2 will take over, RIGHT? 
> 
> That's what's called split brain, a very difficult thing to solve.
> You want to make sure that there's always connectivity available
> between the nodes. So, you put more than one connection.
> 
>> Now when PC2 takes over then it will start the services listed
>> in this line masternode   IPaddr::10.0.16.115httpd 
>>
>> So instead of httpd I will write a startup script of my own with
>> lots of ipconfig command and iptables rules in that scripts
>> Start and Stop function and put it in init.d location.
> 
> That's not an optimal approach, i.e. lumping various stuff
> together in such a way will give everybody headache. If you really
> need special resources which are _not_ already available (you
> should check that), then implement only them in separate scripts.
> Heartbeat will be happy to deal with multiple resources, so you
> can do sth like:
> 
> masternode   IPaddr::10.0.16.115firewall mega-ip httpd 
> 
>> The
>> purpose of this is when PC2 takes over its IP will be changed to
>> eth0: 10.0.16.113 and eth1: 192.168.100.10.
> 
> eth1? I don't see eth1 in resources.
> 
>> These IP s will be
>> static IPs so I think no problem will arise.  And Alan, sorry I
>> couldn't make use of your that flash tutorial. Actually I didn't
>> understood that clearly.
> 
> It was probably for v2 style configs.
> 
>> Waiting for your suggestions.
> 
> Always happy to help, but perhaps you should also invest some time
> and read a few docs. You can start here:
> 
> http://linux-ha.org/GettingStartedWithHeartbeat

...

>>> Alan Robertson  wrote: Tanveer Chowdhury wrote:
>>>> Hi all: Thank you for all your help and support. This is what I am
>>>> going to do after reading carefully the documentation.

I don't see any evidence you did this.  Sorry.  Everything you're asking
is described in the R1 docs - in detail.  Although the R1 docs could be
better they're quite explicit and give examples on the things you're
asking for.

It appears that you've made up your mind how it has to work, and you're
fighting what the docs say, or ignoring what they say because it doesn't
agree with your pre-conceived notion.

This is MUCH simpler than you're making it.

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Problems loggin in with hb_gui

2007-03-30 Thread Alan Robertson
Michael Schwartzkopff wrote:
> Hi,
> 
> I am using openSuSE 10.2. Default install of heartbeat and gui results in a 
> problem. Loggin in to the cluster gives the error:
> 
> "Failed in authentication. User Name or Password my be wrong. or the user 
> doesn't belong to the haclient group."
> 
> Well, I checked all this and everything should be all right.
> /var/log/messages on the server says:
> 
> Mar 28 14:15:18 epidot mgmtd: pam_unix(hbmgmtd:auth): authentication failure; 
> logname= uid=0 euid=0 tty= ruser= rhost=  user=hacluster
> Mar 28 14:15:18 epidot mgmtd: pam_warn(hbmgmtd:auth): 
> function=[pam_sm_authenticate] service=[hbmgmtd] terminal=[] 
> user=[hacluster] ruser=[] rhost=[]
> Mar 28 14:15:20 epidot mgmtd: [3564]: ERROR: on_listen pam auth failed
> 
> I If kill the mgmtd and start if with the option "-t" (test mode, no auth 
> check) all works smooth.

Yeah.  You're trying to log in as root.  We don't trust root unless
explicitly ordered to.

The message says:
>  "or the user doesn't belong to the haclient group."

99% chance that root isn't a member of group haclient.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Fail-count, ping-pong

2007-03-30 Thread Alan Robertson
Massi wrote:
> Would you suggest to put the 2.0.8 qlreqdy on q production environment ? of
> wait a bit with 2.0.7

There are no known regressions between 2.0.7 and 2.0.8.  2.0.8 is in
every way at least as good as 2.0.7.  In fact, this is almost always true...

-- 
    Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Help :- Heartbeat 2.0.2-1 - Failover - MySQL

2007-03-30 Thread Alan Robertson
jugal shah wrote:
> Hi All,
> 
> I have configure the heartbeat for the fail-over. It works perfactly
> means, when Master Server stops or unavailble it will transfer
> request to the Slave Server.
> 
> As per the configuration setting I have done the following
> configuration in haresources file.
> 
> redhat.cybage.com 172.19.22.230/24/eth0 smb httpd mysql.in
> 
> Though it works but it will stops all the of the above service like
> Smb, Httpd, MySQL.in on slave computer because of that MySQL
> replication fails.
> 
> I have couple of question.
> 
> 1. Is Heartbeat Supports MySQL fail-over, (Means suppose Linux OS
> working perfactly but the MySQL is failed in that scenario is
> heartbeat transfer it requests to the slave computer).

Release 1 (haresources) configurations don't support monitoring of
resources (thinks like mysql).
> 
> 2. If it supports Service fail-over what I have to do for it?

Upgrade to release 2 (CIB) configuration -- with all that involves.

> 3. I have taken Ldirectord optional so I haven't install it.

Fine.  It is.  You took it correctly.


> 4. How heartbeat know which one is Master Server, I have make same
> configuration on both master and slave computer (Ha.cf, haresources,
> Authkeys)

First of all the idea of a "master server" is a bit restrictive, as
heartbeat supports active-active configurations, in which case both
servers can be active at the same time - on different resources.

However, for R1 configurations, for each resource "group" (each line in
haresources), the nominal or preferred master is specified as the first
word on the line.  Even this is ignored if you set auto_failback off.
In this case there is no preferred server at all.

> I need to configure Heartbeat so that even if there is a single
> database fails it will transfer it request to the other computer.

You need an R2 CIB configuration to do that.

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Re: Linux-HA Digest, Vol 40, Issue 126 --- Help Heartbeat -2.0.2-1

2007-04-02 Thread Alan Robertson
jugal shah wrote:
> Hi All,
>  
> First of all thank you very much.
>  
> Can anyone guide me how to do R2 CIB configuration?
>  
> Though I have configure the ha.cf file with the "crm=yes" and it has
> generated cib.xml file but I don't understand how to do the
> configuration in cib.xml file.
>  
> I have done the following things for R2 CIB configuration for MySQL
> Fail-over.
>  
> 1. Make ha.cf file with crm yes which created the cib.xml
>  
> 2. Create virtual IP device file in sys-config/network-scripts only on
> master

You should undo that.  Heartbeat will do that for you and move the
virtual IP around from one machine to another on its own.

> Now I don't have any idea what to do with the cib.xml
>  
> *My overall goal to configure the heartbeat so that, if there is a single
> database fails it will transfer it request to the other computer.*
>  
> If anybody has idea, and links for how to do it please reply me as early
> as possible.

On the right side of every web page, there is a linnk called
"Configuring Heartbeat".  That page is here:
http://linux-ha.org/ConfiguringHeartbeat
It will point you at a few other documents which you can find here:
    http://linux-ha.org/GettingStartedV2
and http://linux-ha.org/GettingStartedRevisedV2




-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] How to perform hb_standby from a custom heartbeat client ?

2007-04-02 Thread Alan Robertson
Dejan Muhamedagic wrote:
> Hi,
> 
> On Mon, Mar 26, 2007 at 01:40:34PM +0100, Martin Gazak wrote:
>> Good morning,
>> I have a custom heartbeat client (Heartbeat 2.0.7, OpenSuSE Linux 10.2), 
>> which signs on, performs nodewalk and then periodically gets node 
>> statuses, all using routines of hbclient library shipped with heartbeat.
>>
>> Is there any routine/call in hbclient, which allows to perform operation 
>> equivalent to "hb_standby all" command ?
>> I am using ha.cf/haresources configuration.
>>
>> I could not find any such call in client_lib.c.
>> I guess it could be achieved by som low level calls like 
>> "sendclustermsg", but I did not find documentation how to send such 
>> specific message.
> 
> There's a script which does standby: /usr/lib/heartbeat/hb_stanby.
> It's just writing a message to /var/lib/heartbeat/fifo. Wouldn't
> that fit the bill?

You could do it from a shell script just like hb_standby does, or you
could do it with sendclustermsg() just like you suspected.

Understand that this will only work for an R1 system.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] heartbeat -k hangs

2007-04-02 Thread Alan Robertson
Yan Fitterer wrote:
> I can't see this working. AFAIK, heartbeat 2.x does not support the
> protocols of the 1.x series.
> 
> It sounds like you'll have to setup your 2.x system as a new cluster,
> then put together a good transition process.

Everything in heartbeat 1.2.x is in 2.0.x.

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Getting the status of the node

2007-04-02 Thread Alan Robertson
Mark Eisenblaetter wrote:
> Hello list,
> 
> i'm searching for a tool/script that tells me if the node is active or
> passiv.

Heartbeat isn't an active/passive solution.

So there is no "active node" or "passive node".

But hb_status rscstatus will tell you what you want to know - but not
very directly.

Read the man page for it.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] which is lying ?

2007-04-02 Thread Alan Robertson
Patrick Begou wrote:
> May be this is my answer:
> 
> Dean (now heartbeat 1.2.4/FC6 X86_64) has no master control process
> ps -ef |grep heart
> nobody5378 1  0 19:12 ? 00:00:00 heartbeat: heartbeat: FIFO reader
> nobody5379 1  0 19:12 ? 00:00:00 heartbeat: heartbeat: write:
> bcast eth1
> nobody5380 1  0 19:12 ? 00:00:00 heartbeat: heartbeat: read:
> bcast eth1
> nobody5381 1  0 19:12 ? 00:00:00 heartbeat: heartbeat: write:
> bcast eth0
> nobody5382 1  0 19:12 ? 00:00:00 heartbeat: heartbeat: read:
> bcast eth0
> nobody5383 1  0 19:12 ? 00:00:00 heartbeat: heartbeat: write:
> ping_group soufflerie
> nobody5384 1  0 19:12 ? 00:00:00 heartbeat: heartbeat: read:
> ping_group soufflerie
> 
> While Ekman has one (heartbeat 1.2.3/Sarge AMD64)
> ekman:~# ps -ef |grep heart
> root  2468 1  0 08:04 ? 00:01:00 heartbeat: heartbeat: master
> control process
> nobody2475  2468  0 08:04 ? 00:00:00 heartbeat: heartbeat: FIFO reader
> nobody2476  2468  0 08:04 ? 00:00:00 heartbeat: heartbeat: write:
> bcast eth1
> nobody2477  2468  0 08:04 ? 00:00:01 heartbeat: heartbeat: read:
> bcast eth1
> nobody2478  2468  0 08:04 ? 00:00:00 heartbeat: heartbeat: write:
> bcast eth0
> nobody2479  2468  0 08:04 ? 00:00:02 heartbeat: heartbeat: read:
> bcast eth0
> nobody2480  2468  0 08:04 ? 00:00:02 heartbeat: heartbeat: write:
> ping_group soufflerie
> nobody2481  2468  0 08:04 ? 00:00:00 heartbeat: heartbeat: read:
> ping_group soufflerie
> 
> Is this a bug ?

Yes.

Anytime the heartbeat master status process goes away and leaves other
processes around -- it's a bug.

Logs would tell the tale.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Repeatable simple colocation bug

2007-04-02 Thread Alan Robertson
Martin Fick wrote:
> I have create an example of a simple colocation
> constraint bug that I have run into using one resource
> that is meant to be colated with two similar
> resources.  I have used the example Stateful ocf agent
> to showcase this bug, this resource simply sets a
> state and maintains it.  I noticed this bug with real
> resources, this isn't just an acedemic example; :)  I
> just wanted to showcase the problem with a simple
> resource so that the resource itself would not be in
> question.
> 
> First there are two resources: example_A and example_B
> that are defined.  I have two machines, mwave and
> dell, each one of these resources ends up starting on
> separate machines.  I then define a third resource
> example_cAB and define two colocation restraints for
> this resource, one for example_A and one for
> example_B.  
> 
> The expected outcome at this point would be for
> example_A and example_B to migrate to the same node
> and then for example_cAB to be started on this same
> node.  Instead, example_A and B stay frozen where they
> are (separate machines) and example_cAB never starts. 
> Here is the crm_mon output to show this:


You didn't say what version you're running.  You didn't include logs.
Putting files like CIB output as attachments is much appreciated - since
mail user agents do funny things to long lines.

See the link below (included in every email to the list):
> See also: http://linux-ha.org/ReportingProblems 


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] resolving "Dependency loop" error

2007-04-02 Thread Alan Robertson
kisalay wrote:
> Hi,
> 
> I have recently upgraded my system from linux-ha 2.0.7 to 2.0.8.
> Since I have upgraded, i have been seeing some errors/ warnings from
> pengine. I assume that these errors were not checked for in 2.0.7 and more
> checks have been added in 2.0.8.
> Below I paste the whole cib.xml ( for clear reference ) and the warnings /
> errors follow:

Could you kindly send the CIB as an attachment, and not as pasted inline
so that it's not at the mercy of email clients?

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] cib.xml races on initialization

2007-04-02 Thread Alan Robertson
Bernd Schubert wrote:
> Hi,
> 
> I think there's a race condition on initializing an entirely new cib.xml, I 
> will describe it as an example:
> 
> system1 + system2: 
>   1) stop heartbeat
>   2) delete their present status:
> 
> rm -rf /var/lib/heartbeat/crm/c* \
> /var/lib/heartbeat/ccm/ccm \
> /var/lib/heartbeat/register \
> /var/lib/log_daemon \
> /var/run/heartbeat/ccm/ccm \
> /var/run/heartbeat/crm/crmd \
> /var/run/heartbeat/crm/cib_callback \
> /var/run/heartbeat/crm/cib_ro \
> /var/run/heartbeat/crm/cib_rw \
> /var/run/heartbeat/register
> 
> Also deleting /var/lib/heartbeat/crm/cib.xml on system2 does not help for the 
> problem below.
> 
> On system1 a new cib.xml is copied to /var/lib/heartbeat/crm. Now heartbeat 
> is 
> started first on system1, then also on system2, but within a couple of 
> seconds after the start on system1.
> 
> If I see it right, system2 should take over the configuration from system1, 
> however, it rather often happens that system1 deletes its proper cib.xml and 
> gets an empty cib.xml from system2.
> 
> Is there already a proper solution to always push the cib.xml from system1 to 
> system2, if the cib.xml on system2 is empty?
> 
> If not, in principle I would be willing to fix this myself. However, this 
> strongly depends on how well the sources are readable and commented (I have 
> not looked into the sources yet) and first we should discuss on which events 
> pushing the cib.xml from one system to another should happen.

What version of heartbeat are you running?


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] cib.xml races on initialization

2007-04-02 Thread Alan Robertson
Yan Fitterer wrote:
> Manual manipulation of cib through /var filesystem is explicitly
> discouraged.
> 
> Use the cibadmin tool. Heartbeat will synchronize the cib between all
> nodes automatically for you.

However, the case he's doing - of initializing it before starting - is
very common -- we even do it systematically during our CTS testing.

What he's doing (if I understood it correctly) should be completely safe.

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] New Command Line Tools: Resource Scripts

2007-04-02 Thread Alan Robertson
Martin Fick wrote:
> Hi,
> 
> I have been using heartbeat 2 for a while and tend to
> prefer scripts over GUIs or XML so I have created some
> helper scripts (that I call Resource Scripts) to
> configure/modify resources and their constraints from
> the command line that I would like to share with
> others.  
> 
> I only have a few of these scripts developed and they
> might not in themselves be particularly useful to
> others, but the main shell library and the idea
> probably would be.  With the library as a start, you
> can easily create resource scripts for the types of
> resource that you use.
> 
> 
> 
> A Resource Script is named after the type of resource
> or resource group that it will create and can be
> invoked with very simple options causing resources to
> be defined in the heartbeat CIB.

I'll read the rest of your email in a bit, but PLEASE don't call them
resource scripts.  That term is already taken ;-)  A resource agent is
often called a resource script.

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Helping out the Linux-HA (heartbeat) project

2007-04-02 Thread Alan Robertson
Hi,

We have lots of different needs to help the project.  Some are coding,
but most aren't.

I've tried to outline a few of them and put them on the web site on this
page:
http://linux-ha.org/HowToContribute

You may think there is nothing you can do to help us, maybe because you
think we don't need the kind of talents you have.

Nothing could be further from the truth!  Check out HowToContribute and
see what I mean!

PS:  Specific coding tasks are not yet up on the web site.  Those are
coming.

I'm going to send out a separate note on the Education project.  It's
object:  To make heartbeat the best documented open source project.



-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] The Heartbeat Education Project

2007-04-02 Thread Alan Robertson
Hi,

For the impatient - view this totally cool link right now;-)
http://linux-ha.org/Education/Newbie/IPaddrScreencast

For the less impatient, and for context about that content and why I
created it, read on...

It's readily apparent that we need good educational / training materials
for Linux-HA.   It is also readily apparent that we don't have them.
Although release 2 is powerful and provides a rich set of constructs for
creating really cool HA systems, it is much harder to learn that it
ought to be, and much more intimidating than it needs to be.

Towards this end, I've started an Education sub-project, to provide
training on Linux-HA - with the intent of making Linux-HA the _best_
documented open source project, and easy to learn.


The home page for the education project is here:
http://linux-ha.org/Education
There are a lot of ideas for what should be covered in the Newbie and
Novice levels.
http://linux-ha.org/Education/Newbie
http://linux-ha.org/Education/Novice

(I need to fix something in the linking for the Journey(wo)man section
-- or change the name).

To give an idea of what I have in mind, I've produced a sample Newbie
level screencast:
http://linux-ha.org/Education/Newbie/IPaddrScreencast

You can watch this screencast in under 10 minutes.  The hope is for all
the educational modules to be 15 minutes or less - each one teaching you
 a task you need to know how to do to get your job done.

I haven't documented how I made this screencast yet, but I'll do that
soon.  But, it was all done with free-as-in-free-beer tools (no, not
free software, but this the truly Free software so far sucks for this
purpose).

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] cib.xml races on initialization

2007-04-03 Thread Alan Robertson
Andrew Beekhof wrote:
> On 4/3/07, Alan Robertson <[EMAIL PROTECTED]> wrote:
>> Yan Fitterer wrote:
>> > Manual manipulation of cib through /var filesystem is explicitly
>> > discouraged.
>> >
>> > Use the cibadmin tool. Heartbeat will synchronize the cib between all
>> > nodes automatically for you.
>>
>> However, the case he's doing - of initializing it before starting - is
>> very common -- we even do it systematically during our CTS testing.
>>
>> What he's doing (if I understood it correctly) should be completely safe.
> 
> no
> 
> he's starting two node at (essentially) the same time with two
> different configurations (ie. one empty) but (i'm guessing) both with
> the same version.  in such cases the result is semi-random as to which
> version will "win" (probably based on the host name or who won the
> election IIRC).
> 
> try setting admin_epoch="1" in the copy of cib.xml you're copying
> in... that will ensure that it will be recognised as "more recent" and
> always be used in preference to an empty configuration.

OK.  I hear what you're saying - from the point of view of explaining
how the code works.

On the other hand, from an intuitive point of view an automatically
generated empty CIB should never be preferred to one with content.

Is there some way to create empty ones with an epoch of -1 (logically
speaking) or something equivalent?

Or make the default epoch of an existing CIB file with a missing epoch
to be 1 instead of zero?  Does the crm_validate tool catch the missing
epoch and complain/explain about it?

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] which is lying ?

2007-04-03 Thread Alan Robertson
Patrick Begou wrote:
> 
> May be it is also an error of myself. I have assumes backward compatibilitie 
> with glib.
> The binaries require libglib 1.2 and I have libglib libglib-2.0. I have 
> created a link
> from libglib 1.2 to libglib-2.0. May be it's not a good idea.


Yes.  That will cause heartbeat to die immediately.  The reason we have
a dependency on glib2 in recent versions is because glib2 is
incompatible with glib1 from an API perspective, and glib1 is obsolete.

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] possible bug in send_arp

2007-04-03 Thread Alan Robertson
Michael Schwartzkopff wrote:
> Am Dienstag, 3. April 2007 12:09 schrieb Michael Schwartzkopff:
>> Hi,
>>
>> send_arp --help gives me:
>> usage: send_arp [-i repeatinterval-ms] [-r repeatcount] [-p pidfile] \
>>   device src_ip_addr src_hw_addr broadcast_ip_addr netmask
>>
> 
> Ok, found it myself. The format of the MAC address is without the columns. 
> That makes it more difficult to use it in ressource scripts.

Without the columns... ?

I don't remember the format, but I do remember that the code documents
it as being unusual.

Please note that this command is not advertised externally, so it's
subject to change without notice.

We have 3 different resource agents that use this resource agent,
including a sendarp resource agent.

What resource agent are you using it for?

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] The Heartbeat Education Project

2007-04-03 Thread Alan Robertson
Michael Schwartzkopff wrote:
> Am Dienstag, 3. April 2007 04:27 schrieb Alan Robertson:
>> Hi,
>>
>> For the impatient - view this totally cool link right now;-)
>>  http://linux-ha.org/Education/Newbie/IPaddrScreencast
> 
> Cool! Thanks.

Please look over the planned course segments.  We're definitely looking
for help in producing them.  If everyone waits for me to get the time to
do them all myself, it will be a very long time before they're finished.

There are several roles people can help with:
course segment planning and "storyboarding"
screen captures
adding captions
dialog and voice-overs
    quality control
project oversight

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Re: [Linux-ha-dev] Reg pingd

2007-04-03 Thread Alan Robertson
[EMAIL PROTECTED] wrote:
>  
>  
> Hi all,
> I am trying to use pingd. 
> I have the following in ha.cf
> respawn root /usr/lib/heartbeat/pingd -m 100 -d 5s 
>  
> And I have attched my cib.xml. I am not sure where I am wrong as I
> am new to xml. 
> I have this cib.xml in both the active & passive member of the
> cluster.
>   My only resource is IPaddr and when pingd is running and when the link
> goes down on the active cluster member, it is not failing over to the
> passive though the link down event is detected by pingd. 
> Plz anyone let me know what could be the issue.
> Regds,
> Kitty.

You are using a different resource attribute name ("default_ping_set"),
but not telling pingd to set that attribute name.

I seem to recall that the default pingd attribute name is pingd.



-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] The Heartbeat Education Project

2007-04-03 Thread Alan Robertson
Michael Schwartzkopff wrote:
> Am Dienstag, 3. April 2007 14:21 schrieb Sander van Vugt:
> (...)
>> I'm interested to help. What thoughts do you have on organizing this?
>> And if you do not have any thoughts about this, I'm volunteering for the
>> "project oversight" role to start with. Don't want to keep you from
>> developing new stuff for the Heartbeat project ;-)
>>
>> Thoughts?
> 
> Same for me. Could do some screen captures, esp for the new IPaddr2 RA.

Wonderful!!

For the information of others reading this:
Sander and Michael and I still need _LOTS_ of help!  :-D

I'll probably still help with this, if for no other reason than I'm
working some issues for having music intros and closes - with full legal
permission.  The current version of 'wink' does stinky audio - but that
will eventually be fixed.  When it does, I'll try and add a little
musical intro.

Michael:
I would imagine that having an advanced screencast on setting up cluster
IP addresses, including all the IPtables stuff would be great.  We'll
probably still need some reference-level information for the web site so
that people can find the presentation.

Sander:
You have the job!
We need to lay out a good set of courses through the journey(wo)man
level.  Obviously Lars and Andrew and Dejan and I will have more things
to say about what we think people need to know.

We also need to make sure these screencasts can be found by Google search.

I'm sure there's a longish todo list for this project beyond even the
course list.  Document how to make these screencasts, etc.  Please
document the todo on the web site, maybe a page named Education/TODO or
something like that?  That way I can add to it as I think of things...

    Thanks to both of you and everyone else who will help that
hasn't _yet_ volunteered!


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] timeout monitoring VIP

2007-04-03 Thread Alan Robertson
Christophe Zwecker wrote:
> hi
> 
> we run 2.0.8
> 
> yesterday we saw a timout in monitoring vip:
> 
> Apr  1 06:14:50 mw-n1 lrmd: [21216]: WARN: on_op_timeout_expired:
> TIMEOUT: operation monitor[222] on ocf::IPaddr::IPaddr_88_198_96_210 for
> client 21219, its parameters: target_role=[started]
> CRM_meta_interval=[5000] ip=[88.198.96.210] CR
> M_meta_op_target_rc=[7] netmask=[29]
> CRM_meta_id=[IPaddr_88_198_96_210_mon] CRM_meta_timeout.
> Apr  1 06:14:50 mw-n1 crmd: [21219]: ERROR: process_lrm_event: LRM
> operation IPaddr_88_198_96_210_monitor_5000 (222) Timed Out
> (timeout=5000ms)
> 
> we had load of 2 on the machine at that time, had that plenty of times
> bevore though, so I dont know it that could be an issue ?
> 
> shall I just increase the timeout and forget about it or what could
> cause this ?

I think I'd increase the timeout and forget about it.  I don't know why,
but every so often the fork/exec/run/exit process takes longer than
you'd think.  It's probably related to cache size and I/O workload, not
to CPU load.

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] cib.xml races on initialization

2007-04-03 Thread Alan Robertson
Andrew Beekhof wrote:
> On 4/3/07, Alan Robertson <[EMAIL PROTECTED]> wrote:
>> Andrew Beekhof wrote:
>> > On 4/3/07, Alan Robertson <[EMAIL PROTECTED]> wrote:
>> >> Yan Fitterer wrote:
>> >> > Manual manipulation of cib through /var filesystem is explicitly
>> >> > discouraged.
>> >> >
>> >> > Use the cibadmin tool. Heartbeat will synchronize the cib between
>> all
>> >> > nodes automatically for you.
>> >>
>> >> However, the case he's doing - of initializing it before starting - is
>> >> very common -- we even do it systematically during our CTS testing.
>> >>
>> >> What he's doing (if I understood it correctly) should be completely
>> safe.
>> >
>> > no
>> >
>> > he's starting two node at (essentially) the same time with two
>> > different configurations (ie. one empty) but (i'm guessing) both with
>> > the same version.  in such cases the result is semi-random as to which
>> > version will "win" (probably based on the host name or who won the
>> > election IIRC).
>> >
>> > try setting admin_epoch="1" in the copy of cib.xml you're copying
>> > in... that will ensure that it will be recognised as "more recent" and
>> > always be used in preference to an empty configuration.
>>
>> OK.  I hear what you're saying - from the point of view of explaining
>> how the code works.
>>
>> On the other hand, from an intuitive point of view an automatically
>> generated empty CIB should never be preferred to one with content.
> 
> and one with content should never have a version of 0.0.0
> (admin_epoch, epoch, num_updates)
> 
>> Is there some way to create empty ones with an epoch of -1 (logically
>> speaking) or something equivalent?
> 
> that still wont help because most of the time the three version
> attributes are unset.  doing this will just end up defaulting to -1
> instead of zero.

I meant different for the automatically created CIB, versus a manually
created one...  You can certainly put in anything you want into the
automatically created empty one...


This is one of those places where mixing the status section in with the
configuration section hurts us.  Because we immediately update the
automatically created one with status, and it looks like something
substantive was changed.  Sigh...


>> Or make the default epoch of an existing CIB file with a missing epoch
>> to be 1 instead of zero?
> 
> possibly, but the simplest answer is really to just set a proper value
> for admin_epoch
> 
>> Does the crm_validate tool catch the missing
>> epoch and complain/explain about it?
> 
> i think so, but possibly we default those fields before the DTD check is
> run

I'm just trying to see if there's something reasonable we can do to make
this more bulletproof - without deflecting the bullet to our feet ;-).

But thinking about it some more maybe there really isn't anything simple
that can be done because of the fact that we update the version info
ourselves with status information...

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] New Command Line Tools: Resource Scripts CIBS

2007-04-03 Thread Alan Robertson
Martin Fick wrote:
>> On 4/2/07, Martin Fick <[EMAIL PROTECTED]> wrote:
>>> I have been using heartbeat 2 for a while and tend
>>> to prefer scripts over GUIs or XML so I have 
>>> created some helper scripts (that I call Resource 
>>> Scripts) to configure/modify resources and their 
>>> constraints from the command line that I would
> like
>>> to share with others.
>>>
>>> I only have a few of these scripts developed and
>>> they might not in themselves be particularly 
>>> useful to others, but the main shell library and 
>>> the idea probably would be.  With the library as a
> 
>>> start, you can easily create resource scripts for 
>>> the types of resource that you use.
> 
> --- Alan Robertson <[EMAIL PROTECTED]> wrote:
> 
>> ... but PLEASE don't call them resource scripts.  
>> That term is already taken ;-)  A resource agent is 
>> often called a resource script.
> 
> I agree. :) I will call them Resource CIB Scripts
> then, or "Resource CIBS" for short, better?  If there
> are not objections, I will change my web page too.  If
> I create constraint plugins, I can then call them
> "Constraint CIBS".

Or you can just call them CIB scripts, or CLI scripts... Those terms are
not taken, and are a little broader...


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] possible bug in send_arp

2007-04-03 Thread Alan Robertson
Michael Schwartzkopff wrote:
> Am Dienstag, 3. April 2007 13:44 schrieb Alan Robertson:
>> Michael Schwartzkopff wrote:
>>> Am Dienstag, 3. April 2007 12:09 schrieb Michael Schwartzkopff:
>>>> Hi,
>>>>
>>>> send_arp --help gives me:
>>>> usage: send_arp [-i repeatinterval-ms] [-r repeatcount] [-p pidfile] \
>>>>   device src_ip_addr src_hw_addr broadcast_ip_addr netmask
>>> Ok, found it myself. The format of the MAC address is without the
>>> columns. That makes it more difficult to use it in ressource scripts.
>> Without the columns... ?
>>
>> I don't remember the format, but I do remember that the code documents
>> it as being unusual.
>>
>> Please note that this command is not advertised externally, so it's
>> subject to change without notice.
>>
>> We have 3 different resource agents that use this resource agent,
>> including a sendarp resource agent.
>>
>> What resource agent are you using it for?
> 
> Hi,
> 
> I am using it for the improved version of IPaddr2 with load sharing via 
> CLUSTERIP target of iptables. I will post the new version as soon as I am 
> satisfied with it (soon!). At the moment it works, but some bugs left.
> 
> send_arp: I need it to advertise my new MAC address, which is a multicat MAC 
> address, no the normal HW address of the interface. Looking at the source of 
> send_arp I noticed that the right format for the MAC address is:
> 010203040506, insted of 01:02:03:04:05:06.
> 
> If you change the format, I would suggest a test to be able to pass both 
> formats to the program.

If you're planning on having us pick up your changes, you have no
worries ;-)

If your changes are clean and clear, and you can give us some simple
emails from your employer, then we'd be happy to pick it up.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] R2 Two-node apache cluster with STONITH

2007-04-03 Thread Alan Robertson
Bjorn Oglefjorn wrote:
> Anyone? Help?
> --BO
> 
> On 4/2/07, Bjorn Oglefjorn <[EMAIL PROTECTED]> wrote:
>>
>> Any ideas as to what's going wrong here?

there is so much send/reply/try/fail/fix stuff in the email that I had
trouble following what was going on.

Could you try reposting this cleanly and explain what symptoms you're
seeing?  I just saw "it doesn't work", and that's not very helpful.

> See also: http://linux-ha.org/ReportingProblems 


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] cib.xml races on initialization

2007-04-04 Thread Alan Robertson
Bernd Schubert wrote:
>>> Or make the default epoch of an existing CIB file with a missing epoch
>>> to be 1 instead of zero?
>> possibly, but the simplest answer is really to just set a proper value
>> for admin_epoch
> 
> As of coding certainly, as of writing config files I don't think so. I guess 
> people usually prefer small files - for code and for config files. 
> 
> I also thing the documentation should be improved, even after reading the it 
> several times, I still did not understand what admin_epoch is used for. 
> 
> 
> The CIB's version is a tuple of admin_epoch, epoch and num_updates (in that 
> order). This is used when applying updates from the master CIB instance. 
> 
> 
> How about to change that into
> 
> The CIB's version x.y.z is a tuple of admin_epoch, epoch and num_updates (in 
> that order). This is used when applying updates from the master CIB instance. 
> If unset all three values default to zero, so an empty cib.xml would have the 
> version 0.0.0. A heartbeat sysadmin should set admin_epoch to force the usage 
> of a manual created cib.xml over a heartbeat automatically created empty 
> file, e.g. .

This sounds good to me.

Maybe add "This tuple is used used for CIB version comparison, with
admin_epoch being most significant, and num_updates being least
significant.  The CIB never automatically updates the admin_epoch
number.  This element of the tuple is left for the administrator to use
to distinguish his or her version updates"  or something like that for
good measure.  I'd rather be clear but a little redundant than not quite
as clear.

Andrew?  Your thoughts?

Bernd: If you want to change it yourself, the directions for updating
the web site are here:
http://linux-ha.org/HowToUpdateWebsite





-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Re: [Linux-ha-dev] Starting a resource

2007-04-04 Thread Alan Robertson
[EMAIL PROTECTED] wrote:
>  
>  
> Hi all,
> I tried to start the resource IPaddr using OCF script. I could do
> that successfully.
> Now I am trying to start another resource xinetd using lsb. But I am
> not able to start this xinetd service.
> I am attaching the xml file and the log and I am using only one ping
> node. 
>  
> Please let me know what could be the issue.
> Following are the various alternatives I used in the cib.xml file to
> start and monitor xinetd.

1) You probably want to use the xinetd OCF resource agent we supplied
instead of using the xinetd init script.  Most people leave xinetd
running all the time, and just incrementally enable and disable services
- which is what our resource agent does.

2)  Heartbeat doesn't think there's any operation called status.
You call it the monitor operation, and if that operation
doesn't exist for the particular kind of resource agent,
then we'll translate it into something appropriate -
like status.


Of course, this begs the question of "What are you trying to do here?"

And, of course, the other question of "Why aren't you sending simple
configuration questions to the main list instead of the development
list?"  You will notice that I've moved your question to the more
appropriate list.  I suggest you subscribe to it ;-)

So, I would suggest that you answer the question of "what are you trying
to accomplish?"


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] OCF_RESKEY_interval

2007-04-04 Thread Alan Robertson
Bernd Schubert wrote:
> Hi,
> 
> after upgrading from heartbeat-2.0.5 to heartbeat-2.0.8 OCF_RESKEY_interval 
> interval is not set anymore, which makes our monitoring actions to always 
> return ${OCF_NOT_RUNNING}.
> 
> As given in the example 
> http://www.linux-ha.org/ClusterInformationBase/Actions 
> in section "Monitoring Examples", the interval is properly set in the 
> cib.xml.
> 
> 
> 
> Any idea why OCF_RESKEY_interval is not set anymore?

It was improper for it to pass it to the resource agent.  This was
corrected.

It is NOT a parameter. It's an attribute, which is a direction to the HA
system, not to the RA.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] STONITH in response to stop failures (suicide or ssh)

2007-04-04 Thread Alan Robertson
Dave Blaschke wrote:
> Christophe Zwecker wrote:
>> Hi,
>>
>> we do not have any stonith hw devices (yet). We encountered the
>> problem that a ressource couldnt be stopped for sure recently (drbd fs
>> wouldnt unmount and mysqld was state unmanaged).
>>
>> Since then ive been looking around and wanted to try ssh plugin or
>> suicide, I dont understand quite the practical difference, when would
>> one prefer ssh over suicide ?
> Suicide can only reboot itself while ssh can reboot either node. 
> However, neither is really suitable in a production environment since a
> node that needs to be stonith'd may not respond to a request to suicide
> itself or an external ssh command to reboot itself.

I think for what they want, suicide is perfect.

They want for a node to kill itself if it can't stop a resource.  The
node is otherwise up and well and happy.

Suicide should work nicely for that...

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Masters take long time to get back the ip from slave

2007-04-04 Thread Alan Robertson
gh the other side
no longer owned any resources.

After this, a variety of BadThingsWillHappen - including
destroying shared disk data.

Sounds like the web page was right.  Bad Things Happened.

We're going to try and get you to an initially working cluster, but
since you don't appear to be at all experienced in managing Linux
systems, it will likely be painful, and it won't replace all the
knowledge you appear to lack at this point.  I'm not trying to be
insulting, and if I've insulted you, please forgive me.  I'm just trying
to be realistic.  You really do need to know how to read logs, read
documentation, and manage a Linux system before you try and make one
highly-available.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


  1   2   3   >