Hehehe.. so i changed the config to bcast eth0 eth1

did a /etc/init.d/heartbeat reload

and not realizing this initiated a restart on the heartbeats, i thought it would just reload the configs, and think , well hey i have to go to another eth now.... but it brought down the primary and teh secondary took over flawlessy without my knowing..... then an hour later i simply brought heartbeat down and then up on the primary and everything came back... wow... i was impressed that i go no phone calls at all from clients! :)

However i noticed only one thing, it can be dangerous.... mysql seemed to have lost a few files in its /var/www/mysql?? when the secondary took over the freshly mounted system was missing some .MYD files for some tables...... there were only 2 sites affected in a minor way, but what happened to these files? When i brought back the primary these files were still not there( this is normal i guess) however the errors for the site disappear but the files were still not there...???

Any ideas or suggestions?

Thanks again for all your help... it was exciting to see it work unexpectedly.... :)

p.s. my haresources file in case you were wondering, i thought it might be due to start up where teh database files are not there(not mounted) but mysql starts...

joe IPaddr::xx.xx.xx.150 drbddisk::mail drbddisk::web \
Filesystem::/dev/drbd0::/var/mail/virtual::ext3::defaults \
Filesystem::/dev/drbd1::/var/www::ext3::defaults \
postfix courier-authdaemon courier-pop courier-imap mysql apache2 proftpd

Rob Morin
Dido Internet Inc.
Montreal,Canada
http://www.dido.ca
514-990-4444



Madd Sauer wrote:
Hello,
On Thu, May 08, 2008 at 08:53:39AM -0400, Rob Morin wrote:
Actually another question....

I would simply add eth1 to the heartbeat ha.cf then? and whats the diff between using mcast vs bcast? I am not sure i understand this ?

mcast = multicast
if your router supports multicast-routing this packes were routed

bcast = broadcast
broadcasts will NEVER routed.

ucast = unicast
that's my choice, broadcast are mostly trash on the net (imho) and i use
always non-bcast if I can. unicast ist also routed traffic.

Madd

Thanks a bunch
:)


Rob Morin
Dido Internet Inc.
Montreal,Canada
http://www.dido.ca
514-990-4444



Dominik Klein wrote:
Rob Morin wrote:
I have not seen my original email get to the list yet... but after looking through the logs i see this on each node...
see below for log excerts...

My test involved bringing down eth0 only(heartbeat & replication), should i have also brought down eth1 the public side of Joe(primary)

my conf file is...

logfacility     daemon        # This is deprecated
keepalive 2                   # Interval between heartbeat (HB) packets.
deadtime 60                   # How quickly HB determines a dead node.
warntime 5                    # Time HB will issue a late HB.
initdead 120 # Time delay needed by HB to report a dead node. udpport 694 # UDP port HB uses to communicate between nodes. #ping 192.168.5.1 # Ping VMware Server host to simulate network resource.
bcast eth0
You only use one connection for heartbeat communication. That is a configuration error.

As you unplugged that interface for testing, you forced a splitbrain situation. Read http://www.linux-ha.org/SplitBrain

Dual split brain so to speak. Your drbd replication is also done over this link. So not only does heartbeat loose connection, but also does drbd. In a standard setup, a not connected secondary drbd device can be promoted disregarding the peer's drbd state.

You might want to read about dopd, too: http://www.drbd.org/users-guide/s-heartbeat-dopd.html It can prevent drbd splitbrain, but you need to have >1 network connection anyways.

#baud 115200
#serial /dev/ttyS0              # Which interface to use for HB packets.
coredumps true
auto_failback on # Auto promotion of primary node upon return to cluster.
Your comment answers your later question on what will happen when a rebooted (stonith'd) node rejoins the cluster.

Regards
Dominik

node    joe      # Node name must be same as uname -n.
node    stewie      # Node name must be same as uname -n.
###
###
respawn hacluster /usr/lib/heartbeat/ipfail
# Specifies which programs to run at startup
# DO not use the below unless you use the /var/lib/heartbeat/crm/cib/xml config file instead
#crm on
use_logd yes                  # Use system logging.
logfile /var/log/hb.log       # Heartbeat logfile.
debugfile /var/log/heartbeat-debug.log # Debugging logfile.


Primary
--------

May  6 23:04:44 joe heartbeat: [4342]: WARN: node stewie: is dead
May 6 23:04:44 joe heartbeat: [4342]: WARN: No STONITH device configured. May 6 23:04:44 joe heartbeat: [4342]: WARN: Shared disks are not protected. May 6 23:04:44 joe heartbeat: [4342]: info: Resources being acquired
>from stewie.
May  6 23:04:44 joe heartbeat: [4342]: info: Link stewie:eth0 dead.
May 6 23:04:44 joe heartbeat: [4249]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL May 6 23:04:44 joe mach_down[4283]: [4328]: info: /usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired May 6 23:04:44 joe heartbeat: [4342]: info: mach_down takeover complete. May 6 23:04:44 joe heartbeat: [4342]: debug: StartNextRemoteRscReq(): child count 1 May 6 23:04:44 joe heartbeat: [4250]: info: Local Resource acquisition completed.


Secondary
-----------

May 6 23:04:46 stewie heartbeat: [21820]: info: Resources being acquired from joe.
May  6 23:04:46 stewie heartbeat: [21820]: info: Link joe:eth0 dead.
May 6 23:04:46 stewie heartbeat: [4946]: info: No local resources [/usr/lib/heartbeat/ResourceManager listkeys stewie] to acquire. May 6 23:04:46 stewie heartbeat: [21825]: ERROR: MSG[4] : [info=req_our_resources()] May 6 23:05:10 stewie mach_down[4953]: [6063]: info: /usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired May 6 23:05:10 stewie heartbeat: [21820]: info: mach_down takeover complete. May 6 23:05:10 stewie heartbeat: [21825]: ERROR: MSG[2] : [info=mach_down]
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to