Hello again, 

I would like to enclose these log lines from /var/log/messages on director1 and 
director2 for if it might give a clue: 

*** From /var/log/messages on director1 ***

After typing: "/etc/init.d/heartbeat start" on director1:

Mar  4 10:15:32 director1 ldirectord[22403]: Invoking ldirectord invoked as: 
/etc/ha.d/resource.d/ldirectord ldirectord.cf status
Mar  4 10:15:32 director1 ldirectord[22403]: Exiting with exit_status 3: 
Exiting from ldirectord status
Mar  4 10:15:33 director1 ldirectord[22440]: Invoking ldirectord invoked as: 
/etc/ha.d/resource.d/ldirectord ldirectord.cf status
Mar  4 10:15:33 director1 ldirectord[22440]: Exiting with exit_status 3: 
Exiting from ldirectord status
Mar  4 10:15:34 director1 ldirectord[22456]: Invoking ldirectord invoked as: 
/etc/ha.d/resource.d/ldirectord ldirectord.cf start
Mar  4 10:15:34 director1 ldirectord[22456]: Starting Linux Director 
v1.186-ha-2.1.3 as daemon
Mar  4 10:15:34 director1 ldirectord[22458]: Added virtual server: 
172.25.146.31:80
Mar  4 10:15:34 director1 kernel: [144096.978975] IPVS: stopping backup sync 
thread 22208 ...
Mar  4 10:15:34 director1 kernel: [144097.227697] IPVS: sync thread started: 
state = MASTER, mcast_ifn = eth0, syncid = 0
Mar  4 10:15:35 director1 ldirectord[22458]: Added fallback server: 
127.0.0.1:80 (172.25.146.31:80) (Weight set to 1)
Mar  4 10:15:35 director1 ldirectord[22458]: Quiescent real server: 
172.25.146.38:80 (172.25.146.31:80) (Weight set to 0)
Mar  4 10:15:35 director1 ldirectord[22458]: Quiescent real server: 
172.25.146.37:80 (172.25.146.31:80) (Weight set to 0)
Mar  4 10:15:36 director1 ldirectord[22458]: Restored real server: 
172.25.146.37:80 (172.25.146.31:80) (Weight set to 1)
Mar  4 10:15:36 director1 ldirectord[22458]: Deleted fallback server: 
127.0.0.1:80 (172.25.146.31:80)
Mar  4 10:15:36 director1 ldirectord[22458]: Restored real server: 
172.25.146.38:80 (172.25.146.31:80) (Weight set to 1)

After typing: "/etc/init.d/heartbeat start" on director2

Mar  4 10:20:16 director1 ldirectord[22859]: Invoking ldirectord invoked as: 
/etc/ha.d/resource.d/ldirectord ldirectord.cf status
Mar  4 10:20:17 director1 ldirectord[22859]: ldirectord for 
/etc/ha.d/ldirectord.cf is running with pid: 22458
Mar  4 10:20:17 director1 ldirectord[22859]: Exiting from ldirectord status
Mar  4 10:20:17 director1 ldirectord[22875]: Invoking ldirectord invoked as: 
/etc/ha.d/resource.d/ldirectord ldirectord.cf start

After typing: "/etc/init.d/heartbeat stop" on director1

Mar  4 10:26:12 director1 kernel: [144734.478909] IPVS: stopping master sync 
thread 22530 ...
Mar  4 10:26:12 director1 kernel: [144734.693492] IPVS: sync thread started: 
state = BACKUP, mcast_ifn = eth0, syncid = 0
Mar  4 10:26:12 director1 ldirectord[23144]: Invoking ldirectord invoked as: 
/etc/ha.d/resource.d/ldirectord ldirectord.cf stop
Mar  4 10:26:13 director1 ldirectord[22458]: Purged real server (stop): 
172.25.146.37:80 (172.25.146.31:80)
Mar  4 10:26:13 director1 ldirectord[22458]: Purged real server (stop): 
172.25.146.38:80 (172.25.146.31:80)
Mar  4 10:26:13 director1 ldirectord[22458]: Purged virtual server (stop): 
172.25.146.31:80
Mar  4 10:26:13 director1 ldirectord[22458]: Linux Director Daemon terminated 
on signal: TERM


*** From /var/log/messages on director1 ***

After typing: "/etc/init.d/heartbeat start" on director2

Mar  4 10:18:43 director2 ldirectord[23274]: Invoking ldirectord invoked as: 
/etc/ha.d/resource.d/ldirectord ldirectord.cf status
Mar  4 10:18:43 director2 ldirectord[23274]: Exiting with exit_status 3: 
Exiting from ldirectord status
Mar  4 10:19:09 director2 ldirectord[23905]: Invoking ldirectord invoked as: 
/etc/ha.d/resource.d/ldirectord ldirectord.cf stop

After typing: "/etc/init.d/heartbeat stop" on director1

Mar  4 10:25:07 director2 ldirectord[23996]: Invoking ldirectord invoked as: 
/etc/ha.d/resource.d/ldirectord ldirectord.cf status
Mar  4 10:25:07 director2 ldirectord[23996]: Exiting with exit_status 3: 
Exiting from ldirectord status
Mar  4 10:25:08 director2 ldirectord[24012]: Invoking ldirectord invoked as: 
/etc/ha.d/resource.d/ldirectord ldirectord.cf start
Mar  4 10:25:08 director2 ldirectord[24012]: Starting Linux Director 
v1.186-ha-2.1.3 as daemon
Mar  4 10:25:08 director2 ldirectord[24014]: Added virtual server: 
172.25.146.31:80
Mar  4 10:25:08 director2 ldirectord[24014]: Added fallback server: 
127.0.0.1:80 (172.25.146.31:80) (Weight set to 1)
Mar  4 10:25:09 director2 ldirectord[24014]: Quiescent real server: 
172.25.146.38:80 (172.25.146.31:80) (Weight set to 0)
Mar  4 10:25:09 director2 ldirectord[24014]: Quiescent real server: 
172.25.146.37:80 (172.25.146.31:80) (Weight set to 0)
Mar  4 10:25:09 director2 ldirectord[24014]: Restored real server: 
172.25.146.37:80 (172.25.146.31:80) (Weight set to 1)
Mar  4 10:25:09 director2 ldirectord[24014]: Deleted fallback server: 
127.0.0.1:80 (172.25.146.31:80)
Mar  4 10:25:10 director2 ldirectord[24014]: Restored real server: 
172.25.146.38:80 (172.25.146.31:80) (Weight set to 1)
Mar  4 10:25:19 director2 ldirectord[24646]: Invoking ldirectord invoked as: 
/etc/ha.d/resource.d/ldirectord ldirectord.cf stop
Mar  4 10:25:20 director2 ldirectord[24014]: Purged real server (stop): 
172.25.146.37:80 (172.25.146.31:80)
Mar  4 10:25:20 director2 ldirectord[24014]: Purged real server (stop): 
172.25.146.38:80 (172.25.146.31:80)
Mar  4 10:25:20 director2 ldirectord[24014]: Purged virtual server (stop): 
172.25.146.31:80
Mar  4 10:25:21 director2 ldirectord[24014]: Linux Director Daemon terminated 
on signal: TERM

In this last part you can see how director2 starts to work properly when 
director1 stops, and about 10 seconds after it stops by itself. Why may it be 
happening?  :_(

Thanks in advance...

Best regards, 

           Alejandro

==      
Alejandro Sanchez Merono - alejandro.sanc...@ite.es 
TIC Department 
Institute of Electrical Technology 
Parque Tecnologico de Valencia 
PATERNA (Valencia)
Spain

Tel.: (+34) 96 136 66 70
Fax: (+34) 96 136 66 80
Web: http://www.ite.es <http://www.ite.es/>
E-mail: i...@ite.es 



-----Mensaje original-----
De: linux-ha-boun...@lists.linux-ha.org 
[mailto:linux-ha-boun...@lists.linux-ha.org] En nombre de Alejandro Sánchez 
Meroño
Enviado el: martes, 03 de marzo de 2009 16:25
Para: linux-ha@lists.linux-ha.org
Asunto: [Linux-HA] HA fails when stopping master director

Hello everybody, 

Here Alejandro from Valencia, Spain. I'm glad to join this mailing list, and 
though at present I'm a complete rookie on HA -and a "sophomore" in Linux-, I'd 
like to think that some day I might help others about this subject.

Unfortunately, it's me who at present need a helping hand from you...

OK, I'll try to put all the data in order: 

     A) Abstract of the issue: I have configured load balancing and high 
availability with two web servers and two directors with ldirectord and 
heartbeat. Load balance works fine, but when testing the HA, if I stop 
heartbeat at the main director, the system swaps to backup director but... only 
for a few seconds!! Then, everything is dead. ha-debug log at the main director 
seems happy, while ha-debug log at the backup director just repeats hundreds of 
times 

     B) What I am actually trying to do:  
My main objective is rather simple: Obtain load balancing and high availability 
from two mirror web servers -Apache. At present we have just one single web 
server with rather heavy work load and running important web applications, so 
we need to secure it. Some day we will have four physical servers, two of them 
running as Load Directors (master and backup) and two of them as replicated web 
servers. But before, I must learn how to do it, of course. So I set up a pilot 
system.  

     C) My pilot system: 
I'm working on an Apple Xserve, where I have created four virtual machines. On 
each one of them I have installed Ubuntu 8.10. I assigned static IP's to each 
one of the VM, and reserved a virtual IP to access the web servers.
So, I have: 
        director1: 172.25.146.32
        director2: 172.25.146.33
        web1: 172.25.146.37
        web2: 172.25.146.38
        Virtual IP: 172.25.146.31
director1 and web1 access the network via eth0, while director2 and web2 do it 
via eth1 (I don't know why, it simply was configured like that when I created 
the virtual machines and installed Ubuntu). 

Each machine has the same /etc/hosts: 
127.0.0.1               localhost
172.25.146.32   director1
172.25.146.33   director2
172.25.146.37   web1
172.25.146.38   web2

     D) What I have installed and configured: 

                D1) Apache and PHP5 on web1 and web2. I can access from the 
browser http://172.25.146.37, and http://172.25.146.38 with no problems. 
                D2) I wrote the following script on director1 and director2: 
/etc/network/if-up.d/loadmodules

###################
#!/bin/bash

echo ip_vs_dh >> /etc/modules
echo ip_vs_ftp >> /etc/modules
echo ip_vs >> /etc/modules
echo ip_vs_lblc >> /etc/modules
echo ip_vs_lblcr >> /etc/modules
echo ip_vs_lc >> /etc/modules
echo ip_vs_nq >> /etc/modules
echo ip_vs_rr >> /etc/modules
echo ip_vs_sed >> /etc/modules
echo ip_vs_sh >> /etc/modules
echo ip_vs_wlc >> /etc/modules
echo ip_vs_wrr >> /etc/modules

modprobe ip_vs_dh
modprobe ip_vs_ftp
modprobe ip_vs
modprobe ip_vs_lblc
modprobe ip_vs_lblcr
modprobe ip_vs_lc
modprobe ip_vs_nq
modprobe ip_vs_rr
modprobe ip_vs_sed
modprobe ip_vs_sh
modprobe ip_vs_wlc
modprobe ip_vs_wrr
######################

But I noticed that when restarting the machines, the modules weren't reloaded. 
So I edited the file /etc/modules and added the lines manually (ip_vs_dh and so 
on)... I don't know if I did well...

                D3) On director1 and director2, I did: apt-get install ipvsadm 
ldirectord heartbeat
                D4) Enabled packet forwarding on /etc/sysctl.conf: 
                        net.ipv4.ip_forward = 1
and then 
                        sysctl -p
                D5) The files: ha.cf, haresources, authkeys, ldirectord.cf and 
logd.cf on director1 and director2: 

/etc/ha.d/ha.cf: 

#This is for director1
#Changed eth0 by eth1 on director2
#
debugfile /var/log/ha-debug
logfile /var/log/ha-log
use_logd yes
logfacility local0
keepalive 1
warntime 10
deadtime 30
initdead 120
updport 694
ucast eth0 172.25.146.32
ucast eth0 172.25.146.33
auto_failback on
node director1
node director2
ping 172.25.146.1 #gateway
respawn hacluster /usr/lib/heartbeat/ipfail

/etc/ha.d/haresources: 

director1 \
  ldirectord::ldirector.cf \
  LVSSyncDaemonSwap::master \
  IPaddr2::172.25.146.31/24/eth0/172.25.146.255
#172.25.146.255 broadcast address
#changed eth0 by eth1 on director2
                
/etc/ha.d/authkeys: (same for director1 and director2) 

auth 3
3 md5 mypassword

/etc/ha.d/ldirectord.cf: (same for director1 and director2)

checktimeout=10
checkinterval=2
autoreload=no
logfile="local0"
quiescent=yes
virtual=172.25.146.31:80
        real=172.25.146.37:80 gate
        real=172.25.146.38:80 gate
        fallback=127.0.0.1:80 gate
        service=http
        request="test.html"
        receive="test"
        scheduler=rr
        protocol=tcp
        checktype=negotiate

/etc/logd.cf

debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility daemon
entity logd
useapphbd no
sendqlen 256
recvqlen 256

                D6) Created the proper /var/www/test.html on web1 and web2

                D7) Typed: 
update-rc.d heartbeat start 75 2 3 4 5 . stop 05 0 1 6 .
update-rc.d -f ldirectord remove
/etc/init.d/ldirectord stop
/etc/init.d/heartbeat start

                D8) I checked: 
ip add sh eth0 on director1, OK
ip add sh eth1 on director2, OK
ldirectord ldirectord.cf status on director1 and director2, running and 
stopped, OK ipvsadm -L -n on director1 and director2, shows the routing table 
on director1 and nothing on director2, OK 
/etc/ha.d/resource.d/LVSSyncDaemonSwap master status on director1 and 
director2, running and stopped, OK

                D9) On both web servers, I enabled arp_ignore and arp_announce 
in /etc/sysctl.conf: 
net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.eth0.arp_ignore = 1
net.ipv4.conf.all.arp_announce = 1
net.ipv4.conf.eth0.arp_announce = 1
(changed eth0 by eth1 on web2). 
And then: sysctl -p

                D10) On both web servers, I added the following on 
/etc/network/interfaces: 

auto lo:0
iface lo:0 inet static
        address 172.25.146.31
        netmask 255.255.255.255
        pre-up sysctl -p > /dev/null

And then: ifup lo:0

        E) Done. Final tests: 

                E1) I try to access http://172.25.146.31 on my browser. 
Success. I can check which server is serving with: 
ipvsadm -L -n --stats
Both servers are serving alternatively, as expected (round robin -rr- 
algorithm).

                E2) I kill web1. http://172.25.146.31 keeps on. Same if I start 
again web1 and kill web2. Success.

So I achieved Load Balancing. Let's see what happens with the High Availability.

                E3) I stop heartbeat on director1 with: 
/etc/init.d/heartbeat stop

And... http://172.25.146.31 doesn't answer anymore... Ouch!!!!!!

                E4) OK, OK, wait a second, let's go back: 
/etc/init.d/heartbeat start (on director1)

And http://172.25.146.31 keeps with no answer... Ooooouch!!!!!!

If I do: 
ipvsadm -L -n
There appears no route anymore (in director1 and director2).

Feeling miserable, I do in a hopeless intuition: 
/etc/init.d/heartbeat start (on director1, again)

And, surprise, http.... is alive again!!

So, if I put director1 down, heartbeat doesn't swap to director2, and if I want 
to put it up again, I must start heartbeat twice!! (so, "auto_failback on" 
doesn't work either)...

I tried then to put director1 down, and start heartbeat thousands of times on 
director2. Nothing happens anyway... 

So I have achieved Lousy Availability instead!!! :_(

I have attached the ha-debug log files to this e-mail, I guess that they must 
be significative for more experienced people... Especially the ha-debug of 
director2 that only repeats over and over again the same sentence: 

ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device

So I sense that something is trying to access director2 through eth0, which 
doesn't exist, as its interface is eth1. But I have revisited many times every 
configuration file and I can't find where can be the error.

So... please please please, may I get any hint?

Thanks in advance!!!!

Best regards, 

         Alejandro

==      
Alejandro Sanchez Merono - alejandro.sanc...@ite.es 
TIC Department 
Institute of Electrical Technology 
Parque Tecnologico de Valencia 
PATERNA (Valencia)
Spain

Tel.: (+34) 96 136 66 70
Fax: (+34) 96 136 66 80
Web: http://www.ite.es <http://www.ite.es/>
E-mail: i...@ite.es 
        
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to