Hello again, I would like to enclose these log lines from /var/log/messages on director1 and director2 for if it might give a clue:
*** From /var/log/messages on director1 *** After typing: "/etc/init.d/heartbeat start" on director1: Mar 4 10:15:32 director1 ldirectord[22403]: Invoking ldirectord invoked as: /etc/ha.d/resource.d/ldirectord ldirectord.cf status Mar 4 10:15:32 director1 ldirectord[22403]: Exiting with exit_status 3: Exiting from ldirectord status Mar 4 10:15:33 director1 ldirectord[22440]: Invoking ldirectord invoked as: /etc/ha.d/resource.d/ldirectord ldirectord.cf status Mar 4 10:15:33 director1 ldirectord[22440]: Exiting with exit_status 3: Exiting from ldirectord status Mar 4 10:15:34 director1 ldirectord[22456]: Invoking ldirectord invoked as: /etc/ha.d/resource.d/ldirectord ldirectord.cf start Mar 4 10:15:34 director1 ldirectord[22456]: Starting Linux Director v1.186-ha-2.1.3 as daemon Mar 4 10:15:34 director1 ldirectord[22458]: Added virtual server: 172.25.146.31:80 Mar 4 10:15:34 director1 kernel: [144096.978975] IPVS: stopping backup sync thread 22208 ... Mar 4 10:15:34 director1 kernel: [144097.227697] IPVS: sync thread started: state = MASTER, mcast_ifn = eth0, syncid = 0 Mar 4 10:15:35 director1 ldirectord[22458]: Added fallback server: 127.0.0.1:80 (172.25.146.31:80) (Weight set to 1) Mar 4 10:15:35 director1 ldirectord[22458]: Quiescent real server: 172.25.146.38:80 (172.25.146.31:80) (Weight set to 0) Mar 4 10:15:35 director1 ldirectord[22458]: Quiescent real server: 172.25.146.37:80 (172.25.146.31:80) (Weight set to 0) Mar 4 10:15:36 director1 ldirectord[22458]: Restored real server: 172.25.146.37:80 (172.25.146.31:80) (Weight set to 1) Mar 4 10:15:36 director1 ldirectord[22458]: Deleted fallback server: 127.0.0.1:80 (172.25.146.31:80) Mar 4 10:15:36 director1 ldirectord[22458]: Restored real server: 172.25.146.38:80 (172.25.146.31:80) (Weight set to 1) After typing: "/etc/init.d/heartbeat start" on director2 Mar 4 10:20:16 director1 ldirectord[22859]: Invoking ldirectord invoked as: /etc/ha.d/resource.d/ldirectord ldirectord.cf status Mar 4 10:20:17 director1 ldirectord[22859]: ldirectord for /etc/ha.d/ldirectord.cf is running with pid: 22458 Mar 4 10:20:17 director1 ldirectord[22859]: Exiting from ldirectord status Mar 4 10:20:17 director1 ldirectord[22875]: Invoking ldirectord invoked as: /etc/ha.d/resource.d/ldirectord ldirectord.cf start After typing: "/etc/init.d/heartbeat stop" on director1 Mar 4 10:26:12 director1 kernel: [144734.478909] IPVS: stopping master sync thread 22530 ... Mar 4 10:26:12 director1 kernel: [144734.693492] IPVS: sync thread started: state = BACKUP, mcast_ifn = eth0, syncid = 0 Mar 4 10:26:12 director1 ldirectord[23144]: Invoking ldirectord invoked as: /etc/ha.d/resource.d/ldirectord ldirectord.cf stop Mar 4 10:26:13 director1 ldirectord[22458]: Purged real server (stop): 172.25.146.37:80 (172.25.146.31:80) Mar 4 10:26:13 director1 ldirectord[22458]: Purged real server (stop): 172.25.146.38:80 (172.25.146.31:80) Mar 4 10:26:13 director1 ldirectord[22458]: Purged virtual server (stop): 172.25.146.31:80 Mar 4 10:26:13 director1 ldirectord[22458]: Linux Director Daemon terminated on signal: TERM *** From /var/log/messages on director1 *** After typing: "/etc/init.d/heartbeat start" on director2 Mar 4 10:18:43 director2 ldirectord[23274]: Invoking ldirectord invoked as: /etc/ha.d/resource.d/ldirectord ldirectord.cf status Mar 4 10:18:43 director2 ldirectord[23274]: Exiting with exit_status 3: Exiting from ldirectord status Mar 4 10:19:09 director2 ldirectord[23905]: Invoking ldirectord invoked as: /etc/ha.d/resource.d/ldirectord ldirectord.cf stop After typing: "/etc/init.d/heartbeat stop" on director1 Mar 4 10:25:07 director2 ldirectord[23996]: Invoking ldirectord invoked as: /etc/ha.d/resource.d/ldirectord ldirectord.cf status Mar 4 10:25:07 director2 ldirectord[23996]: Exiting with exit_status 3: Exiting from ldirectord status Mar 4 10:25:08 director2 ldirectord[24012]: Invoking ldirectord invoked as: /etc/ha.d/resource.d/ldirectord ldirectord.cf start Mar 4 10:25:08 director2 ldirectord[24012]: Starting Linux Director v1.186-ha-2.1.3 as daemon Mar 4 10:25:08 director2 ldirectord[24014]: Added virtual server: 172.25.146.31:80 Mar 4 10:25:08 director2 ldirectord[24014]: Added fallback server: 127.0.0.1:80 (172.25.146.31:80) (Weight set to 1) Mar 4 10:25:09 director2 ldirectord[24014]: Quiescent real server: 172.25.146.38:80 (172.25.146.31:80) (Weight set to 0) Mar 4 10:25:09 director2 ldirectord[24014]: Quiescent real server: 172.25.146.37:80 (172.25.146.31:80) (Weight set to 0) Mar 4 10:25:09 director2 ldirectord[24014]: Restored real server: 172.25.146.37:80 (172.25.146.31:80) (Weight set to 1) Mar 4 10:25:09 director2 ldirectord[24014]: Deleted fallback server: 127.0.0.1:80 (172.25.146.31:80) Mar 4 10:25:10 director2 ldirectord[24014]: Restored real server: 172.25.146.38:80 (172.25.146.31:80) (Weight set to 1) Mar 4 10:25:19 director2 ldirectord[24646]: Invoking ldirectord invoked as: /etc/ha.d/resource.d/ldirectord ldirectord.cf stop Mar 4 10:25:20 director2 ldirectord[24014]: Purged real server (stop): 172.25.146.37:80 (172.25.146.31:80) Mar 4 10:25:20 director2 ldirectord[24014]: Purged real server (stop): 172.25.146.38:80 (172.25.146.31:80) Mar 4 10:25:20 director2 ldirectord[24014]: Purged virtual server (stop): 172.25.146.31:80 Mar 4 10:25:21 director2 ldirectord[24014]: Linux Director Daemon terminated on signal: TERM In this last part you can see how director2 starts to work properly when director1 stops, and about 10 seconds after it stops by itself. Why may it be happening? :_( Thanks in advance... Best regards, Alejandro == Alejandro Sanchez Merono - alejandro.sanc...@ite.es TIC Department Institute of Electrical Technology Parque Tecnologico de Valencia PATERNA (Valencia) Spain Tel.: (+34) 96 136 66 70 Fax: (+34) 96 136 66 80 Web: http://www.ite.es <http://www.ite.es/> E-mail: i...@ite.es -----Mensaje original----- De: linux-ha-boun...@lists.linux-ha.org [mailto:linux-ha-boun...@lists.linux-ha.org] En nombre de Alejandro Sánchez Meroño Enviado el: martes, 03 de marzo de 2009 16:25 Para: linux-ha@lists.linux-ha.org Asunto: [Linux-HA] HA fails when stopping master director Hello everybody, Here Alejandro from Valencia, Spain. I'm glad to join this mailing list, and though at present I'm a complete rookie on HA -and a "sophomore" in Linux-, I'd like to think that some day I might help others about this subject. Unfortunately, it's me who at present need a helping hand from you... OK, I'll try to put all the data in order: A) Abstract of the issue: I have configured load balancing and high availability with two web servers and two directors with ldirectord and heartbeat. Load balance works fine, but when testing the HA, if I stop heartbeat at the main director, the system swaps to backup director but... only for a few seconds!! Then, everything is dead. ha-debug log at the main director seems happy, while ha-debug log at the backup director just repeats hundreds of times B) What I am actually trying to do: My main objective is rather simple: Obtain load balancing and high availability from two mirror web servers -Apache. At present we have just one single web server with rather heavy work load and running important web applications, so we need to secure it. Some day we will have four physical servers, two of them running as Load Directors (master and backup) and two of them as replicated web servers. But before, I must learn how to do it, of course. So I set up a pilot system. C) My pilot system: I'm working on an Apple Xserve, where I have created four virtual machines. On each one of them I have installed Ubuntu 8.10. I assigned static IP's to each one of the VM, and reserved a virtual IP to access the web servers. So, I have: director1: 172.25.146.32 director2: 172.25.146.33 web1: 172.25.146.37 web2: 172.25.146.38 Virtual IP: 172.25.146.31 director1 and web1 access the network via eth0, while director2 and web2 do it via eth1 (I don't know why, it simply was configured like that when I created the virtual machines and installed Ubuntu). Each machine has the same /etc/hosts: 127.0.0.1 localhost 172.25.146.32 director1 172.25.146.33 director2 172.25.146.37 web1 172.25.146.38 web2 D) What I have installed and configured: D1) Apache and PHP5 on web1 and web2. I can access from the browser http://172.25.146.37, and http://172.25.146.38 with no problems. D2) I wrote the following script on director1 and director2: /etc/network/if-up.d/loadmodules ################### #!/bin/bash echo ip_vs_dh >> /etc/modules echo ip_vs_ftp >> /etc/modules echo ip_vs >> /etc/modules echo ip_vs_lblc >> /etc/modules echo ip_vs_lblcr >> /etc/modules echo ip_vs_lc >> /etc/modules echo ip_vs_nq >> /etc/modules echo ip_vs_rr >> /etc/modules echo ip_vs_sed >> /etc/modules echo ip_vs_sh >> /etc/modules echo ip_vs_wlc >> /etc/modules echo ip_vs_wrr >> /etc/modules modprobe ip_vs_dh modprobe ip_vs_ftp modprobe ip_vs modprobe ip_vs_lblc modprobe ip_vs_lblcr modprobe ip_vs_lc modprobe ip_vs_nq modprobe ip_vs_rr modprobe ip_vs_sed modprobe ip_vs_sh modprobe ip_vs_wlc modprobe ip_vs_wrr ###################### But I noticed that when restarting the machines, the modules weren't reloaded. So I edited the file /etc/modules and added the lines manually (ip_vs_dh and so on)... I don't know if I did well... D3) On director1 and director2, I did: apt-get install ipvsadm ldirectord heartbeat D4) Enabled packet forwarding on /etc/sysctl.conf: net.ipv4.ip_forward = 1 and then sysctl -p D5) The files: ha.cf, haresources, authkeys, ldirectord.cf and logd.cf on director1 and director2: /etc/ha.d/ha.cf: #This is for director1 #Changed eth0 by eth1 on director2 # debugfile /var/log/ha-debug logfile /var/log/ha-log use_logd yes logfacility local0 keepalive 1 warntime 10 deadtime 30 initdead 120 updport 694 ucast eth0 172.25.146.32 ucast eth0 172.25.146.33 auto_failback on node director1 node director2 ping 172.25.146.1 #gateway respawn hacluster /usr/lib/heartbeat/ipfail /etc/ha.d/haresources: director1 \ ldirectord::ldirector.cf \ LVSSyncDaemonSwap::master \ IPaddr2::172.25.146.31/24/eth0/172.25.146.255 #172.25.146.255 broadcast address #changed eth0 by eth1 on director2 /etc/ha.d/authkeys: (same for director1 and director2) auth 3 3 md5 mypassword /etc/ha.d/ldirectord.cf: (same for director1 and director2) checktimeout=10 checkinterval=2 autoreload=no logfile="local0" quiescent=yes virtual=172.25.146.31:80 real=172.25.146.37:80 gate real=172.25.146.38:80 gate fallback=127.0.0.1:80 gate service=http request="test.html" receive="test" scheduler=rr protocol=tcp checktype=negotiate /etc/logd.cf debugfile /var/log/ha-debug logfile /var/log/ha-log logfacility daemon entity logd useapphbd no sendqlen 256 recvqlen 256 D6) Created the proper /var/www/test.html on web1 and web2 D7) Typed: update-rc.d heartbeat start 75 2 3 4 5 . stop 05 0 1 6 . update-rc.d -f ldirectord remove /etc/init.d/ldirectord stop /etc/init.d/heartbeat start D8) I checked: ip add sh eth0 on director1, OK ip add sh eth1 on director2, OK ldirectord ldirectord.cf status on director1 and director2, running and stopped, OK ipvsadm -L -n on director1 and director2, shows the routing table on director1 and nothing on director2, OK /etc/ha.d/resource.d/LVSSyncDaemonSwap master status on director1 and director2, running and stopped, OK D9) On both web servers, I enabled arp_ignore and arp_announce in /etc/sysctl.conf: net.ipv4.conf.all.arp_ignore = 1 net.ipv4.conf.eth0.arp_ignore = 1 net.ipv4.conf.all.arp_announce = 1 net.ipv4.conf.eth0.arp_announce = 1 (changed eth0 by eth1 on web2). And then: sysctl -p D10) On both web servers, I added the following on /etc/network/interfaces: auto lo:0 iface lo:0 inet static address 172.25.146.31 netmask 255.255.255.255 pre-up sysctl -p > /dev/null And then: ifup lo:0 E) Done. Final tests: E1) I try to access http://172.25.146.31 on my browser. Success. I can check which server is serving with: ipvsadm -L -n --stats Both servers are serving alternatively, as expected (round robin -rr- algorithm). E2) I kill web1. http://172.25.146.31 keeps on. Same if I start again web1 and kill web2. Success. So I achieved Load Balancing. Let's see what happens with the High Availability. E3) I stop heartbeat on director1 with: /etc/init.d/heartbeat stop And... http://172.25.146.31 doesn't answer anymore... Ouch!!!!!! E4) OK, OK, wait a second, let's go back: /etc/init.d/heartbeat start (on director1) And http://172.25.146.31 keeps with no answer... Ooooouch!!!!!! If I do: ipvsadm -L -n There appears no route anymore (in director1 and director2). Feeling miserable, I do in a hopeless intuition: /etc/init.d/heartbeat start (on director1, again) And, surprise, http.... is alive again!! So, if I put director1 down, heartbeat doesn't swap to director2, and if I want to put it up again, I must start heartbeat twice!! (so, "auto_failback on" doesn't work either)... I tried then to put director1 down, and start heartbeat thousands of times on director2. Nothing happens anyway... So I have achieved Lousy Availability instead!!! :_( I have attached the ha-debug log files to this e-mail, I guess that they must be significative for more experienced people... Especially the ha-debug of director2 that only repeats over and over again the same sentence: ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed. No such device So I sense that something is trying to access director2 through eth0, which doesn't exist, as its interface is eth1. But I have revisited many times every configuration file and I can't find where can be the error. So... please please please, may I get any hint? Thanks in advance!!!! Best regards, Alejandro == Alejandro Sanchez Merono - alejandro.sanc...@ite.es TIC Department Institute of Electrical Technology Parque Tecnologico de Valencia PATERNA (Valencia) Spain Tel.: (+34) 96 136 66 70 Fax: (+34) 96 136 66 80 Web: http://www.ite.es <http://www.ite.es/> E-mail: i...@ite.es _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems