Hi, On Mon, Aug 17, 2009 at 07:38:39AM +0530, Abhin GS wrote: > Hello, > > The node1 was chocked due to a big "messages" file. we have fixed that > problem in node1, then we ran a update for SLES11, every required > patches were installed properly (service openais stop - was done before > patching). We have purposely switched off the node 2 during this > exercise to avoid any complications. > > After the update, we have started the system back online (node2 was > still kept off) and saw that the machine was refusing to function. > analysis found that the systems update had changed the contents > of /etc/hosts during update, node1 entry was taken out for some reason > in node1 hosts file. Pacemaker showed all services to be down even after > that fix. Couple of reboots - no help. I have attached forensics report > (cib and messages) of node1 in node1.tar
Everybody'd be better of using hb_report :) > Whilst , we have switched off node1 after our enthu levels went down and > made node2 online. We were happy to see the things work well in it (we > have adjusted the timings - no cleanup was required - though we tested > this fact for only one reboot) - except the message node1 is offline. we > copied this cib of node2 using cibadmin - Q, switched it off and > switched on the node1 for cib injection. > > @ node 1 we have cleared the pacemaker config using cibadmin -E --force, > then we injected the cib(after increasing the epoch values) using > cibadmin -U -x cib.xml. service openais restart revealed the wonderful > fact that node1 is still behaving the same way. no green signal except > node1 DC. > > Heart broken we did a forensic evidence collection, switched of node1 > and made node2 online for further study on its remaining files. Volla - > node 2 came up showing all red. no services running. only green i could > see was node2 dc. Any way forensic was done. Files are attached herewith > for your kind perusal. > > severely broken, there was no more energy left in us for this 5 week > effort to bring up a HA cluster which will run postgres, apache on a > virtual ip. decided to switch off the msa array - after switching off > node2 (we lost hope in node 1 earlier). Your fencing (stonith) doesn't work: Aug 16 16:14:45 node1 stonithd: [3870]: ERROR: Failed to STONITH the node node2: optype=POWEROFF, op_result=TIMEOUT You'll find a bunch of similar messages. The cluster won't make any progress in case it has to fence a node but it can't. The timeout for stonith resources (st1/st2) is set to very low 5 seconds. Make that at least 1 minute. Thanks, Dejan > 10 minutes later - MSA was pushed online then node2 - then node1. Node 2 > became dc and all is green. I really did not understand what went wrong > when and where. I tried to look in the log - but was not able to > understand anything (lack of confidence after multiple failure ). > > One observation, which could be right or wrong - node 1 will fail to > function properly if node 2 is not available and vice versa. Well node1 > is now having the latest patches, but node2 is still virgin. we didn't > have the heart to run update on node2 after experiencing the node 1 > affair. > > Please throw some light in to our mystery HA project. > > Thank you in advance. > > Take care, > > Abhin > > > On Thu, 2009-08-13 at 14:16 +0200, Andrew Beekhof wrote: > > First thing I'd do is fix this: > > > > Aug 8 13:47:13 node1 cib: [3894]: ERROR: write_xml_file: Cannot write > > output to /var/lib/heartbeat/crm/cib.XLiyUG: No space left on device > > (28) > > > > then i'd increase the timeouts: > > > > Aug 8 13:39:42 node2 crmd: [3803]: ERROR: process_lrm_event: LRM > > operation fs:1_stop_0 (18) Timed Out (timeout=20000ms) > > Aug 8 13:45:16 node2 crmd: [3692]: ERROR: process_lrm_event: LRM > > operation postgres_start_0 (15) Timed Out (timeout=20000ms) > > Aug 8 13:48:57 node2 crmd: [3692]: ERROR: process_lrm_event: LRM > > operation fs:0_stop_0 (23) Timed Out (timeout=20000ms) > > Aug 8 13:53:06 node2 crmd: [3765]: ERROR: process_lrm_event: LRM > > operation postgres_start_0 (14) Timed Out (timeout=20000ms) > > > > Try setting default-action-timeout to something higher than 20s > > > > On Wed, Aug 12, 2009 at 11:54 AM, Abhin.G.S - DEUCN<de...@inmail.sk> wrote: > > > > > > Hello Andrew, > > > > > > On behalf of Ajith, i'm sending you the details. > > > > > > /var/log/message of node2 (truncated) = http://deucn.com/messages_new > > > > > > Attachments : > > > > > > 1> CIB.xml > > > > > > 2> extract of /var/log/messages of node1 > > > > > > 3> complete /var/log/messages of node2 in zip format > > > > > > Please help us. > > > > > > Thank you, > > > > > > Warm Regards > > > > > > Abhin.G.S > > > ---- Original message ---- > > > From: Andrew Beekhof <and...@beekhof.net> > > > To: pacemaker@oss.clusterlabs.org > > > Date: 8/12/2009 12:49:00 PM > > > Subject: Re: [Pacemaker] Please Help - frequent cleanup is required for > > > the > > > resources on failover condition > > > > > > On Sun, Aug 9, 2009 at 4:41 PM, Ajith Kumar<ajith.kgs...@gmail.com> wrote: > > >> Hello Everyone, > > >> > > >> I was behind a project to create a test cluster using Pacemaker on > > >> suse11. > > >> With kind help of lmb and beekhof @ #linux-cluster i was finally able to > > >> put > > >> up a two node cluster using HP ML350g5 each with two HBA connected to a > > >> MSA2012fcdc. > > >> > > >> The cluster resource both apache2 and postgresql requires cleanup every > > >> time > > >> i boot the cluster (this was a test cluster - which was switched off at > > >> the > > >> end of the day - or when i see the level of madness in me cross the > > >> barrier), when a simulated failover (by making the other node stand by) , > > >> or > > >> when i pull the nic cable of one node. Ipaddress and stonith was working > > >> fine as planned. but the big boys - apache2 and postgresql is having > > >> trouble > > >> and i have to cleanup always. > > >> > > >> I would like to give the log file as attachment (/var/log/messages) - but > > >> it > > >> is 3.2GB in size > > > > > > limit the contents to just one instance of the problem and use bzip > > > > > >> and has lot of repeated entries - whic i did not find > > >> relevant. > > > > > > actually its the only thing that i relevant > > > > > > _______________________________________________ > > > Pacemaker mailing list > > > Pacemaker@oss.clusterlabs.org > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > ------------------------------------ > > > Abhin.G.S > > > ========= > > > +91-9895-525880 | +91-471-2437189 > > > D E U C N ® | http://www.deucn.com > > > ------------------------------------ > > > > > > ---------- > > > VYHLADAJTE VASE DOVOLENKOVE FOTOGRAFIE NA MAPE. Info na www.fotoskola.sk. > > > > > ---------- > Ukazte svoje fotky na www.zonerama.sk > _______________________________________________ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker