Hello William Sorry but i would to know if can show me your /etc/cluster/cluster.conf
Il giorno 23 marzo 2012 21:50, William Seligman <selig...@nevis.columbia.edu > ha scritto: > On 3/22/12 2:43 PM, William Seligman wrote: > > On 3/20/12 4:55 PM, Lars Ellenberg wrote: > >> On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote: > >>> On 3/16/12 12:12 PM, William Seligman wrote: > >>>> On 3/16/12 7:02 AM, Andreas Kurz wrote: > >>>>> On 03/15/2012 11:50 PM, William Seligman wrote: > >>>>>> On 3/15/12 6:07 PM, William Seligman wrote: > >>>>>>> On 3/15/12 6:05 PM, William Seligman wrote: > >>>>>>>> On 3/15/12 4:57 PM, emmanuel segura wrote: > >>>>>>>> > >>>>>>>>> we can try to understand what happen when clvm hang > >>>>>>>>> > >>>>>>>>> edit the /etc/lvm/lvm.conf and change level = 7 in the log > session and > >>>>>>>>> uncomment this line > >>>>>>>>> > >>>>>>>>> file = "/var/log/lvm2.log" > >>>>>>>> > >>>>>>>> Here's the tail end of the file (the original is 1.6M). Because > there no times > >>>>>>>> in the log, it's hard for me to point you to the point where I > crashed the other > >>>>>>>> system. I think (though I'm not sure) that the crash happened > after the last > >>>>>>>> occurrence of > >>>>>>>> > >>>>>>>> cache/lvmcache.c:1484 Wiping internal VG cache > >>>>>>>> > >>>>>>>> Honestly, it looks like a wall of text to me. Does it suggest > anything to you? > >>>>>>> > >>>>>>> Maybe it would help if I included the link to the pastebin where I > put the > >>>>>>> output: <http://pastebin.com/8pgW3Muw> > >>>>>> > >>>>>> Could the problem be with lvm+drbd? > >>>>>> > >>>>>> In lvm2.conf, I see this sequence of lines pre-crash: > >>>>>> > >>>>>> device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT > >>>>>> device/dev-io.c:271 /dev/md0: size is 1027968 sectors > >>>>>> device/dev-io.c:137 /dev/md0: block size is 1024 bytes > >>>>>> device/dev-io.c:588 Closed /dev/md0 > >>>>>> device/dev-io.c:271 /dev/md0: size is 1027968 sectors > >>>>>> device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT > >>>>>> device/dev-io.c:137 /dev/md0: block size is 1024 bytes > >>>>>> device/dev-io.c:588 Closed /dev/md0 > >>>>>> filters/filter-composite.c:31 Using /dev/md0 > >>>>>> device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT > >>>>>> device/dev-io.c:137 /dev/md0: block size is 1024 bytes > >>>>>> label/label.c:186 /dev/md0: No label detected > >>>>>> device/dev-io.c:588 Closed /dev/md0 > >>>>>> device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT > >>>>>> device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors > >>>>>> device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes > >>>>>> device/dev-io.c:588 Closed /dev/drbd0 > >>>>>> device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors > >>>>>> device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT > >>>>>> device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes > >>>>>> device/dev-io.c:588 Closed /dev/drbd0 > >>>>>> > >>>>>> I interpret this: Look at /dev/md0, get some info, close; look at > /dev/drbd0, > >>>>>> get some info, close. > >>>>>> > >>>>>> Post-crash, I see: > >>>>>> > >>>>>> evice/dev-io.c:535 Opened /dev/md0 RO O_DIRECT > >>>>>> device/dev-io.c:271 /dev/md0: size is 1027968 sectors > >>>>>> device/dev-io.c:137 /dev/md0: block size is 1024 bytes > >>>>>> device/dev-io.c:588 Closed /dev/md0 > >>>>>> device/dev-io.c:271 /dev/md0: size is 1027968 sectors > >>>>>> device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT > >>>>>> device/dev-io.c:137 /dev/md0: block size is 1024 bytes > >>>>>> device/dev-io.c:588 Closed /dev/md0 > >>>>>> filters/filter-composite.c:31 Using /dev/md0 > >>>>>> device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT > >>>>>> device/dev-io.c:137 /dev/md0: block size is 1024 bytes > >>>>>> label/label.c:186 /dev/md0: No label detected > >>>>>> device/dev-io.c:588 Closed /dev/md0 > >>>>>> device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT > >>>>>> device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors > >>>>>> device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes > >>>>>> > >>>>>> ... and then it hangs. Comparing the two, it looks like it can't > close /dev/drbd0. > >>>>>> > >>>>>> If I look at /proc/drbd when I crash one node, I see this: > >>>>>> > >>>>>> # cat /proc/drbd > >>>>>> version: 8.3.12 (api:88/proto:86-96) > >>>>>> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by > >>>>>> r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34 > >>>>>> 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s----- > >>>>>> ns:7000064 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 > ep:1 wo:b oos:0 > >>>>> > >>>>> s----- ... DRBD suspended io, most likely because of it's > >>>>> fencing-policy. For valid dual-primary setups you have to use > >>>>> "resource-and-stonith" policy and a working "fence-peer" handler. In > >>>>> this mode I/O is suspended until fencing of peer was succesful. > Question > >>>>> is, why the peer does _not_ also suspend its I/O because obviously > >>>>> fencing was not successful ..... > >>>>> > >>>>> So with a correct DRBD configuration one of your nodes should already > >>>>> have been fenced because of connection loss between nodes (on drbd > >>>>> replication link). > >>>>> > >>>>> You can use e.g. that nice fencing script: > >>>>> > >>>>> http://goo.gl/O4N8f > >>>> > >>>> This is the output of "drbdadm dump admin": < > http://pastebin.com/kTxvHCtx> > >>>> > >>>> So I've got resource-and-stonith. I gather from an earlier thread that > >>>> obliterate-peer.sh is more-or-less equivalent in functionality with > >>>> stonith_admin_fence_peer.sh: > >>>> > >>>> <http://www.gossamer-threads.com/lists/linuxha/users/78504#78504> > >>>> > >>>> At the moment I'm pursuing the possibility that I'm returning the > wrong return > >>>> codes from my fencing agent: > >>>> > >>>> <http://www.gossamer-threads.com/lists/linuxha/users/78572> > >>> > >>> I cleaned up my fencing agent, making sure its return code matched > those > >>> returned by other agents in /usr/sbin/fence_, and allowing for some > delay issues > >>> in reading the UPS status. But... > >>> > >>>> After that, I'll look at another suggestion with lvm.conf: > >>>> > >>>> <http://www.gossamer-threads.com/lists/linuxha/users/78796#78796> > >>>> > >>>> Then I'll try DRBD 8.4.1. Hopefully one of these is the source of the > issue. > >>> > >>> Failure on all three counts. > >> > >> May I suggest you double check the permissions on your fence peer > script? > >> I suspect you may simply have forgotten the "chmod +x" . > >> > >> Test with "drbdadm fence-peer minor-0" from the command line. > > > > I still haven't solved the problem, but this advice has gotten me > further than > > before. > > > > First, Lars was correct: I did not have execute permissions set on my > fence peer > > scripts. (D'oh!) I turned them on, but that did not change anything: > cman+clvmd > > still hung on the vgdisplay command if I crashed the peer node. > > > > I started up both nodes again (cman+pacemaker+drbd+clvmd) and tried Lars' > > suggested command. I didn't save the response for this message (d'oh > again!) but > > it said that the fence-peer script had failed. > > > > Hmm. The peer was definitely shutting down, so my fencing script is > working. I > > went over it, comparing the return codes to those of the existing > scripts, and > > made some changes. Here's my current script: < > http://pastebin.com/nUnYVcBK>. > > > > Up until now my fence-peer scripts had either been Lon Hohberger's > > obliterate-peer.sh or Digimer's rhcs_fence. I decided to try > > stonith_admin-fence-peer.sh that Andreas Kurz recommended; unlike the > first two > > scripts, which fence using fence_node, the latter script just calls > stonith_admin. > > > > When I tried the stonith_admin-fence-peer.sh script, it worked: > > > > # drbdadm fence-peer minor-0 > > stonith_admin-fence-peer.sh[10886]: stonith_admin successfully fenced > peer > > orestes-corosync.nevis.columbia.edu. > > > > Power was cut on the peer, the remaining node stayed up. Then I brought > up the > > peer with: > > > > stonith_admin -U orestes-corosync.nevis.columbia.edu > > > > BUT: When the restored peer came up and started to run cman, the clvmd > hung on > > the main node again. > > > > After cycling through some more tests, I found that if I brought down > the peer > > with drbdadm, then brought up with the peer with no HA services, then > started > > drbd and then cman, the cluster remained intact. > > > > If I crashed the peer, the scheme in the previous paragraph didn't work. > I bring > > up drbd, check that the disks are both UpToDate, then bring up cman. At > that > > point the vgdisplay on the main node takes so long to run that clvmd > will time out: > > > > vgdisplay Error locking on node orestes-corosync.nevis.columbia.edu: > Command > > timed out > > > > I timed how long it took vgdisplay to run. I might be able to work > around this > > by setting the timeout on my clvmd resource to 300s, but that seems to > be a > > band-aid for an underlying problem. Any suggestions on what else I could > check? > > I've done some more tests. Still no solution, just an observation: The > "death > mode" appears to be: > > - Two nodes running cman+pacemaker+drbd+clvmd > - Take one node down = one remaining node w/cman+pacemaker+drbd+clvmd > - Start up dead node. If it ever gets into a state in which it's running > cman > but not clvmd, clvmd on the uncrashed node hangs. > - Conversely, if I bring up drbd, make it primary, start cman+clvmd, > there's no > problem on the uncrashed node. > > My guess is that clvmd is getting the number of nodes it expects from > cman. When > the formally-dead node starts running cman, the number of cluster nodes > goes to > 2 (I checked with 'cman_tool status') but the number of nodes running > clvmd is > still 1, hence the crash. > > Does this guess make sense? > > -- > Bill Seligman | Phone: (914) 591-2823 > Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu > PO Box 137 | > Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/ > > > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > -- esta es mi vida e me la vivo hasta que dios quiera _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems