Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes - SOLVED

2012-03-27 Thread William Seligman
On 3/27/12 4:52 AM, emmanuel segura wrote:

> So now your cluster it's OK?

*Laughs* No! There's another problem I have to solve. But it's completely
unrelated to this one. I'll work on it some more, and if I can't solve it I'll
start a new thread.

Thanks for asking, Emmanuel. (I want to prove I can spell your name correctly!)

> Il giorno 27 marzo 2012 00:33, William Seligman > ha scritto:
> 
>> On 3/26/12 5:31 PM, William Seligman wrote:
>>> On 3/26/12 5:17 PM, William Seligman wrote:
 On 3/26/12 4:28 PM, emmanuel segura wrote:
>>
> and i suggest you to start clvmd at boot time
>
> chkconfig clvmd on

 I'm afraid this doesn't work. It's as I predicted; when gfs2 starts I get:

 Mounting GFS2 filesystem (/usr/nevis): invalid device path 
 "/dev/mapper/ADMIN-usr"
[FAILED]

 ... and so on, because the ADMIN volume group was never loaded by 
 clvmd. Without a "vgscan" in there somewhere, the system can't see the
 volume groups on the drbd resource.
>>>
>>> Wait a second... there's an ocf:heartbeat:LVM resource! Testing...
>>
>> Emannuel, you did it!
>>
>> For the sake of future searches, and possibly future documentation, let me 
>> start with my original description of the problem:
>>
>>> I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in
>>> "Clusters From Scratch." Fencing is through forcibly rebooting a node by
>>> cutting and restoring its power via UPS.
>>> 
>>> My fencing/failover tests have revealed a problem. If I gracefully turn
>>> off one node ("crm node standby"; "service pacemaker stop"; "shutdown -r
>>> now") all the resources transfer to the other node with no problems. If I
>>> cut power to one node (as would happen if it were fenced), the lsb::clvmd
>>> resource on the remaining node eventually fails. Since all the other
>>> resources depend on clvmd, all the resources on the remaining node stop
>>> and the cluster is left with nothing running.
>>> 
>>> I've traced why the lsb::clvmd fails: The monitor/status command
>>> includes "vgdisplay", which hangs indefinitely. Therefore the monitor
>>> will always time-out.
>>> 
>>> So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is
>>> cut off, the cluster isn't handling it properly. Has anyone on this list
>>> seen this before? Any ideas?
>>>
>>> Details:
>>>
>>> versions:
>>> Redhat Linux 6.2 (kernel 2.6.32)
>>> cman-3.0.12.1
>>> corosync-1.4.1
>>> pacemaker-1.1.6
>>> lvm2-2.02.87
>>> lvm2-cluster-2.02.87
>>
>> The problem is that clvmd on the main node will hang if there's a 
>> substantive period of time during which the other node returns running cman
>> but not clvmd. I never tracked down why this happens, but there's a
>> practical solution: minimize any interval for which that would be true. To
>> ensure this, take clvmd outside the resource manager's control:
>>
>> chkconfig cman on
>> chkconfig clvmd on
>> chkconfig pacemaker on
>>
>> On RHEL6.2, these services will be started in the above order; clvmd will 
>> start within a few seconds after cman.
>> 
>> Here's my cluster.conf  and the output of 
>> "crm configure show" . The key lines from
>> the latter are:
>>
>> primitive AdminDrbd ocf:linbit:drbd \
>>params drbd_resource="admin"
>> primitive AdminLvm ocf:heartbeat:LVM \
>>params volgrpname="ADMIN" \
>>op monitor interval="30" timeout="100" depth="0"
>> primitive Gfs2 lsb:gfs2
>> group VolumeGroup AdminLvm Gfs2
>> ms AdminClone AdminDrbd \
>>meta master-max="2" master-node-max="1" \
>>clone-max="2" clone-node-max="1" \
>>notify="true" interleave="true"
>> clone VolumeClone VolumeGroup \
>>meta interleave="true"
>> colocation Volume_With_Admin inf: VolumeClone AdminClone:Master
>> order Admin_Before_Volume inf: AdminClone:promote VolumeClone:start
>>
>> What I learned: If one is going to extend the example in "Clusters From 
>> Scratch" to include logical volumes, one must start clvmd at boot time, and
>> include any volume groups in ocf:heartbeat:LVM resources that start before
>> gfs2.
>> 
>> Note the long timeout on the ocf:heartbeat:LVM resource. This is a good 
>> idea because, during the boot of the crashed node, there'll still be an 
>> interval of a few seconds when cman will be running but clvmd won't be.
>> During my tests, the LVM monitor would fail if it checked during that
>> interval with a timeout that was shorter than it took clvmd to start on the
>> crashed node. This was annoying; all resources dependent on AdminLvm would
>> be stopped until AdminLvm recovered (a few more seconds). Increasing the
>> timeout avoids this.
>> 
>> It also means that during any recovery procedure on the crashed node for 
>> which I turn off all the services, I have to minimize the interval between
>> the start of cman and clvmd if I've turned off services at boot; e.g.,

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes - SOLVED

2012-03-27 Thread emmanuel segura
William :-)

So now your cluster it's OK?

Il giorno 27 marzo 2012 00:33, William Seligman  ha scritto:

> On 3/26/12 5:31 PM, William Seligman wrote:
> > On 3/26/12 5:17 PM, William Seligman wrote:
> >> On 3/26/12 4:28 PM, emmanuel segura wrote:
>
> >>> and i suggest you to start clvmd at boot time
> >>>
> >>> chkconfig clvmd on
> >>
> >> I'm afraid this doesn't work. It's as I predicted; when gfs2 starts I
> get:
> >>
> >> Mounting GFS2 filesystem (/usr/nevis): invalid device path
> "/dev/mapper/ADMIN-usr"
> >>[FAILED]
> >>
> >> ... and so on, because the ADMIN volume group was never loaded by
> clvmd. Without
> >> a "vgscan" in there somewhere, the system can't see the volume groups
> on the
> >> drbd resource.
> >
> > Wait a second... there's an ocf:heartbeat:LVM resource! Testing...
>
> Emannuel, you did it!
>
> For the sake of future searches, and possibly future documentation, let me
> start
> with my original description of the problem:
>
> > I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in
> "Clusters
> > From Scratch." Fencing is through forcibly rebooting a node by cutting
> and
> > restoring its power via UPS.
> >
> > My fencing/failover tests have revealed a problem. If I gracefully turn
> off one
> > node ("crm node standby"; "service pacemaker stop"; "shutdown -r now")
> all the
> > resources transfer to the other node with no problems. If I cut power to
> one
> > node (as would happen if it were fenced), the lsb::clvmd resource on the
> > remaining node eventually fails. Since all the other resources depend on
> clvmd,
> > all the resources on the remaining node stop and the cluster is left with
> > nothing running.
> >
> > I've traced why the lsb::clvmd fails: The monitor/status command includes
> > "vgdisplay", which hangs indefinitely. Therefore the monitor will always
> time-out.
> >
> > So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is
> cut
> > off, the cluster isn't handling it properly. Has anyone on this list
> seen this
> > before? Any ideas?
> >
> > Details:
> >
> > versions:
> > Redhat Linux 6.2 (kernel 2.6.32)
> > cman-3.0.12.1
> > corosync-1.4.1
> > pacemaker-1.1.6
> > lvm2-2.02.87
> > lvm2-cluster-2.02.87
>
> The problem is that clvmd on the main node will hang if there's a
> substantive
> period of time during which the other node returns running cman but not
> clvmd. I
> never tracked down why this happens, but there's a practical solution:
> minimize
> any interval for which that would be true. To ensure this, take clvmd
> outside
> the resource manager's control:
>
> chkconfig cman on
> chkconfig clvmd on
> chkconfig pacemaker on
>
> On RHEL6.2, these services will be started in the above order; clvmd will
> start
> within a few seconds after cman.
>
> Here's my cluster.conf  and the output of
> "crm
> configure show" . The key lines from the
> latter are:
>
> primitive AdminDrbd ocf:linbit:drbd \
>params drbd_resource="admin"
> primitive AdminLvm ocf:heartbeat:LVM \
>params volgrpname="ADMIN" \
>op monitor interval="30" timeout="100" depth="0"
> primitive Gfs2 lsb:gfs2
> group VolumeGroup AdminLvm Gfs2
> ms AdminClone AdminDrbd \
>meta master-max="2" master-node-max="1" \
>clone-max="2" clone-node-max="1" \
>notify="true" interleave="true"
> clone VolumeClone VolumeGroup \
>meta interleave="true"
> colocation Volume_With_Admin inf: VolumeClone AdminClone:Master
> order Admin_Before_Volume inf: AdminClone:promote VolumeClone:start
>
> What I learned: If one is going to extend the example in "Clusters From
> Scratch"
> to include logical volumes, one must start clvmd at boot time, and include
> any
> volume groups in ocf:heartbeat:LVM resources that start before gfs2.
>
> Note the long timeout on the ocf:heartbeat:LVM resource. This is a good
> idea
> because, during the boot of the crashed node, there'll still be an
> interval of a
> few seconds when cman will be running but clvmd won't be. During my tests,
> the
> LVM monitor would fail if it checked during that interval with a timeout
> that
> was shorter than it took clvmd to start on the crashed node. This was
> annoying;
> all resources dependent on AdminLvm would be stopped until AdminLvm
> recovered (a
> few more seconds). Increasing the timeout avoids this.
>
> It also means that during any recovery procedure on the crashed node for
> which I
> turn off all the services, I have to minimize the interval between the
> start of
> cman and clvmd if I've turned off services at boot; e.g.,
>
> service drbd start # ... and fix any split-brain problems or whatever
> service cman start; service clvmd start # put on one line
> service pacemaker start
>
> I thank everyone on this list who was patient with me as I pounded on this
> problem for two weeks!
> --
> Bill Seligman | Phone: (914) 591-2823

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-27 Thread emmanuel segura
William

But i would like to know if you have a lvm resource in your pacemaker
configuration

Remember clvmd it's not for active di vg or lv it's for propagate the lvm
meta data on all node of the cluster



Il giorno 26 marzo 2012 23:17, William Seligman  ha scritto:

> On 3/26/12 4:28 PM, emmanuel segura wrote:
> > Sorry Willian i can't post my config now because i'm at home now  not in
> my
> > job
> >
> > I think it's no a problem if clvm start before drbd, because clvm not
> > needed and devices to start
> >
> > This it's the point, i hope to be clear
> >
> > The introduction of pacemaker in redhat cluster was thinked  for replace
> > rgmanager not whole cluster stack
> >
> > and i suggest you to start clvmd at boot time
> >
> > chkconfig clvmd on
>
> I'm afraid this doesn't work. It's as I predicted; when gfs2 starts I get:
>
> Mounting GFS2 filesystem (/usr/nevis): invalid device path
> "/dev/mapper/ADMIN-usr"
>   [FAILED]
>
> ... and so on, because the ADMIN volume group was never loaded by clvmd.
> Without
> a "vgscan" in there somewhere, the system can't see the volume groups on
> the
> drbd resource.
>
> > Sorry for my bad english :-) i can from a spanish country and all days i
> > speak Italian
>
> I'm sorry that I don't speak more languages! You're the one who's helping
> me;
> it's my task to learn and understand. Certainly your English is better
> than my
> French or Russian.
>
> > Il giorno 26 marzo 2012 22:04, William Seligman <
> selig...@nevis.columbia.edu
> >> ha scritto:
> >
> >> On 3/26/12 3:48 PM, emmanuel segura wrote:
> >>> I know it's normal fence_node doesn't work because the request of fence
> >>> must be redirect to pacemaker stonith
> >>>
> >>> I think call the cluster agents with rgmanager it's really ugly thing,
> i
> >>> never seen a cluster like this
> >>> ==
> >>> If I understand "Pacemaker Explained"  and how
> I'd
> >>> invoke
> >>> clvmd from cman , the clvmd script that would be
> >>> invoked
> >>> by either HA resource manager is exactly the same: /etc/init.d/clvmd.
> >>> ==
> >>>
> >>> clvm doesn't need to be called from rgmanger in the cluster
> configuration
> >>>
> >>> this the boot sequence of redhat daemons
> >>>
> >>> 1:cman, 2:clvm, 3:rgmanager
> >>>
> >>> and if you don't wanna use rgmanager you just replace rgmanager
> >>
> >> I'm sorry, but I don't think I understand what you're suggesting. Do you
> >> suggest
> >> that I start clvmd at boot? That won't work; clvmd won't see the volume
> >> groups
> >> on drbd until drbd is started and promoted to primary.
> >>
> >> May I ask you to post your own cluster.conf on pastebin.com so I can
> see
> >> how you
> >> do it? Along with "crm configure show" if that's relevant for your
> cluster?
> >>
> >>> Il giorno 26 marzo 2012 19:21, William Seligman <
> >> selig...@nevis.columbia.edu
>  ha scritto:
> >>>
>  On 3/24/12 5:40 PM, emmanuel segura wrote:
> > I think it's better you use clvmd with cman
> >
> > I don't now why you use the lsb script of clvm
> >
> > On Redhat clvmd need of cman and you try to running with pacemaker, i
> >> not
> > sure this is the problem but this type of configuration it's so
> strange
> >
> > I made it a virtual cluster with kvm and i not foud a problems
> 
>  While I appreciate the advice, it's not immediately clear that trying
> to
>  eliminate pacemaker would do me any good. Perhaps someone can
> >> demonstrate
>  the
>  error in my reasoning:
> 
>  If I understand "Pacemaker Explained"  and how
> >> I'd
>  invoke
>  clvmd from cman , the clvmd script that would
> be
>  invoked
>  by either HA resource manager is exactly the same: /etc/init.d/clvmd.
> 
>  If I tried to use cman instead of pacemaker, I'd be cutting myself off
>  from the
>  pacemaker features that cman/rgmanager does not yet have available,
> >> such as
>  pacemaker's symlink, exportfs, and clonable IPaddr2 resources.
> 
>  I recognize I've got a strange problem. Given that fence_node doesn't
> >> work
>  but
>  stonith_admin does, I strongly suspect that the problem is caused by
> the
>  behavior of my fencing agent, not the use of pacemaker versus
> rgmanager,
>  nor by
>  how clvmd is being started.
> 
> > Il giorno 24 marzo 2012 13:09, William Seligman <
>  selig...@nevis.columbia.edu
> >> ha scritto:
> >
> >> On 3/24/12 4:47 AM, emmanuel segura wrote:
> >>> How do you configure clvmd?
> >>>
> >>> with cman or with pacemaker?
> >>
> >> Pacemaker. Here's the output of 'crm configure show':
> >> 
> >>
> >>> Il giorno 23 marzo 2012 22:14, Willi

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes - SOLVED

2012-03-26 Thread William Seligman
On 3/26/12 5:31 PM, William Seligman wrote:
> On 3/26/12 5:17 PM, William Seligman wrote:
>> On 3/26/12 4:28 PM, emmanuel segura wrote:

>>> and i suggest you to start clvmd at boot time
>>>
>>> chkconfig clvmd on
>>
>> I'm afraid this doesn't work. It's as I predicted; when gfs2 starts I get:
>>
>> Mounting GFS2 filesystem (/usr/nevis): invalid device path 
>> "/dev/mapper/ADMIN-usr"
>>[FAILED]
>>
>> ... and so on, because the ADMIN volume group was never loaded by clvmd. 
>> Without
>> a "vgscan" in there somewhere, the system can't see the volume groups on the
>> drbd resource.
> 
> Wait a second... there's an ocf:heartbeat:LVM resource! Testing...

Emannuel, you did it!

For the sake of future searches, and possibly future documentation, let me start
with my original description of the problem:

> I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in 
> "Clusters
> From Scratch." Fencing is through forcibly rebooting a node by cutting and
> restoring its power via UPS.
> 
> My fencing/failover tests have revealed a problem. If I gracefully turn off 
> one
> node ("crm node standby"; "service pacemaker stop"; "shutdown -r now") all the
> resources transfer to the other node with no problems. If I cut power to one
> node (as would happen if it were fenced), the lsb::clvmd resource on the
> remaining node eventually fails. Since all the other resources depend on 
> clvmd,
> all the resources on the remaining node stop and the cluster is left with
> nothing running.
> 
> I've traced why the lsb::clvmd fails: The monitor/status command includes
> "vgdisplay", which hangs indefinitely. Therefore the monitor will always 
> time-out.
> 
> So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is cut
> off, the cluster isn't handling it properly. Has anyone on this list seen this
> before? Any ideas?
> 
> Details:
> 
> versions:
> Redhat Linux 6.2 (kernel 2.6.32)
> cman-3.0.12.1
> corosync-1.4.1
> pacemaker-1.1.6
> lvm2-2.02.87
> lvm2-cluster-2.02.87

The problem is that clvmd on the main node will hang if there's a substantive
period of time during which the other node returns running cman but not clvmd. I
never tracked down why this happens, but there's a practical solution: minimize
any interval for which that would be true. To ensure this, take clvmd outside
the resource manager's control:

chkconfig cman on
chkconfig clvmd on
chkconfig pacemaker on

On RHEL6.2, these services will be started in the above order; clvmd will start
within a few seconds after cman.

Here's my cluster.conf  and the output of "crm
configure show" . The key lines from the latter 
are:

primitive AdminDrbd ocf:linbit:drbd \
params drbd_resource="admin"
primitive AdminLvm ocf:heartbeat:LVM \
params volgrpname="ADMIN" \
op monitor interval="30" timeout="100" depth="0"
primitive Gfs2 lsb:gfs2
group VolumeGroup AdminLvm Gfs2
ms AdminClone AdminDrbd \
meta master-max="2" master-node-max="1" \
clone-max="2" clone-node-max="1" \
notify="true" interleave="true"
clone VolumeClone VolumeGroup \
meta interleave="true"
colocation Volume_With_Admin inf: VolumeClone AdminClone:Master
order Admin_Before_Volume inf: AdminClone:promote VolumeClone:start

What I learned: If one is going to extend the example in "Clusters From Scratch"
to include logical volumes, one must start clvmd at boot time, and include any
volume groups in ocf:heartbeat:LVM resources that start before gfs2.

Note the long timeout on the ocf:heartbeat:LVM resource. This is a good idea
because, during the boot of the crashed node, there'll still be an interval of a
few seconds when cman will be running but clvmd won't be. During my tests, the
LVM monitor would fail if it checked during that interval with a timeout that
was shorter than it took clvmd to start on the crashed node. This was annoying;
all resources dependent on AdminLvm would be stopped until AdminLvm recovered (a
few more seconds). Increasing the timeout avoids this.

It also means that during any recovery procedure on the crashed node for which I
turn off all the services, I have to minimize the interval between the start of
cman and clvmd if I've turned off services at boot; e.g.,

service drbd start # ... and fix any split-brain problems or whatever
service cman start; service clvmd start # put on one line
service pacemaker start

I thank everyone on this list who was patient with me as I pounded on this
problem for two weeks!
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-26 Thread William Seligman
On 3/26/12 5:17 PM, William Seligman wrote:
> On 3/26/12 4:28 PM, emmanuel segura wrote:
>> Sorry Willian i can't post my config now because i'm at home now  not in my
>> job
>>
>> I think it's no a problem if clvm start before drbd, because clvm not
>> needed and devices to start
>>
>> This it's the point, i hope to be clear
>>
>> The introduction of pacemaker in redhat cluster was thinked  for replace
>> rgmanager not whole cluster stack
>>
>> and i suggest you to start clvmd at boot time
>>
>> chkconfig clvmd on
> 
> I'm afraid this doesn't work. It's as I predicted; when gfs2 starts I get:
> 
> Mounting GFS2 filesystem (/usr/nevis): invalid device path 
> "/dev/mapper/ADMIN-usr"
>[FAILED]
> 
> ... and so on, because the ADMIN volume group was never loaded by clvmd. 
> Without
> a "vgscan" in there somewhere, the system can't see the volume groups on the
> drbd resource.

Wait a second... there's an ocf:heartbeat:LVM resource! Testing...
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-26 Thread William Seligman
On 3/26/12 4:28 PM, emmanuel segura wrote:
> Sorry Willian i can't post my config now because i'm at home now  not in my
> job
> 
> I think it's no a problem if clvm start before drbd, because clvm not
> needed and devices to start
> 
> This it's the point, i hope to be clear
> 
> The introduction of pacemaker in redhat cluster was thinked  for replace
> rgmanager not whole cluster stack
> 
> and i suggest you to start clvmd at boot time
> 
> chkconfig clvmd on

I'm afraid this doesn't work. It's as I predicted; when gfs2 starts I get:

Mounting GFS2 filesystem (/usr/nevis): invalid device path 
"/dev/mapper/ADMIN-usr"
   [FAILED]

... and so on, because the ADMIN volume group was never loaded by clvmd. Without
a "vgscan" in there somewhere, the system can't see the volume groups on the
drbd resource.

> Sorry for my bad english :-) i can from a spanish country and all days i
> speak Italian

I'm sorry that I don't speak more languages! You're the one who's helping me;
it's my task to learn and understand. Certainly your English is better than my
French or Russian.

> Il giorno 26 marzo 2012 22:04, William Seligman > ha scritto:
> 
>> On 3/26/12 3:48 PM, emmanuel segura wrote:
>>> I know it's normal fence_node doesn't work because the request of fence
>>> must be redirect to pacemaker stonith
>>>
>>> I think call the cluster agents with rgmanager it's really ugly thing, i
>>> never seen a cluster like this
>>> ==
>>> If I understand "Pacemaker Explained"  and how I'd
>>> invoke
>>> clvmd from cman , the clvmd script that would be
>>> invoked
>>> by either HA resource manager is exactly the same: /etc/init.d/clvmd.
>>> ==
>>>
>>> clvm doesn't need to be called from rgmanger in the cluster configuration
>>>
>>> this the boot sequence of redhat daemons
>>>
>>> 1:cman, 2:clvm, 3:rgmanager
>>>
>>> and if you don't wanna use rgmanager you just replace rgmanager
>>
>> I'm sorry, but I don't think I understand what you're suggesting. Do you
>> suggest
>> that I start clvmd at boot? That won't work; clvmd won't see the volume
>> groups
>> on drbd until drbd is started and promoted to primary.
>>
>> May I ask you to post your own cluster.conf on pastebin.com so I can see
>> how you
>> do it? Along with "crm configure show" if that's relevant for your cluster?
>>
>>> Il giorno 26 marzo 2012 19:21, William Seligman <
>> selig...@nevis.columbia.edu
 ha scritto:
>>>
 On 3/24/12 5:40 PM, emmanuel segura wrote:
> I think it's better you use clvmd with cman
>
> I don't now why you use the lsb script of clvm
>
> On Redhat clvmd need of cman and you try to running with pacemaker, i
>> not
> sure this is the problem but this type of configuration it's so strange
>
> I made it a virtual cluster with kvm and i not foud a problems

 While I appreciate the advice, it's not immediately clear that trying to
 eliminate pacemaker would do me any good. Perhaps someone can
>> demonstrate
 the
 error in my reasoning:

 If I understand "Pacemaker Explained"  and how
>> I'd
 invoke
 clvmd from cman , the clvmd script that would be
 invoked
 by either HA resource manager is exactly the same: /etc/init.d/clvmd.

 If I tried to use cman instead of pacemaker, I'd be cutting myself off
 from the
 pacemaker features that cman/rgmanager does not yet have available,
>> such as
 pacemaker's symlink, exportfs, and clonable IPaddr2 resources.

 I recognize I've got a strange problem. Given that fence_node doesn't
>> work
 but
 stonith_admin does, I strongly suspect that the problem is caused by the
 behavior of my fencing agent, not the use of pacemaker versus rgmanager,
 nor by
 how clvmd is being started.

> Il giorno 24 marzo 2012 13:09, William Seligman <
 selig...@nevis.columbia.edu
>> ha scritto:
>
>> On 3/24/12 4:47 AM, emmanuel segura wrote:
>>> How do you configure clvmd?
>>>
>>> with cman or with pacemaker?
>>
>> Pacemaker. Here's the output of 'crm configure show':
>> 
>>
>>> Il giorno 23 marzo 2012 22:14, William Seligman <
>> selig...@nevis.columbia.edu
 ha scritto:
>>>
 On 3/23/12 5:03 PM, emmanuel segura wrote:

> Sorry but i would to know if can show me your
 /etc/cluster/cluster.conf

 Here it is: 

> Il giorno 23 marzo 2012 21:50, William Seligman <
 selig...@nevis.columbia.edu
>> ha scritto:
>
>> On 3/22/12 2:43 PM, William Seligman wrote:
>>> On 3/20/12 4:55 PM, Lars Ellenberg wrote:
 On Fri, Mar 16, 2

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-24 Thread emmanuel segura
I think it's better you use clvmd with cman

I don't now why you use the lsb script of clvm

On Redhat clvmd need of cman and you try to running with pacemaker, i not
sure this is the problem but this type of configuration it's so strange

I made it a virtual cluster with kvm and i not foud a problems

Il giorno 24 marzo 2012 13:09, William Seligman  ha scritto:

> On 3/24/12 4:47 AM, emmanuel segura wrote:
> > How do you configure clvmd?
> >
> > with cman or with pacemaker?
>
> Pacemaker. Here's the output of 'crm configure show':
> 
>
> > Il giorno 23 marzo 2012 22:14, William Seligman <
> selig...@nevis.columbia.edu
> >> ha scritto:
> >
> >> On 3/23/12 5:03 PM, emmanuel segura wrote:
> >>
> >>> Sorry but i would to know if can show me your /etc/cluster/cluster.conf
> >>
> >> Here it is: 
> >>
> >>> Il giorno 23 marzo 2012 21:50, William Seligman <
> >> selig...@nevis.columbia.edu
>  ha scritto:
> >>>
>  On 3/22/12 2:43 PM, William Seligman wrote:
> > On 3/20/12 4:55 PM, Lars Ellenberg wrote:
> >> On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote:
> >>> On 3/16/12 12:12 PM, William Seligman wrote:
>  On 3/16/12 7:02 AM, Andreas Kurz wrote:
> >
> > s- ... DRBD suspended io, most likely because of it's
> > fencing-policy. For valid dual-primary setups you have to use
> > "resource-and-stonith" policy and a working "fence-peer" handler.
> >> In
> > this mode I/O is suspended until fencing of peer was succesful.
>  Question
> > is, why the peer does _not_ also suspend its I/O because
> obviously
> > fencing was not successful .
> >
> > So with a correct DRBD configuration one of your nodes should
> >> already
> > have been fenced because of connection loss between nodes (on
> drbd
> > replication link).
> >
> > You can use e.g. that nice fencing script:
> >
> > http://goo.gl/O4N8f
> 
>  This is the output of "drbdadm dump admin": <
>  http://pastebin.com/kTxvHCtx>
> 
>  So I've got resource-and-stonith. I gather from an earlier thread
> >> that
>  obliterate-peer.sh is more-or-less equivalent in functionality
> with
>  stonith_admin_fence_peer.sh:
> 
>  
> 
>  At the moment I'm pursuing the possibility that I'm returning the
>  wrong return
>  codes from my fencing agent:
> 
>  
> >>>
> >>> I cleaned up my fencing agent, making sure its return code matched
>  those
> >>> returned by other agents in /usr/sbin/fence_, and allowing for some
>  delay issues
> >>> in reading the UPS status. But...
> >>>
>  After that, I'll look at another suggestion with lvm.conf:
> 
>  
> 
>  Then I'll try DRBD 8.4.1. Hopefully one of these is the source of
> >> the
>  issue.
> >>>
> >>> Failure on all three counts.
> >>
> >> May I suggest you double check the permissions on your fence peer
>  script?
> >> I suspect you may simply have forgotten the "chmod +x" .
> >>
> >> Test with "drbdadm fence-peer minor-0" from the command line.
> >
> > I still haven't solved the problem, but this advice has gotten me
>  further than
> > before.
> >
> > First, Lars was correct: I did not have execute permissions set on my
>  fence peer
> > scripts. (D'oh!) I turned them on, but that did not change anything:
>  cman+clvmd
> > still hung on the vgdisplay command if I crashed the peer node.
> >
> > I started up both nodes again (cman+pacemaker+drbd+clvmd) and tried
> >> Lars'
> > suggested command. I didn't save the response for this message (d'oh
>  again!) but
> > it said that the fence-peer script had failed.
> >
> > Hmm. The peer was definitely shutting down, so my fencing script is
>  working. I
> > went over it, comparing the return codes to those of the existing
>  scripts, and
> > made some changes. Here's my current script: <
>  http://pastebin.com/nUnYVcBK>.
> >
> > Up until now my fence-peer scripts had either been Lon Hohberger's
> > obliterate-peer.sh or Digimer's rhcs_fence. I decided to try
> > stonith_admin-fence-peer.sh that Andreas Kurz recommended; unlike the
>  first two
> > scripts, which fence using fence_node, the latter script just calls
>  stonith_admin.
> >
> > When I tried the stonith_admin-fence-peer.sh script, it worked:
> >
> > # drbdadm fence-peer minor-0
> > stonith_admin-fence-peer.sh[10886]: stonith_admin successfully fenced
>  peer
> > orest

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-24 Thread William Seligman
On 3/24/12 4:47 AM, emmanuel segura wrote:
> How do you configure clvmd?
> 
> with cman or with pacemaker?

Pacemaker. Here's the output of 'crm configure show':


> Il giorno 23 marzo 2012 22:14, William Seligman > ha scritto:
> 
>> On 3/23/12 5:03 PM, emmanuel segura wrote:
>>
>>> Sorry but i would to know if can show me your /etc/cluster/cluster.conf
>>
>> Here it is: 
>>
>>> Il giorno 23 marzo 2012 21:50, William Seligman <
>> selig...@nevis.columbia.edu
 ha scritto:
>>>
 On 3/22/12 2:43 PM, William Seligman wrote:
> On 3/20/12 4:55 PM, Lars Ellenberg wrote:
>> On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote:
>>> On 3/16/12 12:12 PM, William Seligman wrote:
 On 3/16/12 7:02 AM, Andreas Kurz wrote:
>
> s- ... DRBD suspended io, most likely because of it's
> fencing-policy. For valid dual-primary setups you have to use
> "resource-and-stonith" policy and a working "fence-peer" handler.
>> In
> this mode I/O is suspended until fencing of peer was succesful.
 Question
> is, why the peer does _not_ also suspend its I/O because obviously
> fencing was not successful .
>
> So with a correct DRBD configuration one of your nodes should
>> already
> have been fenced because of connection loss between nodes (on drbd
> replication link).
>
> You can use e.g. that nice fencing script:
>
> http://goo.gl/O4N8f

 This is the output of "drbdadm dump admin": <
 http://pastebin.com/kTxvHCtx>

 So I've got resource-and-stonith. I gather from an earlier thread
>> that
 obliterate-peer.sh is more-or-less equivalent in functionality with
 stonith_admin_fence_peer.sh:

 

 At the moment I'm pursuing the possibility that I'm returning the
 wrong return
 codes from my fencing agent:

 
>>>
>>> I cleaned up my fencing agent, making sure its return code matched
 those
>>> returned by other agents in /usr/sbin/fence_, and allowing for some
 delay issues
>>> in reading the UPS status. But...
>>>
 After that, I'll look at another suggestion with lvm.conf:

 

 Then I'll try DRBD 8.4.1. Hopefully one of these is the source of
>> the
 issue.
>>>
>>> Failure on all three counts.
>>
>> May I suggest you double check the permissions on your fence peer
 script?
>> I suspect you may simply have forgotten the "chmod +x" .
>>
>> Test with "drbdadm fence-peer minor-0" from the command line.
>
> I still haven't solved the problem, but this advice has gotten me
 further than
> before.
>
> First, Lars was correct: I did not have execute permissions set on my
 fence peer
> scripts. (D'oh!) I turned them on, but that did not change anything:
 cman+clvmd
> still hung on the vgdisplay command if I crashed the peer node.
>
> I started up both nodes again (cman+pacemaker+drbd+clvmd) and tried
>> Lars'
> suggested command. I didn't save the response for this message (d'oh
 again!) but
> it said that the fence-peer script had failed.
>
> Hmm. The peer was definitely shutting down, so my fencing script is
 working. I
> went over it, comparing the return codes to those of the existing
 scripts, and
> made some changes. Here's my current script: <
 http://pastebin.com/nUnYVcBK>.
>
> Up until now my fence-peer scripts had either been Lon Hohberger's
> obliterate-peer.sh or Digimer's rhcs_fence. I decided to try
> stonith_admin-fence-peer.sh that Andreas Kurz recommended; unlike the
 first two
> scripts, which fence using fence_node, the latter script just calls
 stonith_admin.
>
> When I tried the stonith_admin-fence-peer.sh script, it worked:
>
> # drbdadm fence-peer minor-0
> stonith_admin-fence-peer.sh[10886]: stonith_admin successfully fenced
 peer
> orestes-corosync.nevis.columbia.edu.
>
> Power was cut on the peer, the remaining node stayed up. Then I brought
 up the
> peer with:
>
> stonith_admin -U orestes-corosync.nevis.columbia.edu
>
> BUT: When the restored peer came up and started to run cman, the clvmd
 hung on
> the main node again.
>
> After cycling through some more tests, I found that if I brought down
 the peer
> with drbdadm, then brought up with the peer with no HA services, then
 started
> drbd and then cman, the cluster remained intact.
>
> If I crashed the peer, the scheme in the previous para

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-24 Thread emmanuel segura
How do you configure clvmd?

with cman or with pacemaker?

Il giorno 23 marzo 2012 22:14, William Seligman  ha scritto:

> On 3/23/12 5:03 PM, emmanuel segura wrote:
>
> > Sorry but i would to know if can show me your /etc/cluster/cluster.conf
>
> Here it is: 
>
> > Il giorno 23 marzo 2012 21:50, William Seligman <
> selig...@nevis.columbia.edu
> >> ha scritto:
> >
> >> On 3/22/12 2:43 PM, William Seligman wrote:
> >>> On 3/20/12 4:55 PM, Lars Ellenberg wrote:
>  On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote:
> > On 3/16/12 12:12 PM, William Seligman wrote:
> >> On 3/16/12 7:02 AM, Andreas Kurz wrote:
> >>> On 03/15/2012 11:50 PM, William Seligman wrote:
>  On 3/15/12 6:07 PM, William Seligman wrote:
> > On 3/15/12 6:05 PM, William Seligman wrote:
> >> On 3/15/12 4:57 PM, emmanuel segura wrote:
> >>
> >>> we can try to understand what happen when clvm hang
> >>>
> >>> edit the /etc/lvm/lvm.conf  and change level = 7 in the log
> >> session and
> >>> uncomment this line
> >>>
> >>> file = "/var/log/lvm2.log"
> >>
> >> Here's the tail end of the file (the original is 1.6M). Because
> >> there no times
> >> in the log, it's hard for me to point you to the point where I
> >> crashed the other
> >> system. I think (though I'm not sure) that the crash happened
> >> after the last
> >> occurrence of
> >>
> >> cache/lvmcache.c:1484   Wiping internal VG cache
> >>
> >> Honestly, it looks like a wall of text to me. Does it suggest
> >> anything to you?
> >
> > Maybe it would help if I included the link to the pastebin where
> I
> >> put the
> > output: 
> 
>  Could the problem be with lvm+drbd?
> 
>  In lvm2.conf, I see this sequence of lines pre-crash:
> 
>  device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>  device/dev-io.c:271   /dev/md0: size is 1027968 sectors
>  device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>  device/dev-io.c:588   Closed /dev/md0
>  device/dev-io.c:271   /dev/md0: size is 1027968 sectors
>  device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>  device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>  device/dev-io.c:588   Closed /dev/md0
>  filters/filter-composite.c:31   Using /dev/md0
>  device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>  device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>  label/label.c:186   /dev/md0: No label detected
>  device/dev-io.c:588   Closed /dev/md0
>  device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
>  device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
>  device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
>  device/dev-io.c:588   Closed /dev/drbd0
>  device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
>  device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
>  device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
>  device/dev-io.c:588   Closed /dev/drbd0
> 
>  I interpret this: Look at /dev/md0, get some info, close; look at
> >> /dev/drbd0,
>  get some info, close.
> 
>  Post-crash, I see:
> 
>  evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>  device/dev-io.c:271   /dev/md0: size is 1027968 sectors
>  device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>  device/dev-io.c:588   Closed /dev/md0
>  device/dev-io.c:271   /dev/md0: size is 1027968 sectors
>  device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>  device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>  device/dev-io.c:588   Closed /dev/md0
>  filters/filter-composite.c:31   Using /dev/md0
>  device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>  device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>  label/label.c:186   /dev/md0: No label detected
>  device/dev-io.c:588   Closed /dev/md0
>  device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
>  device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
>  device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
> 
>  ... and then it hangs. Comparing the two, it looks like it can't
> >> close /dev/drbd0.
> 
>  If I look at /proc/drbd when I crash one node, I see this:
> 
>  # cat /proc/drbd
>  version: 8.3.12 (api:88/proto:86-96)
>  GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
>  r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
>   0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C
> s-
>  ns:764 nr:0 dw:0 dr:

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-23 Thread William Seligman
On 3/23/12 5:03 PM, emmanuel segura wrote:

> Sorry but i would to know if can show me your /etc/cluster/cluster.conf

Here it is: 

> Il giorno 23 marzo 2012 21:50, William Seligman > ha scritto:
> 
>> On 3/22/12 2:43 PM, William Seligman wrote:
>>> On 3/20/12 4:55 PM, Lars Ellenberg wrote:
 On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote:
> On 3/16/12 12:12 PM, William Seligman wrote:
>> On 3/16/12 7:02 AM, Andreas Kurz wrote:
>>> On 03/15/2012 11:50 PM, William Seligman wrote:
 On 3/15/12 6:07 PM, William Seligman wrote:
> On 3/15/12 6:05 PM, William Seligman wrote:
>> On 3/15/12 4:57 PM, emmanuel segura wrote:
>>
>>> we can try to understand what happen when clvm hang
>>>
>>> edit the /etc/lvm/lvm.conf  and change level = 7 in the log
>> session and
>>> uncomment this line
>>>
>>> file = "/var/log/lvm2.log"
>>
>> Here's the tail end of the file (the original is 1.6M). Because
>> there no times
>> in the log, it's hard for me to point you to the point where I
>> crashed the other
>> system. I think (though I'm not sure) that the crash happened
>> after the last
>> occurrence of
>>
>> cache/lvmcache.c:1484   Wiping internal VG cache
>>
>> Honestly, it looks like a wall of text to me. Does it suggest
>> anything to you?
>
> Maybe it would help if I included the link to the pastebin where I
>> put the
> output: 

 Could the problem be with lvm+drbd?

 In lvm2.conf, I see this sequence of lines pre-crash:

 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 filters/filter-composite.c:31   Using /dev/md0
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 label/label.c:186   /dev/md0: No label detected
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
 device/dev-io.c:588   Closed /dev/drbd0
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
 device/dev-io.c:588   Closed /dev/drbd0

 I interpret this: Look at /dev/md0, get some info, close; look at
>> /dev/drbd0,
 get some info, close.

 Post-crash, I see:

 evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 filters/filter-composite.c:31   Using /dev/md0
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 label/label.c:186   /dev/md0: No label detected
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes

 ... and then it hangs. Comparing the two, it looks like it can't
>> close /dev/drbd0.

 If I look at /proc/drbd when I crash one node, I see this:

 # cat /proc/drbd
 version: 8.3.12 (api:88/proto:86-96)
 GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
 r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s-
 ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0
>> ep:1 wo:b oos:0
>>>
>>> s- ... DRBD suspended io, most likely because of it's
>>> fencing-policy. For valid dual-primary setups you have to use
>>> "resource-and-stonith" policy and a working "fence-peer" handler. In
>>> this mode I/O is suspended until fencing of peer was succesful.
>> Questio

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-23 Thread emmanuel segura
Hello William

Sorry but i would to know if can show me your /etc/cluster/cluster.conf

Il giorno 23 marzo 2012 21:50, William Seligman  ha scritto:

> On 3/22/12 2:43 PM, William Seligman wrote:
> > On 3/20/12 4:55 PM, Lars Ellenberg wrote:
> >> On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote:
> >>> On 3/16/12 12:12 PM, William Seligman wrote:
>  On 3/16/12 7:02 AM, Andreas Kurz wrote:
> > On 03/15/2012 11:50 PM, William Seligman wrote:
> >> On 3/15/12 6:07 PM, William Seligman wrote:
> >>> On 3/15/12 6:05 PM, William Seligman wrote:
>  On 3/15/12 4:57 PM, emmanuel segura wrote:
> 
> > we can try to understand what happen when clvm hang
> >
> > edit the /etc/lvm/lvm.conf  and change level = 7 in the log
> session and
> > uncomment this line
> >
> > file = "/var/log/lvm2.log"
> 
>  Here's the tail end of the file (the original is 1.6M). Because
> there no times
>  in the log, it's hard for me to point you to the point where I
> crashed the other
>  system. I think (though I'm not sure) that the crash happened
> after the last
>  occurrence of
> 
>  cache/lvmcache.c:1484   Wiping internal VG cache
> 
>  Honestly, it looks like a wall of text to me. Does it suggest
> anything to you?
> >>>
> >>> Maybe it would help if I included the link to the pastebin where I
> put the
> >>> output: 
> >>
> >> Could the problem be with lvm+drbd?
> >>
> >> In lvm2.conf, I see this sequence of lines pre-crash:
> >>
> >> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> >> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
> >> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> >> device/dev-io.c:588   Closed /dev/md0
> >> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
> >> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> >> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> >> device/dev-io.c:588   Closed /dev/md0
> >> filters/filter-composite.c:31   Using /dev/md0
> >> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> >> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> >> label/label.c:186   /dev/md0: No label detected
> >> device/dev-io.c:588   Closed /dev/md0
> >> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
> >> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
> >> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
> >> device/dev-io.c:588   Closed /dev/drbd0
> >> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
> >> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
> >> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
> >> device/dev-io.c:588   Closed /dev/drbd0
> >>
> >> I interpret this: Look at /dev/md0, get some info, close; look at
> /dev/drbd0,
> >> get some info, close.
> >>
> >> Post-crash, I see:
> >>
> >> evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> >> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
> >> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> >> device/dev-io.c:588   Closed /dev/md0
> >> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
> >> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> >> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> >> device/dev-io.c:588   Closed /dev/md0
> >> filters/filter-composite.c:31   Using /dev/md0
> >> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> >> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> >> label/label.c:186   /dev/md0: No label detected
> >> device/dev-io.c:588   Closed /dev/md0
> >> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
> >> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
> >> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
> >>
> >> ... and then it hangs. Comparing the two, it looks like it can't
> close /dev/drbd0.
> >>
> >> If I look at /proc/drbd when I crash one node, I see this:
> >>
> >> # cat /proc/drbd
> >> version: 8.3.12 (api:88/proto:86-96)
> >> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
> >> r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
> >>  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s-
> >> ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0
> ep:1 wo:b oos:0
> >
> > s- ... DRBD suspended io, most likely because of it's
> > fencing-policy. For valid dual-primary setups you have to use
> > "resource-and-stonith" policy and a working "fence-peer" handler. In
> > this mode I/O is suspended until fencing of peer was succesful.
> Question
> > is, why the peer does _not_ also suspend its I/O because obviously
> > fencin

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-23 Thread William Seligman
On 3/22/12 2:43 PM, William Seligman wrote:
> On 3/20/12 4:55 PM, Lars Ellenberg wrote:
>> On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote:
>>> On 3/16/12 12:12 PM, William Seligman wrote:
 On 3/16/12 7:02 AM, Andreas Kurz wrote:
> On 03/15/2012 11:50 PM, William Seligman wrote:
>> On 3/15/12 6:07 PM, William Seligman wrote:
>>> On 3/15/12 6:05 PM, William Seligman wrote:
 On 3/15/12 4:57 PM, emmanuel segura wrote:

> we can try to understand what happen when clvm hang
>
> edit the /etc/lvm/lvm.conf  and change level = 7 in the log session 
> and
> uncomment this line
>
> file = "/var/log/lvm2.log"

 Here's the tail end of the file (the original is 1.6M). Because there 
 no times
 in the log, it's hard for me to point you to the point where I crashed 
 the other
 system. I think (though I'm not sure) that the crash happened after 
 the last
 occurrence of

 cache/lvmcache.c:1484   Wiping internal VG cache

 Honestly, it looks like a wall of text to me. Does it suggest anything 
 to you?
>>>
>>> Maybe it would help if I included the link to the pastebin where I put 
>>> the
>>> output: 
>>
>> Could the problem be with lvm+drbd?
>>
>> In lvm2.conf, I see this sequence of lines pre-crash:
>>
>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>> device/dev-io.c:588   Closed /dev/md0
>> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>> device/dev-io.c:588   Closed /dev/md0
>> filters/filter-composite.c:31   Using /dev/md0
>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>> label/label.c:186   /dev/md0: No label detected
>> device/dev-io.c:588   Closed /dev/md0
>> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
>> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
>> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
>> device/dev-io.c:588   Closed /dev/drbd0
>> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
>> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
>> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
>> device/dev-io.c:588   Closed /dev/drbd0
>>
>> I interpret this: Look at /dev/md0, get some info, close; look at 
>> /dev/drbd0,
>> get some info, close.
>>
>> Post-crash, I see:
>>
>> evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>> device/dev-io.c:588   Closed /dev/md0
>> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>> device/dev-io.c:588   Closed /dev/md0
>> filters/filter-composite.c:31   Using /dev/md0
>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>> label/label.c:186   /dev/md0: No label detected
>> device/dev-io.c:588   Closed /dev/md0
>> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
>> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
>> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
>>
>> ... and then it hangs. Comparing the two, it looks like it can't close 
>> /dev/drbd0.
>>
>> If I look at /proc/drbd when I crash one node, I see this:
>>
>> # cat /proc/drbd
>> version: 8.3.12 (api:88/proto:86-96)
>> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
>> r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
>>  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s-
>> ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 
>> wo:b oos:0
>
> s- ... DRBD suspended io, most likely because of it's
> fencing-policy. For valid dual-primary setups you have to use
> "resource-and-stonith" policy and a working "fence-peer" handler. In
> this mode I/O is suspended until fencing of peer was succesful. Question
> is, why the peer does _not_ also suspend its I/O because obviously
> fencing was not successful .
>
> So with a correct DRBD configuration one of your nodes should already
> have been fenced because of connection loss between nodes (on drbd
> replication link).
>
> You can use e.g. that nice fencing script:
>
> 

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-22 Thread William Seligman
On 3/22/12 2:49 PM, David Coulson wrote:
> On 3/22/12 2:43 PM, William Seligman wrote:
>>
>> I still haven't solved the problem, but this advice has gotten me further 
>> than
>> before.
>>
>> First, Lars was correct: I did not have execute permissions set on my fence 
>> peer
>> scripts. (D'oh!) I turned them on, but that did not change anything: 
>> cman+clvmd
>> still hung on the vgdisplay command if I crashed the peer node.
>>
> Does cman think the node is fenced? clvmd will block IO until the node is 
> fenced
> properly.

Let's see:

On main node, before crashing the peer node:

corosync-objctl | grep member
runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(192.168.100.207)
runtime.totem.pg.mrp.srp.members.1.join_count=1
runtime.totem.pg.mrp.srp.members.1.status=joined
runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(192.168.100.206)
runtime.totem.pg.mrp.srp.members.2.join_count=2
runtime.totem.pg.mrp.srp.members.2.status=joined

Then on peer node:

echo c > /proc/sysrq-trigger

The UPS for the peer node shuts down, which tells me the main node ran the
fencing agent. Now:

corosync-objctl | grep member
runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(192.168.100.207)
runtime.totem.pg.mrp.srp.members.1.join_count=1
runtime.totem.pg.mrp.srp.members.1.status=joined
runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(192.168.100.206)
runtime.totem.pg.mrp.srp.members.2.join_count=2
runtime.totem.pg.mrp.srp.members.2.status=left

Looks like cman knows. Is there any other way to check a node's fenced status as
far as cman is concerned?

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-22 Thread David Coulson


On 3/22/12 2:43 PM, William Seligman wrote:
>
>
> I still haven't solved the problem, but this advice has gotten me further than
> before.
>
> First, Lars was correct: I did not have execute permissions set on my fence 
> peer
> scripts. (D'oh!) I turned them on, but that did not change anything: 
> cman+clvmd
> still hung on the vgdisplay command if I crashed the peer node.
>
Does cman think the node is fenced? clvmd will block IO until the node 
is fenced properly.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-22 Thread William Seligman
On 3/20/12 4:55 PM, Lars Ellenberg wrote:
> On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote:
>> On 3/16/12 12:12 PM, William Seligman wrote:
>>> On 3/16/12 7:02 AM, Andreas Kurz wrote:
 On 03/15/2012 11:50 PM, William Seligman wrote:
> On 3/15/12 6:07 PM, William Seligman wrote:
>> On 3/15/12 6:05 PM, William Seligman wrote:
>>> On 3/15/12 4:57 PM, emmanuel segura wrote:
>>>
 we can try to understand what happen when clvm hang

 edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
 uncomment this line

 file = "/var/log/lvm2.log"
>>>
>>> Here's the tail end of the file (the original is 1.6M). Because there 
>>> no times
>>> in the log, it's hard for me to point you to the point where I crashed 
>>> the other
>>> system. I think (though I'm not sure) that the crash happened after the 
>>> last
>>> occurrence of
>>>
>>> cache/lvmcache.c:1484   Wiping internal VG cache
>>>
>>> Honestly, it looks like a wall of text to me. Does it suggest anything 
>>> to you?
>>
>> Maybe it would help if I included the link to the pastebin where I put 
>> the
>> output: 
>
> Could the problem be with lvm+drbd?
>
> In lvm2.conf, I see this sequence of lines pre-crash:
>
> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> device/dev-io.c:588   Closed /dev/md0
> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> device/dev-io.c:588   Closed /dev/md0
> filters/filter-composite.c:31   Using /dev/md0
> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> label/label.c:186   /dev/md0: No label detected
> device/dev-io.c:588   Closed /dev/md0
> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
> device/dev-io.c:588   Closed /dev/drbd0
> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
> device/dev-io.c:588   Closed /dev/drbd0
>
> I interpret this: Look at /dev/md0, get some info, close; look at 
> /dev/drbd0,
> get some info, close.
>
> Post-crash, I see:
>
> evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> device/dev-io.c:588   Closed /dev/md0
> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> device/dev-io.c:588   Closed /dev/md0
> filters/filter-composite.c:31   Using /dev/md0
> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> label/label.c:186   /dev/md0: No label detected
> device/dev-io.c:588   Closed /dev/md0
> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
>
> ... and then it hangs. Comparing the two, it looks like it can't close 
> /dev/drbd0.
>
> If I look at /proc/drbd when I crash one node, I see this:
>
> # cat /proc/drbd
> version: 8.3.12 (api:88/proto:86-96)
> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
> r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
>  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s-
> ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 
> wo:b oos:0

 s- ... DRBD suspended io, most likely because of it's
 fencing-policy. For valid dual-primary setups you have to use
 "resource-and-stonith" policy and a working "fence-peer" handler. In
 this mode I/O is suspended until fencing of peer was succesful. Question
 is, why the peer does _not_ also suspend its I/O because obviously
 fencing was not successful .

 So with a correct DRBD configuration one of your nodes should already
 have been fenced because of connection loss between nodes (on drbd
 replication link).

 You can use e.g. that nice fencing script:

 http://goo.gl/O4N8f
>>>
>>> This is the output of "drbdadm dump admin": 
>>>
>>> So I've got resource-and-stonith. I gather from an ear

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-20 Thread Lars Ellenberg
On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote:
> On 3/16/12 12:12 PM, William Seligman wrote:
> > On 3/16/12 7:02 AM, Andreas Kurz wrote:
> >> On 03/15/2012 11:50 PM, William Seligman wrote:
> >>> On 3/15/12 6:07 PM, William Seligman wrote:
>  On 3/15/12 6:05 PM, William Seligman wrote:
> > On 3/15/12 4:57 PM, emmanuel segura wrote:
> >
> >> we can try to understand what happen when clvm hang
> >>
> >> edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
> >> uncomment this line
> >>
> >> file = "/var/log/lvm2.log"
> >
> > Here's the tail end of the file (the original is 1.6M). Because there 
> > no times
> > in the log, it's hard for me to point you to the point where I crashed 
> > the other
> > system. I think (though I'm not sure) that the crash happened after the 
> > last
> > occurrence of
> >
> > cache/lvmcache.c:1484   Wiping internal VG cache
> >
> > Honestly, it looks like a wall of text to me. Does it suggest anything 
> > to you?
> 
>  Maybe it would help if I included the link to the pastebin where I put 
>  the
>  output: 
> >>>
> >>> Could the problem be with lvm+drbd?
> >>>
> >>> In lvm2.conf, I see this sequence of lines pre-crash:
> >>>
> >>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> >>> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
> >>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> >>> device/dev-io.c:588   Closed /dev/md0
> >>> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
> >>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> >>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> >>> device/dev-io.c:588   Closed /dev/md0
> >>> filters/filter-composite.c:31   Using /dev/md0
> >>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> >>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> >>> label/label.c:186   /dev/md0: No label detected
> >>> device/dev-io.c:588   Closed /dev/md0
> >>> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
> >>> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
> >>> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
> >>> device/dev-io.c:588   Closed /dev/drbd0
> >>> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
> >>> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
> >>> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
> >>> device/dev-io.c:588   Closed /dev/drbd0
> >>>
> >>> I interpret this: Look at /dev/md0, get some info, close; look at 
> >>> /dev/drbd0,
> >>> get some info, close.
> >>>
> >>> Post-crash, I see:
> >>>
> >>> evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> >>> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
> >>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> >>> device/dev-io.c:588   Closed /dev/md0
> >>> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
> >>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> >>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> >>> device/dev-io.c:588   Closed /dev/md0
> >>> filters/filter-composite.c:31   Using /dev/md0
> >>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> >>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> >>> label/label.c:186   /dev/md0: No label detected
> >>> device/dev-io.c:588   Closed /dev/md0
> >>> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
> >>> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
> >>> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
> >>>
> >>> ... and then it hangs. Comparing the two, it looks like it can't close 
> >>> /dev/drbd0.
> >>>
> >>> If I look at /proc/drbd when I crash one node, I see this:
> >>>
> >>> # cat /proc/drbd
> >>> version: 8.3.12 (api:88/proto:86-96)
> >>> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
> >>> r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
> >>>  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s-
> >>> ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 
> >>> wo:b oos:0
> >>
> >> s- ... DRBD suspended io, most likely because of it's
> >> fencing-policy. For valid dual-primary setups you have to use
> >> "resource-and-stonith" policy and a working "fence-peer" handler. In
> >> this mode I/O is suspended until fencing of peer was succesful. Question
> >> is, why the peer does _not_ also suspend its I/O because obviously
> >> fencing was not successful .
> >>
> >> So with a correct DRBD configuration one of your nodes should already
> >> have been fenced because of connection loss between nodes (on drbd
> >> replication link).
> >>
> >> You can use e.g. that nice fencing script:
> >>
> >> http://goo.gl/O4N8f
> > 
> > This is the output of "drbdadm dump admin": 
> > 
> > So I've got resource-and-stonith. I gather from an earlier thread that
> > obliterate-peer.sh is 

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-16 Thread William Seligman
On 3/16/12 12:12 PM, William Seligman wrote:
> On 3/16/12 7:02 AM, Andreas Kurz wrote:
>> On 03/15/2012 11:50 PM, William Seligman wrote:
>>> On 3/15/12 6:07 PM, William Seligman wrote:
 On 3/15/12 6:05 PM, William Seligman wrote:
> On 3/15/12 4:57 PM, emmanuel segura wrote:
>
>> we can try to understand what happen when clvm hang
>>
>> edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
>> uncomment this line
>>
>> file = "/var/log/lvm2.log"
>
> Here's the tail end of the file (the original is 1.6M). Because there no 
> times
> in the log, it's hard for me to point you to the point where I crashed 
> the other
> system. I think (though I'm not sure) that the crash happened after the 
> last
> occurrence of
>
> cache/lvmcache.c:1484   Wiping internal VG cache
>
> Honestly, it looks like a wall of text to me. Does it suggest anything to 
> you?

 Maybe it would help if I included the link to the pastebin where I put the
 output: 
>>>
>>> Could the problem be with lvm+drbd?
>>>
>>> In lvm2.conf, I see this sequence of lines pre-crash:
>>>
>>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>>> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
>>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>>> device/dev-io.c:588   Closed /dev/md0
>>> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
>>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>>> device/dev-io.c:588   Closed /dev/md0
>>> filters/filter-composite.c:31   Using /dev/md0
>>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>>> label/label.c:186   /dev/md0: No label detected
>>> device/dev-io.c:588   Closed /dev/md0
>>> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
>>> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
>>> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
>>> device/dev-io.c:588   Closed /dev/drbd0
>>> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
>>> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
>>> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
>>> device/dev-io.c:588   Closed /dev/drbd0
>>>
>>> I interpret this: Look at /dev/md0, get some info, close; look at 
>>> /dev/drbd0,
>>> get some info, close.
>>>
>>> Post-crash, I see:
>>>
>>> evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>>> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
>>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>>> device/dev-io.c:588   Closed /dev/md0
>>> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
>>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>>> device/dev-io.c:588   Closed /dev/md0
>>> filters/filter-composite.c:31   Using /dev/md0
>>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>>> label/label.c:186   /dev/md0: No label detected
>>> device/dev-io.c:588   Closed /dev/md0
>>> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
>>> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
>>> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
>>>
>>> ... and then it hangs. Comparing the two, it looks like it can't close 
>>> /dev/drbd0.
>>>
>>> If I look at /proc/drbd when I crash one node, I see this:
>>>
>>> # cat /proc/drbd
>>> version: 8.3.12 (api:88/proto:86-96)
>>> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
>>> r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
>>>  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s-
>>> ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 
>>> wo:b oos:0
>>
>> s- ... DRBD suspended io, most likely because of it's
>> fencing-policy. For valid dual-primary setups you have to use
>> "resource-and-stonith" policy and a working "fence-peer" handler. In
>> this mode I/O is suspended until fencing of peer was succesful. Question
>> is, why the peer does _not_ also suspend its I/O because obviously
>> fencing was not successful .
>>
>> So with a correct DRBD configuration one of your nodes should already
>> have been fenced because of connection loss between nodes (on drbd
>> replication link).
>>
>> You can use e.g. that nice fencing script:
>>
>> http://goo.gl/O4N8f
> 
> This is the output of "drbdadm dump admin": 
> 
> So I've got resource-and-stonith. I gather from an earlier thread that
> obliterate-peer.sh is more-or-less equivalent in functionality with
> stonith_admin_fence_peer.sh:
> 
> 
> 
> At the moment I'm pursuing the possibility that I'm returning the wrong return
> codes from my fencing agent:
> 
> 

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-16 Thread William Seligman
On 3/16/12 4:53 AM, emmanuel segura wrote:

> for the lvm hang you can use this in your /etc/lvm/lvm.conf
> 
> ignore_suspended_devices = 1
> 
> because i seen in the lvm log,
> 
> ===
> and then it hangs. Comparing the two, it looks like it can't close
> /dev/drbd0
> ===

No, this does not prevent the hang. I tried with both DRBD 8.3.12 and 8.4.1.

> Il giorno 15 marzo 2012 23:50, William Seligman > ha scritto:
> 
>> On 3/15/12 6:07 PM, William Seligman wrote:
>>> On 3/15/12 6:05 PM, William Seligman wrote:
 On 3/15/12 4:57 PM, emmanuel segura wrote:

> we can try to understand what happen when clvm hang
>
> edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
> uncomment this line
>
> file = "/var/log/lvm2.log"

 Here's the tail end of the file (the original is 1.6M). Because there
>> no times
 in the log, it's hard for me to point you to the point where I crashed
>> the other
 system. I think (though I'm not sure) that the crash happened after the
>> last
 occurrence of

 cache/lvmcache.c:1484   Wiping internal VG cache

 Honestly, it looks like a wall of text to me. Does it suggest anything
>> to you?
>>>
>>> Maybe it would help if I included the link to the pastebin where I put
>> the
>>> output: 
>>
>> Could the problem be with lvm+drbd?
>>
>> In lvm2.conf, I see this sequence of lines pre-crash:
>>
>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>> device/dev-io.c:588   Closed /dev/md0
>> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>> device/dev-io.c:588   Closed /dev/md0
>> filters/filter-composite.c:31   Using /dev/md0
>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>> label/label.c:186   /dev/md0: No label detected
>> device/dev-io.c:588   Closed /dev/md0
>> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
>> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
>> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
>> device/dev-io.c:588   Closed /dev/drbd0
>> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
>> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
>> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
>> device/dev-io.c:588   Closed /dev/drbd0
>>
>> I interpret this: Look at /dev/md0, get some info, close; look at
>> /dev/drbd0,
>> get some info, close.
>>
>> Post-crash, I see:
>>
>> evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>> device/dev-io.c:588   Closed /dev/md0
>> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>> device/dev-io.c:588   Closed /dev/md0
>> filters/filter-composite.c:31   Using /dev/md0
>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>> label/label.c:186   /dev/md0: No label detected
>> device/dev-io.c:588   Closed /dev/md0
>> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
>> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
>> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
>>
>> ... and then it hangs. Comparing the two, it looks like it can't close
>> /dev/drbd0.
>>
>> If I look at /proc/drbd when I crash one node, I see this:
>>
>> # cat /proc/drbd
>> version: 8.3.12 (api:88/proto:86-96)
>> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
>> r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
>>  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s-
>>ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1
>> wo:b oos:0
>>
>>
>> If I look at /proc/drbd if I bring down one node gracefully (crm node
>> standby),
>> I get this:
>>
>> # cat /proc/drbd
>> version: 8.3.12 (api:88/proto:86-96)
>> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
>> r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
>>  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-
>>ns:764 nr:40 dw:40 dr:7036496 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1
>> wo:b
>> oos:0
>>
>> Could it be that drbd can't respond to certain requests from lvm if the
>> state of
>> the peer is DUnknown instead of Outdated?
>>
> Il giorno 15 marzo 2012 20:50, William Seligman <
>> selig...@nevis.columbia.edu
>> ha scritto:
>
>> On 3/15/12 12:55 PM, emmanuel segura wrote:
>>
>>> I don't see any error and the answer for your question it's 

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-16 Thread William Seligman
On 3/16/12 7:02 AM, Andreas Kurz wrote:
> On 03/15/2012 11:50 PM, William Seligman wrote:
>> On 3/15/12 6:07 PM, William Seligman wrote:
>>> On 3/15/12 6:05 PM, William Seligman wrote:
 On 3/15/12 4:57 PM, emmanuel segura wrote:

> we can try to understand what happen when clvm hang
>
> edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
> uncomment this line
>
> file = "/var/log/lvm2.log"

 Here's the tail end of the file (the original is 1.6M). Because there no 
 times
 in the log, it's hard for me to point you to the point where I crashed the 
 other
 system. I think (though I'm not sure) that the crash happened after the 
 last
 occurrence of

 cache/lvmcache.c:1484   Wiping internal VG cache

 Honestly, it looks like a wall of text to me. Does it suggest anything to 
 you?
>>>
>>> Maybe it would help if I included the link to the pastebin where I put the
>>> output: 
>>
>> Could the problem be with lvm+drbd?
>>
>> In lvm2.conf, I see this sequence of lines pre-crash:
>>
>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>> device/dev-io.c:588   Closed /dev/md0
>> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>> device/dev-io.c:588   Closed /dev/md0
>> filters/filter-composite.c:31   Using /dev/md0
>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>> label/label.c:186   /dev/md0: No label detected
>> device/dev-io.c:588   Closed /dev/md0
>> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
>> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
>> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
>> device/dev-io.c:588   Closed /dev/drbd0
>> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
>> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
>> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
>> device/dev-io.c:588   Closed /dev/drbd0
>>
>> I interpret this: Look at /dev/md0, get some info, close; look at /dev/drbd0,
>> get some info, close.
>>
>> Post-crash, I see:
>>
>> evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>> device/dev-io.c:588   Closed /dev/md0
>> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>> device/dev-io.c:588   Closed /dev/md0
>> filters/filter-composite.c:31   Using /dev/md0
>> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
>> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
>> label/label.c:186   /dev/md0: No label detected
>> device/dev-io.c:588   Closed /dev/md0
>> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
>> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
>> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
>>
>> ... and then it hangs. Comparing the two, it looks like it can't close 
>> /dev/drbd0.
>>
>> If I look at /proc/drbd when I crash one node, I see this:
>>
>> # cat /proc/drbd
>> version: 8.3.12 (api:88/proto:86-96)
>> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
>> r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
>>  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s-
>> ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 
>> wo:b oos:0
> 
> s- ... DRBD suspended io, most likely because of it's
> fencing-policy. For valid dual-primary setups you have to use
> "resource-and-stonith" policy and a working "fence-peer" handler. In
> this mode I/O is suspended until fencing of peer was succesful. Question
> is, why the peer does _not_ also suspend its I/O because obviously
> fencing was not successful .
> 
> So with a correct DRBD configuration one of your nodes should already
> have been fenced because of connection loss between nodes (on drbd
> replication link).
> 
> You can use e.g. that nice fencing script:
> 
> http://goo.gl/O4N8f

This is the output of "drbdadm dump admin": 

So I've got resource-and-stonith. I gather from an earlier thread that
obliterate-peer.sh is more-or-less equivalent in functionality with
stonith_admin_fence_peer.sh:



At the moment I'm pursuing the possibility that I'm returning the wrong return
codes from my fencing agent:



After that, I'll look at another suggestion with lvm.conf:



Then I'll try DRB

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-16 Thread Andreas Kurz
On 03/15/2012 11:50 PM, William Seligman wrote:
> On 3/15/12 6:07 PM, William Seligman wrote:
>> On 3/15/12 6:05 PM, William Seligman wrote:
>>> On 3/15/12 4:57 PM, emmanuel segura wrote:
>>>
 we can try to understand what happen when clvm hang

 edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
 uncomment this line

 file = "/var/log/lvm2.log"
>>>
>>> Here's the tail end of the file (the original is 1.6M). Because there no 
>>> times
>>> in the log, it's hard for me to point you to the point where I crashed the 
>>> other
>>> system. I think (though I'm not sure) that the crash happened after the last
>>> occurrence of
>>>
>>> cache/lvmcache.c:1484   Wiping internal VG cache
>>>
>>> Honestly, it looks like a wall of text to me. Does it suggest anything to 
>>> you?
>>
>> Maybe it would help if I included the link to the pastebin where I put the
>> output: 
> 
> Could the problem be with lvm+drbd?
> 
> In lvm2.conf, I see this sequence of lines pre-crash:
> 
> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> device/dev-io.c:588   Closed /dev/md0
> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> device/dev-io.c:588   Closed /dev/md0
> filters/filter-composite.c:31   Using /dev/md0
> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> label/label.c:186   /dev/md0: No label detected
> device/dev-io.c:588   Closed /dev/md0
> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
> device/dev-io.c:588   Closed /dev/drbd0
> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
> device/dev-io.c:588   Closed /dev/drbd0
> 
> I interpret this: Look at /dev/md0, get some info, close; look at /dev/drbd0,
> get some info, close.
> 
> Post-crash, I see:
> 
> evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> device/dev-io.c:588   Closed /dev/md0
> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> device/dev-io.c:588   Closed /dev/md0
> filters/filter-composite.c:31   Using /dev/md0
> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> label/label.c:186   /dev/md0: No label detected
> device/dev-io.c:588   Closed /dev/md0
> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
> 
> ... and then it hangs. Comparing the two, it looks like it can't close 
> /dev/drbd0.
> 
> If I look at /proc/drbd when I crash one node, I see this:
> 
> # cat /proc/drbd
> version: 8.3.12 (api:88/proto:86-96)
> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
> r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
>  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s-
> ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b 
> oos:0

s- ... DRBD suspended io, most likely because of it's
fencing-policy. For valid dual-primary setups you have to use
"resource-and-stonith" policy and a working "fence-peer" handler. In
this mode I/O is suspended until fencing of peer was succesful. Question
is, why the peer does _not_ also suspend its I/O because obviously
fencing was not successful .

So with a correct DRBD configuration one of your nodes should already
have been fenced because of connection loss between nodes (on drbd
replication link).

You can use e.g. that nice fencing script:

http://goo.gl/O4N8f

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> 
> If I look at /proc/drbd if I bring down one node gracefully (crm node 
> standby),
> I get this:
> 
> # cat /proc/drbd
> version: 8.3.12 (api:88/proto:86-96)
> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
> r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
>  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-
> ns:764 nr:40 dw:40 dr:7036496 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 
> wo:b
> oos:0
> 
> Could it be that drbd can't respond to certain requests from lvm if the state 
> of
> the peer is DUnknown instead of Outdated?
> 
 Il giorno 15 marzo 2012 20:50, William Seligman 
  ha scritto:

> On 3/15/12 12:55 PM, emmanuel se

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-16 Thread emmanuel segura
Hello William

for the lvm hang you can use this in your /etc/lvm/lvm.conf

ignore_suspended_devices = 1

because i seen in the lvm log,

===
and then it hangs. Comparing the two, it looks like it can't close
/dev/drbd0
===



Il giorno 15 marzo 2012 23:50, William Seligman  ha scritto:

> On 3/15/12 6:07 PM, William Seligman wrote:
> > On 3/15/12 6:05 PM, William Seligman wrote:
> >> On 3/15/12 4:57 PM, emmanuel segura wrote:
> >>
> >>> we can try to understand what happen when clvm hang
> >>>
> >>> edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
> >>> uncomment this line
> >>>
> >>> file = "/var/log/lvm2.log"
> >>
> >> Here's the tail end of the file (the original is 1.6M). Because there
> no times
> >> in the log, it's hard for me to point you to the point where I crashed
> the other
> >> system. I think (though I'm not sure) that the crash happened after the
> last
> >> occurrence of
> >>
> >> cache/lvmcache.c:1484   Wiping internal VG cache
> >>
> >> Honestly, it looks like a wall of text to me. Does it suggest anything
> to you?
> >
> > Maybe it would help if I included the link to the pastebin where I put
> the
> > output: 
>
> Could the problem be with lvm+drbd?
>
> In lvm2.conf, I see this sequence of lines pre-crash:
>
> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> device/dev-io.c:588   Closed /dev/md0
> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> device/dev-io.c:588   Closed /dev/md0
> filters/filter-composite.c:31   Using /dev/md0
> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> label/label.c:186   /dev/md0: No label detected
> device/dev-io.c:588   Closed /dev/md0
> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
> device/dev-io.c:588   Closed /dev/drbd0
> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
> device/dev-io.c:588   Closed /dev/drbd0
>
> I interpret this: Look at /dev/md0, get some info, close; look at
> /dev/drbd0,
> get some info, close.
>
> Post-crash, I see:
>
> evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> device/dev-io.c:588   Closed /dev/md0
> device/dev-io.c:271   /dev/md0: size is 1027968 sectors
> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> device/dev-io.c:588   Closed /dev/md0
> filters/filter-composite.c:31   Using /dev/md0
> device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
> device/dev-io.c:137   /dev/md0: block size is 1024 bytes
> label/label.c:186   /dev/md0: No label detected
> device/dev-io.c:588   Closed /dev/md0
> device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
> device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
> device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
>
> ... and then it hangs. Comparing the two, it looks like it can't close
> /dev/drbd0.
>
> If I look at /proc/drbd when I crash one node, I see this:
>
> # cat /proc/drbd
> version: 8.3.12 (api:88/proto:86-96)
> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
> r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
>  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s-
>ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1
> wo:b oos:0
>
>
> If I look at /proc/drbd if I bring down one node gracefully (crm node
> standby),
> I get this:
>
> # cat /proc/drbd
> version: 8.3.12 (api:88/proto:86-96)
> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
> r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
>  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-
>ns:764 nr:40 dw:40 dr:7036496 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1
> wo:b
> oos:0
>
> Could it be that drbd can't respond to certain requests from lvm if the
> state of
> the peer is DUnknown instead of Outdated?
>
> >>> Il giorno 15 marzo 2012 20:50, William Seligman <
> selig...@nevis.columbia.edu
>  ha scritto:
> >>>
>  On 3/15/12 12:55 PM, emmanuel segura wrote:
> 
> > I don't see any error and the answer for your question it's yes
> >
> > can you show me your /etc/cluster/cluster.conf and your crm configure
>  show
> >
> > like that more later i can try to look if i found some fix
> 
>  Thanks for taking a look.
>

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread William Seligman
On 3/15/12 6:07 PM, William Seligman wrote:
> On 3/15/12 6:05 PM, William Seligman wrote:
>> On 3/15/12 4:57 PM, emmanuel segura wrote:
>>
>>> we can try to understand what happen when clvm hang
>>>
>>> edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
>>> uncomment this line
>>>
>>> file = "/var/log/lvm2.log"
>>
>> Here's the tail end of the file (the original is 1.6M). Because there no 
>> times
>> in the log, it's hard for me to point you to the point where I crashed the 
>> other
>> system. I think (though I'm not sure) that the crash happened after the last
>> occurrence of
>>
>> cache/lvmcache.c:1484   Wiping internal VG cache
>>
>> Honestly, it looks like a wall of text to me. Does it suggest anything to 
>> you?
> 
> Maybe it would help if I included the link to the pastebin where I put the
> output: 

Could the problem be with lvm+drbd?

In lvm2.conf, I see this sequence of lines pre-crash:

device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
device/dev-io.c:271   /dev/md0: size is 1027968 sectors
device/dev-io.c:137   /dev/md0: block size is 1024 bytes
device/dev-io.c:588   Closed /dev/md0
device/dev-io.c:271   /dev/md0: size is 1027968 sectors
device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
device/dev-io.c:137   /dev/md0: block size is 1024 bytes
device/dev-io.c:588   Closed /dev/md0
filters/filter-composite.c:31   Using /dev/md0
device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
device/dev-io.c:137   /dev/md0: block size is 1024 bytes
label/label.c:186   /dev/md0: No label detected
device/dev-io.c:588   Closed /dev/md0
device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
device/dev-io.c:588   Closed /dev/drbd0
device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
device/dev-io.c:588   Closed /dev/drbd0

I interpret this: Look at /dev/md0, get some info, close; look at /dev/drbd0,
get some info, close.

Post-crash, I see:

evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
device/dev-io.c:271   /dev/md0: size is 1027968 sectors
device/dev-io.c:137   /dev/md0: block size is 1024 bytes
device/dev-io.c:588   Closed /dev/md0
device/dev-io.c:271   /dev/md0: size is 1027968 sectors
device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
device/dev-io.c:137   /dev/md0: block size is 1024 bytes
device/dev-io.c:588   Closed /dev/md0
filters/filter-composite.c:31   Using /dev/md0
device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
device/dev-io.c:137   /dev/md0: block size is 1024 bytes
label/label.c:186   /dev/md0: No label detected
device/dev-io.c:588   Closed /dev/md0
device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes

... and then it hangs. Comparing the two, it looks like it can't close 
/dev/drbd0.

If I look at /proc/drbd when I crash one node, I see this:

# cat /proc/drbd
version: 8.3.12 (api:88/proto:86-96)
GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s-
ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b 
oos:0


If I look at /proc/drbd if I bring down one node gracefully (crm node standby),
I get this:

# cat /proc/drbd
version: 8.3.12 (api:88/proto:86-96)
GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-
ns:764 nr:40 dw:40 dr:7036496 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b
oos:0

Could it be that drbd can't respond to certain requests from lvm if the state of
the peer is DUnknown instead of Outdated?

>>> Il giorno 15 marzo 2012 20:50, William Seligman >>> ha scritto:
>>>
 On 3/15/12 12:55 PM, emmanuel segura wrote:

> I don't see any error and the answer for your question it's yes
>
> can you show me your /etc/cluster/cluster.conf and your crm configure
 show
>
> like that more later i can try to look if i found some fix

 Thanks for taking a look.

 My cluster.conf: 
 crm configure show: 

 Before you spend a lot of time on the second file, remember that clvmd
 will hang
 whether or not I'm running pacemaker.

> Il giorno 15 marzo 2012 17:42, William Seligman <
 selig...@nevis.columbia.edu
>> ha scritto:
>
>> On 3/15/12 12:15 PM, emmanuel segura wrote:
>>
>>> Ho did you created your volume group
>>
>> pvcreate /dev/drbd0
>> vgcreate -c y ADMIN /dev/drbd0
>> lvcreate -L 200G -n usr ADMIN # ... and so on
>> # "Nevis-HA" is

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread William Seligman
On 3/15/12 6:05 PM, William Seligman wrote:
> On 3/15/12 4:57 PM, emmanuel segura wrote:
> 
>> we can try to understand what happen when clvm hang
>>
>> edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
>> uncomment this line
>>
>> file = "/var/log/lvm2.log"
> 
> Here's the tail end of the file (the original is 1.6M). Because there no times
> in the log, it's hard for me to point you to the point where I crashed the 
> other
> system. I think (though I'm not sure) that the crash happened after the last
> occurrence of
> 
> cache/lvmcache.c:1484   Wiping internal VG cache
> 
> Honestly, it looks like a wall of text to me. Does it suggest anything to you?

Maybe it would help if I included the link to the pastebin where I put the
output: 

>> Il giorno 15 marzo 2012 20:50, William Seligman >> ha scritto:
>>
>>> On 3/15/12 12:55 PM, emmanuel segura wrote:
>>>
 I don't see any error and the answer for your question it's yes

 can you show me your /etc/cluster/cluster.conf and your crm configure
>>> show

 like that more later i can try to look if i found some fix
>>>
>>> Thanks for taking a look.
>>>
>>> My cluster.conf: 
>>> crm configure show: 
>>>
>>> Before you spend a lot of time on the second file, remember that clvmd
>>> will hang
>>> whether or not I'm running pacemaker.
>>>
 Il giorno 15 marzo 2012 17:42, William Seligman <
>>> selig...@nevis.columbia.edu
> ha scritto:

> On 3/15/12 12:15 PM, emmanuel segura wrote:
>
>> Ho did you created your volume group
>
> pvcreate /dev/drbd0
> vgcreate -c y ADMIN /dev/drbd0
> lvcreate -L 200G -n usr ADMIN # ... and so on
> # "Nevis-HA" is the cluster name I used in cluster.conf
> mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr  # ... and so
>>> on
>
>> give me the output of vgs command when the cluster it's up
>
> Here it is:
>
>Logging initialised at Thu Mar 15 12:40:39 2012
>Set umask from 0022 to 0077
>Finding all volume groups
>Finding volume group "ROOT"
>Finding volume group "ADMIN"
>  VG#PV #LV #SN Attr   VSize   VFree
>  ADMIN   1   5   0 wz--nc   2.61t 765.79g
>  ROOT1   2   0 wz--n- 117.16g  0
>Wiping internal VG cache
>
> I assume the "c" in the ADMIN attributes means that clustering is turned
> on?
>
>> Il giorno 15 marzo 2012 17:06, William Seligman <
> selig...@nevis.columbia.edu
>>> ha scritto:
>>
>>> On 3/15/12 11:50 AM, emmanuel segura wrote:
 yes william

 Now try clvmd -d and see what happen

 locking_type = 3 it's lvm cluster lock type
>>>
>>> Since you asked for confirmation, here it is: the output of 'clvmd -d'
>>> just now. . I crashed the other node at
>>> Mar 15 12:02:35, when you see the only additional line of output.
>>>
>>> I don't see any particular difference between this and the previous
>>> result , which suggests that I had
>>> cluster locking enabled before, and still do now.
>>>
 Il giorno 15 marzo 2012 16:15, William Seligman <
>>> selig...@nevis.columbia.edu
> ha scritto:

> On 3/15/12 5:18 AM, emmanuel segura wrote:
>
>> The first thing i seen in your clvmd log it's this
>>
>> =
>>  WARNING: Locking disabled. Be careful! This could corrupt your
>>> metadata.
>> =
>
> I saw that too, and thought the same as you did. I did some checks
> (see below), but some web searches suggest that this message is a
> normal consequence of clvmd initialization; e.g.,
>
> 
>
>> use this command
>>
>> lvmconf --enable-cluster
>>
>> and remember for cman+pacemaker you don't need qdisk
>
> Before I tried your lvmconf suggestion, here was my
>>> /etc/lvm/lvm.conf:
>  and the output of "lvm dumpconfig":
> .
>
> Then I did as you suggested, but with a check to see if anything
> changed:
>
> # cd /etc/lvm/
> # cp lvm.conf lvm.conf.cluster
> # lvmconf --enable-cluster
> # diff lvm.conf lvm.conf.cluster
> #
>
> So the key lines have been there all along:
>locking_type = 3
>fallback_to_local_locking = 0
>
>
>> Il giorno 14 marzo 2012 23:17, William Seligman <
> selig...@nevis.columbia.edu
>>> ha scritto:
>>
>>> On 3/14/12 9:20 AM, emmanuel segu

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread William Seligman
On 3/15/12 4:57 PM, emmanuel segura wrote:

> we can try to understand what happen when clvm hang
> 
> edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
> uncomment this line
> 
> file = "/var/log/lvm2.log"

Here's the tail end of the file (the original is 1.6M). Because there no times
in the log, it's hard for me to point you to the point where I crashed the other
system. I think (though I'm not sure) that the crash happened after the last
occurrence of

cache/lvmcache.c:1484   Wiping internal VG cache

Honestly, it looks like a wall of text to me. Does it suggest anything to you?

> Il giorno 15 marzo 2012 20:50, William Seligman > ha scritto:
> 
>> On 3/15/12 12:55 PM, emmanuel segura wrote:
>>
>>> I don't see any error and the answer for your question it's yes
>>>
>>> can you show me your /etc/cluster/cluster.conf and your crm configure
>> show
>>>
>>> like that more later i can try to look if i found some fix
>>
>> Thanks for taking a look.
>>
>> My cluster.conf: 
>> crm configure show: 
>>
>> Before you spend a lot of time on the second file, remember that clvmd
>> will hang
>> whether or not I'm running pacemaker.
>>
>>> Il giorno 15 marzo 2012 17:42, William Seligman <
>> selig...@nevis.columbia.edu
 ha scritto:
>>>
 On 3/15/12 12:15 PM, emmanuel segura wrote:

> Ho did you created your volume group

 pvcreate /dev/drbd0
 vgcreate -c y ADMIN /dev/drbd0
 lvcreate -L 200G -n usr ADMIN # ... and so on
 # "Nevis-HA" is the cluster name I used in cluster.conf
 mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr  # ... and so
>> on

> give me the output of vgs command when the cluster it's up

 Here it is:

Logging initialised at Thu Mar 15 12:40:39 2012
Set umask from 0022 to 0077
Finding all volume groups
Finding volume group "ROOT"
Finding volume group "ADMIN"
  VG#PV #LV #SN Attr   VSize   VFree
  ADMIN   1   5   0 wz--nc   2.61t 765.79g
  ROOT1   2   0 wz--n- 117.16g  0
Wiping internal VG cache

 I assume the "c" in the ADMIN attributes means that clustering is turned
 on?

> Il giorno 15 marzo 2012 17:06, William Seligman <
 selig...@nevis.columbia.edu
>> ha scritto:
>
>> On 3/15/12 11:50 AM, emmanuel segura wrote:
>>> yes william
>>>
>>> Now try clvmd -d and see what happen
>>>
>>> locking_type = 3 it's lvm cluster lock type
>>
>> Since you asked for confirmation, here it is: the output of 'clvmd -d'
>> just now. . I crashed the other node at
>> Mar 15 12:02:35, when you see the only additional line of output.
>>
>> I don't see any particular difference between this and the previous
>> result , which suggests that I had
>> cluster locking enabled before, and still do now.
>>
>>> Il giorno 15 marzo 2012 16:15, William Seligman <
>> selig...@nevis.columbia.edu
 ha scritto:
>>>
 On 3/15/12 5:18 AM, emmanuel segura wrote:

> The first thing i seen in your clvmd log it's this
>
> =
>  WARNING: Locking disabled. Be careful! This could corrupt your
>> metadata.
> =

 I saw that too, and thought the same as you did. I did some checks
 (see below), but some web searches suggest that this message is a
 normal consequence of clvmd initialization; e.g.,

 

> use this command
>
> lvmconf --enable-cluster
>
> and remember for cman+pacemaker you don't need qdisk

 Before I tried your lvmconf suggestion, here was my
>> /etc/lvm/lvm.conf:
  and the output of "lvm dumpconfig":
 .

 Then I did as you suggested, but with a check to see if anything
 changed:

 # cd /etc/lvm/
 # cp lvm.conf lvm.conf.cluster
 # lvmconf --enable-cluster
 # diff lvm.conf lvm.conf.cluster
 #

 So the key lines have been there all along:
locking_type = 3
fallback_to_local_locking = 0


> Il giorno 14 marzo 2012 23:17, William Seligman <
 selig...@nevis.columbia.edu
>> ha scritto:
>
>> On 3/14/12 9:20 AM, emmanuel segura wrote:
>>> Hello William
>>>
>>> i did new you are using drbd and i dont't know what type of
>>> configuration you using
>>>
>>> But it's better you try to start clvm with clvmd -d
>>>
>>> like thak we can see what it's the problem
>

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread Lars Marowsky-Bree
On 2012-03-15T15:59:21, William Seligman  wrote:

> Could this be an issue? I've noticed that my fencing agent always seems to be
> called with "action=reboot" when a node is fenced. Why is it using 'reboot' 
> and
> not 'off'? Is this the standard, or am I missing a definition somewhere?

Make sure that stonith-timeout is set high enough; if you need at least
21s guaranteed, set it to 60s. (Remember that timeouts are a last line
of defense, and don't speed things up.)

And yes, it uses reboot because that is the default. Look at
stonith-action in the CIB.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread emmanuel segura
Ok William

we can try to understand what happen when clvm hang

edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
uncomment this line

file = "/var/log/lvm2.log"

Il giorno 15 marzo 2012 20:50, William Seligman  ha scritto:

> On 3/15/12 12:55 PM, emmanuel segura wrote:
>
> > I don't see any error and the answer for your question it's yes
> >
> > can you show me your /etc/cluster/cluster.conf and your crm configure
> show
> >
> > like that more later i can try to look if i found some fix
>
> Thanks for taking a look.
>
> My cluster.conf: 
> crm configure show: 
>
> Before you spend a lot of time on the second file, remember that clvmd
> will hang
> whether or not I'm running pacemaker.
>
> > Il giorno 15 marzo 2012 17:42, William Seligman <
> selig...@nevis.columbia.edu
> >> ha scritto:
> >
> >> On 3/15/12 12:15 PM, emmanuel segura wrote:
> >>
> >>> Ho did you created your volume group
> >>
> >> pvcreate /dev/drbd0
> >> vgcreate -c y ADMIN /dev/drbd0
> >> lvcreate -L 200G -n usr ADMIN # ... and so on
> >> # "Nevis-HA" is the cluster name I used in cluster.conf
> >> mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr  # ... and so
> on
> >>
> >>> give me the output of vgs command when the cluster it's up
> >>
> >> Here it is:
> >>
> >>Logging initialised at Thu Mar 15 12:40:39 2012
> >>Set umask from 0022 to 0077
> >>Finding all volume groups
> >>Finding volume group "ROOT"
> >>Finding volume group "ADMIN"
> >>  VG#PV #LV #SN Attr   VSize   VFree
> >>  ADMIN   1   5   0 wz--nc   2.61t 765.79g
> >>  ROOT1   2   0 wz--n- 117.16g  0
> >>Wiping internal VG cache
> >>
> >> I assume the "c" in the ADMIN attributes means that clustering is turned
> >> on?
> >>
> >>> Il giorno 15 marzo 2012 17:06, William Seligman <
> >> selig...@nevis.columbia.edu
>  ha scritto:
> >>>
>  On 3/15/12 11:50 AM, emmanuel segura wrote:
> > yes william
> >
> > Now try clvmd -d and see what happen
> >
> > locking_type = 3 it's lvm cluster lock type
> 
>  Since you asked for confirmation, here it is: the output of 'clvmd -d'
>  just now. . I crashed the other node at
>  Mar 15 12:02:35, when you see the only additional line of output.
> 
>  I don't see any particular difference between this and the previous
>  result , which suggests that I had
>  cluster locking enabled before, and still do now.
> 
> > Il giorno 15 marzo 2012 16:15, William Seligman <
>  selig...@nevis.columbia.edu
> >> ha scritto:
> >
> >> On 3/15/12 5:18 AM, emmanuel segura wrote:
> >>
> >>> The first thing i seen in your clvmd log it's this
> >>>
> >>> =
> >>>  WARNING: Locking disabled. Be careful! This could corrupt your
> metadata.
> >>> =
> >>
> >> I saw that too, and thought the same as you did. I did some checks
> >> (see below), but some web searches suggest that this message is a
> >> normal consequence of clvmd initialization; e.g.,
> >>
> >> 
> >>
> >>> use this command
> >>>
> >>> lvmconf --enable-cluster
> >>>
> >>> and remember for cman+pacemaker you don't need qdisk
> >>
> >> Before I tried your lvmconf suggestion, here was my
> /etc/lvm/lvm.conf:
> >>  and the output of "lvm dumpconfig":
> >> .
> >>
> >> Then I did as you suggested, but with a check to see if anything
> >> changed:
> >>
> >> # cd /etc/lvm/
> >> # cp lvm.conf lvm.conf.cluster
> >> # lvmconf --enable-cluster
> >> # diff lvm.conf lvm.conf.cluster
> >> #
> >>
> >> So the key lines have been there all along:
> >>locking_type = 3
> >>fallback_to_local_locking = 0
> >>
> >>
> >>> Il giorno 14 marzo 2012 23:17, William Seligman <
> >> selig...@nevis.columbia.edu
>  ha scritto:
> >>>
>  On 3/14/12 9:20 AM, emmanuel segura wrote:
> > Hello William
> >
> > i did new you are using drbd and i dont't know what type of
> > configuration you using
> >
> > But it's better you try to start clvm with clvmd -d
> >
> > like thak we can see what it's the problem
> 
>  For what it's worth, here's the output of running clvmd -d on
>  the node that stays up: 
> 
>  What's probably important in that big mass of output are the
>  last two lines. Up to that point, I have both nodes up and
>  running cman + clvmd; cluster.conf is here:
>  
> 
>  At the time of the next-to-the-last

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread William Seligman
On 3/15/12 3:45 PM, Vladislav Bogdanov wrote:
> 15.03.2012 18:43, William Seligman wrote:
>> On 3/15/12 3:43 AM, Vladislav Bogdanov wrote:
>>> 14.03.2012 00:42, William Seligman wrote:
>>> [snip]
 These were the log messages, which show that stonith_admin did its job and 
 CMAN
 was notified of the fencing: .
>>>
>>> Could you please look at the output of 'dlm_tool ls' and 'dlm_tool dump'?
>>>
>>> You probably have 'kern_stop' and 'fencing' flags there. That means that
>>> dlm is unaware that node is fenced.
>>
>> Here's 'dlm_tool ls' with both nodes running cman+clvmd+gfs2:
>> 
>>
>> 'dlm_tool dump': 
>>
>> For comparison, I crashed one node and looked at the same output on the
>> remaining node:
>> dlm_tool ls: 
>> dlm_tool dump:  (the post-crash lines begin at
>> 1331824940)
> 
> Everything is fine there, dlm correctly understands that node is fenced
> and returns to a normal state.
> 
> The only minor issue I see is that fencing took much time - 21 sec.

Hmm. My fencing agent works by toggling the power on a UPS. If all the agent
does is "action=off", it will cut power immediately. But if you tell it
"action=reboot", it will cut the load, wait 10 seconds, then turn the load back
on again; I found I needed that delay because otherwise the UPS might
confuse/overlap/ignore sequential commands.

Could this be an issue? I've noticed that my fencing agent always seems to be
called with "action=reboot" when a node is fenced. Why is it using 'reboot' and
not 'off'? Is this the standard, or am I missing a definition somewhere?

>>
>> I don't see the "kern_stop" or "fencing" flags. There's another thing I don't
>> see: at the top of 'dlm_tool dump' it displays most of the contents of my
>> cluster.conf file, except for the fencing sections. Here's my cluster.conf 
>> for
>> comparison: 
> 
> It also looks correct (I mean fence_pcmk), but I can be wrong here, I do
> not use cman.
> 
>>
>> cman doesn't see anything wrong in my cluster.conf file:
>>
>> # ccs_config_validate
>> Configuration validates
>>
>> But could there be something that's causing the fencing sections to be 
>> ignored?
>>

 Unfortunately, I still got the gfs2 freeze, so this is not the complete 
 story.
>>>
>>> Both clvmd and gfs2 use dlm. If dlm layer thinks fencing is not
>>> completed, both of them freeze.
>>
>> I did 'grep -E "(dlm|clvm|fenc)" /var/log/messages' and looked at the time I
>> crashed the node: . I see lines that indicate 
>> that
>> pacemaker and drbd are fencing the node, but nothing from dlm or clvmd. Does
>> this indicate what you suggest: Could dlm somehow be ignoring or overlooking 
>> the
>> fencing I put in? Is there any other way to check this?
> 
> No, dlm_controld (and friends) mostly uses different logging method -
> that is what you see in dlm_tool dump.

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread William Seligman
On 3/15/12 12:55 PM, emmanuel segura wrote:

> I don't see any error and the answer for your question it's yes
> 
> can you show me your /etc/cluster/cluster.conf and your crm configure show
> 
> like that more later i can try to look if i found some fix

Thanks for taking a look.

My cluster.conf: 
crm configure show: 

Before you spend a lot of time on the second file, remember that clvmd will hang
whether or not I'm running pacemaker.

> Il giorno 15 marzo 2012 17:42, William Seligman > ha scritto:
> 
>> On 3/15/12 12:15 PM, emmanuel segura wrote:
>>
>>> Ho did you created your volume group
>>
>> pvcreate /dev/drbd0
>> vgcreate -c y ADMIN /dev/drbd0
>> lvcreate -L 200G -n usr ADMIN # ... and so on
>> # "Nevis-HA" is the cluster name I used in cluster.conf
>> mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr  # ... and so on
>>
>>> give me the output of vgs command when the cluster it's up
>>
>> Here it is:
>>
>>Logging initialised at Thu Mar 15 12:40:39 2012
>>Set umask from 0022 to 0077
>>Finding all volume groups
>>Finding volume group "ROOT"
>>Finding volume group "ADMIN"
>>  VG#PV #LV #SN Attr   VSize   VFree
>>  ADMIN   1   5   0 wz--nc   2.61t 765.79g
>>  ROOT1   2   0 wz--n- 117.16g  0
>>Wiping internal VG cache
>>
>> I assume the "c" in the ADMIN attributes means that clustering is turned
>> on?
>>
>>> Il giorno 15 marzo 2012 17:06, William Seligman <
>> selig...@nevis.columbia.edu
 ha scritto:
>>>
 On 3/15/12 11:50 AM, emmanuel segura wrote:
> yes william
>
> Now try clvmd -d and see what happen
>
> locking_type = 3 it's lvm cluster lock type

 Since you asked for confirmation, here it is: the output of 'clvmd -d' 
 just now. . I crashed the other node at
 Mar 15 12:02:35, when you see the only additional line of output.

 I don't see any particular difference between this and the previous
 result , which suggests that I had
 cluster locking enabled before, and still do now.

> Il giorno 15 marzo 2012 16:15, William Seligman <
 selig...@nevis.columbia.edu
>> ha scritto:
>
>> On 3/15/12 5:18 AM, emmanuel segura wrote:
>>
>>> The first thing i seen in your clvmd log it's this
>>>
>>> =
>>>  WARNING: Locking disabled. Be careful! This could corrupt your 
>>> metadata.
>>> =
>>
>> I saw that too, and thought the same as you did. I did some checks
>> (see below), but some web searches suggest that this message is a
>> normal consequence of clvmd initialization; e.g.,
>>
>> 
>>
>>> use this command
>>>
>>> lvmconf --enable-cluster
>>>
>>> and remember for cman+pacemaker you don't need qdisk
>>
>> Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf:
>>  and the output of "lvm dumpconfig":
>> .
>>
>> Then I did as you suggested, but with a check to see if anything
>> changed:
>>
>> # cd /etc/lvm/
>> # cp lvm.conf lvm.conf.cluster
>> # lvmconf --enable-cluster
>> # diff lvm.conf lvm.conf.cluster
>> #
>>
>> So the key lines have been there all along:
>>locking_type = 3
>>fallback_to_local_locking = 0
>>
>>
>>> Il giorno 14 marzo 2012 23:17, William Seligman <
>> selig...@nevis.columbia.edu
 ha scritto:
>>>
 On 3/14/12 9:20 AM, emmanuel segura wrote:
> Hello William
>
> i did new you are using drbd and i dont't know what type of 
> configuration you using
>
> But it's better you try to start clvm with clvmd -d
>
> like thak we can see what it's the problem

 For what it's worth, here's the output of running clvmd -d on
 the node that stays up: 

 What's probably important in that big mass of output are the
 last two lines. Up to that point, I have both nodes up and
 running cman + clvmd; cluster.conf is here:
 

 At the time of the next-to-the-last line, I cut power to the
 other node.

 At the time of the last line, I run "vgdisplay" on the
 remaining node, which hangs forever.

 After a lot of web searching, I found that I'm not the only one
 with this problem. Here's one case that doesn't seem relevant
 to me, since I don't use qdisk:
 .
 Here's one with the same problem with the same OS:
 

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread Vladislav Bogdanov
15.03.2012 18:43, William Seligman wrote:
> On 3/15/12 3:43 AM, Vladislav Bogdanov wrote:
>> 14.03.2012 00:42, William Seligman wrote:
>> [snip]
>>> These were the log messages, which show that stonith_admin did its job and 
>>> CMAN
>>> was notified of the fencing: .
>>
>> Could you please look at the output of 'dlm_tool ls' and 'dlm_tool dump'?
>>
>> You probably have 'kern_stop' and 'fencing' flags there. That means that
>> dlm is unaware that node is fenced.
> 
> Here's 'dlm_tool ls' with both nodes running cman+clvmd+gfs2:
> 
> 
> 'dlm_tool dump': 
> 
> For comparison, I crashed one node and looked at the same output on the
> remaining node:
> dlm_tool ls: 
> dlm_tool dump:  (the post-crash lines begin at
> 1331824940)

Everything is fine there, dlm correctly understands that node is fenced
and returns to a normal state.

The only minor issue I see is that fencing took much time - 21 sec.

> 
> I don't see the "kern_stop" or "fencing" flags. There's another thing I don't
> see: at the top of 'dlm_tool dump' it displays most of the contents of my
> cluster.conf file, except for the fencing sections. Here's my cluster.conf for
> comparison: 

It also looks correct (I mean fence_pcmk), but I can be wrong here, I do
not use cman.

> 
> cman doesn't see anything wrong in my cluster.conf file:
> 
> # ccs_config_validate
> Configuration validates
> 
> But could there be something that's causing the fencing sections to be 
> ignored?
> 
>>>
>>> Unfortunately, I still got the gfs2 freeze, so this is not the complete 
>>> story.
>>
>> Both clvmd and gfs2 use dlm. If dlm layer thinks fencing is not
>> completed, both of them freeze.
> 
> I did 'grep -E "(dlm|clvm|fenc)" /var/log/messages' and looked at the time I
> crashed the node: . I see lines that indicate 
> that
> pacemaker and drbd are fencing the node, but nothing from dlm or clvmd. Does
> this indicate what you suggest: Could dlm somehow be ignoring or overlooking 
> the
> fencing I put in? Is there any other way to check this?

No, dlm_controld (and friends) mostly uses different logging method -
that is what you see in dlm_tool dump.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread emmanuel segura
Hello William

I don't see any error and the answer for your question it's yes

can you show me your /etc/cluster/cluster.conf and your crm configure show

like that more later i can try to look if i found some fix

Il giorno 15 marzo 2012 17:42, William Seligman  ha scritto:

> On 3/15/12 12:15 PM, emmanuel segura wrote:
>
> > Ho did you created your volume group
>
> pvcreate /dev/drbd0
> vgcreate -c y ADMIN /dev/drbd0
> lvcreate -L 200G -n usr ADMIN # ... and so on
> # "Nevis-HA" is the cluster name I used in cluster.conf
> mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr  # ... and so on
>
> > give me the output of vgs command when the cluster it's up
>
> Here it is:
>
>Logging initialised at Thu Mar 15 12:40:39 2012
>Set umask from 0022 to 0077
>Finding all volume groups
>Finding volume group "ROOT"
>Finding volume group "ADMIN"
>  VG#PV #LV #SN Attr   VSize   VFree
>  ADMIN   1   5   0 wz--nc   2.61t 765.79g
>  ROOT1   2   0 wz--n- 117.16g  0
>Wiping internal VG cache
>
> I assume the "c" in the ADMIN attributes means that clustering is turned
> on?
>
> > Il giorno 15 marzo 2012 17:06, William Seligman <
> selig...@nevis.columbia.edu
> >> ha scritto:
> >
> >> On 3/15/12 11:50 AM, emmanuel segura wrote:
> >>> yes william
> >>>
> >>> Now try clvmd -d and see what happen
> >>>
> >>> locking_type = 3 it's lvm cluster lock type
> >>
> >> Since you asked for confirmation, here it is: the output of 'clvmd -d'
> >> just now.
> >> . I crashed the other node at Mar 15
> >> 12:02:35,
> >> when you see the only additional line of output.
> >>
> >> I don't see any particular difference between this and the previous
> result
> >> , which suggests that I had cluster
> locking
> >> enabled before, and still do now.
> >>
> >>> Il giorno 15 marzo 2012 16:15, William Seligman <
> >> selig...@nevis.columbia.edu
>  ha scritto:
> >>>
>  On 3/15/12 5:18 AM, emmanuel segura wrote:
> 
> > The first thing i seen in your clvmd log it's this
> >
> > =
> >  WARNING: Locking disabled. Be careful! This could corrupt your
> >> metadata.
> > =
> 
>  I saw that too, and thought the same as you did. I did some checks
> (see
>  below),
>  but some web searches suggest that this message is a normal
> consequence
> >> of
>  clvmd
>  initialization; e.g.,
> 
>  
> 
> > use this command
> >
> > lvmconf --enable-cluster
> >
> > and remember for cman+pacemaker you don't need qdisk
> 
>  Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf:
>   and the output of "lvm dumpconfig":
>  .
> 
>  Then I did as you suggested, but with a check to see if anything
> >> changed:
> 
>  # cd /etc/lvm/
>  # cp lvm.conf lvm.conf.cluster
>  # lvmconf --enable-cluster
>  # diff lvm.conf lvm.conf.cluster
>  #
> 
>  So the key lines have been there all along:
> locking_type = 3
> fallback_to_local_locking = 0
> 
> 
> > Il giorno 14 marzo 2012 23:17, William Seligman <
>  selig...@nevis.columbia.edu
> >> ha scritto:
> >
> >> On 3/14/12 9:20 AM, emmanuel segura wrote:
> >>> Hello William
> >>>
> >>> i did new you are using drbd and i dont't know what type of
>  configuration
> >>> you using
> >>>
> >>> But it's better you try to start clvm with clvmd -d
> >>>
> >>> like thak we can see what it's the problem
> >>
> >> For what it's worth, here's the output of running clvmd -d on the
> node
>  that
> >> stays up: 
> >>
> >> What's probably important in that big mass of output are the last
> two
> >> lines. Up
> >> to that point, I have both nodes up and running cman + clvmd;
>  cluster.conf
> >> is
> >> here: 
> >>
> >> At the time of the next-to-the-last line, I cut power to the other
> >> node.
> >>
> >> At the time of the last line, I run "vgdisplay" on the remaining
> node,
> >> which
> >> hangs forever.
> >>
> >> After a lot of web searching, I found that I'm not the only one with
>  this
> >> problem. Here's one case that doesn't seem relevant to me, since I
> >> don't
> >> use
> >> qdisk:
> >> <
> 
> http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html
> >>> .
> >> Here's one with the same problem with the same OS:
> >> , but with no resolution.
> >>
> >> Out of curiosity, has anyone on this list made a two-node cman+clvmd
> >> cluster
> >> work for them?
> >>
> >>> Il giorno 14 mar

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread William Seligman
On 3/15/12 12:15 PM, emmanuel segura wrote:

> Ho did you created your volume group

pvcreate /dev/drbd0
vgcreate -c y ADMIN /dev/drbd0
lvcreate -L 200G -n usr ADMIN # ... and so on
# "Nevis-HA" is the cluster name I used in cluster.conf
mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr  # ... and so on

> give me the output of vgs command when the cluster it's up

Here it is:

Logging initialised at Thu Mar 15 12:40:39 2012
Set umask from 0022 to 0077
Finding all volume groups
Finding volume group "ROOT"
Finding volume group "ADMIN"
  VG#PV #LV #SN Attr   VSize   VFree
  ADMIN   1   5   0 wz--nc   2.61t 765.79g
  ROOT1   2   0 wz--n- 117.16g  0
Wiping internal VG cache

I assume the "c" in the ADMIN attributes means that clustering is turned on?

> Il giorno 15 marzo 2012 17:06, William Seligman > ha scritto:
> 
>> On 3/15/12 11:50 AM, emmanuel segura wrote:
>>> yes william
>>>
>>> Now try clvmd -d and see what happen
>>>
>>> locking_type = 3 it's lvm cluster lock type
>>
>> Since you asked for confirmation, here it is: the output of 'clvmd -d'
>> just now.
>> . I crashed the other node at Mar 15
>> 12:02:35,
>> when you see the only additional line of output.
>>
>> I don't see any particular difference between this and the previous result
>> , which suggests that I had cluster locking
>> enabled before, and still do now.
>>
>>> Il giorno 15 marzo 2012 16:15, William Seligman <
>> selig...@nevis.columbia.edu
 ha scritto:
>>>
 On 3/15/12 5:18 AM, emmanuel segura wrote:

> The first thing i seen in your clvmd log it's this
>
> =
>  WARNING: Locking disabled. Be careful! This could corrupt your
>> metadata.
> =

 I saw that too, and thought the same as you did. I did some checks (see
 below),
 but some web searches suggest that this message is a normal consequence
>> of
 clvmd
 initialization; e.g.,

 

> use this command
>
> lvmconf --enable-cluster
>
> and remember for cman+pacemaker you don't need qdisk

 Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf:
  and the output of "lvm dumpconfig":
 .

 Then I did as you suggested, but with a check to see if anything
>> changed:

 # cd /etc/lvm/
 # cp lvm.conf lvm.conf.cluster
 # lvmconf --enable-cluster
 # diff lvm.conf lvm.conf.cluster
 #

 So the key lines have been there all along:
locking_type = 3
fallback_to_local_locking = 0


> Il giorno 14 marzo 2012 23:17, William Seligman <
 selig...@nevis.columbia.edu
>> ha scritto:
>
>> On 3/14/12 9:20 AM, emmanuel segura wrote:
>>> Hello William
>>>
>>> i did new you are using drbd and i dont't know what type of
 configuration
>>> you using
>>>
>>> But it's better you try to start clvm with clvmd -d
>>>
>>> like thak we can see what it's the problem
>>
>> For what it's worth, here's the output of running clvmd -d on the node
 that
>> stays up: 
>>
>> What's probably important in that big mass of output are the last two
>> lines. Up
>> to that point, I have both nodes up and running cman + clvmd;
 cluster.conf
>> is
>> here: 
>>
>> At the time of the next-to-the-last line, I cut power to the other
>> node.
>>
>> At the time of the last line, I run "vgdisplay" on the remaining node,
>> which
>> hangs forever.
>>
>> After a lot of web searching, I found that I'm not the only one with
 this
>> problem. Here's one case that doesn't seem relevant to me, since I
>> don't
>> use
>> qdisk:
>> <
 http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html
>>> .
>> Here's one with the same problem with the same OS:
>> , but with no resolution.
>>
>> Out of curiosity, has anyone on this list made a two-node cman+clvmd
>> cluster
>> work for them?
>>
>>> Il giorno 14 marzo 2012 14:02, William Seligman <
>> selig...@nevis.columbia.edu
 ha scritto:
>>>
 On 3/14/12 6:02 AM, emmanuel segura wrote:

  I think it's better you make clvmd start at boot
>
> chkconfig cman on ; chkconfig clvmd on
>

 I've already tried it. It doesn't work. The problem is that my LVM
 information is on the drbd. If I start up clvmd before drbd, it
>> won't
>> find
 the logical volumes.

 I also don't see why that would make a difference (although this
>> could
>> b

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread emmanuel segura
Hello William

Ho did you created your volume group

give me the output of vgs command when the cluster it's up



Il giorno 15 marzo 2012 17:06, William Seligman  ha scritto:

> On 3/15/12 11:50 AM, emmanuel segura wrote:
> > yes william
> >
> > Now try clvmd -d and see what happen
> >
> > locking_type = 3 it's lvm cluster lock type
>
> Since you asked for confirmation, here it is: the output of 'clvmd -d'
> just now.
> . I crashed the other node at Mar 15
> 12:02:35,
> when you see the only additional line of output.
>
> I don't see any particular difference between this and the previous result
> , which suggests that I had cluster locking
> enabled before, and still do now.
>
> > Il giorno 15 marzo 2012 16:15, William Seligman <
> selig...@nevis.columbia.edu
> >> ha scritto:
> >
> >> On 3/15/12 5:18 AM, emmanuel segura wrote:
> >>
> >>> The first thing i seen in your clvmd log it's this
> >>>
> >>> =
> >>>  WARNING: Locking disabled. Be careful! This could corrupt your
> metadata.
> >>> =
> >>
> >> I saw that too, and thought the same as you did. I did some checks (see
> >> below),
> >> but some web searches suggest that this message is a normal consequence
> of
> >> clvmd
> >> initialization; e.g.,
> >>
> >> 
> >>
> >>> use this command
> >>>
> >>> lvmconf --enable-cluster
> >>>
> >>> and remember for cman+pacemaker you don't need qdisk
> >>
> >> Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf:
> >>  and the output of "lvm dumpconfig":
> >> .
> >>
> >> Then I did as you suggested, but with a check to see if anything
> changed:
> >>
> >> # cd /etc/lvm/
> >> # cp lvm.conf lvm.conf.cluster
> >> # lvmconf --enable-cluster
> >> # diff lvm.conf lvm.conf.cluster
> >> #
> >>
> >> So the key lines have been there all along:
> >>locking_type = 3
> >>fallback_to_local_locking = 0
> >>
> >>
> >>> Il giorno 14 marzo 2012 23:17, William Seligman <
> >> selig...@nevis.columbia.edu
>  ha scritto:
> >>>
>  On 3/14/12 9:20 AM, emmanuel segura wrote:
> > Hello William
> >
> > i did new you are using drbd and i dont't know what type of
> >> configuration
> > you using
> >
> > But it's better you try to start clvm with clvmd -d
> >
> > like thak we can see what it's the problem
> 
>  For what it's worth, here's the output of running clvmd -d on the node
> >> that
>  stays up: 
> 
>  What's probably important in that big mass of output are the last two
>  lines. Up
>  to that point, I have both nodes up and running cman + clvmd;
> >> cluster.conf
>  is
>  here: 
> 
>  At the time of the next-to-the-last line, I cut power to the other
> node.
> 
>  At the time of the last line, I run "vgdisplay" on the remaining node,
>  which
>  hangs forever.
> 
>  After a lot of web searching, I found that I'm not the only one with
> >> this
>  problem. Here's one case that doesn't seem relevant to me, since I
> don't
>  use
>  qdisk:
>  <
> >> http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html
> >.
>  Here's one with the same problem with the same OS:
>  , but with no resolution.
> 
>  Out of curiosity, has anyone on this list made a two-node cman+clvmd
>  cluster
>  work for them?
> 
> > Il giorno 14 marzo 2012 14:02, William Seligman <
>  selig...@nevis.columbia.edu
> >> ha scritto:
> >
> >> On 3/14/12 6:02 AM, emmanuel segura wrote:
> >>
> >>  I think it's better you make clvmd start at boot
> >>>
> >>> chkconfig cman on ; chkconfig clvmd on
> >>>
> >>
> >> I've already tried it. It doesn't work. The problem is that my LVM
> >> information is on the drbd. If I start up clvmd before drbd, it
> won't
>  find
> >> the logical volumes.
> >>
> >> I also don't see why that would make a difference (although this
> could
>  be
> >> part of the confusion): a service is a service. I've tried starting
> up
> >> clvmd inside and outside pacemaker control, with the same problem.
> Why
> >> would starting clvmd at boot make a difference?
> >>
> >>  Il giorno 13 marzo 2012 23:29, William Seligman >>> columbia.edu 
> >>>
>  ha scritto:
> 
> >>>
> >>>  On 3/13/12 5:50 PM, emmanuel segura wrote:
> 
>   So if you using cman why you use lsb::clvmd
> >
> > I think you are very confused
> >
> 
>  I don't dispute that I may be very confused!
> 
>  However, from what I can tell, I still need to run clvmd even if
> 

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread William Seligman
On 3/15/12 11:50 AM, emmanuel segura wrote:
> yes william
> 
> Now try clvmd -d and see what happen
> 
> locking_type = 3 it's lvm cluster lock type

Since you asked for confirmation, here it is: the output of 'clvmd -d' just now.
. I crashed the other node at Mar 15 12:02:35,
when you see the only additional line of output.

I don't see any particular difference between this and the previous result
, which suggests that I had cluster locking
enabled before, and still do now.

> Il giorno 15 marzo 2012 16:15, William Seligman > ha scritto:
> 
>> On 3/15/12 5:18 AM, emmanuel segura wrote:
>>
>>> The first thing i seen in your clvmd log it's this
>>>
>>> =
>>>  WARNING: Locking disabled. Be careful! This could corrupt your metadata.
>>> =
>>
>> I saw that too, and thought the same as you did. I did some checks (see
>> below),
>> but some web searches suggest that this message is a normal consequence of
>> clvmd
>> initialization; e.g.,
>>
>> 
>>
>>> use this command
>>>
>>> lvmconf --enable-cluster
>>>
>>> and remember for cman+pacemaker you don't need qdisk
>>
>> Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf:
>>  and the output of "lvm dumpconfig":
>> .
>>
>> Then I did as you suggested, but with a check to see if anything changed:
>>
>> # cd /etc/lvm/
>> # cp lvm.conf lvm.conf.cluster
>> # lvmconf --enable-cluster
>> # diff lvm.conf lvm.conf.cluster
>> #
>>
>> So the key lines have been there all along:
>>locking_type = 3
>>fallback_to_local_locking = 0
>>
>>
>>> Il giorno 14 marzo 2012 23:17, William Seligman <
>> selig...@nevis.columbia.edu
 ha scritto:
>>>
 On 3/14/12 9:20 AM, emmanuel segura wrote:
> Hello William
>
> i did new you are using drbd and i dont't know what type of
>> configuration
> you using
>
> But it's better you try to start clvm with clvmd -d
>
> like thak we can see what it's the problem

 For what it's worth, here's the output of running clvmd -d on the node
>> that
 stays up: 

 What's probably important in that big mass of output are the last two
 lines. Up
 to that point, I have both nodes up and running cman + clvmd;
>> cluster.conf
 is
 here: 

 At the time of the next-to-the-last line, I cut power to the other node.

 At the time of the last line, I run "vgdisplay" on the remaining node,
 which
 hangs forever.

 After a lot of web searching, I found that I'm not the only one with
>> this
 problem. Here's one case that doesn't seem relevant to me, since I don't
 use
 qdisk:
 <
>> http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html>.
 Here's one with the same problem with the same OS:
 , but with no resolution.

 Out of curiosity, has anyone on this list made a two-node cman+clvmd
 cluster
 work for them?

> Il giorno 14 marzo 2012 14:02, William Seligman <
 selig...@nevis.columbia.edu
>> ha scritto:
>
>> On 3/14/12 6:02 AM, emmanuel segura wrote:
>>
>>  I think it's better you make clvmd start at boot
>>>
>>> chkconfig cman on ; chkconfig clvmd on
>>>
>>
>> I've already tried it. It doesn't work. The problem is that my LVM
>> information is on the drbd. If I start up clvmd before drbd, it won't
 find
>> the logical volumes.
>>
>> I also don't see why that would make a difference (although this could
 be
>> part of the confusion): a service is a service. I've tried starting up
>> clvmd inside and outside pacemaker control, with the same problem. Why
>> would starting clvmd at boot make a difference?
>>
>>  Il giorno 13 marzo 2012 23:29, William Seligman>> columbia.edu 
>>>
 ha scritto:

>>>
>>>  On 3/13/12 5:50 PM, emmanuel segura wrote:

  So if you using cman why you use lsb::clvmd
>
> I think you are very confused
>

 I don't dispute that I may be very confused!

 However, from what I can tell, I still need to run clvmd even if
 I'm running cman (I'm not using rgmanager). If I just run cman,
 gfs2 and any other form of mount fails. If I run cman, then clvmd,
 then gfs2, everything behaves normally.

 Going by these instructions:

 
>

 the resources he puts under "cluster control" (rgmanager) I have to
 put under pacemaker con

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread emmanuel segura
yes william

Now try clvmd -d and see what happen

locking_type = 3 it's lvm cluster lock type

Il giorno 15 marzo 2012 16:15, William Seligman  ha scritto:

> On 3/15/12 5:18 AM, emmanuel segura wrote:
>
> > The first thing i seen in your clvmd log it's this
> >
> > =
> >  WARNING: Locking disabled. Be careful! This could corrupt your metadata.
> > =
>
> I saw that too, and thought the same as you did. I did some checks (see
> below),
> but some web searches suggest that this message is a normal consequence of
> clvmd
> initialization; e.g.,
>
> 
>
> > use this command
> >
> > lvmconf --enable-cluster
> >
> > and remember for cman+pacemaker you don't need qdisk
>
> Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf:
>  and the output of "lvm dumpconfig":
> .
>
> Then I did as you suggested, but with a check to see if anything changed:
>
> # cd /etc/lvm/
> # cp lvm.conf lvm.conf.cluster
> # lvmconf --enable-cluster
> # diff lvm.conf lvm.conf.cluster
> #
>
> So the key lines have been there all along:
>locking_type = 3
>fallback_to_local_locking = 0
>
>
> > Il giorno 14 marzo 2012 23:17, William Seligman <
> selig...@nevis.columbia.edu
> >> ha scritto:
> >
> >> On 3/14/12 9:20 AM, emmanuel segura wrote:
> >>> Hello William
> >>>
> >>> i did new you are using drbd and i dont't know what type of
> configuration
> >>> you using
> >>>
> >>> But it's better you try to start clvm with clvmd -d
> >>>
> >>> like thak we can see what it's the problem
> >>
> >> For what it's worth, here's the output of running clvmd -d on the node
> that
> >> stays up: 
> >>
> >> What's probably important in that big mass of output are the last two
> >> lines. Up
> >> to that point, I have both nodes up and running cman + clvmd;
> cluster.conf
> >> is
> >> here: 
> >>
> >> At the time of the next-to-the-last line, I cut power to the other node.
> >>
> >> At the time of the last line, I run "vgdisplay" on the remaining node,
> >> which
> >> hangs forever.
> >>
> >> After a lot of web searching, I found that I'm not the only one with
> this
> >> problem. Here's one case that doesn't seem relevant to me, since I don't
> >> use
> >> qdisk:
> >> <
> http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html>.
> >> Here's one with the same problem with the same OS:
> >> , but with no resolution.
> >>
> >> Out of curiosity, has anyone on this list made a two-node cman+clvmd
> >> cluster
> >> work for them?
> >>
> >>> Il giorno 14 marzo 2012 14:02, William Seligman <
> >> selig...@nevis.columbia.edu
>  ha scritto:
> >>>
>  On 3/14/12 6:02 AM, emmanuel segura wrote:
> 
>   I think it's better you make clvmd start at boot
> >
> > chkconfig cman on ; chkconfig clvmd on
> >
> 
>  I've already tried it. It doesn't work. The problem is that my LVM
>  information is on the drbd. If I start up clvmd before drbd, it won't
> >> find
>  the logical volumes.
> 
>  I also don't see why that would make a difference (although this could
> >> be
>  part of the confusion): a service is a service. I've tried starting up
>  clvmd inside and outside pacemaker control, with the same problem. Why
>  would starting clvmd at boot make a difference?
> 
>   Il giorno 13 marzo 2012 23:29, William Seligman > columbia.edu 
> >
> >> ha scritto:
> >>
> >
> >  On 3/13/12 5:50 PM, emmanuel segura wrote:
> >>
> >>  So if you using cman why you use lsb::clvmd
> >>>
> >>> I think you are very confused
> >>>
> >>
> >> I don't dispute that I may be very confused!
> >>
> >> However, from what I can tell, I still need to run clvmd even if
> >> I'm running cman (I'm not using rgmanager). If I just run cman,
> >> gfs2 and any other form of mount fails. If I run cman, then clvmd,
> >> then gfs2, everything behaves normally.
> >>
> >> Going by these instructions:
> >>
> >>  >> https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial>
> >>>
> >>
> >> the resources he puts under "cluster control" (rgmanager) I have to
> >> put under pacemaker control. Those include drbd, clvmd, and gfs2.
> >>
> >> The difference between what I've got, and what's in "Clusters From
> >> Scratch", is in CFS they assign one DRBD volume to a single
> >> filesystem. I create an LVM physical volume on my DRBD resource,
> >> as in the above tutorial, and so I have to start clvmd or the
> >> logical volumes in the DRBD partition won't be recognized.>> Is
> >> there some way to get logical volumes recognized automatically

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread William Seligman
On 3/15/12 3:43 AM, Vladislav Bogdanov wrote:
> 14.03.2012 00:42, William Seligman wrote:
> [snip]
>> These were the log messages, which show that stonith_admin did its job and 
>> CMAN
>> was notified of the fencing: .
> 
> Could you please look at the output of 'dlm_tool ls' and 'dlm_tool dump'?
> 
> You probably have 'kern_stop' and 'fencing' flags there. That means that
> dlm is unaware that node is fenced.

Here's 'dlm_tool ls' with both nodes running cman+clvmd+gfs2:


'dlm_tool dump': 

For comparison, I crashed one node and looked at the same output on the
remaining node:
dlm_tool ls: 
dlm_tool dump:  (the post-crash lines begin at
1331824940)

I don't see the "kern_stop" or "fencing" flags. There's another thing I don't
see: at the top of 'dlm_tool dump' it displays most of the contents of my
cluster.conf file, except for the fencing sections. Here's my cluster.conf for
comparison: 

cman doesn't see anything wrong in my cluster.conf file:

# ccs_config_validate
Configuration validates

But could there be something that's causing the fencing sections to be ignored?

>>
>> Unfortunately, I still got the gfs2 freeze, so this is not the complete 
>> story.
> 
> Both clvmd and gfs2 use dlm. If dlm layer thinks fencing is not
> completed, both of them freeze.

I did 'grep -E "(dlm|clvm|fenc)" /var/log/messages' and looked at the time I
crashed the node: . I see lines that indicate that
pacemaker and drbd are fencing the node, but nothing from dlm or clvmd. Does
this indicate what you suggest: Could dlm somehow be ignoring or overlooking the
fencing I put in? Is there any other way to check this?

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread William Seligman
On 3/15/12 5:18 AM, emmanuel segura wrote:

> The first thing i seen in your clvmd log it's this
> 
> =
>  WARNING: Locking disabled. Be careful! This could corrupt your metadata.
> =

I saw that too, and thought the same as you did. I did some checks (see below),
but some web searches suggest that this message is a normal consequence of clvmd
initialization; e.g.,



> use this command
> 
> lvmconf --enable-cluster
> 
> and remember for cman+pacemaker you don't need qdisk

Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf:
 and the output of "lvm dumpconfig":
.

Then I did as you suggested, but with a check to see if anything changed:

# cd /etc/lvm/
# cp lvm.conf lvm.conf.cluster
# lvmconf --enable-cluster
# diff lvm.conf lvm.conf.cluster
#

So the key lines have been there all along:
locking_type = 3
fallback_to_local_locking = 0


> Il giorno 14 marzo 2012 23:17, William Seligman > ha scritto:
> 
>> On 3/14/12 9:20 AM, emmanuel segura wrote:
>>> Hello William
>>>
>>> i did new you are using drbd and i dont't know what type of configuration
>>> you using
>>>
>>> But it's better you try to start clvm with clvmd -d
>>>
>>> like thak we can see what it's the problem
>>
>> For what it's worth, here's the output of running clvmd -d on the node that
>> stays up: 
>>
>> What's probably important in that big mass of output are the last two
>> lines. Up
>> to that point, I have both nodes up and running cman + clvmd; cluster.conf
>> is
>> here: 
>>
>> At the time of the next-to-the-last line, I cut power to the other node.
>>
>> At the time of the last line, I run "vgdisplay" on the remaining node,
>> which
>> hangs forever.
>>
>> After a lot of web searching, I found that I'm not the only one with this
>> problem. Here's one case that doesn't seem relevant to me, since I don't
>> use
>> qdisk:
>> .
>> Here's one with the same problem with the same OS:
>> , but with no resolution.
>>
>> Out of curiosity, has anyone on this list made a two-node cman+clvmd
>> cluster
>> work for them?
>>
>>> Il giorno 14 marzo 2012 14:02, William Seligman <
>> selig...@nevis.columbia.edu
 ha scritto:
>>>
 On 3/14/12 6:02 AM, emmanuel segura wrote:

  I think it's better you make clvmd start at boot
>
> chkconfig cman on ; chkconfig clvmd on
>

 I've already tried it. It doesn't work. The problem is that my LVM
 information is on the drbd. If I start up clvmd before drbd, it won't
>> find
 the logical volumes.

 I also don't see why that would make a difference (although this could
>> be
 part of the confusion): a service is a service. I've tried starting up
 clvmd inside and outside pacemaker control, with the same problem. Why
 would starting clvmd at boot make a difference?

  Il giorno 13 marzo 2012 23:29, William Seligman columbia.edu 
>
>> ha scritto:
>>
>
>  On 3/13/12 5:50 PM, emmanuel segura wrote:
>>
>>  So if you using cman why you use lsb::clvmd
>>>
>>> I think you are very confused
>>>
>>
>> I don't dispute that I may be very confused!
>>
>> However, from what I can tell, I still need to run clvmd even if
>> I'm running cman (I'm not using rgmanager). If I just run cman,
>> gfs2 and any other form of mount fails. If I run cman, then clvmd,
>> then gfs2, everything behaves normally.
>>
>> Going by these instructions:
>>
>> > https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial>
>>>
>>
>> the resources he puts under "cluster control" (rgmanager) I have to
>> put under pacemaker control. Those include drbd, clvmd, and gfs2.
>>
>> The difference between what I've got, and what's in "Clusters From
>> Scratch", is in CFS they assign one DRBD volume to a single
>> filesystem. I create an LVM physical volume on my DRBD resource,
>> as in the above tutorial, and so I have to start clvmd or the
>> logical volumes in the DRBD partition won't be recognized.>> Is
>> there some way to get logical volumes recognized automatically by
>> cman without rgmanager that I've missed?
>>
>
>  Il giorno 13 marzo 2012 22:42, William Seligman<
>>>
>> selig...@nevis.columbia.edu
>>
>>> ha scritto:

>>>
>>>  On 3/13/12 12:29 PM, William Seligman wrote:

> I'm not sure if this is a "Linux-HA" question; please direct
> me to the appropriate list if it's not.
>
> I'm setting up a two-node cman+pacemaker+gfs2 

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread emmanuel segura
Hello Willian

The first thing i seen in your clvmd log it's this

=
 WARNING: Locking disabled. Be careful! This could corrupt your metadata.
=

use this command

lvmconf --enable-cluster

and remember for cman+pacemaker you don't need qdisk

Il giorno 14 marzo 2012 23:17, William Seligman  ha scritto:

> On 3/14/12 9:20 AM, emmanuel segura wrote:
> > Hello William
> >
> > i did new you are using drbd and i dont't know what type of configuration
> > you using
> >
> > But it's better you try to start clvm with clvmd -d
> >
> > like thak we can see what it's the problem
>
> For what it's worth, here's the output of running clvmd -d on the node that
> stays up: 
>
> What's probably important in that big mass of output are the last two
> lines. Up
> to that point, I have both nodes up and running cman + clvmd; cluster.conf
> is
> here: 
>
> At the time of the next-to-the-last line, I cut power to the other node.
>
> At the time of the last line, I run "vgdisplay" on the remaining node,
> which
> hangs forever.
>
> After a lot of web searching, I found that I'm not the only one with this
> problem. Here's one case that doesn't seem relevant to me, since I don't
> use
> qdisk:
> .
> Here's one with the same problem with the same OS:
> , but with no resolution.
>
> Out of curiosity, has anyone on this list made a two-node cman+clvmd
> cluster
> work for them?
>
> > Il giorno 14 marzo 2012 14:02, William Seligman <
> selig...@nevis.columbia.edu
> >> ha scritto:
> >
> >> On 3/14/12 6:02 AM, emmanuel segura wrote:
> >>
> >>  I think it's better you make clvmd start at boot
> >>>
> >>> chkconfig cman on ; chkconfig clvmd on
> >>>
> >>
> >> I've already tried it. It doesn't work. The problem is that my LVM
> >> information is on the drbd. If I start up clvmd before drbd, it won't
> find
> >> the logical volumes.
> >>
> >> I also don't see why that would make a difference (although this could
> be
> >> part of the confusion): a service is a service. I've tried starting up
> >> clvmd inside and outside pacemaker control, with the same problem. Why
> >> would starting clvmd at boot make a difference?
> >>
> >>  Il giorno 13 marzo 2012 23:29, William Seligman >>> columbia.edu 
> >>>
>  ha scritto:
> 
> >>>
> >>>  On 3/13/12 5:50 PM, emmanuel segura wrote:
> 
>   So if you using cman why you use lsb::clvmd
> >
> > I think you are very confused
> >
> 
>  I don't dispute that I may be very confused!
> 
>  However, from what I can tell, I still need to run clvmd even if
>  I'm running cman (I'm not using rgmanager). If I just run cman,
>  gfs2 and any other form of mount fails. If I run cman, then clvmd,
>  then gfs2, everything behaves normally.
> 
>  Going by these instructions:
> 
>   https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial>
> >
> 
>  the resources he puts under "cluster control" (rgmanager) I have to
>  put under pacemaker control. Those include drbd, clvmd, and gfs2.
> 
>  The difference between what I've got, and what's in "Clusters From
>  Scratch", is in CFS they assign one DRBD volume to a single
>  filesystem. I create an LVM physical volume on my DRBD resource,
>  as in the above tutorial, and so I have to start clvmd or the
>  logical volumes in the DRBD partition won't be recognized.>> Is
>  there some way to get logical volumes recognized automatically by
>  cman without rgmanager that I've missed?
> 
> >>>
> >>>  Il giorno 13 marzo 2012 22:42, William Seligman<
> >
>  selig...@nevis.columbia.edu
> 
> > ha scritto:
> >>
> >
> >  On 3/13/12 12:29 PM, William Seligman wrote:
> >>
> >>> I'm not sure if this is a "Linux-HA" question; please direct
> >>> me to the appropriate list if it's not.
> >>>
> >>> I'm setting up a two-node cman+pacemaker+gfs2 cluster as
> >>> described in "Clusters From Scratch." Fencing is through
> >>> forcibly rebooting a node by cutting and restoring its power
> >>> via UPS.
> >>>
> >>> My fencing/failover tests have revealed a problem. If I
> >>> gracefully turn off one node ("crm node standby"; "service
> >>> pacemaker stop"; "shutdown -r now") all the resources
> >>> transfer to the other node with no problems. If I cut power
> >>> to one node (as would happen if it were fenced), the
> >>> lsb::clvmd resource on the remaining node eventually fails.
> >>> Since all the other resources depend on clvmd, all the
> >>> resources on the remaining node stop and the cluster is left
> >>> with nothing running.
> >>>
> >>> I've tr

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread Vladislav Bogdanov
14.03.2012 00:42, William Seligman wrote:
[snip]
> These were the log messages, which show that stonith_admin did its job and 
> CMAN
> was notified of the fencing: .

Could you please look at the output of 'dlm_tool ls' and 'dlm_tool dump'?

You probably have 'kern_stop' and 'fencing' flags there. That means that
dlm is unaware that node is fenced.

> 
> Unfortunately, I still got the gfs2 freeze, so this is not the complete story.

Both clvmd and gfs2 use dlm. If dlm layer thinks fencing is not
completed, both of them freeze.

Best,
Vladislav
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-14 Thread Dimitri Maziuk
On 03/14/2012 05:22 PM, William Seligman wrote:

> Now consider a primary-primary cluster. Both run the same resource. One fails.
> There's no failover here; the other box still runs the resource. In my case, 
> the
> only thing that has to work is cloned cluster IP address, and that I've 
> verified
> to my satisfaction.

That may be true if you offer only completely stateless services over
UDP on the cluster IP address. Or running some interesting network stack
on top of IP. According to my (admittedly fading) memory of networking
101 TCP-based services don't quite work that way.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-14 Thread Lars Marowsky-Bree
On 2012-03-14T18:22:42, William Seligman  wrote:

> Now consider a primary-primary cluster. Both run the same resource.
> One fails.  There's no failover here; the other box still runs the
> resource. In my case, the only thing that has to work is cloned
> cluster IP address, and that I've verified to my satisfaction.

There's still an outage while the replication and OCFS2 freeze until the
other side has been shot; OCFS2 then also needs to recover the journal
from the departing node before continuing. Depending on your service,
the service also needs to recover the data from the departed node.

So there's still a brief freeze. Yes, it may be minimally shorter than
an active/passive setup in the rare error case; but you're paying for
this with significantly increased complexity and probably reduced
performance during *normal* operation.

It's your call. And technically I enjoy the distributed active/active
stuff, it's lots of fun. I like solving problems. But I've seen not that
many scenarios where the complexity is warranted. (Databases or cloud
deployments being the rare exception.)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-14 Thread William Seligman
On 3/14/12 12:43 PM, Dimitri Maziuk wrote:
> On 03/14/2012 11:08 AM, Lars Marowsky-Bree wrote:
>> On 2012-03-14T11:41:53, William Seligman  wrote:
>>
>>> I'm mindful of the issues involved, such as those Lars Ellenberg
>>> brought up in his response. I need something that will failover with
>>> a minimum of fuss. Although I'm encountering one problem after
>>> another, I think I'm closing in on my goal.
>>
>> I doubt this is what you're getting. An active/passive fail-over
>> configuration would likely save you tons of trouble and not perform
>> worse, probably be faster for most workloads.
> 
> Or if you look at it from another angle, if you can't configure your
> resources to start properly at failover, what makes you think you can
> configure a dual-primary any better?

I'll repeat the answer I gave in that other thread, for what it's worth:

Consider two nodes in a primary-secondary cluster. Primary is running a
resource. It fails, so the resource has to failover to secondary.

Now consider a primary-primary cluster. Both run the same resource. One fails.
There's no failover here; the other box still runs the resource. In my case, the
only thing that has to work is cloned cluster IP address, and that I've verified
to my satisfaction.
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-14 Thread William Seligman
On 3/14/12 9:20 AM, emmanuel segura wrote:
> Hello William
> 
> i did new you are using drbd and i dont't know what type of configuration
> you using
> 
> But it's better you try to start clvm with clvmd -d
> 
> like thak we can see what it's the problem

For what it's worth, here's the output of running clvmd -d on the node that
stays up: 

What's probably important in that big mass of output are the last two lines. Up
to that point, I have both nodes up and running cman + clvmd; cluster.conf is
here: 

At the time of the next-to-the-last line, I cut power to the other node.

At the time of the last line, I run "vgdisplay" on the remaining node, which
hangs forever.

After a lot of web searching, I found that I'm not the only one with this
problem. Here's one case that doesn't seem relevant to me, since I don't use
qdisk:
.
Here's one with the same problem with the same OS:
, but with no resolution.

Out of curiosity, has anyone on this list made a two-node cman+clvmd cluster
work for them?

> Il giorno 14 marzo 2012 14:02, William Seligman > ha scritto:
> 
>> On 3/14/12 6:02 AM, emmanuel segura wrote:
>>
>>  I think it's better you make clvmd start at boot
>>>
>>> chkconfig cman on ; chkconfig clvmd on
>>>
>>
>> I've already tried it. It doesn't work. The problem is that my LVM
>> information is on the drbd. If I start up clvmd before drbd, it won't find
>> the logical volumes.
>>
>> I also don't see why that would make a difference (although this could be
>> part of the confusion): a service is a service. I've tried starting up
>> clvmd inside and outside pacemaker control, with the same problem. Why
>> would starting clvmd at boot make a difference?
>>
>>  Il giorno 13 marzo 2012 23:29, William Seligman>> columbia.edu 
>>>
 ha scritto:

>>>
>>>  On 3/13/12 5:50 PM, emmanuel segura wrote:

  So if you using cman why you use lsb::clvmd
>
> I think you are very confused
>

 I don't dispute that I may be very confused!

 However, from what I can tell, I still need to run clvmd even if
 I'm running cman (I'm not using rgmanager). If I just run cman,
 gfs2 and any other form of mount fails. If I run cman, then clvmd,
 then gfs2, everything behaves normally.

 Going by these instructions:

 
>

 the resources he puts under "cluster control" (rgmanager) I have to
 put under pacemaker control. Those include drbd, clvmd, and gfs2.

 The difference between what I've got, and what's in "Clusters From
 Scratch", is in CFS they assign one DRBD volume to a single
 filesystem. I create an LVM physical volume on my DRBD resource,
 as in the above tutorial, and so I have to start clvmd or the
 logical volumes in the DRBD partition won't be recognized.>> Is
 there some way to get logical volumes recognized automatically by
 cman without rgmanager that I've missed?

>>>
>>>  Il giorno 13 marzo 2012 22:42, William Seligman<
>
 selig...@nevis.columbia.edu

> ha scritto:
>>
>
>  On 3/13/12 12:29 PM, William Seligman wrote:
>>
>>> I'm not sure if this is a "Linux-HA" question; please direct
>>> me to the appropriate list if it's not.
>>>
>>> I'm setting up a two-node cman+pacemaker+gfs2 cluster as
>>> described in "Clusters From Scratch." Fencing is through
>>> forcibly rebooting a node by cutting and restoring its power
>>> via UPS.
>>>
>>> My fencing/failover tests have revealed a problem. If I
>>> gracefully turn off one node ("crm node standby"; "service
>>> pacemaker stop"; "shutdown -r now") all the resources
>>> transfer to the other node with no problems. If I cut power
>>> to one node (as would happen if it were fenced), the
>>> lsb::clvmd resource on the remaining node eventually fails.
>>> Since all the other resources depend on clvmd, all the
>>> resources on the remaining node stop and the cluster is left
>>> with nothing running.
>>>
>>> I've traced why the lsb::clvmd fails: The monitor/status
>>> command includes "vgdisplay", which hangs indefinitely.
>>> Therefore the monitor will always time-out.
>>>
>>> So this isn't a problem with pacemaker, but with clvmd/dlm:
>>> If a node is cut off, the cluster isn't handling it properly.
>>> Has anyone on this list seen this before? Any ideas?
>>>
 Details:
>
>>
>>> versions:
>>> Redhat Linux 6.2 (kernel 2.6.32)
>>> cman-3.0.12.1
>>> corosync-1.4.1
>>> pacemaker-1.1.6
>>> lvm2-2.02.87
>>> lvm2-cluster-2.02.87
>>>
>>
>> This may be a Linux-HA question after all!
>>
>>>

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-14 Thread Dimitri Maziuk
On 03/14/2012 11:08 AM, Lars Marowsky-Bree wrote:
> On 2012-03-14T11:41:53, William Seligman  wrote:
> 
>> I'm mindful of the issues involved, such as those Lars Ellenberg
>> brought up in his response. I need something that will failover with
>> a minimum of fuss. Although I'm encountering one problem after
>> another, I think I'm closing in on my goal.
> 
> I doubt this is what you're getting. An active/passive fail-over
> configuration would likely save you tons of trouble and not perform
> worse, probably be faster for most workloads.

Or if you look at it from another angle, if you can't configure your
resources to start properly at failover, what makes you think you can
configure a dual-primary any better?

Dima
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-14 Thread Lars Marowsky-Bree
On 2012-03-14T11:41:53, William Seligman  wrote:

> I'm mindful of the issues involved, such as those Lars Ellenberg
> brought up in his response. I need something that will failover with
> a minimum of fuss. Although I'm encountering one problem after
> another, I think I'm closing in on my goal.

I doubt this is what you're getting. An active/passive fail-over
configuration would likely save you tons of trouble and not perform
worse, probably be faster for most workloads.



Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-14 Thread William Seligman

On 3/14/12 9:26 AM, Lars Marowsky-Bree wrote:

On 2012-03-14T09:02:59, William Seligman  wrote:

To ask a slightly different question - why? Does your workload require /
benefit from a dual-primary architecture? Most don't.




I'm mindful of the issues involved, such as those Lars Ellenberg brought up in 
his response. I need something that will failover with a minimum of fuss. 
Although I'm encountering one problem after another, I think I'm closing in on 
my goal.


And if not, at least I'm leaving some interesting threads in Linux-HA for future 
sysadmins to search for.

--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-14 Thread Lars Marowsky-Bree
On 2012-03-14T09:02:59, William Seligman  wrote:

To ask a slightly different question - why? Does your workload require /
benefit from a dual-primary architecture? Most don't.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-14 Thread emmanuel segura
Hello William

i did new you are using drbd and i dont't know what type of configuration
you using

But it's better you try to start clvm with clvmd -d

like thak we can see what it's the problem

Il giorno 14 marzo 2012 14:02, William Seligman  ha scritto:

> On 3/14/12 6:02 AM, emmanuel segura wrote:
>
>  I think it's better you make clvmd start at boot
>>
>> chkconfig cman on ; chkconfig clvmd on
>>
>
> I've already tried it. It doesn't work. The problem is that my LVM
> information is on the drbd. If I start up clvmd before drbd, it won't find
> the logical volumes.
>
> I also don't see why that would make a difference (although this could be
> part of the confusion): a service is a service. I've tried starting up
> clvmd inside and outside pacemaker control, with the same problem. Why
> would starting clvmd at boot make a difference?
>
>  Il giorno 13 marzo 2012 23:29, William Seligman> columbia.edu 
>>
>>> ha scritto:
>>>
>>
>>  On 3/13/12 5:50 PM, emmanuel segura wrote:
>>>
>>>  So if you using cman why you use lsb::clvmd

 I think you are very confused

>>>
>>> I don't dispute that I may be very confused!
>>>
>>> However, from what I can tell, I still need to run clvmd even if
>>> I'm running cman (I'm not using rgmanager). If I just run cman,
>>> gfs2 and any other form of mount fails. If I run cman, then clvmd,
>>> then gfs2, everything behaves normally.
>>>
>>> Going by these instructions:
>>>
>>> 
>>> >
>>>
>>> the resources he puts under "cluster control" (rgmanager) I have to
>>> put under pacemaker control. Those include drbd, clvmd, and gfs2.
>>>
>>> The difference between what I've got, and what's in "Clusters From
>>> Scratch", is in CFS they assign one DRBD volume to a single
>>> filesystem. I create an LVM physical volume on my DRBD resource,
>>> as in the above tutorial, and so I have to start clvmd or the
>>> logical volumes in the DRBD partition won't be recognized.>> Is
>>> there some way to get logical volumes recognized automatically by
>>> cman without rgmanager that I've missed?
>>>
>>
>>  Il giorno 13 marzo 2012 22:42, William Seligman<

>>> selig...@nevis.columbia.edu
>>>
 ha scritto:
>

  On 3/13/12 12:29 PM, William Seligman wrote:
>
>> I'm not sure if this is a "Linux-HA" question; please direct
>> me to the appropriate list if it's not.
>>
>> I'm setting up a two-node cman+pacemaker+gfs2 cluster as
>> described in "Clusters From Scratch." Fencing is through
>> forcibly rebooting a node by cutting and restoring its power
>> via UPS.
>>
>> My fencing/failover tests have revealed a problem. If I
>> gracefully turn off one node ("crm node standby"; "service
>> pacemaker stop"; "shutdown -r now") all the resources
>> transfer to the other node with no problems. If I cut power
>> to one node (as would happen if it were fenced), the
>> lsb::clvmd resource on the remaining node eventually fails.
>> Since all the other resources depend on clvmd, all the
>> resources on the remaining node stop and the cluster is left
>> with nothing running.
>>
>> I've traced why the lsb::clvmd fails: The monitor/status
>> command includes "vgdisplay", which hangs indefinitely.
>> Therefore the monitor will always time-out.
>>
>> So this isn't a problem with pacemaker, but with clvmd/dlm:
>> If a node is cut off, the cluster isn't handling it properly.
>> Has anyone on this list seen this before? Any ideas?
>>
> >> Details:

>
>> versions:
>> Redhat Linux 6.2 (kernel 2.6.32)
>> cman-3.0.12.1
>> corosync-1.4.1
>> pacemaker-1.1.6
>> lvm2-2.02.87
>> lvm2-cluster-2.02.87
>>
>
> This may be a Linux-HA question after all!
>
> I ran a few more tests. Here's the output from a typical test of
>
> grep -E "(dlm|gfs2}clvmd|fenc|syslogd)**" /var/log/messages
>
> 
>
> It looks like what's happening is that the fence agent (one I
> wrote) is not returning the proper error code when a node
> crashes. According to this page, if a fencing agent fails GFS2
> will freeze to protect the data:
>
>  Linux/6/html/Global_File_**System_2/s1-gfs2hand-allnodes.**html>
>
> As a test, I tried to fence my test node via standard means:
>
> stonith_admin -F 
> orestes-corosync.nevis.**columbia.edu
>
> These were the log messages, which show that stonith_admin did
> its job and CMAN was notified of the
> fencing:
> >.
>

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-14 Thread William Seligman

On 3/14/12 6:02 AM, emmanuel segura wrote:


I think it's better you make clvmd start at boot

chkconfig cman on ; chkconfig clvmd on


I've already tried it. It doesn't work. The problem is that my LVM 
information is on the drbd. If I start up clvmd before drbd, it won't 
find the logical volumes.


I also don't see why that would make a difference (although this could 
be part of the confusion): a service is a service. I've tried starting 
up clvmd inside and outside pacemaker control, with the same problem. 
Why would starting clvmd at boot make a difference?



Il giorno 13 marzo 2012 23:29, William Seligman
ha scritto:



On 3/13/12 5:50 PM, emmanuel segura wrote:


So if you using cman why you use lsb::clvmd

I think you are very confused


I don't dispute that I may be very confused!

However, from what I can tell, I still need to run clvmd even if
I'm running cman (I'm not using rgmanager). If I just run cman,
gfs2 and any other form of mount fails. If I run cman, then clvmd,
then gfs2, everything behaves normally.

Going by these instructions:



the resources he puts under "cluster control" (rgmanager) I have to
put under pacemaker control. Those include drbd, clvmd, and gfs2.

The difference between what I've got, and what's in "Clusters From
Scratch", is in CFS they assign one DRBD volume to a single
filesystem. I create an LVM physical volume on my DRBD resource,
as in the above tutorial, and so I have to start clvmd or the
logical volumes in the DRBD partition won't be recognized.>> Is
there some way to get logical volumes recognized automatically by
cman without rgmanager that I've missed?



Il giorno 13 marzo 2012 22:42, William Seligman<

selig...@nevis.columbia.edu

ha scritto:



On 3/13/12 12:29 PM, William Seligman wrote:

I'm not sure if this is a "Linux-HA" question; please direct
me to the appropriate list if it's not.

I'm setting up a two-node cman+pacemaker+gfs2 cluster as
described in "Clusters From Scratch." Fencing is through
forcibly rebooting a node by cutting and restoring its power
via UPS.

My fencing/failover tests have revealed a problem. If I
gracefully turn off one node ("crm node standby"; "service
pacemaker stop"; "shutdown -r now") all the resources
transfer to the other node with no problems. If I cut power
to one node (as would happen if it were fenced), the
lsb::clvmd resource on the remaining node eventually fails.
Since all the other resources depend on clvmd, all the
resources on the remaining node stop and the cluster is left
with nothing running.

I've traced why the lsb::clvmd fails: The monitor/status
command includes "vgdisplay", which hangs indefinitely.
Therefore the monitor will always time-out.

So this isn't a problem with pacemaker, but with clvmd/dlm:
If a node is cut off, the cluster isn't handling it properly.
Has anyone on this list seen this before? Any ideas?

>> Details:


versions:
Redhat Linux 6.2 (kernel 2.6.32)
cman-3.0.12.1
corosync-1.4.1
pacemaker-1.1.6
lvm2-2.02.87
lvm2-cluster-2.02.87


This may be a Linux-HA question after all!

I ran a few more tests. Here's the output from a typical test of

grep -E "(dlm|gfs2}clvmd|fenc|syslogd)" /var/log/messages



It looks like what's happening is that the fence agent (one I
wrote) is not returning the proper error code when a node
crashes. According to this page, if a fencing agent fails GFS2
will freeze to protect the data:



As a test, I tried to fence my test node via standard means:

stonith_admin -F orestes-corosync.nevis.columbia.edu

These were the log messages, which show that stonith_admin did
its job and CMAN was notified of the
fencing:.

Unfortunately, I still got the gfs2 freeze, so this is not the
complete story.

First things first. I vaguely recall a web page that went over
the STONITH return codes, but I can't locate it again. Is there
any reference to the return codes expected from a fencing
agent, perhaps as function of the state of the fencing device?


--
Bill Seligman | mailto://selig...@nevis.columbia.edu
Nevis Labs, Columbia Univ | http://www.nevis.columbia.edu/~seligman/
PO Box 137|
Irvington NY 10533  USA   | Phone: (914) 591-2823



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-14 Thread emmanuel segura
Hello William

I think it's better you make clvmd start at boot

chkconfig cman on ; chkconfig clvmd on



Il giorno 13 marzo 2012 23:29, William Seligman  ha scritto:

> On 3/13/12 5:50 PM, emmanuel segura wrote:
>
> > So if you using cman why you use lsb::clvmd
> >
> > I think you are very confused
>
> I don't dispute that I may be very confused!
>
> However, from what I can tell, I still need to run clvmd even if I'm
> running
> cman (I'm not using rgmanager). If I just run cman, gfs2 and any other
> form of
> mount fails. If I run cman, then clvmd, then gfs2, everything behaves
> normally.
>
> Going by these instructions:
>
> 
>
> the resources he puts under "cluster control" (rgmanager) I have to put
> under
> pacemaker control. Those include drbd, clvmd, and gfs2.
>
> The difference between what I've got, and what's in "Clusters From
> Scratch", is
> in CFS they assign one DRBD volume to a single filesystem. I create an LVM
> physical volume on my DRBD resource, as in the above tutorial, and so I
> have to
> start clvmd or the logical volumes in the DRBD partition won't be
> recognized.
>
> Is there some way to get logical volumes recognized automatically by cman
> without rgmanager that I've missed?
>
> > Il giorno 13 marzo 2012 22:42, William Seligman <
> selig...@nevis.columbia.edu
> >> ha scritto:
> >
> >> On 3/13/12 12:29 PM, William Seligman wrote:
> >>> I'm not sure if this is a "Linux-HA" question; please direct me to the
> >>> appropriate list if it's not.
> >>>
> >>> I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in
> >>> "Clusters From Scratch." Fencing is through forcibly rebooting a node
> by
> >>> cutting and restoring its power via UPS.
> >>>
> >>> My fencing/failover tests have revealed a problem. If I gracefully turn
> >>> off one node ("crm node standby"; "service pacemaker stop"; "shutdown
> -r
> >>> now") all the resources transfer to the other node with no problems.
> If I
> >>> cut power to one node (as would happen if it were fenced), the
> lsb::clvmd
> >>> resource on the remaining node eventually fails. Since all the other
> >>> resources depend on clvmd, all the resources on the remaining node stop
> >>> and the cluster is left with nothing running.
> >>>
> >>> I've traced why the lsb::clvmd fails: The monitor/status command
> >>> includes "vgdisplay", which hangs indefinitely. Therefore the monitor
> >>> will always time-out.
> >>>
> >>> So this isn't a problem with pacemaker, but with clvmd/dlm: If a node
> is
> >>> cut off, the cluster isn't handling it properly. Has anyone on this
> list
> >>> seen this before? Any ideas?>>>
> >>> Details:
> >>>
> >>> versions:
> >>> Redhat Linux 6.2 (kernel 2.6.32)
> >>> cman-3.0.12.1
> >>> corosync-1.4.1
> >>> pacemaker-1.1.6
> >>> lvm2-2.02.87
> >>> lvm2-cluster-2.02.87
> >>
> >> This may be a Linux-HA question after all!
> >>
> >> I ran a few more tests. Here's the output from a typical test of
> >>
> >> grep -E "(dlm|gfs2}clvmd|fenc|syslogd)" /var/log/messages
> >>
> >> 
> >>
> >> It looks like what's happening is that the fence agent (one I wrote) is
> >> not returning the proper error code when a node crashes. According to
> this
> >> page, if a fencing agent fails GFS2 will freeze to protect the data:
> >>
> >> <
> http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/s1-gfs2hand-allnodes.html
> >
> >>
> >> As a test, I tried to fence my test node via standard means:
> >>
> >> stonith_admin -F orestes-corosync.nevis.columbia.edu
> >>
> >> These were the log messages, which show that stonith_admin did its job
> and
> >> CMAN was notified of the fencing: .
> >>
> >> Unfortunately, I still got the gfs2 freeze, so this is not the complete
> >> story.
> >>
> >> First things first. I vaguely recall a web page that went over the
> STONITH
> >> return codes, but I can't locate it again. Is there any reference to the
> >> return codes expected from a fencing agent, perhaps as function of the
> >> state of the fencing device?
>
> --
> Bill Seligman | Phone: (914) 591-2823
> Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
> PO Box 137|
> Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/
>
>
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-13 Thread William Seligman
On 3/13/12 5:50 PM, emmanuel segura wrote:

> So if you using cman why you use lsb::clvmd
> 
> I think you are very confused

I don't dispute that I may be very confused!

However, from what I can tell, I still need to run clvmd even if I'm running
cman (I'm not using rgmanager). If I just run cman, gfs2 and any other form of
mount fails. If I run cman, then clvmd, then gfs2, everything behaves normally.

Going by these instructions:



the resources he puts under "cluster control" (rgmanager) I have to put under
pacemaker control. Those include drbd, clvmd, and gfs2.

The difference between what I've got, and what's in "Clusters From Scratch", is
in CFS they assign one DRBD volume to a single filesystem. I create an LVM
physical volume on my DRBD resource, as in the above tutorial, and so I have to
start clvmd or the logical volumes in the DRBD partition won't be recognized.

Is there some way to get logical volumes recognized automatically by cman
without rgmanager that I've missed?

> Il giorno 13 marzo 2012 22:42, William Seligman > ha scritto:
> 
>> On 3/13/12 12:29 PM, William Seligman wrote:
>>> I'm not sure if this is a "Linux-HA" question; please direct me to the
>>> appropriate list if it's not.
>>>
>>> I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in
>>> "Clusters From Scratch." Fencing is through forcibly rebooting a node by
>>> cutting and restoring its power via UPS.
>>> 
>>> My fencing/failover tests have revealed a problem. If I gracefully turn
>>> off one node ("crm node standby"; "service pacemaker stop"; "shutdown -r
>>> now") all the resources transfer to the other node with no problems. If I
>>> cut power to one node (as would happen if it were fenced), the lsb::clvmd
>>> resource on the remaining node eventually fails. Since all the other
>>> resources depend on clvmd, all the resources on the remaining node stop
>>> and the cluster is left with nothing running.
>>> 
>>> I've traced why the lsb::clvmd fails: The monitor/status command
>>> includes "vgdisplay", which hangs indefinitely. Therefore the monitor
>>> will always time-out.
>>> 
>>> So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is
>>> cut off, the cluster isn't handling it properly. Has anyone on this list
>>> seen this before? Any ideas?>>>
>>> Details:
>>>
>>> versions:
>>> Redhat Linux 6.2 (kernel 2.6.32)
>>> cman-3.0.12.1
>>> corosync-1.4.1
>>> pacemaker-1.1.6
>>> lvm2-2.02.87
>>> lvm2-cluster-2.02.87
>>
>> This may be a Linux-HA question after all!
>>
>> I ran a few more tests. Here's the output from a typical test of
>>
>> grep -E "(dlm|gfs2}clvmd|fenc|syslogd)" /var/log/messages
>>
>> 
>>
>> It looks like what's happening is that the fence agent (one I wrote) is
>> not returning the proper error code when a node crashes. According to this 
>> page, if a fencing agent fails GFS2 will freeze to protect the data:
>>
>> 
>>
>> As a test, I tried to fence my test node via standard means:
>>
>> stonith_admin -F orestes-corosync.nevis.columbia.edu
>>
>> These were the log messages, which show that stonith_admin did its job and 
>> CMAN was notified of the fencing: .
>> 
>> Unfortunately, I still got the gfs2 freeze, so this is not the complete 
>> story.
>> 
>> First things first. I vaguely recall a web page that went over the STONITH 
>> return codes, but I can't locate it again. Is there any reference to the 
>> return codes expected from a fencing agent, perhaps as function of the
>> state of the fencing device?

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-13 Thread emmanuel segura
Hello Willian

So if you using cman why you use lsb::clvmd

I think you are very confused



Il giorno 13 marzo 2012 22:42, William Seligman  ha scritto:

> On 3/13/12 12:29 PM, William Seligman wrote:
> > I'm not sure if this is a "Linux-HA" question; please direct me to the
> > appropriate list if it's not.
> >
> > I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in
> "Clusters
> > From Scratch." Fencing is through forcibly rebooting a node by cutting
> and
> > restoring its power via UPS.
> >
> > My fencing/failover tests have revealed a problem. If I gracefully turn
> off one
> > node ("crm node standby"; "service pacemaker stop"; "shutdown -r now")
> all the
> > resources transfer to the other node with no problems. If I cut power to
> one
> > node (as would happen if it were fenced), the lsb::clvmd resource on the
> > remaining node eventually fails. Since all the other resources depend on
> clvmd,
> > all the resources on the remaining node stop and the cluster is left with
> > nothing running.
> >
> > I've traced why the lsb::clvmd fails: The monitor/status command includes
> > "vgdisplay", which hangs indefinitely. Therefore the monitor will always
> time-out.
> >
> > So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is
> cut
> > off, the cluster isn't handling it properly. Has anyone on this list
> seen this
> > before? Any ideas?
> >
> > Details:
> >
> > versions:
> > Redhat Linux 6.2 (kernel 2.6.32)
> > cman-3.0.12.1
> > corosync-1.4.1
> > pacemaker-1.1.6
> > lvm2-2.02.87
> > lvm2-cluster-2.02.87
>
> This may be a Linux-HA question after all!
>
> I ran a few more tests. Here's the output from a typical test of
>
> grep -E "(dlm|gfs2}clvmd|fenc|syslogd)" /var/log/messages
>
> 
>
> It looks like what's happening is that the fence agent (one I wrote) is not
> returning the proper error code when a node crashes. According to this
> page, if
> a fencing agent fails GFS2 will freeze to protect the data:
>
> <
> http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/s1-gfs2hand-allnodes.html
> >
>
> As a test, I tried to fence my test node via standard means:
>
> stonith_admin -F orestes-corosync.nevis.columbia.edu
>
> These were the log messages, which show that stonith_admin did its job and
> CMAN
> was notified of the fencing: .
>
> Unfortunately, I still got the gfs2 freeze, so this is not the complete
> story.
>
> First things first. I vaguely recall a web page that went over the STONITH
> return codes, but I can't locate it again. Is there any reference to the
> return
> codes expected from a fencing agent, perhaps as function of the state of
> the
> fencing device?
> --
> Bill Seligman | Phone: (914) 591-2823
> Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
> PO Box 137|
> Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/
>
>
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-13 Thread William Seligman
On 3/13/12 12:29 PM, William Seligman wrote:
> I'm not sure if this is a "Linux-HA" question; please direct me to the
> appropriate list if it's not.
> 
> I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in 
> "Clusters
> From Scratch." Fencing is through forcibly rebooting a node by cutting and
> restoring its power via UPS.
> 
> My fencing/failover tests have revealed a problem. If I gracefully turn off 
> one
> node ("crm node standby"; "service pacemaker stop"; "shutdown -r now") all the
> resources transfer to the other node with no problems. If I cut power to one
> node (as would happen if it were fenced), the lsb::clvmd resource on the
> remaining node eventually fails. Since all the other resources depend on 
> clvmd,
> all the resources on the remaining node stop and the cluster is left with
> nothing running.
> 
> I've traced why the lsb::clvmd fails: The monitor/status command includes
> "vgdisplay", which hangs indefinitely. Therefore the monitor will always 
> time-out.
> 
> So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is cut
> off, the cluster isn't handling it properly. Has anyone on this list seen this
> before? Any ideas?
> 
> Details:
> 
> versions:
> Redhat Linux 6.2 (kernel 2.6.32)
> cman-3.0.12.1
> corosync-1.4.1
> pacemaker-1.1.6
> lvm2-2.02.87
> lvm2-cluster-2.02.87

This may be a Linux-HA question after all!

I ran a few more tests. Here's the output from a typical test of

grep -E "(dlm|gfs2}clvmd|fenc|syslogd)" /var/log/messages



It looks like what's happening is that the fence agent (one I wrote) is not
returning the proper error code when a node crashes. According to this page, if
a fencing agent fails GFS2 will freeze to protect the data:



As a test, I tried to fence my test node via standard means:

stonith_admin -F orestes-corosync.nevis.columbia.edu

These were the log messages, which show that stonith_admin did its job and CMAN
was notified of the fencing: .

Unfortunately, I still got the gfs2 freeze, so this is not the complete story.

First things first. I vaguely recall a web page that went over the STONITH
return codes, but I can't locate it again. Is there any reference to the return
codes expected from a fencing agent, perhaps as function of the state of the
fencing device?
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-13 Thread William Seligman
On 3/13/12 2:49 PM, emmanuel segura wrote:
> Sorry Willian
> 
> But i think clvmd it must be used with
> 
> ocf:lvm2:clvmd
> 
> esample
> 
> crm confgiure
> primitive clvmd  ocf:lvm2:clvmd params daemon_timeout="30"
> 
> clone cln_clvmd clvmd
> 
> and rember clvmd depend on dlm, so for the dlm you sould same

I don't have an ocf:lvm2:clvmd resource on my system. When I do a web search, it
looks like a resource found on SUSE systems, but not on RHEL distros.

Based on "Clusters From Scratch", I think that if I'm using cman that dlm is
started automatically. I see dlm_controld is running without my explicitly
starting it:

# ps aux | grep dlm_controld
root  2495  0.0  0.0 234688  7564 ?  Ssl  12:32   0:00 dlm_controld

I should have also mentioned that I can duplicate this problem outside
pacemaker. That is, I can start cman, clvmd, and gfs2 manually on both nodes,
cut off power on one node, and clustering fails on the other node. So I suspect
it's not a pacemaker resource problem.

For a moment I thought I might not have used "-p lock_dlm" when I created my
GFS2 filesystems, but I think the output of "gfs2_edit -p sb ..." shows that I
did it correctly: .

When I looked more carefully at my lvm.conf, I saw that I had a typo:

fallback_to_local_locking=4

I changed to it the correct value (according to
):

fallback_to_local_locking=0

Unfortunately this doesn't solve the problem.

So... any ideas?

> Il giorno 13 marzo 2012 17:29, William Seligman > ha scritto:
> 
>> I'm not sure if this is a "Linux-HA" question; please direct me to the 
>> appropriate list if it's not.
>> 
>> I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in 
>> "Clusters From Scratch." Fencing is through forcibly rebooting a node by
>> cutting and restoring its power via UPS.
>> 
>> My fencing/failover tests have revealed a problem. If I gracefully turn off
>> one node ("crm node standby"; "service pacemaker stop"; "shutdown -r now")
>> all the resources transfer to the other node with no problems. If I cut
>> power to one node (as would happen if it were fenced), the lsb::clvmd
>> resource on the remaining node eventually fails. Since all the other
>> resources depend on clvmd, all the resources on the remaining node stop and
>> the cluster is left with nothing running.
>>
>> I've traced why the lsb::clvmd fails: The monitor/status command includes
>> "vgdisplay", which hangs indefinitely. Therefore the monitor will always
>> time-out.
>>
>> So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is 
>> cut off, the cluster isn't handling it properly. Has anyone on this list
>> seen this before? Any ideas?
>>
>> Details:
>>
>> versions:
>> Redhat Linux 6.2 (kernel 2.6.32)
>> cman-3.0.12.1
>> corosync-1.4.1
>> pacemaker-1.1.6
>> lvm2-2.02.87
>> lvm2-cluster-2.02.87
>>
>> cluster.conf: 
>> output of "crm configure show": 
>> output of "lvm dumpconfig": 
>>
>> /var/log/cluster/dlm_controld.log and /var/log/cluster/gfs_controld.log 
>> show nothing. When I shut down power to one nodes (orestes-tb), the output
>> of grep -E "(dlm|gfs2|clvmd)" /var/log/messages is 
>> .
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-13 Thread emmanuel segura
Sorry Willian

But i think clvmd it must be used with

ocf:lvm2:clvmd

esample

crm confgiure
primitive clvmd  ocf:lvm2:clvmd params daemon_timeout="30"

clone cln_clvmd clvmd

and rember clvmd depend on dlm, so for the dlm you sould same



Il giorno 13 marzo 2012 17:29, William Seligman  ha scritto:

> I'm not sure if this is a "Linux-HA" question; please direct me to the
> appropriate list if it's not.
>
> I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in
> "Clusters
> From Scratch." Fencing is through forcibly rebooting a node by cutting and
> restoring its power via UPS.
>
> My fencing/failover tests have revealed a problem. If I gracefully turn
> off one
> node ("crm node standby"; "service pacemaker stop"; "shutdown -r now") all
> the
> resources transfer to the other node with no problems. If I cut power to
> one
> node (as would happen if it were fenced), the lsb::clvmd resource on the
> remaining node eventually fails. Since all the other resources depend on
> clvmd,
> all the resources on the remaining node stop and the cluster is left with
> nothing running.
>
> I've traced why the lsb::clvmd fails: The monitor/status command includes
> "vgdisplay", which hangs indefinitely. Therefore the monitor will always
> time-out.
>
> So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is
> cut
> off, the cluster isn't handling it properly. Has anyone on this list seen
> this
> before? Any ideas?
>
> Details:
>
> versions:
> Redhat Linux 6.2 (kernel 2.6.32)
> cman-3.0.12.1
> corosync-1.4.1
> pacemaker-1.1.6
> lvm2-2.02.87
> lvm2-cluster-2.02.87
>
> cluster.conf: 
> output of "crm configure show": 
> output of "lvm dumpconfig": 
>
> /var/log/cluster/dlm_controld.log and /var/log/cluster/gfs_controld.log
> show
> nothing. When I shut down power to one nodes (orestes-tb), the output of
> grep -E "(dlm|gfs2|clvmd)" /var/log/messages is <
> http://pastebin.com/vjpvCFeN>.
>
> --
> Bill Seligman | Phone: (914) 591-2823
> Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
> PO Box 137|
> Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/
>
>
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems