subject:"Re\: \[Linux\-HA\] Server becomes unresponsive after node failure"

Re: [Linux-HA] Server becomes unresponsive after node failure

2011-03-11 Thread Dimitri Maziuk

Dejan Muhamedagic wrote:

>  I guess that in some shops you'd need to
> clone yourself or sth else, otherwise you just wouldn't scale
> with demand.

Yeah, I keep telling my boss that.

>> ... split-brain 
> So, did you have stonith in place then? ;-)

I've users instead, they come and tell me the Internet is broken when 
that sort of thing happens.

> The cost of decent fencing hardware is nowadays really small.
> And the probability of the power supplies going bad is much
> higher than that of PDU/PSU.

Interestingly enough, out of approx. 300 computer-years here, the kit 
from our server vendor had 3 PSU failures recently. On 3 identical 
machines bought some 4 years ago. (The other failure was a sata backplane.)

With 4 hardware failures on 60 servers in 5 years, it's hard to justify 
decent fencing hardware even if it were free. And net-connected PDUs 
that can power a full-height server cabinet are actually far from it.

Dima
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Server becomes unresponsive after node failure

2011-03-11 Thread Dejan Muhamedagic

On Wed, Mar 09, 2011 at 11:51:20AM -0600, Dimitri Maziuk wrote:
> Dejan Muhamedagic wrote:
> > On Tue, Mar 08, 2011 at 02:27:52PM -0600, Dimitri Maziuk wrote:
> 
> >> Well, realistically, if the link is a foot of x/over cable and gremlins 
> >> have not been pulling on it, and the NICs aren't falling out of their 
> >> slots, and are half-decent quality hardware, and the drivers aren't 
> >> alpha prototype code, and so on, the chances of it being the "link down" 
> >> case should be fairly low.
> > 
> > LOL. BTW, the gremlins I saw doing that were wearing company
> > badges and pulling wrong cables. Realistically, never
> > underestimate human factor :)
> 
> That's why we put locks on our server room doors. So that I am the only 
> gremlin there.

Well, that's good for your cluster too. But it places a bit of
an extra burden on you. I guess that in some shops you'd need to
clone yourself or sth else, otherwise you just wouldn't scale
with demand.

> (Last time I saw split-brain was when I myself pulled on the x/over 
> cable. Really. If you have an rj45 connector with the little tab broken 
> off, throw it out and get a new one now. Trust me, it's much cheaper 
> than the alternative.)

So, did you have stonith in place then? ;-)

> Seriously, though, you have to weigh the cost of ipmi daughterboards or 
> net-connected power strips vs the likelyhood of losing that cross-over 
> link vs the likelyhood of the power strip itself going titsup and taking 
> down your entire cluster. *Then* say "go get a real stonith device".

The cost of decent fencing hardware is nowadays really small.
And the probability of the power supplies going bad is much
higher than that of PDU/PSU. And so on. :)

Cheers,

Dejan

> Dima
> -- 
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Server becomes unresponsive after node failure

2011-03-09 Thread Dimitri Maziuk

Dejan Muhamedagic wrote:
> On Tue, Mar 08, 2011 at 02:27:52PM -0600, Dimitri Maziuk wrote:

>> Well, realistically, if the link is a foot of x/over cable and gremlins 
>> have not been pulling on it, and the NICs aren't falling out of their 
>> slots, and are half-decent quality hardware, and the drivers aren't 
>> alpha prototype code, and so on, the chances of it being the "link down" 
>> case should be fairly low.
> 
> LOL. BTW, the gremlins I saw doing that were wearing company
> badges and pulling wrong cables. Realistically, never
> underestimate human factor :)

That's why we put locks on our server room doors. So that I am the only 
gremlin there.

(Last time I saw split-brain was when I myself pulled on the x/over 
cable. Really. If you have an rj45 connector with the little tab broken 
off, throw it out and get a new one now. Trust me, it's much cheaper 
than the alternative.)

Seriously, though, you have to weigh the cost of ipmi daughterboards or 
net-connected power strips vs the likelyhood of losing that cross-over 
link vs the likelyhood of the power strip itself going titsup and taking 
down your entire cluster. *Then* say "go get a real stonith device".

Dima
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Server becomes unresponsive after node failure

2011-03-09 Thread Dejan Muhamedagic

On Tue, Mar 08, 2011 at 02:27:52PM -0600, Dimitri Maziuk wrote:
> Lars Ellenberg wrote:
> > 
> > Oh, that's easy.  external/ssh pings the victim, and if it does not
> > answer, which will be the case for a down node as well as a down link,
> > stonith is considered to have been successful ;-)
> > 
> > In the "node down" case, this will allow the cluster to proceed,
> > and all is well.
> > 
> > But in the "link down" case, this will allow the cluster to proceed,
> > even though the victim will continue to run it's services, causing
> > cluster split brain and data corruption.
> 
> Well, realistically, if the link is a foot of x/over cable and gremlins 
> have not been pulling on it, and the NICs aren't falling out of their 
> slots, and are half-decent quality hardware, and the drivers aren't 
> alpha prototype code, and so on, the chances of it being the "link down" 
> case should be fairly low.

LOL. BTW, the gremlins I saw doing that were wearing company
badges and pulling wrong cables. Realistically, never
underestimate human factor :)

Dejan

> Dima
> -- 
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Server becomes unresponsive after node failure

2011-03-08 Thread Dimitri Maziuk

Lars Ellenberg wrote:
> 
> Oh, that's easy.  external/ssh pings the victim, and if it does not
> answer, which will be the case for a down node as well as a down link,
> stonith is considered to have been successful ;-)
> 
> In the "node down" case, this will allow the cluster to proceed,
> and all is well.
> 
> But in the "link down" case, this will allow the cluster to proceed,
> even though the victim will continue to run it's services, causing
> cluster split brain and data corruption.

Well, realistically, if the link is a foot of x/over cable and gremlins 
have not been pulling on it, and the NICs aren't falling out of their 
slots, and are half-decent quality hardware, and the drivers aren't 
alpha prototype code, and so on, the chances of it being the "link down" 
case should be fairly low.

Dima
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Server becomes unresponsive after node failure

2011-03-08 Thread Lars Ellenberg

On Tue, Mar 08, 2011 at 05:43:17PM +0100, Dejan Muhamedagic wrote:
> Hi,
> 
> On Tue, Mar 08, 2011 at 05:32:44PM +0100, Sascha Hagedorn wrote:
> > Hi Dejan,
> > 
> > thank you for your answer. I added an external/ssh stonith resource
> > to test this and it resolved the problem. It wasn't clear to me that
> > the stonith resource does more than shooting the other node.
> > Apparently some cluster parameters are being set too, so the system
> > stays clean. During the test my understanding was when I cut the
> > power of one node I don't need a stonith device to shoot it.
> 
> Hmm, I wonder how external/ssh could've solved this particular
> issue, since if you pull the plug it will never be able to fence
> that node.

Oh, that's easy.  external/ssh pings the victim, and if it does not
answer, which will be the case for a down node as well as a down link,
stonith is considered to have been successful ;-)

In the "node down" case, this will allow the cluster to proceed,
and all is well.

But in the "link down" case, this will allow the cluster to proceed,
even though the victim will continue to run it's services, causing
cluster split brain and data corruption.

That's why:

> You really need a usable stonith device. external/ssh
> is for testing only.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Server becomes unresponsive after node failure

2011-03-08 Thread Dejan Muhamedagic

Hi,

On Tue, Mar 08, 2011 at 05:32:44PM +0100, Sascha Hagedorn wrote:
> Hi Dejan,
> 
> thank you for your answer. I added an external/ssh stonith resource to test 
> this and it resolved the problem. It wasn't clear to me that the stonith 
> resource does more than shooting the other node. Apparently some cluster 
> parameters are being set too, so the system stays clean. During the test my 
> understanding was when I cut the power of one node I don't need a stonith 
> device to shoot it.

Hmm, I wonder how external/ssh could've solved this particular
issue, since if you pull the plug it will never be able to fence
that node. You really need a usable stonith device. external/ssh
is for testing only.

Thanks,

Dejan

> 
> Thanks again,
> Sascha
> 
> -Ursprüngliche Nachricht-
> Von: linux-ha-boun...@lists.linux-ha.org 
> [mailto:linux-ha-boun...@lists.linux-ha.org] Im Auftrag von Dejan Muhamedagic
> Gesendet: Montag, 7. März 2011 16:43
> An: General Linux-HA mailing list
> Betreff: Re: [Linux-HA] Server becomes unresponsive after node failure
> 
> Hi,
> 
> On Mon, Mar 07, 2011 at 10:55:01AM +0100, Sascha Hagedorn wrote:
> > Hello everyone,
> >
> > I am evaluating a two node cluster setup and I am running into some 
> > problems. The cluster runs a dual master DRBD disk with a OCFS2 filesystem. 
> > Here are the used software versions:
> >
> >
> > -  SLES11 + HAE Extension
> 
> SLE11 is not supported anymore, you'd need to upgrade to SLE11SP1.
> 
> > -  DRBD 8.3.7
> >
> > -  OCFS2 1.4.2
> >
> > -  libdlm 3.00.01
> >
> > -  cluster-glue 1.0.5
> >
> > -  Pacemaker 1.1.2
> >
> > -  OpenAIS 1.1.2
> >
> > The problem occurs when the second node is being powered off instantly by 
> > pulling the power cable.  Shortly after that the load average on the 
> > surviving system goes up at a very high rate, with no CPU utilization until 
> > the server becomes unresponsive. Processes I see in the top list very 
> > frequently are cib, dlm_controld, corosync and ha_logd. Access to the DRBD 
> > partition is not possible, although the crm_mon shows it is being mounted 
> > and all services are running. An "ls" on the DRBD OCFS2 partition results 
> > in a hanging prompt (So does "df" or any other command accessing the 
> > partition).
> 
> You created a split-brain condition, but have no stonith
> resources (and stonith is disabled). That won't work.
> 
> Thanks,
> 
> Dejan
> 
> >
> > crm_mon after the power is cut on cluster-node2:
> >
> > 
> > Last updated: Mon Mar  7 10:32:10 2011
> > Stack: openais
> > Current DC: cluster-node1 - partition WITHOUT quorum
> > Version: 1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5
> > 2 Nodes configured, 2 expected votes
> > 4 Resources configured.
> > 
> >
> > Online: [ cluster-node1 ]
> > OFFLINE: [ cluster-node2 ]
> >
> > Master/Slave Set: ms_drbd
> >  Masters: [ cluster-node1 ]
> >  Stopped: [ p_drbd:1 ]
> > Clone Set: cl_dlm
> >  Started: [ cluster-node1 ]
> >  Stopped: [ p_dlm:1 ]
> > Clone Set: cl_o2cb
> >  Started: [ cluster-node1 ]
> >  Stopped: [ p_o2cb:1 ]
> > Clone Set: cl_fs
> >  Started: [ cluster-node1 ]
> >  Stopped: [ p_fs:1 ]
> >
> > The configuration is as follows:
> >
> > node cluster-node1
> > node cluster-node2
> > primitive p_dlm ocf:pacemaker:controld \
> > op monitor interval="120s"
> > primitive p_drbd ocf:linbit:drbd \
> > params drbd_resource="r0" \
> > operations $id="p_drbd-operations" \
> > op monitor interval="20" role="Master" timeout="20" \
> > op monitor interval="30" role="Slave" timeout="20"
> > primitive p_fs ocf:heartbeat:Filesystem \
> > params device="/dev/drbd0" directory="/data" fstype="ocfs2" \
> > op monitor interval="120s"
> > primitive p_o2cb ocf:ocfs2:o2cb \
> > op monitor interval="120s"
> > ms ms_drbd p_drbd \
> > meta resource-stickines="100" notify="true" master-max="2" 
> > interleave="true"
> > clone cl_dlm p_dlm \
> > meta globally-unique="false" interleave="true"
> > clone cl_fs p_fs \
> > meta interleave="true&

Re: [Linux-HA] Server becomes unresponsive after node failure

2011-03-08 Thread Sascha Hagedorn

Hi Dejan,

thank you for your answer. I added an external/ssh stonith resource to test 
this and it resolved the problem. It wasn't clear to me that the stonith 
resource does more than shooting the other node. Apparently some cluster 
parameters are being set too, so the system stays clean. During the test my 
understanding was when I cut the power of one node I don't need a stonith 
device to shoot it.

Thanks again,
Sascha

-Ursprüngliche Nachricht-
Von: linux-ha-boun...@lists.linux-ha.org 
[mailto:linux-ha-boun...@lists.linux-ha.org] Im Auftrag von Dejan Muhamedagic
Gesendet: Montag, 7. März 2011 16:43
An: General Linux-HA mailing list
Betreff: Re: [Linux-HA] Server becomes unresponsive after node failure

Hi,

On Mon, Mar 07, 2011 at 10:55:01AM +0100, Sascha Hagedorn wrote:
> Hello everyone,
>
> I am evaluating a two node cluster setup and I am running into some problems. 
> The cluster runs a dual master DRBD disk with a OCFS2 filesystem. Here are 
> the used software versions:
>
>
> -  SLES11 + HAE Extension

SLE11 is not supported anymore, you'd need to upgrade to SLE11SP1.

> -  DRBD 8.3.7
>
> -  OCFS2 1.4.2
>
> -  libdlm 3.00.01
>
> -  cluster-glue 1.0.5
>
> -  Pacemaker 1.1.2
>
> -  OpenAIS 1.1.2
>
> The problem occurs when the second node is being powered off instantly by 
> pulling the power cable.  Shortly after that the load average on the 
> surviving system goes up at a very high rate, with no CPU utilization until 
> the server becomes unresponsive. Processes I see in the top list very 
> frequently are cib, dlm_controld, corosync and ha_logd. Access to the DRBD 
> partition is not possible, although the crm_mon shows it is being mounted and 
> all services are running. An "ls" on the DRBD OCFS2 partition results in a 
> hanging prompt (So does "df" or any other command accessing the partition).

You created a split-brain condition, but have no stonith
resources (and stonith is disabled). That won't work.

Thanks,

Dejan

>
> crm_mon after the power is cut on cluster-node2:
>
> 
> Last updated: Mon Mar  7 10:32:10 2011
> Stack: openais
> Current DC: cluster-node1 - partition WITHOUT quorum
> Version: 1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5
> 2 Nodes configured, 2 expected votes
> 4 Resources configured.
> 
>
> Online: [ cluster-node1 ]
> OFFLINE: [ cluster-node2 ]
>
> Master/Slave Set: ms_drbd
>  Masters: [ cluster-node1 ]
>  Stopped: [ p_drbd:1 ]
> Clone Set: cl_dlm
>  Started: [ cluster-node1 ]
>  Stopped: [ p_dlm:1 ]
> Clone Set: cl_o2cb
>  Started: [ cluster-node1 ]
>  Stopped: [ p_o2cb:1 ]
> Clone Set: cl_fs
>  Started: [ cluster-node1 ]
>  Stopped: [ p_fs:1 ]
>
> The configuration is as follows:
>
> node cluster-node1
> node cluster-node2
> primitive p_dlm ocf:pacemaker:controld \
> op monitor interval="120s"
> primitive p_drbd ocf:linbit:drbd \
> params drbd_resource="r0" \
> operations $id="p_drbd-operations" \
> op monitor interval="20" role="Master" timeout="20" \
> op monitor interval="30" role="Slave" timeout="20"
> primitive p_fs ocf:heartbeat:Filesystem \
> params device="/dev/drbd0" directory="/data" fstype="ocfs2" \
> op monitor interval="120s"
> primitive p_o2cb ocf:ocfs2:o2cb \
> op monitor interval="120s"
> ms ms_drbd p_drbd \
> meta resource-stickines="100" notify="true" master-max="2" 
> interleave="true"
> clone cl_dlm p_dlm \
> meta globally-unique="false" interleave="true"
> clone cl_fs p_fs \
> meta interleave="true" ordered="true"
> clone cl_o2cb p_o2cb \
> meta globally-unique="false" interleave="true"
> colocation co_dlm-drbd inf: cl_dlm ms_drbd:Master
> colocation co_fs-o2cb inf: cl_fs cl_o2cb
> colocation co_o2cb-dlm inf: cl_o2cb cl_dlm
> order o_dlm-o2cb 0: cl_dlm cl_o2cb
> order o_drbd-dlm 0: ms_drbd:promote cl_dlm
> order o_o2cb-fs 0: cl_o2cb cl_fs
> property $id="cib-bootstrap-options" \
> dc-version="1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> stonith-enabled="false" \
> no-quorum-policy="ignore"
>
> Here is a  snippet from /var/log/messages (power cut at 10:32:02):
>
> Mar  7 10:32:03 cluster-node1 ker

Re: [Linux-HA] Server becomes unresponsive after node failure

2011-03-07 Thread Dejan Muhamedagic

Hi,

On Mon, Mar 07, 2011 at 10:55:01AM +0100, Sascha Hagedorn wrote:
> Hello everyone,
> 
> I am evaluating a two node cluster setup and I am running into some problems. 
> The cluster runs a dual master DRBD disk with a OCFS2 filesystem. Here are 
> the used software versions:
> 
> 
> -  SLES11 + HAE Extension

SLE11 is not supported anymore, you'd need to upgrade to SLE11SP1.

> -  DRBD 8.3.7
> 
> -  OCFS2 1.4.2
> 
> -  libdlm 3.00.01
> 
> -  cluster-glue 1.0.5
> 
> -  Pacemaker 1.1.2
> 
> -  OpenAIS 1.1.2
> 
> The problem occurs when the second node is being powered off instantly by 
> pulling the power cable.  Shortly after that the load average on the 
> surviving system goes up at a very high rate, with no CPU utilization until 
> the server becomes unresponsive. Processes I see in the top list very 
> frequently are cib, dlm_controld, corosync and ha_logd. Access to the DRBD 
> partition is not possible, although the crm_mon shows it is being mounted and 
> all services are running. An "ls" on the DRBD OCFS2 partition results in a 
> hanging prompt (So does "df" or any other command accessing the partition).

You created a split-brain condition, but have no stonith
resources (and stonith is disabled). That won't work.

Thanks,

Dejan

> 
> crm_mon after the power is cut on cluster-node2:
> 
> 
> Last updated: Mon Mar  7 10:32:10 2011
> Stack: openais
> Current DC: cluster-node1 - partition WITHOUT quorum
> Version: 1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5
> 2 Nodes configured, 2 expected votes
> 4 Resources configured.
> 
> 
> Online: [ cluster-node1 ]
> OFFLINE: [ cluster-node2 ]
> 
> Master/Slave Set: ms_drbd
>  Masters: [ cluster-node1 ]
>  Stopped: [ p_drbd:1 ]
> Clone Set: cl_dlm
>  Started: [ cluster-node1 ]
>  Stopped: [ p_dlm:1 ]
> Clone Set: cl_o2cb
>  Started: [ cluster-node1 ]
>  Stopped: [ p_o2cb:1 ]
> Clone Set: cl_fs
>  Started: [ cluster-node1 ]
>  Stopped: [ p_fs:1 ]
> 
> The configuration is as follows:
> 
> node cluster-node1
> node cluster-node2
> primitive p_dlm ocf:pacemaker:controld \
> op monitor interval="120s"
> primitive p_drbd ocf:linbit:drbd \
> params drbd_resource="r0" \
> operations $id="p_drbd-operations" \
> op monitor interval="20" role="Master" timeout="20" \
> op monitor interval="30" role="Slave" timeout="20"
> primitive p_fs ocf:heartbeat:Filesystem \
> params device="/dev/drbd0" directory="/data" fstype="ocfs2" \
> op monitor interval="120s"
> primitive p_o2cb ocf:ocfs2:o2cb \
> op monitor interval="120s"
> ms ms_drbd p_drbd \
> meta resource-stickines="100" notify="true" master-max="2" 
> interleave="true"
> clone cl_dlm p_dlm \
> meta globally-unique="false" interleave="true"
> clone cl_fs p_fs \
> meta interleave="true" ordered="true"
> clone cl_o2cb p_o2cb \
> meta globally-unique="false" interleave="true"
> colocation co_dlm-drbd inf: cl_dlm ms_drbd:Master
> colocation co_fs-o2cb inf: cl_fs cl_o2cb
> colocation co_o2cb-dlm inf: cl_o2cb cl_dlm
> order o_dlm-o2cb 0: cl_dlm cl_o2cb
> order o_drbd-dlm 0: ms_drbd:promote cl_dlm
> order o_o2cb-fs 0: cl_o2cb cl_fs
> property $id="cib-bootstrap-options" \
> dc-version="1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> stonith-enabled="false" \
> no-quorum-policy="ignore"
> 
> Here is a  snippet from /var/log/messages (power cut at 10:32:02):
> 
> Mar  7 10:32:03 cluster-node1 kernel: [ 4714.838629] r8169: eth0: link down
> Mar  7 10:32:06 cluster-node1 corosync[4300]:   [TOTEM ] A processor failed, 
> forming new configuration.
> Mar  7 10:32:06 cluster-node1 kernel: [ 4717.748011] block drbd0: PingAck did 
> not arrive in time.
> Mar  7 10:32:06 cluster-node1 kernel: [ 4717.748020] block drbd0: peer( 
> Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> 
> DUnknown )
> Mar  7 10:32:06 cluster-node1 kernel: [ 4717.748031] block drbd0: asender 
> terminated
> Mar  7 10:32:06 cluster-node1 kernel: [ 4717.748035] block drbd0: short read 
> expecting header on sock: r=-512
> Mar  7 10:32:06 cluster-node1 kernel: [ 4717.748037] block drbd0: Terminating 
> asender thread
> Mar  7 10:32:06 cluster-node1 kernel: [ 4717.748068] block drbd0: Creating 
> new current UUID
> Mar  7 10:32:06 cluster-node1 kernel: [ 4717.763424] block drbd0: Connection 
> closed
> Mar  7 10:32:06 cluster-node1 kernel: [ 4717.763429] block drbd0: conn( 
> NetworkFailure -> Unconnected )
> Mar  7 10:32:06 cluster-node1 kernel: [ 4717.763434] block drbd0: receiver 
> terminated
> Mar  7 10:32:06 cluster-node1 kernel: [ 4717.763436] block drbd0: Restarting 
> receiver thread
> Mar  7 10:32:06 cluster-node1 kernel: [ 4717.763439] block drbd0: receiver 
> (re)started
> Mar  7 10:32:06 cluster-node1 kernel: [ 4

Re: [Linux-HA] Server becomes unresponsive after node failure

Re: [Linux-HA] Server becomes unresponsive after node failure

Re: [Linux-HA] Server becomes unresponsive after node failure

Re: [Linux-HA] Server becomes unresponsive after node failure

Re: [Linux-HA] Server becomes unresponsive after node failure

Re: [Linux-HA] Server becomes unresponsive after node failure

Re: [Linux-HA] Server becomes unresponsive after node failure

Re: [Linux-HA] Server becomes unresponsive after node failure

Re: [Linux-HA] Server becomes unresponsive after node failure

9 matches

Site Navigation

Mail list logo

Footer information