Re: [ClusterLabs] 2 node cluster dlm/clvm trouble

2018-09-11 Thread Valentin Vidic
On Tue, Sep 11, 2018 at 09:31:13AM -0400, Patrick Whitney wrote:
> But, when I invoke the "human" stonith power device (i.e. I turn the node
> off), the other node collapses...
> 
> In the logs I supplied, I basically do this:
> 
> 1. stonith fence (With fence scsi)

After fence_scsi finishes the node should not show any signs of life.
If it continues to work on the network after this point it can cause
trouble.

> 2. verify UI shows fenced node as stopped
> 3. power off fenced node

Not sure if you use poweroff command to shutdown the node or turn it off
some other way?

If you don't have any other fence plugin you can use, try testing with
meatware. Stonith will wait until you manually confirm with meatclient
that the node is down.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] 2 node cluster dlm/clvm trouble

2018-09-11 Thread Vladislav Bogdanov

On 11.09.2018 16:31, Patrick Whitney wrote:
But, when I invoke the "human" stonith power device (i.e. I turn the 
node off), the other node collapses...


In the logs I supplied, I basically do this:

1. stonith fence (With fence scsi)


At this point DLM on a healthy node is notified that node was fenced and 
expects no connections from DLM on a fenced node. What happens if it 
sees such connection is hidden deep in code.



2. verify UI shows fenced node as stopped


Then I wouldn't trust such UI.


3. power off fenced node

It's only when I shut down the fenced node that the running node falls 
over.


How would using a power fencing agent differ from me manually removing 
power?


There is a delay between fence success notification to DLM and actual 
power off. With power fencing notification goes after power is cut.




Thanks (I very much appreciate the discussion!)

Best,
-Pat



Would it be useful to show logs of what that looks like?

On Tue, Sep 11, 2018 at 9:22 AM Valentin Vidic > wrote:


On Tue, Sep 11, 2018 at 09:13:08AM -0400, Patrick Whitney wrote:
 > So when the cluster suggests that DLM is shutdown on coro-test-1:
 > Clone Set: dlm-clone [dlm]
 >      Started: [ coro-test-2 ]
 >      Stopped: [ coro-test-1 ]
 >
 > ... DLM isn't actually stopped on 1?

If you can connect to the node and see dlm services running than
it is not stopped:

20101 dlm_controld
20245 dlm_scand
20246 dlm_recv
20247 dlm_send
20248 dlm_recoverd

But if you kill the power on the node than it will be gone for sure :)

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org 
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



--
Patrick Whitney
DevOps Engineer -- Tools


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] 2 node cluster dlm/clvm trouble

2018-09-11 Thread Patrick Whitney
But, when I invoke the "human" stonith power device (i.e. I turn the node
off), the other node collapses...

In the logs I supplied, I basically do this:

1. stonith fence (With fence scsi)
2. verify UI shows fenced node as stopped
3. power off fenced node

It's only when I shut down the fenced node that the running node falls
over.

How would using a power fencing agent differ from me manually removing
power?

Thanks (I very much appreciate the discussion!)

Best,
-Pat



Would it be useful to show logs of what that looks like?

On Tue, Sep 11, 2018 at 9:22 AM Valentin Vidic 
wrote:

> On Tue, Sep 11, 2018 at 09:13:08AM -0400, Patrick Whitney wrote:
> > So when the cluster suggests that DLM is shutdown on coro-test-1:
> > Clone Set: dlm-clone [dlm]
> >  Started: [ coro-test-2 ]
> >  Stopped: [ coro-test-1 ]
> >
> > ... DLM isn't actually stopped on 1?
>
> If you can connect to the node and see dlm services running than
> it is not stopped:
>
> 20101 dlm_controld
> 20245 dlm_scand
> 20246 dlm_recv
> 20247 dlm_send
> 20248 dlm_recoverd
>
> But if you kill the power on the node than it will be gone for sure :)
>
> --
> Valentin
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>


-- 
Patrick Whitney
DevOps Engineer -- Tools
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] 2 node cluster dlm/clvm trouble

2018-09-11 Thread Valentin Vidic
On Tue, Sep 11, 2018 at 04:14:08PM +0300, Vladislav Bogdanov wrote:
> And that is not an easy task sometimes, because main part of dlm runs in
> kernel.
> In some circumstances the only option is to forcibly reset the node.

Exactly, killing the power on the node will stop the DLM code running in
the kernel too.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] 2 node cluster dlm/clvm trouble

2018-09-11 Thread Valentin Vidic
On Tue, Sep 11, 2018 at 09:13:08AM -0400, Patrick Whitney wrote:
> So when the cluster suggests that DLM is shutdown on coro-test-1:
> Clone Set: dlm-clone [dlm]
>  Started: [ coro-test-2 ]
>  Stopped: [ coro-test-1 ]
> 
> ... DLM isn't actually stopped on 1?

If you can connect to the node and see dlm services running than
it is not stopped:

20101 dlm_controld
20245 dlm_scand
20246 dlm_recv
20247 dlm_send
20248 dlm_recoverd

But if you kill the power on the node than it will be gone for sure :)

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] 2 node cluster dlm/clvm trouble

2018-09-11 Thread Vladislav Bogdanov

On 11.09.2018 16:10, Valentin Vidic wrote:

On Tue, Sep 11, 2018 at 09:02:06AM -0400, Patrick Whitney wrote:

What I'm having trouble understanding is why dlm flattens the remaining
"running" node when the already fenced node is shutdown...  I'm having
trouble understanding how power fencing would cause dlm to behave any
differently than just shutting down the fenced node.


fences_scsi just kills the storage on the node, but dlm continues to run
causing problems for the rest of the cluster nodes.  So it seems some
other fence agent should be used that would kill dlm too.



And that is not an easy task sometimes, because main part of dlm runs in 
kernel.

In some circumstances the only option is to forcibly reset the node.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] 2 node cluster dlm/clvm trouble

2018-09-11 Thread Patrick Whitney
So when the cluster suggests that DLM is shutdown on coro-test-1:
Clone Set: dlm-clone [dlm]
 Started: [ coro-test-2 ]
 Stopped: [ coro-test-1 ]

... DLM isn't actually stopped on 1?

Best,
-Pat

On Tue, Sep 11, 2018 at 9:10 AM Valentin Vidic 
wrote:

> On Tue, Sep 11, 2018 at 09:02:06AM -0400, Patrick Whitney wrote:
> > What I'm having trouble understanding is why dlm flattens the remaining
> > "running" node when the already fenced node is shutdown...  I'm having
> > trouble understanding how power fencing would cause dlm to behave any
> > differently than just shutting down the fenced node.
>
> fences_scsi just kills the storage on the node, but dlm continues to run
> causing problems for the rest of the cluster nodes.  So it seems some
> other fence agent should be used that would kill dlm too.
>
> --
> Valentin
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>


-- 
Patrick Whitney
DevOps Engineer -- Tools
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] 2 node cluster dlm/clvm trouble

2018-09-11 Thread Valentin Vidic
On Tue, Sep 11, 2018 at 09:02:06AM -0400, Patrick Whitney wrote:
> What I'm having trouble understanding is why dlm flattens the remaining
> "running" node when the already fenced node is shutdown...  I'm having
> trouble understanding how power fencing would cause dlm to behave any
> differently than just shutting down the fenced node.

fences_scsi just kills the storage on the node, but dlm continues to run
causing problems for the rest of the cluster nodes.  So it seems some
other fence agent should be used that would kill dlm too.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] 2 node cluster dlm/clvm trouble

2018-09-06 Thread Andrei Borzenkov
06.09.2018 17:36, Patrick Whitney пишет:
> Good Morning Everyone,
> 
> I'm hoping someone with more experience with corosync and pacemaker can see
> what I am doing wrong.
> 
> I've got a test setup of 2 nodes, with dlm and clvm setup as clones, and
> using fence_scsi as my fencing agent.
> 
> I've got it to the point where the cluster is up, and reports it is happy.
> I then began testing fencing.   When issuing 'pcs stonith fence' it appears
> to work; that is, the scsi reservation is pulled and the output of 'pcs
> status' looks sane, and I'm able to access resources on the un-fenced node.
> 
> Things go awry when I shutdown (init 0) the fenced node... my unfenced node
> decides to fence itself (which looks like it was initiated by dlm due to an
> abandoned lockspace).
> 
> I suspect this is due to misconfiguration, since I'm new to the toolset,
> but I'm not quite sure what I need to change.
> 
> Any and all input is appreciated!
> 
> Below is a chronology of events; my corosync config and cib.xml; command
> output; and annotated logs.
> 
> Again, any hints, suggestions, wild guesses, or premonitions are welcomed
> -- I'm stuck!   Please let me know if there is additional information which
> would be helpful.
> 
> Many thanks,
> -Patrick W.
> 
> Sep  6 08:54:14  -- Cluster is up and running; UI reports everything
>healthy.
> 
> Sep  6 08:55:44  -- 'pcs stonith fence' called against node 1
> (coro-test-1);
>UI reports everything as expected -- that
> is, resources show only running on unfenced node and they're available.
>Oddly, although the UI says dlm is stopped
> on fenced node, the dlm_controld is still running.
> 
> Sep  6 09:03:38  -- node 1 is shutdown, and node 2 falls to pieces.
>- First, corosync sees lost member -- seems
> like this is appropriate, to me.
>- Next, dlm_controld calls to fence
> everything
>- stonith-ng tries to fence node 1 (but its
> already fenced!)
>- dlm closes connection to "node 2" (does
> dlm "nodes" map to cluster nodes? I'm not sure they do)
>- clvmd dlm lockspace is now abandoned;
> cluster attempts to fence the remaining node
>  (But can't because scsi_fence doesn't work
> like that).
> 
> ***
> **   -- Configuration --
> ***
> root@coro-test-2:~# pcs --version
> 0.9.149
> root@coro-test-2:~# pacemakerd --version
> Pacemaker 1.1.14

I wonder if https://github.com/ClusterLabs/pacemaker/pull/839 is
relevant here.

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org