Re: [Linux-HA] How do I clear the Failed actions section?

2012-03-08 Thread Helmut Wollmersdorfer

Am 08.03.2012 um 13:33 schrieb William Seligman:

> On 3/8/12 6:53 AM, Helmut Wollmersdorfer wrote:
>>
>> [...]

>>   Master/Slave Set: DrbdClone2
>>   Resource Group: group_drbd2:0
>>   xen_drbd2_1:0  (ocf::linbit:drbd): Slave xen10 (unmanaged)
>> FAILED
>>   xen_drbd2_2:0  (ocf::linbit:drbd): Stopped
>>   Masters: [ xen11 ]
>> [...]
>>  xen_drbd2_1:1_promote_0 (node=xen10, call=790, rc=1,
>> status=complete): unknown error
>> [...]
>>  xen_drbd2_1:0_promote_0 (node=xen10, call=1326, rc=-2,
>> status=Timed Out): unknown exec error
>>  xen_drbd2_1:0_stop_0 (node=xen10, call=1348, rc=-2, status=Timed
>> Out): unknown exec error
>>
>> xen11:# crm resource cleanup xen_drbd2_1
>> Error performing operation: The object/attribute does not exist
>> Error performing operation: The object/attribute does not exist
>
> Given the list of resources displayed by crm_mon, the command you  
> need is
>
> crm resource cleanup DrbdClone2

Thx. Works fine.

>
> I can't say whether that will fix your problems, but you won't get  
> the "does not exist" message.
>
> Somewhere in either "Pacemaker Explained" or "Clusters From  
> Scratch", it says that once you clone or ms a resource, you can't  
> refer to that resource as an individual anymore; you have to use the  
> clone/ms name.
>
> What I did when faced with a problem like yours is "cat /proc/drbd",  
> look at the lines for the failed drbd, and fix it on my own. Then  
> I'd type the cleanup command for pacemaker to pick up the current  
> state of the resource.

The DRBD-resources are fine (see below). The failed action messages in  
the CRM seem to get not cleaned sometimes for some reason.

>> > ro2="Secondary" ds1="UpToDate" ds2="UpToDate" />
>> > ro2="Secondary" ds1="UpToDate" ds2="UpToDate" />

Thx again

Helmut Wollmersdorfer



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] How do I clear the Failed actions section?

2012-03-08 Thread William Seligman

On 3/8/12 6:53 AM, Helmut Wollmersdorfer wrote:


Am 07.03.2012 um 18:01 schrieb Florian Haas:


On Wed, Mar 7, 2012 at 5:51 PM, William Seligman
  wrote:

Again, a disclaimer: I am not an expert.


Your advice was spot on. :)


But what to do, if cleanup is not working? And everything is running:

# crm status

Last updated: Thu Mar  8 12:27:00 2012
Stack: Heartbeat
Current DC: xen10 (5ab5ba3d-3be5-4763-83e7-90aaa49361a6) - partition
with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, unknown expected votes
12 Resources configured.


Online: [ xen10 xen11 ]

   xen_www  (ocf::heartbeat:Xen):   Started xen11
   Master/Slave Set: DrbdClone1
   Masters: [ xen11 ]
   Slaves: [ xen10 ]
   xen_typo3(ocf::heartbeat:Xen):   Started xen11
   xen_shopdb   (ocf::heartbeat:Xen):   Started xen10
   xen_admintool(ocf::heartbeat:Xen):   Started xen11
   xen_cmsdb(ocf::heartbeat:Xen):   Started xen11
   Master/Slave Set: DrbdClone2
   Resource Group: group_drbd2:0
   xen_drbd2_1:0(ocf::linbit:drbd): Slave xen10 (unmanaged)
FAILED
   xen_drbd2_2:0(ocf::linbit:drbd): Stopped
   Masters: [ xen11 ]
   Master/Slave Set: DrbdClone3
   Masters: [ xen10 ]
   Slaves: [ xen11 ]
   Master/Slave Set: DrbdClone5
   Masters: [ xen11 ]
   Slaves: [ xen10 ]
   Master/Slave Set: DrbdClone6
   Slaves: [ xen11 xen10 ]
   Master/Slave Set: DrbdClone4
   Masters: [ xen11 ]
   Slaves: [ xen10 ]

Failed actions:
  xen_cmsdb_monitor_3000 (node=xen10, call=571, rc=7,
status=complete): not running
  xen_drbd1_2:1_promote_0 (node=xen10, call=5205, rc=1,
status=complete): unknown error
  xen_drbd2_1:1_promote_0 (node=xen10, call=790, rc=1,
status=complete): unknown error
  xen_ns2_monitor_3000 (node=xen10, call=601, rc=7,
status=complete): not running
  xen_drbd3_1:1_promote_0 (node=xen10, call=383, rc=-2,
status=Timed Out): unknown exec error
  xen_drbd2_1:0_promote_0 (node=xen10, call=1326, rc=-2,
status=Timed Out): unknown exec error
  xen_drbd2_1:0_stop_0 (node=xen10, call=1348, rc=-2, status=Timed
Out): unknown exec error

xen11:# crm resource cleanup xen_drbd2_1
Error performing operation: The object/attribute does not exist
Error performing operation: The object/attribute does not exist


Given the list of resources displayed by crm_mon, the command you need is

crm resource cleanup DrbdClone2

I can't say whether that will fix your problems, but you won't get the 
"does not exist" message.


Somewhere in either "Pacemaker Explained" or "Clusters From Scratch", it 
says that once you clone or ms a resource, you can't refer to that 
resource as an individual anymore; you have to use the clone/ms name.


What I did when faced with a problem like yours is "cat /proc/drbd", 
look at the lines for the failed drbd, and fix it on my own. Then I'd 
type the cleanup command for pacemaker to pick up the current state of 
the resource.



# xm list
NameID   Mem VCPUs
State   Time(s)
Domain-0 0  100516 r-
40648.5
admintool5  4096 2 -
b   7455.4
cmsdb3  2048 2 -
b   2106.5
typo32  1024 2 -
b   2890.9
www  1  1024 1 -
b855.0


xen11:# drbdadm status





















Helmut Wollmersdorfer


--
Bill Seligman | mailto://selig...@nevis.columbia.edu
Nevis Labs, Columbia Univ | http://www.nevis.columbia.edu/~seligman/
PO Box 137|
Irvington NY 10533  USA   | Phone: (914) 591-2823



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Apparent problem in pacemaker ordering

2012-03-08 Thread Dejan Muhamedagic
Hi,

On Wed, Mar 07, 2012 at 07:52:16PM -0500, William Seligman wrote:
> On 3/5/12 11:55 AM, William Seligman wrote:
> > On 3/3/12 3:30 PM, William Seligman wrote:
> >> On 3/3/12 2:14 PM, Florian Haas wrote:
> >>> On Sat, Mar 3, 2012 at 6:55 PM, William Seligman
> >>>   wrote:
>  On 3/3/12 12:03 PM, emmanuel segura wrote:
> >
> > are you sure the exportfs agent can be use it with clone active/active?
> 
>  a) I've been through the script. If there's some problem associated with 
>  it
>  being cloned, I haven't seen it. (It can't handle globally-unique="true",
>  but I didn't turn that on.)
> >>>
> >>> It shouldn't have a problem with being cloned. Obviously, cloning that
> >>> RA _really_ makes sense only with the export that manages an NFSv4
> >>> virtual root (fsid=0). Otherwise, the export clone has to be hosted on
> >>> a clustered filesystem, and you'd have to have a pNFS implementation
> >>> that doesn't suck (tough to come by on Linux), and if you want that
> >>> sort of replicate, parallel-access NFS you might as well use Gluster.
> >>> The downside of the latter, though, is it's currently NFSv3-only,
> >>> without sideband locking.
> >>
> >> I'll look this over when I have a chance. I think I can get away without a 
> >> NFSv4
> >> virtual root because I'm exporting everything to my cluster either 
> >> read-only, or
> >> only one system at a time will do any writing. Now that you've warned me, 
> >> I'll
> >> do some more checking.
> >>
>  b) I had similar problems using the exportfs resource in a 
>  primary-secondary
>  setup without clones.
> 
>  Why would a resource being cloned create an ordering problem? I haven't 
>  set
>  the interleave parameter (even with the documentation I'm not sure what 
>  it
>  does) but A before B before C seems pretty clear, even for cloned 
>  resources.
> >>>
> >>> As far as what interleave does. Suppose you have two clones, A and B.
> >>> And they're linked with an order constraint, like this:
> >>>
> >>> order A_before_B inf: A B
> >>>
> >>> ... then if interleave is false, _all_ instances of A must be started
> >>> before _any_ instance of B gets to start anywhere in the cluster.
> >>> However if interleave is true, then for any node only the _local_
> >>> instance of A needs to be started before it can start the
> >>> corresponding _local_ instance of B.
> >>>
> >>> In other words, interleave=true is actually the reasonable thing to
> >>> set on all clone instances by default, and I believe the pengine
> >>> actually does use a default of interleave=true on defined clone sets
> >>> since some 1.1.x release (I don't recall which).
> >>
> >> Thanks, Florian. That's a great explanation. I'll probably stick
> >> "interleave=true" on most of my clones just to make sure.
> >>
> >> It explains an error message I've seen in the logs:
> >>
> >> Mar  2 18:15:19 hypatia-tb pengine: [4414]: ERROR: clone_rsc_colocation_rh:
> >> Cannot interleave clone ClusterIPClone and Gfs2Clone because they do not 
> >> support
> >> the same number of resources per node
> >>
> >> Because ClusterIPClone has globally-unique=true and clone-max=2, it's 
> >> possible
> >> for both instances to be running on a single node; I've seen this a few 
> >> times in
> >> my testing when cycling power on one of the nodes. Interleaving doesn't 
> >> make
> >> sense in such a case.
> >>
> >>> Bill, seeing as you've already pastebinned your config and crm_mon
> >>> output, could you also pastebin your whole CIB as per "cibadmin -Q"
> >>> output? Thanks.
> >>
> >> Sure: . It doesn't have the exportfs 
> >> resources in
> >> it; I took them out before leaving for the weekend. If it helps, I'll put 
> >> them
> >> back in and try to get the "cibadmin -Q" output before any nodes crash.
> >>
> > 
> > For a test, I stuck in a exportfs resource with all the ordering 
> > constraints.
> > Here's the "cibadmin -Q" output from that:
> > 
> > 
> > 
> > The output of crm_mon just after doing that, showing resource failure:
> > 
> > 
> > 
> > Then all the resources are stopped:
> > 
> > 
> > 
> > A few seconds later one of the nodes is fenced, but this does not bring up
> > anything:
> > 
> > 
> 
> I believe I have the solution to my stability problem. It doesn't solve the
> issue of ordering, but I think I have a configuration that will survive 
> failover.
> 
> Here's the problem. I had exportfs resources such as:
> 
> primitive ExportUsrNevis ocf:heartbeat:exportfs \
> op start interval="0" timeout="40" \
> op stop interval="0" timeout="45" \
> params clientspec="*.nevis.columbia.edu" directory="/usr/nevis" \
> fsid="20" options="ro,no_root_squash,async"
> 
> I did detailed traces of the execution of exportfs (putting in logger 
> commands)
> and found that t

Re: [Linux-HA] How do I clear the Failed actions section?

2012-03-08 Thread Helmut Wollmersdorfer

Am 07.03.2012 um 18:01 schrieb Florian Haas:

> On Wed, Mar 7, 2012 at 5:51 PM, William Seligman
>  wrote:
>> Again, a disclaimer: I am not an expert.
>
> Your advice was spot on. :)

But what to do, if cleanup is not working? And everything is running:

# crm status

Last updated: Thu Mar  8 12:27:00 2012
Stack: Heartbeat
Current DC: xen10 (5ab5ba3d-3be5-4763-83e7-90aaa49361a6) - partition  
with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, unknown expected votes
12 Resources configured.


Online: [ xen10 xen11 ]

  xen_www   (ocf::heartbeat:Xen):   Started xen11
  Master/Slave Set: DrbdClone1
  Masters: [ xen11 ]
  Slaves: [ xen10 ]
  xen_typo3 (ocf::heartbeat:Xen):   Started xen11
  xen_shopdb(ocf::heartbeat:Xen):   Started xen10
  xen_admintool (ocf::heartbeat:Xen):   Started xen11
  xen_cmsdb (ocf::heartbeat:Xen):   Started xen11
  Master/Slave Set: DrbdClone2
  Resource Group: group_drbd2:0
  xen_drbd2_1:0 (ocf::linbit:drbd): Slave xen10 (unmanaged)  
FAILED
  xen_drbd2_2:0 (ocf::linbit:drbd): Stopped
  Masters: [ xen11 ]
  Master/Slave Set: DrbdClone3
  Masters: [ xen10 ]
  Slaves: [ xen11 ]
  Master/Slave Set: DrbdClone5
  Masters: [ xen11 ]
  Slaves: [ xen10 ]
  Master/Slave Set: DrbdClone6
  Slaves: [ xen11 xen10 ]
  Master/Slave Set: DrbdClone4
  Masters: [ xen11 ]
  Slaves: [ xen10 ]

Failed actions:
 xen_cmsdb_monitor_3000 (node=xen10, call=571, rc=7,  
status=complete): not running
 xen_drbd1_2:1_promote_0 (node=xen10, call=5205, rc=1,  
status=complete): unknown error
 xen_drbd2_1:1_promote_0 (node=xen10, call=790, rc=1,  
status=complete): unknown error
 xen_ns2_monitor_3000 (node=xen10, call=601, rc=7,  
status=complete): not running
 xen_drbd3_1:1_promote_0 (node=xen10, call=383, rc=-2,  
status=Timed Out): unknown exec error
 xen_drbd2_1:0_promote_0 (node=xen10, call=1326, rc=-2,  
status=Timed Out): unknown exec error
 xen_drbd2_1:0_stop_0 (node=xen10, call=1348, rc=-2, status=Timed  
Out): unknown exec error

xen11:# crm resource cleanup xen_drbd2_1
Error performing operation: The object/attribute does not exist
Error performing operation: The object/attribute does not exist

# xm list
NameID   Mem VCPUs   
State   Time(s)
Domain-0 0  100516 r-   
40648.5
admintool5  4096 2 - 
b   7455.4
cmsdb3  2048 2 - 
b   2106.5
typo32  1024 2 - 
b   2890.9
www  1  1024 1 - 
b855.0


xen11:# drbdadm status





















Helmut Wollmersdorfer


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems