[Linux-ha-dev] R: R: [PATCH] Filesystem RA:

2013-04-10 Thread Guglielmo Abbruzzese
Hi Darren,
I am aware STONITH could help, but unfortunately I cannot add such device to 
the architecture at the moment. 
Furthermore, sybase seems to be stopped  (the start/stop order should be 
already granted by the Resource Group structure)

Resource Group: grp-sdg
 resource_vrt_ip(ocf::heartbeat:IPaddr2):   Started NODE_A
 resource_lvm   (ocf::heartbeat:LVM):   Started NODE_A
 resource_lvmdir(ocf::heartbeat:Filesystem):failed (and so 
unmanaged)
 resource_sybase(lsb:sybase):   stopped
 resource_httpd (lsb:httpd):stopped
 resource_tomcatd   (lsb:tomcatd):  stopped
 resource_sdgd  (lsb:sdgd): stopped
 resource_statd (lsb:statistiched): stopped

I'm just guessing, why the same configuration swapped fine with the previous 
storage? The only difference could be the changed multipath configuration 

Thanks a lot
G.


-Messaggio originale-
Da: linux-ha-dev-boun...@lists.linux-ha.org 
[mailto:linux-ha-dev-boun...@lists.linux-ha.org] Per conto di Darren Thompson 
(AkurIT)
Inviato: martedì 9 aprile 2013 23:35
A: High-Availability Linux Development List
Oggetto: Re: [Linux-ha-dev] R: [PATCH] Filesystem RA:

Hi

The correct way for that to have been handled, given you additional detail 
would have been for the node to have received a STONITH.

Things that you should check:
1 STONITH device configured correctly and operational.
2 the " on fail" for any file system cluster resource stop should be " fence".
3 you need to review your constraints to ensure that the order and relationship 
between SYBASE and file system resource needs to be corrected so that SYBASE is 
stopped first.

Hope this helps

Darren 


Sent from my iPhone

On 09/04/2013, at 11:57 PM, "Guglielmo Abbruzzese"  wrote:

> Hi everybody,
> In my case (very similar to Junko's) when I disconnect the Fibre 
> Channels the "try_umount" procedure in RA Filesystem script doesn't work.
> 
> After the programmed attempts the active/passive cluster doesn't swap, 
> and the lvmdir resource is flagged as "failed" rather than "stopped".
> 
> I must say, even if I try to umount the /storage resource manually it 
> doesn't work because of sybase is using some files stored on it 
> (busy); this is why the RA cannot complete the operation in a clean 
> mode. Is there a way to force the swap anyway?
> 
> Some issues. I already tried:
> 1) This very test with a different optical SAN/storage in the past, 
> and the RA could always umount correctly the storage;
> 2) I modified the RA forcing the option "umount -l" even in case I've 
> got a
> ext4 FR rather than NFS;
> 3) I killed the hanged processes with the command "fuser -km /storage"  
> but the umount option always failed, and after a while I obtained a 
> kernel panic
> 
> Is there a way to force the swap anyway, even if the umount is not clean?
> Any suggestion?
> 
> Thanks for your time,
> Regards
> Guglielmo
> 
> P.S. lvmdir resource configuration
> 
>  type="Filesystem">
>  
> name="device" value="/dev/VG_SDG_Cluster_RM/LV_SDG_Cluster_RM"/>
> name="directory" value="/storage"/>
> name="fstype" value="ext4"/>
>  
>  
> name="multiple-active" value="stop_start"/>
> name="migration-threshold" value="1"/>
> name="failure-timeout" value="0"/>
>  
>  
> name="monitor" on-fail="restart" requires="nothing" timeout="40s"/>
> on-fail="restart" requires="nothing" timeout="180s"/>
> on-fail="restart" requires="nothing" timeout="180s"/>
>  
> 
> 
> 2012/5/9 Junko IKEDA :
>> Hi,
>> 
>> In my case, the umount succeed when the Fibre Channels is 
>> disconnected, so it seemed that the handling status file caused a 
>> longer failover, as Dejan said.
>> If the umount fails, it will go into a timeout, might call stonith 
>> action, and this case also makes sense (though I couldn't see this).
>> 
>> I tried the following setup;
>> 
>> (1) timeout : multipath > RA
>> multipath timeout = 120s
>> Filesystem RA stop timeout = 60s
>> 
>> (2) timeout : multipath < RA
>> multipath timeout = 60s
>> Filesystem RA stop timeout = 120s
>> 
>> case (1), Filesystem_stop() fails. The hanging FC causes the stop timeout.
>> 
>> case (2), Filesystem_stop() succeeds.
>> Filesystem is hanging out, but line 758 and 759 succeed(rc=0).
>> The status file is no more inaccessible, so it remains on the 
>> filesystem, in fact.
>> 
 758 if [ -f "$STATUSFILE" ]; then
 759 rm -f ${STATUSFILE}
 760 if [ $? -ne 0 ]; then
>> 
>> so, the line 761 might not be called as expected.
>> 
 761 ocf_log warn "Failed to remove status file ${STATUSFILE}."
>> 
>> 
>> By the way, my concern is the unexpected stop timeout and the longer 
>> fail over time, if OCF_CHECK_LEVEL is set as 20, it would be better 
>> to try remove its status file just in case.
>> It can handle the case (2) if the user wants to recover this c

[Linux-ha-dev] [PATCH][crmsh] deal with the case-insentive hostname

2013-04-10 Thread Junko IKEDA
Hi,
I set upper-case hostname (GUEST03/GUEST4) and run Pacemaker 1.1.9 +
Corosync 2.3.0.

[root@GUEST04 ~]# crm_mon -1
Last updated: Wed Apr 10 15:12:48 2013
Last change: Wed Apr 10 14:02:36 2013 via crmd on GUEST04
Stack: corosync
Current DC: GUEST04 (3232242817) - partition with quorum
Version: 1.1.9-e8caee8
2 Nodes configured, unknown expected votes
1 Resources configured.


Online: [ GUEST03 GUEST04 ]

 dummy  (ocf::pacemaker:Dummy): Started GUEST03


for example, call crm shell with lower-case hostname.

[root@GUEST04 ~]# crm node standby guest03
ERROR: bad lifetime: guest03

"crm node standby GUEST03" surely works well,
so crm shell just doesn't take into account the hostname conversion.
It's better to accept the both of the upper/lower-case.

"node standby", "node delete", "resource migrate(move)"  get hit with this
issue.
Please see the attached.

Thanks,
Junko


ignorecase.patch
Description: Binary data
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] R: R: [PATCH] Filesystem RA:

2013-04-10 Thread Darren Thompson (AkurIT)
Hi G.

I personally recommend as a minimum that you setup a SBD partition and use SBD 
STONITH. It protects against file/ database corruption in the event of an issue 
on the underlying storage.

Hardware (power) STONITH is considered the "best" protection, but I have had 
clusters running for years using just SBD STONITH and I would not deploy a 
cluster managed file system without it,

You should also strongly consider setting the "fence on stop failure" for the 
same reason. The worst possible corruption can be caused by the cluster having 
a " split brain" due to a partially dismounted file system and another node 
mounting and writing to it at the same time.

Regards
D.


On 10/04/2013, at 5:30 PM, "Guglielmo Abbruzzese"  wrote:

> Hi Darren,
> I am aware STONITH could help, but unfortunately I cannot add such device to 
> the architecture at the moment. 
> Furthermore, sybase seems to be stopped  (the start/stop order should be 
> already granted by the Resource Group structure)
> 
> Resource Group: grp-sdg
> resource_vrt_ip(ocf::heartbeat:IPaddr2):   Started NODE_A
> resource_lvm   (ocf::heartbeat:LVM):   Started NODE_A
> resource_lvmdir(ocf::heartbeat:Filesystem):failed (and so 
> unmanaged)
> resource_sybase(lsb:sybase):   stopped
> resource_httpd (lsb:httpd):stopped
> resource_tomcatd   (lsb:tomcatd):  stopped
> resource_sdgd  (lsb:sdgd): stopped
> resource_statd (lsb:statistiched): stopped
> 
> I'm just guessing, why the same configuration swapped fine with the previous 
> storage? The only difference could be the changed multipath configuration 
> 
> Thanks a lot
> G.
> 
> 
> -Messaggio originale-
> Da: linux-ha-dev-boun...@lists.linux-ha.org 
> [mailto:linux-ha-dev-boun...@lists.linux-ha.org] Per conto di Darren Thompson 
> (AkurIT)
> Inviato: martedì 9 aprile 2013 23:35
> A: High-Availability Linux Development List
> Oggetto: Re: [Linux-ha-dev] R: [PATCH] Filesystem RA:
> 
> Hi
> 
> The correct way for that to have been handled, given you additional detail 
> would have been for the node to have received a STONITH.
> 
> Things that you should check:
> 1 STONITH device configured correctly and operational.
> 2 the " on fail" for any file system cluster resource stop should be " fence".
> 3 you need to review your constraints to ensure that the order and 
> relationship between SYBASE and file system resource needs to be corrected so 
> that SYBASE is stopped first.
> 
> Hope this helps
> 
> Darren 
> 
> 
> Sent from my iPhone
> 
> On 09/04/2013, at 11:57 PM, "Guglielmo Abbruzzese"  
> wrote:
> 
>> Hi everybody,
>> In my case (very similar to Junko's) when I disconnect the Fibre 
>> Channels the "try_umount" procedure in RA Filesystem script doesn't work.
>> 
>> After the programmed attempts the active/passive cluster doesn't swap, 
>> and the lvmdir resource is flagged as "failed" rather than "stopped".
>> 
>> I must say, even if I try to umount the /storage resource manually it 
>> doesn't work because of sybase is using some files stored on it 
>> (busy); this is why the RA cannot complete the operation in a clean 
>> mode. Is there a way to force the swap anyway?
>> 
>> Some issues. I already tried:
>> 1) This very test with a different optical SAN/storage in the past, 
>> and the RA could always umount correctly the storage;
>> 2) I modified the RA forcing the option "umount -l" even in case I've 
>> got a
>> ext4 FR rather than NFS;
>> 3) I killed the hanged processes with the command "fuser -km /storage"  
>> but the umount option always failed, and after a while I obtained a 
>> kernel panic
>> 
>> Is there a way to force the swap anyway, even if the umount is not clean?
>> Any suggestion?
>> 
>> Thanks for your time,
>> Regards
>> Guglielmo
>> 
>> P.S. lvmdir resource configuration
>> 
>> > type="Filesystem">
>> 
>>   > name="device" value="/dev/VG_SDG_Cluster_RM/LV_SDG_Cluster_RM"/>
>>   > name="directory" value="/storage"/>
>>   > name="fstype" value="ext4"/>
>> 
>> 
>>   > name="multiple-active" value="stop_start"/>
>>   > name="migration-threshold" value="1"/>
>>   > name="failure-timeout" value="0"/>
>> 
>> 
>>   > name="monitor" on-fail="restart" requires="nothing" timeout="40s"/>
>>   > on-fail="restart" requires="nothing" timeout="180s"/>
>>   > on-fail="restart" requires="nothing" timeout="180s"/>
>> 
>> 
>> 
>> 2012/5/9 Junko IKEDA :
>>> Hi,
>>> 
>>> In my case, the umount succeed when the Fibre Channels is 
>>> disconnected, so it seemed that the handling status file caused a 
>>> longer failover, as Dejan said.
>>> If the umount fails, it will go into a timeout, might call stonith 
>>> action, and this case also makes sense (though I couldn't see this).
>>> 
>>> I tried the following setup;
>>> 
>>> (1) timeout : multipath > RA
>>> multipath timeout = 120s
>>> Filesyste

[Linux-ha-dev] R: R: R: [PATCH] Filesystem RA:

2013-04-10 Thread Guglielmo Abbruzzese
I realize SDB could be the best option in my situation.
So I prepared a 1G partition on the shared storage, and I downloaded the 
sbd-1837fd8cc64a.tar.gz file from the http://hg.linux-ha.org/sbd link.
Just a doubt: shall I upgrade the corosync/pacemaker versions (unfortunately I 
cannot)?
This is my environment:
  RHEL  6.2 2.6.32-220.el6.x86_64 #1 SMP Wed Nov 9 08:03:13 EST 2011 x86_64 
x86_64 x86_64 GNU/Linux
  Pacemaker 1.1.6-3 pacemaker-cli-1.1.6-3.el6.x86_64
pacemaker-libs-1.1.6-3.el6.x86_64
pacemaker-cluster-libs-1.1.6-3.el6.x86_64
pacemaker-1.1.6-3.el6.x86_64
  Corosync  1.4.1-4 corosync-1.4.1-4.el6.x86_64
corosynclib-1.4.1-4.el6.x86_64
  DRBD  8.4.1-2 kmod-drbd84-8.4.1-2.el6.elrepo.x86_64
drbd84-utils-8.4.1-2.el6.elrepo.x86_64

Will it be enough just compile and install the source code or could I fuind 
troubles concerning dependencies or similar stuff?
Thanks a lot
G

-Messaggio originale-
Da: linux-ha-dev-boun...@lists.linux-ha.org 
[mailto:linux-ha-dev-boun...@lists.linux-ha.org] Per conto di Darren Thompson 
(AkurIT)
Inviato: mercoledì 10 aprile 2013 15:44
A: High-Availability Linux Development List
Oggetto: Re: [Linux-ha-dev] R: R: [PATCH] Filesystem RA:

Hi G.

I personally recommend as a minimum that you setup a SBD partition and use SBD 
STONITH. It protects against file/ database corruption in the event of an issue 
on the underlying storage.

Hardware (power) STONITH is considered the "best" protection, but I have had 
clusters running for years using just SBD STONITH and I would not deploy a 
cluster managed file system without it,

You should also strongly consider setting the "fence on stop failure" for the 
same reason. The worst possible corruption can be caused by the cluster having 
a " split brain" due to a partially dismounted file system and another node 
mounting and writing to it at the same time.

Regards
D.


On 10/04/2013, at 5:30 PM, "Guglielmo Abbruzzese"  wrote:

> Hi Darren,
> I am aware STONITH could help, but unfortunately I cannot add such device to 
> the architecture at the moment. 
> Furthermore, sybase seems to be stopped  (the start/stop order should 
> be already granted by the Resource Group structure)
> 
> Resource Group: grp-sdg
> resource_vrt_ip(ocf::heartbeat:IPaddr2):   Started NODE_A
> resource_lvm   (ocf::heartbeat:LVM):   Started NODE_A
> resource_lvmdir(ocf::heartbeat:Filesystem):failed (and so 
> unmanaged)
> resource_sybase(lsb:sybase):   stopped
> resource_httpd (lsb:httpd):stopped
> resource_tomcatd   (lsb:tomcatd):  stopped
> resource_sdgd  (lsb:sdgd): stopped
> resource_statd (lsb:statistiched): stopped
> 
> I'm just guessing, why the same configuration swapped fine with the 
> previous storage? The only difference could be the changed multipath 
> configuration
> 
> Thanks a lot
> G.
> 
> 
> -Messaggio originale-
> Da: linux-ha-dev-boun...@lists.linux-ha.org 
> [mailto:linux-ha-dev-boun...@lists.linux-ha.org] Per conto di Darren 
> Thompson (AkurIT)
> Inviato: martedì 9 aprile 2013 23:35
> A: High-Availability Linux Development List
> Oggetto: Re: [Linux-ha-dev] R: [PATCH] Filesystem RA:
> 
> Hi
> 
> The correct way for that to have been handled, given you additional detail 
> would have been for the node to have received a STONITH.
> 
> Things that you should check:
> 1 STONITH device configured correctly and operational.
> 2 the " on fail" for any file system cluster resource stop should be " fence".
> 3 you need to review your constraints to ensure that the order and 
> relationship between SYBASE and file system resource needs to be corrected so 
> that SYBASE is stopped first.
> 
> Hope this helps
> 
> Darren
> 
> 
> Sent from my iPhone
> 
> On 09/04/2013, at 11:57 PM, "Guglielmo Abbruzzese"  
> wrote:
> 
>> Hi everybody,
>> In my case (very similar to Junko's) when I disconnect the Fibre 
>> Channels the "try_umount" procedure in RA Filesystem script doesn't work.
>> 
>> After the programmed attempts the active/passive cluster doesn't 
>> swap, and the lvmdir resource is flagged as "failed" rather than "stopped".
>> 
>> I must say, even if I try to umount the /storage resource manually it 
>> doesn't work because of sybase is using some files stored on it 
>> (busy); this is why the RA cannot complete the operation in a clean 
>> mode. Is there a way to force the swap anyway?
>> 
>> Some issues. I already tried:
>> 1) This very test with a different optical SAN/storage in the past, 
>> and the RA could always umount correctly the storage;
>> 2) I modified the RA forcing the option "umount -l" even in case I've 
>> got a
>> ext4 FR rather than NFS;
>> 3) I killed the hanged processes with the command "fuser -km /storage"  
>> but the umount option always failed, and after a while I obtained a 
>> kernel panic
>> 
>> Is there a way to force the swap anyway, even if the umount is not clean?
>> An

Re: [Linux-ha-dev] R: R: R: [PATCH] Filesystem RA:

2013-04-10 Thread Darren Thompson (AkurIT)
Guglielmo

Sorry, this is where I have to back out. Not familiar with the raw software or 
Redhat distro. 
In my case I used SLES 11.1 with the HA expansion and all the software I 
required was included in those two components. 
I have also setup and continue to use SLES 10+ clusters, but SBD was not 
available then so had to use alternate STONITH drivers (ILO and blade centre ,  
although I still have an old cluster that uses SSH STONITH. ) 

D.


Sent from my iPhone

On 11/04/2013, at 3:12 AM, "Guglielmo Abbruzzese"  wrote:

> I realize SDB could be the best option in my situation.
> So I prepared a 1G partition on the shared storage, and I downloaded the 
> sbd-1837fd8cc64a.tar.gz file from the http://hg.linux-ha.org/sbd link.
> Just a doubt: shall I upgrade the corosync/pacemaker versions (unfortunately 
> I cannot)?
> This is my environment:
>  RHEL6.22.6.32-220.el6.x86_64 #1 SMP Wed Nov 9 08:03:13 EST 2011 
> x86_64 x86_64 x86_64 GNU/Linux
>  Pacemaker1.1.6-3pacemaker-cli-1.1.6-3.el6.x86_64
>pacemaker-libs-1.1.6-3.el6.x86_64
>pacemaker-cluster-libs-1.1.6-3.el6.x86_64
>pacemaker-1.1.6-3.el6.x86_64
>  Corosync1.4.1-4corosync-1.4.1-4.el6.x86_64
>corosynclib-1.4.1-4.el6.x86_64
>  DRBD8.4.1-2kmod-drbd84-8.4.1-2.el6.elrepo.x86_64
>drbd84-utils-8.4.1-2.el6.elrepo.x86_64
> 
> Will it be enough just compile and install the source code or could I fuind 
> troubles concerning dependencies or similar stuff?
> Thanks a lot
> G
> 
> -Messaggio originale-
> Da: linux-ha-dev-boun...@lists.linux-ha.org 
> [mailto:linux-ha-dev-boun...@lists.linux-ha.org] Per conto di Darren Thompson 
> (AkurIT)
> Inviato: mercoledì 10 aprile 2013 15:44
> A: High-Availability Linux Development List
> Oggetto: Re: [Linux-ha-dev] R: R: [PATCH] Filesystem RA:
> 
> Hi G.
> 
> I personally recommend as a minimum that you setup a SBD partition and use 
> SBD STONITH. It protects against file/ database corruption in the event of an 
> issue on the underlying storage.
> 
> Hardware (power) STONITH is considered the "best" protection, but I have had 
> clusters running for years using just SBD STONITH and I would not deploy a 
> cluster managed file system without it,
> 
> You should also strongly consider setting the "fence on stop failure" for the 
> same reason. The worst possible corruption can be caused by the cluster 
> having a " split brain" due to a partially dismounted file system and another 
> node mounting and writing to it at the same time.
> 
> Regards
> D.
> 
> 
> On 10/04/2013, at 5:30 PM, "Guglielmo Abbruzzese"  
> wrote:
> 
>> Hi Darren,
>> I am aware STONITH could help, but unfortunately I cannot add such device to 
>> the architecture at the moment. 
>> Furthermore, sybase seems to be stopped  (the start/stop order should 
>> be already granted by the Resource Group structure)
>> 
>> Resource Group: grp-sdg
>>resource_vrt_ip(ocf::heartbeat:IPaddr2):   Started NODE_A
>>resource_lvm   (ocf::heartbeat:LVM):   Started NODE_A
>>resource_lvmdir(ocf::heartbeat:Filesystem):failed (and so 
>> unmanaged)
>>resource_sybase(lsb:sybase):   stopped
>>resource_httpd (lsb:httpd):stopped
>>resource_tomcatd   (lsb:tomcatd):  stopped
>>resource_sdgd  (lsb:sdgd): stopped
>>resource_statd (lsb:statistiched): stopped
>> 
>> I'm just guessing, why the same configuration swapped fine with the 
>> previous storage? The only difference could be the changed multipath 
>> configuration
>> 
>> Thanks a lot
>> G.
>> 
>> 
>> -Messaggio originale-
>> Da: linux-ha-dev-boun...@lists.linux-ha.org 
>> [mailto:linux-ha-dev-boun...@lists.linux-ha.org] Per conto di Darren 
>> Thompson (AkurIT)
>> Inviato: martedì 9 aprile 2013 23:35
>> A: High-Availability Linux Development List
>> Oggetto: Re: [Linux-ha-dev] R: [PATCH] Filesystem RA:
>> 
>> Hi
>> 
>> The correct way for that to have been handled, given you additional detail 
>> would have been for the node to have received a STONITH.
>> 
>> Things that you should check:
>> 1 STONITH device configured correctly and operational.
>> 2 the " on fail" for any file system cluster resource stop should be " 
>> fence".
>> 3 you need to review your constraints to ensure that the order and 
>> relationship between SYBASE and file system resource needs to be corrected 
>> so that SYBASE is stopped first.
>> 
>> Hope this helps
>> 
>> Darren
>> 
>> 
>> Sent from my iPhone
>> 
>> On 09/04/2013, at 11:57 PM, "Guglielmo Abbruzzese"  
>> wrote:
>> 
>>> Hi everybody,
>>> In my case (very similar to Junko's) when I disconnect the Fibre 
>>> Channels the "try_umount" procedure in RA Filesystem script doesn't work.
>>> 
>>> After the programmed attempts the active/passive cluster doesn't 
>>> swap, and the lvmdir resource is flagged as "failed" rather than "stopped".
>>> 
>>> I must say, even if I try to umount the /storage resource manually it 
>>> doesn't work because of sybase is us

Re: [Linux-ha-dev] [PATCH][crmsh] deal with the case-insentive hostname

2013-04-10 Thread Dejan Muhamedagic
Hi Junko-san,

On Wed, Apr 10, 2013 at 06:13:45PM +0900, Junko IKEDA wrote:
> Hi,
> I set upper-case hostname (GUEST03/GUEST4) and run Pacemaker 1.1.9 +
> Corosync 2.3.0.
> 
> [root@GUEST04 ~]# crm_mon -1
> Last updated: Wed Apr 10 15:12:48 2013
> Last change: Wed Apr 10 14:02:36 2013 via crmd on GUEST04
> Stack: corosync
> Current DC: GUEST04 (3232242817) - partition with quorum
> Version: 1.1.9-e8caee8
> 2 Nodes configured, unknown expected votes
> 1 Resources configured.
> 
> 
> Online: [ GUEST03 GUEST04 ]
> 
>  dummy  (ocf::pacemaker:Dummy): Started GUEST03
> 
> 
> for example, call crm shell with lower-case hostname.
> 
> [root@GUEST04 ~]# crm node standby guest03
> ERROR: bad lifetime: guest03

This message looks awkward.

> "crm node standby GUEST03" surely works well,
> so crm shell just doesn't take into account the hostname conversion.
> It's better to accept the both of the upper/lower-case.

Yes, indeed.

> "node standby", "node delete", "resource migrate(move)"  get hit with this
> issue.
> Please see the attached.

The patch looks correct. Many thanks for the contribution!

Cheers,

Dejan

> Thanks,
> Junko


> ___
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] ManageVE prints bogus errors to the syslog

2013-04-10 Thread Dejan Muhamedagic
On Wed, Apr 10, 2013 at 12:23:38AM +0200, Lars Ellenberg wrote:
> On Fri, Apr 05, 2013 at 12:39:46PM +0200, Dejan Muhamedagic wrote:
> > Hi Lars,
> > 
> > On Thu, Apr 04, 2013 at 09:28:00PM +0200, Lars Ellenberg wrote:
> > > On Wed, Apr 03, 2013 at 06:25:58PM +0200, Dejan Muhamedagic wrote:
> > > > Hi,
> > > > 
> > > > On Fri, Mar 22, 2013 at 08:41:30AM +0100, Roman Haefeli wrote:
> > > > > Hi,
> > > > > 
> > > > > When stopping a node of our cluster managing a bunch of OpenVZ CTs, I
> > > > > get a lot of such messages in the syslog:
> > > > > 
> > > > > Mar 20 17:20:44 localhost ManageVE[2586]: ERROR: vzctl status 10002 
> > > > > returned: 10002 does not exist.
> > > > > Mar 20 17:20:44 localhost lrmd: [2547]: info: operation monitor[6] on 
> > > > > opensim for client 2550: pid 2586 exited with return code 7
> > > > > 
> > > > > It looks to me as if lrmd is making sure the CT is not running 
> > > > > anymore.
> > > > > However, this triggers ManageVE to print an error.
> > > > 
> > > > Could be. Looking at the RA, there's a bunch of places where the
> > > > status is invoked and where this message could get logged. It
> > > > could be improved. The following patch should help:
> > > > 
> > > > https://github.com/ClusterLabs/resource-agents/commit/ca987afd35226145f48fb31bef911aa3ed3b6015
> > > 
> > > BTW, why call `vzctl | awk` *twice*,
> > > just to get two items out of the vzctl output?
> > > 
> > > how about lose the awk, and the second invokation?
> > > something like this:
> > > (should veexists and vestatus be local as well?)
> > > 
> > > diff --git a/heartbeat/ManageVE b/heartbeat/ManageVE
> > > index 56a3d03..53f9bab 100755
> > > --- a/heartbeat/ManageVE
> > > +++ b/heartbeat/ManageVE
> > > @@ -182,10 +182,12 @@ migrate_from_ve()
> > >  status_ve()
> > >  { 
> > >declare -i retcode
> > > -
> > > -  veexists=`$VZCTL status $VEID 2>/dev/null | $AWK '{print $3}'`
> > > -  vestatus=`$VZCTL status $VEID 2>/dev/null | $AWK '{print $5}'`
> > > +  local vzstatus
> > > +  vzstatus=`$VZCTL status $VEID 2>/dev/null`
> > >retcode=$?
> > > +  set -- $vzstatus
> > > +  veexists=$3
> > > +  vestatus=$5
> > >  
> > >if [[ $retcode != 0 ]]; then
> > >  ocf_log err "vzctl status $VEID returned: $retcode"
> > 
> > Well, you do have commit rights, don't you? :)
> 
> Sure, but I don't have a vz handy to test even "obviously correct"
> patches with, before I commit...

Looked correct to me too, but then it wouldn't have been the
first time I got something wrong :D

Maybe the reporter can help with testing. Roman?

Cheers,

Dejan

> 
>   Lars
> ___
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/