Re: [Linux-HA] resource unmanaged/failed

2011-12-12 Thread Aleksey V. Kashin
2011/12/12, Andrew Beekhof :
> On Fri, Dec 9, 2011 at 7:46 PM, Aleksey V. Kashin
>  wrote:
>>> How much do they have now?
>>
>> They have 12G RAM.
>
> That seems respectable.
>
>>
>>> How much is in use by the radius servers?
>>
>>   total   used   free sharedbuffers
>> cached
>> Mem: 12038  11606431  0  2   6479
>> -/+ buffers/cache:   5124   6913
>> Swap: 7632   3398   4233
>
> That doesn't really answer the question though, you really need to
> find out where the memory is going.
> Although 12Gb is a decent amount of RAM, /If/ a single radius server
> needs 8Gb, then the machine is clearly not going to be able to handle
> 2 of them.
> There's not really anything Pacemaker can do about it.
>

On this server also running Oracle RDBMS (database for radius-server).
It's generate big part of load.

> About the only thing you can do is increase the operation timeouts and
> perhaps play with the realtime and nice values of various processes.
>

I tried increase "timeout" (How long to wait before declaring the action has
failed.), but this doesn't work for me. Now I'm testing with
"failure-timeout" (How many seconds to wait before acting as if the
failure had not occurred),
Also I'll try play with process priority for corosync. Thanks for your advices.

>> And now I'm seeing  again "resource unmanaged/failed" :(
>
>
>
>>  Resource Group: raddb
>> raddb_ip   (ocf::heartbeat:IPaddr2):   Started radius1 (unmanaged)
>> FAILED
>>
>> Failed actions:
>>raddb_ip_monitor_15000 (node=radius1, call=4, rc=-2, status=Timed
>> Out): unknown exec error
>>raddb_ip_stop_0 (node=radius1, call=5, rc=-2, status=Timed Out):
>> unknown exec error
>> ___
>> Linux-HA mailing list
>> Linux-HA@lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] resource unmanaged/failed

2011-12-11 Thread Andrew Beekhof
On Fri, Dec 9, 2011 at 7:46 PM, Aleksey V. Kashin
 wrote:
>> How much do they have now?
>
> They have 12G RAM.

That seems respectable.

>
>> How much is in use by the radius servers?
>
>                   total       used       free     shared    buffers     cached
> Mem:         12038      11606        431          0          2       6479
> -/+ buffers/cache:       5124       6913
> Swap:         7632       3398       4233

That doesn't really answer the question though, you really need to
find out where the memory is going.
Although 12Gb is a decent amount of RAM, /If/ a single radius server
needs 8Gb, then the machine is clearly not going to be able to handle
2 of them.
There's not really anything Pacemaker can do about it.

About the only thing you can do is increase the operation timeouts and
perhaps play with the realtime and nice values of various processes.

> And now I'm seeing  again "resource unmanaged/failed" :(



>  Resource Group: raddb
>     raddb_ip   (ocf::heartbeat:IPaddr2):       Started radius1 (unmanaged) 
> FAILED
>
> Failed actions:
>    raddb_ip_monitor_15000 (node=radius1, call=4, rc=-2, status=Timed
> Out): unknown exec error
>    raddb_ip_stop_0 (node=radius1, call=5, rc=-2, status=Timed Out):
> unknown exec error
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] resource unmanaged/failed

2011-12-09 Thread Aleksey V. Kashin
> How much do they have now?

They have 12G RAM.

> How much is in use by the radius servers?

   total   used   free sharedbuffers cached
Mem: 12038  11606431  0  2   6479
-/+ buffers/cache:   5124   6913
Swap: 7632   3398   4233

And now I'm seeing  again "resource unmanaged/failed" :(

 Resource Group: raddb
 raddb_ip   (ocf::heartbeat:IPaddr2):   Started radius1 (unmanaged) 
FAILED

Failed actions:
raddb_ip_monitor_15000 (node=radius1, call=4, rc=-2, status=Timed
Out): unknown exec error
raddb_ip_stop_0 (node=radius1, call=5, rc=-2, status=Timed Out):
unknown exec error
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] resource unmanaged/failed

2011-12-08 Thread Andrew Beekhof
On Wed, Dec 7, 2011 at 9:56 PM, Aleksey V. Kashin
 wrote:
> I can't increase ram on this servers. How can I do that resource isn't
> becomes "unmanaged/failed" ?
>

How much do they have now?
How much is in use by the radius servers?
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] resource unmanaged/failed

2011-12-08 Thread Dejan Muhamedagic
Hi,
On Wed, Dec 07, 2011 at 04:56:31PM +0600, Aleksey V. Kashin wrote:
> Hello.
> 
> I have two servers (radius1, radius2). I've set up the cluster resource 
> - IPaddr2. I used next commands to set up this resource:
> 
> # crm configure property stonith-enabled="false"

For a 2-node cluster disabling stonith is really bad.

> # crm configure property no-quorum-policy="ignore"
> # crm configure primitive raddb_ip ocf:heartbeat:IPaddr2 params 
> ip="10.99.2.57" cidr_netmask="32" op monitor interval="15s"
> # crm configure group raddb raddb_ip
> # crm configure location raddb-prefers-radius1 raddb inf: radius1
> # crm configure rsc_defaults resource-stickiness=101
> 
> All ok.
> 
> But sometimes on server radius1 the load is increasing and server is 
> swapping and at that moment resource becomes "(unmanaged) FAILED". Below 
> I've presented example "unmanaged" resource:
> 
> # crm_mon
> 
> Last updated: Wed Dec  7 14:56:20 2011
> Stack: openais
> Current DC: radius1 - partition with quorum
> Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
> 2 Nodes configured, 2 expected votes
> 1 Resources configured.
> 
> 
> Online: [ radius2 radius1 ]
> 
>   Resource Group: raddb
>   raddb_ip   (ocf::heartbeat:IPaddr2):   Started radius1 
> (unmanaged) FAILED
> 
> Failed actions:
>  raddb_ip_monitor_15000 (node=radius1, call=4, rc=-2, status=Timed 
> Out): unknown exec error
>  raddb_ip_stop_0 (node=radius1, call=5, rc=-2, status=Timed Out): 
> unknown exec error
> 
> 
> I've presented part of /var/log/syslog (radius1) here - 
> http://paste.org/41963
> 
> 
> In that moment ip address 10.99.2.57 is alive and server responds to 
> requests coming to this ip. However sometimes this resource becomes 
> completely unavailable and I restart corosync. It's very bad.
> 
> I think resource becomes unmanaged because server is using swap and part 
> of corosync processes is in swap. I tested this suggestion and when 
> server is using a lot of swap resource becomes "unmanaged".

corosync gets swapped? How interesting.

> I use debian gnu/linux 5.x and this packages - 
> http://people.debian.org/~madkiss/ha/:
> 
> # dpkg -l |grep cluster
> ii  cluster-glue  
> 1.0.7+hg2618-2~bpo50+1  The reusable cluster components for Linux HA
> ii  corosync  
> 1.4.2-1~bpo50+1 Standards-based cluster framework (daemon an
> ii  libcluster-glue   
> 1.0.7+hg2618-2~bpo50+1  Reusable cluster libraries (transitional pac
> ii  libcorosync4  
> 1.4.2-1~bpo50+1 Standards-based cluster framework (libraries
> ii  libcrmcluster1
> 1.1.5-3~bpo50+1 Pacemaker libraries - CRM
> ii  liblrm2   
> 1.0.7+hg2618-2~bpo50+1  Reusable cluster libraries -- liblrm2
> ii  libpils2  
> 1.0.7+hg2618-2~bpo50+1  Reusable cluster libraries -- libpils2
> ii  libplumb2 
> 1.0.7+hg2618-2~bpo50+1  Reusable cluster libraries -- libplumb2
> ii  libplumbgpl2  
> 1.0.7+hg2618-2~bpo50+1  Reusable cluster libraries -- libplumbgpl2
> ii  libstonith1   
> 1.0.7+hg2618-2~bpo50+1  Reusable cluster libraries -- libstonith1
> ii  pacemaker 
> 1.1.5-3~bpo50+1 HA cluster resource manager
> 
> 
> 
> I can't increase ram on this servers. How can I do that resource isn't 
> becomes "unmanaged/failed" ?

Buy more memory. If you cannot, then I don't see any point in
using clustering.

Thanks,

Dejan


> With Best Regards.
> Aleksey V. Kashin
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] resource unmanaged/failed

2011-12-08 Thread Aleksey V. Kashin
Hello.

I have two servers (radius1, radius2). I've set up the cluster resource 
- IPaddr2. I used next commands to set up this resource:

# crm configure property stonith-enabled="false"
# crm configure property no-quorum-policy="ignore"
# crm configure primitive raddb_ip ocf:heartbeat:IPaddr2 params 
ip="10.99.2.57" cidr_netmask="32" op monitor interval="15s"
# crm configure group raddb raddb_ip
# crm configure location raddb-prefers-radius1 raddb inf: radius1
# crm configure rsc_defaults resource-stickiness=101

All ok.

But sometimes on server radius1 the load is increasing and server is 
swapping and at that moment resource becomes "(unmanaged) FAILED". Below 
I've presented example "unmanaged" resource:

# crm_mon

Last updated: Wed Dec  7 14:56:20 2011
Stack: openais
Current DC: radius1 - partition with quorum
Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
2 Nodes configured, 2 expected votes
1 Resources configured.


Online: [ radius2 radius1 ]

  Resource Group: raddb
  raddb_ip   (ocf::heartbeat:IPaddr2):   Started radius1 
(unmanaged) FAILED

Failed actions:
 raddb_ip_monitor_15000 (node=radius1, call=4, rc=-2, status=Timed 
Out): unknown exec error
 raddb_ip_stop_0 (node=radius1, call=5, rc=-2, status=Timed Out): 
unknown exec error


I've presented part of /var/log/syslog (radius1) here - 
http://paste.org/41963


In that moment ip address 10.99.2.57 is alive and server responds to 
requests coming to this ip. However sometimes this resource becomes 
completely unavailable and I restart corosync. It's very bad.

I think resource becomes unmanaged because server is using swap and part 
of corosync processes is in swap. I tested this suggestion and when 
server is using a lot of swap resource becomes "unmanaged".

I use debian gnu/linux 5.x and this packages - 
http://people.debian.org/~madkiss/ha/:

# dpkg -l |grep cluster
ii  cluster-glue  
1.0.7+hg2618-2~bpo50+1  The reusable cluster components for Linux HA
ii  corosync  
1.4.2-1~bpo50+1 Standards-based cluster framework (daemon an
ii  libcluster-glue   
1.0.7+hg2618-2~bpo50+1  Reusable cluster libraries (transitional pac
ii  libcorosync4  
1.4.2-1~bpo50+1 Standards-based cluster framework (libraries
ii  libcrmcluster1
1.1.5-3~bpo50+1 Pacemaker libraries - CRM
ii  liblrm2   
1.0.7+hg2618-2~bpo50+1  Reusable cluster libraries -- liblrm2
ii  libpils2  
1.0.7+hg2618-2~bpo50+1  Reusable cluster libraries -- libpils2
ii  libplumb2 
1.0.7+hg2618-2~bpo50+1  Reusable cluster libraries -- libplumb2
ii  libplumbgpl2  
1.0.7+hg2618-2~bpo50+1  Reusable cluster libraries -- libplumbgpl2
ii  libstonith1   
1.0.7+hg2618-2~bpo50+1  Reusable cluster libraries -- libstonith1
ii  pacemaker 
1.1.5-3~bpo50+1 HA cluster resource manager



I can't increase ram on this servers. How can I do that resource isn't 
becomes "unmanaged/failed" ?


With Best Regards.
Aleksey V. Kashin
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems