Re: [Linux-HA] resource unmanaged/failed
2011/12/12, Andrew Beekhof : > On Fri, Dec 9, 2011 at 7:46 PM, Aleksey V. Kashin > wrote: >>> How much do they have now? >> >> They have 12G RAM. > > That seems respectable. > >> >>> How much is in use by the radius servers? >> >> total used free sharedbuffers >> cached >> Mem: 12038 11606431 0 2 6479 >> -/+ buffers/cache: 5124 6913 >> Swap: 7632 3398 4233 > > That doesn't really answer the question though, you really need to > find out where the memory is going. > Although 12Gb is a decent amount of RAM, /If/ a single radius server > needs 8Gb, then the machine is clearly not going to be able to handle > 2 of them. > There's not really anything Pacemaker can do about it. > On this server also running Oracle RDBMS (database for radius-server). It's generate big part of load. > About the only thing you can do is increase the operation timeouts and > perhaps play with the realtime and nice values of various processes. > I tried increase "timeout" (How long to wait before declaring the action has failed.), but this doesn't work for me. Now I'm testing with "failure-timeout" (How many seconds to wait before acting as if the failure had not occurred), Also I'll try play with process priority for corosync. Thanks for your advices. >> And now I'm seeing again "resource unmanaged/failed" :( > > > >> Resource Group: raddb >> raddb_ip (ocf::heartbeat:IPaddr2): Started radius1 (unmanaged) >> FAILED >> >> Failed actions: >>raddb_ip_monitor_15000 (node=radius1, call=4, rc=-2, status=Timed >> Out): unknown exec error >>raddb_ip_stop_0 (node=radius1, call=5, rc=-2, status=Timed Out): >> unknown exec error >> ___ >> Linux-HA mailing list >> Linux-HA@lists.linux-ha.org >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems > ___ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] resource unmanaged/failed
On Fri, Dec 9, 2011 at 7:46 PM, Aleksey V. Kashin wrote: >> How much do they have now? > > They have 12G RAM. That seems respectable. > >> How much is in use by the radius servers? > > total used free shared buffers cached > Mem: 12038 11606 431 0 2 6479 > -/+ buffers/cache: 5124 6913 > Swap: 7632 3398 4233 That doesn't really answer the question though, you really need to find out where the memory is going. Although 12Gb is a decent amount of RAM, /If/ a single radius server needs 8Gb, then the machine is clearly not going to be able to handle 2 of them. There's not really anything Pacemaker can do about it. About the only thing you can do is increase the operation timeouts and perhaps play with the realtime and nice values of various processes. > And now I'm seeing again "resource unmanaged/failed" :( > Resource Group: raddb > raddb_ip (ocf::heartbeat:IPaddr2): Started radius1 (unmanaged) > FAILED > > Failed actions: > raddb_ip_monitor_15000 (node=radius1, call=4, rc=-2, status=Timed > Out): unknown exec error > raddb_ip_stop_0 (node=radius1, call=5, rc=-2, status=Timed Out): > unknown exec error > ___ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] resource unmanaged/failed
> How much do they have now? They have 12G RAM. > How much is in use by the radius servers? total used free sharedbuffers cached Mem: 12038 11606431 0 2 6479 -/+ buffers/cache: 5124 6913 Swap: 7632 3398 4233 And now I'm seeing again "resource unmanaged/failed" :( Resource Group: raddb raddb_ip (ocf::heartbeat:IPaddr2): Started radius1 (unmanaged) FAILED Failed actions: raddb_ip_monitor_15000 (node=radius1, call=4, rc=-2, status=Timed Out): unknown exec error raddb_ip_stop_0 (node=radius1, call=5, rc=-2, status=Timed Out): unknown exec error ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] resource unmanaged/failed
On Wed, Dec 7, 2011 at 9:56 PM, Aleksey V. Kashin wrote: > I can't increase ram on this servers. How can I do that resource isn't > becomes "unmanaged/failed" ? > How much do they have now? How much is in use by the radius servers? ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] resource unmanaged/failed
Hi, On Wed, Dec 07, 2011 at 04:56:31PM +0600, Aleksey V. Kashin wrote: > Hello. > > I have two servers (radius1, radius2). I've set up the cluster resource > - IPaddr2. I used next commands to set up this resource: > > # crm configure property stonith-enabled="false" For a 2-node cluster disabling stonith is really bad. > # crm configure property no-quorum-policy="ignore" > # crm configure primitive raddb_ip ocf:heartbeat:IPaddr2 params > ip="10.99.2.57" cidr_netmask="32" op monitor interval="15s" > # crm configure group raddb raddb_ip > # crm configure location raddb-prefers-radius1 raddb inf: radius1 > # crm configure rsc_defaults resource-stickiness=101 > > All ok. > > But sometimes on server radius1 the load is increasing and server is > swapping and at that moment resource becomes "(unmanaged) FAILED". Below > I've presented example "unmanaged" resource: > > # crm_mon > > Last updated: Wed Dec 7 14:56:20 2011 > Stack: openais > Current DC: radius1 - partition with quorum > Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f > 2 Nodes configured, 2 expected votes > 1 Resources configured. > > > Online: [ radius2 radius1 ] > > Resource Group: raddb > raddb_ip (ocf::heartbeat:IPaddr2): Started radius1 > (unmanaged) FAILED > > Failed actions: > raddb_ip_monitor_15000 (node=radius1, call=4, rc=-2, status=Timed > Out): unknown exec error > raddb_ip_stop_0 (node=radius1, call=5, rc=-2, status=Timed Out): > unknown exec error > > > I've presented part of /var/log/syslog (radius1) here - > http://paste.org/41963 > > > In that moment ip address 10.99.2.57 is alive and server responds to > requests coming to this ip. However sometimes this resource becomes > completely unavailable and I restart corosync. It's very bad. > > I think resource becomes unmanaged because server is using swap and part > of corosync processes is in swap. I tested this suggestion and when > server is using a lot of swap resource becomes "unmanaged". corosync gets swapped? How interesting. > I use debian gnu/linux 5.x and this packages - > http://people.debian.org/~madkiss/ha/: > > # dpkg -l |grep cluster > ii cluster-glue > 1.0.7+hg2618-2~bpo50+1 The reusable cluster components for Linux HA > ii corosync > 1.4.2-1~bpo50+1 Standards-based cluster framework (daemon an > ii libcluster-glue > 1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries (transitional pac > ii libcorosync4 > 1.4.2-1~bpo50+1 Standards-based cluster framework (libraries > ii libcrmcluster1 > 1.1.5-3~bpo50+1 Pacemaker libraries - CRM > ii liblrm2 > 1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries -- liblrm2 > ii libpils2 > 1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries -- libpils2 > ii libplumb2 > 1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries -- libplumb2 > ii libplumbgpl2 > 1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries -- libplumbgpl2 > ii libstonith1 > 1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries -- libstonith1 > ii pacemaker > 1.1.5-3~bpo50+1 HA cluster resource manager > > > > I can't increase ram on this servers. How can I do that resource isn't > becomes "unmanaged/failed" ? Buy more memory. If you cannot, then I don't see any point in using clustering. Thanks, Dejan > With Best Regards. > Aleksey V. Kashin > ___ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] resource unmanaged/failed
Hello. I have two servers (radius1, radius2). I've set up the cluster resource - IPaddr2. I used next commands to set up this resource: # crm configure property stonith-enabled="false" # crm configure property no-quorum-policy="ignore" # crm configure primitive raddb_ip ocf:heartbeat:IPaddr2 params ip="10.99.2.57" cidr_netmask="32" op monitor interval="15s" # crm configure group raddb raddb_ip # crm configure location raddb-prefers-radius1 raddb inf: radius1 # crm configure rsc_defaults resource-stickiness=101 All ok. But sometimes on server radius1 the load is increasing and server is swapping and at that moment resource becomes "(unmanaged) FAILED". Below I've presented example "unmanaged" resource: # crm_mon Last updated: Wed Dec 7 14:56:20 2011 Stack: openais Current DC: radius1 - partition with quorum Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f 2 Nodes configured, 2 expected votes 1 Resources configured. Online: [ radius2 radius1 ] Resource Group: raddb raddb_ip (ocf::heartbeat:IPaddr2): Started radius1 (unmanaged) FAILED Failed actions: raddb_ip_monitor_15000 (node=radius1, call=4, rc=-2, status=Timed Out): unknown exec error raddb_ip_stop_0 (node=radius1, call=5, rc=-2, status=Timed Out): unknown exec error I've presented part of /var/log/syslog (radius1) here - http://paste.org/41963 In that moment ip address 10.99.2.57 is alive and server responds to requests coming to this ip. However sometimes this resource becomes completely unavailable and I restart corosync. It's very bad. I think resource becomes unmanaged because server is using swap and part of corosync processes is in swap. I tested this suggestion and when server is using a lot of swap resource becomes "unmanaged". I use debian gnu/linux 5.x and this packages - http://people.debian.org/~madkiss/ha/: # dpkg -l |grep cluster ii cluster-glue 1.0.7+hg2618-2~bpo50+1 The reusable cluster components for Linux HA ii corosync 1.4.2-1~bpo50+1 Standards-based cluster framework (daemon an ii libcluster-glue 1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries (transitional pac ii libcorosync4 1.4.2-1~bpo50+1 Standards-based cluster framework (libraries ii libcrmcluster1 1.1.5-3~bpo50+1 Pacemaker libraries - CRM ii liblrm2 1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries -- liblrm2 ii libpils2 1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries -- libpils2 ii libplumb2 1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries -- libplumb2 ii libplumbgpl2 1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries -- libplumbgpl2 ii libstonith1 1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries -- libstonith1 ii pacemaker 1.1.5-3~bpo50+1 HA cluster resource manager I can't increase ram on this servers. How can I do that resource isn't becomes "unmanaged/failed" ? With Best Regards. Aleksey V. Kashin ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems