Re: [ClusterLabs] crm_mon memory leak

2015-11-18 Thread Jan Pokorný
On 09/11/15 13:11 +, Karthikeyan Ramasamy wrote:
> root 13405 1  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 13566 13405  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 13623 13566  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 13758 13566  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 13784 13623  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 14146 13566  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 14167 13623  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 14193 13784  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 14284 13758  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 14381 13784  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 14469 14284  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 14589 13405  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 14837 14381  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 14860 13566  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 14977 14589  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 19816 14167  0 13:43 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 19845 19816  0 13:43 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> 
> From the above it looks that one crm_mon spawns another crm_mon processes and 
> keeps building.

Yep, see the attached PID scheme.

My guess is that the script PCSESA.sh is in fact an accidental "soft" fork
bomb that could be reduced to something like this t.sh script:

echo -e '#!/bin/sh\nwhile true; do sleep 15; (eval "$0" "$@" &); done' > t.sh
chmod +x t.sh
./t.sh --foo bar

What puzzles me, though, is that the same PID file used in nested
execution is not preventing this sort of recursion, and I am wondering
if "open(..., | O_SYNC)" or explicit fsync after write would be of
any help here (smells like filesystem-level race condition).

> Can you please let us know if there is anything else we have to
> check or still there could be issues with the script?

Karthik, would you be able to provide somewhat reduced version of
PCSESA.sh (as requested by Ken) that still reproduces the issue?

-- 
Jan (Poki)


pgpONXRDcGnul.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Re: [ClusterLabs] crm_mon memory leak

2015-11-18 Thread Karthikeyan Ramasamy
Hi Jan,
  There was a problem with one of the Resource Agents that caused this looping, 
which caused both start and stop to fail and it went recursive.  Now, that has 
been fixed.

Thanks,
Karthik.

-Original Message-
From: Jan Pokorný [mailto:jpoko...@redhat.com] 
Sent: 19 நவம்பர் 2015 01:08
To: users@clusterlabs.org
Subject: Re: [ClusterLabs] crm_mon memory leak

On 09/11/15 13:11 +, Karthikeyan Ramasamy wrote:
> root 13405 1  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 13566 13405  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 13623 13566  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 13758 13566  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 13784 13623  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 14146 13566  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 14167 13623  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 14193 13784  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 14284 13758  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 14381 13784  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 14469 14284  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 14589 13405  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 14837 14381  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 14860 13566  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 14977 14589  0 13:42 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 19816 14167  0 13:43 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> root 19845 19816  0 13:43 ?00:00:00 /usr/sbin/crm_mon -p 
> /tmp/ClusterMon_SNMP_10.64.109.36.pid -d -i 15 -E 
> /opt/occ/CXP_902_0588_R13B2370/tools/PCSESA.sh -h 
> /tmp/ClusterMon_SNMP_10.64.109.36.html
> 
> From the above it looks that one crm_mon spawns another crm_mon processes and 
> keeps building.

Yep, see the attached PID scheme.

My guess is that the script PCSESA.sh is in fact an accidental "soft" fork bomb 
that could be reduced to something like this t.sh script:

echo -e '#!/bin/sh\nwhile true; do sleep 15; (eval "$0" "$@" &); done' > t.sh 
chmod +x t.sh ./t.sh --foo bar

What puzzles me, though, is that the same PID file used in nested execution is 
not preventing this sort of recursion, and I am wondering if "open(..., | 
O_SYNC)" or explicit fsync after write w

Re: [ClusterLabs] crm_mon memory leak

2015-10-30 Thread Ken Gaillot
On 10/30/2015 05:29 AM, Karthikeyan Ramasamy wrote:
> Dear Pacemaker support,
> We are using pacemaker1.1.10-14 to implement a service management framework, 
> with high availability on the road-map.  This pacemaker version was available 
> through redhat for our environments
> 
>   We are running into an issue where pacemaker causes a node to crash.  The 
> last feature we integrated was SNMP notification.  While listing out the 
> processes we found that crm_mon processes occupying 58GB of available 64GB, 
> when the node crashed.  When we removed that feature, pacemaker was stable 
> again.
> 
> Section 7.1 of the pacemaker document details that SNMP notification agent 
> triggers a crm_mon process at regular intervals.  On checking clusterlabs for 
> list of known issues, we found this crm_mon memory leak issue.  Although not 
> related, we think that there is some problem with the crm_mon process.
> 
> http://clusterlabs.org/pipermail/users/2015-August/001084.html
> 
> Can you please let us know if there are issues with SNMP notification in 
> Pacemaker or if there is anything that we could be wrong.  Also, any 
> workarounds for this issue if available, would be very helpful for us.  
> Please help.
> 
> Thanks,
> Karthik.

Are you using ClusterMon with Pacemaker's built-in SNMP capability, or
an external script that generates the SNMP trap?

If you're using the built-in capability, that has to be explicitly
enabled when Pacemaker is compiled. Many distributions (including RHEL)
do not enable it. Run "crm_mon --help"; if it shows a "-S" option, you
have it enabled, otherwise not.

If you're using an external script to generate the SNMP trap, please
post it (with any sensitive info taken out of course).

The ClusterMon resource will generate a crm_mon at regular intervals,
but it should exit quickly. It sounds like it's not exiting at all,
which is why you see this problem.

If you have a RHEL subscription, you can open a support ticket with Red
Hat. Note that stonith must be enabled before Red Hat (and many other
vendors) will support a cluster. Also, you should be able to "yum
update" to a much newer version of Pacemaker to get bugfixes, if you're
using RHEL 6 or 7.

FYI, the latest upstream Pacemaker has a new feature that will be in
1.1.14, allowing it to call an external notification script without
needing a ClusterMon resource.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] crm_mon memory leak

2015-10-30 Thread Karthikeyan Ramasamy
Dear Pacemaker support,
We are using pacemaker1.1.10-14 to implement a service management framework, 
with high availability on the road-map.  This pacemaker version was available 
through redhat for our environments

  We are running into an issue where pacemaker causes a node to crash.  The 
last feature we integrated was SNMP notification.  While listing out the 
processes we found that crm_mon processes occupying 58GB of available 64GB, 
when the node crashed.  When we removed that feature, pacemaker was stable 
again.

Section 7.1 of the pacemaker document details that SNMP notification agent 
triggers a crm_mon process at regular intervals.  On checking clusterlabs for 
list of known issues, we found this crm_mon memory leak issue.  Although not 
related, we think that there is some problem with the crm_mon process.

http://clusterlabs.org/pipermail/users/2015-August/001084.html

Can you please let us know if there are issues with SNMP notification in 
Pacemaker or if there is anything that we could be wrong.  Also, any 
workarounds for this issue if available, would be very helpful for us.  Please 
help.

Thanks,
Karthik.


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org