Thanks Andrew, much appreciated. I’ll try upgrading to 1.11 and report back with how it goes.
On 07/05/2014 01:20, "Andrew Beekhof" <and...@beekhof.net> wrote: > >On 6 May 2014, at 7:47 pm, Greg Murphy <greg.mur...@gamesparks.com> wrote: > >> Here you go - I’ve only run lrmd for 30 minutes since installing the >>debug >> package, but hopefully that’s enough - if not, let me know and I’ll do a >> longer capture. >> > >I'll keep looking, but almost everything so far seems to be from or >related to the g_dbus API: > >... >==37625== by 0x6F20E30: g_dbus_proxy_new_for_bus_sync (in >/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.3800.1) >==37625== by 0x507B90B: get_proxy (upstart.c:66) >==37625== by 0x507B9BF: upstart_init (upstart.c:85) >==37625== by 0x507C88E: upstart_job_exec (upstart.c:429) >==37625== by 0x10CE03: lrmd_rsc_dispatch (lrmd.c:879) >==37625== by 0x4E5F112: crm_trigger_dispatch (mainloop.c:105) >==37625== by 0x58A13B5: g_main_context_dispatch (in >/lib/x86_64-linux-gnu/libglib-2.0.so.0.3800.1) >==37625== by 0x58A1707: ??? (in >/lib/x86_64-linux-gnu/libglib-2.0.so.0.3800.1) >==37625== by 0x58A1B09: g_main_loop_run (in >/lib/x86_64-linux-gnu/libglib-2.0.so.0.3800.1) >==37625== by 0x10AC3A: main (main.c:314) > >Which is going to be called every time an upstart job is run (ie. >recurring monitor of an upstart resource) > >There were several problems with that API and we removed all use of it in >1.1.11. >I'm quite confident that most, if not all, of the memory issues would go >away if you upgraded. > > >> >> >> On 06/05/2014 10:08, "Andrew Beekhof" <and...@beekhof.net> wrote: >> >>> Oh, any any chance you could install the debug packages? It will make >>>the >>> output even more useful :-) >>> >>> On 6 May 2014, at 7:06 pm, Andrew Beekhof <and...@beekhof.net> wrote: >>> >>>> >>>> On 6 May 2014, at 6:05 pm, Greg Murphy <greg.mur...@gamesparks.com> >>>> wrote: >>>> >>>>> Attached are the valgrind outputs from two separate runs of lrmd with >>>>> the >>>>> suggested variables set. Do they help narrow the issue down? >>>> >>>> They do somewhat. I'll investigate. But much of the memory is still >>>> reachable: >>>> >>>> ==26203== indirectly lost: 17,945,950 bytes in 642,546 blocks >>>> ==26203== possibly lost: 2,805 bytes in 60 blocks >>>> ==26203== still reachable: 26,104,781 bytes in 544,782 blocks >>>> ==26203== suppressed: 8,652 bytes in 176 blocks >>>> ==26203== Reachable blocks (those to which a pointer was found) are >>>>not >>>> shown. >>>> ==26203== To see them, rerun with: --leak-check=full >>>> --show-reachable=yes >>>> >>>> Could you add the --show-reachable=yes to VALGRIND_OPTS variable? >>>> >>>>> >>>>> >>>>> Thanks >>>>> >>>>> Greg >>>>> >>>>> >>>>> On 02/05/2014 03:01, "Andrew Beekhof" <and...@beekhof.net> wrote: >>>>> >>>>>> >>>>>> On 30 Apr 2014, at 9:01 pm, Greg Murphy <greg.mur...@gamesparks.com> >>>>>> wrote: >>>>>> >>>>>>> Hi >>>>>>> >>>>>>> I¹m running a two-node Pacemaker cluster on Ubuntu Saucy (13.10), >>>>>>> kernel 3.11.0-17-generic and the Ubuntu Pacemaker package, version >>>>>>> 1.1.10+git20130802-1ubuntu1. >>>>>> >>>>>> The problem is that I have no way of knowing what code is/isn't >>>>>> included >>>>>> in '1.1.10+git20130802-1ubuntu1'. >>>>>> You could try setting the following in your environment before >>>>>> starting >>>>>> pacemaker though >>>>>> >>>>>> # Variables for running child daemons under valgrind and/or checking >>>>>> for >>>>>> memory problems >>>>>> G_SLICE=always-malloc >>>>>> MALLOC_PERTURB_=221 # or 0 >>>>>> MALLOC_CHECK_=3 # or 0,1,2 >>>>>> PCMK_valgrind_enabled=lrmd >>>>>> VALGRIND_OPTS="--leak-check=full --trace-children=no >>>>>>--num-callers=25 >>>>>> --log-file=/var/lib/pacemaker/valgrind-%p >>>>>> --suppressions=/usr/share/pacemaker/tests/valgrind-pcmk.suppressions >>>>>> --gen-suppressions=all" >>>>>> >>>>>> >>>>>>> The cluster is configured with a DRBD master/slave set and then a >>>>>>> failover resource group containing MySQL (along with its DRBD >>>>>>> filesystem) and a Zabbix Proxy and Agent. >>>>>>> >>>>>>> Since I built the cluster around two months ago I¹ve noticed that >>>>>>>on >>>>>>> the the active node the memory footprint of lrmd gradually grows to >>>>>>> quite a significant size. The cluster was last restarted three >>>>>>>weeks >>>>>>> ago, and now lrmd has over 1GB of mapped memory on the active node >>>>>>> and >>>>>>> only 151MB on the passive node. Current excerpts from >>>>>>> /proc/PID/status >>>>>>> are: >>>>>>> >>>>>>> Active node >>>>>>> VmPeak: >>>>>>> 1146740 kB >>>>>>> VmSize: >>>>>>> 1146740 kB >>>>>>> VmLck: >>>>>>> 0 kB >>>>>>> VmPin: >>>>>>> 0 kB >>>>>>> VmHWM: >>>>>>> 267680 kB >>>>>>> VmRSS: >>>>>>> 188764 kB >>>>>>> VmData: >>>>>>> 1065860 kB >>>>>>> VmStk: >>>>>>> 136 kB >>>>>>> VmExe: >>>>>>> 32 kB >>>>>>> VmLib: >>>>>>> 10416 kB >>>>>>> VmPTE: >>>>>>> 2164 kB >>>>>>> VmSwap: >>>>>>> 822752 kB >>>>>>> >>>>>>> Passive node >>>>>>> VmPeak: >>>>>>> 220832 kB >>>>>>> VmSize: >>>>>>> 155428 kB >>>>>>> VmLck: >>>>>>> 0 kB >>>>>>> VmPin: >>>>>>> 0 kB >>>>>>> VmHWM: >>>>>>> 4568 kB >>>>>>> VmRSS: >>>>>>> 3880 kB >>>>>>> VmData: >>>>>>> 74548 kB >>>>>>> VmStk: >>>>>>> 136 kB >>>>>>> VmExe: >>>>>>> 32 kB >>>>>>> VmLib: >>>>>>> 10416 kB >>>>>>> VmPTE: >>>>>>> 172 kB >>>>>>> VmSwap: >>>>>>> 0 kB >>>>>>> >>>>>>> During the last week or so I¹ve taken a couple of snapshots of >>>>>>> /proc/PID/smaps on the active node, and the heap particularly >>>>>>>stands >>>>>>> out >>>>>>> as growing: (I have the full outputs captured if they¹ll help) >>>>>>> >>>>>>> 20140422 >>>>>>> 7f92e1578000-7f92f218b000 rw-p 00000000 00:00 0 >>>>>>> [heap] >>>>>>> Size: 274508 kB >>>>>>> Rss: 180152 kB >>>>>>> Pss: 180152 kB >>>>>>> Shared_Clean: 0 kB >>>>>>> Shared_Dirty: 0 kB >>>>>>> Private_Clean: 0 kB >>>>>>> Private_Dirty: 180152 kB >>>>>>> Referenced: 120472 kB >>>>>>> Anonymous: 180152 kB >>>>>>> AnonHugePages: 0 kB >>>>>>> Swap: 91568 kB >>>>>>> KernelPageSize: 4 kB >>>>>>> MMUPageSize: 4 kB >>>>>>> Locked: 0 kB >>>>>>> VmFlags: rd wr mr mw me ac >>>>>>> >>>>>>> >>>>>>> 20140423 >>>>>>> 7f92e1578000-7f92f305e000 rw-p 00000000 00:00 0 >>>>>>> [heap] >>>>>>> Size: 289688 kB >>>>>>> Rss: 184136 kB >>>>>>> Pss: 184136 kB >>>>>>> Shared_Clean: 0 kB >>>>>>> Shared_Dirty: 0 kB >>>>>>> Private_Clean: 0 kB >>>>>>> Private_Dirty: 184136 kB >>>>>>> Referenced: 69748 kB >>>>>>> Anonymous: 184136 kB >>>>>>> AnonHugePages: 0 kB >>>>>>> Swap: 103112 kB >>>>>>> KernelPageSize: 4 kB >>>>>>> MMUPageSize: 4 kB >>>>>>> Locked: 0 kB >>>>>>> VmFlags: rd wr mr mw me ac >>>>>>> >>>>>>> 20140430 >>>>>>> 7f92e1578000-7f92fc01d000 rw-p 00000000 00:00 0 >>>>>>> [heap] >>>>>>> Size: 436884 kB >>>>>>> Rss: 140812 kB >>>>>>> Pss: 140812 kB >>>>>>> Shared_Clean: 0 kB >>>>>>> Shared_Dirty: 0 kB >>>>>>> Private_Clean: 744 kB >>>>>>> Private_Dirty: 140068 kB >>>>>>> Referenced: 43600 kB >>>>>>> Anonymous: 140812 kB >>>>>>> AnonHugePages: 0 kB >>>>>>> Swap: 287392 kB >>>>>>> KernelPageSize: 4 kB >>>>>>> MMUPageSize: 4 kB >>>>>>> Locked: 0 kB >>>>>>> VmFlags: rd wr mr mw me ac >>>>>>> >>>>>>> I noticed in the release notes for 1.1.10-rc1 >>>>>>> >>>>>>> >>>>>>>(https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1 >>>>>>>.1 >>>>>>> 0-r >>>>>>> c1) that there was work done to fix "crmd: lrmd: stonithd: fixed >>>>>>> memory >>>>>>> leaks² but I¹m not sure which particular bug this was related to. >>>>>>> (And >>>>>>> those fixes should be in the version I¹m running anyway). >>>>>>> >>>>>>> I¹ve also spotted a few memory leak fixes in >>>>>>> https://github.com/beekhof/pacemaker, but I¹m not sure whether they >>>>>>> relate to my issue (assuming I have a memory leak and this isn¹t >>>>>>> expected behaviour). >>>>>>> >>>>>>> Is there additional debugging that I can perform to check whether I >>>>>>> have a leak, or is there enough evidence to justify upgrading to >>>>>>> 1.1.11? >>>>>>> >>>>>>> Thanks in advance >>>>>>> >>>>>>> Greg Murphy >>>>>>> _______________________________________________ >>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org >>>>>>> Getting started: >>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>> >>>>> >>>>> <lrmd.tgz>_______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: >>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> >>> >> >> <lrmd-dbg.tgz>_______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org