This sounds a lot like a problem I have been having once a week or so: https://github.com/ganglia/monitor-core/issues/97
I have a reference to 246193:Apr 14 23:59:02 lsu02 /usr/sbin/gmetad[25897]: Process XML (LAX Tiggr): XML_ParseBuffer() error at line 75498: no element found in syslog. But I can't be sure at that timestamp lines up with when the interactive port stopped working. I tried increasing the number of server_threads but (anecdotally) that does not appear to have helped. xmllint currently says everything is a-okay but I don't know what it looks like when the interactive port is down. On 04/05/2013 01:22 PM, Vladimir Vuksan wrote: > Run the XML output through xmllint e.g. something like > > > nc localhost 8651 | xmllint - > > may give you hints. > > On Fri, 5 Apr 2013, Ramon Bastiaans wrote: > >> Ah. I also suspect some weird gmetric to cause this, but so far have not >> been able to find it in the XML unfortunately. >> >> Well regardless of the cause, I think it should not cause the interactive >> port to stop responding and for the web interface to hang. >> >> Having a quick look at the source of gmetad I was not able to find where >> this might originate. Perhaps the web interface could fail back to port 8651 >> if port 8652 times out. >> >> - Ramon >> >> P.S. pbs-python still alive and well. If you mean "Job Monarch" I have been >> working hard recently on a new release and it is near (99%) finished. ;) >> pbswebmon is a completely different project which SARA is not associated >> with or has any role in. >> >> >> As of January 2013, SARA has a new name: SURFsara. >> >> ing. Ramon Bastiaans - Senior Systems Programmer - Cluster Computing >> | Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG >> Amsterdam | T +31 (0)20 592 30 00 | ramon.bastia...@surfsara.nl | >> www.surfsara.nl | >> >> >> >> >> On 4 apr. 2013, at 18:52, Chris Hunter <chris.hun...@yale.edu> wrote: >> >>> Hi, >>> >>> We have seen this before (ganglia-gmond 3.2) when there are whitespace >>> or non-alphanumeric characters in custom gmetrics. >>> >>> PS I hope pbs-python/pbswebmon are still active... >>> >>> >>>> Hi, >>>> >>>> We have been experiencing a weird issue with gmetad. >>>> >>>> I am running gmetad v3.4.0 >>>> >>>> Once in a while now a XML error seems to occur. Like this: >>>> >>>> /usr/sbin/gmetad[12241]: Process XML (LISA Cluster): XML_ParseBuffer() >>>> error at line 525626: >>>> >>>> Besides what is causing that and why, this causing the Ganglia web front >>>> end to hang and become non responsive. >>>> >>>> After checking the gmetad it seems port 8652 is no longer responding to >>>> queries. This does nothing: >>>> >>>> # telnet localhost 8652 >>>> Trying 127.0.0.1... >>>> Connected to localhost. >>>> Escape character is '^]'. >>>> /LISA Cluster >>>> >>>> <after about 1 minute> >>>> Connection closed by foreign host. >>>> >>>> >>>> However port 8651 still works: >>>> >>>> # telnet localhost 8651 | wc -l >>>> Connection closed by foreign host. >>>> 921410 >>>> >>>> And when I switch the web frontend from port 8652 back to port 8651 >>>> ($conf['ganglia_port'] = 8651;), the web page responds and works again. >>>> >>>> After restarting gmetad port 8652 also becomes responsive again. It almost >>>> seems gmetad has a thread lost it's way or something. >>>> >>>> Any idea what may be causing this (besides the XML error)? It seems weird >>>> to me if 1 port works and the other does not anymore. It might be a bug. >>>> >>>> I have a dump of the XML (from port 8651 before restarting) available for >>>> who might want it, but it is 42 MB. >>>> >>>> >>>> Kind regards, >>>> - Ramon. >>>> >>>> As of January 2013, SARA has a new name: SURFsara. >>>> >>>> ing. Ramon Bastiaans - Senior Systems Programmer - Cluster Computing >>>> | Operations, Support & Development | SURFsara | Science Park 140 | 1098 >>>> XG Amsterdam | T +31 (0)20 592 30 00 | ramon.bastia...@surfsara.nl | >>>> www.surfsara.nl | >>> = >>> >>> ------------------------------------------------------------------------------ >>> Minimize network downtime and maximize team effectiveness. >>> Reduce network management and security costs.Learn how to hire >>> the most talented Cisco Certified professionals. Visit the >>> Employer Resources Portal >>> http://www.cisco.com/web/learning/employer_resources/index.html >>> _______________________________________________ >>> Ganglia-developers mailing list >>> Ganglia-developers@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/ganglia-developers >> >> > > ------------------------------------------------------------------------------ > Minimize network downtime and maximize team effectiveness. > Reduce network management and security costs.Learn how to hire > the most talented Cisco Certified professionals. Visit the > Employer Resources Portal > http://www.cisco.com/web/learning/employer_resources/index.html > _______________________________________________ > Ganglia-developers mailing list > Ganglia-developers@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-developers > ------------------------------------------------------------------------------ Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis & visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter _______________________________________________ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers