This sounds a lot like a problem I have been having once a week or so:
https://github.com/ganglia/monitor-core/issues/97

I have a reference to 246193:Apr 14 23:59:02 lsu02
/usr/sbin/gmetad[25897]: Process XML (LAX Tiggr): XML_ParseBuffer()
error at line 75498: no element found

in syslog.  But I can't be sure at that timestamp lines up with when the
interactive port stopped working.  I tried increasing the number of
server_threads but (anecdotally) that does not appear to have helped.
xmllint currently says everything is a-okay but I don't know what it
looks like when the interactive port is down.

On 04/05/2013 01:22 PM, Vladimir Vuksan wrote:
> Run the XML output through xmllint e.g. something like
> 
> 
> nc localhost 8651 | xmllint -
> 
> may give you hints.
> 
> On Fri, 5 Apr 2013, Ramon Bastiaans wrote:
> 
>> Ah. I also suspect some weird gmetric to cause this, but so far have not 
>> been able to find it in the XML unfortunately.
>>
>> Well regardless of the cause, I think it should not cause the interactive 
>> port to stop responding and for the web interface to hang.
>>
>> Having a quick look at the source of gmetad I was not able to find where 
>> this might originate. Perhaps the web interface could fail back to port 8651 
>> if port 8652 times out.
>>
>> - Ramon
>>
>> P.S. pbs-python still alive and well. If you mean "Job Monarch" I have been 
>> working hard recently on a new release and it is near (99%) finished. ;) 
>> pbswebmon is a completely different project which SARA is not associated 
>> with or has any role in.
>>
>>
>> As of January 2013, SARA has a new name: SURFsara.
>>
>> ing. Ramon Bastiaans - Senior Systems Programmer - Cluster Computing
>> | Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG 
>> Amsterdam | T +31 (0)20 592 30 00 | ramon.bastia...@surfsara.nl | 
>> www.surfsara.nl |
>>
>>
>>
>>
>> On 4 apr. 2013, at 18:52, Chris Hunter <chris.hun...@yale.edu> wrote:
>>
>>> Hi,
>>>
>>> We have seen this before (ganglia-gmond 3.2) when there are whitespace
>>> or non-alphanumeric characters in custom gmetrics.
>>>
>>> PS I hope pbs-python/pbswebmon are still active...
>>>
>>>
>>>> Hi,
>>>>
>>>> We have been experiencing a weird issue with gmetad.
>>>>
>>>> I am running gmetad v3.4.0
>>>>
>>>> Once in a while now a XML error seems to occur. Like this:
>>>>
>>>> /usr/sbin/gmetad[12241]: Process XML (LISA Cluster): XML_ParseBuffer() 
>>>> error at line 525626:
>>>>
>>>> Besides what is causing that and why, this causing the Ganglia web front 
>>>> end to hang and become non responsive.
>>>>
>>>> After checking the gmetad it seems port 8652 is no longer responding to 
>>>> queries. This does nothing:
>>>>
>>>> # telnet localhost 8652
>>>> Trying 127.0.0.1...
>>>> Connected to localhost.
>>>> Escape character is '^]'.
>>>> /LISA Cluster
>>>>
>>>> <after about 1 minute>
>>>> Connection closed by foreign host.
>>>>
>>>>
>>>> However port 8651 still works:
>>>>
>>>> # telnet localhost 8651 | wc -l
>>>> Connection closed by foreign host.
>>>> 921410
>>>>
>>>> And when I switch the web frontend from port 8652 back to port 8651 
>>>> ($conf['ganglia_port'] = 8651;), the web page responds and works again.
>>>>
>>>> After restarting gmetad port 8652 also becomes responsive again. It almost 
>>>> seems gmetad has a thread lost it's way or something.
>>>>
>>>> Any idea what may be causing this (besides the XML error)? It seems weird 
>>>> to me if 1 port works and the other does not anymore. It might be a bug.
>>>>
>>>> I have a dump of the XML (from port 8651 before restarting) available for 
>>>> who might want it, but it is 42 MB.
>>>>
>>>>
>>>> Kind regards,
>>>> - Ramon.
>>>>
>>>> As of January 2013, SARA has a new name: SURFsara.
>>>>
>>>> ing. Ramon Bastiaans - Senior Systems Programmer - Cluster Computing
>>>> | Operations, Support & Development | SURFsara | Science Park 140 | 1098 
>>>> XG Amsterdam | T +31 (0)20 592 30 00 | ramon.bastia...@surfsara.nl | 
>>>> www.surfsara.nl |
>>> =
>>>
>>> ------------------------------------------------------------------------------
>>> Minimize network downtime and maximize team effectiveness.
>>> Reduce network management and security costs.Learn how to hire
>>> the most talented Cisco Certified professionals. Visit the
>>> Employer Resources Portal
>>> http://www.cisco.com/web/learning/employer_resources/index.html
>>> _______________________________________________
>>> Ganglia-developers mailing list
>>> Ganglia-developers@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/ganglia-developers
>>
>>
> 
> ------------------------------------------------------------------------------
> Minimize network downtime and maximize team effectiveness.
> Reduce network management and security costs.Learn how to hire 
> the most talented Cisco Certified professionals. Visit the 
> Employer Resources Portal
> http://www.cisco.com/web/learning/employer_resources/index.html
> _______________________________________________
> Ganglia-developers mailing list
> Ganglia-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-developers
> 


------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to