steve-

this has been addressed in 2.5.0 (the CVS now).  in 2.4.x gmond it was 
possible to crash it by closing the connection at the right time.  

i found on linux that trying to fdopen a sock to use it like a stream was 
way way buggy.  to write the XML out i needed to convert the binary info 
in the in-memory hash into text on the fly and a stream would be a great 
way to do that but it didn't work.

my workaround was to create the xml_print() function (which in 2.5.0 i 
renamed the "buffrd_print").  the error return for xml_print() in 2.4.1 
was not handled correctly.  i'm embarrased how bad ./gmond/server.c is for 
2.4.1 when i look at it.

i've changed the code significantly in 2.5.0.  please let me know if you 
can crash a 2.5.0 gmond and how you are able to.  i think it much more 
bullet-proof but don't trust everything i think.  i've taken 2.5.0 gmond 
and hammered it with requests and closed client prematurely and it's been 
pretty solid.  

to be honest, we should find/build a test suite that beats the hell out of 
what we build to ensure the quality is good.  

i've been so quiet today because i went flying (i'm working on my private 
pilot certification).  i flew a katana for the first time today and 
decided it's the training plane for me (before i was flying 152s and 
172s).  my favorite part was flying along the coast near half moon bay (it 
was a red tide) and a full-flap decent from 3000 to 1500 feet to get below 
SFO bravo airspace.  it felt like i was bomb-diving the node was pointed 
so far down.

-matt




Today, Steven Wagner wrote forth saying...

> I've noticed something bizarre in testing the output of my "gappy" Linux 
> data source.  For all I know it's something I'm doing.
> 
> As a little stress test, I decided to try running a large number of 
> connections in a row to see whether the monitoring core handled it 
> gracefully - remember, my 2.4.1 Linux data source has been "timing out" 
> during poll() (according to gmetad) and occasionally crashing ever since I 
> started using the C version of gmetad.
> 
> So, on first one host and then more than one host, I would try
> "telnet hostname 8649 || telnet hostname 8649 || .. " for a good, oh, ten 
> lines or so, and then run them serially or in parallel.  The result kind of 
> surprised me.
> 
> It ran like a top on localhost - screenload after screenload of fast, 
> smooth output.  I expected this.  Actually, this was the behavior I *want* 
> from the monitoring core all the time.
> 
> I switched to the Solaris front-end box and ran the same test.  Ruh roh. 
> After a few iterations the XML feed stopped completely.  I checked the 
> debug output of the monitoring core and it was apparently still trying to 
> send the data.  Not only that, but *the listening threads had stopped!* 
> (this was a mute host)  In fact, the remaining XML listening thread seemed 
> to be looping:
> 
> sent data to host 10.x.y.z
> server_thread() 3076 clientfd = 9
> 
> sent data to host 10.x.y.z
> server_thread() 5126 clientfd = 9
> 
> sent data to host 10.x.y.z
> server_thread() 6151 clientfd = 9
> 
> I broke the connection and tried again.  This time, connection refused. 
> What you say!!  In the debug window, I see that the monitoring core has 
> crashed quietly ("Broken pipe").
> 
> OK, cranked it up again and it runs fine.  I tried it again just to be sure 
> I could reproduce it, and the same thing happened.  Fine.  Tried it from an 
> SGI ...
> 
> Same thing.  On IRIX it takes many tries, though.
> 
> Tried it from another Linux box (identical hardware, different kernel) ...
> 
> Same thing.  Sometimes it happens quickly, but on my last test it took ages.
> 
> When the XML output stops, you can crash the monitoring core simply by 
> closing the connection.  I suspect gmond is crashing because gmetad is 
> timing out on the XML stream and closing the connection out of disgust. 
> Thanks, gmetad!  :P  The stuttering stream could also account for why the 
> poll() or XML parsing fail altogether.
> 
> Uh ... any ideas?
> 
> 
> 
> -------------------------------------------------------
> In remembrance
> www.osdn.com/911/
> _______________________________________________
> Ganglia-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ganglia-developers
> 


Reply via email to