steve-
this has been addressed in 2.5.0 (the CVS now). in 2.4.x gmond it was
possible to crash it by closing the connection at the right time.
i found on linux that trying to fdopen a sock to use it like a stream was
way way buggy. to write the XML out i needed to convert the binary info
in the in-memory hash into text on the fly and a stream would be a great
way to do that but it didn't work.
my workaround was to create the xml_print() function (which in 2.5.0 i
renamed the "buffrd_print"). the error return for xml_print() in 2.4.1
was not handled correctly. i'm embarrased how bad ./gmond/server.c is for
2.4.1 when i look at it.
i've changed the code significantly in 2.5.0. please let me know if you
can crash a 2.5.0 gmond and how you are able to. i think it much more
bullet-proof but don't trust everything i think. i've taken 2.5.0 gmond
and hammered it with requests and closed client prematurely and it's been
pretty solid.
to be honest, we should find/build a test suite that beats the hell out of
what we build to ensure the quality is good.
i've been so quiet today because i went flying (i'm working on my private
pilot certification). i flew a katana for the first time today and
decided it's the training plane for me (before i was flying 152s and
172s). my favorite part was flying along the coast near half moon bay (it
was a red tide) and a full-flap decent from 3000 to 1500 feet to get below
SFO bravo airspace. it felt like i was bomb-diving the node was pointed
so far down.
-matt
Today, Steven Wagner wrote forth saying...
> I've noticed something bizarre in testing the output of my "gappy" Linux
> data source. For all I know it's something I'm doing.
>
> As a little stress test, I decided to try running a large number of
> connections in a row to see whether the monitoring core handled it
> gracefully - remember, my 2.4.1 Linux data source has been "timing out"
> during poll() (according to gmetad) and occasionally crashing ever since I
> started using the C version of gmetad.
>
> So, on first one host and then more than one host, I would try
> "telnet hostname 8649 || telnet hostname 8649 || .. " for a good, oh, ten
> lines or so, and then run them serially or in parallel. The result kind of
> surprised me.
>
> It ran like a top on localhost - screenload after screenload of fast,
> smooth output. I expected this. Actually, this was the behavior I *want*
> from the monitoring core all the time.
>
> I switched to the Solaris front-end box and ran the same test. Ruh roh.
> After a few iterations the XML feed stopped completely. I checked the
> debug output of the monitoring core and it was apparently still trying to
> send the data. Not only that, but *the listening threads had stopped!*
> (this was a mute host) In fact, the remaining XML listening thread seemed
> to be looping:
>
> sent data to host 10.x.y.z
> server_thread() 3076 clientfd = 9
>
> sent data to host 10.x.y.z
> server_thread() 5126 clientfd = 9
>
> sent data to host 10.x.y.z
> server_thread() 6151 clientfd = 9
>
> I broke the connection and tried again. This time, connection refused.
> What you say!! In the debug window, I see that the monitoring core has
> crashed quietly ("Broken pipe").
>
> OK, cranked it up again and it runs fine. I tried it again just to be sure
> I could reproduce it, and the same thing happened. Fine. Tried it from an
> SGI ...
>
> Same thing. On IRIX it takes many tries, though.
>
> Tried it from another Linux box (identical hardware, different kernel) ...
>
> Same thing. Sometimes it happens quickly, but on my last test it took ages.
>
> When the XML output stops, you can crash the monitoring core simply by
> closing the connection. I suspect gmond is crashing because gmetad is
> timing out on the XML stream and closing the connection out of disgust.
> Thanks, gmetad! :P The stuttering stream could also account for why the
> poll() or XML parsing fail altogether.
>
> Uh ... any ideas?
>
>
>
> -------------------------------------------------------
> In remembrance
> www.osdn.com/911/
> _______________________________________________
> Ganglia-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ganglia-developers
>