I've noticed something bizarre in testing the output of my "gappy" Linux data source. For all I know it's something I'm doing.

As a little stress test, I decided to try running a large number of connections in a row to see whether the monitoring core handled it gracefully - remember, my 2.4.1 Linux data source has been "timing out" during poll() (according to gmetad) and occasionally crashing ever since I started using the C version of gmetad.

So, on first one host and then more than one host, I would try
"telnet hostname 8649 || telnet hostname 8649 || .. " for a good, oh, ten lines or so, and then run them serially or in parallel. The result kind of surprised me.

It ran like a top on localhost - screenload after screenload of fast, smooth output. I expected this. Actually, this was the behavior I *want* from the monitoring core all the time.

I switched to the Solaris front-end box and ran the same test. Ruh roh. After a few iterations the XML feed stopped completely. I checked the debug output of the monitoring core and it was apparently still trying to send the data. Not only that, but *the listening threads had stopped!* (this was a mute host) In fact, the remaining XML listening thread seemed to be looping:

sent data to host 10.x.y.z
server_thread() 3076 clientfd = 9

sent data to host 10.x.y.z
server_thread() 5126 clientfd = 9

sent data to host 10.x.y.z
server_thread() 6151 clientfd = 9

I broke the connection and tried again. This time, connection refused. What you say!! In the debug window, I see that the monitoring core has crashed quietly ("Broken pipe").

OK, cranked it up again and it runs fine. I tried it again just to be sure I could reproduce it, and the same thing happened. Fine. Tried it from an SGI ...

Same thing.  On IRIX it takes many tries, though.

Tried it from another Linux box (identical hardware, different kernel) ...

Same thing.  Sometimes it happens quickly, but on my last test it took ages.

When the XML output stops, you can crash the monitoring core simply by closing the connection. I suspect gmond is crashing because gmetad is timing out on the XML stream and closing the connection out of disgust. Thanks, gmetad! :P The stuttering stream could also account for why the poll() or XML parsing fail altogether.

Uh ... any ideas?


Reply via email to