Martin,
Now I have played around with delay code, this version has an
exponential back off. The mode for the number of loops was 1 loop which
implys your assumption is correct. The next highest frequency of
occurrence for number of loops was 22, so waiting a very short bit
either works, or doesn't and you have to wait a lot longer.
One idea we been throwing around is to idea is to not call send() for
each metric, but instead, get all the data for one host, and then send
it as one huge. We may run benchmarks on that later and see whats better
in terms of wall clock time.
Ian
/* this function wraps calls to apr_send_socket to handle EAGAIN */
apr_status_t socket_send_full(apr_socket_t *sock, const char *buf,
apr_size_t *len)
{
apr_status_t rv;
int loop = 0;
apr_size_t start_len;
apr_interval_time_t t;
start_len = (*len);
(*len) = start_len;
rv = apr_socket_send( sock, buf, len);
while (loop++ < 33 && APR_STATUS_IS_EAGAIN(rv))
{
t = loop * loop * 100;
apr_sleep(t);
(*len) = start_len;
rv = apr_socket_send( sock, buf, len);
}
return rv;
}
Martin Knoblauch wrote:
Hi Ian,
thanks for updation the patch.
Puuhhh. That behaviour you describe is bad indeed. Seems either Cygwin
or M$ are doing something stupid.
One thought - you are calling apr_socket_send() at a high frequency in
that loop. Have you played with inserting some delay code in the loop?
Maybe waiting a ms or so would increase the chance of success?
Cheers
Martin
--- Ian Cunningham <[EMAIL PROTECTED]> wrote:
Martin,
Non-scientific numbers here for you. Connecting to the tcp port 600
times, print_host_metric() called apr_socket_send() at least 90,624
times. Of those 90,624 times, we got stuck in a EAGAIN while loop
1,190
times. On average that while loop looped 29,116.66 times, with
maximum
of 525,705 loops.
Pretty bad in my opinion. But the workaround... works :/
I have refactored all of the apr_socket_sends to use the workaround.
I
have it error out if it loops more than 750,000 times *shakes head*.
I've posted a patch to the bug that seems to work, it only bombed out
once in 600 tries.
http://bugzilla.ganglia.info/cgi-bin/bugzilla/attachment.cgi?id=27&action=view
Ian
Martin Knoblauch wrote:
Hi Richard,
correct. I was waiting for a comment from Ian on my concerns about
possible endless loops before committing the patch.
Ian: what do you think. Do you have any data how often you iterate
those EAGAIN loops?
Cheers
Martin
--- [EMAIL PROTECTED] wrote:
Gee,
I thought that was fixed with this patch:
http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=50
Actually, looking at 3.0.3 gmond.c, it looks like the patch did not
make
it
into the release - that's a shame.
Even looking at the patch, it looks as if it is a partial fix,
because
while
the patched metric printing is protected like this (gmond.c,
process_tcp_accept_channel):
<snip>
rv = print_host_metric(client, metric, now);
while(rv == EAGAIN)
{
rv = print_host_metric(client, metric, now);
}
if(rv != APR_SUCCESS)
{
goto close_accept_socket;
}
}
</snip>
the gmetric printing in the same function is not protected:
<snip>
/* Send the gmetric info for this particular host */
for(metric_hi = apr_hash_first(client_context, ((Ganglia_host
*)val)->gmetrics);
metric_hi;
metric_hi = apr_hash_next(metric_hi))
{
void *metric;
apr_hash_this(metric_hi, NULL, NULL, &metric);
/* Print each of the metrics from gmetric for this
host...
*/
if(print_host_gmetric(client, metric, now) !=
APR_SUCCESS)
{
goto close_accept_socket;
}
}
It may be best to talk to the original owner of the patch,
I'm not confident to submit a patch myself, although I will try
to submit a bugzill entry.
kind regards,
Richard
-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf
Of
Gilad Raphaelli
Sent: 14 March 2006 18:35
To: ganglia-developers@lists.sourceforge.net
Subject: [Ganglia-developers] RE: First prerelease of ganglia-3.0.3
ready for testing
I have tried the new release 3.0.3.200602231926
without success on FreeBSD 4.11 - the xml is still
truncated when attempting to access the data from a
remote host. Interestingly, this is not the case when
trying from the host running gmond. Based on the
strace, my colleague commented:
Default socket buffer is 64K. It appears that
socket is non-blocking. That last write is failing
(EAGAIN) because the socket buffer is full. The
application is ignoring that fact and shutting down
the socket. Looks to me like an application bug that
just accidentally works on rhel.
Please let me know if you need any more information.
Thank you,
Gil
-----------------------------------------------------
Running an strace on gmond (on the target host) while
trying to retrieve the data shows:
71160 write(10, "<METRIC NAME=\"swap_free\"
VAL=\"41"..., 124) = 124
71160 write(10, "<METRIC NAME=\"bytes_in\"
VAL=\"608"..., 129) = -1 EAGAIN
(Resource temporarily unavailable)
71160 shutdown(10, 0 /* receive */) = 0
What this looks like from the requester (not the
exact same transaction):
<METRIC NAME="mem_buffers" VAL="204096" TYPE="uint32" UNITS="KB"
TN="119" TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> <METRIC
NAME="swap_free" VAL="4194136" TYPE="uint32" UNITS="KB" TN="119"
TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> Connection
closed
by
foreign host.
A normal transaction closes with a closing tag: </GANGLIA_XML>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting
language that extends applications into web and mobile media.
Attend
the
live webcast and join the prime developer group breaking into this
new
coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers
------------------------------------------------------------------------
For more information about Barclays Capital, please
visit our web site at http://www.barcap.com.
Internet communications are not secure and therefore the Barclays
Group does not accept legal responsibility for the contents of this
message. Although the Barclays Group operates anti-virus
programmes,
it does not accept responsibility for any damage whatsoever that is
caused by viruses being passed. Any views or opinions presented
are
solely those of the author and do not necessarily represent those
of
the
Barclays Group. Replies to this email may be monitored by the
Barclays
Group for operational or business reasons.
------------------------------------------------------------------------
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting
language
that extends applications into web and mobile media. Attend the
live
webcast
and join the prime developer group breaking into this new coding
territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers
------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www: http://www.knobisoft.de
------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www: http://www.knobisoft.de