Martin,

Now I have played around with delay code, this version has an exponential back off. The mode for the number of loops was 1 loop which implys your assumption is correct. The next highest frequency of occurrence for number of loops was 22, so waiting a very short bit either works, or doesn't and you have to wait a lot longer.

One idea we been throwing around is to idea is to not call send() for each metric, but instead, get all the data for one host, and then send it as one huge. We may run benchmarks on that later and see whats better in terms of wall clock time.

Ian

/* this function wraps calls to apr_send_socket to handle EAGAIN */
apr_status_t socket_send_full(apr_socket_t *sock, const char *buf, apr_size_t *len)
{
 apr_status_t rv;
 int loop = 0;
 apr_size_t start_len;
 apr_interval_time_t t;

 start_len = (*len);
 (*len) = start_len;
 rv = apr_socket_send( sock, buf, len);

 while (loop++ < 33 && APR_STATUS_IS_EAGAIN(rv))
 {
   t = loop * loop * 100;
   apr_sleep(t);
   (*len) = start_len;
   rv = apr_socket_send( sock, buf, len);
 }
 return rv;
}


Martin Knoblauch wrote:

Hi Ian,

thanks for updation the patch.

Puuhhh. That behaviour you describe is bad indeed. Seems either Cygwin
or M$ are doing something stupid.

One thought - you are calling apr_socket_send() at a high frequency in
that loop. Have you played with inserting some delay code in the loop?
Maybe waiting a ms or so would increase the chance of success?

Cheers
Martin

--- Ian Cunningham <[EMAIL PROTECTED]> wrote:

Martin,

Non-scientific numbers here for you. Connecting to the tcp port 600 times, print_host_metric() called apr_socket_send() at least 90,624 times. Of those 90,624 times, we got stuck in a EAGAIN while loop 1,190 times. On average that while loop looped 29,116.66 times, with maximum of 525,705 loops.

Pretty bad in my opinion. But the workaround... works :/

I have refactored all of the apr_socket_sends to use the workaround.
I have it error out if it loops more than 750,000 times *shakes head*. I've posted a patch to the bug that seems to work, it only bombed out

once in 600 tries.


http://bugzilla.ganglia.info/cgi-bin/bugzilla/attachment.cgi?id=27&action=view
Ian

Martin Knoblauch wrote:

Hi Richard,

correct. I was waiting for a comment from Ian on my concerns about
possible endless loops before committing the patch.

Ian: what do you think. Do you have any data how often you iterate
those EAGAIN loops?

Cheers
Martin


--- [EMAIL PROTECTED] wrote:



Gee,

I thought that was fixed with this patch:
http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=50

Actually, looking at 3.0.3 gmond.c, it looks like the patch did not
make
it
into the release - that's a shame.

Even looking at the patch, it looks as if it is a partial fix,
because
while
the patched metric printing is protected like this (gmond.c,
process_tcp_accept_channel):
<snip>
      rv = print_host_metric(client, metric, now);
      while(rv == EAGAIN)
      {
        rv = print_host_metric(client, metric, now);
      }
        if(rv != APR_SUCCESS)
          {
            goto close_accept_socket;
          }
      }
</snip>

the gmetric printing in the same function is not protected:
<snip>

    /* Send the gmetric info for this particular host */
    for(metric_hi = apr_hash_first(client_context, ((Ganglia_host
*)val)->gmetrics);
        metric_hi;
        metric_hi = apr_hash_next(metric_hi))
      {
        void *metric;
        apr_hash_this(metric_hi, NULL, NULL, &metric);

        /* Print each of the metrics from gmetric for this
host...
*/
        if(print_host_gmetric(client, metric, now) !=
APR_SUCCESS)
          {
            goto close_accept_socket;
          }
      }

It may be best to talk to the original owner of the patch,
I'm not confident to submit a patch myself, although I will try
to submit a bugzill entry.

kind regards,
Richard

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf
Of
Gilad Raphaelli
Sent: 14 March 2006 18:35
To: ganglia-developers@lists.sourceforge.net
Subject: [Ganglia-developers] RE: First prerelease of ganglia-3.0.3
ready for testing


I have tried the new release 3.0.3.200602231926
without success on FreeBSD 4.11 - the xml is still
truncated when attempting to access the data from a
remote host.  Interestingly, this is not the case when
trying from the host running gmond.  Based on the
strace, my colleague commented:

Default socket buffer is 64K.  It appears that
socket is non-blocking.  That last write is failing
(EAGAIN) because the socket buffer is full.  The
application is ignoring that fact and shutting down
the socket.  Looks to me like an application bug that
just accidentally works on rhel.

Please let me know if you need any more information.

Thank you,

Gil
-----------------------------------------------------

Running an strace on gmond (on the target host) while
trying to retrieve the data shows:
71160 write(10, "<METRIC NAME=\"swap_free\"
VAL=\"41"..., 124) = 124
71160 write(10, "<METRIC NAME=\"bytes_in\"
VAL=\"608"..., 129) = -1 EAGAIN
(Resource temporarily unavailable)
71160 shutdown(10, 0 /* receive */)     = 0

What this looks like from the requester (not the
exact same transaction):

<METRIC NAME="mem_buffers" VAL="204096" TYPE="uint32" UNITS="KB"
TN="119" TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>  <METRIC
NAME="swap_free" VAL="4194136" TYPE="uint32" UNITS="KB" TN="119"
TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>  Connection
closed
by
foreign host.

A normal transaction closes with a closing tag: </GANGLIA_XML>

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com

-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting
language that extends applications into web and mobile media.
Attend
the
live webcast and join the prime developer group breaking into this
new
coding territory!

http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642


_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers



------------------------------------------------------------------------


For more information about Barclays Capital, please
visit our web site at http://www.barcap.com.


Internet communications are not secure and therefore the Barclays Group does not accept legal responsibility for the contents of this
message.  Although the Barclays Group operates anti-virus
programmes,
it does not accept responsibility for any damage whatsoever that is
caused by viruses being passed.  Any views or opinions presented
are
solely those of the author and do not necessarily represent those
of
the Barclays Group. Replies to this email may be monitored by the Barclays Group for operational or business reasons.


------------------------------------------------------------------------


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting
language
that extends applications into web and mobile media. Attend the
live
webcast
and join the prime developer group breaking into this new coding
territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de





------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de

Reply via email to