On Wednesday 04:26 PM 7/15/2009, Tom Jackson wrote:
Your SF bug report says that you put in a 300 millisecond delay.
Where? Even if you think that such a fix is not good, it would be
helpful to at least know what works.

There's a massive amount of debugging I've done on this that's not included in the bug report, actually, for reasons of brevity. But I did state that the workaround is to "insert a delay before the data starts being read by ns_https{post,get}"--or in other words, immediately before the loops commented with "Read the content" in ns_httpspost/ns_httpsget:

----- 8< ----------------------------------------------------------
        #
        # Read the content.
        #

        while 1 {
            set buf [_ns_https_read $timeout $rfd $length]
            append page $buf
            [...]
----- 8< ----------------------------------------------------------

The "after X" statement would go immediately before this while loop.

You also talk about truncation, but then the truncation stops if the
received data goes above 81000.

It might be a good idea to narrow down when the bug appears (what byte
value) and when it goes away again. This might suggest something.

I tried that, and it was suggestive but ultimately not much help in debugging the problem. For one thing, the byte values vary by platform, and aren't even consistent on the same platform (i.e., a given byte size might work or fail depending on the run). It's a timing issue, as I said in the bug report. However, if you're curious, this is an analysis of the errors at various byte values taken from our internal bug report for this issue:

----- 8< ----------------------------------------------------------
The error shows up consistently (99.9+% of the time) at 74000 through 81000 bytes (counting by 1000), so I've been using the range of 70000-83000 for testing. Also, some specific testing showed that the errors actually kick in reliably at 73729 bytes; note that 73728=8192*9. And in all the succeeding sizes until the errors stop again, the socket returns exactly 73728 bytes of data regardless of the request size. This particular run of consistent errors stops at 81884 bytes (though there are a few rare successes in that range), which doesn't have any suggestive powers of 2.

So it seems clear that the buffer size affects the reliability in at least two ways: 1) larger sizes are more likely to fail, and 2) certain multiples of 8192 are particularly significant in that they're the last working size before a long stretch of failing sizes (all of which return that last working size). In addition to 73728=8192*9, I verified that this happens at 90112=8192*11 and 106496=8192*13, and that it does NOT happen at 81920=8192*10 or 57344=8192*7. So it would appear that odd multiples of 8192 where the multiplier is >= 9 are the ones that typically start lengthy failure sequences.
----- 8< ----------------------------------------------------------

Note that this analysis only applies to RHEL4 (the byte-size analysis for Mac OS X is similar, but the multipliers and trigger levels are different, though I didn't record the actual values). And even on RHEL4 these aren't the only values that fail--other smaller and larger buffer sizes will fail too, just not as consistently.

- John


--
AOLserver - http://www.aolserver.com/

To Remove yourself from this list, simply send an email to 
<lists...@listserv.aol.com> with the
body of "SIGNOFF AOLSERVER" in the email message. You can leave the Subject: 
field of your email blank.

Reply via email to