On Jan 31, 2008, at 1:43 PM, Matthew Toseland wrote: > On Thursday 31 January 2008 19:29, Robert Hailey wrote: >> >> On Jan 31, 2008, at 12:00 PM, Matthew Toseland wrote: >> >>> On Thursday 31 January 2008 17:34, you wrote: >>>> >>>> On Jan 31, 2008, at 8:41 AM, Matthew Toseland wrote: >>>> >>>>> We are still getting timeouts. [...] >>>>> Any theories about the most likely cause? >>>> >>>> Considering the rather common occurrence of high-ping opennet >>>> peernodes, >>> >>> Oh? >> >> Every time I look at my opennet peers, I *always* have at least two >> with pings greater than 2 seconds. Right now, one with 4.5 secs, and >> one with 8.9 (the rest are sane). > > Hmmm. Doesn't happen for me, although I only have 4 or 5 opennet > peers. > > It seems extraordinarily unlikely that this is real - either this is > a stats > bug, or a message layer bug.
And if it is a message layer bug, that means it may be directly related to the timeouts. >>>> In the past while examining the throttle controls, I have suspected >>>> that (with priority queues) the "90-seconds at full throttle" >>>> constant >>>> might actually reduce to taking on too many concurrent chk >>>> transfers >>>> for them all to complete on time. >>> >>> Why? IIRC we include a fudge factor in that calculation, admittedly >>> it isn't >>> very accurate and should be made more so by using stats on bandwidth >>> usage... >> >> Just that the CHKs all use the same throttle, so they all throttle- >> down when we accept another CHK transfer. > > Well sure, but if the mechanism is working we won't accept enough to > be a > problem. I'm not saying this is an issue, but when a node is busy the 90-second- standard might actually make the average chk transfer time (over long distances) always exactly 90 seconds (through the busiest node). Since the transfer timeout is 120 seconds, this actually leaves only 30 seconds to accumulate acceptable latency; by your previous value of 30 hops, this means one second per hop (1/2 ping time plus coalescing delay?). Or else, how many transfers are aborted because nodes disconnect, and if they would succeed if the target transfer time was shorter than 90 seconds? Particularly as the CHK is streaming, that the traffic up unto the abort is wasted (50% payload?). >>>>> Do timeouts show up in simulation? >>>> >>>> I don't normally watch for them, I've started a new run with >>>> Accepted >>>> & Fatal request timeouts being logged. So far nothing. >>> >>> Ok. >> >> After running the simulator for two hours w/ ten nodes, I spot >> exactly >> one Accepted timeout (17 minutes into the simulation). >> >> So the answer is yes... timeouts still occur in the simulator. > > Suggests a messaging bug, although it's possible it's an artifact of > java's > lack of thread priorities on *nix (i.e. cpu issues). I would be more inclined to think a messaging bug, it is a beefy machine and it occurred some time into the simulation. >>>>> What can we do to debug this? >>>> >>>> Probably: >>>> (1) a simulated high-ping times seen in the public network at about >>>> the same rate, >>> >>> You mean bugs cause high ping times and high ping times cause >>> timeouts? >>> >>>> (2) a message/link layer stress test complete with rekeying/ >>>> disconnects/and [busy/not-busy] spikes >>> >>> This would be a good idea, I dunno how much work would be involved? >>> >>> What can I usefully work on in this area? AFAICS: >>> - The window-grows-while-unused bug. >>> - More accurate bandwidth liability limiting. >>> - Debug the not-forwarded detection and make assumeNATed false by >>> default. >>> (Reduce baseload bandwidth usage). >>> >>> Anything else? You want to take any of these on? >> >> I don't think I can take on a big project right now. > > Is there anything I can do? I am not familiar with the window-grows-while-unused bug, and am not working on/debugging the message layer right now. It's up to you. -- Robert Hailey
