On Thursday 31 January 2008 21:21, Robert Hailey wrote: > > >>> Oh? > >> > >> Every time I look at my opennet peers, I *always* have at least two > >> with pings greater than 2 seconds. Right now, one with 4.5 secs, and > >> one with 8.9 (the rest are sane). > > > > Hmmm. Doesn't happen for me, although I only have 4 or 5 opennet > > peers. > > > > It seems extraordinarily unlikely that this is real - either this is > > a stats > > bug, or a message layer bug. > > And if it is a message layer bug, that means it may be directly > related to the timeouts.
It's also possible it's just due to nodes being hideously overloaded. Which can be due to several causes: - The startup spike. Which can last a long time because we have no request resuming. - Out of memory causing continual garbage collection. Several users have reported that this happens after 12 hours or so of uptime. - .... > > >>>> In the past while examining the throttle controls, I have suspected > >>>> that (with priority queues) the "90-seconds at full throttle" > >>>> constant > >>>> might actually reduce to taking on too many concurrent chk > >>>> transfers > >>>> for them all to complete on time. > >>> > >>> Why? IIRC we include a fudge factor in that calculation, admittedly > >>> it isn't > >>> very accurate and should be made more so by using stats on bandwidth > >>> usage... > >> > >> Just that the CHKs all use the same throttle, so they all throttle- > >> down when we accept another CHK transfer. > > > > Well sure, but if the mechanism is working we won't accept enough to > > be a > > problem. > > I'm not saying this is an issue, but when a node is busy the 90-second- > standard might actually make the average chk transfer time (over long > distances) always exactly 90 seconds (through the busiest node). Since > the transfer timeout is 120 seconds, this actually leaves only 30 > seconds to accumulate acceptable latency; by your previous value of 30 > hops, this means one second per hop (1/2 ping time plus coalescing > delay?). Hmmm. Perhaps. So we should reduce the 90 seconds to say 60 seconds? That might cut actual bandwidth usage... > > Or else, how many transfers are aborted because nodes disconnect, and > if they would succeed if the target transfer time was shorter than 90 > seconds? Particularly as the CHK is streaming, that the traffic up > unto the abort is wasted (50% payload?). Hmmm. IIRC that is fatal? > > >>>>> Do timeouts show up in simulation? > >>>> > >>>> I don't normally watch for them, I've started a new run with > >>>> Accepted > >>>> & Fatal request timeouts being logged. So far nothing. > >>> > >>> Ok. > >> > >> After running the simulator for two hours w/ ten nodes, I spot > >> exactly > >> one Accepted timeout (17 minutes into the simulation). > >> > >> So the answer is yes... timeouts still occur in the simulator. > > > > Suggests a messaging bug, although it's possible it's an artifact of > > java's > > lack of thread priorities on *nix (i.e. cpu issues). > > I would be more inclined to think a messaging bug, it is a beefy > machine and it occurred some time into the simulation. > > >>>>> What can we do to debug this? > >>>> > >>>> Probably: > >>>> (1) a simulated high-ping times seen in the public network at about > >>>> the same rate, > >>> > >>> You mean bugs cause high ping times and high ping times cause > >>> timeouts? > >>> > >>>> (2) a message/link layer stress test complete with rekeying/ > >>>> disconnects/and [busy/not-busy] spikes > >>> > >>> This would be a good idea, I dunno how much work would be involved? > >>> > >>> What can I usefully work on in this area? AFAICS: > >>> - The window-grows-while-unused bug. > >>> - More accurate bandwidth liability limiting. > >>> - Debug the not-forwarded detection and make assumeNATed false by > >>> default. > >>> (Reduce baseload bandwidth usage). > >>> > >>> Anything else? You want to take any of these on? > >> > >> I don't think I can take on a big project right now. > > > > Is there anything I can do? > > I am not familiar with the window-grows-while-unused bug, and am not > working on/debugging the message layer right now. It's up to you. I will fix the window-grows-while-unused bug. W.r.t. messaging layer bugs, please explain how to reproduce your simulation; commit whatever source is needed. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20080201/6ae6c9d0/attachment.pgp>
