[freenet-dev] Still getting timeouts

Matthew Toseland Fri, 1 Feb 2008 17:55:24 +0000

On Thursday 31 January 2008 21:21, Robert Hailey wrote:
> 
> >>> Oh?
> >>
> >> Every time I look at my opennet peers, I *always* have at least two
> >> with pings greater than 2 seconds. Right now, one with 4.5 secs, and
> >> one with 8.9 (the rest are sane).
> >
> > Hmmm. Doesn't happen for me, although I only have 4 or 5 opennet  
> > peers.
> >
> > It seems extraordinarily unlikely that this is real - either this is  
> > a stats
> > bug, or a message layer bug.
> 
> And if it is a message layer bug, that means it may be directly  
> related to the timeouts.


It's also possible it's just due to nodes being hideously overloaded. Which 
can be due to several causes:
- The startup spike. Which can last a long time because we have no request 
resuming.
- Out of memory causing continual garbage collection. Several users have 
reported that this happens after 12 hours or so of uptime.
- ....
> 
> >>>> In the past while examining the throttle controls, I have suspected
> >>>> that (with priority queues) the "90-seconds at full throttle"
> >>>> constant
> >>>> might actually reduce to taking on too many concurrent chk  
> >>>> transfers
> >>>> for them all to complete on time.
> >>>
> >>> Why? IIRC we include a fudge factor in that calculation, admittedly
> >>> it isn't
> >>> very accurate and should be made more so by using stats on bandwidth
> >>> usage...
> >>
> >> Just that the CHKs all use the same throttle, so they all throttle-
> >> down when we accept another CHK transfer.
> >
> > Well sure, but if the mechanism is working we won't accept enough to  
> > be a
> > problem.
> 
> I'm not saying this is an issue, but when a node is busy the 90-second- 
> standard might actually make the average chk transfer time (over long  
> distances) always exactly 90 seconds (through the busiest node). Since  
> the transfer timeout is 120 seconds, this actually leaves only 30  
> seconds to accumulate acceptable latency; by your previous value of 30  
> hops, this means one second per hop (1/2 ping time plus coalescing  
> delay?).

Hmmm. Perhaps. So we should reduce the 90 seconds to say 60 seconds? That 
might cut actual bandwidth usage...
> 
> Or else, how many transfers are aborted because nodes disconnect, and  
> if they would succeed if the target transfer time was shorter than 90  
> seconds? Particularly as the CHK is streaming, that the traffic up  
> unto the abort is wasted (50% payload?).

Hmmm. IIRC that is fatal?
> 
> >>>>> Do timeouts show up in simulation?
> >>>>
> >>>> I don't normally watch for them, I've started a new run with  
> >>>> Accepted
> >>>> & Fatal request timeouts being logged. So far nothing.
> >>>
> >>> Ok.
> >>
> >> After running the simulator for two hours w/ ten nodes, I spot  
> >> exactly
> >> one Accepted timeout (17 minutes into the simulation).
> >>
> >> So the answer is yes... timeouts still occur in the simulator.
> >
> > Suggests a messaging bug, although it's possible it's an artifact of  
> > java's
> > lack of thread priorities on *nix (i.e. cpu issues).
> 
> I would be more inclined to think a messaging bug, it is a beefy  
> machine and it occurred some time into the simulation.
> 
> >>>>> What can we do to debug this?
> >>>>
> >>>> Probably:
> >>>> (1) a simulated high-ping times seen in the public network at about
> >>>> the same rate,
> >>>
> >>> You mean bugs cause high ping times and high ping times cause
> >>> timeouts?
> >>>
> >>>> (2) a message/link layer stress test complete with rekeying/
> >>>> disconnects/and [busy/not-busy] spikes
> >>>
> >>> This would be a good idea, I dunno how much work would be involved?
> >>>
> >>> What can I usefully work on in this area? AFAICS:
> >>> - The window-grows-while-unused bug.
> >>> - More accurate bandwidth liability limiting.
> >>> - Debug the not-forwarded detection and make assumeNATed false by
> >>> default.
> >>> (Reduce baseload bandwidth usage).
> >>>
> >>> Anything else? You want to take any of these on?
> >>
> >> I don't think I can take on a big project right now.
> >
> > Is there anything I can do?
> 
> I am not familiar with the window-grows-while-unused bug, and am not  
> working on/debugging the message layer right now. It's up to you.

I will fix the window-grows-while-unused bug.

W.r.t. messaging layer bugs, please explain how to reproduce your simulation; 
commit whatever source is needed.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20080201/6ae6c9d0/attachment.pgp>

[freenet-dev] Still getting timeouts

Reply via email to