On Jan 31, 2008, at 12:00 PM, Matthew Toseland wrote: > On Thursday 31 January 2008 17:34, you wrote: >> >> On Jan 31, 2008, at 8:41 AM, Matthew Toseland wrote: >> >>> We are still getting timeouts. [...] >>> Any theories about the most likely cause? >> >> Considering the rather common occurrence of high-ping opennet >> peernodes, > > Oh?
Every time I look at my opennet peers, I *always* have at least two with pings greater than 2 seconds. Right now, one with 4.5 secs, and one with 8.9 (the rest are sane). >> my first suspect is that they are culminated pingtimes and >> coalescing delays. > > Is that possible? 2 minute timeout, say 30 hops, it'd have to be ~ 4 > second > round trip per hop, which doesn't happen much does it? I don't know, I have seen ping times to peers in excess of 12 seconds. How common are the timeouts anyway? >> If this is the case, the only way I am aware of to >> solve it is to favor nodes with low ping times; > > Opennet favours nodes that get successful requests. I want to keep > alchemy out > of it as much as possible, since this is what Oskar has shown to > work, and > anyway it ought to effectively balance all the other factors - if a > node is > too slow it won't generate successful requests. > >> I actually already >> have a patch for that, although in it's present incantation it also >> favors darknet nodes for routing (easily excised). > > :) >> >> My only other suspect is a bug in the message/link layer that drops >> messages. > > Entirely possible. > > The current link layer sucks, it is in need of a major rewrite. > There is one > major bug which needs to be fixed (we increase the AIMD transfer > rate to a > peer indefinitely while we are not maxing it out, and then get a > huge spike > causing big problems when we do get more traffic to send). But more > generally, it's not as close to TCP as I'd like it to be, and it has > severe > limits on packets in flight. The packet format is also much more > verbose than > it needs to be. > > http://wiki.freenetproject.org/NewTransportLayer > http://wiki.freenetproject.org/NewPacketFormat > > Any such rewrite is highly unlikely to go in before 0.7.0. But if it > turns out > to be relatively urgent it should be done soon after that. >> >> In the past while examining the throttle controls, I have suspected >> that (with priority queues) the "90-seconds at full throttle" >> constant >> might actually reduce to taking on too many concurrent chk transfers >> for them all to complete on time. > > Why? IIRC we include a fudge factor in that calculation, admittedly > it isn't > very accurate and should be made more so by using stats on bandwidth > usage... Just that the CHKs all use the same throttle, so they all throttle- down when we accept another CHK transfer. >>> Do timeouts show up in simulation? >> >> I don't normally watch for them, I've started a new run with Accepted >> & Fatal request timeouts being logged. So far nothing. > > Ok. After running the simulator for two hours w/ ten nodes, I spot exactly one Accepted timeout (17 minutes into the simulation). So the answer is yes... timeouts still occur in the simulator. >>> What can we do to debug this? >> >> Probably: >> (1) a simulated high-ping times seen in the public network at about >> the same rate, > > You mean bugs cause high ping times and high ping times cause > timeouts? > >> (2) a message/link layer stress test complete with rekeying/ >> disconnects/and [busy/not-busy] spikes > > This would be a good idea, I dunno how much work would be involved? > > What can I usefully work on in this area? AFAICS: > - The window-grows-while-unused bug. > - More accurate bandwidth liability limiting. > - Debug the not-forwarded detection and make assumeNATed false by > default. > (Reduce baseload bandwidth usage). > > Anything else? You want to take any of these on? I don't think I can take on a big project right now. -- Robert Hailey
