On Jan 18, 2008, at 12:55 PM, Matthew Toseland wrote: >> I think that the fatal timeout problem should be nailed down before >> 0.7. >> >> From my investigation I think that it is caused by a problem in the >> link layer (Message->packet->Message) where Messages are either >> dropped or duplicated (albeit, somewhat rarely). > > Do you have any more detail?
Definitionally a fatal timeout is when the last node receives the Ack that the other node is working on it, but nothing else (except maybe non-local RejectOverloads). So that we received an ack, factually all we *know* is that a request handler began to run on the remote machine, but we did not get a responding MessageItem... In theory these are the only possible problems that would cause this in an environment where all the nodes are the same (simulator): (1) RequestHandler & Sender don't match each other. (2) The node finished, but the response was dropped between their RequestHandler and our MessageCore. (3) The node actually did take > 2 minutes to process the request (e.g. fetching from datastore, verify, forwarding queue, coalescing delays, etc). Although I have just recently been looking at the HTLs w/respect to fatal timeouts, I find it super-odd that (at least for this, my simulator setup) they always occur at htl==5 (?!?!). How totally bizarre! HTLs do not start at 5, so they are surely being decremented. It must be that the request stays at 10 to a point, back tracks to 5, and then fatal-timeout's. As best I can tell, all the fatal timeouts in the simulator occur with data which does not exist so it may end up being a whole-network search (a bunch of USK lookups). Certainly if any control packet is dropped in the chain a timeout is certain, and I have gotten a few accepted timeouts in the simulator, but they are much more rare. --Impossible message #1-- Jan 18, 2008 19:43:22:148 (freenet.node.PeerNode$SyncMessageCallback, UDP packet receiver for Darknet port 5008, ERROR): Acknowledged but not sent?! on freenet.node.PeerNode$SyncMessageCallback at 1398044 for freenet .node.DarknetPeerNode at [email protected]:5004 at c8af72...bd6c@12ca580 * To me this indicates a bug in the message system. * If we send a Message we don't think that we have sent, this might make all the sendSync()'s contributers to cause #3. * It does not appear consistently with fatal timeouts, so at best it is a symptom, not a cause. --Impossible message #2-- Jan 18, 2008 20:21:39:517 (freenet.node.RequestSender, RequestSender for UID -8258496701911223604, ERROR): request fatal-timeout (null) htl#5 after accept (1 messages; last=FNPRejectOverload) Jan 18, 2008 20:21:40:926 (freenet.node.RequestSender, RequestSender for UID -8258496701911223604, ERROR): request fatal-timeout (null) htl#5 after accept (1 messages; last=FNPRejectOverload) Jan 18, 2008 20:21:40:927 (freenet.node.RequestHandler, RequestHandler for UID -8258496701911223604, ERROR): requestsender took too long to respond to requestor (2m1s/6) Jan 18, 2008 20:21:41:030 (freenet.node.RequestSender, RequestSender for UID -8258496701911223604, ERROR): request fatal-timeout (null) htl#5 after accept (1 messages; last=FNPRejectOverload) Jan 18, 2008 20:21:41:031 (freenet.node.RequestHandler, RequestHandler for UID -8258496701911223604, ERROR): requestsender took too long to respond to requestor (2m1s/6) Jan 18, 2008 20:21:41:031 (freenet.node.RequestHandler, RequestHandler for UID -3246664023200145547, ERROR): requestsender took too long to respond to requestor (2m/6) Jan 18, 2008 20:21:46:211 (freenet.node.RequestHandler, RequestHandler for UID -8258496701911223604, ERROR): requestsender took too long to respond to requestor (2m6s/1) * I would expect that if one node FATAL_TIMEOUTs, the whole chain should stop (and all return TIMEOUT "/6", but the last one would have returned RNF "/1"). I suppose this means that some requestsender waited *longer* than FETCH_TIMEOUT, and a request handler translated a timeout to a RejectOverload, and then a node waiting even longer than FETCH_TIMEOUT tried to continue searching, but running out of peers returned RNF (???) [a whole 6 seconds more too]. * I would expect that fatal timeouts are cascading and therefore should see: fatal-timeout htl#5, htl#6, htl#7... etc. Not all #5. -- Dropped Messages makes sense, as the occasional AcceptedTimeout (which in a simulator should also be impossible). If the issue is duplicated packets, then it would have to be coupled with poor mutual exclusion on request senders (two running with the same UID, taking each others packets), at least a winnable race condition. That's all I've got so far. -- Robert Hailey -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20080118/1e0c1f5d/attachment.html>
