If you can reproduce it in the simulator hopefully you can fix it in the simulator ... surely you can prove whether or not there was a reply sent by a node by looking at the logs?
On Friday 18 January 2008 20:50, Robert Hailey wrote: > > On Jan 18, 2008, at 12:55 PM, Matthew Toseland wrote: > > >> I think that the fatal timeout problem should be nailed down before > >> 0.7. > >> > >> From my investigation I think that it is caused by a problem in the > >> link layer (Message->packet->Message) where Messages are either > >> dropped or duplicated (albeit, somewhat rarely). > > > > Do you have any more detail? > > Definitionally a fatal timeout is when the last node receives the Ack > that the other node is working on it, but nothing else (except maybe > non-local RejectOverloads). > > So that we received an ack, factually all we *know* is that a request > handler began to run on the remote machine, but we did not get a > responding MessageItem... > In theory these are the only possible problems that would cause this > in an environment where all the nodes are the same (simulator): > (1) RequestHandler & Sender don't match each other. > (2) The node finished, but the response was dropped between their > RequestHandler and our MessageCore. > (3) The node actually did take > 2 minutes to process the request > (e.g. fetching from datastore, verify, forwarding queue, coalescing > delays, etc). > > Although I have just recently been looking at the HTLs w/respect to > fatal timeouts, I find it super-odd that (at least for this, my > simulator setup) they always occur at htl==5 (?!?!). How totally > bizarre! HTLs do not start at 5, so they are surely being decremented. > It must be that the request stays at 10 to a point, back tracks to 5, > and then fatal-timeout's. > > As best I can tell, all the fatal timeouts in the simulator occur with > data which does not exist so it may end up being a whole-network > search (a bunch of USK lookups). > > Certainly if any control packet is dropped in the chain a timeout is > certain, and I have gotten a few accepted timeouts in the simulator, > but they are much more rare. > > --Impossible message #1-- > Jan 18, 2008 19:43:22:148 (freenet.node.PeerNode$SyncMessageCallback, > UDP packet receiver for Darknet port 5008, ERROR): Acknowledged but > not sent?! on freenet.node.PeerNode$SyncMessageCallback at 1398044 for > freenet > .node.DarknetPeerNode at [email protected]:5004 at c8af72...bd6c@12ca580 > > * To me this indicates a bug in the message system. > * If we send a Message we don't think that we have sent, this might > make all the sendSync()'s contributers to cause #3. > * It does not appear consistently with fatal timeouts, so at best it > is a symptom, not a cause. > > --Impossible message #2-- > Jan 18, 2008 20:21:39:517 (freenet.node.RequestSender, RequestSender > for UID -8258496701911223604, ERROR): request fatal-timeout (null) > htl#5 after accept (1 messages; last=FNPRejectOverload) > Jan 18, 2008 20:21:40:926 (freenet.node.RequestSender, RequestSender > for UID -8258496701911223604, ERROR): request fatal-timeout (null) > htl#5 after accept (1 messages; last=FNPRejectOverload) > Jan 18, 2008 20:21:40:927 (freenet.node.RequestHandler, RequestHandler > for UID -8258496701911223604, ERROR): requestsender took too long to > respond to requestor (2m1s/6) > Jan 18, 2008 20:21:41:030 (freenet.node.RequestSender, RequestSender > for UID -8258496701911223604, ERROR): request fatal-timeout (null) > htl#5 after accept (1 messages; last=FNPRejectOverload) > Jan 18, 2008 20:21:41:031 (freenet.node.RequestHandler, RequestHandler > for UID -8258496701911223604, ERROR): requestsender took too long to > respond to requestor (2m1s/6) > Jan 18, 2008 20:21:41:031 (freenet.node.RequestHandler, RequestHandler > for UID -3246664023200145547, ERROR): requestsender took too long to > respond to requestor (2m/6) > Jan 18, 2008 20:21:46:211 (freenet.node.RequestHandler, RequestHandler > for UID -8258496701911223604, ERROR): requestsender took too long to > respond to requestor (2m6s/1) > > * I would expect that if one node FATAL_TIMEOUTs, the whole chain > should stop (and all return TIMEOUT "/6", but the last one would have > returned RNF "/1"). I suppose this means that some requestsender > waited *longer* than FETCH_TIMEOUT, and a request handler translated a > timeout to a RejectOverload, and then a node waiting even longer than > FETCH_TIMEOUT tried to continue searching, but running out of peers > returned RNF (???) [a whole 6 seconds more too]. > * I would expect that fatal timeouts are cascading and therefore > should see: fatal-timeout htl#5, htl#6, htl#7... etc. Not all #5. > > -- > > Dropped Messages makes sense, as the occasional AcceptedTimeout (which > in a simulator should also be impossible). > > If the issue is duplicated packets, then it would have to be coupled > with poor mutual exclusion on request senders (two running with the > same UID, taking each others packets), at least a winnable race > condition. > > That's all I've got so far. > > -- > Robert Hailey > > -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20080118/607a11b0/attachment.pgp>
