If you can reproduce it in the simulator hopefully you can fix it in the 
simulator ... surely you can prove whether or not there was a reply sent by a 
node by looking at the logs?

On Friday 18 January 2008 20:50, Robert Hailey wrote:
> 
> On Jan 18, 2008, at 12:55 PM, Matthew Toseland wrote:
> 
> >> I think that the fatal timeout problem should be nailed down before  
> >> 0.7.
> >>
> >> From my investigation I think that it is caused by a problem in the
> >> link layer (Message->packet->Message) where Messages are either
> >> dropped or duplicated (albeit, somewhat rarely).
> >
> > Do you have any more detail?
> 
> Definitionally a fatal timeout is when the last node receives the Ack  
> that the other node is working on it, but nothing else (except maybe  
> non-local RejectOverloads).
> 
> So that we received an ack, factually all we *know* is that a request  
> handler began to run on the remote machine, but we did not get a  
> responding MessageItem...
> In theory these are the only possible problems that would cause this  
> in an environment where all the nodes are the same (simulator):
> (1) RequestHandler & Sender don't match each other.
> (2) The node finished, but the response was dropped between their  
> RequestHandler and our MessageCore.
> (3) The node actually did take > 2 minutes to process the request  
> (e.g. fetching from datastore, verify, forwarding queue, coalescing  
> delays, etc).
> 
> Although I have just recently been looking at the HTLs w/respect to  
> fatal timeouts, I find it super-odd that (at least for this, my  
> simulator setup) they always occur at htl==5 (?!?!). How totally  
> bizarre! HTLs do not start at 5, so they are surely being decremented.  
> It must be that the request stays at 10 to a point, back tracks to 5,  
> and then fatal-timeout's.
> 
> As best I can tell, all the fatal timeouts in the simulator occur with  
> data which does not exist so it may end up being a whole-network  
> search (a bunch of USK lookups).
> 
> Certainly if any control packet is dropped in the chain a timeout is  
> certain, and I have gotten a few accepted timeouts in the simulator,  
> but they are much more rare.
> 
> --Impossible message #1--
> Jan 18, 2008 19:43:22:148 (freenet.node.PeerNode$SyncMessageCallback,  
> UDP packet receiver for Darknet port 5008, ERROR): Acknowledged but  
> not sent?! on freenet.node.PeerNode$SyncMessageCallback at 1398044 for  
> freenet 
> .node.DarknetPeerNode at [email protected]:5004 at c8af72...bd6c@12ca580
> 
> * To me this indicates a bug in the message system.
> * If we send a Message we don't think that we have sent, this might  
> make all the sendSync()'s contributers to cause #3.
> * It does not appear consistently with fatal timeouts, so at best it  
> is a symptom, not a cause.
> 
> --Impossible message #2--
> Jan 18, 2008 20:21:39:517 (freenet.node.RequestSender, RequestSender  
> for UID -8258496701911223604, ERROR): request fatal-timeout (null)  
> htl#5 after accept (1 messages; last=FNPRejectOverload)
> Jan 18, 2008 20:21:40:926 (freenet.node.RequestSender, RequestSender  
> for UID -8258496701911223604, ERROR): request fatal-timeout (null)  
> htl#5 after accept (1 messages; last=FNPRejectOverload)
> Jan 18, 2008 20:21:40:927 (freenet.node.RequestHandler, RequestHandler  
> for UID -8258496701911223604, ERROR): requestsender took too long to  
> respond to requestor (2m1s/6)
> Jan 18, 2008 20:21:41:030 (freenet.node.RequestSender, RequestSender  
> for UID -8258496701911223604, ERROR): request fatal-timeout (null)  
> htl#5 after accept (1 messages; last=FNPRejectOverload)
> Jan 18, 2008 20:21:41:031 (freenet.node.RequestHandler, RequestHandler  
> for UID -8258496701911223604, ERROR): requestsender took too long to  
> respond to requestor (2m1s/6)
> Jan 18, 2008 20:21:41:031 (freenet.node.RequestHandler, RequestHandler  
> for UID -3246664023200145547, ERROR): requestsender took too long to  
> respond to requestor (2m/6)
> Jan 18, 2008 20:21:46:211 (freenet.node.RequestHandler, RequestHandler  
> for UID -8258496701911223604, ERROR): requestsender took too long to  
> respond to requestor (2m6s/1)
> 
> * I would expect that if one node FATAL_TIMEOUTs, the whole chain  
> should stop (and all return TIMEOUT "/6", but the last one would have  
> returned RNF "/1"). I suppose this means that some requestsender  
> waited *longer* than FETCH_TIMEOUT, and a request handler translated a  
> timeout to a RejectOverload, and then a node waiting even longer than  
> FETCH_TIMEOUT tried to continue searching, but running out of peers  
> returned RNF (???) [a whole 6 seconds more too].
> * I would expect that fatal timeouts are cascading and therefore  
> should see: fatal-timeout htl#5, htl#6, htl#7... etc. Not all #5.
> 
> --
> 
> Dropped Messages makes sense, as the occasional AcceptedTimeout (which  
> in a simulator should also be impossible).
> 
> If the issue is duplicated packets, then it would have to be coupled  
> with poor mutual exclusion on request senders (two running with the  
> same UID, taking each others packets), at least a winnable race  
> condition.
> 
> That's all I've got so far.
> 
> --
> Robert Hailey
> 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20080118/607a11b0/attachment.pgp>

Reply via email to