On Jan 18, 2008, at 12:55 PM, Matthew Toseland wrote:

>> I think that the fatal timeout problem should be nailed down before  
>> 0.7.
>>
>> From my investigation I think that it is caused by a problem in the
>> link layer (Message->packet->Message) where Messages are either
>> dropped or duplicated (albeit, somewhat rarely).
>
> Do you have any more detail?

Definitionally a fatal timeout is when the last node receives the Ack  
that the other node is working on it, but nothing else (except maybe  
non-local RejectOverloads).

So that we received an ack, factually all we *know* is that a request  
handler began to run on the remote machine, but we did not get a  
responding MessageItem...
In theory these are the only possible problems that would cause this  
in an environment where all the nodes are the same (simulator):
(1) RequestHandler & Sender don't match each other.
(2) The node finished, but the response was dropped between their  
RequestHandler and our MessageCore.
(3) The node actually did take > 2 minutes to process the request  
(e.g. fetching from datastore, verify, forwarding queue, coalescing  
delays, etc).

Although I have just recently been looking at the HTLs w/respect to  
fatal timeouts, I find it super-odd that (at least for this, my  
simulator setup) they always occur at htl==5 (?!?!). How totally  
bizarre! HTLs do not start at 5, so they are surely being decremented.  
It must be that the request stays at 10 to a point, back tracks to 5,  
and then fatal-timeout's.

As best I can tell, all the fatal timeouts in the simulator occur with  
data which does not exist so it may end up being a whole-network  
search (a bunch of USK lookups).

Certainly if any control packet is dropped in the chain a timeout is  
certain, and I have gotten a few accepted timeouts in the simulator,  
but they are much more rare.

--Impossible message #1--
Jan 18, 2008 19:43:22:148 (freenet.node.PeerNode$SyncMessageCallback,  
UDP packet receiver for Darknet port 5008, ERROR): Acknowledged but  
not sent?! on freenet.node.PeerNode$SyncMessageCallback at 1398044 for  
freenet 
.node.DarknetPeerNode at [email protected]:5004 at c8af72...bd6c@12ca580

* To me this indicates a bug in the message system.
* If we send a Message we don't think that we have sent, this might  
make all the sendSync()'s contributers to cause #3.
* It does not appear consistently with fatal timeouts, so at best it  
is a symptom, not a cause.

--Impossible message #2--
Jan 18, 2008 20:21:39:517 (freenet.node.RequestSender, RequestSender  
for UID -8258496701911223604, ERROR): request fatal-timeout (null)  
htl#5 after accept (1 messages; last=FNPRejectOverload)
Jan 18, 2008 20:21:40:926 (freenet.node.RequestSender, RequestSender  
for UID -8258496701911223604, ERROR): request fatal-timeout (null)  
htl#5 after accept (1 messages; last=FNPRejectOverload)
Jan 18, 2008 20:21:40:927 (freenet.node.RequestHandler, RequestHandler  
for UID -8258496701911223604, ERROR): requestsender took too long to  
respond to requestor (2m1s/6)
Jan 18, 2008 20:21:41:030 (freenet.node.RequestSender, RequestSender  
for UID -8258496701911223604, ERROR): request fatal-timeout (null)  
htl#5 after accept (1 messages; last=FNPRejectOverload)
Jan 18, 2008 20:21:41:031 (freenet.node.RequestHandler, RequestHandler  
for UID -8258496701911223604, ERROR): requestsender took too long to  
respond to requestor (2m1s/6)
Jan 18, 2008 20:21:41:031 (freenet.node.RequestHandler, RequestHandler  
for UID -3246664023200145547, ERROR): requestsender took too long to  
respond to requestor (2m/6)
Jan 18, 2008 20:21:46:211 (freenet.node.RequestHandler, RequestHandler  
for UID -8258496701911223604, ERROR): requestsender took too long to  
respond to requestor (2m6s/1)

* I would expect that if one node FATAL_TIMEOUTs, the whole chain  
should stop (and all return TIMEOUT "/6", but the last one would have  
returned RNF "/1"). I suppose this means that some requestsender  
waited *longer* than FETCH_TIMEOUT, and a request handler translated a  
timeout to a RejectOverload, and then a node waiting even longer than  
FETCH_TIMEOUT tried to continue searching, but running out of peers  
returned RNF (???) [a whole 6 seconds more too].
* I would expect that fatal timeouts are cascading and therefore  
should see: fatal-timeout htl#5, htl#6, htl#7... etc. Not all #5.

--

Dropped Messages makes sense, as the occasional AcceptedTimeout (which  
in a simulator should also be impossible).

If the issue is duplicated packets, then it would have to be coupled  
with poor mutual exclusion on request senders (two running with the  
same UID, taking each others packets), at least a winnable race  
condition.

That's all I've got so far.

--
Robert Hailey

-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20080118/1e0c1f5d/attachment.html>

Reply via email to