On Jan 25, 2008, at 12:28 PM, Matthew Toseland wrote:

> On Friday 25 January 2008 16:49, Robert Hailey wrote:
>>
>> On Jan 24, 2008, at 1:07 PM, Matthew Toseland wrote:
>>
>>>> Well, I do think that this problem *generally* has gone away. A  
>>>> large
>>>> part of the timeouts may have been request coalescing deadlocks. In
>>>> my
>>>> logs, I no longer see that "requestsender took to long to respond  
>>>> to
>>>> requestor (+2m)", but when I do see that log statement fire, it is
>>>> huge!
>>>>
>>>> Jan 24, 2008 17:05:11:767 (freenet.node.RequestHandler,  
>>>> RequestSender
>>>> for UID 5637402349040790252, ERROR):
>>>> requestsender took too long to respond to requestor (16m10s/3)
>>>> Jan 24, 2008 17:05:14:446 (freenet.node.RequestHandler,  
>>>> RequestSender
>>>> for UID 98827504771122964, ERROR):
>>>> requestsender took too long to respond to requestor (16m8s/3)
>>>> Jan 24, 2008 17:05:14:447 (freenet.node.RequestHandler,  
>>>> RequestSender
>>>> for UID 774454676209630, ERROR):
>>>> requestsender took too long to respond to requestor (16m8s/3)
>>>> Jan 24, 2008 17:23:00:203 (freenet.node.RequestHandler,  
>>>> RequestSender
>>>> for UID 7341907878853950087, ERROR):
>>>> requestsender took too long to respond to requestor (34m33s/4)
>>>>
>>>> Half an hour for one request? Good night!
>>>
>>> This is suspicious, they are all roughly the same period except the
>>> last. I
>>> suggest you set log level minor and investigate what happened by
>>> searching
>>> for the UID.
>>
>> I've let it run overnight, and they increase all the more (~15  
>> hours).
>> After pouring through a thread dump, I think that you actually just
>> fixed this problem (waiting on chk transfers) with: r17272, r17275.
>
> I think so, although actually in a later commit. Currently debugging  
> the
> fix. :)
>>
>> Nice catch!
>>
>> This is the same node affected by bug#2006, I wonder if this is the
>> root cause.
>
> Maybe...
>
> I wonder if we can ship alpha 2 without this? If we need to include  
> it then we
> need to release 1104 *now* ...

I think it's a big bug, particularly for busy nodes. r17235, using  
unsmoothed ping times, may need to be reverted before the build.  
Before your partial fix (r17275), r17235 *does* allow a node fallen  
under bug#2006 to still operate, but it may reject too many requests  
from common peers (whose ping time is over-normal).

--
Robert Hailey


Reply via email to