On Jan 25, 2008, at 12:28 PM, Matthew Toseland wrote: > On Friday 25 January 2008 16:49, Robert Hailey wrote: >> >> On Jan 24, 2008, at 1:07 PM, Matthew Toseland wrote: >> >>>> Well, I do think that this problem *generally* has gone away. A >>>> large >>>> part of the timeouts may have been request coalescing deadlocks. In >>>> my >>>> logs, I no longer see that "requestsender took to long to respond >>>> to >>>> requestor (+2m)", but when I do see that log statement fire, it is >>>> huge! >>>> >>>> Jan 24, 2008 17:05:11:767 (freenet.node.RequestHandler, >>>> RequestSender >>>> for UID 5637402349040790252, ERROR): >>>> requestsender took too long to respond to requestor (16m10s/3) >>>> Jan 24, 2008 17:05:14:446 (freenet.node.RequestHandler, >>>> RequestSender >>>> for UID 98827504771122964, ERROR): >>>> requestsender took too long to respond to requestor (16m8s/3) >>>> Jan 24, 2008 17:05:14:447 (freenet.node.RequestHandler, >>>> RequestSender >>>> for UID 774454676209630, ERROR): >>>> requestsender took too long to respond to requestor (16m8s/3) >>>> Jan 24, 2008 17:23:00:203 (freenet.node.RequestHandler, >>>> RequestSender >>>> for UID 7341907878853950087, ERROR): >>>> requestsender took too long to respond to requestor (34m33s/4) >>>> >>>> Half an hour for one request? Good night! >>> >>> This is suspicious, they are all roughly the same period except the >>> last. I >>> suggest you set log level minor and investigate what happened by >>> searching >>> for the UID. >> >> I've let it run overnight, and they increase all the more (~15 >> hours). >> After pouring through a thread dump, I think that you actually just >> fixed this problem (waiting on chk transfers) with: r17272, r17275. > > I think so, although actually in a later commit. Currently debugging > the > fix. :) >> >> Nice catch! >> >> This is the same node affected by bug#2006, I wonder if this is the >> root cause. > > Maybe... > > I wonder if we can ship alpha 2 without this? If we need to include > it then we > need to release 1104 *now* ...
I think it's a big bug, particularly for busy nodes. r17235, using unsmoothed ping times, may need to be reverted before the build. Before your partial fix (r17275), r17235 *does* allow a node fallen under bug#2006 to still operate, but it may reject too many requests from common peers (whose ping time is over-normal). -- Robert Hailey
