[ http://issues.apache.org/jira/browse/NUTCH-344?page=all ]
Sami Siren resolved NUTCH-344. ------------------------------ Fix Version/s: 0.8.1 0.9.0 Resolution: Fixed I just committed this to 0.8 branch and trunk, thanks Greg! > Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks > ------------------------------------------------------------------------- > > Key: NUTCH-344 > URL: http://issues.apache.org/jira/browse/NUTCH-344 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.8, 0.9.0, 0.8.1 > Environment: All > Reporter: Greg Kim > Fix For: 0.8.1, 0.9.0 > > Attachments: cleanExpiredServerBlocks.patch > > > With the recent change to the following code in HttpBase.java has tendencies > to block fetcher threads while one thread busy waits... > private static void cleanExpiredServerBlocks() { > synchronized (BLOCKED_ADDR_TO_TIME) { > while (!BLOCKED_ADDR_QUEUE.isEmpty()) { <===== LINE 3: > String host = (String) BLOCKED_ADDR_QUEUE.getLast(); > long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue(); > if (time <= System.currentTimeMillis()) { > BLOCKED_ADDR_TO_TIME.remove(host); > BLOCKED_ADDR_QUEUE.removeLast(); > } > } > } > } > LINE3: As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the > thread that first enters this block busy-waits until it becomes empty while > all other threads block on the synchronized block. This leads to extremely > poor fetcher performance. > Since the checkin to respect crawlDelay in robots.txt, we are no longer > guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is > to iterate the queue once rather than busy waiting... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira