[ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507535 ]
Doğacan Güney commented on NUTCH-503: ------------------------------------- Nice to hear, Emmanuel. I believe this is ready for committing, but, Vishal, can you add a test case for this? (Though, I am not sure how we can add a test case since this bug only occurs in distributed setups). > Generator exits incorrectly for small fetchlists > ------------------------------------------------- > > Key: NUTCH-503 > URL: https://issues.apache.org/jira/browse/NUTCH-503 > Project: Nutch > Issue Type: Bug > Components: generator > Affects Versions: 0.8, 0.8.1, 0.9.0 > Environment: Fedora Core 2, JDK 1.6 > Reporter: Vishal Shah > Fix For: 0.8.2 > > Attachments: emptyfetchlist.patch, emptyfetchlist.patch > > > I think I found the reason why the generator returns with an empty > fetchlist for small fetchsizes. > > After the first job finishes running, the generator checks the following > condition to see if it got an empty list: > > if (readers == null || readers.length == 0 || !readers[0].next(new > FloatWritable())) { > > The third condition is incorrect here. In some cases, esp. for small > fetchlists, the first partition might be empty, but some other partition(s) > might contain urls. In this case, the Generator is incorrectly assuming that > all partitions are empty by just looking at the first. This problem could > also occur when all URLs in the fetchlist are from the same host (or from a > very small number of hosts, or from a number of hosts that all map to a small > number of partitions). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.