[ 
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507144
 ] 

Vishal Shah commented on NUTCH-503:
-----------------------------------

Hi Emmanuel,

   Can you please dump the contents of your crawldb after injecting your urls 
into the crawldb using the readdb command? Are these urls injected into the db 
in the first place? It could be that your urlfilters are filtering out your 
urls, or maybe there's some other problem. (esp. since the third test you did 
works). It would be good to know the contents of the crawldb before generate 
and after inject in each case.


> Generator exits incorrectly for small fetchlists 
> -------------------------------------------------
>
>                 Key: NUTCH-503
>                 URL: https://issues.apache.org/jira/browse/NUTCH-503
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: Fedora Core 2, JDK 1.6
>            Reporter: Vishal Shah
>             Fix For: 0.8.2
>
>         Attachments: emptyfetchlist.patch, emptyfetchlist.patch
>
>
>    I think I found the reason why the generator returns with an empty 
> fetchlist for small fetchsizes. 
>  
>    After the first job finishes running, the generator checks the following 
> condition to see if it got an empty list:
>  
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>  
>   The third condition is incorrect here. In some cases, esp. for small 
> fetchlists, the first partition might be empty, but some other partition(s) 
> might contain urls. In this case, the Generator is incorrectly assuming that 
> all partitions are empty by just looking at the first. This problem could 
> also occur when all URLs in the fetchlist are from the same host (or from a 
> very small number of hosts, or from a number of hosts that all map to a small 
> number of partitions).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to