Try
bin/nutch readdb crawlbooks/crawldb -stats
and see if there are more URL's than ~2500

Are you doing updatedb after a fetch?

- Espen

derevo wrote:
> hi, 
> 
> i have two servers with nutch 0.9  and hadoop 
> 
> i generate segment 
> exec
> bin/nutch generate crawlbooks/crawldb crawlbooks/segments -topN 15000
> 
> in  mapred-default.xml
> 
>               <name>mapred.map.tasks</name>
>               <value>2</value>
> 
>               <name>mapred.reduce.tasks</name>
>               <value>2</value>
> 
> 
> exec
> bin/nutch fetch $segment 
> 
> after fetching 
> 
> exec
> $ bin/nutch readseg -list $segment
> 
> NAME            GENERATED       FETCHER START           FETCHER END           
>  
> FETCHED PARSED
> 20070509104954  7500            2007-05-09T10:54:33     2007-05-09T10:56:26   
>  
> 2470    2464
> 
> 
> i'm try -topN 50000 and 100000  
> 
> FETCHED result near (2400 2600) all time
> 
> i,m fetching my host, injected link type
> 
> http://myhost.com/arc/*.txt 
> 
> 
> 
> 10x
> 
> 
>  
> 
> 
> 
> 


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to