Re: [Nutch-general] fetch problem

derevo Thu, 10 May 2007 03:49:34 -0700

>Try
>bin/nutch readdb crawlbooks/crawldb -stats
>and see if there are more URL's than ~2500
>Are you doing updatedb after a fetch?
-> Espen



bin/nutch readdb crawlbooks/crawldb -stats
CrawlDb statistics start: crawlbooks/crawldb
Statistics for CrawlDb: crawlbooks/crawldb
TOTAL urls:     152437
retry 0:        152437
min score:      1.0
avg score:      1.0
max score:      1.0
status 1 (db_unfetched):        112795
status 2 (db_fetched):  39640
status 3 (db_gone):     2
CrawlDb statistics: done


then i try generate segment 

bin/nutch generate crawlbooks/crawldb crawlbooks/segments -topN 20000
NAME            GENERATED       FETCHER START           FETCHER END            
FETCHED PARSED
20070510040137  10000                       


next step 

bin/nutch fetch $segment && bin/nutch updatedb crawlbooks/crawldb $segment


bin/nutch readseg -list crawlbooks/segments/20070510040137
NAME            GENERATED       FETCHER START           FETCHER END            
FETCHED PARSED
20070510040137  10000           2007-05-10T04:05:41     2007-05-10T04:14:54    
3036    2805

ONLY 3036 FETCHED 

The size of one downloaded file is near 40000 byte (txt files). 

im using Nutch Hadoop (formerly NDFS) distributed file system (HDFS) and
MapReduce
(two servers)






-- 
View this message in context: 
http://www.nabble.com/fetch-problem-tf3717200.html#a10410948
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] fetch problem

Reply via email to