>Try >bin/nutch readdb crawlbooks/crawldb -stats >and see if there are more URL's than ~2500 >Are you doing updatedb after a fetch? -> Espen
bin/nutch readdb crawlbooks/crawldb -stats CrawlDb statistics start: crawlbooks/crawldb Statistics for CrawlDb: crawlbooks/crawldb TOTAL urls: 152437 retry 0: 152437 min score: 1.0 avg score: 1.0 max score: 1.0 status 1 (db_unfetched): 112795 status 2 (db_fetched): 39640 status 3 (db_gone): 2 CrawlDb statistics: done then i try generate segment bin/nutch generate crawlbooks/crawldb crawlbooks/segments -topN 20000 NAME GENERATED FETCHER START FETCHER END FETCHED PARSED 20070510040137 10000 next step bin/nutch fetch $segment && bin/nutch updatedb crawlbooks/crawldb $segment bin/nutch readseg -list crawlbooks/segments/20070510040137 NAME GENERATED FETCHER START FETCHER END FETCHED PARSED 20070510040137 10000 2007-05-10T04:05:41 2007-05-10T04:14:54 3036 2805 ONLY 3036 FETCHED The size of one downloaded file is near 40000 byte (txt files). im using Nutch Hadoop (formerly NDFS) distributed file system (HDFS) and MapReduce (two servers) -- View this message in context: http://www.nabble.com/fetch-problem-tf3717200.html#a10410948 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
