Try bin/nutch readdb crawlbooks/crawldb -stats and see if there are more URL's than ~2500
Are you doing updatedb after a fetch? - Espen derevo wrote: > hi, > > i have two servers with nutch 0.9 and hadoop > > i generate segment > exec > bin/nutch generate crawlbooks/crawldb crawlbooks/segments -topN 15000 > > in mapred-default.xml > > <name>mapred.map.tasks</name> > <value>2</value> > > <name>mapred.reduce.tasks</name> > <value>2</value> > > > exec > bin/nutch fetch $segment > > after fetching > > exec > $ bin/nutch readseg -list $segment > > NAME GENERATED FETCHER START FETCHER END > > FETCHED PARSED > 20070509104954 7500 2007-05-09T10:54:33 2007-05-09T10:56:26 > > 2470 2464 > > > i'm try -topN 50000 and 100000 > > FETCHED result near (2400 2600) all time > > i,m fetching my host, injected link type > > http://myhost.com/arc/*.txt > > > > 10x > > > > > > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
