I wanted to do a test vertical crawl (db.ignore.external.links=true) of several dozen sites using "nutch crawl urlDir -threads 10 -depth 6 -topN 32768 -dir /var/nutch/testindex"... FWIW, I ran the crawl on an Athlon 1900 with 1.5GB RAM and the crawl directory size is about 2,4GB. Maximum memory usage was about 1.6-1.7GB (went into swap).
This is what I found at the end of hadoop.log when the process finished: 2007-04-12 04:11:24,903 INFO indexer.Indexer - Indexer: done 2007-04-12 04:11:25,138 INFO indexer.DeleteDuplicates - Dedup: starting 2007-04-12 04:11:26,178 INFO indexer.DeleteDuplicates - Dedup: adding indexes i n: /var/nutch/testindex/indexes 2007-04-12 04:12:59,636 INFO indexer.DeleteDuplicates - Dedup: done 2007-04-12 04:12:59,637 INFO indexer.IndexMerger - merging indexes to: /var/nut ch/testindex/index 2007-04-12 04:12:59,684 INFO indexer.IndexMerger - Adding /var/nutch/testindex/indexes/part-00000 2007-04-12 04:16:09,532 INFO indexer.IndexMerger - done merging 2007-04-12 04:16:09,728 INFO crawl.Crawl - crawl finished: /var/nutch/testindex Looks to me like everything was in perfect order, but I got the following error when querying the index throught the nutch web ui: "HTTP Status 404 - /var/nutch/testindex/index/segments (No such file or directory)" This is what I saw in the /var/nutch/testindex/index directory: $ ls _0.fdt _0.fnm _0.nrm _0.tii segments_2 _0.fdx _0.frq _0.prx _0.tis segments.gen Obviously, there is no segments file. Any ideas why that is? TIA, t.n.a. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general
