Hi, I'm having trouble getting Nutch-0.9 (recompiled with NUTCH-467 applied) to crawl, and have tried many of the fixes that have been suggested here on the mailing list. The following is my Nutch output:
crawl started in: crawled-12 rootUrlDir = urls threads = 10 depth = 3 topN = 20 Injector: starting Injector: crawlDb: crawled-12/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawled-12/segments/20080220133145 Generator: filtering: false Generator: topN: 20 Generator: 0 records selected for fetching, exiting ... Stopping at depth=0 - no more URLs to fetch. No URLs to fetch - check your seed list and URL filters. crawl finished: crawled-12 I have done/checked for the following: 1. I have a valid http.agent.name string specified in nutch-site.xml; as a precaution, I also commented out the http.agent.name <property> section in nutch-default.xml in case the final configuration does not take hold. I have also verified this against the job.xml retrieved via the map/reduce web interface at 50030 on my master node, and the http.agent.name and http.agent.version strings are both present (and not empty). 2. I have configured my crawl-urlfilter.txt in all manner of ways, and it definitely allows the domains I'm crawling from. I have even added "+." to allow everything at the end of the file, but still the crawl does not work. 3. My logging level has been set to DEBUG, and then to TRACE, and still there are no errors, nor warnings (except for messages that look like this: 2008-02-20 07:47:55,247 DEBUG conf.Configuration - java.io.IOException: config() at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93) at org.apache.hadoop.dfs.FSConstants.<clinit>(FSConstants.java:120) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:976) at org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:276) at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.create(DistributedFileSystem.java:143) at org.apache.hadoop.fs.ChecksumFileSystem$FSOutputSummer.<init>(ChecksumFileSystem.java:363) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:438) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:346) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:253) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:84) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:78) at org.apache.hadoop.fs.ChecksumFileSystem.copyFromLocalFile(ChecksumFileSystem.java:566) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:741) at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:102) at org.apache.hadoop.fs.FsShell.run(FsShell.java:822) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.hadoop.fs.FsShell.main(FsShell.java:910) which doesn't look like an error to me, after I looked at that line in the source where it came from--it looks more like an indication that a Config is being read, please correct me if I'm wrong.) 4. I have tried hadoop clusters with 1, 2, and 4 slaves. 5. I have tried URL lists with 1, 4, 6, 12, 40, 46 distinct URLs, in case it was an issue with the minimum number of URLs needed -- I seem to remember reading about such an issue on the mailing list but I cannot find the post anymore--if anyone could point me in the direction of that, that would be helpful. 6. I have tried setting "crawl.generate.filter" to true, and false, in nutch-site.xml; neither works. 7. I have tried running with 10, 1, and 4 threads for the number of map and reduce tasks. 8. There were no OutOfMemoryErrors whatsoever and system load was not excessive during the crawl. 9. Results from readdb -stats: CrawlDb statistics start: crawled-12/crawldb Statistics for CrawlDb: crawled-12/crawldb TOTAL urls: 46 retry 0: 46 min score: 1.0 avg score: 1.0 max score: 1.0 status 1 (db_unfetched): 46 CrawlDb statistics: done Any help at all would be much appreciated. Thanks. Jiaqi Tan