[ http://issues.apache.org/jira/browse/NUTCH-117?page=all ] Piotr Kosiorowski closed NUTCH-117: -----------------------------------
Fix Version: 0.7.2-dev Resolution: Fixed Assign To: Piotr Kosiorowski Applied fixed by Mike. Also reported offlist by Michal Karwanski. > Crawl crashes with java.io.IOException: already exists: > C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL > ------------------------------------------------------------------------------------------------------------- > > Key: NUTCH-117 > URL: http://issues.apache.org/jira/browse/NUTCH-117 > Project: Nutch > Type: Bug > Versions: 0.7.1, 0.7, 0.6 > Environment: Window 2000 P4 1.70GHz 512MB RAM > Java 1.5.0_05 > Reporter: Stephen Cross > Assignee: Piotr Kosiorowski > Priority: Critical > Fix For: 0.7.2-dev > > I started a crawl using the command line using nutch 0.7.1. > nutch-daemon.sh start crawl urls.txt -dir oct18 -threads 4 -depth 20 > After crawling for over 15 hours the crawl crached with the following > exception: > 051019 050543 status: segment 20051019050438, 30 pages, 0 errors, 1589818 > bytes, 48020 ms > 051019 050543 status: 0.6247397 pages/s, 258.65167 kb/s, 52993.934 bytes/page > 051019 050544 Updating C:\nutch\crawl.intranet\oct18\db > 051019 050544 Updating for > C:\nutch\crawl.intranet\oct18\segments\20051019050438 > 051019 050544 Processing document 0 > 051019 050544 Finishing update > 051019 050544 Processing pagesByURL: Sorted 47 instructions in 0.02 seconds. > 051019 050544 Processing pagesByURL: Sorted 2350.0 instructions/second > Exception in thread "main" java.io.IOException: already exists: > C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL > at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86) > at > org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549) > at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544) > at > org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321) > at > org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371) > at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141) > This was on the 14th segement from the requested depth of 20. Doing a quick > Google on the exception brings up a few previous posts with the same error > but no definitive answer, seems to have been occuring since nutch 0.6. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira