Hi,
I want to do a crawl of 3 websites (say, a.com, b.com and c.com)

I want to keep the crawldb and segments of these 3 websites separate.

So, I ran these three commands simultaneously

bash-3.2$ bin/nutch /seeds/aseedurls/urls.txt -dir /crawla -threads 10
-depth 8 -topN 100000
bash-3.2$ bin/nutch /seeds/bseedurls/urls.txt -dir /crawlb -threads 10
-depth 8 -topN 100000
bash-3.2$ bin/nutch /seeds/cseedurls/urls.txt -dir /crawlc -threads 10
-depth 8 -topN 100000

The crawl of a.com completed successfully, but for the other two crawls, I
get an error:

Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:969)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)

hadoop.log says:

2009-08-25 18:32:51,864 WARN  mapred.LocalJobRunner - job_local_0013
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_local_0013/attempt_local_0013_m_000000_0/output/spill0.out
in any of the configured local directories
at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:381)
at
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
at
org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:94)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1186)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:867)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)

Any clues?

Thanks,
Zee

Reply via email to