Hi,

I am trying to use nutch-0.8-dev and I have a problem with crawl run.
I did checkout from SVN and prepared fresh package (ant package - all
went fine). Then I installed nutch on linux and made only minor
changes to nutch-site.xml file (turned on some plugins and increased
several constansts), prepared file with urls and started bin/nutch
crawl.

This worked for nutch-0.7x but for nutch-0.8-dev I am receiving the
following exception in log file:

051220 204248 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/crawl-tool.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-site.xml
051220 204249 crawl started in: ./crawl.test
051220 204249 rootUrlDir = urls
051220 204249 threads = 10
051220 204249 depth = 6
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/crawl-tool.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-site.xml
051220 204249 Injector: starting
051220 204249 Injector: crawlDb: ./crawl.test/crawldb
051220 204249 Injector: urlDir: urls
051220 204249 Injector: Converting injected urls to crawl db entries.
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/crawl-tool.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-site.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing /home/lukas/nutch/mapred/local/localRunner/job_4zwds6.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-site.xml
java.io.IOException: No input directories specified in: NutchConf:
nutch-default.xml , mapred-default.xml ,
/home/lukas/nutch/mapred/local/localRunner/job_4zwds6.xml ,
nutch-site.xml
        at 
org.apache.nutch.mapred.InputFormatBase.listFiles(InputFormatBase.java:85)
        at 
org.apache.nutch.mapred.InputFormatBase.getSplits(InputFormatBase.java:95)
        at 
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:63)
051220 204249 Running job: job_4zwds6
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:102)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:101)

It seems that the problem is that Nutch is not able to find
mapred.input.subdir setting in neither of config files. I found that
there is mapred.input.dir property defined in config for particular
job (job_4zwds6.xml) with value equal to the name of my urls file but
I don't understand where should I define mapred.input.subdir property
and what value to assign to it (if it needs to be defined manually -
note that mapred.input.dir seems to be configured automatically).

Does anybody know the answer?

p.s: Note that number of lines it the exception trace above for
InputFormatBase.java file (85,95) can differ a bit as I tried to
insert some more LOG.debug() commands there in search of the root
cause and then I removed them again but it is possible that I left
some extra empty lines there.

Thanks,
Lukas

Reply via email to