I have installed version 2.2 of nutch on a CentIOS machine and am using the
following command:
./bin/crawl urls testcrawl "solrfolder" 2
I have attempted to use the default filter configuration and also explicitly
specified urlfilter-regex
in the nutch-default.xml (without modifying the default regex filters).
However if fails each time and I can see the exception below in the
hadoop.log.
As you can see it looks like it has not picked up anything from the seed.txt
in the urls folder
(as MalformedURLException error usually prints the url).
This file has 1 entry with the protocol specified e.g. http://www.google.com
Can anyone shed any light on this?
Regards,
Peter.
2013-06-12 17:00:47,857 INFO crawl.InjectorJob - InjectorJob: starting at
2013-06-12 17:00:47
2013-06-12 17:00:47,858 INFO crawl.InjectorJob - InjectorJob: Injecting
urlDir: urls
2013-06-12 17:00:48,140 INFO crawl.InjectorJob - InjectorJob: Using class
org.apache.gora.memory.store.MemStore as the Gora storage class.
2013-06-12 17:00:48,158 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2013-06-12 17:00:48,206 WARN snappy.LoadSnappy - Snappy native library not
loaded
2013-06-12 17:00:48,344 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2013-06-12 17:00:48,403 INFO regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
2013-06-12 17:00:48,407 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
2013-06-12 17:00:48,407 WARN mapred.LocalJobRunner - job_local_0001
java.net.MalformedURLException: no protocol:
at java.net.URL.<init>(URL.java:585)
at java.net.URL.<init>(URL.java:482)
at java.net.URL.<init>(URL.java:431)
at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:44)
at
org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:162)
at
org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:88)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
2013-06-12 17:00:49,300 ERROR crawl.InjectorJob - InjectorJob:
java.lang.RuntimeException: job failed: name=[testcrawl]inject urls,
jobid=job_local_0001
at
org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)