I have installed version 2.2 of nutch on a CentIOS machine and am using the following command:

    ./bin/crawl urls testcrawl "solrfolder" 2

I have attempted to use the default filter configuration and also explicitly specified urlfilter-regex
in the nutch-default.xml (without modifying the default regex filters).

However if fails each time and I can see the exception below in the hadoop.log.

As you can see it looks like it has not picked up anything from the seed.txt in the urls folder
(as MalformedURLException error usually prints the url).
This file has 1 entry with the protocol specified e.g. http://www.google.com

Can anyone shed any light on this?

Regards,
Peter.

2013-06-12 17:00:47,857 INFO crawl.InjectorJob - InjectorJob: starting at 2013-06-12 17:00:47 2013-06-12 17:00:47,858 INFO crawl.InjectorJob - InjectorJob: Injecting urlDir: urls 2013-06-12 17:00:48,140 INFO crawl.InjectorJob - InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class. 2013-06-12 17:00:48,158 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2013-06-12 17:00:48,206 WARN snappy.LoadSnappy - Snappy native library not loaded 2013-06-12 17:00:48,344 INFO mapreduce.GoraRecordWriter - gora.buffer.write.limit = 10000 2013-06-12 17:00:48,403 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 2013-06-12 17:00:48,407 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2013-06-12 17:00:48,407 WARN  mapred.LocalJobRunner - job_local_0001
java.net.MalformedURLException: no protocol:
       at java.net.URL.<init>(URL.java:585)
       at java.net.URL.<init>(URL.java:482)
       at java.net.URL.<init>(URL.java:431)
       at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:44)
at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:162) at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:88)
       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) 2013-06-12 17:00:49,300 ERROR crawl.InjectorJob - InjectorJob: java.lang.RuntimeException: job failed: name=[testcrawl]inject urls, jobid=job_local_0001 at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
       at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
       at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
       at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
       at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)


Reply via email to