Hi Peter, please do not hijack threads.
Seed URLs must be fully specified including protocol, e.g.: http://nutch.apache.org/ but not apache.org Sebastian On 06/12/2013 05:08 PM, Peter Gaines wrote: > I have installed version 2.2 of nutch on a CentIOS machine and am using the > following command: > > ./bin/crawl urls testcrawl "solrfolder" 2 > > I have attempted to use the default filter configuration and also explicitly > specified urlfilter-regex > in the nutch-default.xml (without modifying the default regex filters). > > However if fails each time and I can see the exception below in the > hadoop.log. > > As you can see it looks like it has not picked up anything from the seed.txt > in the urls folder > (as MalformedURLException error usually prints the url). > This file has 1 entry with the protocol specified e.g. http://www.google.com > > Can anyone shed any light on this? > > Regards, > Peter. > > 2013-06-12 17:00:47,857 INFO crawl.InjectorJob - InjectorJob: starting at > 2013-06-12 17:00:47 > 2013-06-12 17:00:47,858 INFO crawl.InjectorJob - InjectorJob: Injecting > urlDir: urls > 2013-06-12 17:00:48,140 INFO crawl.InjectorJob - InjectorJob: Using class > org.apache.gora.memory.store.MemStore as the Gora storage class. > 2013-06-12 17:00:48,158 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your > platform... using builtin-java classes where applicable > 2013-06-12 17:00:48,206 WARN snappy.LoadSnappy - Snappy native library not > loaded > 2013-06-12 17:00:48,344 INFO mapreduce.GoraRecordWriter - > gora.buffer.write.limit = 10000 > 2013-06-12 17:00:48,403 INFO regex.RegexURLNormalizer - can't find rules for > scope 'inject', using > default > 2013-06-12 17:00:48,407 WARN mapred.FileOutputCommitter - Output path is > null in cleanup > 2013-06-12 17:00:48,407 WARN mapred.LocalJobRunner - job_local_0001 > java.net.MalformedURLException: no protocol: > at java.net.URL.<init>(URL.java:585) > at java.net.URL.<init>(URL.java:482) > at java.net.URL.<init>(URL.java:431) > at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:44) > at > org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:162) > at > org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:88) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > 2013-06-12 17:00:49,300 ERROR crawl.InjectorJob - InjectorJob: > java.lang.RuntimeException: job > failed: name=[testcrawl]inject urls, jobid=job_local_0001 > at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54) > at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233) > at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251) > at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282) > >

