Re: Suffix URLFilter not working

Peter Gaines Wed, 12 Jun 2013 08:10:00 -0700

I have installed version 2.2 of nutch on a CentIOS machine and am using thefollowing command:


    ./bin/crawl urls testcrawl "solrfolder" 2

I have attempted to use the default filter configuration and also explicitlyspecified urlfilter-regex

in the nutch-default.xml (without modifying the default regex filters).

However if fails each time and I can see the exception below in thehadoop.log.

As you can see it looks like it has not picked up anything from the seed.txtin the urls folder

(as MalformedURLException error usually prints the url).
This file has 1 entry with the protocol specified e.g. http://www.google.com

Can anyone shed any light on this?

Regards,
Peter.

2013-06-12 17:00:47,857 INFO crawl.InjectorJob - InjectorJob: starting at2013-06-12 17:00:472013-06-12 17:00:47,858 INFO crawl.InjectorJob - InjectorJob: InjectingurlDir: urls2013-06-12 17:00:48,140 INFO crawl.InjectorJob - InjectorJob: Using classorg.apache.gora.memory.store.MemStore as the Gora storage class.2013-06-12 17:00:48,158 WARN util.NativeCodeLoader - Unable to loadnative-hadoop library for your platform... using builtin-java classes whereapplicable2013-06-12 17:00:48,206 WARN snappy.LoadSnappy - Snappy native library notloaded2013-06-12 17:00:48,344 INFO mapreduce.GoraRecordWriter -gora.buffer.write.limit = 100002013-06-12 17:00:48,403 INFO regex.RegexURLNormalizer - can't find rulesfor scope 'inject', using default2013-06-12 17:00:48,407 WARN mapred.FileOutputCommitter - Output path isnull in cleanup

2013-06-12 17:00:48,407 WARN  mapred.LocalJobRunner - job_local_0001
java.net.MalformedURLException: no protocol:
       at java.net.URL.<init>(URL.java:585)
       at java.net.URL.<init>(URL.java:482)
       at java.net.URL.<init>(URL.java:431)
       at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:44)

atorg.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:162)atorg.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:88)

       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)

atorg.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)2013-06-12 17:00:49,300 ERROR crawl.InjectorJob - InjectorJob:java.lang.RuntimeException: job failed: name=[testcrawl]inject urls,jobid=job_local_0001atorg.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)

       at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
       at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
       at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
       at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)

Re: Suffix URLFilter not working

Reply via email to