Re: Suffix URLFilter not working

Sebastian Nagel Wed, 12 Jun 2013 13:06:53 -0700

Hi Peter,

please do not hijack threads.


Seed URLs must be fully specified including protocol, e.g.:
 http://nutch.apache.org/
but not
 apache.org

Sebastian

On 06/12/2013 05:08 PM, Peter Gaines wrote:
> I have installed version 2.2 of nutch on a CentIOS machine and am using the 
> following command:
> 
>     ./bin/crawl urls testcrawl "solrfolder" 2
> 
> I have attempted to use the default filter configuration and also explicitly 
> specified urlfilter-regex
> in the nutch-default.xml (without modifying the default regex filters).
> 
> However if fails each time and I can see the exception below in the 
> hadoop.log.
> 
> As you can see it looks like it has not picked up anything from the seed.txt 
> in the urls folder
> (as MalformedURLException error usually prints the url).
> This file has 1 entry with the protocol specified e.g. http://www.google.com
> 
> Can anyone shed any light on this?
> 
> Regards,
> Peter.
> 
> 2013-06-12 17:00:47,857 INFO  crawl.InjectorJob - InjectorJob: starting at 
> 2013-06-12 17:00:47
> 2013-06-12 17:00:47,858 INFO  crawl.InjectorJob - InjectorJob: Injecting 
> urlDir: urls
> 2013-06-12 17:00:48,140 INFO  crawl.InjectorJob - InjectorJob: Using class
> org.apache.gora.memory.store.MemStore as the Gora storage class.
> 2013-06-12 17:00:48,158 WARN  util.NativeCodeLoader - Unable to load 
> native-hadoop library for your
> platform... using builtin-java classes where applicable
> 2013-06-12 17:00:48,206 WARN  snappy.LoadSnappy - Snappy native library not 
> loaded
> 2013-06-12 17:00:48,344 INFO  mapreduce.GoraRecordWriter - 
> gora.buffer.write.limit = 10000
> 2013-06-12 17:00:48,403 INFO  regex.RegexURLNormalizer - can't find rules for 
> scope 'inject', using
> default
> 2013-06-12 17:00:48,407 WARN  mapred.FileOutputCommitter - Output path is 
> null in cleanup
> 2013-06-12 17:00:48,407 WARN  mapred.LocalJobRunner - job_local_0001
> java.net.MalformedURLException: no protocol:
>        at java.net.URL.<init>(URL.java:585)
>        at java.net.URL.<init>(URL.java:482)
>        at java.net.URL.<init>(URL.java:431)
>        at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:44)
>        at 
> org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:162)
>        at 
> org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:88)
>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>        at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> 2013-06-12 17:00:49,300 ERROR crawl.InjectorJob - InjectorJob: 
> java.lang.RuntimeException: job
> failed: name=[testcrawl]inject urls, jobid=job_local_0001
>        at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
>        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
>        at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
>        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
> 
>

Re: Suffix URLFilter not working

Reply via email to