My apologies I sent it in error.
I have resent the email as a new thread
BTW I specified the full protocol i.e. http://

Regards,
Peter

-----Original Message----- From: Sebastian Nagel
Sent: Wednesday, June 12, 2013 8:54 PM
To: [email protected]
Subject: Re: Suffix URLFilter not working

Hi Peter,

please do not hijack threads.

Seed URLs must be fully specified including protocol, e.g.:
http://nutch.apache.org/
but not
apache.org

Sebastian

On 06/12/2013 05:08 PM, Peter Gaines wrote:
I have installed version 2.2 of nutch on a CentIOS machine and am using the following command:

    ./bin/crawl urls testcrawl "solrfolder" 2

I have attempted to use the default filter configuration and also explicitly specified urlfilter-regex
in the nutch-default.xml (without modifying the default regex filters).

However if fails each time and I can see the exception below in the hadoop.log.

As you can see it looks like it has not picked up anything from the seed.txt in the urls folder
(as MalformedURLException error usually prints the url).
This file has 1 entry with the protocol specified e.g. http://www.google.com

Can anyone shed any light on this?

Regards,
Peter.

2013-06-12 17:00:47,857 INFO crawl.InjectorJob - InjectorJob: starting at 2013-06-12 17:00:47 2013-06-12 17:00:47,858 INFO crawl.InjectorJob - InjectorJob: Injecting urlDir: urls
2013-06-12 17:00:48,140 INFO  crawl.InjectorJob - InjectorJob: Using class
org.apache.gora.memory.store.MemStore as the Gora storage class.
2013-06-12 17:00:48,158 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
2013-06-12 17:00:48,206 WARN snappy.LoadSnappy - Snappy native library not loaded 2013-06-12 17:00:48,344 INFO mapreduce.GoraRecordWriter - gora.buffer.write.limit = 10000 2013-06-12 17:00:48,403 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using
default
2013-06-12 17:00:48,407 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2013-06-12 17:00:48,407 WARN  mapred.LocalJobRunner - job_local_0001
java.net.MalformedURLException: no protocol:
       at java.net.URL.<init>(URL.java:585)
       at java.net.URL.<init>(URL.java:482)
       at java.net.URL.<init>(URL.java:431)
       at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:44)
at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:162) at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:88)
       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) 2013-06-12 17:00:49,300 ERROR crawl.InjectorJob - InjectorJob: java.lang.RuntimeException: job
failed: name=[testcrawl]inject urls, jobid=job_local_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
       at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
       at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
       at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
       at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)



Reply via email to