My apologies I sent it in error.
I have resent the email as a new thread
BTW I specified the full protocol i.e. http://
Regards,
Peter
-----Original Message-----
From: Sebastian Nagel
Sent: Wednesday, June 12, 2013 8:54 PM
To: [email protected]
Subject: Re: Suffix URLFilter not working
Hi Peter,
please do not hijack threads.
Seed URLs must be fully specified including protocol, e.g.:
http://nutch.apache.org/
but not
apache.org
Sebastian
On 06/12/2013 05:08 PM, Peter Gaines wrote:
I have installed version 2.2 of nutch on a CentIOS machine and am using
the following command:
./bin/crawl urls testcrawl "solrfolder" 2
I have attempted to use the default filter configuration and also
explicitly specified urlfilter-regex
in the nutch-default.xml (without modifying the default regex filters).
However if fails each time and I can see the exception below in the
hadoop.log.
As you can see it looks like it has not picked up anything from the
seed.txt in the urls folder
(as MalformedURLException error usually prints the url).
This file has 1 entry with the protocol specified e.g.
http://www.google.com
Can anyone shed any light on this?
Regards,
Peter.
2013-06-12 17:00:47,857 INFO crawl.InjectorJob - InjectorJob: starting at
2013-06-12 17:00:47
2013-06-12 17:00:47,858 INFO crawl.InjectorJob - InjectorJob: Injecting
urlDir: urls
2013-06-12 17:00:48,140 INFO crawl.InjectorJob - InjectorJob: Using class
org.apache.gora.memory.store.MemStore as the Gora storage class.
2013-06-12 17:00:48,158 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your
platform... using builtin-java classes where applicable
2013-06-12 17:00:48,206 WARN snappy.LoadSnappy - Snappy native library
not loaded
2013-06-12 17:00:48,344 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2013-06-12 17:00:48,403 INFO regex.RegexURLNormalizer - can't find rules
for scope 'inject', using
default
2013-06-12 17:00:48,407 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
2013-06-12 17:00:48,407 WARN mapred.LocalJobRunner - job_local_0001
java.net.MalformedURLException: no protocol:
at java.net.URL.<init>(URL.java:585)
at java.net.URL.<init>(URL.java:482)
at java.net.URL.<init>(URL.java:431)
at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:44)
at
org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:162)
at
org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:88)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
2013-06-12 17:00:49,300 ERROR crawl.InjectorJob - InjectorJob:
java.lang.RuntimeException: job
failed: name=[testcrawl]inject urls, jobid=job_local_0001
at
org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)