What are you trying to achieve? The crawl command does not invocate any
plugins of itself, it merely chains several Nutch jobs together. The
Nutch jobs themselves - or more specifically the mappers and reducers -
make use of the plugin repository.
On 11/03/2011 01:47 PM, Ashish M wrote:
What method in crawl.java would trigger the invocation of plugins?
Sent from my iPhone. Please ignore the typos.
On Nov 3, 2011, at 5:30 AM, Markus Jelsma<[email protected]> wrote:
remove *parse* in the segment and you're good to go.
On Thursday 03 November 2011 13:16:40 Ashish Mehrotra wrote:
Hi All,
I am trying to parse already crawled segments using the method --
ParseSegment.parse(seg);
seg is the Path to the existing segment.
This internally fires a new job and the error thrown is --
Exception in thread "main" java.io.IOException: Segment already parsed!
at
org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputForma
t.java:80) at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
What I am trying to do here is parse the already fetched data to test my
HTML Parse Filter. Looks like the above method of ParseSegment gets called
in the normal workflow of crawl, fetch, parse ...
What I have done is modified the org.apache.nutch.crawl.Crawl.run() to
call only ParseSegment and commented the injector, generator and fetcher
parts. I am calling ParseSegment.parse(segment) in the run() method. I am
passing the segment name in the command line.
Should I be calling some other method to test my HTML parser filter plugin
without crawling again?
Any pointers should be helpful.
Thanks,
Ashish
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350