Hi All,

I am trying to parse already crawled segments using the method --
ParseSegment.parse(seg);


seg is the Path to the existing segment.
This internally fires a new job and the error thrown is --

Exception in thread "main" java.io.IOException: Segment already parsed!
at 
org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:80)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)

What I am trying to do here is parse the already fetched data to test my HTML 
Parse Filter.
Looks like the above method of ParseSegment gets called in the normal workflow 
of crawl, fetch, parse ...

What I have done is modified the org.apache.nutch.crawl.Crawl.run()  to call 
only ParseSegment and commented the injector, generator and fetcher parts. I am 
calling ParseSegment.parse(segment) in the run() method. I am passing the 
segment name in the command line.

Should I be calling some other method to test my HTML parser filter plugin 
without crawling again?

Any pointers should be helpful.

Thanks,
Ashish

Reply via email to