remove *parse* in the segment and you're good to go. On Thursday 03 November 2011 13:16:40 Ashish Mehrotra wrote: > Hi All, > > I am trying to parse already crawled segments using the method -- > ParseSegment.parse(seg); > > > seg is the Path to the existing segment. > This internally fires a new job and the error thrown is -- > > Exception in thread "main" java.io.IOException: Segment already parsed! > at > org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputForma > t.java:80) at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156) > > What I am trying to do here is parse the already fetched data to test my > HTML Parse Filter. Looks like the above method of ParseSegment gets called > in the normal workflow of crawl, fetch, parse ... > > What I have done is modified the org.apache.nutch.crawl.Crawl.run() to > call only ParseSegment and commented the injector, generator and fetcher > parts. I am calling ParseSegment.parse(segment) in the run() method. I am > passing the segment name in the command line. > > Should I be calling some other method to test my HTML parser filter plugin > without crawling again? > > Any pointers should be helpful. > > Thanks, > Ashish
-- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

