[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482371#comment-14482371 ]
Lewis John McGibbney commented on NUTCH-1854: --------------------------------------------- In the past, I've experienced failed fetch task if parsing fails when invoked during fetching. There are various ways to overcome this, as you've said generate more, smaller fetch lists , so if a parsing fetcher fails then we mitigate against loosing large fetch results. You've also noted that simply making a check for the parse directory later on is a work around of sorts but it does not prevent interruptions in a typical workflow should a parsing fetcher fail. This is a Nutch Gotcha which I've been aware of since my early use of Nutch. It's something that's stuck with me and is probably more habit now than anything else Chris. The crawl script shadows this behavior hence the reason it fails when attempting to reparse a segment. The parsing fetcher is disabled by default based on the underlying assumption that Nutch will be invoked as a breadth first crawl. This is also reflected in the settings which ignore internal links but follow external links. I understand that the goal here is to move towards more of an interactive understanding of Crawldb and Record status, and I am supportive of that. I hope the above provides some context to Azitang and others. -- *Lewis* > ./bin/crawl fails with a parsing fetcher > ---------------------------------------- > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.9 > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney > Fix For: 1.11 > > > If you run ./bin/crawl with a parsing fetcher e.g. > <property> > > <name>fetcher.parse</name> > > <value>false</value> > > <description>If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished.</description> > > </property> > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)