[ https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748811#action_12748811 ]
Julien Nioche commented on NUTCH-696: ------------------------------------- The simplest way of not being blocked by such issues is simply to skip the problematic records so that even if a document fails a task the other records will still be processed in a subsequent re-run. This can be done with the following Hadoop options : skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1" The time out mechanism could be interesting but is not really straightforward to implement so probably best to forget about it for now. > Timeout for Parser > ------------------ > > Key: NUTCH-696 > URL: https://issues.apache.org/jira/browse/NUTCH-696 > Project: Nutch > Issue Type: Wish > Components: fetcher > Reporter: Julien Nioche > Priority: Minor > > I found that the parsing sometimes crashes due to a problem on a specific > document, which is a bit of a shame as this blocks the rest of the segment > and Hadoop ends up finding that the node does not respond. I was wondering > about whether it would make sense to have a timeout mechanism for the parsing > so that if a document is not parsed after a time t, it is simply treated as > an exception and we can get on with the rest of the process. > Does that make sense? Where do you think we should implement that, in > ParseUtil? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.