[ 
https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748811#action_12748811
 ] 

Julien Nioche commented on NUTCH-696:
-------------------------------------

The simplest way of not being blocked by such issues is simply to skip the 
problematic records so that even if a document fails a task the other records 
will still be processed in a subsequent re-run. This can be done with the 
following Hadoop options :

skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D 
mapred.skip.map.max.skip.records=1"

The time out mechanism could be interesting but is not really straightforward 
to implement so probably best to forget about it for now.

> Timeout for Parser
> ------------------
>
>                 Key: NUTCH-696
>                 URL: https://issues.apache.org/jira/browse/NUTCH-696
>             Project: Nutch
>          Issue Type: Wish
>          Components: fetcher
>            Reporter: Julien Nioche
>            Priority: Minor
>
> I found that the parsing sometimes crashes due to a problem on a specific 
> document, which is a bit of a shame as this blocks the rest of the segment 
> and Hadoop ends up finding that the node does not respond. I was wondering 
> about whether it would make sense to have a timeout mechanism for the parsing 
> so that if a document is not parsed after a time t, it is simply treated as 
> an exception and we can get on with the rest of the process.
> Does that make sense? Where do you think we should implement that, in 
> ParseUtil?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to