Re: Much faster RegExp lib needed in nutch?

Stefan Groschupf Thu, 16 Mar 2006 15:31:38 -0800

Beside that, we may should add a kind of timeout to the url filterin general.
I think this is overkill. There is already a Hadoop task timeout.Is that not sufficient?

No! What happens is that the url filter hang and than the completetask is time outed instead of just skipping this url.After 4 retries the complete job is killed and all fetched data arelost, in my case any time 5 mio urls. :-(

This was the real reason of the described problem in hadoop-dev.

Instead I would suggest go a step forward by add a (configurable)timeout mechanism and skip bad records in reducing in general.Processing such big data and losing all data because just of one badrecord is very sad.

As far I know google's map reduce skip bad records also.

Stefan

Re: Much faster RegExp lib needed in nutch?

Reply via email to