Beside that, we may should add a kind of timeout to the url filter
in general.
I think this is overkill. There is already a Hadoop task timeout.
Is that not sufficient?
No! What happens is that the url filter hang and than the complete
task is time outed instead of just skipping this url.
After 4 retries the complete job is killed and all fetched data are
lost, in my case any time 5 mio urls. :-(
This was the real reason of the described problem in hadoop-dev.
Instead I would suggest go a step forward by add a (configurable)
timeout mechanism and skip bad records in reducing in general.
Processing such big data and losing all data because just of one bad
record is very sad.
As far I know google's map reduce skip bad records also.
Stefan