Github user ottobackwards commented on the issue:
https://github.com/apache/metron/pull/741
@justinleet My question with regards to the necessity for a refactor came
from wanting to handle the exception in the HDFSWriter, where the exceptions
where being caught at the time.
The fact that the tuples are 'unpaired' from their messages, and that we
handle them all together seems to me to be problematic if you want to take per
message/sourcehandle action.
I think your approach removes the immediate problem with that, although
having both the hdfs writer and the source handler itself take actions and
split the 'ownership' doesn't feel quite right to me.
@cestella It would be good to know why this is happening, but in truth, any
persistent network connection in a long lived 'forever' type application needs
to guard against these kinds of errors.
For reliability reasons, and where we want to get to, we require a more
clear documentation of the failure and recovery states of the writers, esp. the
hdfs as we are batching. We also need to understand all the ways the stream
can fail, to the extent that that is possible.
I am not sure that limiting it to this one case is enough, there will
still be many possible ways to end up with a very unclear situation, that the
writer is failing continuously and but the source handler is not removed from
'service'. Users, as we have seen on the list will be left to track through
system to work their way to this problem.
Failure to store is a critical failure of the system. Esp. in systems
where there are data retention rules or SLA's on data loss. Thus we need in
addition to the handling of this, an alerting strategy.
While this fix ( questions pending ) is an improvement of the symptom, it
does not address the higher level issue or severity of this problem.
---