Github user ottobackwards commented on the issue:

    https://github.com/apache/metron/pull/741
  
    @justinleet My question with regards to the necessity for a refactor came 
from wanting to handle the exception in the HDFSWriter, where the exceptions 
where being caught at the time.
    
    The fact that the tuples are 'unpaired' from their messages, and that we 
handle them all together seems to me to be problematic if you want to take per 
message/sourcehandle action.
    
    I think your approach removes the immediate problem with that, although 
having both the hdfs writer and the source handler itself take actions and 
split the 'ownership' doesn't feel quite right to me.
    
    @cestella It would be good to know why this is happening, but in truth, any 
persistent network connection in a long lived 'forever' type application needs 
to guard against these kinds of errors.
    
    For reliability reasons, and where we want to get to, we require a more 
clear documentation of the failure and recovery states of the writers, esp. the 
hdfs as we are batching.  We also need to understand all the ways the stream 
can fail, to the extent that that is possible.
    
    I am not sure that limiting it to this one case is enough,  there will 
still be many possible ways to end up with a very unclear situation, that the 
writer is failing continuously and but the source handler is not removed from 
'service'.  Users, as we have seen on the list will be left to track through 
system to work their way to this problem.  
    
    Failure to store is a critical failure of the system.  Esp. in systems 
where there are data retention rules or SLA's on data loss.  Thus we need in 
addition to the handling of this, an alerting strategy.
    
    While this fix ( questions pending ) is an improvement of the symptom, it 
does not address the higher level issue or severity of this problem.


---

Reply via email to