Hi All, One of our users is outputting to Cassandra, but they want to handle a Cassandra failure or Cassandra down time gracefully from an output operator. Currently a lot of our database operators will just fail and redeploy continually until the database comes back. This is a bad idea for a couple of reasons:
1 - We rely on buffer server spooling to prevent data loss. If the database is down for a long time (several hours or a day) we may run out of space to spool for buffer server since it spools to local disk, and data is purged only after a window is committed. Furthermore this buffer server problem will exist for all the Streaming Containers in the dag, not just the one immediately upstream from the output operator, since data is spooled to disk for all operators and only removed for windows once a window is committed. 2 - If there is another failure further upstream in the dag, upstream operators will be redeployed to a checkpoint less than or equal to the checkpoint of the database operator in the At leas once case. This could mean redoing several hours or a day worth of computation. We should support a mechanism to detect when the connection to a database is lost and then spool to hdfs using a WAL, and then write the contents of the WAL into the database once it comes back online. This will save the local disk space of all the nodes used in the dag and allow it to be used for only the data being output to the output operator. Ticket here if anyone is interested in working on it: https://malhar.atlassian.net/browse/MLHR-1951 Thanks, Tim
