Hi All,

One of our users is outputting to Cassandra, but they want to handle a
Cassandra failure or Cassandra down time gracefully from an output
operator. Currently a lot of our database operators will just fail and
redeploy continually until the database comes back. This is a bad idea for
a couple of reasons:

1 - We rely on buffer server spooling to prevent data loss. If the database
is down for a long time (several hours or a day) we may run out of space to
spool for buffer server since it spools to local disk, and data is purged
only after a window is committed. Furthermore this buffer server problem
will exist for all the Streaming Containers in the dag, not just the one
immediately upstream from the output operator, since data is spooled to
disk for all operators and only removed for windows once a window is
committed.

2 - If there is another failure further upstream in the dag, upstream
operators will be redeployed to a checkpoint less than or equal to the
checkpoint of the database operator in the At leas once case. This could
mean redoing several hours or a day worth of computation.

We should support a mechanism to detect when the connection to a database
is lost and then spool to hdfs using a WAL, and then write the contents of
the WAL into the database once it comes back online. This will save the
local disk space of all the nodes used in the dag and allow it to be used
for only the data being output to the output operator.

Ticket here if anyone is interested in working on it:

https://malhar.atlassian.net/browse/MLHR-1951

Thanks,
Tim

Reply via email to