No, you really shouldn't rely on checkpoints if you cant afford to
reprocess from the beginning of your retention (or lose data and start
from the latest messages).

If you're in a real bind, you might be able to get something out of
the serialized data in the checkpoint, but it'd probably be easier and
faster to just grep through the output logs for the last processed
offsets (assuming you're logging at info level).  Look for "Computing
topic "

On Mon, Aug 15, 2016 at 4:14 PM, Shifeng Xiao <xiaoshife...@gmail.com> wrote:
> Hi folks,
>
> We  are using kafka + spark streaming in our data pipeline,  but sometimes
> we have to clean up checkpoint from hdfs before we restart spark streaming
> application, otherwise the application fails to start.
>
> That means we are losing data when we clean up checkpoint, is there a way to
> read kafka offset from checkpoint so that we might be able tp process the
> data from that offset to avoid losing data.
>
> Thanks

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to