Ingesting from KafkaIO into HDFS, problems with checkpoint and data lost on top of yarn + spark

Juan Carlos Garcia Mon, 01 Oct 2018 22:25:44 -0700

Hi folks we are running a pipeline which as the subject says the we are
having issues with data lost.


Using KafkaIO (2.0.4 due to the version of our brokers) with
commitOnFinalize, we would like to understand how this finalize work
together with a FileIO.

I studied the KafkaIO and saw that the records are committed to kafka
inside the consumerPollLoop method only when a checkpoint is produced, but
when is this checkpoint produced?, how does it cope with windowed data and
a FileIO to produces files?

When running with spark our batchInterval is 30secs, and the pipeline have
a fixed-window of 1hr for FileIO to write to HDFS and we are constantly
restarting the pipeline (1 or 3 times a day, or yarn reach it maximum
restart attempt and then it kill it completely due to networks interruption
), however we have detected we have missing data on HDFS.

Initially we were running without specifying a checkpoint directory
(SparkRunner) , and we found that on each deployment a random directory was
generated under /tmp, recently we started to uses a fixed directory for
checkpoint (via - - checkpointDir on the spark runner), but still we have
doubts that this will completely solve our data lost problems when
restarting the pipeline multiple times a day (or is it our assumption
incorrect? ).

Please advice if this usecase (data ingestion to hdfs) is something beam
could achieve without lossing data from KafkaIO.

Thanks
JC

Ingesting from KafkaIO into HDFS, problems with checkpoint and data lost on top of yarn + spark

Reply via email to