I'm starting with a very simple pipeline that will read from Kafka ->
Simple Transformation -> GroupByKey -> Persist the data.  We are also
applying some simple windowing/triggering that will persist the data after
every 100 elements or every 60 seconds to balance slow trickles of data as
well as not storing too much in memory.  For now I'm just running with the
DirectRunner since this is just a small processing problem.

With the potential for failure during the persisting of the data, we want
to ensure that the Kafka offsets are not updated until we have successfully
persisted the data.  Looking at KafkaIO it seems like our two options for
persisting offsets are:
* Kafka's enable.auto.commit
* KafkaUnboundedSource checkpointing.

The first option would commit prematurely before we could guarantee the
data was persisted.  I can't unfortunately find many details about the
checkpointing so I was wondering if there was a way to configure it or tune
it more appropriately.

Specifically I'm hoping to understand the flow so I can rely on the built
in KafkaIO functionality without having to write our own offset
management.  Or is it more common to write your own?

Thanks,
Micah

Reply via email to