Hello Experts,

I'm working on a spark app which reads data from kafka & persists it in
hbase.

Spark documentation states the below *[1]* that in case of worker failure
we can loose some data. If not how can I make my kafka stream more reliable?
I have seen there is a simple consumer *[2]* but I'm not sure if it has
been used/tested extensively.

I was wondering if there is a way to explicitly acknowledge the kafka
offsets once they are replicated in memory of other worker nodes (if it's
not already done) to tackle this issue.

Any help is appreciated in advance.


   1. *Using any input source that receives data through a network* - For
   network-based data sources like *Kafka *and Flume, the received input
   data is replicated in memory between nodes of the cluster (default
   replication factor is 2). So if a worker node fails, then the system can
   recompute the lost from the the left over copy of the input data. However,
   if the *worker node where a network receiver was running fails, then a
   tiny bit of data may be lost*, that is, the data received by the system
   but not yet replicated to other node(s). The receiver will be started on a
   different node and it will continue to receive data.
   2. https://github.com/dibbhatt/kafka-spark-consumer

Txz,

*Mukesh Jha <me.mukesh....@gmail.com>*

Reply via email to