Hello Experts, I'm working on a spark app which reads data from kafka & persists it in hbase.
Spark documentation states the below *[1]* that in case of worker failure we can loose some data. If not how can I make my kafka stream more reliable? I have seen there is a simple consumer *[2]* but I'm not sure if it has been used/tested extensively. I was wondering if there is a way to explicitly acknowledge the kafka offsets once they are replicated in memory of other worker nodes (if it's not already done) to tackle this issue. Any help is appreciated in advance. 1. *Using any input source that receives data through a network* - For network-based data sources like *Kafka *and Flume, the received input data is replicated in memory between nodes of the cluster (default replication factor is 2). So if a worker node fails, then the system can recompute the lost from the the left over copy of the input data. However, if the *worker node where a network receiver was running fails, then a tiny bit of data may be lost*, that is, the data received by the system but not yet replicated to other node(s). The receiver will be started on a different node and it will continue to receive data. 2. https://github.com/dibbhatt/kafka-spark-consumer Txz, *Mukesh Jha <me.mukesh....@gmail.com>*