Re: Zero Data Loss in Spark with Kafka

2016-10-26 Thread Cody Koeninger
Honestly, I would stay far away from saving offsets in Zookeeper if at all possible. It's better to store them alongside your results. On Wed, Oct 26, 2016 at 10:44 AM, Sunita Arvind wrote: > This is enough to get it to work: > >

Re: Zero Data Loss in Spark with Kafka

2016-10-26 Thread Sunita Arvind
This is enough to get it to work: df.save(conf.getString("ParquetOutputPath")+offsetSaved, "parquet", SaveMode.Overwrite) And tests so far (in local env) seem good with the edits. Yet to test on the cluster. Cody, appreciate your thoughts on the edits. Just want to make sure I am not doing an

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
The error in the file I just shared is here: val partitionOffsetPath:String = topicDirs.consumerOffsetDir + "/" + partition._2(0); --> this was just partition and hence there was an error fetching the offset. Still testing. Somehow Cody, your code never lead to file already exists sort of

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
Attached is the edited code. Am I heading in right direction? Also, I am missing something due to which, it seems to work well as long as the application is running and the files are created right. But as soon as I restart the application, it goes back to fromOffset as 0. Any thoughts? regards

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
Thanks for confirming Cody. To get to use the library, I had to do: val offsetsStore = new ZooKeeperOffsetsStore(conf.getString("zkHosts"), "/consumers/topics/"+ topics + "/0") It worked well. However, I had to specify the partitionId in the zkPath. If I want the library to pick all the

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Cody Koeninger
You are correct that you shouldn't have to worry about broker id. I'm honestly not sure specifically what else you are asking at this point. On Tue, Oct 25, 2016 at 1:39 PM, Sunita Arvind wrote: > Just re-read the kafka architecture. Something that slipped my mind is, it

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
Just re-read the kafka architecture. Something that slipped my mind is, it is leader based. So topic/partitionId pair will be same on all the brokers. So we do not need to consider brokerid while storing offsets. Still exploring rest of the items. regards Sunita On Tue, Oct 25, 2016 at 11:09 AM,

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
Hello Experts, I am trying to use the saving to ZK design. Just saw Sudhir's comments that it is old approach. Any reasons for that? Any issues observed with saving to ZK. The way we are planning to use it is: 1. Following http://aseigneurin.github.io/2016/05/07/spark-kafka-

Re: Zero Data Loss in Spark with Kafka

2016-08-23 Thread Cody Koeninger
See https://github.com/koeninger/kafka-exactly-once On Aug 23, 2016 10:30 AM, "KhajaAsmath Mohammed" wrote: > Hi Experts, > > I am looking for some information on how to acheive zero data loss while > working with kafka and Spark. I have searched online and blogs have >

Re: Zero Data Loss in Spark with Kafka

2016-08-23 Thread Sudhir Babu Pothineni
saving offsets to zookeeper is old approach, check-pointing internally saves the offsets to HDFS/location of checkpointing. more details here: http://spark.apache.org/docs/latest/streaming-kafka-integration.html On Tue, Aug 23, 2016 at 10:30 AM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com>

Zero Data Loss in Spark with Kafka

2016-08-23 Thread KhajaAsmath Mohammed
Hi Experts, I am looking for some information on how to acheive zero data loss while working with kafka and Spark. I have searched online and blogs have different answer. Please let me know if anyone has idea on this. Blog 1: