Re: Easy way to get offset metatada with Spark Streaming API

2017-09-15 Thread Dmitry Naumenko
Nice, thanks again Michael for helping out. Dmitry 2017-09-14 21:37 GMT+03:00 Michael Armbrust : > Yep, that is correct. You can also use the query ID which is a GUID that > is stored in the checkpoint and preserved across restarts if you want to > distinguish the

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-14 Thread Michael Armbrust
Yep, that is correct. You can also use the query ID which is a GUID that is stored in the checkpoint and preserved across restarts if you want to distinguish the batches from different streams. sqlContext.sparkContext.getLocalProperty(StreamExecution.QUERY_ID_KEY) This was added recently

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-13 Thread Michael Armbrust
I think the right way to look at this is the batchId is just a proxy for offsets that is agnostic to what type of source you are reading from (or how many sources their are). We might call into a custom sink with the same batchId more than once, but it will always contain the same data (there is

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-13 Thread Dmitry Naumenko
Thanks, I see. However, I guess reading from checkpoint directory might be less efficient comparing just preserving offsets in Dataset. I have one more question about operation idempotence (hope it help others to have a clear picture). If I read offsets on re-start from RDBMS and manually

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-12 Thread Michael Armbrust
In the checkpoint directory there is a file /offsets/$batchId that holds the offsets serialized as JSON. I would not consider this a public stable API though. Really the only important thing to get exactly once is that you must ensure whatever operation you are doing downstream is idempotent

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-12 Thread Dmitry Naumenko
Thanks for response, Michael > You should still be able to get exactly once processing by using the batchId that is passed to the Sink. Could you explain this in more detail, please? Is there some kind of offset manager API that works as get-offset by batch id lookup table? Dmitry 2017-09-12

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-12 Thread Michael Armbrust
I think that we are going to have to change the Sink API as part of SPARK-20928 , which is why I linked these tickets together. I'm still targeting an initial version for Spark 2.3 which should happen sometime towards the end of the year.

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-12 Thread Dmitry Naumenko
Thanks, Cody Unfortunately, it seems to be there is no active development right now. Maybe I can step in and help with it somehow? Dmitry 2017-09-11 21:01 GMT+03:00 Cody Koeninger : > https://issues-test.apache.org/jira/browse/SPARK-18258 > > On Mon, Sep 11, 2017 at 7:15

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-11 Thread Cody Koeninger
https://issues-test.apache.org/jira/browse/SPARK-18258 On Mon, Sep 11, 2017 at 7:15 AM, Dmitry Naumenko wrote: > Hi all, > > It started as a discussion in > https://stackoverflow.com/questions/46153105/how-to-get-kafka-offsets-with-spark-structured-streaming-api. > > So

Easy way to get offset metatada with Spark Streaming API

2017-09-11 Thread Dmitry Naumenko
Hi all, It started as a discussion in https://stackoverflow.com/questions/46153105/how-to-get-kafka-offsets-with-spark-structured-streaming-api . So the problem that there is no support in Public API to obtain the Kafka (or Kineses) offsets. For example, if you want to save offsets in external