Re: [Spark Streaming] Starting Spark Streaming application from a specific position in Kinesis stream

Sam Elamin Sun, 19 Feb 2017 11:55:47 -0800

HI Niel,

My advice would be to write a structured streaming connector. The new
structured streaming APIs were brought in to handle exactly the issues you
describe


See this blog
<https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html>

There isnt a structured streaming connector as of yet, but you can easily
write one that uses the underlying batch methods to read/write to Kinesis

Have a look at how I wrote my bigquery connector here
<http://github.com/samelamin/spark-bigquery>. Plus the best thing is we get
a new connector to a highly used datasource/sink

Hope that helps

Regards
Sam

On Sun, Feb 19, 2017 at 5:53 PM, Neil Maheshwari <
neil.v.maheshw...@gmail.com> wrote:

> Thanks for your response Ayan.
>
> This could be an option. One complication I see with that approach is that
> I do not want to miss any records that are between the data we have batched
> to the data store and the checkpoint. I would still need a mechanism for
> recording the sequence number of the last time the data was batched, so I
> could start the streaming application after that sequence number.
>
> A similar approach could be to batch our data periodically, recording the
> last sequence number of the batch. Then, fetch data from Kinesis using the
> low level API to read data from the latest sequence number of the batched
> data up until the sequence number of the latest checkpoint from our spark
> app. I could merge batched dataset and the dataset fetched from Kinesis’s
> lower level API, and use that dataset as an RDD to prep the job.
>
> On Feb 19, 2017, at 3:12 AM, ayan guha <guha.a...@gmail.com> wrote:
>
> Hi
>
> AFAIK, Kinesis does not provide any mechanism other than check point to
> restart. That makes sense as it makes it so generic.
>
> Question: why cant you warm up your data from a data store? Say every 30
> mins you run a job to aggregate your data to a data store for that hour.
> When you restart the streaming app it would read from dynamo check point,
> but it would also preps an initial rdd from data store?
>
> Best
> Ayan
> On Sun, 19 Feb 2017 at 8:29 pm, Neil Maheshwari <
> neil.v.maheshw...@gmail.com> wrote:
>
>> Hello,
>>
>> I am building a Spark streaming application that ingests data from an
>> Amazon Kinesis stream. My application keeps track of the minimum price over
>> a window for groups of similar tickets. When I deploy the application, I
>> would like it to start processing at the start of the previous hours data.
>> This will warm up the state of the application and allow us to deploy our
>> application faster. For example, if I start the application at 3 PM, I
>> would like to process the data retained by Kinesis from 2PM to 3PM, and
>> then continue receiving data going forward. Spark Streaming’s Kinesis
>> receiver, which relies on the Amazon Kinesis Client Library, seems to give
>> me three options for choosing where to read from the stream:
>>
>>    - read from the latest checkpointed sequence number in Dynamo
>>    - start from the oldest record in the stream (TRIM_HORIZON shard
>>    iterator type)
>>    - start from the most recent record in the stream (LATEST shard
>>    iterator type)
>>
>>
>> Do you have any suggestions on how we could start our application at a
>> specific timestamp or sequence number in the Kinesis stream? Some ideas I
>> had were:
>>
>>    - Create a KCL application that fetches the previous hour data and
>>    writes it to HDFS. We can create an RDD from that dataset and initialize
>>    our Spark Streaming job with it. The spark streaming job’s Kinesis 
>> receiver
>>    can have the same name as the initial KCL application, and use that
>>    applications checkpoint as the starting point. We’re writing our spark 
>> jobs
>>    in Python, so this would require launching the java MultiLang daemon, or
>>    writing that portion of the application in Java/Scala.
>>    - Before the Spark streaming application starts, we could fetch a
>>    shard iterator using the AT_TIMESTAMP shard iterator type. We could record
>>    the sequence number of the first record returned by this iterator, and
>>    create an entry in Dynamo for our application for that sequence number. 
>> Our
>>    Kinesis receiver would pick up from this checkpoint. It makes me a little
>>    nervous that we would be faking Kinesis Client Library's protocol by
>>    writing a checkpoint into Dynamo
>>
>>
>> Thanks in advance!
>>
>> Neil
>>
> --
> Best Regards,
> Ayan Guha
>
>
>

Re: [Spark Streaming] Starting Spark Streaming application from a specific position in Kinesis stream

Reply via email to