Re: [Spark Streaming] Starting Spark Streaming application from a specific position in Kinesis stream

Sam Elamin Sun, 19 Feb 2017 14:14:30 -0800

just doing a bit of research, seems weve been beaten to the punch, theres
already a connector you can use here
<https://github.com/maropu/spark-kinesis-sql-asl/issues/4>


Give it a go and feel free to give the commiter feedback or better yet send
some PRs if it needs them :)

On Sun, Feb 19, 2017 at 9:23 PM, Sam Elamin <hussam.ela...@gmail.com> wrote:

> Hey Neil
>
> No worries! Happy to help you write it if you want, just link me to the
> repo and we can write it together
>
> Would be fun!
>
>
> Regards
> Sam
> On Sun, 19 Feb 2017 at 21:21, Neil Maheshwari <neil.v.maheshw...@gmail.com>
> wrote:
>
>> Thanks for the advice Sam. I will look into implementing a structured
>> streaming connector.
>>
>> On Feb 19, 2017, at 11:54 AM, Sam Elamin <hussam.ela...@gmail.com> wrote:
>>
>> HI Niel,
>>
>> My advice would be to write a structured streaming connector. The new
>> structured streaming APIs were brought in to handle exactly the issues you
>> describe
>>
>> See this blog
>> <https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html>
>>
>> There isnt a structured streaming connector as of yet, but you can easily
>> write one that uses the underlying batch methods to read/write to Kinesis
>>
>> Have a look at how I wrote my bigquery connector here
>> <http://github.com/samelamin/spark-bigquery>. Plus the best thing is we
>> get a new connector to a highly used datasource/sink
>>
>> Hope that helps
>>
>> Regards
>> Sam
>>
>> On Sun, Feb 19, 2017 at 5:53 PM, Neil Maheshwari <
>> neil.v.maheshw...@gmail.com> wrote:
>>
>> Thanks for your response Ayan.
>>
>> This could be an option. One complication I see with that approach is
>> that I do not want to miss any records that are between the data we have
>> batched to the data store and the checkpoint. I would still need a
>> mechanism for recording the sequence number of the last time the data was
>> batched, so I could start the streaming application after that sequence
>> number.
>>
>> A similar approach could be to batch our data periodically, recording the
>> last sequence number of the batch. Then, fetch data from Kinesis using the
>> low level API to read data from the latest sequence number of the batched
>> data up until the sequence number of the latest checkpoint from our spark
>> app. I could merge batched dataset and the dataset fetched from Kinesis’s
>> lower level API, and use that dataset as an RDD to prep the job.
>>
>> On Feb 19, 2017, at 3:12 AM, ayan guha <guha.a...@gmail.com> wrote:
>>
>> Hi
>>
>> AFAIK, Kinesis does not provide any mechanism other than check point to
>> restart. That makes sense as it makes it so generic.
>>
>> Question: why cant you warm up your data from a data store? Say every 30
>> mins you run a job to aggregate your data to a data store for that hour.
>> When you restart the streaming app it would read from dynamo check point,
>> but it would also preps an initial rdd from data store?
>>
>> Best
>> Ayan
>> On Sun, 19 Feb 2017 at 8:29 pm, Neil Maheshwari <
>> neil.v.maheshw...@gmail.com> wrote:
>>
>> Hello,
>>
>> I am building a Spark streaming application that ingests data from an
>> Amazon Kinesis stream. My application keeps track of the minimum price over
>> a window for groups of similar tickets. When I deploy the application, I
>> would like it to start processing at the start of the previous hours data.
>> This will warm up the state of the application and allow us to deploy our
>> application faster. For example, if I start the application at 3 PM, I
>> would like to process the data retained by Kinesis from 2PM to 3PM, and
>> then continue receiving data going forward. Spark Streaming’s Kinesis
>> receiver, which relies on the Amazon Kinesis Client Library, seems to give
>> me three options for choosing where to read from the stream:
>>
>>    - read from the latest checkpointed sequence number in Dynamo
>>    - start from the oldest record in the stream (TRIM_HORIZON shard
>>    iterator type)
>>    - start from the most recent record in the stream (LATEST shard
>>    iterator type)
>>
>>
>> Do you have any suggestions on how we could start our application at a
>> specific timestamp or sequence number in the Kinesis stream? Some ideas I
>> had were:
>>
>>    - Create a KCL application that fetches the previous hour data and
>>    writes it to HDFS. We can create an RDD from that dataset and initialize
>>    our Spark Streaming job with it. The spark streaming job’s Kinesis 
>> receiver
>>    can have the same name as the initial KCL application, and use that
>>    applications checkpoint as the starting point. We’re writing our spark 
>> jobs
>>    in Python, so this would require launching the java MultiLang daemon, or
>>    writing that portion of the application in Java/Scala.
>>    - Before the Spark streaming application starts, we could fetch a
>>    shard iterator using the AT_TIMESTAMP shard iterator type. We could record
>>    the sequence number of the first record returned by this iterator, and
>>    create an entry in Dynamo for our application for that sequence number. 
>> Our
>>    Kinesis receiver would pick up from this checkpoint. It makes me a little
>>    nervous that we would be faking Kinesis Client Library's protocol by
>>    writing a checkpoint into Dynamo
>>
>>
>> Thanks in advance!
>>
>> Neil
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>>
>>

Re: [Spark Streaming] Starting Spark Streaming application from a specific position in Kinesis stream

Reply via email to