Hello, 

I am building a Spark streaming application that ingests data from an Amazon 
Kinesis stream. My application keeps track of the minimum price over a window 
for groups of similar tickets. When I deploy the application, I would like it 
to start processing at the start of the previous hours data. This will warm up 
the state of the application and allow us to deploy our application faster. For 
example, if I start the application at 3 PM, I would like to process the data 
retained by Kinesis from 2PM to 3PM, and then continue receiving data going 
forward. Spark Streaming’s Kinesis receiver, which relies on the Amazon Kinesis 
Client Library, seems to give me three options for choosing where to read from 
the stream: 
read from the latest checkpointed sequence number in Dynamo
start from the oldest record in the stream (TRIM_HORIZON shard iterator type)
start from the most recent record in the stream (LATEST shard iterator type)  

Do you have any suggestions on how we could start our application at a specific 
timestamp or sequence number in the Kinesis stream? Some ideas I had were: 
Create a KCL application that fetches the previous hour data and writes it to 
HDFS. We can create an RDD from that dataset and initialize our Spark Streaming 
job with it. The spark streaming job’s Kinesis receiver can have the same name 
as the initial KCL application, and use that applications checkpoint as the 
starting point. We’re writing our spark jobs in Python, so this would require 
launching the java MultiLang daemon, or writing that portion of the application 
in Java/Scala. 
Before the Spark streaming application starts, we could fetch a shard iterator 
using the AT_TIMESTAMP shard iterator type. We could record the sequence number 
of the first record returned by this iterator, and create an entry in Dynamo 
for our application for that sequence number. Our Kinesis receiver would pick 
up from this checkpoint. It makes me a little nervous that we would be faking 
Kinesis Client Library's protocol by writing a checkpoint into Dynamo

Thanks in advance!

Neil

Reply via email to