Re: Kafka Offsets after application is restarted using Spark Streaming Checkpointing

kundan kumar Sun, 15 Nov 2015 09:05:19 -0800

Sure

Thanks !!


On Sun, Nov 15, 2015 at 9:13 PM, Cody Koeninger <c...@koeninger.org> wrote:

> Not sure on that, maybe someone else can chime in
>
> On Sat, Nov 14, 2015 at 4:51 AM, kundan kumar <iitr.kun...@gmail.com>
> wrote:
>
>> Hi Cody ,
>>
>> Thanks for the clarification. I will try to come up with some workaround.
>>
>> I have an another doubt. When my job is restarted, and recovers from the
>> checkpoint it does the re-partitioning step twice for each 15 minute job
>> until the window of 2 hours is complete. Then the re-partitioning  takes
>> place only once.
>>
>> For eg - When the job recovers at 16:15 it does re-partitioning for the
>> 16:15 kafka stream and the 14:15 kafka stream as well. Also, all the other
>> intermediate stages are computed for 10:00 batch. I am using
>> reduceByKeyAndWindow with inverse function. Now once the 2 hrs window is
>> complete i.e at 18:15 repartitioning takes place only once. Seems like the
>> checkpoint does not have rdd stored for beyond 2 hrs which is my window
>> duration.  Because of this my job takes more time than usual.
>>
>> Is there a way or some configuration parameter which would help avoid
>> repartitioning twice ?
>>
>> I am attaching the snapshot for the same.
>>
>> Thanks !!
>> Kundan
>>
>> On Fri, Nov 13, 2015 at 8:48 PM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> Unless you change maxRatePerPartition, a batch is going to contain all
>>> of the offsets from the last known processed to the highest available.
>>>
>>> Offsets are not time-based, and Kafka's time-based api currently has
>>> very poor granularity (it's based on filesystem timestamp of the log
>>> segment).  There's a kafka improvement proposal to add time-based indexing,
>>> but I wouldn't expect it soon.
>>>
>>> Basically, if you want batches to relate to time even while your spark
>>> job is down, you need an external process to index Kafka and do some custom
>>> work to use that index to generate batches.
>>>
>>> Or (preferably) embed a time in your message, and do any time-based
>>> calculations using that time, not time of processing.
>>>
>>> On Fri, Nov 13, 2015 at 4:36 AM, kundan kumar <iitr.kun...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am using spark streaming check-pointing mechanism and reading the
>>>> data from kafka. The window duration for my application is 2 hrs with a
>>>> sliding interval of 15 minutes.
>>>>
>>>> So, my batches run at following intervals...
>>>> 09:45
>>>> 10:00
>>>> 10:15
>>>> 10:30  and so on
>>>>
>>>> Suppose, my running batch dies at 09:55 and I restart the application
>>>> at 12:05, then the flow is something like
>>>>
>>>> At 12:05 it would run the 10:00 batch -> would this read the kafka
>>>> offsets from the time it went down (or 9:45)  to 12:00 ? or  just upto
>>>> 10:10 ?
>>>> then next would 10:15 batch - what would be the offsets as input for
>>>> this batch ? ...so on for all the queued batches
>>>>
>>>>
>>>> Basically, my requirement is such that when the application is
>>>> restarted at 12:05 then it should read the kafka offsets till 10:00  and
>>>> then the next queued batch takes offsets from 10:00 to 10:15 and so on
>>>> until all the queued batches are processed.
>>>>
>>>> If this is the way offsets are handled for all the queued batched and I
>>>> am fine.
>>>>
>>>> Or else please provide suggestions on how this can be done.
>>>>
>>>>
>>>>
>>>> Thanks!!!
>>>>
>>>>
>>>
>>
>

Re: Kafka Offsets after application is restarted using Spark Streaming Checkpointing

Reply via email to