Thanks Yu for sharing the use case.

>>If our system have some problem, such as hdfs issue, and the "first 
batch" and "second batch" were both queued. When the issue gone, these two 
batch will start together. Then, will onBatchStarted be called 
concurrently for these two batches?<<

Not sure...I have not digged in to that detail or faced a situation one 
such.....I see a method onBatchSubmitted in the listener and the comment 
for the method reads "/** Called when a batch of jobs has been submitted 
for processing. */" Given that we have an event for batch submitted 
too...I think the case you mention is a possible scenario....so probably 
you can use this method in combination with the other two. As all the 
three methods take BatchInfo as their arguments and the BatchInfo class 
has the needed details of batchTime you should be able to achieve your 
task.


Thanking You
---------------------------------------------------------------------------------
Praveen Devarao
Spark Technology Centre
IBM India Software Labs
---------------------------------------------------------------------------------
"Courage doesn't always roar. Sometimes courage is the quiet voice at the 
end of the day saying I will try again"



From:   Yu Xie <yuu...@gmail.com>
To:     Praveen Devarao/India/IBM@IBMIN
Cc:     user@spark.apache.org
Date:   21/04/2016 01:40 pm
Subject:        Re: How to know whether I'm in the first batch of spark 
streaming



Thank you Praveen

  in our spark streaming, we write down the data to a HDFS directory, and 
use the YYYYMMDDHHHmm00 format of batch time as the directory name.
  So, when we stop the streaming and start the streaming again (we do not 
use checkpoint), in the init of the first batch, we will write down the 
empty directory between the stop and start.
  If the second batch runs faster than the first batch, and it will have 
the chance to run the "init". In this case, the directory that the "first 
batch" will output to will be set to an empty directory by the "second 
batch", it will make the data mess.

  I have a question about the StreamingListener.
  If our system have some problem, such as hdfs issue, and the "first 
batch" and "second batch" were both queued. When the issue gone, these two 
batch will start together. Then, will onBatchStarted be called 
concurrently for these two batches?

Thank you


On Thu, Apr 21, 2016 at 3:11 PM, Praveen Devarao <praveen...@in.ibm.com> 
wrote:
Hi Yu,

        Could you provide more details on what and how are you trying to 
initialize.....are you having this initialization as part of the code 
block in action of the DStream? Say if the second batch finishes before 
first batch wouldn't your results be affected as init would have not taken 
place (since you want it on first batch itself)?

        One way we could think of knowing the first batch is by 
implementing the StreamingListenertrait which has a method onBatchStarted 
and onBatchCompleted...These methods should help you determine the first 
batch (definitely first batch will start first though order of ending is 
not guaranteed with concurrentJobs set to more than 1)...

        Would be interesting to know your use case...could you share, if 
possible?

Thanking You
---------------------------------------------------------------------------------
Praveen Devarao
Spark Technology Centre
IBM India Software Labs
---------------------------------------------------------------------------------
"Courage doesn't always roar. Sometimes courage is the quiet voice at the 
end of the day saying I will try again"



From:        Yu Xie <yuu...@gmail.com>
To:        user@spark.apache.org
Date:        19/04/2016 01:24 pm
Subject:        How to know whether I'm in the first batch of spark 
streaming




hi spark users

I'm running a spark streaming application, with concurrentJobs > 1, so 
maybe more than one batches could run together.

Now I would like to do some init work in the first batch based on the 
"time" of the first batch. So even the second batch runs faster than the 
first batch, I still need to init in the literal "first batch"

Then is there a way that I can know that?
Thank you






Reply via email to