Thank you Praveen

  in our spark streaming, we write down the data to a HDFS directory, and
use the YYYYMMDDHHHmm00 format of batch time as the directory name.
  So, when we stop the streaming and start the streaming again (we do not
use checkpoint), in the init of the first batch, we will write down the
empty directory between the stop and start.
  If the second batch runs faster than the first batch, and it will have
the chance to run the "init". In this case, the directory that the "first
batch" will output to will be set to an empty directory by the "second
batch", it will make the data mess.

  I have a question about the StreamingListener.
  If our system have some problem, such as hdfs issue, and the "first
batch" and "second batch" were both queued. When the issue gone, these two
batch will start together. Then, will onBatchStarted be called concurrently
for these two batches?

Thank you


On Thu, Apr 21, 2016 at 3:11 PM, Praveen Devarao <praveen...@in.ibm.com>
wrote:

> Hi Yu,
>
>         Could you provide more details on what and how are you trying to
> initialize.....are you having this initialization as part of the code block
> in action of the DStream? Say if the second batch finishes before first
> batch wouldn't your results be affected as init would have not taken place
> (since you want it on first batch itself)?
>
>         One way we could think of knowing the first batch is by
> implementing the *StreamingListener*trait which has a method *onBatchStarted
> *and *onBatchCompleted*...These methods should help you determine the
> first batch (definitely first batch will start first though order of ending
> is not guaranteed with concurrentJobs set to more than 1)...
>
>         Would be interesting to know your use case...could you share, if
> possible?
>
> Thanking You
>
> ---------------------------------------------------------------------------------
> Praveen Devarao
> Spark Technology Centre
> IBM India Software Labs
>
> ---------------------------------------------------------------------------------
> "Courage doesn't always roar. Sometimes courage is the quiet voice at the
> end of the day saying I will try again"
>
>
>
> From:        Yu Xie <yuu...@gmail.com>
> To:        user@spark.apache.org
> Date:        19/04/2016 01:24 pm
> Subject:        How to know whether I'm in the first batch of spark
> streaming
> ------------------------------
>
>
>
> hi spark users
>
> I'm running a spark streaming application, with concurrentJobs > 1, so
> maybe more than one batches could run together.
>
> Now I would like to do some init work in the first batch based on the
> "time" of the first batch. So even the second batch runs faster than the
> first batch, I still need to init in the literal "first batch"
>
> Then is there a way that I can know that?
> Thank you
>
>
>

Reply via email to