Is it possible to achieve serial batching with Spark Streaming?

Example:

I configure the Streaming Context for creating a batch every 3 seconds.

Processing of the batch #2 takes longer than 3 seconds and creates a
backlog of batches:

batch #1 takes 2s
batch #2 takes 10s
batch #3 takes 2s
batch #4 takes 2s

Whet testing locally, it seems that processing of multiple batches is
finished at the same time:

batch #1 finished at 2s
batch #2 finished at 12s
batch #3 finished at 12s (processed in parallel)
batch #4 finished at 15s

How can I delay processing of the next batch, so that is processed
only after processing of the previous batch has been completed?

batch #1 finished at 2s
batch #2 finished at 12s
batch #3 finished at 14s (processed serially)
batch #4 finished at 16s

I want to perform a transformation for every key only once in a given
period of time (e.g. batch duration). I find all unique keys in a
batch and perform the transformation on each key. To ensure that the
transformation is done for every key only once, only one batch can be
processed at a time. At the same time, I want that single batch to be
processed in parallel.

context = new JavaStreamingContext(conf, Durations.seconds(10));
stream = context.receiverStream(...);
stream
    .reduceByKey(...)
    .transform(...)
    .foreachRDD(output);

Any ideas or pointers are very welcome.

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to