Re: Ordering of Batches in Spark streaming

2015-07-14 Thread Tathagata Das
This has been discussed in a number of threads in this mailing list. Here
is a summary.

1. Processing of batch T+1 always starts after all the processing of batch
T has completed. But here a batch is defined by data of all the receivers
running the in the system receiving within the batch interval. Since all
the data is divided internally in blocks and partitions, there is no clear
mapping between the original order in the sources and ordering in the RDDs
generated by batches.

2. However in the specific case of Direct Kafka stream, since there is a
one-to-one mapping between the Kafka partition and RDD partition (of the
RDDs generated by the direct kafka stream), there is a per-partitoin
ordering guarantee. For example, partition 2 of all the direct Kafka RDDs
maps to partition 2 of a Kafka topic, then data of all the data is
consecutive RDD partition 2 will be in the same order as they were in
Kafka. This is the special case.

Here is another relevant thread:
http://mail-archives.us.apache.org/mod_mbox/spark-user/201502.mbox/%3ccao05p7de8dpxs5dyfvrni_yzv22s5z26b9jvyayj-r+pwy5...@mail.gmail.com%3E


On Sun, Jul 12, 2015 at 8:36 PM, anshu shukla anshushuk...@gmail.com
wrote:

 Anyone   who can give some highlight over  HOW SPARK DOES *ORDERING OF
 BATCHES * .

 On Sat, Jul 11, 2015 at 9:19 AM, anshu shukla anshushuk...@gmail.com
 wrote:

 Thanks Ayan ,

 I was curious to know* how Spark does it *.Is there  any  *Documentation*
 where i can get the detail about that . Will you please point me out some
 detailed link etc .

 May be it does something like *transactional topologies in storm*.(
 https://storm.apache.org/documentation/Transactional-topologies.html)


 On Sat, Jul 11, 2015 at 9:13 AM, ayan guha guha.a...@gmail.com wrote:

 AFAIK, it is guranteed that batch t+1 will not start processing until
 batch t is done.

 ordeing within batch - what do you mean by that? In essence, the (mini)
 batch will get distributed in partitions like a normal RDD, so following
 rdd.zipWithIndex should give a wy to order them by the time they are
 received.

 On Sat, Jul 11, 2015 at 12:50 PM, anshu shukla anshushuk...@gmail.com
 wrote:

 Hey ,

 Is there any *guarantee of fix  ordering among the batches/RDDs* .

 After searching  a lot  I found there is no ordering  by default (from
 the framework itself ) not only on *batch wise *but *also ordering
  within   batches* .But i doubt  is there any change from old spark
 versions to spark 1.4 in this context.

 Any  Comments please !!

 --
 Thanks  Regards,
 Anshu Shukla




 --
 Best Regards,
 Ayan Guha




 --
 Thanks  Regards,
 Anshu Shukla




 --
 Thanks  Regards,
 Anshu Shukla



Re: Ordering of Batches in Spark streaming

2015-07-12 Thread anshu shukla
Anyone   who can give some highlight over  HOW SPARK DOES *ORDERING OF
BATCHES * .

On Sat, Jul 11, 2015 at 9:19 AM, anshu shukla anshushuk...@gmail.com
wrote:

 Thanks Ayan ,

 I was curious to know* how Spark does it *.Is there  any  *Documentation*
 where i can get the detail about that . Will you please point me out some
 detailed link etc .

 May be it does something like *transactional topologies in storm*.(
 https://storm.apache.org/documentation/Transactional-topologies.html)


 On Sat, Jul 11, 2015 at 9:13 AM, ayan guha guha.a...@gmail.com wrote:

 AFAIK, it is guranteed that batch t+1 will not start processing until
 batch t is done.

 ordeing within batch - what do you mean by that? In essence, the (mini)
 batch will get distributed in partitions like a normal RDD, so following
 rdd.zipWithIndex should give a wy to order them by the time they are
 received.

 On Sat, Jul 11, 2015 at 12:50 PM, anshu shukla anshushuk...@gmail.com
 wrote:

 Hey ,

 Is there any *guarantee of fix  ordering among the batches/RDDs* .

 After searching  a lot  I found there is no ordering  by default (from
 the framework itself ) not only on *batch wise *but *also ordering
  within   batches* .But i doubt  is there any change from old spark
 versions to spark 1.4 in this context.

 Any  Comments please !!

 --
 Thanks  Regards,
 Anshu Shukla




 --
 Best Regards,
 Ayan Guha




 --
 Thanks  Regards,
 Anshu Shukla




-- 
Thanks  Regards,
Anshu Shukla


Re: Ordering of Batches in Spark streaming

2015-07-10 Thread anshu shukla
Thanks Ayan ,

I was curious to know* how Spark does it *.Is there  any  *Documentation*
where i can get the detail about that . Will you please point me out some
detailed link etc .

May be it does something like *transactional topologies in storm*.(
https://storm.apache.org/documentation/Transactional-topologies.html)


On Sat, Jul 11, 2015 at 9:13 AM, ayan guha guha.a...@gmail.com wrote:

 AFAIK, it is guranteed that batch t+1 will not start processing until
 batch t is done.

 ordeing within batch - what do you mean by that? In essence, the (mini)
 batch will get distributed in partitions like a normal RDD, so following
 rdd.zipWithIndex should give a wy to order them by the time they are
 received.

 On Sat, Jul 11, 2015 at 12:50 PM, anshu shukla anshushuk...@gmail.com
 wrote:

 Hey ,

 Is there any *guarantee of fix  ordering among the batches/RDDs* .

 After searching  a lot  I found there is no ordering  by default (from
 the framework itself ) not only on *batch wise *but *also ordering
  within   batches* .But i doubt  is there any change from old spark
 versions to spark 1.4 in this context.

 Any  Comments please !!

 --
 Thanks  Regards,
 Anshu Shukla




 --
 Best Regards,
 Ayan Guha




-- 
Thanks  Regards,
Anshu Shukla


Re: Ordering of Batches in Spark streaming

2015-07-10 Thread ayan guha
AFAIK, it is guranteed that batch t+1 will not start processing until batch
t is done.

ordeing within batch - what do you mean by that? In essence, the (mini)
batch will get distributed in partitions like a normal RDD, so following
rdd.zipWithIndex should give a wy to order them by the time they are
received.

On Sat, Jul 11, 2015 at 12:50 PM, anshu shukla anshushuk...@gmail.com
wrote:

 Hey ,

 Is there any *guarantee of fix  ordering among the batches/RDDs* .

 After searching  a lot  I found there is no ordering  by default (from the
 framework itself ) not only on *batch wise *but *also ordering  within
 batches* .But i doubt  is there any change from old spark versions to
 spark 1.4 in this context.

 Any  Comments please !!

 --
 Thanks  Regards,
 Anshu Shukla




-- 
Best Regards,
Ayan Guha