A micro batch is an RDD. An RDD has partitions, so different executors can work on different partitions concurrently.
Don't think of that as multiple micro-batches within a time slot. It's one RDD within a time slot, with multiple partitions. On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie <debie.d...@gmail.com> wrote: > Thanks, but that thread does not answer my questions, which are about the > distributed nature of RDDs vs the small nature of "micro batches" and on how > Spark Streaming distributes work. > > On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: >> >> Hi Daan, >> >> You may find this link Re: Is "spark streaming" streaming or mini-batch? >> helpful. This was a thread in this forum not long ago. >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> Disclaimer: Use it at your own risk. Any and all responsibility for any >> loss, damage or destruction of data or any other property which may arise >> from relying on this email's technical content is explicitly disclaimed. The >> author will in no case be liable for any monetary damages arising from such >> loss, damage or destruction. >> >> >> >> >> On 13 September 2016 at 14:25, DandyDev <debie.d...@gmail.com> wrote: >>> >>> Hi all! >>> >>> When reading about Spark Streaming and its execution model, I see >>> diagrams >>> like this a lot: >>> >>> >>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27699/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala-31-638.jpg> >>> >>> It does a fine job explaining how DStreams consist of micro batches that >>> are >>> basically RDDs. There are however some things I don't understand: >>> >>> - RDDs are distributed by design, but micro batches are conceptually >>> small. >>> How/why are these micro batches distributed so that they need to be >>> implemented as RDD? >>> - The above image doesn't explain how Spark Streaming parallelizes data. >>> According to the image, a stream of events get broken into micro batches >>> over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is a >>> micro >>> batch, etc.). How does parallelism come into play here? Is it that even >>> within a "time slot" (eg. time 0 to 1) there can be so many events, that >>> multiple micro batches for that time slot will be created and distributed >>> across the executors? >>> >>> Clarification would be helpful! >>> >>> Daan >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-dividing-DStream-into-mini-batches-tp27699.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >> > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org