Ah, that makes it much clearer, thanks! It also brings up an additional question: who/what decides on the partitioning? Does Spark Streaming decide to divide a micro batch/RDD into more than 1 partition based on size? Or is it something that the "source" (SocketStream, KafkaStream etc.) decides?
On Tue, Sep 13, 2016 at 4:26 PM, Cody Koeninger <c...@koeninger.org> wrote: > A micro batch is an RDD. > > An RDD has partitions, so different executors can work on different > partitions concurrently. > > Don't think of that as multiple micro-batches within a time slot. > It's one RDD within a time slot, with multiple partitions. > > On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie <debie.d...@gmail.com> wrote: > > Thanks, but that thread does not answer my questions, which are about the > > distributed nature of RDDs vs the small nature of "micro batches" and on > how > > Spark Streaming distributes work. > > > > On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh < > mich.talebza...@gmail.com> > > wrote: > >> > >> Hi Daan, > >> > >> You may find this link Re: Is "spark streaming" streaming or mini-batch? > >> helpful. This was a thread in this forum not long ago. > >> > >> HTH > >> > >> Dr Mich Talebzadeh > >> > >> > >> > >> LinkedIn > >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd > OABUrV8Pw > >> > >> > >> > >> http://talebzadehmich.wordpress.com > >> > >> > >> Disclaimer: Use it at your own risk. Any and all responsibility for any > >> loss, damage or destruction of data or any other property which may > arise > >> from relying on this email's technical content is explicitly > disclaimed. The > >> author will in no case be liable for any monetary damages arising from > such > >> loss, damage or destruction. > >> > >> > >> > >> > >> On 13 September 2016 at 14:25, DandyDev <debie.d...@gmail.com> wrote: > >>> > >>> Hi all! > >>> > >>> When reading about Spark Streaming and its execution model, I see > >>> diagrams > >>> like this a lot: > >>> > >>> > >>> <http://apache-spark-user-list.1001560.n3.nabble.com/ > file/n27699/lambda-architecture-with-spark-spark- > streaming-kafka-cassandra-akka-and-scala-31-638.jpg> > >>> > >>> It does a fine job explaining how DStreams consist of micro batches > that > >>> are > >>> basically RDDs. There are however some things I don't understand: > >>> > >>> - RDDs are distributed by design, but micro batches are conceptually > >>> small. > >>> How/why are these micro batches distributed so that they need to be > >>> implemented as RDD? > >>> - The above image doesn't explain how Spark Streaming parallelizes > data. > >>> According to the image, a stream of events get broken into micro > batches > >>> over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is a > >>> micro > >>> batch, etc.). How does parallelism come into play here? Is it that even > >>> within a "time slot" (eg. time 0 to 1) there can be so many events, > that > >>> multiple micro batches for that time slot will be created and > distributed > >>> across the executors? > >>> > >>> Clarification would be helpful! > >>> > >>> Daan > >>> > >>> > >>> > >>> -- > >>> View this message in context: > >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark- > Streaming-dividing-DStream-into-mini-batches-tp27699.html > >>> Sent from the Apache Spark User List mailing list archive at > Nabble.com. > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >>> > >> > > >