Hi all!

When reading about Spark Streaming and its execution model, I see diagrams
like this a lot:

<http://apache-spark-user-list.1001560.n3.nabble.com/file/n27699/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala-31-638.jpg>
 

It does a fine job explaining how DStreams consist of micro batches that are
basically RDDs. There are however some things I don't understand:

- RDDs are distributed by design, but micro batches are conceptually small.
How/why are these micro batches distributed so that they need to be
implemented as RDD?
- The above image doesn't explain how Spark Streaming parallelizes data.
According to the image, a stream of events get broken into micro batches
over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is a micro
batch, etc.). How does parallelism come into play here? Is it that even
within a "time slot" (eg. time 0 to 1) there can be so many events, that
multiple micro batches for that time slot will be created and distributed
across the executors?

Clarification would be helpful!

Daan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-dividing-DStream-into-mini-batches-tp27699.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to