Re: Spark Streaming - dividing DStream into mini batches

Daan Debie Tue, 13 Sep 2016 07:51:10 -0700

Ah, that makes it much clearer, thanks!

It also brings up an additional question: who/what decides on the
partitioning? Does Spark Streaming decide to divide a micro batch/RDD into
more than 1 partition based on size? Or is it something that the "source"
(SocketStream, KafkaStream etc.) decides?


On Tue, Sep 13, 2016 at 4:26 PM, Cody Koeninger <c...@koeninger.org> wrote:

> A micro batch is an RDD.
>
> An RDD has partitions, so different executors can work on different
> partitions concurrently.
>
> Don't think of that as multiple micro-batches within a time slot.
> It's one RDD within a time slot, with multiple partitions.
>
> On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie <debie.d...@gmail.com> wrote:
> > Thanks, but that thread does not answer my questions, which are about the
> > distributed nature of RDDs vs the small nature of "micro batches" and on
> how
> > Spark Streaming distributes work.
> >
> > On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com>
> > wrote:
> >>
> >> Hi Daan,
> >>
> >> You may find this link Re: Is "spark streaming" streaming or mini-batch?
> >> helpful. This was a thread in this forum not long ago.
> >>
> >> HTH
> >>
> >> Dr Mich Talebzadeh
> >>
> >>
> >>
> >> LinkedIn
> >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw
> >>
> >>
> >>
> >> http://talebzadehmich.wordpress.com
> >>
> >>
> >> Disclaimer: Use it at your own risk. Any and all responsibility for any
> >> loss, damage or destruction of data or any other property which may
> arise
> >> from relying on this email's technical content is explicitly
> disclaimed. The
> >> author will in no case be liable for any monetary damages arising from
> such
> >> loss, damage or destruction.
> >>
> >>
> >>
> >>
> >> On 13 September 2016 at 14:25, DandyDev <debie.d...@gmail.com> wrote:
> >>>
> >>> Hi all!
> >>>
> >>> When reading about Spark Streaming and its execution model, I see
> >>> diagrams
> >>> like this a lot:
> >>>
> >>>
> >>> <http://apache-spark-user-list.1001560.n3.nabble.com/
> file/n27699/lambda-architecture-with-spark-spark-
> streaming-kafka-cassandra-akka-and-scala-31-638.jpg>
> >>>
> >>> It does a fine job explaining how DStreams consist of micro batches
> that
> >>> are
> >>> basically RDDs. There are however some things I don't understand:
> >>>
> >>> - RDDs are distributed by design, but micro batches are conceptually
> >>> small.
> >>> How/why are these micro batches distributed so that they need to be
> >>> implemented as RDD?
> >>> - The above image doesn't explain how Spark Streaming parallelizes
> data.
> >>> According to the image, a stream of events get broken into micro
> batches
> >>> over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is a
> >>> micro
> >>> batch, etc.). How does parallelism come into play here? Is it that even
> >>> within a "time slot" (eg. time 0 to 1) there can be so many events,
> that
> >>> multiple micro batches for that time slot will be created and
> distributed
> >>> across the executors?
> >>>
> >>> Clarification would be helpful!
> >>>
> >>> Daan
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-
> Streaming-dividing-DStream-into-mini-batches-tp27699.html
> >>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>>
> >>
> >
>

Re: Spark Streaming - dividing DStream into mini batches

Reply via email to