Re: Spark Streaming - dividing DStream into mini batches

Cody Koeninger Tue, 13 Sep 2016 07:49:45 -0700

The DStream implementation decides how to produce an RDD for a time
(this is the compute method)


The RDD implementation decides how to partition things (this is the
getPartitions method)

You can look at those methods in DirectKafkaInputDStream and KafkaRDD
respectively if you want to see an example

On Tue, Sep 13, 2016 at 9:37 AM, Daan Debie <debie.d...@gmail.com> wrote:
> Ah, that makes it much clearer, thanks!
>
> It also brings up an additional question: who/what decides on the
> partitioning? Does Spark Streaming decide to divide a micro batch/RDD into
> more than 1 partition based on size? Or is it something that the "source"
> (SocketStream, KafkaStream etc.) decides?
>
> On Tue, Sep 13, 2016 at 4:26 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> A micro batch is an RDD.
>>
>> An RDD has partitions, so different executors can work on different
>> partitions concurrently.
>>
>> Don't think of that as multiple micro-batches within a time slot.
>> It's one RDD within a time slot, with multiple partitions.
>>
>> On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie <debie.d...@gmail.com> wrote:
>> > Thanks, but that thread does not answer my questions, which are about
>> > the
>> > distributed nature of RDDs vs the small nature of "micro batches" and on
>> > how
>> > Spark Streaming distributes work.
>> >
>> > On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh
>> > <mich.talebza...@gmail.com>
>> > wrote:
>> >>
>> >> Hi Daan,
>> >>
>> >> You may find this link Re: Is "spark streaming" streaming or
>> >> mini-batch?
>> >> helpful. This was a thread in this forum not long ago.
>> >>
>> >> HTH
>> >>
>> >> Dr Mich Talebzadeh
>> >>
>> >>
>> >>
>> >> LinkedIn
>> >>
>> >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >>
>> >>
>> >>
>> >> http://talebzadehmich.wordpress.com
>> >>
>> >>
>> >> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> >> loss, damage or destruction of data or any other property which may
>> >> arise
>> >> from relying on this email's technical content is explicitly
>> >> disclaimed. The
>> >> author will in no case be liable for any monetary damages arising from
>> >> such
>> >> loss, damage or destruction.
>> >>
>> >>
>> >>
>> >>
>> >> On 13 September 2016 at 14:25, DandyDev <debie.d...@gmail.com> wrote:
>> >>>
>> >>> Hi all!
>> >>>
>> >>> When reading about Spark Streaming and its execution model, I see
>> >>> diagrams
>> >>> like this a lot:
>> >>>
>> >>>
>> >>>
>> >>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27699/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala-31-638.jpg>
>> >>>
>> >>> It does a fine job explaining how DStreams consist of micro batches
>> >>> that
>> >>> are
>> >>> basically RDDs. There are however some things I don't understand:
>> >>>
>> >>> - RDDs are distributed by design, but micro batches are conceptually
>> >>> small.
>> >>> How/why are these micro batches distributed so that they need to be
>> >>> implemented as RDD?
>> >>> - The above image doesn't explain how Spark Streaming parallelizes
>> >>> data.
>> >>> According to the image, a stream of events get broken into micro
>> >>> batches
>> >>> over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is a
>> >>> micro
>> >>> batch, etc.). How does parallelism come into play here? Is it that
>> >>> even
>> >>> within a "time slot" (eg. time 0 to 1) there can be so many events,
>> >>> that
>> >>> multiple micro batches for that time slot will be created and
>> >>> distributed
>> >>> across the executors?
>> >>>
>> >>> Clarification would be helpful!
>> >>>
>> >>> Daan
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context:
>> >>>
>> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-dividing-DStream-into-mini-batches-tp27699.html
>> >>> Sent from the Apache Spark User List mailing list archive at
>> >>> Nabble.com.
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >>>
>> >>
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark Streaming - dividing DStream into mini batches

Reply via email to