Re: Spark Streaming - dividing DStream into mini batches

Cody Koeninger Tue, 13 Sep 2016 07:27:44 -0700

A micro batch is an RDD.

An RDD has partitions, so different executors can work on different
partitions concurrently.


Don't think of that as multiple micro-batches within a time slot.
It's one RDD within a time slot, with multiple partitions.

On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie <debie.d...@gmail.com> wrote:
> Thanks, but that thread does not answer my questions, which are about the
> distributed nature of RDDs vs the small nature of "micro batches" and on how
> Spark Streaming distributes work.
>
> On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>>
>> Hi Daan,
>>
>> You may find this link Re: Is "spark streaming" streaming or mini-batch?
>> helpful. This was a thread in this forum not long ago.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed. The
>> author will in no case be liable for any monetary damages arising from such
>> loss, damage or destruction.
>>
>>
>>
>>
>> On 13 September 2016 at 14:25, DandyDev <debie.d...@gmail.com> wrote:
>>>
>>> Hi all!
>>>
>>> When reading about Spark Streaming and its execution model, I see
>>> diagrams
>>> like this a lot:
>>>
>>>
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27699/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala-31-638.jpg>
>>>
>>> It does a fine job explaining how DStreams consist of micro batches that
>>> are
>>> basically RDDs. There are however some things I don't understand:
>>>
>>> - RDDs are distributed by design, but micro batches are conceptually
>>> small.
>>> How/why are these micro batches distributed so that they need to be
>>> implemented as RDD?
>>> - The above image doesn't explain how Spark Streaming parallelizes data.
>>> According to the image, a stream of events get broken into micro batches
>>> over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is a
>>> micro
>>> batch, etc.). How does parallelism come into play here? Is it that even
>>> within a "time slot" (eg. time 0 to 1) there can be so many events, that
>>> multiple micro batches for that time slot will be created and distributed
>>> across the executors?
>>>
>>> Clarification would be helpful!
>>>
>>> Daan
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-dividing-DStream-into-mini-batches-tp27699.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark Streaming - dividing DStream into mini batches

Reply via email to