Re: Spark Streaming - dividing DStream into mini batches

Cody Koeninger Thu, 15 Sep 2016 08:11:44 -0700

Yeah.  If you're looking to reduce over more than one microbatch/rdd,
there's also reduceByKeyAndWindow


On Thu, Sep 15, 2016 at 4:27 AM, Daan Debie <debie.d...@gmail.com> wrote:
> I have another (semi-related) question: I see in the documentation that
> DStream has a transformation reduceByKey. Does this work on _all_ elements
> in the stream, as they're coming in, or is this a transformation per
> RDD/micro batch? I assume the latter, otherwise it would be more akin to
> updateStateByKey, right?
>
> On Tue, Sep 13, 2016 at 4:42 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> The DStream implementation decides how to produce an RDD for a time
>> (this is the compute method)
>>
>> The RDD implementation decides how to partition things (this is the
>> getPartitions method)
>>
>> You can look at those methods in DirectKafkaInputDStream and KafkaRDD
>> respectively if you want to see an example
>>
>> On Tue, Sep 13, 2016 at 9:37 AM, Daan Debie <debie.d...@gmail.com> wrote:
>> > Ah, that makes it much clearer, thanks!
>> >
>> > It also brings up an additional question: who/what decides on the
>> > partitioning? Does Spark Streaming decide to divide a micro batch/RDD
>> > into
>> > more than 1 partition based on size? Or is it something that the
>> > "source"
>> > (SocketStream, KafkaStream etc.) decides?
>> >
>> > On Tue, Sep 13, 2016 at 4:26 PM, Cody Koeninger <c...@koeninger.org>
>> > wrote:
>> >>
>> >> A micro batch is an RDD.
>> >>
>> >> An RDD has partitions, so different executors can work on different
>> >> partitions concurrently.
>> >>
>> >> Don't think of that as multiple micro-batches within a time slot.
>> >> It's one RDD within a time slot, with multiple partitions.
>> >>
>> >> On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie <debie.d...@gmail.com>
>> >> wrote:
>> >> > Thanks, but that thread does not answer my questions, which are about
>> >> > the
>> >> > distributed nature of RDDs vs the small nature of "micro batches" and
>> >> > on
>> >> > how
>> >> > Spark Streaming distributes work.
>> >> >
>> >> > On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh
>> >> > <mich.talebza...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Hi Daan,
>> >> >>
>> >> >> You may find this link Re: Is "spark streaming" streaming or
>> >> >> mini-batch?
>> >> >> helpful. This was a thread in this forum not long ago.
>> >> >>
>> >> >> HTH
>> >> >>
>> >> >> Dr Mich Talebzadeh
>> >> >>
>> >> >>
>> >> >>
>> >> >> LinkedIn
>> >> >>
>> >> >>
>> >> >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >> >>
>> >> >>
>> >> >>
>> >> >> http://talebzadehmich.wordpress.com
>> >> >>
>> >> >>
>> >> >> Disclaimer: Use it at your own risk. Any and all responsibility for
>> >> >> any
>> >> >> loss, damage or destruction of data or any other property which may
>> >> >> arise
>> >> >> from relying on this email's technical content is explicitly
>> >> >> disclaimed. The
>> >> >> author will in no case be liable for any monetary damages arising
>> >> >> from
>> >> >> such
>> >> >> loss, damage or destruction.
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On 13 September 2016 at 14:25, DandyDev <debie.d...@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> Hi all!
>> >> >>>
>> >> >>> When reading about Spark Streaming and its execution model, I see
>> >> >>> diagrams
>> >> >>> like this a lot:
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27699/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala-31-638.jpg>
>> >> >>>
>> >> >>> It does a fine job explaining how DStreams consist of micro batches
>> >> >>> that
>> >> >>> are
>> >> >>> basically RDDs. There are however some things I don't understand:
>> >> >>>
>> >> >>> - RDDs are distributed by design, but micro batches are
>> >> >>> conceptually
>> >> >>> small.
>> >> >>> How/why are these micro batches distributed so that they need to be
>> >> >>> implemented as RDD?
>> >> >>> - The above image doesn't explain how Spark Streaming parallelizes
>> >> >>> data.
>> >> >>> According to the image, a stream of events get broken into micro
>> >> >>> batches
>> >> >>> over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is
>> >> >>> a
>> >> >>> micro
>> >> >>> batch, etc.). How does parallelism come into play here? Is it that
>> >> >>> even
>> >> >>> within a "time slot" (eg. time 0 to 1) there can be so many events,
>> >> >>> that
>> >> >>> multiple micro batches for that time slot will be created and
>> >> >>> distributed
>> >> >>> across the executors?
>> >> >>>
>> >> >>> Clarification would be helpful!
>> >> >>>
>> >> >>> Daan
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> --
>> >> >>> View this message in context:
>> >> >>>
>> >> >>>
>> >> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-dividing-DStream-into-mini-batches-tp27699.html
>> >> >>> Sent from the Apache Spark User List mailing list archive at
>> >> >>> Nabble.com.
>> >> >>>
>> >> >>>
>> >> >>> ---------------------------------------------------------------------
>> >> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >> >>>
>> >> >>
>> >> >
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark Streaming - dividing DStream into mini batches

Reply via email to