Why does Spark need to set log levels

2017-10-09 Thread Daan Debie
Hi all!

I would love to use Spark with a somewhat more modern logging framework
than Log4j 1.2. I have Logback in mind, mostly because it integrates well
with central logging solutions such as the ELK stack. I've read up a bit on
getting Spark 2.0 (that's what I'm using currently) to work with anything
else than Log4j 1.2, and it seems nigh impossible.

If I understood the problem correctly from the various JIRA issues I read,
it seems Spark needs to be able to set the log-level programmatically,
which slf4j doesn't support, and as such, Spark integrates with Log4j 1.2
on a deep level.

My question: why would Spark want to set log levels programmatically? Why
not leave it to the user of Spark to provide a logging configuration that
suits his/her needs? That way, the offending code that integrates with
Log4j directly, could be removed, and Spark can start relying only on the
slf4j API, as any good library should.

I'm curious about the motivations of the Spark dev team on this!

Daan


Re: Can mapWithState state func be called every batchInterval?

2016-10-11 Thread Daan Debie
That's nice and all, but I'd rather have a solution involving mapWithState
of course :) I'm just wondering why it doesn't support this use case yet.

On Tue, Oct 11, 2016 at 3:41 PM, Cody Koeninger  wrote:

> They're telling you not to use the old function because it's linear on the
> total number of keys, not keys in the batch, so it's slow.
>
> But if that's what you really want, go ahead and do it, and see if it
> performs well enough.
>
> On Oct 11, 2016 6:28 AM, "DandyDev"  wrote:
>
> Hi there,
>
> I've built a Spark Streaming app that accepts certain events from Kafka,
> and
> I want to keep some state between the events. So I've successfully used
> mapWithState for that. The problem is, that I want the state for keys to be
> updated on every batchInterval, because "lack" of events is also
> significant
> to the use case. This doesn't seem possible with mapWithState, unless I'm
> missing something.
>
> Previously I looked at updateStateByKey, which says:
> > In every batch, Spark will apply the state update function for all
> > existing keys, regardless of whether they have new data in a batch or
> not.
>
> That is what I want, however, I've seen several tutorials/blog posts where
> the advise was not to use updateStateByKey anymore, and use mapWithState
> instead.
>
> So my questions:
>
> - Can mapWithState state function be called every batchInterval, even when
> no events exist for that interval?
> - If not, is it okay to use updateStateByKey instead? Or will it be
> deprecated in the near future?
> - If mapWithState doesn't support my need, is there another way to
> accomplish the goal of updating state every batchInterval, that still uses
> mapWithState, together with some other mechanism?
>
> Thanks in advance!
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Can-mapWithState-state-func-be-called-
> every-batchInterval-tp27877.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>


Re: Spark Streaming - dividing DStream into mini batches

2016-09-15 Thread Daan Debie
I have another (semi-related) question: I see in the documentation that
DStream has a transformation reduceByKey. Does this work on _all_ elements
in the stream, as they're coming in, or is this a transformation per
RDD/micro batch? I assume the latter, otherwise it would be more akin to
updateStateByKey, right?

On Tue, Sep 13, 2016 at 4:42 PM, Cody Koeninger <c...@koeninger.org> wrote:

> The DStream implementation decides how to produce an RDD for a time
> (this is the compute method)
>
> The RDD implementation decides how to partition things (this is the
> getPartitions method)
>
> You can look at those methods in DirectKafkaInputDStream and KafkaRDD
> respectively if you want to see an example
>
> On Tue, Sep 13, 2016 at 9:37 AM, Daan Debie <debie.d...@gmail.com> wrote:
> > Ah, that makes it much clearer, thanks!
> >
> > It also brings up an additional question: who/what decides on the
> > partitioning? Does Spark Streaming decide to divide a micro batch/RDD
> into
> > more than 1 partition based on size? Or is it something that the "source"
> > (SocketStream, KafkaStream etc.) decides?
> >
> > On Tue, Sep 13, 2016 at 4:26 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
> >>
> >> A micro batch is an RDD.
> >>
> >> An RDD has partitions, so different executors can work on different
> >> partitions concurrently.
> >>
> >> Don't think of that as multiple micro-batches within a time slot.
> >> It's one RDD within a time slot, with multiple partitions.
> >>
> >> On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie <debie.d...@gmail.com>
> wrote:
> >> > Thanks, but that thread does not answer my questions, which are about
> >> > the
> >> > distributed nature of RDDs vs the small nature of "micro batches" and
> on
> >> > how
> >> > Spark Streaming distributes work.
> >> >
> >> > On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh
> >> > <mich.talebza...@gmail.com>
> >> > wrote:
> >> >>
> >> >> Hi Daan,
> >> >>
> >> >> You may find this link Re: Is "spark streaming" streaming or
> >> >> mini-batch?
> >> >> helpful. This was a thread in this forum not long ago.
> >> >>
> >> >> HTH
> >> >>
> >> >> Dr Mich Talebzadeh
> >> >>
> >> >>
> >> >>
> >> >> LinkedIn
> >> >>
> >> >> https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> >>
> >> >>
> >> >>
> >> >> http://talebzadehmich.wordpress.com
> >> >>
> >> >>
> >> >> Disclaimer: Use it at your own risk. Any and all responsibility for
> any
> >> >> loss, damage or destruction of data or any other property which may
> >> >> arise
> >> >> from relying on this email's technical content is explicitly
> >> >> disclaimed. The
> >> >> author will in no case be liable for any monetary damages arising
> from
> >> >> such
> >> >> loss, damage or destruction.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On 13 September 2016 at 14:25, DandyDev <debie.d...@gmail.com>
> wrote:
> >> >>>
> >> >>> Hi all!
> >> >>>
> >> >>> When reading about Spark Streaming and its execution model, I see
> >> >>> diagrams
> >> >>> like this a lot:
> >> >>>
> >> >>>
> >> >>>
> >> >>> <http://apache-spark-user-list.1001560.n3.nabble.com/
> file/n27699/lambda-architecture-with-spark-spark-
> streaming-kafka-cassandra-akka-and-scala-31-638.jpg>
> >> >>>
> >> >>> It does a fine job explaining how DStreams consist of micro batches
> >> >>> that
> >> >>> are
> >> >>> basically RDDs. There are however some things I don't understand:
> >> >>>
> >> >>> - RDDs are distributed by design, but micro batches are conceptually
> >> >>> small.
> >> >>> How/why are these micro batches distributed so that they need to be
> >> >>> implemented as RDD?
> >> >>> - The above image doesn't explain how Spark Streaming parallelizes
> >> >>> data.
> >> >>> According to the image, a stream of events get broken into micro
> >> >>> batches
> >> >>> over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is
> a
> >> >>> micro
> >> >>> batch, etc.). How does parallelism come into play here? Is it that
> >> >>> even
> >> >>> within a "time slot" (eg. time 0 to 1) there can be so many events,
> >> >>> that
> >> >>> multiple micro batches for that time slot will be created and
> >> >>> distributed
> >> >>> across the executors?
> >> >>>
> >> >>> Clarification would be helpful!
> >> >>>
> >> >>> Daan
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> View this message in context:
> >> >>>
> >> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-
> Streaming-dividing-DStream-into-mini-batches-tp27699.html
> >> >>> Sent from the Apache Spark User List mailing list archive at
> >> >>> Nabble.com.
> >> >>>
> >> >>> 
> -
> >> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >> >>>
> >> >>
> >> >
> >
> >
>


Re: Spark Streaming - dividing DStream into mini batches

2016-09-14 Thread Daan Debie
Thanks for the awesome explanation! It's super clear to me now :)

On Tue, Sep 13, 2016 at 4:42 PM, Cody Koeninger <c...@koeninger.org> wrote:

> The DStream implementation decides how to produce an RDD for a time
> (this is the compute method)
>
> The RDD implementation decides how to partition things (this is the
> getPartitions method)
>
> You can look at those methods in DirectKafkaInputDStream and KafkaRDD
> respectively if you want to see an example
>
> On Tue, Sep 13, 2016 at 9:37 AM, Daan Debie <debie.d...@gmail.com> wrote:
> > Ah, that makes it much clearer, thanks!
> >
> > It also brings up an additional question: who/what decides on the
> > partitioning? Does Spark Streaming decide to divide a micro batch/RDD
> into
> > more than 1 partition based on size? Or is it something that the "source"
> > (SocketStream, KafkaStream etc.) decides?
> >
> > On Tue, Sep 13, 2016 at 4:26 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
> >>
> >> A micro batch is an RDD.
> >>
> >> An RDD has partitions, so different executors can work on different
> >> partitions concurrently.
> >>
> >> Don't think of that as multiple micro-batches within a time slot.
> >> It's one RDD within a time slot, with multiple partitions.
> >>
> >> On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie <debie.d...@gmail.com>
> wrote:
> >> > Thanks, but that thread does not answer my questions, which are about
> >> > the
> >> > distributed nature of RDDs vs the small nature of "micro batches" and
> on
> >> > how
> >> > Spark Streaming distributes work.
> >> >
> >> > On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh
> >> > <mich.talebza...@gmail.com>
> >> > wrote:
> >> >>
> >> >> Hi Daan,
> >> >>
> >> >> You may find this link Re: Is "spark streaming" streaming or
> >> >> mini-batch?
> >> >> helpful. This was a thread in this forum not long ago.
> >> >>
> >> >> HTH
> >> >>
> >> >> Dr Mich Talebzadeh
> >> >>
> >> >>
> >> >>
> >> >> LinkedIn
> >> >>
> >> >> https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> >>
> >> >>
> >> >>
> >> >> http://talebzadehmich.wordpress.com
> >> >>
> >> >>
> >> >> Disclaimer: Use it at your own risk. Any and all responsibility for
> any
> >> >> loss, damage or destruction of data or any other property which may
> >> >> arise
> >> >> from relying on this email's technical content is explicitly
> >> >> disclaimed. The
> >> >> author will in no case be liable for any monetary damages arising
> from
> >> >> such
> >> >> loss, damage or destruction.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On 13 September 2016 at 14:25, DandyDev <debie.d...@gmail.com>
> wrote:
> >> >>>
> >> >>> Hi all!
> >> >>>
> >> >>> When reading about Spark Streaming and its execution model, I see
> >> >>> diagrams
> >> >>> like this a lot:
> >> >>>
> >> >>>
> >> >>>
> >> >>> <http://apache-spark-user-list.1001560.n3.nabble.com/
> file/n27699/lambda-architecture-with-spark-spark-
> streaming-kafka-cassandra-akka-and-scala-31-638.jpg>
> >> >>>
> >> >>> It does a fine job explaining how DStreams consist of micro batches
> >> >>> that
> >> >>> are
> >> >>> basically RDDs. There are however some things I don't understand:
> >> >>>
> >> >>> - RDDs are distributed by design, but micro batches are conceptually
> >> >>> small.
> >> >>> How/why are these micro batches distributed so that they need to be
> >> >>> implemented as RDD?
> >> >>> - The above image doesn't explain how Spark Streaming parallelizes
> >> >>> data.
> >> >>> According to the image, a stream of events get broken into micro
> >> >>> batches
> >> >>> over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is
> a
> >> >>> micro
> >> >>> batch, etc.). How does parallelism come into play here? Is it that
> >> >>> even
> >> >>> within a "time slot" (eg. time 0 to 1) there can be so many events,
> >> >>> that
> >> >>> multiple micro batches for that time slot will be created and
> >> >>> distributed
> >> >>> across the executors?
> >> >>>
> >> >>> Clarification would be helpful!
> >> >>>
> >> >>> Daan
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> View this message in context:
> >> >>>
> >> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-
> Streaming-dividing-DStream-into-mini-batches-tp27699.html
> >> >>> Sent from the Apache Spark User List mailing list archive at
> >> >>> Nabble.com.
> >> >>>
> >> >>> 
> -
> >> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >> >>>
> >> >>
> >> >
> >
> >
>


Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Daan Debie
Ah, that makes it much clearer, thanks!

It also brings up an additional question: who/what decides on the
partitioning? Does Spark Streaming decide to divide a micro batch/RDD into
more than 1 partition based on size? Or is it something that the "source"
(SocketStream, KafkaStream etc.) decides?

On Tue, Sep 13, 2016 at 4:26 PM, Cody Koeninger <c...@koeninger.org> wrote:

> A micro batch is an RDD.
>
> An RDD has partitions, so different executors can work on different
> partitions concurrently.
>
> Don't think of that as multiple micro-batches within a time slot.
> It's one RDD within a time slot, with multiple partitions.
>
> On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie <debie.d...@gmail.com> wrote:
> > Thanks, but that thread does not answer my questions, which are about the
> > distributed nature of RDDs vs the small nature of "micro batches" and on
> how
> > Spark Streaming distributes work.
> >
> > On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com>
> > wrote:
> >>
> >> Hi Daan,
> >>
> >> You may find this link Re: Is "spark streaming" streaming or mini-batch?
> >> helpful. This was a thread in this forum not long ago.
> >>
> >> HTH
> >>
> >> Dr Mich Talebzadeh
> >>
> >>
> >>
> >> LinkedIn
> >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw
> >>
> >>
> >>
> >> http://talebzadehmich.wordpress.com
> >>
> >>
> >> Disclaimer: Use it at your own risk. Any and all responsibility for any
> >> loss, damage or destruction of data or any other property which may
> arise
> >> from relying on this email's technical content is explicitly
> disclaimed. The
> >> author will in no case be liable for any monetary damages arising from
> such
> >> loss, damage or destruction.
> >>
> >>
> >>
> >>
> >> On 13 September 2016 at 14:25, DandyDev <debie.d...@gmail.com> wrote:
> >>>
> >>> Hi all!
> >>>
> >>> When reading about Spark Streaming and its execution model, I see
> >>> diagrams
> >>> like this a lot:
> >>>
> >>>
> >>> <http://apache-spark-user-list.1001560.n3.nabble.com/
> file/n27699/lambda-architecture-with-spark-spark-
> streaming-kafka-cassandra-akka-and-scala-31-638.jpg>
> >>>
> >>> It does a fine job explaining how DStreams consist of micro batches
> that
> >>> are
> >>> basically RDDs. There are however some things I don't understand:
> >>>
> >>> - RDDs are distributed by design, but micro batches are conceptually
> >>> small.
> >>> How/why are these micro batches distributed so that they need to be
> >>> implemented as RDD?
> >>> - The above image doesn't explain how Spark Streaming parallelizes
> data.
> >>> According to the image, a stream of events get broken into micro
> batches
> >>> over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is a
> >>> micro
> >>> batch, etc.). How does parallelism come into play here? Is it that even
> >>> within a "time slot" (eg. time 0 to 1) there can be so many events,
> that
> >>> multiple micro batches for that time slot will be created and
> distributed
> >>> across the executors?
> >>>
> >>> Clarification would be helpful!
> >>>
> >>> Daan
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-
> Streaming-dividing-DStream-into-mini-batches-tp27699.html
> >>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>> -
> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>>
> >>
> >
>


Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Daan Debie
Thanks, but that thread does not answer my questions, which are about the
distributed nature of RDDs vs the small nature of "micro batches" and on
how Spark Streaming distributes work.

On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh 
wrote:

> Hi Daan,
>
> You may find this link Re: Is "spark streaming" streaming or mini-batch?
> 
> helpful. This was a thread in this forum not long ago.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 13 September 2016 at 14:25, DandyDev  wrote:
>
>> Hi all!
>>
>> When reading about Spark Streaming and its execution model, I see diagrams
>> like this a lot:
>>
>> > n27699/lambda-architecture-with-spark-spark-streaming-
>> kafka-cassandra-akka-and-scala-31-638.jpg>
>>
>> It does a fine job explaining how DStreams consist of micro batches that
>> are
>> basically RDDs. There are however some things I don't understand:
>>
>> - RDDs are distributed by design, but micro batches are conceptually
>> small.
>> How/why are these micro batches distributed so that they need to be
>> implemented as RDD?
>> - The above image doesn't explain how Spark Streaming parallelizes data.
>> According to the image, a stream of events get broken into micro batches
>> over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is a
>> micro
>> batch, etc.). How does parallelism come into play here? Is it that even
>> within a "time slot" (eg. time 0 to 1) there can be so many events, that
>> multiple micro batches for that time slot will be created and distributed
>> across the executors?
>>
>> Clarification would be helpful!
>>
>> Daan
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Spark-Streaming-dividing-DStream-into-
>> mini-batches-tp27699.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>