Re: Model abstract class in spark ml

2016-08-30 Thread Mohit Jaggi
I think I figured it out. There is indeed "something deeper in Scala” :-)

abstract class A {
  def a: this.type
}

class AA(i: Int) extends A {
  def a = this
}
the above works ok. But if you return anything other than “this”, you will get 
a compile error.

abstract class A {
  def a: this.type
}

class AA(i: Int) extends A {
  def a = new AA(1)
}
Error:(33, 11) type mismatch;
 found   : com.dataorchard.datagears.AA
 required: AA.this.type
  def a = new AA(1)
  ^

So you have to do:

abstract class A[T <: A[T]]  {
  def a: T
}

class AA(i: Int) extends A[AA] {
  def a = new AA(1)
}


Mohit Jaggi
Founder,
Data Orchard LLC
www.dataorchardllc.com




> On Aug 30, 2016, at 9:51 PM, Mohit Jaggi  wrote:
> 
> thanks Sean. I am cross posting on dev to see why the code was written that 
> way. Perhaps, this.type doesn’t do what is needed.
> 
> Mohit Jaggi
> Founder,
> Data Orchard LLC
> www.dataorchardllc.com 
> 
> 
> 
> 
>> On Aug 30, 2016, at 2:08 PM, Sean Owen > > wrote:
>> 
>> I think it's imitating, for example, how Enum is delcared in Java:
>> 
>> abstract class Enum>
>> 
>> this is done so that Enum can refer to the actual type of the derived
>> enum class when declaring things like public final int compareTo(E o)
>> to implement Comparable. The type is redundant in a sense, because
>> you effectively have MyEnum extending Enum.
>> 
>> Java allows this self-referential definition. However Scala has
>> "this.type" for this purpose and (unless I'm about to learn something
>> deeper about Scala) it would have been the better way to express this
>> so that Model methods can for example state that copy() returns a
>> Model of the same concrete type.
>> 
>> I don't know if it can be changed now without breaking compatibility
>> but you're welcome to give it a shot with MiMa to see. It does
>> compile, using this.type.
>> 
>> 
>> On Tue, Aug 30, 2016 at 9:47 PM, Mohit Jaggi > > wrote:
>>> Folks,
>>> I am having a bit of trouble understanding the following:
>>> 
>>> abstract class Model[M <: Model[M]]
>>> 
>>> Why is M <: Model[M]?
>>> 
>>> Cheers,
>>> Mohit.
>>> 
> 



Re: Model abstract class in spark ml

2016-08-30 Thread Mohit Jaggi
thanks Sean. I am cross posting on dev to see why the code was written that
way. Perhaps, this.type doesn’t do what is needed.

Mohit Jaggi
Founder,
Data Orchard LLC
www.dataorchardllc.com




On Aug 30, 2016, at 2:08 PM, Sean Owen  wrote:

I think it's imitating, for example, how Enum is delcared in Java:

abstract class Enum>

this is done so that Enum can refer to the actual type of the derived
enum class when declaring things like public final int compareTo(E o)
to implement Comparable. The type is redundant in a sense, because
you effectively have MyEnum extending Enum.

Java allows this self-referential definition. However Scala has
"this.type" for this purpose and (unless I'm about to learn something
deeper about Scala) it would have been the better way to express this
so that Model methods can for example state that copy() returns a
Model of the same concrete type.

I don't know if it can be changed now without breaking compatibility
but you're welcome to give it a shot with MiMa to see. It does
compile, using this.type.


On Tue, Aug 30, 2016 at 9:47 PM, Mohit Jaggi  wrote:

Folks,
I am having a bit of trouble understanding the following:

abstract class Model[M <: Model[M]]

Why is M <: Model[M]?

Cheers,
Mohit.


dev-subscr...@spark.apache.org

2016-08-30 Thread huanqinghappy
dev-subscr...@spark.apache.org

Re: Model abstract class in spark ml

2016-08-30 Thread Mohit Jaggi
thanks Sean. I am cross posting on dev to see why the code was written that 
way. Perhaps, this.type doesn’t do what is needed.

Mohit Jaggi
Founder,
Data Orchard LLC
www.dataorchardllc.com




> On Aug 30, 2016, at 2:08 PM, Sean Owen  wrote:
> 
> I think it's imitating, for example, how Enum is delcared in Java:
> 
> abstract class Enum>
> 
> this is done so that Enum can refer to the actual type of the derived
> enum class when declaring things like public final int compareTo(E o)
> to implement Comparable. The type is redundant in a sense, because
> you effectively have MyEnum extending Enum.
> 
> Java allows this self-referential definition. However Scala has
> "this.type" for this purpose and (unless I'm about to learn something
> deeper about Scala) it would have been the better way to express this
> so that Model methods can for example state that copy() returns a
> Model of the same concrete type.
> 
> I don't know if it can be changed now without breaking compatibility
> but you're welcome to give it a shot with MiMa to see. It does
> compile, using this.type.
> 
> 
> On Tue, Aug 30, 2016 at 9:47 PM, Mohit Jaggi  wrote:
>> Folks,
>> I am having a bit of trouble understanding the following:
>> 
>> abstract class Model[M <: Model[M]]
>> 
>> Why is M <: Model[M]?
>> 
>> Cheers,
>> Mohit.
>> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: How to check for No of Records per partition in Dataframe

2016-08-30 Thread vsr
Hi Saurabh, 

You can do the following to print the number of entries in each partition.
You may need to grep executor logs for the counts.

val rdd = sc.parallelize(1 to 100, 4)
rdd.foreachPartition(it => println("Record count in partition" + it.size))

Hope this is what you are looking for.

Thanks
Srinivas



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-check-for-No-of-Records-per-partition-in-Dataframe-tp18764p18805.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Mesos is now a maven module

2016-08-30 Thread Dongjoon Hyun
Thank you all for quick fix! :D

Dongjoon.

On Tuesday, August 30, 2016, Michael Gummelt  wrote:

> https://github.com/apache/spark/pull/14885
>
> Thanks
>
> On Tue, Aug 30, 2016 at 11:36 AM, Marcelo Vanzin  > wrote:
>
>> On Tue, Aug 30, 2016 at 11:32 AM, Sean Owen > > wrote:
>> > Ah, I helped miss that. We don't enable -Pyarn for YARN because it's
>> > already always set? I wonder if it makes sense to make that optional
>> > in order to speed up builds, or, maybe I'm missing a reason it's
>> > always essential.
>>
>> YARN is currently handled as part of the Hadoop profiles in
>> dev/run-tests.py; it could potentially be changed to behave like the
>> others (e.g. only enabled when the YARN code changes).
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> 
>>
>>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>


Re: Mesos is now a maven module

2016-08-30 Thread Michael Gummelt
https://github.com/apache/spark/pull/14885

Thanks

On Tue, Aug 30, 2016 at 11:36 AM, Marcelo Vanzin 
wrote:

> On Tue, Aug 30, 2016 at 11:32 AM, Sean Owen  wrote:
> > Ah, I helped miss that. We don't enable -Pyarn for YARN because it's
> > already always set? I wonder if it makes sense to make that optional
> > in order to speed up builds, or, maybe I'm missing a reason it's
> > always essential.
>
> YARN is currently handled as part of the Hadoop profiles in
> dev/run-tests.py; it could potentially be changed to behave like the
> others (e.g. only enabled when the YARN code changes).
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
Michael Gummelt
Software Engineer
Mesosphere


Re: What are the names of the network protocols used between Spark Driver, Master and Workers?

2016-08-30 Thread kant kodali

Ok I will answer my own question. Looks like Netty based RPC





On Mon, Aug 29, 2016 9:22 PM, kant kodali kanth...@gmail.com wrote:
What are the names of the network protocols used between Spark Driver, Master
and Workers?

Re: Spark Kerberos proxy user

2016-08-30 Thread Michael Gummelt
Here's one: https://issues.apache.org/jira/browse/SPARK-16742

On Tue, Aug 30, 2016 at 3:02 AM, Abel Rincón  wrote:

> Hi again,
>
> Is there any open issue related?
>
> Nowadays, we (stratio)  have a end to end  running, with a spark
> distribution based in 1.6.2.
>
> Work in progress:
>
> - Create and share our solution documentation.
> - Test Suite for all the stuff.
> - Rebase our code with apache-master branch.
>
> Regards,
>
>
> 2016-08-25 12:10 GMT+02:00 Abel Rincón :
>
>> Hi devs,
>>
>> I'm working (at Stratio)  on use spark over mesos and standalone, with a
>> kerberized HDFS
>>
>> We are working to solve these scenarios,
>>
>>
>>- We have an long term running spark sql context using concurrently
>>by many users like Thrift server called CrossData, we need access to hdfs
>>data with kerberos authorization using proxy-user method. we trust on HDFS
>>permission system, or our custom authorizer.
>>
>>
>>- We need load/write dataframes using datasources with HDFS
>>backend(built-in, or others)  such json, csv, parquet, orc …, so we want 
>> to
>>enable the secure access (krb)  only by configuration.
>>
>>
>>- We have an scenario where we want to run streaming jobs over
>>kerberized HDFS,  from W/R and  checkpointing too.
>>
>>
>>- We have to load every single RDD that spark core over kerberized
>>HDFS without breaking the Spark API.
>>
>>
>>
>>
>> As you can see, We have a "special" requirement need to set the proxy
>> user by job over the same spark context.
>>
>> Do you have any idea to cover it?
>>
>>
>


-- 
Michael Gummelt
Software Engineer
Mesosphere


Re: Mesos is now a maven module

2016-08-30 Thread Marcelo Vanzin
On Tue, Aug 30, 2016 at 11:32 AM, Sean Owen  wrote:
> Ah, I helped miss that. We don't enable -Pyarn for YARN because it's
> already always set? I wonder if it makes sense to make that optional
> in order to speed up builds, or, maybe I'm missing a reason it's
> always essential.

YARN is currently handled as part of the Hadoop profiles in
dev/run-tests.py; it could potentially be changed to behave like the
others (e.g. only enabled when the YARN code changes).

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Mesos is now a maven module

2016-08-30 Thread Sean Owen
Ah, I helped miss that. We don't enable -Pyarn for YARN because it's
already always set? I wonder if it makes sense to make that optional
in order to speed up builds, or, maybe I'm missing a reason it's
always essential.

I think it's not setting -Pmesos indeed because no Mesos code was
changed but I think that script change is necessary as a follow up
yes.

Yeah, nothing is in the Jenkins config itself.

On Tue, Aug 30, 2016 at 6:05 PM, Marcelo Vanzin  wrote:
> A quick look shows that maybe dev/sparktestsupport/modules.py needs to
> be modified, and a "build_profile_flags" added to the mesos section
> (similar to hive / hive-thriftserver).
>
> Note not all PR builds will trigger mesos currently, since it's listed
> as an independent module in the above file.
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Performance of loading parquet files into case classes in Spark

2016-08-30 Thread Steve Loughran

On 29 Aug 2016, at 20:58, Julien Dumazert 
> wrote:

Hi Maciek,

I followed your recommandation and benchmarked Dataframes aggregations on 
Dataset. Here is what I got:


// Dataset with RDD-style code
// 34.223s
df.as[A].map(_.fieldToSum).reduce(_ + _)

// Dataset with map and Dataframes sum
// 35.372s


df.as[A].map(_.fieldToSum).agg(sum("value")).collect().head.getAs[Long](0)

Not much of a difference. It seems that as soon as you access data as in RDDs, 
you force the full decoding of the object into a case class, which is super 
costly.

I find this behavior quite normal: as soon as you provide the user with the 
ability to pass a blackbox function, anything can happen, so you have to load 
the whole object. On the other hand, when using SQL-style functions only, 
everything is "white box", so Spark understands what you want to do and can 
optimize.



SWL and the dataframe code where you are asking for a specific field can be 
handled by the file format itself, so optimising the operation. If you ask for 
only one column of Parquet and orc data, then only that column's data should be 
loaded. And because they store columns together, you save on all the IO needed 
to read all the discarded columns. Add even more selectiveness (such as ranges 
in values), then you can even get "predicate pushdown" where blocks of the file 
are skipped if the input format reader can determine that none of the columns 
there match the predicate's conditions.

you should be able to ge away with something like df.select("field") to 
filter out the fields you want first, then stay in code rather than SQL.

Anyway, experiment: its always more accurate than the opinions of others, 
especially when applied to your own datasets.


Re: KMeans calls takeSample() twice?

2016-08-30 Thread Georgios Samaras
Good catch Shivaram. However, the very next line states:

// this shouldn't happen often because we use a big multiplier for the
initial size

which makes me wondering if that is the case, really, since I am
experimenting heavily right now and I launched 30~40 jobs, and from a
glance on them I can see takeSample() being called twice!

George


On Tue, Aug 30, 2016 at 10:20 AM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> I think takeSample itself runs multiple jobs if the amount of samples
> collected in the first pass is not enough. The comment and code path
> at https://github.com/apache/spark/blob/412b0e8969215411b97efd3d0984dc
> 6cac5d31e0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L508
> should explain when this happens. Also you can confirm this by
> checking if the logWarning shows up in your logs.
>
> Thanks
> Shivaram
>
> On Tue, Aug 30, 2016 at 9:50 AM, Georgios Samaras
>  wrote:
> >
> > -- Forwarded message --
> > From: Georgios Samaras 
> > Date: Tue, Aug 30, 2016 at 9:49 AM
> > Subject: Re: KMeans calls takeSample() twice?
> > To: "Sean Owen [via Apache Spark Developers List]"
> > 
> >
> >
> > I am not sure what you want me to check. Note that I see two
> takeSample()s
> > being invoked every single time I execute KMeans(). In a current job I
> have,
> > I did view the details and updated the:
> >
> > StackOverflow question.
> >
> >
> >
> > On Tue, Aug 30, 2016 at 9:25 AM, Sean Owen [via Apache Spark Developers
> > List]  wrote:
> >>
> >> I'm not sure it's a UI bug; it really does record two different
> >> stages, the second of which executes quickly. I am not sure why that
> >> would happen off the top of my head. I don't see anything that failed
> >> here.
> >>
> >> Digging into those two stages and what they executed might give a clue
> >> to what's really going on there.
> >>
> >> On Tue, Aug 30, 2016 at 5:18 PM, gsamaras <[hidden email]> wrote:
> >> > Yanbo thank you for your reply. So you are saying that this is a bug
> in
> >> > the
> >> > Spark UI in general, and not in the local Spark UI of our cluster,
> where
> >> > I
> >> > work, right?
> >> >
> >> > George
> >>
> >> -
> >> To unsubscribe e-mail: [hidden email]
> >>
> >>
> >>
> >> 
> >> If you reply to this email, your message will be added to the discussion
> >> below:
> >>
> >> http://apache-spark-developers-list.1001551.n3.nabble.com/KMeans-calls-
> takeSample-twice-tp18761p18788.html
> >> To unsubscribe from KMeans calls takeSample() twice?, click here.
> >> NAML
> >
> >
> >
>


Re: KMeans calls takeSample() twice?

2016-08-30 Thread Shivaram Venkataraman
I think takeSample itself runs multiple jobs if the amount of samples
collected in the first pass is not enough. The comment and code path
at 
https://github.com/apache/spark/blob/412b0e8969215411b97efd3d0984dc6cac5d31e0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L508
should explain when this happens. Also you can confirm this by
checking if the logWarning shows up in your logs.

Thanks
Shivaram

On Tue, Aug 30, 2016 at 9:50 AM, Georgios Samaras
 wrote:
>
> -- Forwarded message --
> From: Georgios Samaras 
> Date: Tue, Aug 30, 2016 at 9:49 AM
> Subject: Re: KMeans calls takeSample() twice?
> To: "Sean Owen [via Apache Spark Developers List]"
> 
>
>
> I am not sure what you want me to check. Note that I see two takeSample()s
> being invoked every single time I execute KMeans(). In a current job I have,
> I did view the details and updated the:
>
> StackOverflow question.
>
>
>
> On Tue, Aug 30, 2016 at 9:25 AM, Sean Owen [via Apache Spark Developers
> List]  wrote:
>>
>> I'm not sure it's a UI bug; it really does record two different
>> stages, the second of which executes quickly. I am not sure why that
>> would happen off the top of my head. I don't see anything that failed
>> here.
>>
>> Digging into those two stages and what they executed might give a clue
>> to what's really going on there.
>>
>> On Tue, Aug 30, 2016 at 5:18 PM, gsamaras <[hidden email]> wrote:
>> > Yanbo thank you for your reply. So you are saying that this is a bug in
>> > the
>> > Spark UI in general, and not in the local Spark UI of our cluster, where
>> > I
>> > work, right?
>> >
>> > George
>>
>> -
>> To unsubscribe e-mail: [hidden email]
>>
>>
>>
>> 
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/KMeans-calls-takeSample-twice-tp18761p18788.html
>> To unsubscribe from KMeans calls takeSample() twice?, click here.
>> NAML
>
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Mesos is now a maven module

2016-08-30 Thread Dongjoon Hyun
Thank you for confirming, Sean and Marcelo!

Bests,
Dongjoon.

On Tue, Aug 30, 2016 at 10:05 AM, Marcelo Vanzin 
wrote:

> A quick look shows that maybe dev/sparktestsupport/modules.py needs to
> be modified, and a "build_profile_flags" added to the mesos section
> (similar to hive / hive-thriftserver).
>
> Note not all PR builds will trigger mesos currently, since it's listed
> as an independent module in the above file.
>
> On Tue, Aug 30, 2016 at 10:01 AM, Sean Owen  wrote:
> > I have the heady power to modify Jenkins jobs now, so I will carefully
> take
> > a look at them and see if any of the config needs -Pmesos. But yeah I
> > thought this should be baked into the script.
> >
> > On Tue, Aug 30, 2016 at 5:56 PM, Dongjoon Hyun 
> wrote:
> >>
> >> Hi, Michael.
> >>
> >> It's a great news!
> >>
> >> BTW, I'm wondering if the Jenkins (SparkPullRequestBuilder) knows this
> new
> >> profile, -Pmesos.
> >>
> >> The PR was passed with the following Jenkins build arguments without
> >> `-Pmesos` option. (at the last test)
> >> ```
> >> [info] Building Spark (w/Hive 1.2.1) using SBT with these arguments:
> >> -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive-thriftserver -Phive
> test:package
> >> streaming-kafka-0-8-assembly/assembly streaming-flume-assembly/assembly
> >> streaming-kinesis-asl-assembly/assembly
> >> ```
> >>
> >> https://amplab.cs.berkeley.edu/jenkins/job/
> SparkPullRequestBuilder/64435/consoleFull
> >>
> >> Also, up to now, Jenkins seems not to use '-Pmesos' for all PRs.
> >>
> >> Bests,
> >> Dongjoon.
> >>
> >>
> >> On Fri, Aug 26, 2016 at 3:19 PM, Michael Gummelt <
> mgumm...@mesosphere.io>
> >> wrote:
> >>>
> >>> If it's separable, then sure.  Consistency is nice.
> >>>
> >>> On Fri, Aug 26, 2016 at 2:14 PM, Jacek Laskowski 
> wrote:
> 
>  Hi Michael,
> 
>  Congrats!
> 
>  BTW What I like about the change the most is that it uses the
>  pluggable interface for TaskScheduler and SchedulerBackend (as
>  introduced by YARN). Think Standalone should follow the steps. WDYT?
> 
>  Pozdrawiam,
>  Jacek Laskowski
>  
>  https://medium.com/@jaceklaskowski/
>  Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>  Follow me at https://twitter.com/jaceklaskowski
> 
> 
>  On Fri, Aug 26, 2016 at 10:20 PM, Michael Gummelt
>   wrote:
>  > Hello devs,
>  >
>  > Much like YARN, Mesos has been refactored into a Maven module.  So
>  > when
>  > building, you must add "-Pmesos" to enable Mesos support.
>  >
>  > The pre-built distributions from Apache will continue to enable
> Mesos.
>  >
>  > PR: https://github.com/apache/spark/pull/14637
>  >
>  > Cheers
>  >
>  > --
>  > Michael Gummelt
>  > Software Engineer
>  > Mesosphere
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Michael Gummelt
> >>> Software Engineer
> >>> Mesosphere
> >>
> >>
> >
>
>
>
> --
> Marcelo
>


Re: Mesos is now a maven module

2016-08-30 Thread Marcelo Vanzin
A quick look shows that maybe dev/sparktestsupport/modules.py needs to
be modified, and a "build_profile_flags" added to the mesos section
(similar to hive / hive-thriftserver).

Note not all PR builds will trigger mesos currently, since it's listed
as an independent module in the above file.

On Tue, Aug 30, 2016 at 10:01 AM, Sean Owen  wrote:
> I have the heady power to modify Jenkins jobs now, so I will carefully take
> a look at them and see if any of the config needs -Pmesos. But yeah I
> thought this should be baked into the script.
>
> On Tue, Aug 30, 2016 at 5:56 PM, Dongjoon Hyun  wrote:
>>
>> Hi, Michael.
>>
>> It's a great news!
>>
>> BTW, I'm wondering if the Jenkins (SparkPullRequestBuilder) knows this new
>> profile, -Pmesos.
>>
>> The PR was passed with the following Jenkins build arguments without
>> `-Pmesos` option. (at the last test)
>> ```
>> [info] Building Spark (w/Hive 1.2.1) using SBT with these arguments:
>> -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive-thriftserver -Phive test:package
>> streaming-kafka-0-8-assembly/assembly streaming-flume-assembly/assembly
>> streaming-kinesis-asl-assembly/assembly
>> ```
>>
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64435/consoleFull
>>
>> Also, up to now, Jenkins seems not to use '-Pmesos' for all PRs.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Fri, Aug 26, 2016 at 3:19 PM, Michael Gummelt 
>> wrote:
>>>
>>> If it's separable, then sure.  Consistency is nice.
>>>
>>> On Fri, Aug 26, 2016 at 2:14 PM, Jacek Laskowski  wrote:

 Hi Michael,

 Congrats!

 BTW What I like about the change the most is that it uses the
 pluggable interface for TaskScheduler and SchedulerBackend (as
 introduced by YARN). Think Standalone should follow the steps. WDYT?

 Pozdrawiam,
 Jacek Laskowski
 
 https://medium.com/@jaceklaskowski/
 Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
 Follow me at https://twitter.com/jaceklaskowski


 On Fri, Aug 26, 2016 at 10:20 PM, Michael Gummelt
  wrote:
 > Hello devs,
 >
 > Much like YARN, Mesos has been refactored into a Maven module.  So
 > when
 > building, you must add "-Pmesos" to enable Mesos support.
 >
 > The pre-built distributions from Apache will continue to enable Mesos.
 >
 > PR: https://github.com/apache/spark/pull/14637
 >
 > Cheers
 >
 > --
 > Michael Gummelt
 > Software Engineer
 > Mesosphere
>>>
>>>
>>>
>>>
>>> --
>>> Michael Gummelt
>>> Software Engineer
>>> Mesosphere
>>
>>
>



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Mesos is now a maven module

2016-08-30 Thread Sean Owen
I have the heady power to modify Jenkins jobs now, so I will carefully take
a look at them and see if any of the config needs -Pmesos. But yeah I
thought this should be baked into the script.

On Tue, Aug 30, 2016 at 5:56 PM, Dongjoon Hyun  wrote:

> Hi, Michael.
>
> It's a great news!
>
> BTW, I'm wondering if the Jenkins (SparkPullRequestBuilder) knows this new
> profile, -Pmesos.
>
> The PR was passed with the following Jenkins build arguments without
> `-Pmesos` option. (at the last test)
> ```
> [info] Building Spark (w/Hive 1.2.1) using SBT with these arguments:
>  -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive-thriftserver -Phive test:package
> streaming-kafka-0-8-assembly/assembly streaming-flume-assembly/assembly
> streaming-kinesis-asl-assembly/assembly
> ```
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64435/
> consoleFull
>
> Also, up to now, Jenkins seems not to use '-Pmesos' for all PRs.
>
> Bests,
> Dongjoon.
>
>
> On Fri, Aug 26, 2016 at 3:19 PM, Michael Gummelt 
> wrote:
>
>> If it's separable, then sure.  Consistency is nice.
>>
>> On Fri, Aug 26, 2016 at 2:14 PM, Jacek Laskowski  wrote:
>>
>>> Hi Michael,
>>>
>>> Congrats!
>>>
>>> BTW What I like about the change the most is that it uses the
>>> pluggable interface for TaskScheduler and SchedulerBackend (as
>>> introduced by YARN). Think Standalone should follow the steps. WDYT?
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>>
>>> On Fri, Aug 26, 2016 at 10:20 PM, Michael Gummelt
>>>  wrote:
>>> > Hello devs,
>>> >
>>> > Much like YARN, Mesos has been refactored into a Maven module.  So when
>>> > building, you must add "-Pmesos" to enable Mesos support.
>>> >
>>> > The pre-built distributions from Apache will continue to enable Mesos.
>>> >
>>> > PR: https://github.com/apache/spark/pull/14637
>>> >
>>> > Cheers
>>> >
>>> > --
>>> > Michael Gummelt
>>> > Software Engineer
>>> > Mesosphere
>>>
>>
>>
>>
>> --
>> Michael Gummelt
>> Software Engineer
>> Mesosphere
>>
>
>


Re: Mesos is now a maven module

2016-08-30 Thread Marcelo Vanzin
Michael added the profile to the build scripts, but maybe some script
or code path was missed...

On Tue, Aug 30, 2016 at 9:56 AM, Dongjoon Hyun  wrote:
> Hi, Michael.
>
> It's a great news!
>
> BTW, I'm wondering if the Jenkins (SparkPullRequestBuilder) knows this new
> profile, -Pmesos.
>
> The PR was passed with the following Jenkins build arguments without
> `-Pmesos` option. (at the last test)
> ```
> [info] Building Spark (w/Hive 1.2.1) using SBT with these arguments:  -Pyarn
> -Phadoop-2.3 -Pkinesis-asl -Phive-thriftserver -Phive test:package
> streaming-kafka-0-8-assembly/assembly streaming-flume-assembly/assembly
> streaming-kinesis-asl-assembly/assembly
> ```
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64435/consoleFull
>
> Also, up to now, Jenkins seems not to use '-Pmesos' for all PRs.
>
> Bests,
> Dongjoon.
>
>
> On Fri, Aug 26, 2016 at 3:19 PM, Michael Gummelt 
> wrote:
>>
>> If it's separable, then sure.  Consistency is nice.
>>
>> On Fri, Aug 26, 2016 at 2:14 PM, Jacek Laskowski  wrote:
>>>
>>> Hi Michael,
>>>
>>> Congrats!
>>>
>>> BTW What I like about the change the most is that it uses the
>>> pluggable interface for TaskScheduler and SchedulerBackend (as
>>> introduced by YARN). Think Standalone should follow the steps. WDYT?
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>>
>>> On Fri, Aug 26, 2016 at 10:20 PM, Michael Gummelt
>>>  wrote:
>>> > Hello devs,
>>> >
>>> > Much like YARN, Mesos has been refactored into a Maven module.  So when
>>> > building, you must add "-Pmesos" to enable Mesos support.
>>> >
>>> > The pre-built distributions from Apache will continue to enable Mesos.
>>> >
>>> > PR: https://github.com/apache/spark/pull/14637
>>> >
>>> > Cheers
>>> >
>>> > --
>>> > Michael Gummelt
>>> > Software Engineer
>>> > Mesosphere
>>
>>
>>
>>
>> --
>> Michael Gummelt
>> Software Engineer
>> Mesosphere
>
>



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Mesos is now a maven module

2016-08-30 Thread Dongjoon Hyun
Hi, Michael.

It's a great news!

BTW, I'm wondering if the Jenkins (SparkPullRequestBuilder) knows this new
profile, -Pmesos.

The PR was passed with the following Jenkins build arguments without
`-Pmesos` option. (at the last test)
```
[info] Building Spark (w/Hive 1.2.1) using SBT with these arguments:
 -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive-thriftserver -Phive test:package
streaming-kafka-0-8-assembly/assembly streaming-flume-assembly/assembly
streaming-kinesis-asl-assembly/assembly
```
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64435/consoleFull

Also, up to now, Jenkins seems not to use '-Pmesos' for all PRs.

Bests,
Dongjoon.


On Fri, Aug 26, 2016 at 3:19 PM, Michael Gummelt 
wrote:

> If it's separable, then sure.  Consistency is nice.
>
> On Fri, Aug 26, 2016 at 2:14 PM, Jacek Laskowski  wrote:
>
>> Hi Michael,
>>
>> Congrats!
>>
>> BTW What I like about the change the most is that it uses the
>> pluggable interface for TaskScheduler and SchedulerBackend (as
>> introduced by YARN). Think Standalone should follow the steps. WDYT?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Fri, Aug 26, 2016 at 10:20 PM, Michael Gummelt
>>  wrote:
>> > Hello devs,
>> >
>> > Much like YARN, Mesos has been refactored into a Maven module.  So when
>> > building, you must add "-Pmesos" to enable Mesos support.
>> >
>> > The pre-built distributions from Apache will continue to enable Mesos.
>> >
>> > PR: https://github.com/apache/spark/pull/14637
>> >
>> > Cheers
>> >
>> > --
>> > Michael Gummelt
>> > Software Engineer
>> > Mesosphere
>>
>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>


Re: KMeans calls takeSample() twice?

2016-08-30 Thread gsamaras
I am not sure what you want me to check. Note that I see two takeSample()s
being invoked every single time I execute KMeans(). In a current job I
have, I did view the details and updated the:

StackOverflow question.




On Tue, Aug 30, 2016 at 9:25 AM, Sean Owen [via Apache Spark Developers
List]  wrote:

> I'm not sure it's a UI bug; it really does record two different
> stages, the second of which executes quickly. I am not sure why that
> would happen off the top of my head. I don't see anything that failed
> here.
>
> Digging into those two stages and what they executed might give a clue
> to what's really going on there.
>
> On Tue, Aug 30, 2016 at 5:18 PM, gsamaras <[hidden email]
> > wrote:
> > Yanbo thank you for your reply. So you are saying that this is a bug in
> the
> > Spark UI in general, and not in the local Spark UI of our cluster, where
> I
> > work, right?
> >
> > George
>
> -
> To unsubscribe e-mail: [hidden email]
> 
>
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-developers-list.1001551.n3.nabble.com/KMeans-calls-
> takeSample-twice-tp18761p18788.html
> To unsubscribe from KMeans calls takeSample() twice?, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/KMeans-calls-takeSample-twice-tp18761p18789.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Fwd: KMeans calls takeSample() twice?

2016-08-30 Thread Georgios Samaras
-- Forwarded message --
From: Georgios Samaras 
Date: Tue, Aug 30, 2016 at 9:49 AM
Subject: Re: KMeans calls takeSample() twice?
To: "Sean Owen [via Apache Spark Developers List]" <
ml-node+s1001551n18788...@n3.nabble.com>


I am not sure what you want me to check. Note that I see two takeSample()s
being invoked every single time I execute KMeans(). In a current job I
have, I did view the details and updated the:

StackOverflow question.




On Tue, Aug 30, 2016 at 9:25 AM, Sean Owen [via Apache Spark Developers
List]  wrote:

> I'm not sure it's a UI bug; it really does record two different
> stages, the second of which executes quickly. I am not sure why that
> would happen off the top of my head. I don't see anything that failed
> here.
>
> Digging into those two stages and what they executed might give a clue
> to what's really going on there.
>
> On Tue, Aug 30, 2016 at 5:18 PM, gsamaras <[hidden email]
> > wrote:
> > Yanbo thank you for your reply. So you are saying that this is a bug in
> the
> > Spark UI in general, and not in the local Spark UI of our cluster, where
> I
> > work, right?
> >
> > George
>
> -
> To unsubscribe e-mail: [hidden email]
> 
>
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-developers-list.1001551.n3.nabble.com/
> KMeans-calls-takeSample-twice-tp18761p18788.html
> To unsubscribe from KMeans calls takeSample() twice?, click here
> 
> .
> NAML
> 
>


Re: Broadcast Variable Life Cycle

2016-08-30 Thread Sean Owen
Yeah, after destroy, accessing the broadcast variable results in an
error. Accessing it after it's unpersisted (on an executor) causes it
to be rebroadcast.

On Tue, Aug 30, 2016 at 5:12 PM, Jerry Lam  wrote:
> Hi Sean,
>
> Thank you for sharing the knowledge between unpersist and destroy.
> Does that mean unpersist keeps the broadcast variable in the driver whereas
> destroy will delete everything about the broadcast variable like it has
> never existed?
>
> Best Regards,
>
> Jerry
>
>
> On Tue, Aug 30, 2016 at 11:58 AM, Sean Owen  wrote:
>>
>> Yes, although there's a difference between unpersist and destroy,
>> you'll hit the same type of question either way. You do indeed have to
>> reason about when you know the broadcast variable is no longer needed
>> in the face of lazy evaluation, and that's hard.
>>
>> Sometimes it's obvious and you can take advantage of this to
>> proactively free resources. You may have to consider restructuring the
>> computation to allow for more resources to be freed, if this is
>> important to scale.
>>
>> Keep in mind that things that are computed and cached may be lost and
>> recomputed even after their parent RDDs were definitely already
>> computed and don't seem to be needed. This is why unpersist is often
>> the better thing to call because it allows for variables to be
>> rebroadcast if needed in this case. Destroy permanently closes the
>> broadcast.
>>
>> On Tue, Aug 30, 2016 at 4:43 PM, Jerry Lam  wrote:
>> > Hi Sean,
>> >
>> > Thank you for the response. The only problem is that actively managing
>> > broadcast variables require to return the broadcast variables to the
>> > caller
>> > if the function that creates the broadcast variables does not contain
>> > any
>> > action. That is the scope that uses the broadcast variables cannot
>> > destroy
>> > the broadcast variables in many cases. For example:
>> >
>> > ==
>> > def perfromTransformation(rdd: RDD[int]) = {
>> >val sharedMap = sc.broadcast(map)
>> >rdd.map{id =>
>> >   val localMap = sharedMap.vlaue
>> >   (id, localMap(id))
>> >}
>> > }
>> >
>> > def main = {
>> > 
>> > performTransformation(rdd).toDF("id",
>> > "i").write.parquet("dummy_example")
>> > }
>> > ==
>> >
>> > In this example above, we cannot destroy the sharedMap before the
>> > write.parquet is executed because RDD is evaluated lazily. We will get a
>> > exception if I put sharedMap.destroy like this:
>> >
>> > ==
>> > def perfromTransformation(rdd: RDD[int]) = {
>> >val sharedMap = sc.broadcast(map)
>> >val result = rdd.map{id =>
>> >   val localMap = sharedMap.vlaue
>> >   (id, localMap(id))
>> >}
>> >sharedMap.destroy
>> >result
>> > }
>> > ==
>> >
>> > Am I missing something? Are there better way to do this without
>> > returning
>> > the broadcast variables to the main function?
>> >
>> > Best Regards,
>> >
>> > Jerry
>> >
>> >
>> >
>> > On Mon, Aug 29, 2016 at 12:11 PM, Sean Owen  wrote:
>> >>
>> >> Yes you want to actively unpersist() or destroy() broadcast variables
>> >> when they're no longer needed. They can eventually be removed when the
>> >> reference on the driver is garbage collected, but you usually would
>> >> not want to rely on that.
>> >>
>> >> On Mon, Aug 29, 2016 at 4:30 PM, Jerry Lam 
>> >> wrote:
>> >> > Hello spark developers,
>> >> >
>> >> > Anyone can shed some lights on the life cycle of the broadcast
>> >> > variables?
>> >> > Basically, if I have a broadcast variable defined in a loop and for
>> >> > each
>> >> > iteration, I provide a different value.
>> >> > // For example:
>> >> > for(i< 1 to 10) {
>> >> > val bc = sc.broadcast(i)
>> >> > sc.parallelize(Seq(1,2,3)).map{id => val i = bc.value; (id,
>> >> > i)}.toDF("id", "i").write.parquet("/dummy_output")
>> >> > }
>> >> >
>> >> > Do I need to active manage the broadcast variable in this case? I
>> >> > know
>> >> > this
>> >> > example is not real but please imagine this broadcast variable can
>> >> > hold
>> >> > an
>> >> > array of 1M Long.
>> >> >
>> >> > Regards,
>> >> >
>> >> > Jerry
>> >> >
>> >> >
>> >> >
>> >> > On Sun, Aug 21, 2016 at 1:07 PM, Jerry Lam 
>> >> > wrote:
>> >> >>
>> >> >> Hello spark developers,
>> >> >>
>> >> >> Can someone explain to me what is the lifecycle of a broadcast
>> >> >> variable?
>> >> >> When a broadcast variable will be garbage-collected at the
>> >> >> driver-side
>> >> >> and
>> >> >> at the executor-side? Does a spark application need to actively
>> >> >> manage
>> >> >> the
>> >> >> broadcast variables to ensure that it will not run in OOM?
>> >> >>
>> >> >> Best Regards,
>> >> >>
>> >> >> Jerry
>> >> >
>> >> >
>> >
>> >
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: KMeans calls takeSample() twice?

2016-08-30 Thread gsamaras
Yanbo thank you for your reply. So you are saying that this is a bug in the
Spark UI in general, and not in the local Spark UI of our cluster, where I
work, right?

George

On Mon, Aug 29, 2016 at 11:55 PM, Yanbo Liang-2 [via Apache Spark
Developers List]  wrote:

> I run KMeans with probes and found that takeSample() was called only once
> actually. It looks like this issue was caused by mistake display at Spark
> UI.
>
> Thanks
> Yanbo
>
> On Mon, Aug 29, 2016 at 2:34 PM, gsamaras <[hidden email]
> > wrote:
>
>> After reading the internal code of Spark about it, I wasn't able to
>> understand why it calls takeSample() twice? Can someone please explain?
>>
>> There is a relevant  StackOverflow question
>> > calls-takesample-twice>
>> .
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-developers
>> -list.1001551.n3.nabble.com/KMeans-calls-takeSample-twice-tp18761.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail: [hidden email]
>> 
>>
>>
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-developers-list.1001551.n3.nabble.com/KMeans-calls-
> takeSample-twice-tp18761p18768.html
> To unsubscribe from KMeans calls takeSample() twice?, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/KMeans-calls-takeSample-twice-tp18761p18786.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Broadcast Variable Life Cycle

2016-08-30 Thread Jerry Lam
Hi Sean,

Thank you for sharing the knowledge between unpersist and destroy.
Does that mean unpersist keeps the broadcast variable in the driver whereas
destroy will delete everything about the broadcast variable like it has
never existed?

Best Regards,

Jerry


On Tue, Aug 30, 2016 at 11:58 AM, Sean Owen  wrote:

> Yes, although there's a difference between unpersist and destroy,
> you'll hit the same type of question either way. You do indeed have to
> reason about when you know the broadcast variable is no longer needed
> in the face of lazy evaluation, and that's hard.
>
> Sometimes it's obvious and you can take advantage of this to
> proactively free resources. You may have to consider restructuring the
> computation to allow for more resources to be freed, if this is
> important to scale.
>
> Keep in mind that things that are computed and cached may be lost and
> recomputed even after their parent RDDs were definitely already
> computed and don't seem to be needed. This is why unpersist is often
> the better thing to call because it allows for variables to be
> rebroadcast if needed in this case. Destroy permanently closes the
> broadcast.
>
> On Tue, Aug 30, 2016 at 4:43 PM, Jerry Lam  wrote:
> > Hi Sean,
> >
> > Thank you for the response. The only problem is that actively managing
> > broadcast variables require to return the broadcast variables to the
> caller
> > if the function that creates the broadcast variables does not contain any
> > action. That is the scope that uses the broadcast variables cannot
> destroy
> > the broadcast variables in many cases. For example:
> >
> > ==
> > def perfromTransformation(rdd: RDD[int]) = {
> >val sharedMap = sc.broadcast(map)
> >rdd.map{id =>
> >   val localMap = sharedMap.vlaue
> >   (id, localMap(id))
> >}
> > }
> >
> > def main = {
> > 
> > performTransformation(rdd).toDF("id",
> > "i").write.parquet("dummy_example")
> > }
> > ==
> >
> > In this example above, we cannot destroy the sharedMap before the
> > write.parquet is executed because RDD is evaluated lazily. We will get a
> > exception if I put sharedMap.destroy like this:
> >
> > ==
> > def perfromTransformation(rdd: RDD[int]) = {
> >val sharedMap = sc.broadcast(map)
> >val result = rdd.map{id =>
> >   val localMap = sharedMap.vlaue
> >   (id, localMap(id))
> >}
> >sharedMap.destroy
> >result
> > }
> > ==
> >
> > Am I missing something? Are there better way to do this without returning
> > the broadcast variables to the main function?
> >
> > Best Regards,
> >
> > Jerry
> >
> >
> >
> > On Mon, Aug 29, 2016 at 12:11 PM, Sean Owen  wrote:
> >>
> >> Yes you want to actively unpersist() or destroy() broadcast variables
> >> when they're no longer needed. They can eventually be removed when the
> >> reference on the driver is garbage collected, but you usually would
> >> not want to rely on that.
> >>
> >> On Mon, Aug 29, 2016 at 4:30 PM, Jerry Lam 
> wrote:
> >> > Hello spark developers,
> >> >
> >> > Anyone can shed some lights on the life cycle of the broadcast
> >> > variables?
> >> > Basically, if I have a broadcast variable defined in a loop and for
> each
> >> > iteration, I provide a different value.
> >> > // For example:
> >> > for(i< 1 to 10) {
> >> > val bc = sc.broadcast(i)
> >> > sc.parallelize(Seq(1,2,3)).map{id => val i = bc.value; (id,
> >> > i)}.toDF("id", "i").write.parquet("/dummy_output")
> >> > }
> >> >
> >> > Do I need to active manage the broadcast variable in this case? I know
> >> > this
> >> > example is not real but please imagine this broadcast variable can
> hold
> >> > an
> >> > array of 1M Long.
> >> >
> >> > Regards,
> >> >
> >> > Jerry
> >> >
> >> >
> >> >
> >> > On Sun, Aug 21, 2016 at 1:07 PM, Jerry Lam 
> wrote:
> >> >>
> >> >> Hello spark developers,
> >> >>
> >> >> Can someone explain to me what is the lifecycle of a broadcast
> >> >> variable?
> >> >> When a broadcast variable will be garbage-collected at the
> driver-side
> >> >> and
> >> >> at the executor-side? Does a spark application need to actively
> manage
> >> >> the
> >> >> broadcast variables to ensure that it will not run in OOM?
> >> >>
> >> >> Best Regards,
> >> >>
> >> >> Jerry
> >> >
> >> >
> >
> >
>


Re: Structured Streaming with Kafka sources/sinks

2016-08-30 Thread Reynold Xin
In this case simply not much progress has been made, because people might
be busy with other stuff.

Ofir it looks like you have spent non-trivial amount of time thinking about
this topic and have even designed something to work -- can you chime in on
the JIRA ticket with your thoughts and your prototype? That would be
tremendously useful to the project.



On Tue, Aug 30, 2016 at 11:44 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> > I personally find it disappointing that a big chuck of Spark's design
> and development is happening behind closed curtains.
>
> I'm not too familiar with Streaming, but I see design docs and proposals
> for ML and SQL published here and on JIRA all the time, and they are
> discussed extensively.
>
> For example, here are some ML JIRAs with extensive design discussions:
> SPARK-6725 , SPARK-13944
> , SPARK-16365
> 
>
> Nick
>
> On Tue, Aug 30, 2016 at 11:10 AM Cody Koeninger 
> wrote:
>
>> Not that I wouldn't rather have more open communication around this
>> issue...but what are people actually expecting to get out of
>> structured streaming with regard to Kafka?
>>
>> There aren't any realistic pushdown-type optimizations available, and
>> from what I could tell the last time I looked at structured streaming,
>> resolving the event time vs processing time issue was still a ways
>> off.
>>
>> On Tue, Aug 30, 2016 at 1:56 AM, Ofir Manor 
>> wrote:
>> > I personally find it disappointing that a big chuck of Spark's design
>> and
>> > development is happening behind closed curtains. It makes it harder than
>> > necessary for me to work with Spark. We had to improvise in the recent
>> weeks
>> > a temporary solution for reading from Kafka (from Structured Streaming)
>> to
>> > unblock our development, and I feed that if the design and development
>> of
>> > that feature was done in the open, it would have saved us a lot of
>> hassle
>> > (and would reduce the refactoring of our code base).
>> >
>> > It hard not compare it to other Apache projects - for example, I believe
>> > most of the Apache Kafka full-time contributors work at a single
>> company,
>> > but they manage as a community to have a very transparent design and
>> > development process, which seems to work great.
>> >
>> > Ofir Manor
>> >
>> > Co-Founder & CTO | Equalum
>> >
>> > Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>> >
>> >
>> > On Mon, Aug 29, 2016 at 10:39 PM, Fred Reiss 
>> wrote:
>> >>
>> >> I think that the community really needs some feedback on the progress
>> of
>> >> this very important task. Many existing Spark Streaming applications
>> can't
>> >> be ported to Structured Streaming without Kafka support.
>> >>
>> >> Is there a design document somewhere?  Or can someone from the
>> DataBricks
>> >> team break down the existing monolithic JIRA issue into smaller steps
>> that
>> >> reflect the current development plan?
>> >>
>> >> Fred
>> >>
>> >>
>> >> On Sat, Aug 27, 2016 at 2:32 PM, Koert Kuipers 
>> wrote:
>> >>>
>> >>> thats great
>> >>>
>> >>> is this effort happening anywhere that is publicly visible? github?
>> >>>
>> >>> On Tue, Aug 16, 2016 at 2:04 AM, Reynold Xin 
>> wrote:
>> 
>>  We (the team at Databricks) are working on one currently.
>> 
>> 
>>  On Mon, Aug 15, 2016 at 7:26 PM, Cody Koeninger 
>>  wrote:
>> >
>> > https://issues.apache.org/jira/browse/SPARK-15406
>> >
>> > I'm not working on it (yet?), never got an answer to the question of
>> > who was planning to work on it.
>> >
>> > On Mon, Aug 15, 2016 at 9:12 PM, Guo, Chenzhao <
>> chenzhao@intel.com>
>> > wrote:
>> > > Hi all,
>> > >
>> > >
>> > >
>> > > I’m trying to write Structured Streaming test code and will deal
>> with
>> > > Kafka
>> > > source. Currently Spark 2.0 doesn’t support Kafka sources/sinks.
>> > >
>> > >
>> > >
>> > > I found some Databricks slides saying that Kafka sources/sinks
>> will
>> > > be
>> > > implemented in Spark 2.0, so is there anybody working on this? And
>> > > when will
>> > > it be released?
>> > >
>> > >
>> > >
>> > > Thanks,
>> > >
>> > > Chenzhao Guo
>> >
>> > 
>> -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>> 
>> >>>
>> >>
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Broadcast Variable Life Cycle

2016-08-30 Thread Sean Owen
Yes, although there's a difference between unpersist and destroy,
you'll hit the same type of question either way. You do indeed have to
reason about when you know the broadcast variable is no longer needed
in the face of lazy evaluation, and that's hard.

Sometimes it's obvious and you can take advantage of this to
proactively free resources. You may have to consider restructuring the
computation to allow for more resources to be freed, if this is
important to scale.

Keep in mind that things that are computed and cached may be lost and
recomputed even after their parent RDDs were definitely already
computed and don't seem to be needed. This is why unpersist is often
the better thing to call because it allows for variables to be
rebroadcast if needed in this case. Destroy permanently closes the
broadcast.

On Tue, Aug 30, 2016 at 4:43 PM, Jerry Lam  wrote:
> Hi Sean,
>
> Thank you for the response. The only problem is that actively managing
> broadcast variables require to return the broadcast variables to the caller
> if the function that creates the broadcast variables does not contain any
> action. That is the scope that uses the broadcast variables cannot destroy
> the broadcast variables in many cases. For example:
>
> ==
> def perfromTransformation(rdd: RDD[int]) = {
>val sharedMap = sc.broadcast(map)
>rdd.map{id =>
>   val localMap = sharedMap.vlaue
>   (id, localMap(id))
>}
> }
>
> def main = {
> 
> performTransformation(rdd).toDF("id",
> "i").write.parquet("dummy_example")
> }
> ==
>
> In this example above, we cannot destroy the sharedMap before the
> write.parquet is executed because RDD is evaluated lazily. We will get a
> exception if I put sharedMap.destroy like this:
>
> ==
> def perfromTransformation(rdd: RDD[int]) = {
>val sharedMap = sc.broadcast(map)
>val result = rdd.map{id =>
>   val localMap = sharedMap.vlaue
>   (id, localMap(id))
>}
>sharedMap.destroy
>result
> }
> ==
>
> Am I missing something? Are there better way to do this without returning
> the broadcast variables to the main function?
>
> Best Regards,
>
> Jerry
>
>
>
> On Mon, Aug 29, 2016 at 12:11 PM, Sean Owen  wrote:
>>
>> Yes you want to actively unpersist() or destroy() broadcast variables
>> when they're no longer needed. They can eventually be removed when the
>> reference on the driver is garbage collected, but you usually would
>> not want to rely on that.
>>
>> On Mon, Aug 29, 2016 at 4:30 PM, Jerry Lam  wrote:
>> > Hello spark developers,
>> >
>> > Anyone can shed some lights on the life cycle of the broadcast
>> > variables?
>> > Basically, if I have a broadcast variable defined in a loop and for each
>> > iteration, I provide a different value.
>> > // For example:
>> > for(i< 1 to 10) {
>> > val bc = sc.broadcast(i)
>> > sc.parallelize(Seq(1,2,3)).map{id => val i = bc.value; (id,
>> > i)}.toDF("id", "i").write.parquet("/dummy_output")
>> > }
>> >
>> > Do I need to active manage the broadcast variable in this case? I know
>> > this
>> > example is not real but please imagine this broadcast variable can hold
>> > an
>> > array of 1M Long.
>> >
>> > Regards,
>> >
>> > Jerry
>> >
>> >
>> >
>> > On Sun, Aug 21, 2016 at 1:07 PM, Jerry Lam  wrote:
>> >>
>> >> Hello spark developers,
>> >>
>> >> Can someone explain to me what is the lifecycle of a broadcast
>> >> variable?
>> >> When a broadcast variable will be garbage-collected at the driver-side
>> >> and
>> >> at the executor-side? Does a spark application need to actively manage
>> >> the
>> >> broadcast variables to ensure that it will not run in OOM?
>> >>
>> >> Best Regards,
>> >>
>> >> Jerry
>> >
>> >
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Structured Streaming with Kafka sources/sinks

2016-08-30 Thread Nicholas Chammas
> I personally find it disappointing that a big chuck of Spark's design and
development is happening behind closed curtains.

I'm not too familiar with Streaming, but I see design docs and proposals
for ML and SQL published here and on JIRA all the time, and they are
discussed extensively.

For example, here are some ML JIRAs with extensive design discussions:
SPARK-6725 , SPARK-13944
, SPARK-16365


Nick

On Tue, Aug 30, 2016 at 11:10 AM Cody Koeninger  wrote:

> Not that I wouldn't rather have more open communication around this
> issue...but what are people actually expecting to get out of
> structured streaming with regard to Kafka?
>
> There aren't any realistic pushdown-type optimizations available, and
> from what I could tell the last time I looked at structured streaming,
> resolving the event time vs processing time issue was still a ways
> off.
>
> On Tue, Aug 30, 2016 at 1:56 AM, Ofir Manor  wrote:
> > I personally find it disappointing that a big chuck of Spark's design and
> > development is happening behind closed curtains. It makes it harder than
> > necessary for me to work with Spark. We had to improvise in the recent
> weeks
> > a temporary solution for reading from Kafka (from Structured Streaming)
> to
> > unblock our development, and I feed that if the design and development of
> > that feature was done in the open, it would have saved us a lot of hassle
> > (and would reduce the refactoring of our code base).
> >
> > It hard not compare it to other Apache projects - for example, I believe
> > most of the Apache Kafka full-time contributors work at a single company,
> > but they manage as a community to have a very transparent design and
> > development process, which seems to work great.
> >
> > Ofir Manor
> >
> > Co-Founder & CTO | Equalum
> >
> > Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
> >
> >
> > On Mon, Aug 29, 2016 at 10:39 PM, Fred Reiss 
> wrote:
> >>
> >> I think that the community really needs some feedback on the progress of
> >> this very important task. Many existing Spark Streaming applications
> can't
> >> be ported to Structured Streaming without Kafka support.
> >>
> >> Is there a design document somewhere?  Or can someone from the
> DataBricks
> >> team break down the existing monolithic JIRA issue into smaller steps
> that
> >> reflect the current development plan?
> >>
> >> Fred
> >>
> >>
> >> On Sat, Aug 27, 2016 at 2:32 PM, Koert Kuipers 
> wrote:
> >>>
> >>> thats great
> >>>
> >>> is this effort happening anywhere that is publicly visible? github?
> >>>
> >>> On Tue, Aug 16, 2016 at 2:04 AM, Reynold Xin 
> wrote:
> 
>  We (the team at Databricks) are working on one currently.
> 
> 
>  On Mon, Aug 15, 2016 at 7:26 PM, Cody Koeninger 
>  wrote:
> >
> > https://issues.apache.org/jira/browse/SPARK-15406
> >
> > I'm not working on it (yet?), never got an answer to the question of
> > who was planning to work on it.
> >
> > On Mon, Aug 15, 2016 at 9:12 PM, Guo, Chenzhao <
> chenzhao@intel.com>
> > wrote:
> > > Hi all,
> > >
> > >
> > >
> > > I’m trying to write Structured Streaming test code and will deal
> with
> > > Kafka
> > > source. Currently Spark 2.0 doesn’t support Kafka sources/sinks.
> > >
> > >
> > >
> > > I found some Databricks slides saying that Kafka sources/sinks will
> > > be
> > > implemented in Spark 2.0, so is there anybody working on this? And
> > > when will
> > > it be released?
> > >
> > >
> > >
> > > Thanks,
> > >
> > > Chenzhao Guo
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> 
> >>>
> >>
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Broadcast Variable Life Cycle

2016-08-30 Thread Jerry Lam
Hi Sean,

Thank you for the response. The only problem is that actively managing
broadcast variables require to return the broadcast variables to the caller
if the function that creates the broadcast variables does not contain any
action. That is the scope that uses the broadcast variables cannot destroy
the broadcast variables in many cases. For example:

==
def perfromTransformation(rdd: RDD[int]) = {
   val sharedMap = sc.broadcast(map)
   rdd.map{id =>
  val localMap = sharedMap.vlaue
  (id, localMap(id))
   }
}

def main = {

performTransformation(rdd).toDF("id",
"i").write.parquet("dummy_example")
}
==

In this example above, we cannot destroy the sharedMap before the
write.parquet is executed because RDD is evaluated lazily. We will get a
exception if I put sharedMap.destroy like this:

==
def perfromTransformation(rdd: RDD[int]) = {
   val sharedMap = sc.broadcast(map)
   val result = rdd.map{id =>
  val localMap = sharedMap.vlaue
  (id, localMap(id))
   }
   sharedMap.destroy
   result
}
==

Am I missing something? Are there better way to do this without returning
the broadcast variables to the main function?

Best Regards,

Jerry



On Mon, Aug 29, 2016 at 12:11 PM, Sean Owen  wrote:

> Yes you want to actively unpersist() or destroy() broadcast variables
> when they're no longer needed. They can eventually be removed when the
> reference on the driver is garbage collected, but you usually would
> not want to rely on that.
>
> On Mon, Aug 29, 2016 at 4:30 PM, Jerry Lam  wrote:
> > Hello spark developers,
> >
> > Anyone can shed some lights on the life cycle of the broadcast variables?
> > Basically, if I have a broadcast variable defined in a loop and for each
> > iteration, I provide a different value.
> > // For example:
> > for(i< 1 to 10) {
> > val bc = sc.broadcast(i)
> > sc.parallelize(Seq(1,2,3)).map{id => val i = bc.value; (id,
> > i)}.toDF("id", "i").write.parquet("/dummy_output")
> > }
> >
> > Do I need to active manage the broadcast variable in this case? I know
> this
> > example is not real but please imagine this broadcast variable can hold
> an
> > array of 1M Long.
> >
> > Regards,
> >
> > Jerry
> >
> >
> >
> > On Sun, Aug 21, 2016 at 1:07 PM, Jerry Lam  wrote:
> >>
> >> Hello spark developers,
> >>
> >> Can someone explain to me what is the lifecycle of a broadcast variable?
> >> When a broadcast variable will be garbage-collected at the driver-side
> and
> >> at the executor-side? Does a spark application need to actively manage
> the
> >> broadcast variables to ensure that it will not run in OOM?
> >>
> >> Best Regards,
> >>
> >> Jerry
> >
> >
>


Re: Reynold on vacation next two weeks

2016-08-30 Thread Jacek Laskowski
Hi,

Definitely well deserved. Don't check your emails for the 2 weeks. Not even
for a minute :-)

Jacek

On 30 Aug 2016 10:21 a.m., "Reynold Xin"  wrote:

> A lot of people have been pinging me on github and email directly and
> expect instant reply. Just FYI I'm on vacation for two weeks with limited
> internet access.


Re: Structured Streaming with Kafka sources/sinks

2016-08-30 Thread Cody Koeninger
Not that I wouldn't rather have more open communication around this
issue...but what are people actually expecting to get out of
structured streaming with regard to Kafka?

There aren't any realistic pushdown-type optimizations available, and
from what I could tell the last time I looked at structured streaming,
resolving the event time vs processing time issue was still a ways
off.

On Tue, Aug 30, 2016 at 1:56 AM, Ofir Manor  wrote:
> I personally find it disappointing that a big chuck of Spark's design and
> development is happening behind closed curtains. It makes it harder than
> necessary for me to work with Spark. We had to improvise in the recent weeks
> a temporary solution for reading from Kafka (from Structured Streaming) to
> unblock our development, and I feed that if the design and development of
> that feature was done in the open, it would have saved us a lot of hassle
> (and would reduce the refactoring of our code base).
>
> It hard not compare it to other Apache projects - for example, I believe
> most of the Apache Kafka full-time contributors work at a single company,
> but they manage as a community to have a very transparent design and
> development process, which seems to work great.
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
>
> On Mon, Aug 29, 2016 at 10:39 PM, Fred Reiss  wrote:
>>
>> I think that the community really needs some feedback on the progress of
>> this very important task. Many existing Spark Streaming applications can't
>> be ported to Structured Streaming without Kafka support.
>>
>> Is there a design document somewhere?  Or can someone from the DataBricks
>> team break down the existing monolithic JIRA issue into smaller steps that
>> reflect the current development plan?
>>
>> Fred
>>
>>
>> On Sat, Aug 27, 2016 at 2:32 PM, Koert Kuipers  wrote:
>>>
>>> thats great
>>>
>>> is this effort happening anywhere that is publicly visible? github?
>>>
>>> On Tue, Aug 16, 2016 at 2:04 AM, Reynold Xin  wrote:

 We (the team at Databricks) are working on one currently.


 On Mon, Aug 15, 2016 at 7:26 PM, Cody Koeninger 
 wrote:
>
> https://issues.apache.org/jira/browse/SPARK-15406
>
> I'm not working on it (yet?), never got an answer to the question of
> who was planning to work on it.
>
> On Mon, Aug 15, 2016 at 9:12 PM, Guo, Chenzhao 
> wrote:
> > Hi all,
> >
> >
> >
> > I’m trying to write Structured Streaming test code and will deal with
> > Kafka
> > source. Currently Spark 2.0 doesn’t support Kafka sources/sinks.
> >
> >
> >
> > I found some Databricks slides saying that Kafka sources/sinks will
> > be
> > implemented in Spark 2.0, so is there anybody working on this? And
> > when will
> > it be released?
> >
> >
> >
> > Thanks,
> >
> > Chenzhao Guo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

>>>
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



ApacheCon Seville CFP closes September 9th

2016-08-30 Thread Rich Bowen
It's traditional. We wait for the last minute to get our talk proposals
in for conferences.

Well, the last minute has arrived. The CFP for ApacheCon Seville closes
on September 9th, which is less than 2 weeks away. It's time to get your
talks in, so that we can make this the best ApacheCon yet.

It's also time to discuss with your developer and user community whether
there's a track of talks that you might want to propose, so that you
have more complete coverage of your project than a talk or two.

For Apache Big Data, the relevant URLs are:
Event details:
http://events.linuxfoundation.org/events/apache-big-data-europe
CFP:
http://events.linuxfoundation.org/events/apache-big-data-europe/program/cfp

For ApacheCon Europe, the relevant URLs are:
Event details: http://events.linuxfoundation.org/events/apachecon-europe
CFP: http://events.linuxfoundation.org/events/apachecon-europe/program/cfp

This year, we'll be reviewing papers "blind" - that is, looking at the
abstracts without knowing who the speaker is. This has been shown to
eliminate the "me and my buddies" nature of many tech conferences,
producing more diversity, and more new speakers. So make sure your
abstracts clearly explain what you'll be talking about.

For further updated about ApacheCon, follow us on Twitter, @ApacheCon,
or drop by our IRC channel, #apachecon on the Freenode IRC network.

-- 
Rich Bowen
WWW: http://apachecon.com/
Twitter: @ApacheCon

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Spark 2.0.1 fails for provided hadoop

2016-08-30 Thread Rishi Mishra
Hi All,
I tried to configure my Spark with MapR hadoop cluster. For that I built
Spark 2.0 from source with hadoop-provided option. Then as per the document
I set my hadoop libraries in spark-env.sh.
However I get an error while SessionCatalog is getting created.  Please
refer below for exception stack trace.  Point to note is default scheme for
MapR is "maprfs://".  Hence the error.

I can see some fixes were there earlier to solve the problem.
https://github.com/apache/spark/pull/13348
But another PR removed the code.
https://github.com/apache/spark/pull/13868/files.

If I take the changes in the 1st PR mentioned here it works perfectly
fine.

Is it intentional or is it a bug ?
If its intentional , does user always have to run drivers on a hadoop
cluster node ? Which might make "some" sense in a production environment ,
but it is not very helpful during development.

GIT version on my fork : d16f9a0b7c464728d7b11899740908e23820a797.

Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra



Exception Stack
=

2016-08-29 18:30:17,0869 ERROR JniCommon
fs/client/fileclient/cc/jni_MapRClient.cc:2073 Thread: 18258 mkdirs failed
for /rishim1/POCs/spark-2.0.1-SNAPSHOT-bin-custom-spark/spar, error 13
org.apache.spark.SparkException: Unable to create database default as
failed to create its directory
maprfs:///rishim1/POCs/spark-2.0.1-SNAPSHOT-bin-custom-spark/spark-warehouse
  at
org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.liftedTree1$1(InMemoryCatalog.scala:126)
  at
org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.createDatabase(InMemoryCatalog.scala:120)
  at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:147)
  at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.(SessionCatalog.scala:89)
  at
org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)
  at
org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95)
  at
org.apache.spark.sql.internal.SessionState$$anon$1.(SessionState.scala:112)
  at
org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)
  at
org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:111)
  at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at
org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411)
  ... 48 elided
Caused by: org.apache.hadoop.security.AccessControlException: User
rishim(user id 1000)  has been denied access to create spark-warehouse
  at com.mapr.fs.MapRFileSystem.makeDir(MapRFileSystem.java:1239)
  at com.mapr.fs.MapRFileSystem.mkdirs(MapRFileSystem.java:1259)
  at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1913)
  at
org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.liftedTree1$1(InMemoryCatalog.scala:123)
  ... 62 more


Re: Spark Kerberos proxy user

2016-08-30 Thread Abel Rincón
Hi again,

Is there any open issue related?

Nowadays, we (stratio)  have a end to end  running, with a spark
distribution based in 1.6.2.

Work in progress:

- Create and share our solution documentation.
- Test Suite for all the stuff.
- Rebase our code with apache-master branch.

Regards,


2016-08-25 12:10 GMT+02:00 Abel Rincón :

> Hi devs,
>
> I'm working (at Stratio)  on use spark over mesos and standalone, with a
> kerberized HDFS
>
> We are working to solve these scenarios,
>
>
>- We have an long term running spark sql context using concurrently by
>many users like Thrift server called CrossData, we need access to hdfs data
>with kerberos authorization using proxy-user method. we trust on HDFS
>permission system, or our custom authorizer.
>
>
>- We need load/write dataframes using datasources with HDFS
>backend(built-in, or others)  such json, csv, parquet, orc …, so we want to
>enable the secure access (krb)  only by configuration.
>
>
>- We have an scenario where we want to run streaming jobs over
>kerberized HDFS,  from W/R and  checkpointing too.
>
>
>- We have to load every single RDD that spark core over kerberized
>HDFS without breaking the Spark API.
>
>
>
>
> As you can see, We have a "special" requirement need to set the proxy user
> by job over the same spark context.
>
> Do you have any idea to cover it?
>
>


Re: 3Ps for Datasets not available?! (=Parquet Predicate Pushdown)

2016-08-30 Thread Jacek Laskowski
Hi Reynold,

That's what I was told few times already (most notably by Adam on
twitter), but couldn't understand what it meant exactly. It's only now
when I understand what you're saying, Reynold :)

Does this put DataFrame's Column-based or SQL-based queries usually
faster than Datasets with Encoders?

How much I'm wrong to claim that for parquet files, Hive tables, and
JDBC tables using DataFrame + Columns/SQL-based queries is usually
faster than Datasets? Is that Datasets only shine for strongly typed
queries with data sources with no support for such optimizations like
filter pushdown? I'm tempted to say that for some data sources
DataFrames are faster than Datasets...always. True? What am I missing?

https://twitter.com/jaceklaskowski/status/770554918419755008

Thanks a lot, Reynold, for helping me out to get the gist of it all!

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Aug 30, 2016 at 10:23 AM, Reynold Xin  wrote:
> The UDF is a black box so Spark can't know what it is dealing with. There
> are simple cases in which we can analyze the UDFs byte code and infer what
> it is doing, but it is pretty difficult to do in general.
>
>
> On Tuesday, August 30, 2016, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> I've been playing with UDFs and why they're a blackbox for Spark's
>> optimizer and started with filters to showcase the optimizations in
>> play.
>>
>> My current understanding is that the predicate pushdowns are supported
>> by the following data sources:
>>
>> 1. Hive tables
>> 2. Parquet files
>> 3. ORC files
>> 4. JDBC
>>
>> While working on examples I came to a conclusion that not only does
>> predicate pushdown work for the data sources mentioned above but
>> solely for DataFrames. That was quite interesting since I was so much
>> into Datasets as strongly type-safe data abstractions in Spark SQL.
>>
>> Can you help me to find the truth? Any links to videos, articles,
>> commits and such to further deepen my understanding of optimizations
>> in Spark SQL 2.0? I'd greatly appreciate.
>>
>> The following query pushes the filter down to Parquet (see
>> PushedFilters attribute at the bottom)
>>
>> scala> cities.filter('name === "Warsaw").queryExecution.executedPlan
>> res30: org.apache.spark.sql.execution.SparkPlan =
>> *Project [id#196L, name#197]
>> +- *Filter (isnotnull(name#197) && (name#197 = Warsaw))
>>+- *FileScan parquet [id#196L,name#197] Batched: true, Format:
>> ParquetFormat, InputPaths:
>> file:/Users/jacek/dev/oss/spark/cities.parquet, PartitionFilters: [],
>> PushedFilters: [IsNotNull(name), EqualTo(name,Warsaw)], ReadSchema:
>> struct
>>
>> Why does this not work for Datasets? Is the function/lambda too
>> complex? Are there any examples where it works for Datasets? Are we
>> perhaps trading strong type-safety over optimizations like predicate
>> pushdown (and the feature's are yet to come in the next releases of
>> Spark 2)?
>>
>> scala> cities.as[(Long, String)].filter(_._2 ==
>> "Warsaw").queryExecution.executedPlan
>> res31: org.apache.spark.sql.execution.SparkPlan =
>> *Filter .apply
>> +- *FileScan parquet [id#196L,name#197] Batched: true, Format:
>> ParquetFormat, InputPaths:
>> file:/Users/jacek/dev/oss/spark/cities.parquet, PartitionFilters: [],
>> PushedFilters: [], ReadSchema: struct
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: 3Ps for Datasets not available?! (=Parquet Predicate Pushdown)

2016-08-30 Thread Reynold Xin
The UDF is a black box so Spark can't know what it is dealing with. There
are simple cases in which we can analyze the UDFs byte code and infer what
it is doing, but it is pretty difficult to do in general.

On Tuesday, August 30, 2016, Jacek Laskowski  wrote:

> Hi,
>
> I've been playing with UDFs and why they're a blackbox for Spark's
> optimizer and started with filters to showcase the optimizations in
> play.
>
> My current understanding is that the predicate pushdowns are supported
> by the following data sources:
>
> 1. Hive tables
> 2. Parquet files
> 3. ORC files
> 4. JDBC
>
> While working on examples I came to a conclusion that not only does
> predicate pushdown work for the data sources mentioned above but
> solely for DataFrames. That was quite interesting since I was so much
> into Datasets as strongly type-safe data abstractions in Spark SQL.
>
> Can you help me to find the truth? Any links to videos, articles,
> commits and such to further deepen my understanding of optimizations
> in Spark SQL 2.0? I'd greatly appreciate.
>
> The following query pushes the filter down to Parquet (see
> PushedFilters attribute at the bottom)
>
> scala> cities.filter('name === "Warsaw").queryExecution.executedPlan
> res30: org.apache.spark.sql.execution.SparkPlan =
> *Project [id#196L, name#197]
> +- *Filter (isnotnull(name#197) && (name#197 = Warsaw))
>+- *FileScan parquet [id#196L,name#197] Batched: true, Format:
> ParquetFormat, InputPaths:
> file:/Users/jacek/dev/oss/spark/cities.parquet, PartitionFilters: [],
> PushedFilters: [IsNotNull(name), EqualTo(name,Warsaw)], ReadSchema:
> struct
>
> Why does this not work for Datasets? Is the function/lambda too
> complex? Are there any examples where it works for Datasets? Are we
> perhaps trading strong type-safety over optimizations like predicate
> pushdown (and the feature's are yet to come in the next releases of
> Spark 2)?
>
> scala> cities.as[(Long, String)].filter(_._2 ==
> "Warsaw").queryExecution.executedPlan
> res31: org.apache.spark.sql.execution.SparkPlan =
> *Filter .apply
> +- *FileScan parquet [id#196L,name#197] Batched: true, Format:
> ParquetFormat, InputPaths:
> file:/Users/jacek/dev/oss/spark/cities.parquet, PartitionFilters: [],
> PushedFilters: [], ReadSchema: struct
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>
>


Reynold on vacation next two weeks

2016-08-30 Thread Reynold Xin
A lot of people have been pinging me on github and email directly and
expect instant reply. Just FYI I'm on vacation for two weeks with limited
internet access.


3Ps for Datasets not available?! (=Parquet Predicate Pushdown)

2016-08-30 Thread Jacek Laskowski
Hi,

I've been playing with UDFs and why they're a blackbox for Spark's
optimizer and started with filters to showcase the optimizations in
play.

My current understanding is that the predicate pushdowns are supported
by the following data sources:

1. Hive tables
2. Parquet files
3. ORC files
4. JDBC

While working on examples I came to a conclusion that not only does
predicate pushdown work for the data sources mentioned above but
solely for DataFrames. That was quite interesting since I was so much
into Datasets as strongly type-safe data abstractions in Spark SQL.

Can you help me to find the truth? Any links to videos, articles,
commits and such to further deepen my understanding of optimizations
in Spark SQL 2.0? I'd greatly appreciate.

The following query pushes the filter down to Parquet (see
PushedFilters attribute at the bottom)

scala> cities.filter('name === "Warsaw").queryExecution.executedPlan
res30: org.apache.spark.sql.execution.SparkPlan =
*Project [id#196L, name#197]
+- *Filter (isnotnull(name#197) && (name#197 = Warsaw))
   +- *FileScan parquet [id#196L,name#197] Batched: true, Format:
ParquetFormat, InputPaths:
file:/Users/jacek/dev/oss/spark/cities.parquet, PartitionFilters: [],
PushedFilters: [IsNotNull(name), EqualTo(name,Warsaw)], ReadSchema:
struct

Why does this not work for Datasets? Is the function/lambda too
complex? Are there any examples where it works for Datasets? Are we
perhaps trading strong type-safety over optimizations like predicate
pushdown (and the feature's are yet to come in the next releases of
Spark 2)?

scala> cities.as[(Long, String)].filter(_._2 ==
"Warsaw").queryExecution.executedPlan
res31: org.apache.spark.sql.execution.SparkPlan =
*Filter .apply
+- *FileScan parquet [id#196L,name#197] Batched: true, Format:
ParquetFormat, InputPaths:
file:/Users/jacek/dev/oss/spark/cities.parquet, PartitionFilters: [],
PushedFilters: [], ReadSchema: struct

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Structured Streaming with Kafka sources/sinks

2016-08-30 Thread Ofir Manor
I personally find it disappointing that a big chuck of Spark's design and
development is happening behind closed curtains. It makes it harder than
necessary for me to work with Spark. We had to improvise in the recent
weeks a temporary solution for reading from Kafka (from Structured
Streaming) to unblock our development, and I feed that if the design and
development of that feature was done in the open, it would have saved us a
lot of hassle (and would reduce the refactoring of our code base).

It hard not compare it to other Apache projects - for example, I believe
most of the Apache Kafka full-time contributors work at a single company,
but they manage as a community to have a very transparent design and
development process, which seems to work great.

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Mon, Aug 29, 2016 at 10:39 PM, Fred Reiss  wrote:

> I think that the community really needs some feedback on the progress of
> this very important task. Many existing Spark Streaming applications can't
> be ported to Structured Streaming without Kafka support.
>
> Is there a design document somewhere?  Or can someone from the DataBricks
> team break down the existing monolithic JIRA issue into smaller steps that
> reflect the current development plan?
>
> Fred
>
>
> On Sat, Aug 27, 2016 at 2:32 PM, Koert Kuipers  wrote:
>
>> thats great
>>
>> is this effort happening anywhere that is publicly visible? github?
>>
>> On Tue, Aug 16, 2016 at 2:04 AM, Reynold Xin  wrote:
>>
>>> We (the team at Databricks) are working on one currently.
>>>
>>>
>>> On Mon, Aug 15, 2016 at 7:26 PM, Cody Koeninger 
>>> wrote:
>>>
 https://issues.apache.org/jira/browse/SPARK-15406

 I'm not working on it (yet?), never got an answer to the question of
 who was planning to work on it.

 On Mon, Aug 15, 2016 at 9:12 PM, Guo, Chenzhao 
 wrote:
 > Hi all,
 >
 >
 >
 > I’m trying to write Structured Streaming test code and will deal with
 Kafka
 > source. Currently Spark 2.0 doesn’t support Kafka sources/sinks.
 >
 >
 >
 > I found some Databricks slides saying that Kafka sources/sinks will be
 > implemented in Spark 2.0, so is there anybody working on this? And
 when will
 > it be released?
 >
 >
 >
 > Thanks,
 >
 > Chenzhao Guo

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>>
>>
>


Re: KMeans calls takeSample() twice?

2016-08-30 Thread Yanbo Liang
I run KMeans with probes and found that takeSample() was called only once
actually. It looks like this issue was caused by mistake display at Spark
UI.

Thanks
Yanbo

On Mon, Aug 29, 2016 at 2:34 PM, gsamaras 
wrote:

> After reading the internal code of Spark about it, I wasn't able to
> understand why it calls takeSample() twice? Can someone please explain?
>
> There is a relevant  StackOverflow question
>  twice>
> .
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/KMeans-calls-
> takeSample-twice-tp18761.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>