Re: Model abstract class in spark ml

2016-08-30 Thread Mohit Jaggi
I think I figured it out. There is indeed "something deeper in Scala” :-) abstract class A { def a: this.type } class AA(i: Int) extends A { def a = this } the above works ok. But if you return anything other than “this”, you will get a compile error. abstract class A { def a: this.type

Re: Model abstract class in spark ml

2016-08-30 Thread Mohit Jaggi
thanks Sean. I am cross posting on dev to see why the code was written that way. Perhaps, this.type doesn’t do what is needed. Mohit Jaggi Founder, Data Orchard LLC www.dataorchardllc.com On Aug 30, 2016, at 2:08 PM, Sean Owen wrote: I think it's imitating, for example,

dev-subscr...@spark.apache.org

2016-08-30 Thread huanqinghappy
dev-subscr...@spark.apache.org

Re: Model abstract class in spark ml

2016-08-30 Thread Mohit Jaggi
thanks Sean. I am cross posting on dev to see why the code was written that way. Perhaps, this.type doesn’t do what is needed. Mohit Jaggi Founder, Data Orchard LLC www.dataorchardllc.com > On Aug 30, 2016, at 2:08 PM, Sean Owen wrote: > > I think it's imitating, for

Re: How to check for No of Records per partition in Dataframe

2016-08-30 Thread vsr
Hi Saurabh, You can do the following to print the number of entries in each partition. You may need to grep executor logs for the counts. val rdd = sc.parallelize(1 to 100, 4) rdd.foreachPartition(it => println("Record count in partition" + it.size)) Hope this is what you are looking for.

Re: Mesos is now a maven module

2016-08-30 Thread Dongjoon Hyun
Thank you all for quick fix! :D Dongjoon. On Tuesday, August 30, 2016, Michael Gummelt wrote: > https://github.com/apache/spark/pull/14885 > > Thanks > > On Tue, Aug 30, 2016 at 11:36 AM, Marcelo Vanzin

Re: Mesos is now a maven module

2016-08-30 Thread Michael Gummelt
https://github.com/apache/spark/pull/14885 Thanks On Tue, Aug 30, 2016 at 11:36 AM, Marcelo Vanzin wrote: > On Tue, Aug 30, 2016 at 11:32 AM, Sean Owen wrote: > > Ah, I helped miss that. We don't enable -Pyarn for YARN because it's > > already always

Re: What are the names of the network protocols used between Spark Driver, Master and Workers?

2016-08-30 Thread kant kodali
Ok I will answer my own question. Looks like Netty based RPC On Mon, Aug 29, 2016 9:22 PM, kant kodali kanth...@gmail.com wrote: What are the names of the network protocols used between Spark Driver, Master and Workers?

Re: Spark Kerberos proxy user

2016-08-30 Thread Michael Gummelt
Here's one: https://issues.apache.org/jira/browse/SPARK-16742 On Tue, Aug 30, 2016 at 3:02 AM, Abel Rincón wrote: > Hi again, > > Is there any open issue related? > > Nowadays, we (stratio) have a end to end running, with a spark > distribution based in 1.6.2. > > Work in

Re: Mesos is now a maven module

2016-08-30 Thread Marcelo Vanzin
On Tue, Aug 30, 2016 at 11:32 AM, Sean Owen wrote: > Ah, I helped miss that. We don't enable -Pyarn for YARN because it's > already always set? I wonder if it makes sense to make that optional > in order to speed up builds, or, maybe I'm missing a reason it's > always

Re: Mesos is now a maven module

2016-08-30 Thread Sean Owen
Ah, I helped miss that. We don't enable -Pyarn for YARN because it's already always set? I wonder if it makes sense to make that optional in order to speed up builds, or, maybe I'm missing a reason it's always essential. I think it's not setting -Pmesos indeed because no Mesos code was changed

Re: Performance of loading parquet files into case classes in Spark

2016-08-30 Thread Steve Loughran
On 29 Aug 2016, at 20:58, Julien Dumazert > wrote: Hi Maciek, I followed your recommandation and benchmarked Dataframes aggregations on Dataset. Here is what I got: // Dataset with RDD-style code // 34.223s

Re: KMeans calls takeSample() twice?

2016-08-30 Thread Georgios Samaras
Good catch Shivaram. However, the very next line states: // this shouldn't happen often because we use a big multiplier for the initial size which makes me wondering if that is the case, really, since I am experimenting heavily right now and I launched 30~40 jobs, and from a glance on them I can

Re: KMeans calls takeSample() twice?

2016-08-30 Thread Shivaram Venkataraman
I think takeSample itself runs multiple jobs if the amount of samples collected in the first pass is not enough. The comment and code path at https://github.com/apache/spark/blob/412b0e8969215411b97efd3d0984dc6cac5d31e0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L508 should explain when

Re: Mesos is now a maven module

2016-08-30 Thread Dongjoon Hyun
Thank you for confirming, Sean and Marcelo! Bests, Dongjoon. On Tue, Aug 30, 2016 at 10:05 AM, Marcelo Vanzin wrote: > A quick look shows that maybe dev/sparktestsupport/modules.py needs to > be modified, and a "build_profile_flags" added to the mesos section > (similar to

Re: Mesos is now a maven module

2016-08-30 Thread Marcelo Vanzin
A quick look shows that maybe dev/sparktestsupport/modules.py needs to be modified, and a "build_profile_flags" added to the mesos section (similar to hive / hive-thriftserver). Note not all PR builds will trigger mesos currently, since it's listed as an independent module in the above file. On

Re: Mesos is now a maven module

2016-08-30 Thread Sean Owen
I have the heady power to modify Jenkins jobs now, so I will carefully take a look at them and see if any of the config needs -Pmesos. But yeah I thought this should be baked into the script. On Tue, Aug 30, 2016 at 5:56 PM, Dongjoon Hyun wrote: > Hi, Michael. > > It's a

Re: Mesos is now a maven module

2016-08-30 Thread Marcelo Vanzin
Michael added the profile to the build scripts, but maybe some script or code path was missed... On Tue, Aug 30, 2016 at 9:56 AM, Dongjoon Hyun wrote: > Hi, Michael. > > It's a great news! > > BTW, I'm wondering if the Jenkins (SparkPullRequestBuilder) knows this new >

Re: Mesos is now a maven module

2016-08-30 Thread Dongjoon Hyun
Hi, Michael. It's a great news! BTW, I'm wondering if the Jenkins (SparkPullRequestBuilder) knows this new profile, -Pmesos. The PR was passed with the following Jenkins build arguments without `-Pmesos` option. (at the last test) ``` [info] Building Spark (w/Hive 1.2.1) using SBT with these

Re: KMeans calls takeSample() twice?

2016-08-30 Thread gsamaras
I am not sure what you want me to check. Note that I see two takeSample()s being invoked every single time I execute KMeans(). In a current job I have, I did view the details and updated the: StackOverflow question.

Fwd: KMeans calls takeSample() twice?

2016-08-30 Thread Georgios Samaras
-- Forwarded message -- From: Georgios Samaras Date: Tue, Aug 30, 2016 at 9:49 AM Subject: Re: KMeans calls takeSample() twice? To: "Sean Owen [via Apache Spark Developers List]" < ml-node+s1001551n18788...@n3.nabble.com> I am not sure what you want

Re: Broadcast Variable Life Cycle

2016-08-30 Thread Sean Owen
Yeah, after destroy, accessing the broadcast variable results in an error. Accessing it after it's unpersisted (on an executor) causes it to be rebroadcast. On Tue, Aug 30, 2016 at 5:12 PM, Jerry Lam wrote: > Hi Sean, > > Thank you for sharing the knowledge between

Re: KMeans calls takeSample() twice?

2016-08-30 Thread gsamaras
Yanbo thank you for your reply. So you are saying that this is a bug in the Spark UI in general, and not in the local Spark UI of our cluster, where I work, right? George On Mon, Aug 29, 2016 at 11:55 PM, Yanbo Liang-2 [via Apache Spark Developers List]

Re: Broadcast Variable Life Cycle

2016-08-30 Thread Jerry Lam
Hi Sean, Thank you for sharing the knowledge between unpersist and destroy. Does that mean unpersist keeps the broadcast variable in the driver whereas destroy will delete everything about the broadcast variable like it has never existed? Best Regards, Jerry On Tue, Aug 30, 2016 at 11:58 AM,

Re: Structured Streaming with Kafka sources/sinks

2016-08-30 Thread Reynold Xin
In this case simply not much progress has been made, because people might be busy with other stuff. Ofir it looks like you have spent non-trivial amount of time thinking about this topic and have even designed something to work -- can you chime in on the JIRA ticket with your thoughts and your

Re: Broadcast Variable Life Cycle

2016-08-30 Thread Sean Owen
Yes, although there's a difference between unpersist and destroy, you'll hit the same type of question either way. You do indeed have to reason about when you know the broadcast variable is no longer needed in the face of lazy evaluation, and that's hard. Sometimes it's obvious and you can take

Re: Structured Streaming with Kafka sources/sinks

2016-08-30 Thread Nicholas Chammas
> I personally find it disappointing that a big chuck of Spark's design and development is happening behind closed curtains. I'm not too familiar with Streaming, but I see design docs and proposals for ML and SQL published here and on JIRA all the time, and they are discussed extensively. For

Re: Broadcast Variable Life Cycle

2016-08-30 Thread Jerry Lam
Hi Sean, Thank you for the response. The only problem is that actively managing broadcast variables require to return the broadcast variables to the caller if the function that creates the broadcast variables does not contain any action. That is the scope that uses the broadcast variables cannot

Re: Reynold on vacation next two weeks

2016-08-30 Thread Jacek Laskowski
Hi, Definitely well deserved. Don't check your emails for the 2 weeks. Not even for a minute :-) Jacek On 30 Aug 2016 10:21 a.m., "Reynold Xin" wrote: > A lot of people have been pinging me on github and email directly and > expect instant reply. Just FYI I'm on vacation

Re: Structured Streaming with Kafka sources/sinks

2016-08-30 Thread Cody Koeninger
Not that I wouldn't rather have more open communication around this issue...but what are people actually expecting to get out of structured streaming with regard to Kafka? There aren't any realistic pushdown-type optimizations available, and from what I could tell the last time I looked at

ApacheCon Seville CFP closes September 9th

2016-08-30 Thread Rich Bowen
It's traditional. We wait for the last minute to get our talk proposals in for conferences. Well, the last minute has arrived. The CFP for ApacheCon Seville closes on September 9th, which is less than 2 weeks away. It's time to get your talks in, so that we can make this the best ApacheCon yet.

Spark 2.0.1 fails for provided hadoop

2016-08-30 Thread Rishi Mishra
Hi All, I tried to configure my Spark with MapR hadoop cluster. For that I built Spark 2.0 from source with hadoop-provided option. Then as per the document I set my hadoop libraries in spark-env.sh. However I get an error while SessionCatalog is getting created. Please refer below for exception

Re: Spark Kerberos proxy user

2016-08-30 Thread Abel Rincón
Hi again, Is there any open issue related? Nowadays, we (stratio) have a end to end running, with a spark distribution based in 1.6.2. Work in progress: - Create and share our solution documentation. - Test Suite for all the stuff. - Rebase our code with apache-master branch. Regards,

Re: 3Ps for Datasets not available?! (=Parquet Predicate Pushdown)

2016-08-30 Thread Jacek Laskowski
Hi Reynold, That's what I was told few times already (most notably by Adam on twitter), but couldn't understand what it meant exactly. It's only now when I understand what you're saying, Reynold :) Does this put DataFrame's Column-based or SQL-based queries usually faster than Datasets with

Re: 3Ps for Datasets not available?! (=Parquet Predicate Pushdown)

2016-08-30 Thread Reynold Xin
The UDF is a black box so Spark can't know what it is dealing with. There are simple cases in which we can analyze the UDFs byte code and infer what it is doing, but it is pretty difficult to do in general. On Tuesday, August 30, 2016, Jacek Laskowski wrote: > Hi, > > I've been

Reynold on vacation next two weeks

2016-08-30 Thread Reynold Xin
A lot of people have been pinging me on github and email directly and expect instant reply. Just FYI I'm on vacation for two weeks with limited internet access.

3Ps for Datasets not available?! (=Parquet Predicate Pushdown)

2016-08-30 Thread Jacek Laskowski
Hi, I've been playing with UDFs and why they're a blackbox for Spark's optimizer and started with filters to showcase the optimizations in play. My current understanding is that the predicate pushdowns are supported by the following data sources: 1. Hive tables 2. Parquet files 3. ORC files 4.

Re: Structured Streaming with Kafka sources/sinks

2016-08-30 Thread Ofir Manor
I personally find it disappointing that a big chuck of Spark's design and development is happening behind closed curtains. It makes it harder than necessary for me to work with Spark. We had to improvise in the recent weeks a temporary solution for reading from Kafka (from Structured Streaming) to

Re: KMeans calls takeSample() twice?

2016-08-30 Thread Yanbo Liang
I run KMeans with probes and found that takeSample() was called only once actually. It looks like this issue was caused by mistake display at Spark UI. Thanks Yanbo On Mon, Aug 29, 2016 at 2:34 PM, gsamaras wrote: > After reading the internal code of Spark about it,