Re: [VOTE] SPIP ML Pipelines in R

2018-06-01 Thread Hossein
Hi Shivaram, We converged on a CRAN release process that seems identical to current SparkR. --Hossein On Thu, May 31, 2018 at 9:10 AM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > Hossein -- Can you clarify what the resolution on the repository / > release issue discussed on

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Nicholas Chammas
pyspark --packages org.apache.hadoop:hadoop-aws:2.7.3 didn’t work for me either (even building with -Phadoop-2.7). I guess I’ve been relying on an unsupported pattern and will need to figure something else out going forward in order to use s3a://. ​ On Fri, Jun 1, 2018 at 9:09 PM Marcelo Vanzin

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Marcelo Vanzin
I have personally never tried to include hadoop-aws that way. But at the very least, I'd try to use the same version of Hadoop as the Spark build (2.7.3 IIRC). I don't really expect a different version to work, and if it did in the past it definitely was not by design. On Fri, Jun 1, 2018 at 5:50

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Reynold Xin
+1 On Fri, Jun 1, 2018 at 3:29 PM Marcelo Vanzin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.3.1. > > Given that I expect at least a few people to be busy with Spark Summit next > week, I'm taking the liberty of setting an extended voting period. The

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Nicholas Chammas
Building with -Phadoop-2.7 didn’t help, and if I remember correctly, building with -Phadoop-2.8 worked with hadoop-aws in the 2.3.0 release, so it appears something has changed since then. I wasn’t familiar with -Phadoop-cloud, but I can try that. My goal here is simply to confirm that this

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Marcelo Vanzin
Using the hadoop-aws package is probably going to be a little more complicated than that. The best bet is to use a custom build of Spark that includes it (use -Phadoop-cloud). Otherwise you're probably looking at some nasty dependency issues, especially if you end up mixing different versions of

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Mark Hamstra
There is no hadoop-2.8 profile. Use hadoop-2.7, which is effectively hadoop-2.7+ On Fri, Jun 1, 2018 at 4:01 PM Nicholas Chammas wrote: > I was able to successfully launch a Spark cluster on EC2 at 2.3.1 RC4 > using Flintrock . However, trying > to load

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Nicholas Chammas
I was able to successfully launch a Spark cluster on EC2 at 2.3.1 RC4 using Flintrock . However, trying to load the hadoop-aws package gave me some errors. $ pyspark --packages org.apache.hadoop:hadoop-aws:2.8.4 :: problems summary :: WARNINGS

Re: [VOTE] [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark

2018-06-01 Thread Xiao Li
+1 2018-06-01 15:41 GMT-07:00 Xingbo Jiang : > +1 > > 2018-06-01 9:21 GMT-07:00 Xiangrui Meng : > >> Hi all, >> >> I want to call for a vote of SPARK-24374 >> . It introduces a new >> execution mode to Spark, which would help both integration

Re: [VOTE] [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark

2018-06-01 Thread Xingbo Jiang
+1 2018-06-01 9:21 GMT-07:00 Xiangrui Meng : > Hi all, > > I want to call for a vote of SPARK-24374 > . It introduces a new > execution mode to Spark, which would help both integration with external > DL/AI frameworks and MLlib algorithm

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Marcelo Vanzin
Starting with my own +1 (binding). On Fri, Jun 1, 2018 at 3:28 PM, Marcelo Vanzin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.3.1. > > Given that I expect at least a few people to be busy with Spark Summit next > week, I'm taking the liberty of setting

[VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Marcelo Vanzin
Please vote on releasing the following candidate as Apache Spark version 2.3.1. Given that I expect at least a few people to be busy with Spark Summit next week, I'm taking the liberty of setting an extended voting period. The vote will be open until Friday, June 8th, at 19:00 UTC (that's 12:00

Re: [VOTE] Spark 2.3.1 (RC3)

2018-06-01 Thread Xiao Li
Sure, will send RM an email next time when I made a release-blocker change. Xiao 2018-06-01 13:32 GMT-07:00 Reynold Xin : > Yes everybody please cc the release manager on changes that merit -1. It's > high overhead and let's make this smoother. > > > On Fri, Jun 1, 2018 at 1:28 PM Marcelo

Re: [VOTE] Spark 2.3.1 (RC3)

2018-06-01 Thread Reynold Xin
Yes everybody please cc the release manager on changes that merit -1. It's high overhead and let's make this smoother. On Fri, Jun 1, 2018 at 1:28 PM Marcelo Vanzin wrote: > Xiao, > > This is the third time in this release cycle that this is happening. > Sorry to single out you guys, but can

Re: [VOTE] Spark 2.3.1 (RC3)

2018-06-01 Thread Marcelo Vanzin
Xiao, This is the third time in this release cycle that this is happening. Sorry to single out you guys, but can you please do two things: - do not merge things in 2.3 you're not absolutely sure about - make sure that things you backport to 2.3 are not causing problems - let the RM know about

Re: [VOTE] Spark 2.3.1 (RC3)

2018-06-01 Thread Xiao Li
Based on the plan, the changes in that PR added the extra Aggregate and Expand for common queries: SELECT sum(DISTINCT x), avg(DISTINCT x) FROM tab Both Aggregate and Expand are expensive operators. 2018-06-01 13:24 GMT-07:00 Sean Owen : > Hm, that was merged two days ago, and you decided to

Re: [VOTE] Spark 2.3.1 (RC3)

2018-06-01 Thread Sean Owen
Hm, that was merged two days ago, and you decided to revert it 2 hours ago. It sounds like this was maybe risky to put into 2.3.x during the RC phase, at least. You also don't seem certain whether there's a performance problem; how sure are you? These may all have been the right thing to do

Re: [VOTE] Spark 2.3.1 (RC3)

2018-06-01 Thread Xiao Li
Sorry, I need to say -1 This morning, just found a regression in 2.3.1 and reverted https://github.com/apache/spark/pull/21443 Xiao 2018-06-01 13:09 GMT-07:00 Marcelo Vanzin : > Please vote on releasing the following candidate as Apache Spark version > 2.3.1. > > Given that I expect at least a

[VOTE] Spark 2.3.1 (RC3)

2018-06-01 Thread Marcelo Vanzin
Please vote on releasing the following candidate as Apache Spark version 2.3.1. Given that I expect at least a few people to be busy with Spark Summit next week, I'm taking the liberty of setting an extended voting period. The vote will be open until Friday, June 8th, at 19:00 UTC (that's 12:00

Re: [VOTE] SPIP ML Pipelines in R

2018-06-01 Thread Xiangrui Meng
+1. On Thu, May 31, 2018 at 2:28 PM Joseph Bradley wrote: > Hossein might be slow to respond (OOO), but I just commented on the JIRA. > I'd recommend we follow the same process as the SparkR package. > > +1 on this from me (and I'll be happy to help shepherd it, though Felix > and Shivaram are

[VOTE] [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark

2018-06-01 Thread Xiangrui Meng
Hi all, I want to call for a vote of SPARK-24374 . It introduces a new execution mode to Spark, which would help both integration with external DL/AI frameworks and MLlib algorithm performance. This is one of the follow-ups from a previous

Re: Revisiting Online serving of Spark models?

2018-06-01 Thread Saikat Kanjilal
@Chris This sounds fantastic, please send summary notes for Seattle folks @Felix I work in downtown Seattle, am wondering if we should a tech meetup around model serving in spark at my work or elsewhere close, thoughts? I’m actually in the midst of building microservices to manage models and