Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Reynold Xin
+1 On Wednesday, July 20, 2016, Krishna Sankar wrote: > +1 (non-binding, of course) > > 1. Compiled OS X 10.11.5 (El Capitan) OK Total time: 24:07 min > mvn clean package -Pyarn -Phadoop-2.7 -DskipTests > 2. Tested pyspark, mllib (iPython 4.0) > 2.0 Spark version is

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Krishna Sankar
+1 (non-binding, of course) 1. Compiled OS X 10.11.5 (El Capitan) OK Total time: 24:07 min mvn clean package -Pyarn -Phadoop-2.7 -DskipTests 2. Tested pyspark, mllib (iPython 4.0) 2.0 Spark version is 2.0.0 2.1. statistics (min,max,mean,Pearson,Spearman) OK 2.2. Linear/Ridge/Lasso Regression

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Joseph Gonzalez
+1 Sent from my iPad - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Michael Allman
I've run some tests with some real and some synthetic parquet data with nested columns with and without the hive metastore on our Spark 1.5, 1.6 and 2.0 versions. I haven't seen any unexpected performance surprises, except that Spark 2.0 now does schema inference across all files in a

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Maciej Bryński
@Michael, I answered in Jira and could repeat here. I think that my problem is unrelated to Hive, because I'm using read.parquet method. I also attached some VisualVM snapshots to SPARK-16321 (I think I should merge both issues) And code profiling suggest bottleneck when reading parquet file. I

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Marcin Tustin
I refer to Maciej Bryński's (mac...@brynski.pl) emails of 29 and 30 June 2016 to this list. He said that his benchmarking suggested that Spark 2.0 was slower than 1.6. I'm wondering if that was ever investigated, and if so if the speed is back up, or not. On Wed, Jul 20, 2016 at 12:18 PM,

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Michael Allman
Marcin, I'm not sure what you're referring to. Can you be more specific? Cheers, Michael > On Jul 20, 2016, at 9:10 AM, Marcin Tustin wrote: > > Whatever happened with the query regarding benchmarks? Is that resolved? > > On Tue, Jul 19, 2016 at 10:35 PM, Reynold Xin

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Marcin Tustin
Whatever happened with the query regarding benchmarks? Is that resolved? On Tue, Jul 19, 2016 at 10:35 PM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Shivaram Venkataraman
+1 SHA and MD5 sums match for all binaries. Docs look fine this time around. Built and ran `dev/run-tests` with Java 7 on a linux machine. No blocker bugs on JIRA and the only critical bug with target as 2.0.0 is SPARK-16633, which doesn't look like a release blocker. I also checked issues which

Snappy initialization issue, spark assembly jar missing snappy classes?

2016-07-20 Thread Eugene Morozov
Greetings! We're reading input files with newApiHadoopFile that is configured with multiline split. Everything's fine, besides https://issues.apache.org/jira/browse/MAPREDUCE-6549. It looks like the issue is fixed, but within hadoop 2.7.2. Which means we have to download spark without hadoop and