branch-1.2 has been cut

2014-11-03 Thread Patrick Wendell
Hi All, I've just cut the release branch for Spark 1.2, consistent with then end of the scheduled feature window for the release. New commits to master will need to be explicitly merged into branch-1.2 in order to be in the release. This begins the transition into a QA period for Spark 1.2, with

Re: sbt scala compiler crashes on spark-sql

2014-11-03 Thread Imran Rashid
thanks everyone, that worked. I had been just cleaning the sql project, which wasn't enough, but a full clean of everything and its happy now. just in case this helps anybody else come up with steps to reproduce, for me the error was always in DataTypeConversions.scala, and I think it *might*

Re: branch-1.2 has been cut

2014-11-03 Thread Nicholas Chammas
Minor question, but when would be the right time to update the default Spark version https://github.com/apache/spark/blob/76386e1a23c55a58c0aeea67820aab2bac71b24b/ec2/spark_ec2.py#L42 in the EC2 script? On Mon, Nov 3, 2014 at 3:55 AM, Patrick Wendell pwend...@gmail.com wrote: Hi All, I've

Re: Surprising Spark SQL benchmark

2014-11-03 Thread ozgun
Hey Patrick, It's Ozgun from Citus Data. We'd like to make these benchmark results fair, and have tried different config settings for SparkSQL over the past month. We picked the best config settings we could find, and also contacted the Spark users list about running TPC-H numbers.

Re: matrix factorization cross validation

2014-11-03 Thread Debasish Das
I added the drivers for precisionAt(k: Int) driver for the movielens test-cases...Although I am a bit confused on precisionAt(k: Int) code from RankingMetrics.scala... While cross validating, I am really not sure how to set K... if (labSet.nonEmpty) { val n = math.min(pred.length, k) ... } If I

MatrixFactorizationModel predict(Int, Int) API

2014-11-03 Thread Debasish Das
Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been called and in all the test-cases that API has been used... I can perhaps refactor my code to

Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matt Cheah
Hi everyone, I'm running into more and more cases where too many files are opened when spark.shuffle.consolidateFiles is turned off. I was wondering if this is a common scenario among the rest of the community, and if so, if it is worth considering the setting to be turned on by default. From

Spark shuffle consolidateFiles performance degradation quantification

2014-11-03 Thread Matt Cheah
Hi everyone, I'm running into more and more cases where too many files are opened when spark.shuffle.consolidateFiles is turned off. I was wondering if this is a common scenario among the rest of the community, and if so, if it is worth considering the setting to be turned on by default. From

Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matei Zaharia
In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have better performance while creating fewer files. So I'd suggest trying that too. Matei On Nov 3, 2014, at 6:12 PM, Andrew Or and...@databricks.com wrote: Hey Matt, There's some prior work that compares

Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matei Zaharia
(BTW this had a bug with negative hash codes in 1.1.0 so you should try branch-1.1 for it). Matei On Nov 3, 2014, at 6:28 PM, Matei Zaharia matei.zaha...@gmail.com wrote: In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have better performance while creating fewer

Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Zach Fry
Hey Andrew, Matei, Thanks for responding. For some more context, we were running into Too many open files issues where we were seeing this happen immediately after the Collect phase (about 30 seconds into a run) on a decently sized dataset (14 MM rows). The ulimit set in the spark-env was

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-03 Thread Xiangrui Meng
Was user presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code