branch-1.2 has been cut
Hi All, I've just cut the release branch for Spark 1.2, consistent with then end of the scheduled feature window for the release. New commits to master will need to be explicitly merged into branch-1.2 in order to be in the release. This begins the transition into a QA period for Spark 1.2, with a focus on testing and fixes. A few smaller features may still go in as folks wrap up loose ends in the next 48 hours (or for developments in alpha components). To help with QA, I'll try to package up a SNAPSHOT release soon for community testing; this worked well when testing Spark 1.1 before official votes started. I might give it a few days to allow committers to merge in back-logged fixes and other patches that were punted to after the feature freeze. Thanks to everyone who helped author and review patches over the last few weeks! - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: sbt scala compiler crashes on spark-sql
thanks everyone, that worked. I had been just cleaning the sql project, which wasn't enough, but a full clean of everything and its happy now. just in case this helps anybody else come up with steps to reproduce, for me the error was always in DataTypeConversions.scala, and I think it *might* have started after I did a maven build as well.
Re: branch-1.2 has been cut
Minor question, but when would be the right time to update the default Spark version https://github.com/apache/spark/blob/76386e1a23c55a58c0aeea67820aab2bac71b24b/ec2/spark_ec2.py#L42 in the EC2 script? On Mon, Nov 3, 2014 at 3:55 AM, Patrick Wendell pwend...@gmail.com wrote: Hi All, I've just cut the release branch for Spark 1.2, consistent with then end of the scheduled feature window for the release. New commits to master will need to be explicitly merged into branch-1.2 in order to be in the release. This begins the transition into a QA period for Spark 1.2, with a focus on testing and fixes. A few smaller features may still go in as folks wrap up loose ends in the next 48 hours (or for developments in alpha components). To help with QA, I'll try to package up a SNAPSHOT release soon for community testing; this worked well when testing Spark 1.1 before official votes started. I might give it a few days to allow committers to merge in back-logged fixes and other patches that were punted to after the feature freeze. Thanks to everyone who helped author and review patches over the last few weeks! - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Surprising Spark SQL benchmark
Hey Patrick, It's Ozgun from Citus Data. We'd like to make these benchmark results fair, and have tried different config settings for SparkSQL over the past month. We picked the best config settings we could find, and also contacted the Spark users list about running TPC-H numbers. http://goo.gl/IU5Hw0 http://goo.gl/WQ1kML http://goo.gl/ihLzgh We also received advice at the Spark Summit '14 to wait until v1.1, and therefore re-ran our tests on SparkSQL 1.1. On the specific optimizations, Marco and Samay from our team have much more context, and I'll let them answer your questions on the different settings we tried. Our intent is to be fair and not misrepresent SparkSQL's performance. On that front, we used publicly available documentation and user lists, and spent about a month trying to get the best Spark performance results. If there are specific optimizations we should have applied and missed, we'd love to be involved with the community in re-running the numbers. Is this email thread the best place to continue the conversation? Best, Ozgun -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Surprising-Spark-SQL-benchmark-tp9041p9073.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: matrix factorization cross validation
I added the drivers for precisionAt(k: Int) driver for the movielens test-cases...Although I am a bit confused on precisionAt(k: Int) code from RankingMetrics.scala... While cross validating, I am really not sure how to set K... if (labSet.nonEmpty) { val n = math.min(pred.length, k) ... } If I make k as a function of pred.length val n = math.min(pred.length, k*pred.length) then I can vary k between 0 and 1 and choose the sweet spot for K on a given dataset but I am not sure if it is a measure that makes sense for recommendation... MAP is something that makes sense as it is average over all test set... On Fri, Oct 31, 2014 at 1:26 AM, Sean Owen so...@cloudera.com wrote: No, excepting approximate methods like LSH to figure out the relatively small set of candidates for the users in the partition, and broadcast or join those. On Fri, Oct 31, 2014 at 5:45 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Sean, re my point earlier do you know a more efficient way to compute top k for each user, other than to broadcast the item factors? (I guess one can use the new asymmetric lsh paper perhaps to assist) — Sent from Mailbox On Thu, Oct 30, 2014 at 11:24 PM, Sean Owen so...@cloudera.com wrote: MAP is effectively an average over all k from 1 to min(# recommendations, # items rated) Getting first recommendations right is more important than the last. On Thu, Oct 30, 2014 at 10:21 PM, Debasish Das debasish.da...@gmail.com wrote: Does it make sense to have a user specific K or K is considered same over all users ? Intuitively the users who watches more movies should get a higher K than the others...
MatrixFactorizationModel predict(Int, Int) API
Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been called and in all the test-cases that API has been used... I can perhaps refactor my code to do the same but I was wondering whether people test the lookup(user) version of the code.. Do I need to cache the model to make it work ? I think right now default is STORAGE_AND_DISK... Thanks. Deb
Spark shuffle consolidateFiles performance degradation numbers
Hi everyone, I'm running into more and more cases where too many files are opened when spark.shuffle.consolidateFiles is turned off. I was wondering if this is a common scenario among the rest of the community, and if so, if it is worth considering the setting to be turned on by default. From the documentation, it seems like the performance could be hurt on ext3 file systems. However, what are the concrete numbers of performance degradation that is seen typically? A 2x slowdown in the average job? 3x? Also, what cause the performance degradation on ext3 file systems specifically? Thanks, -Matt Cheah smime.p7s Description: S/MIME cryptographic signature
Spark shuffle consolidateFiles performance degradation quantification
Hi everyone, I'm running into more and more cases where too many files are opened when spark.shuffle.consolidateFiles is turned off. I was wondering if this is a common scenario among the rest of the community, and if so, if it is worth considering the setting to be turned on by default. From the documentation, it seems like the performance could be hurt on ext3 file systems. However, what are the concrete numbers of performance degradation that is seen typically? A 2x slowdown in the average job? 3x? Also, what cause the performance degradation on ext3 file systems specifically? Thanks, -Matt Cheah smime.p7s Description: S/MIME cryptographic signature
Re: Spark shuffle consolidateFiles performance degradation numbers
In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have better performance while creating fewer files. So I'd suggest trying that too. Matei On Nov 3, 2014, at 6:12 PM, Andrew Or and...@databricks.com wrote: Hey Matt, There's some prior work that compares consolidation performance on some medium-scale workload: http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf There we noticed about 2x performance degradation in the reduce phase on ext3. I am not aware of any other concrete numbers. Maybe others have more experiences to add. -Andrew 2014-11-03 17:26 GMT-08:00 Matt Cheah mch...@palantir.com: Hi everyone, I'm running into more and more cases where too many files are opened when spark.shuffle.consolidateFiles is turned off. I was wondering if this is a common scenario among the rest of the community, and if so, if it is worth considering the setting to be turned on by default. From the documentation, it seems like the performance could be hurt on ext3 file systems. However, what are the concrete numbers of performance degradation that is seen typically? A 2x slowdown in the average job? 3x? Also, what cause the performance degradation on ext3 file systems specifically? Thanks, -Matt Cheah - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark shuffle consolidateFiles performance degradation numbers
(BTW this had a bug with negative hash codes in 1.1.0 so you should try branch-1.1 for it). Matei On Nov 3, 2014, at 6:28 PM, Matei Zaharia matei.zaha...@gmail.com wrote: In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have better performance while creating fewer files. So I'd suggest trying that too. Matei On Nov 3, 2014, at 6:12 PM, Andrew Or and...@databricks.com wrote: Hey Matt, There's some prior work that compares consolidation performance on some medium-scale workload: http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf There we noticed about 2x performance degradation in the reduce phase on ext3. I am not aware of any other concrete numbers. Maybe others have more experiences to add. -Andrew 2014-11-03 17:26 GMT-08:00 Matt Cheah mch...@palantir.com: Hi everyone, I'm running into more and more cases where too many files are opened when spark.shuffle.consolidateFiles is turned off. I was wondering if this is a common scenario among the rest of the community, and if so, if it is worth considering the setting to be turned on by default. From the documentation, it seems like the performance could be hurt on ext3 file systems. However, what are the concrete numbers of performance degradation that is seen typically? A 2x slowdown in the average job? 3x? Also, what cause the performance degradation on ext3 file systems specifically? Thanks, -Matt Cheah - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark shuffle consolidateFiles performance degradation numbers
Hey Andrew, Matei, Thanks for responding. For some more context, we were running into Too many open files issues where we were seeing this happen immediately after the Collect phase (about 30 seconds into a run) on a decently sized dataset (14 MM rows). The ulimit set in the spark-env was 256,000 which we believe should have been enough, but even with it set at that number, we were still seeing issues. Can you comment on what a good ulimit should be in these cases? We believe what might have caused this is some process got orphaned without cleaning up its open file handles. However, other than anecdotal evidence and some speculation, we don't have much evidence to expand on this further. We were wondering if we could get some more information about how many files get opened during a shuffle. We discussed that it is going to be around N x M, where N is the number of Tasks and M is the number of Reducers. Does this sound about right? Are there any other considerations we should be aware of when setting consolidateFiles to True? Thanks, Zach Fry Palantir | Developer Support Engineer z...@palantir.com mailto:em...@palantir.com | 650.226.6338 On 11/3/14 6:28 09PM, Matei Zaharia matei.zaha...@gmail.com wrote: In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have better performance while creating fewer files. So I'd suggest trying that too. Matei On Nov 3, 2014, at 6:12 PM, Andrew Or and...@databricks.com wrote: Hey Matt, There's some prior work that compares consolidation performance on some medium-scale workload: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cs.berkeley.edu_- 7Ekubitron_courses_cs262a-2DF13_projects_reports_project16-5Freport.pdfd =AAIFAgc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8r=0Yj0NJdi423O9rGnW Dox5yE_2OXftYbKeoFygDwj99Um=fQgGKwxzg3lfq5XUaEZy674jjtWDSrFOHIrIDFEGpQc s=ukSpYSbxzzrYdHJEXPMx3gGsErP2vA2PMdBVsY3EOnAe= There we noticed about 2x performance degradation in the reduce phase on ext3. I am not aware of any other concrete numbers. Maybe others have more experiences to add. -Andrew 2014-11-03 17:26 GMT-08:00 Matt Cheah mch...@palantir.com: Hi everyone, I'm running into more and more cases where too many files are opened when spark.shuffle.consolidateFiles is turned off. I was wondering if this is a common scenario among the rest of the community, and if so, if it is worth considering the setting to be turned on by default. From the documentation, it seems like the performance could be hurt on ext3 file systems. However, what are the concrete numbers of performance degradation that is seen typically? A 2x slowdown in the average job? 3x? Also, what cause the performance degradation on ext3 file systems specifically? Thanks, -Matt Cheah - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: MatrixFactorizationModel predict(Int, Int) API
Was user presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been called and in all the test-cases that API has been used... I can perhaps refactor my code to do the same but I was wondering whether people test the lookup(user) version of the code.. Do I need to cache the model to make it work ? I think right now default is STORAGE_AND_DISK... Thanks. Deb - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org