[jira] [Commented] (SPARK-15447) Performance test for ALS in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335956#comment-15335956 ] Nick Pentreath commented on SPARK-15447: Finalized results in the linked Google sheet. Also posted raw results in two linked Google docs. [~mengxr] I didn't manage to run 1 billion ratings but did run 250mm (30mm users, 10mm items, 250mm ratings). I didn't see any potential performance regression issues for checkpointing changes (comparing RDD-based APIs between 2.0.0 and 1.6.1) or DF changes (comparing DF-based APIs between 2.0.0 and 1.6.1). I'm resolving this ticket, but let me know if you come up with any questions or concerns. > Performance test for ALS in Spark 2.0 > - > > Key: SPARK-15447 > URL: https://issues.apache.org/jira/browse/SPARK-15447 > Project: Spark > Issue Type: Task > Components: ML >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Nick Pentreath >Priority: Critical > Labels: QA > > We made several changes to ALS in 2.0. It is necessary to run some tests to > avoid performance regression. We should test (synthetic) datasets from 1 > million ratings to 1 billion ratings. > cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance > tests? > Links: > [Results > spreadsheet|https://docs.google.com/spreadsheets/d/1iX5LisfXcZSTCHp8VPoo5z-eCO85A5VsZDtZ5e475ks/edit?usp=sharing] > [Raw results for > SPARK-14891|https://docs.google.com/document/d/1tlWFCv8zWJuxv_gfAhd-57TKURVyrYkF9v4FLl4Lpn0/edit?usp=sharing] > [Raw results for > SPARK-6716|https://docs.google.com/document/d/12qLLX84Dg-XJAgoSQzmb0-bSncjTHhg7A_JJcQneDiE/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15447) Performance test for ALS in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1530#comment-1530 ] Nick Pentreath commented on SPARK-15447: Almost there - I'll be able to close this off by Friday > Performance test for ALS in Spark 2.0 > - > > Key: SPARK-15447 > URL: https://issues.apache.org/jira/browse/SPARK-15447 > Project: Spark > Issue Type: Task > Components: ML >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Nick Pentreath >Priority: Critical > Labels: QA > > We made several changes to ALS in 2.0. It is necessary to run some tests to > avoid performance regression. We should test (synthetic) datasets from 1 > million ratings to 1 billion ratings. > cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance > tests? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15447) Performance test for ALS in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332578#comment-15332578 ] Reynold Xin commented on SPARK-15447: - We can close this one now can't we? > Performance test for ALS in Spark 2.0 > - > > Key: SPARK-15447 > URL: https://issues.apache.org/jira/browse/SPARK-15447 > Project: Spark > Issue Type: Task > Components: ML >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Nick Pentreath >Priority: Critical > Labels: QA > > We made several changes to ALS in 2.0. It is necessary to run some tests to > avoid performance regression. We should test (synthetic) datasets from 1 > million ratings to 1 billion ratings. > cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance > tests? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15447) Performance test for ALS in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314441#comment-15314441 ] Nick Pentreath commented on SPARK-15447: Added a second tab to the sheet for testing DF-based API from 2.0.0-SNAPSHOT vs 1.6.1 for SPARK-14891. Again, 2.0 is faster and no performance regression. > Performance test for ALS in Spark 2.0 > - > > Key: SPARK-15447 > URL: https://issues.apache.org/jira/browse/SPARK-15447 > Project: Spark > Issue Type: Task > Components: ML >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Nick Pentreath >Priority: Critical > Labels: QA > > We made several changes to ALS in 2.0. It is necessary to run some tests to > avoid performance regression. We should test (synthetic) datasets from 1 > million ratings to 1 billion ratings. > cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance > tests? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15447) Performance test for ALS in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308797#comment-15308797 ] Nick Pentreath commented on SPARK-15447: Created a Google sheet with initial results: https://docs.google.com/spreadsheets/d/1iX5LisfXcZSTCHp8VPoo5z-eCO85A5VsZDtZ5e475ks/edit?usp=sharing So far for SPARK-6717 I've just used {{spark-perf}} to compare the RDD-based APIs (as the checkpointing only impacts the RDD-based {{train}} method). From these results no red flags, and 2.0 is actually faster in general relative to 1.6. Checkpointing does add a minor overhead (but this overhead is consistent across the versions and again better in 2.0). There is something a little weird about the 1.6 results for 10m ratings case, but not sure what's going on there - I've rerun a few times with the same result. Also, haven't managed to get to 1b ratings yet due to cluster size, will keep working on it. > Performance test for ALS in Spark 2.0 > - > > Key: SPARK-15447 > URL: https://issues.apache.org/jira/browse/SPARK-15447 > Project: Spark > Issue Type: Task > Components: ML >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Nick Pentreath >Priority: Critical > Labels: QA > > We made several changes to ALS in 2.0. It is necessary to run some tests to > avoid performance regression. We should test (synthetic) datasets from 1 > million ratings to 1 billion ratings. > cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance > tests? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15447) Performance test for ALS in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294116#comment-15294116 ] Nick Pentreath commented on SPARK-15447: [~mengxr] yes will aim to run some tests during early next week. > Performance test for ALS in Spark 2.0 > - > > Key: SPARK-15447 > URL: https://issues.apache.org/jira/browse/SPARK-15447 > Project: Spark > Issue Type: Task > Components: ML >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > Labels: QA > > We made several changes to ALS in 2.0. It is necessary to run some tests to > avoid performance regression. We should test (synthetic) datasets from 1 > million ratings to 1 billion ratings. > cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance > tests? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org