[jira] [Commented] (SPARK-20638) Optimize the CartesianRDD to reduce repeatedly data fetching
[ https://issues.apache.org/jira/browse/SPARK-20638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004342#comment-16004342 ] Teng Jiang commented on SPARK-20638: A further 88x improvement is show in my PR comment. https://github.com/apache/spark/pull/17898#issuecomment-299818394 > Optimize the CartesianRDD to reduce repeatedly data fetching > > > Key: SPARK-20638 > URL: https://issues.apache.org/jira/browse/SPARK-20638 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Teng Jiang > > In CartesianRDD, group each iterator to multiple groups. Thus in the second > iteration, the data with be fetched (num of data)/groupSize times, rather > than (num of data) times. > The test results are: > Test Environment : 3 workers, each has 10 cores, 30G memory, 1 executor > Test data : users : 480,189, each is a 10-dim vector, and items : 17770, each > is a 10-dim vector. > With default CartesianRDD, cartesian time is 2420.7s. > With this proposal, cartesian time is 45.3s > 50x faster than the original method. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20638) Optimize the CartesianRDD to reduce repeatedly data fetching
[ https://issues.apache.org/jira/browse/SPARK-20638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004311#comment-16004311 ] Apache Spark commented on SPARK-20638: -- User 'ConeyLiu' has created a pull request for this issue: https://github.com/apache/spark/pull/17936 > Optimize the CartesianRDD to reduce repeatedly data fetching > > > Key: SPARK-20638 > URL: https://issues.apache.org/jira/browse/SPARK-20638 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Teng Jiang > > In CartesianRDD, group each iterator to multiple groups. Thus in the second > iteration, the data with be fetched (num of data)/groupSize times, rather > than (num of data) times. > The test results are: > Test Environment : 3 workers, each has 10 cores, 30G memory, 1 executor > Test data : users : 480,189, each is a 10-dim vector, and items : 17770, each > is a 10-dim vector. > With default CartesianRDD, cartesian time is 2420.7s. > With this proposal, cartesian time is 45.3s > 50x faster than the original method. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20638) Optimize the CartesianRDD to reduce repeatedly data fetching
[ https://issues.apache.org/jira/browse/SPARK-20638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002299#comment-16002299 ] Teng Jiang commented on SPARK-20638: I think it is just buffered > Optimize the CartesianRDD to reduce repeatedly data fetching > > > Key: SPARK-20638 > URL: https://issues.apache.org/jira/browse/SPARK-20638 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Teng Jiang > > In CartesianRDD, group each iterator to multiple groups. Thus in the second > iteration, the data with be fetched (num of data)/groupSize times, rather > than (num of data) times. > The test results are: > Test Environment : 3 workers, each has 10 cores, 30G memory, 1 executor > Test data : users : 480,189, each is a 10-dim vector, and items : 17770, each > is a 10-dim vector. > With default CartesianRDD, cartesian time is 2420.7s. > With this proposal, cartesian time is 45.3s > 50x faster than the original method. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20638) Optimize the CartesianRDD to reduce repeatedly data fetching
[ https://issues.apache.org/jira/browse/SPARK-20638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002295#comment-16002295 ] Sean Owen commented on SPARK-20638: --- I am still not clear why grouped() is better than buffering the iterator? isn't it just {{.buffered}}? > Optimize the CartesianRDD to reduce repeatedly data fetching > > > Key: SPARK-20638 > URL: https://issues.apache.org/jira/browse/SPARK-20638 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Teng Jiang > > In CartesianRDD, group each iterator to multiple groups. Thus in the second > iteration, the data with be fetched (num of data)/groupSize times, rather > than (num of data) times. > The test results are: > Test Environment : 3 workers, each has 10 cores, 30G memory, 1 executor > Test data : users : 480,189, each is a 10-dim vector, and items : 17770, each > is a 10-dim vector. > With default CartesianRDD, cartesian time is 2420.7s. > With this proposal, cartesian time is 45.3s > 50x faster than the original method. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org