[ https://issues.apache.org/jira/browse/SPARK-20638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-20638. ------------------------------- Resolution: Won't Fix > Optimize the CartesianRDD to reduce repeatedly data fetching > ------------------------------------------------------------ > > Key: SPARK-20638 > URL: https://issues.apache.org/jira/browse/SPARK-20638 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.2.0 > Reporter: Teng Jiang > > In CartesianRDD, group each iterator to multiple groups. Thus in the second > iteration, the data with be fetched (num of data)/groupSize times, rather > than (num of data) times. > The test results are: > Test Environment : 3 workers, each has 10 cores, 30G memory, 1 executor > Test data : users : 480,189, each is a 10-dim vector, and items : 17770, each > is a 10-dim vector. > With default CartesianRDD, cartesian time is 2420.7s. > With this proposal, cartesian time is 45.3s > 50x faster than the original method. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org