Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/17936 rdd1.cartesian(rdd2). For each task we need pool all the data of rdd1 (or rdd2) from the cluster. If we have n task running parallel in the same executor, that means we need duplicate poll n same data to same executor. This can reduce the gc problem and network I/O (maybe disk I/O if the memory and disk array can't fit it in memory totally).
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org