Github user ConeyLiu commented on the issue:

    https://github.com/apache/spark/pull/17936
  
    rdd1.cartesian(rdd2). For each task we need pool all the data of rdd1 (or 
rdd2) from the cluster. If we have n task running parallel in the same 
executor, that means we need duplicate poll n same data to same executor. This 
can reduce the gc problem and network I/O (maybe disk I/O if the memory and 
disk 
     array can't fit it in memory totally).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to