[GitHub] spark pull request #22008: [SPARK-24928][SQL] Optimize cross join according ...

mgaido91 Tue, 07 Aug 2018 13:48:01 -0700

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22008#discussion_r208379788
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
    @@ -152,3 +153,45 @@ object EliminateOuterJoin extends Rule[LogicalPlan] 
with PredicateHelper {
           if (j.joinType == newJoinType) f else Filter(condition, 
j.copy(joinType = newJoinType))
       }
     }
    +
    +/**
    + * Swaps right and left logical plans of a join when left is bigger than 
right. This is useful
    + * because underlying cartesian product performs a nested loop, thus if 
the outer table is
    + * smaller there are less iterator initialization.
    --- End diff --
    
    This is indeed an interesting point. I am not sure how/if we can measure 
the cost in the creation of the involved iterator and the cost of creating it.
    
    Anyway, actually this will optimize not only the initialization cost for 
the iterator, but also the overall number of record read/processed. Let's take 
an example. Imagine that we have a table A with 10M record and a table B with 
100 records. The total number of record retrieved is:
    
     - if A is the left table, we process: 10M (all the records from A) + 100 * 
10M (all the records from B for every record from A) = 101 * 10M
     - if B is the left table, we process: 100 (all the records from B) + 100 * 
10M (all the records from A for each record from B) = ~ 100 * 10M
    
    So in the second case we process size of A - size B less records (same 
applies to number of bytes read).
    
    And there is another good point for the second option: ie. Spark is much 
better at computing/reading 10 times 10M records that 10M times 2 records as it 
can exploits its parallelism.
    
    That said, your comment still applies, ie. there may be cases in which one 
side is very onerous despite is the one with less records involved. Do you have 
any suggestion about how to estimate this? Thanks.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22008: [SPARK-24928][SQL] Optimize cross join according ...

Reply via email to