Github user jihoonson commented on the pull request:

    https://github.com/apache/tajo/pull/706#issuecomment-134164003
  
    This patch is ready for review. Sorry for a large patch. 
    
    Here are some highlights of changes.
    * Added a session variable to set the limitation of broadcast table size 
for cross join. This value is valid only when ```TEST_BROADCAST_JOIN_ENABLED``` 
is set.
    * Cross join is always executed with broadcast join. To do so, at least one 
input of cross join should be the relation which is smaller than 
```BROADCAST_CROSS_JOIN_THRESHOLD```.
    * Added ```PostLogicalPlanVerifier``` to verify that cross join is 
executable or not.
    * Fixed some bugs in BroadcastJoinRule.
    * Fixed some bugs in QueryTestCaseBase.
    * Removed BNL and NL join executors. Instead, each task executes cross join 
with hash join. This is because one of inputs of cross join is always cached in 
the broadcast cache holder.
    * Improved unique key generation for a scan executor in 
```TaskAttemptContext```.
    
    I've tested cross join performance with a cluster consisting of a master 
and 5 workers. Each worker equips 48 cores, 80 GB memory, and 24 disks. 
    
    #### Data
    * partsupp: 80000000 rows (12.2 GB in TEXT file)
    * supplier_small: 100000 rows (14.1 MB in TEXT file)
    
    #### Query
    ```
    select count(*) from partsupp, supplier_small
    ```
    
    #### Result
    1 hrs, 14 mins, 33 sec is taken with this patch. The above query runs 
forever without this patch because a single worker executes cross join.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to