[GitHub] Ben-Zvi opened a new pull request #1522: Drill 6735: Implement Semi-Join for the Hash-Join operator

GitBox Fri, 02 Nov 2018 19:00:35 -0700

Ben-Zvi opened a new pull request #1522: Drill 6735: Implement Semi-Join for 
the Hash-Join operator
URL: https://github.com/apache/drill/pull/1522
 
 
   This work builds on  "DRILL-6798: Planner changes to support semi-join", and 
makes the Hash-Join perform the Semi-Join internally, by not probing/joining 
with any build-side duplicate key.
   
   The work is broken into three commits. The **first commit**:
      Part of the changes is to move the `semiJoin` flag from the planner (via 
`HashJoinPOP`, etc.). 
   The main change finishes probing immediately after the first probe by the 
outer row (not checking for more matches) - see `executeProbePhase()`. (Then 
the loop continues to the next outer row).
      Another change is not to output the (key) columns from the inner side. 
Thus they are skipped when building the output schema, and 
`numberOfBuildSideColumns` is set to zero so copying of the outer columns to 
the outgoing container starts from the first column there.
     Another improvement is not allocating or using the Hash-Join "Helper".
   The **second commit**: 
     Addresses cases of many key-duplicate inner rows by using the hash-table 
from the very beginning to detect duplicate incoming rows and just skip them 
(i.e., not copy them into the partitions). In case of spilling, the hash table 
is reset, and then reused. In case of no spill, the hash table is used as is 
(no need to build it again).
     A new option was added to control this feature. This feature adds some 
overhead (e.g., hash table resizing), but can save lots of storage space and 
related overhead.
   The **third commit**: 
     Tries to address the overhead of the "skip duplicates" feature, by 
performing an initial "run time stats" and then turning the feature off if 
there were not too many duplicates (< %20). The decision is made after reading 
about half-min-hash-table-size in each partition.
   
    **comments:**
   
   1. The Memory-Calculator was not changed, but may need to -- The Hash Join 
Helper is not used (less memory), but in case of "skip duplicates" - the 
Hash-Table is allocated (and grows) early, like in Hash Agg (thus needs to be 
accounted for).  
   2. The SI-Intersect operation is a little similar (also ignores inner 
duplicates), but this work did not merge the other (i.e. Intersect does build 
the Helper, etc.)
   3. Another future possibility is making the Merge-Join support Semi Join.
   4. Also fixed a JsonProperty of HashTableConfig -- see Vlad's comment in PR 
#1248 of DRILL-6027 .
   5. Here are results from some initial performance testing (embedded mode, on 
a Mac):
       (All tests are simple self join semi-join, with no spilling)
       For a **4.8M** *distinct key* rows:
       - Old: 31 sec.
       - Semi: 16 sec.
       - Skip-duplicates: 20 sec.
       - With the third commit (stop skipping): 17 sec.
   
      For a **2.8M** rows with only **18K** distinct (i.e. about 150 duplicates 
per one):
       - Old: 2.4 sec.
       - Semi: 3.1 sec.
       - Skip-duplicates: 2.2 sec.
       - With the third commit (stop skipping): 2.4 sec.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] Ben-Zvi opened a new pull request #1522: Drill 6735: Implement Semi-Join for the Hash-Join operator

Reply via email to