ilooner commented on a change in pull request #1472: DRILL-6747: Fixed IOB in 
HashJoin for inner and right joins on empty probe side
URL: https://github.com/apache/drill/pull/1472#discussion_r219561941
 
 

 ##########
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatch.java
 ##########
 @@ -115,6 +115,14 @@
      *   This value will be returned only after {@link #OK_NEW_SCHEMA} has been
      *   returned at least once (not necessarily <em>immediately</em> after).
      * </p>
+     * <p>
+     *   Also after a RecordBatch returns NONE a RecordBatch should:
+     *   <ul>
+     *     <li>Contain the last valid schema seen by the operator.</li>
+     *     <li>Contain a VectorContainer with empty columns corresponding to 
the last valid schema.</li>
+     *     <li>Return a record count of 0.</li>
+     *   </ul>
+     * </p>
      */
     NONE(false),
 
 Review comment:
   HashJoin needs to first sniff the first data holding batch from the probe 
side in order to compute stats for the memory calculations used to spill. If 
the probe side is empty we will hit NONE. This is unavoidable and this is where 
the problem begins.
   
   In the case of a right join HashJoin still builds HashTables, and the 
HashTables require build and probe batches to be updated simultaneously before 
being built. When the build and probe sides are updated the Hashtable expects a 
vector container with valid columns. If the columns are not present an IOB is 
thrown.
   
   Then later the probing logic is called in HashJoinProbeTemplate, which 
expects the upstream probeBatch to return a valid record count.
   
   There are a few ways to fix this problem:
   
     1. We can refactor the HashTable to update probe and build sides 
independently. This will have some corner cases though, and the HashTable has 
no unit tests, so it's risky. Or we can remove the code generation from the 
HashTable, which will also be somewhat time consuming. Or we could change the 
join logic to not build the HashTable in this case, which is another major 
change. Then we can also refactor HashJoinProbeTemplate to handle an empty 
probe side gracefully, but again HashJoinProbeTemplate is not unit tested so 
this will be time consuming.
   
     2. We leave the HashTable and HashJoinProbeTemplate as is, but use the 
last schema we got from the probe side to build dummy empty vector containers 
for building the HashTable, and for probing in HashJoinProbeTemplate. IMO this 
is a hack. Also there are some corner cases because some Drill operators 
violate the contract that they must first send OK_NEW_SCHEMA and then NONE. 
I've observed in some unit tests some readers directly skip to sending NONE 
without OK_NEW_SCHEMA if there is no data.
   
    3. We define the state of a RecordBatch after it returns NONE to have a 
record count of zero, the last valid schema, and an empty VectorContainer. This 
would be defined should a downstream operator choose to query it. This resolves 
the issue in HashJoin without doing hacks and risky refactoring in a rush. It 
also defines state that was previously undefined, which IMO is a good thing. I 
would also do follow up PRs to make all operators obey this behavior, since it 
will be trivial to do.
   
    -  Of the three options, option 1 is the strictly correct thing to do if 
you don't consider time costs. However, due to the technical debt we've 
accumulated it will be time consuming to do.
    - Option 2 is a hack IMO and just makes things more confusing for the 
future generation of Drillers.
    - Option 3 is a win win. We get a clean resolution to the issue, and we 
define the state of operators after they return NONE. Defining state that was 
previously undefined is always a good thing, especially when it simplifies the 
logic of the system. And I can go through the operators and make sure they do 
this, it should be a trivial change involving zeroing VectorContainers before 
returning NONE and updating record counts to be 0.
     

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to