ilooner commented on a change in pull request #1472: DRILL-6747: Fixed IOB in HashJoin for inner and right joins on empty probe side URL: https://github.com/apache/drill/pull/1472#discussion_r219561941
########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatch.java ########## @@ -115,6 +115,14 @@ * This value will be returned only after {@link #OK_NEW_SCHEMA} has been * returned at least once (not necessarily <em>immediately</em> after). * </p> + * <p> + * Also after a RecordBatch returns NONE a RecordBatch should: + * <ul> + * <li>Contain the last valid schema seen by the operator.</li> + * <li>Contain a VectorContainer with empty columns corresponding to the last valid schema.</li> + * <li>Return a record count of 0.</li> + * </ul> + * </p> */ NONE(false), Review comment: HashJoin needs to first sniff the first data holding batch from the probe side in order to compute stats for the memory calculations used to spill. If the probe side is empty we will hit NONE. This is unavoidable and this is where the problem begins. In the case of a right join HashJoin still builds HashTables, and the HashTables require build and probe batches to be updated simultaneously before being built. When the build and probe sides are updated the Hashtable expects a vector container with valid columns. If the columns are not present an IOB is thrown. Then later the probing logic is called in HashJoinProbeTemplate, which expects the upstream probeBatch to return a valid record count. There are a few ways to fix this problem: 1. We can refactor the HashTable to update probe and build sides independently. This will have some corner cases though, and the HashTable has no unit tests, so it's risky. Or we can remove the code generation from the HashTable, which will also be somewhat time consuming. Or we could change the join logic to not build the HashTable in this case, which is another major change. Then we can also refactor HashJoinProbeTemplate to handle an empty probe side gracefully, but again HashJoinProbeTemplate is not unit tested so this will be time consuming. 2. We leave the HashTable and HashJoinProbeTemplate as is, but use the last schema we got from the probe side to build dummy empty vector containers for building the HashTable, and for probing in HashJoinProbeTemplate. IMO this is a hack. Also there are some corner cases because some Drill operators violate the contract that they must first send OK_NEW_SCHEMA and then NONE. I've observed in some unit tests some readers directly skip to sending NONE without OK_NEW_SCHEMA if there is no data. 3. We define the state of a RecordBatch after it returns NONE to have a record count of zero, the last valid schema, and an empty VectorContainer. This would be defined should a downstream operator choose to query it. This resolves the issue in HashJoin without doing hacks and risky refactoring in a rush. It also defines state that was previously undefined, which IMO is a good thing. I would also do follow up PRs to make all operators obey this behavior, since it will be trivial to do. - Of the three options, option 1 is the strictly correct thing to do if you don't consider time costs. However, due to the technical debt we've accumulated it will be time consuming to do. - Option 2 is a hack IMO and just makes things more confusing for the future generation of Drillers. - Option 3 is a win win. We get a clean resolution to the issue, and we define the state of operators after they return NONE. Defining state that was previously undefined is always a good thing, especially when it simplifies the logic of the system. And I can go through the operators and make sure they do this, it should be a trivial change involving zeroing VectorContainers before returning NONE and updating record counts to be 0. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services