GitHub user jinfengni opened a pull request:
https://github.com/apache/drill/pull/906
DRILL-5546: Handle schema change exception failure caused by empty inâ¦
â¦put or empty batche.
1. Modify ScanBatch's logic when it iterates list of RecordReader.
1) Skip RecordReader if it returns 0 row && present same schema. A new
schema (by calling Mutator.isNewSchema() ) means either a new top level field
is added, or a field in a nested field is added, or an existing field type is
changed.
2) Implicit columns are added and populated only when the input is not
empty, i.e. the batch contains > 0 row or rowCount == 0 && new schema.
3) ScanBatch will return NONE directly (called as "fast NONE"), if all
its RecordReaders haver empty input and thus are skipped, in stead of returing
OK_NEW_SCHEMA first.
2. Modify IteratorValidatorBatchIterator to allow
1) fast NONE ( before seeing a OK_NEW_SCHEMA)
2) batch with empty list of columns.
2. Modify JsonRecordReader when it get 0 row. Do not insert a nullable-int
column for 0 row input. Together with ScanBatch, Drill will skip empty json
files.
3. Modify binary operators such as join, union to handle fast none for
either one side or both sides. Abstract the logic in AbstractBinaryRecordBatch,
except for MergeJoin as its implementation is quite different from others.
4. Fix and refactor union all operator.
1) Correct union operator hanndling 0 input rows. Previously, it will
ignore inputs with 0 row and put nullable-int into output schema, which causes
various of schema change issue in down-stream operator. The new behavior is to
take schema with 0 into account
in determining the output schema, in the same way with > 0 input rows. By
doing that, we ensure Union operator will not behave like a schema-lossy
operator.
2) Add a UnionInputIterator to simplify the logic to iterate the
left/right inputs, removing significant chunk of duplicate codes in previous
implementation.
The new union all operator reduces the code size into half, comparing the
old one.
5. Introduce UntypedNullVector to handle convertFromJson() function, when
the input batch contains 0 row.
Problem: The function convertFromJSon() is different from other regular
functions in that it only knows the output schema after evaluation is
performed. When input has 0 row, Drill essentially does not have
a way to know the output type, and previously will assume Map type. That
works under the assumption other operators like Union would ignore batch with 0
row, which is no longer
the case in the current implementation.
Solution: Use MinorType.NULL at the output type for convertFromJSON()
when input contains 0 row. The new UntypedNullVector is used to represent a
column with MinorType.NULL.
6. HBaseGroupScan convert star column into list of row_key and column
family. HBaseRecordReader should reject column star since it expectes star has
been converted somewhere else.
In HBase a column family always has map type, and a non-rowkey column
always has nullable varbinary type, this ensures that HBaseRecordReader across
different HBase regions will have the same top level schema, even if the region
is
empty or prune all the rows due to filter pushdown optimization. In other
words, we will not see different top level schema from different
HBaseRecordReader for the same table.
However, such change will not be able to handle hard schema change : c1
exists in cf1 in one region, but not in another region. Further work is
required to handle hard schema change.
7. Modify scan cost estimation when the query involves * column. This is to
remove the planning randomness since previously two different operators could
have same cost.
8. Add a new flag 'outputProj' to Project operator, to indicate if Project
is for the query's final output. Such Project is added by TopProjectVisitor, to
handle fast NONE when all the inputs to the query are empty
and are skipped.
1) column star is replaced with empty list
2) regular column reference is replaced with nullable-int column
3) An expression will go through ExpressionTreeMaterializer, and use the
type of materialized expression as the output type
4) Return an OK_NEW_SCHEMA with the schema using the above logic, then
return a NONE to down-stream operator.
9. Add unit test to test operators handling empty input.
10. Add unit test to test query when inputs are all empty.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jinfengni/incubator-drill DRILL-5546
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/906.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #906
----
commit b0110140f8375809af3deddf9881e64dc1242886
Author: Jinfeng Ni <[email protected]>
Date: 2017-05-17T23:08:00Z
DRILL-5546: Handle schema change exception failure caused by empty input or
empty batche.
1. Modify ScanBatch's logic when it iterates list of RecordReader.
1) Skip RecordReader if it returns 0 row && present same schema. A new
schema (by calling Mutator.isNewSchema() ) means either a new top level field
is added, or a field in a nested field is added, or an existing field type is
changed.
2) Implicit columns are added and populated only when the input is not
empty, i.e. the batch contains > 0 row or rowCount == 0 && new schema.
3) ScanBatch will return NONE directly (called as "fast NONE"), if all
its RecordReaders haver empty input and thus are skipped, in stead of returing
OK_NEW_SCHEMA first.
2. Modify IteratorValidatorBatchIterator to allow
1) fast NONE ( before seeing a OK_NEW_SCHEMA)
2) batch with empty list of columns.
2. Modify JsonRecordReader when it get 0 row. Do not insert a nullable-int
column for 0 row input. Together with ScanBatch, Drill will skip empty json
files.
3. Modify binary operators such as join, union to handle fast none for
either one side or both sides. Abstract the logic in AbstractBinaryRecordBatch,
except for MergeJoin as its implementation is quite different from others.
4. Fix and refactor union all operator.
1) Correct union operator hanndling 0 input rows. Previously, it will
ignore inputs with 0 row and put nullable-int into output schema, which causes
various of schema change issue in down-stream operator. The new behavior is to
take schema with 0 into account
in determining the output schema, in the same way with > 0 input rows. By
doing that, we ensure Union operator will not behave like a schema-lossy
operator.
2) Add a UnionInputIterator to simplify the logic to iterate the
left/right inputs, removing significant chunk of duplicate codes in previous
implementation.
The new union all operator reduces the code size into half, comparing the
old one.
5. Introduce UntypedNullVector to handle convertFromJson() function, when
the input batch contains 0 row.
Problem: The function convertFromJSon() is different from other regular
functions in that it only knows the output schema after evaluation is
performed. When input has 0 row, Drill essentially does not have
a way to know the output type, and previously will assume Map type. That
works under the assumption other operators like Union would ignore batch with 0
row, which is no longer
the case in the current implementation.
Solution: Use MinorType.NULL at the output type for convertFromJSON()
when input contains 0 row. The new UntypedNullVector is used to represent a
column with MinorType.NULL.
6. HBaseGroupScan convert star column into list of row_key and column
family. HBaseRecordReader should reject column star since it expectes star has
been converted somewhere else.
In HBase a column family always has map type, and a non-rowkey column
always has nullable varbinary type, this ensures that HBaseRecordReader across
different HBase regions will have the same top level schema, even if the region
is
empty or prune all the rows due to filter pushdown optimization. In other
words, we will not see different top level schema from different
HBaseRecordReader for the same table.
However, such change will not be able to handle hard schema change : c1
exists in cf1 in one region, but not in another region. Further work is
required to handle hard schema change.
7. Modify scan cost estimation when the query involves * column. This is to
remove the planning randomness since previously two different operators could
have same cost.
8. Add a new flag 'outputProj' to Project operator, to indicate if Project
is for the query's final output. Such Project is added by TopProjectVisitor, to
handle fast NONE when all the inputs to the query are empty
and are skipped.
1) column star is replaced with empty list
2) regular column reference is replaced with nullable-int column
3) An expression will go through ExpressionTreeMaterializer, and use the
type of materialized expression as the output type
4) Return an OK_NEW_SCHEMA with the schema using the above logic, then
return a NONE to down-stream operator.
9. Add unit test to test operators handling empty input.
10. Add unit test to test query when inputs are all empty.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---