GitHub user sachouche opened a pull request:
https://github.com/apache/drill/pull/1170
DRILL-6223: Fixed several Drillbit failures due to schema changes
Fixed several Issues due to Schema changes:
1) Changes in complex data types
Drill Query Failing when selecting all columns from a Complex Nested Data
File (Parquet) Set). There are differences in Schema among the files:
The Parquet files exhibit differences both at the first level and within
nested data types
A select * will not cause an exception but using a limit clause will
Note also this issue seems to happen only when multiple Drillbit minor
fragments are involved (concurrency higher than one)
2) Dangling columns (both simple and complex)
This situation can be easily reproduced for:
- Select STAR queries which involve input data with different schemas
- LIMIT or / and PROJECT operators are used
- The data will be read from more than one minor fragment
- This is because individual readers have logic to handle such use-cases
but not downstream operators
- So is reader-1 sends one batch with F1, F2, and F3
- The reader-2 sends batch F2, F3
- Then the LIMIT and PROJECT operator will fail to cleanup the dangling
column F1 which will cause failures when downstream operators copy logic
attempts copy the stale column F1
- This pull request adds logic to detect and eliminate dangling columns
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sachouche/drill DRILL-6223
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/1170.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1170
----
commit d986b6c7588c107bb7e49d2fc8eb3f25a60e1214
Author: Salim Achouche <sachouche2@...>
Date: 2018-02-21T02:17:14Z
DRILL-6223: Fixed several Drillbit failures due to schema changes
----
---