pgaref commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564480514
##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -231,11 +224,9 @@ protected RecordReaderImpl(ReaderImpl fileReader,
if (options.getPreFilterColumnNames() != null) {
for (String colName : options.getPreFilterColumnNames()) {
int expandColId = findColumns(evolution, colName);
+ // If the column is not present in the file then this can be ignored
from read.
if (expandColId != -1) {
filterColIds.add(expandColId);
- } else {
Review comment:
The intention of the original code here was to notify the consumer that
the filter can not be applied and could lead to wrong results (keep in mind
that row-filtering is not best-effort as row-group skipping and the consumer
expects only the rows that actually pass the filter).
This was also a side-effect of the existing code that had no distinction
between a column missing from the schema, and schema-fileReader missmatch as
you very well mentioned above.
However, (in the former case) we can now end up with pretty generic
exceptions from findSubtype that do not reveal the root cause:
**IllegalArgumentException("Field " + first + " not found in " +
current.toString())**
I believe the best thing to do here (even if its a bit ugly) is capture the
Exception and rethrow with an appropriate message.
##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -104,17 +104,10 @@
*/
static int findColumns(SchemaEvolution evolution,
Review comment:
I believe it is safe to assume that the column should always be part of
the readerSchema -- if thats not the case and there is a column/schema
missmatch the caller (Spark/Hive) will be responsible for handling the
RuntimeException.
I like this approach because its cleaner (and distinguishes between the two
cases) so +1 for that -- but lets first make sure we add the behavior change as
part of the method doc.
##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -104,17 +104,10 @@
*/
static int findColumns(SchemaEvolution evolution,
String columnName) {
- try {
- TypeDescription readerColumn =
evolution.getReaderBaseSchema().findSubtype(
- columnName, evolution.isSchemaEvolutionCaseAware);
- TypeDescription fileColumn = evolution.getFileType(readerColumn);
- return fileColumn == null ? -1 : fileColumn.getId();
- } catch (IllegalArgumentException e) {
- if (LOG.isDebugEnabled()){
- LOG.debug("{}", e.getMessage());
- }
- return -1;
- }
+ TypeDescription readerColumn = evolution.getReaderBaseSchema().findSubtype(
+ columnName, evolution.isSchemaEvolutionCaseAware);
+ TypeDescription fileColumn = evolution.getFileType(readerColumn);
+ return fileColumn == null ? -1 : fileColumn.getId();
Review comment:
+1 on the above -- lets add a new test or extend
testSchemaEvolutionMissingColumn -- where the "missing" column is not part of
schema evolution and assert for expected exception
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]