[GitHub] [orc] pgaref commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

GitBox Tue, 26 Jan 2021 04:44:12 -0800


pgaref commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564480514




##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -231,11 +224,9 @@ protected RecordReaderImpl(ReaderImpl fileReader,
     if (options.getPreFilterColumnNames() != null) {
       for (String colName : options.getPreFilterColumnNames()) {
         int expandColId = findColumns(evolution, colName);
+        // If the column is not present in the file then this can be ignored 
from read.
         if (expandColId != -1) {
           filterColIds.add(expandColId);
-        } else {

Review comment:
       The intention of the original code here was to notify the consumer that 
the filter can not be applied  and could lead to wrong results (keep in mind 
that row-filtering is not best-effort as row-group skipping and the consumer 
expects only the rows that actually pass the filter).
   This was also a side-effect of the existing code that had no distinction 
between a column missing from the schema, and schema-fileReader missmatch as 
you very well mentioned above.
   
   However, (in the former case) we can now end up with pretty generic 
exceptions from findSubtype that do not reveal the root cause:
   **IllegalArgumentException("Field " + first + " not found in " + 
current.toString())**
   
   I believe the best thing to do here (even if its a bit ugly) is capture the 
Exception and rethrow with an appropriate message.
   

##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -104,17 +104,10 @@
    */
   static int findColumns(SchemaEvolution evolution,

Review comment:
       I believe it is safe to assume that the column should always be part of 
the readerSchema -- if thats not the case and there is a column/schema 
missmatch the caller (Spark/Hive) will be responsible for handling the 
RuntimeException.
   
   I like this approach because its cleaner (and distinguishes between the two 
cases) so +1 for that -- but lets first make sure we add the behavior change as 
part of the method doc.

##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -104,17 +104,10 @@
    */
   static int findColumns(SchemaEvolution evolution,
                          String columnName) {
-    try {
-      TypeDescription readerColumn = 
evolution.getReaderBaseSchema().findSubtype(
-          columnName, evolution.isSchemaEvolutionCaseAware);
-      TypeDescription fileColumn = evolution.getFileType(readerColumn);
-      return fileColumn == null ? -1 : fileColumn.getId();
-    } catch (IllegalArgumentException e) {
-      if (LOG.isDebugEnabled()){
-        LOG.debug("{}", e.getMessage());
-      }
-      return -1;
-    }
+    TypeDescription readerColumn = evolution.getReaderBaseSchema().findSubtype(
+        columnName, evolution.isSchemaEvolutionCaseAware);
+    TypeDescription fileColumn = evolution.getFileType(readerColumn);
+    return fileColumn == null ? -1 : fileColumn.getId();

Review comment:
       +1 on the above -- lets add a new test or extend 
testSchemaEvolutionMissingColumn -- where the "missing" column is not part of 
schema evolution and assert for expected exception 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [orc] pgaref commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Reply via email to