[GitHub] [iceberg] wypoon commented on a change in pull request #1508: Use schema at the time of the snapshot when reading a snapshot.

GitBox Wed, 07 Oct 2020 22:51:18 -0700


wypoon commented on a change in pull request #1508:
URL: https://github.com/apache/iceberg/pull/1508#discussion_r501463178




##########
File path: spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java
##########
@@ -165,13 +165,23 @@
           TableProperties.PARQUET_BATCH_SIZE, 
TableProperties.PARQUET_BATCH_SIZE_DEFAULT));
   }
 
+  private Schema getTableSchema() {
+    if (snapshotId != null) {
+      return table.schemaForSnapshot(snapshotId);
+    } else if (asOfTimestamp != null) {
+      return table.schemaForSnapshotAsOfTime(asOfTimestamp);
+    } else {
+      return table.schema();
+    }
+  }
+
   private Schema lazySchema() {
     if (schema == null) {
       if (requestedSchema != null) {
         // the projection should include all columns that will be returned, 
including those only used in filters
-        this.schema = SparkSchemaUtil.prune(table.schema(), requestedSchema, 
filterExpression(), caseSensitive);
+        this.schema = SparkSchemaUtil.prune(getTableSchema(), requestedSchema, 
filterExpression(), caseSensitive);
       } else {
-        this.schema = table.schema();
+        this.schema = getTableSchema();

Review comment:
       This is the only point I don't understand. As far as I can tell, 
`BaseTableScan` is constructed with the user's requested schema. If the user is 
wanting to read from an old snapshot, then they should be supplying a requested 
schema compatible with the snapshot's schema. I couldn't see anything I needed 
to change in `BaseTableScan`. If I'm missing something here, can you please 
point out to me the parts I need to change?
   
   I added a new test case where a requested schema is used, and that revealed 
a bug in the spark2 `IcebergSource#createReader`. In that method, there is a 
call to `SparkSchemaUtil.convert(Schema, StructType)` whose only purpose seems 
to be to check that the requested `StructType` does not contain fields in the 
`Schema` (for then the call would cause an exception), and here I needed to use 
the snapshot `Schema` rather than the table `Schema`. Other than that, the 
projection returned by the scan appears to be correct.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] wypoon commented on a change in pull request #1508: Use schema at the time of the snapshot when reading a snapshot.

Reply via email to