Re: [PR] Core: Refactor internal Avro reader to resolve schemas directly [iceberg]

via GitHub Mon, 01 Jan 2024 09:56:17 -0800


rdblue commented on code in PR #9366:
URL: https://github.com/apache/iceberg/pull/9366#discussion_r1439065448



##########
core/src/main/java/org/apache/iceberg/avro/GenericAvroReader.java:
##########
@@ -69,62 +99,107 @@ public void setRowPositionSupplier(Supplier<Long> 
posSupplier) {
 
   @Override
   public T read(T reuse, Decoder decoder) throws IOException {
-    return DecoderResolver.resolveAndRead(decoder, readSchema, fileSchema, 
reader, reuse);
+    return reader.read(decoder, reuse);
   }
 
-  private static class ReadBuilder extends AvroSchemaVisitor<ValueReader<?>> {
-    private final ClassLoader loader;
+  private class ResolvingReadBuilder extends AvroWithPartnerVisitor<Type, 
ValueReader<?>> {
+    private final Map<Type, Schema> avroSchemas;
 
-    private ReadBuilder(ClassLoader loader) {
-      this.loader = loader;
+    private ResolvingReadBuilder(Types.StructType expectedType, String 
rootName) {
+      this.avroSchemas = AvroSchemaUtil.convertTypes(expectedType, rootName);
     }
 
     @Override
-    @SuppressWarnings("unchecked")
-    public ValueReader<?> record(Schema record, List<String> names, 
List<ValueReader<?>> fields) {
-      try {
-        Class<?> recordClass =
-            
DynClasses.builder().loader(loader).impl(record.getFullName()).buildChecked();
-        if (IndexedRecord.class.isAssignableFrom(recordClass)) {
-          return ValueReaders.record(fields, (Class<? extends IndexedRecord>) 
recordClass, record);
+    public ValueReader<?> record(Type partner, Schema record, 
List<ValueReader<?>> fieldResults) {
+      Types.StructType expected = partner != null ? partner.asStructType() : 
null;
+      Map<Integer, Integer> idToPos = idToPos(expected);
+
+      List<Pair<Integer, ValueReader<?>>> readPlan = Lists.newArrayList();
+      List<Schema.Field> fileFields = record.getFields();
+      for (int pos = 0; pos < fileFields.size(); pos += 1) {
+        Schema.Field field = fileFields.get(pos);
+        ValueReader<?> fieldReader = fieldResults.get(pos);
+        Integer fieldId = AvroSchemaUtil.fieldId(field);
+        Integer projectionPos = idToPos.remove(fieldId);
+
+        Object constant = idToConstant.get(fieldId);
+        if (projectionPos != null && constant != null) {
+          readPlan.add(Pair.of(projectionPos, 
ValueReaders.replaceWithConstant(fieldReader, constant)));
+        } else {
+          readPlan.add(Pair.of(projectionPos, fieldReader));
+        }
+      }
+
+      // handle any expected columns that are not in the data file
+      for (Map.Entry<Integer, Integer> idAndPos : idToPos.entrySet()) {
+        int fieldId = idAndPos.getKey();
+        int pos = idAndPos.getValue();
+
+        Object constant = idToConstant.get(fieldId);
+        Types.NestedField field = expected.field(fieldId);
+        if (constant != null) {
+          readPlan.add(Pair.of(pos, ValueReaders.constant(constant)));
+        } else if (fieldId == MetadataColumns.IS_DELETED.fieldId()) {

Review Comment:
   Yes, I think this is better than how we did it before.
   
   Previously, we would [inject these fields in the 
`StructReader`](https://github.com/apache/iceberg/pull/9366/files#diff-08597bac062ea934a49ebeb66da72f459dea1da642e1df2cdd2c8ca6884fed4dL578-L586),
 but there improvements that we can make to that approach:
   1. It isn't the struct reader's responsibility to skip or change the readers 
that are passed to it. Read "planning" should be done here, where the read and 
file schemas are both present.
   2. It wasn't clear what readers should be passed in or produced here, given 
that readers might be replaced
   3. This unifies how new optional fields are handled with how row position 
and metadata fields are handled. It also sets up future default value handling 
using a constant reader, which is one of the reasons for making these changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Refactor internal Avro reader to resolve schemas directly [iceberg]

Reply via email to