nsivabalan commented on code in PR #11381: URL: https://github.com/apache/hudi/pull/11381#discussion_r1623751675
########## hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java: ########## @@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema incomingSchema, InternalSche typeChange.updateColumnType(col, inComingInternalSchema.findType(col)); }); + // mark columns missing from incoming schema as nullable + Set<String> visited = new HashSet<>(); + diffFromOldSchema.stream() + // ignore meta fields + .filter(col -> !META_FIELD_NAMES.contains(col)) + .sorted() + .forEach(col -> { + // if parent is marked as nullable, only update the parent and not all the missing children field + String parent = TableChangesHelper.getParentName(col); + if (!visited.contains(parent)) { + typeChange.updateColumnNullability(col, true); + } + visited.add(col); + }); Review Comment: I do see that we call this only when the config value is true within HoodieSchemaUtils. but I am trying to be future proof and avoid mis-use of the method. ########## hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java: ########## @@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema incomingSchema, InternalSche typeChange.updateColumnType(col, inComingInternalSchema.findType(col)); }); + // mark columns missing from incoming schema as nullable + Set<String> visited = new HashSet<>(); + diffFromOldSchema.stream() + // ignore meta fields + .filter(col -> !META_FIELD_NAMES.contains(col)) + .sorted() + .forEach(col -> { + // if parent is marked as nullable, only update the parent and not all the missing children field + String parent = TableChangesHelper.getParentName(col); + if (!visited.contains(parent)) { + typeChange.updateColumnNullability(col, true); + } + visited.add(col); + }); Review Comment: hey @the-other-tim-brown : where is the check for "hoodie.write.set.null.for.missing.columns". I don't see it being used or checked within AvroSchemaEvolutionUtils.reconcileSchema() I am wondering how can we be sure that all callers will be calling this method only when "hoodie.write.set.null.for.missing.columns" is set to true. ########## hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java: ########## @@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema incomingSchema, InternalSche typeChange.updateColumnType(col, inComingInternalSchema.findType(col)); }); + // mark columns missing from incoming schema as nullable + Set<String> visited = new HashSet<>(); Review Comment: hey @jonvex : when we added support for "hoodie.write.set.null.for.missing.columns", how did we miss this. the java docs for this method does call out that nulls will be injected for missing cols. but prior to this patch, I don't see that happening. ########## hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java: ########## @@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema incomingSchema, InternalSche typeChange.updateColumnType(col, inComingInternalSchema.findType(col)); }); + // mark columns missing from incoming schema as nullable + Set<String> visited = new HashSet<>(); Review Comment: do you think we can rename 2 variable names to make it more comprehensible or w/o ambiguity. diffFromOldSchema -> missingFromIncoming diffFromEvolutionColumns -> additionsToTableSchema or additionsToOldSchema -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org