Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]

2024-06-11 Thread via GitHub


nsivabalan merged PR #11381:
URL: https://github.com/apache/hudi/pull/11381


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]

2024-06-09 Thread via GitHub


the-other-tim-brown commented on code in PR #11381:
URL: https://github.com/apache/hudi/pull/11381#discussion_r1632525375


##
hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java:
##
@@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema 
incomingSchema, InternalSche
   typeChange.updateColumnType(col, inComingInternalSchema.findType(col));
 });
 
+// mark columns missing from incoming schema as nullable
+Set visited = new HashSet<>();
+diffFromOldSchema.stream()
+// ignore meta fields
+.filter(col -> !META_FIELD_NAMES.contains(col))
+.sorted()
+.forEach(col -> {
+  // if parent is marked as nullable, only update the parent and not 
all the missing children field
+  String parent = TableChangesHelper.getParentName(col);
+  if (!visited.contains(parent)) {
+typeChange.updateColumnNullability(col, true);
+  }
+  visited.add(col);
+});

Review Comment:
   @nsivabalan I've updated the PR to include the boolean



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]

2024-06-09 Thread via GitHub


hudi-bot commented on PR #11381:
URL: https://github.com/apache/hudi/pull/11381#issuecomment-2157083778

   
   ## CI report:
   
   * 0d1802d42d4b67cc791cbd8d8c4619dd7a52d319 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24320)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]

2024-06-09 Thread via GitHub


hudi-bot commented on PR #11381:
URL: https://github.com/apache/hudi/pull/11381#issuecomment-2156963475

   
   ## CI report:
   
   * 7ac5620ea218b34184ba918f6197339f2f695eb9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24317)
 
   * 0d1802d42d4b67cc791cbd8d8c4619dd7a52d319 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24320)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]

2024-06-09 Thread via GitHub


hudi-bot commented on PR #11381:
URL: https://github.com/apache/hudi/pull/11381#issuecomment-2156937303

   
   ## CI report:
   
   * 7ac5620ea218b34184ba918f6197339f2f695eb9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24317)
 
   * 0d1802d42d4b67cc791cbd8d8c4619dd7a52d319 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]

2024-06-09 Thread via GitHub


hudi-bot commented on PR #11381:
URL: https://github.com/apache/hudi/pull/11381#issuecomment-2156821392

   
   ## CI report:
   
   * 7ac5620ea218b34184ba918f6197339f2f695eb9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24317)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]

2024-06-09 Thread via GitHub


hudi-bot commented on PR #11381:
URL: https://github.com/apache/hudi/pull/11381#issuecomment-2156819182

   
   ## CI report:
   
   * abd7d6583d8d4469ded1cf054c395e86e14eeef0 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24190)
 
   * 7ac5620ea218b34184ba918f6197339f2f695eb9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]

2024-06-04 Thread via GitHub


nsivabalan commented on code in PR #11381:
URL: https://github.com/apache/hudi/pull/11381#discussion_r1625868147


##
hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java:
##
@@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema 
incomingSchema, InternalSche
   typeChange.updateColumnType(col, inComingInternalSchema.findType(col));
 });
 
+// mark columns missing from incoming schema as nullable
+Set visited = new HashSet<>();
+diffFromOldSchema.stream()
+// ignore meta fields
+.filter(col -> !META_FIELD_NAMES.contains(col))
+.sorted()
+.forEach(col -> {
+  // if parent is marked as nullable, only update the parent and not 
all the missing children field
+  String parent = TableChangesHelper.getParentName(col);
+  if (!visited.contains(parent)) {
+typeChange.updateColumnNullability(col, true);
+  }
+  visited.add(col);
+});

Review Comment:
   yes, I prefer adding a boolean arg to avoid mis-use of the method. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]

2024-06-04 Thread via GitHub


nsivabalan commented on code in PR #11381:
URL: https://github.com/apache/hudi/pull/11381#discussion_r1625866918


##
hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java:
##
@@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema 
incomingSchema, InternalSche
   typeChange.updateColumnType(col, inComingInternalSchema.findType(col));
 });
 
+// mark columns missing from incoming schema as nullable
+Set visited = new HashSet<>();

Review Comment:
   gotcha.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]

2024-06-02 Thread via GitHub


the-other-tim-brown commented on code in PR #11381:
URL: https://github.com/apache/hudi/pull/11381#discussion_r1623759939


##
hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java:
##
@@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema 
incomingSchema, InternalSche
   typeChange.updateColumnType(col, inComingInternalSchema.findType(col));
 });
 
+// mark columns missing from incoming schema as nullable
+Set visited = new HashSet<>();
+diffFromOldSchema.stream()
+// ignore meta fields
+.filter(col -> !META_FIELD_NAMES.contains(col))
+.sorted()
+.forEach(col -> {
+  // if parent is marked as nullable, only update the parent and not 
all the missing children field
+  String parent = TableChangesHelper.getParentName(col);
+  if (!visited.contains(parent)) {
+typeChange.updateColumnNullability(col, true);
+  }
+  visited.add(col);
+});

Review Comment:
   This actually requires updates in quite a few places, I will wait until 
there is alignment and then I can update



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]

2024-06-02 Thread via GitHub


the-other-tim-brown commented on code in PR #11381:
URL: https://github.com/apache/hudi/pull/11381#discussion_r1623757086


##
hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java:
##
@@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema 
incomingSchema, InternalSche
   typeChange.updateColumnType(col, inComingInternalSchema.findType(col));
 });
 
+// mark columns missing from incoming schema as nullable
+Set visited = new HashSet<>();

Review Comment:
   @nsivabalan the current behavior will set null values for the missing data 
in the rows themselves but it does not set the column as nullable in the 
schema, that's what this patch is aiming to fix.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]

2024-06-02 Thread via GitHub


the-other-tim-brown commented on code in PR #11381:
URL: https://github.com/apache/hudi/pull/11381#discussion_r1623756369


##
hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java:
##
@@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema 
incomingSchema, InternalSche
   typeChange.updateColumnType(col, inComingInternalSchema.findType(col));
 });
 
+// mark columns missing from incoming schema as nullable
+Set visited = new HashSet<>();

Review Comment:
   I like using `incoming` and `tableSchema` personally, I can change this 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]

2024-06-02 Thread via GitHub


the-other-tim-brown commented on code in PR #11381:
URL: https://github.com/apache/hudi/pull/11381#discussion_r1623755994


##
hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java:
##
@@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema 
incomingSchema, InternalSche
   typeChange.updateColumnType(col, inComingInternalSchema.findType(col));
 });
 
+// mark columns missing from incoming schema as nullable
+Set visited = new HashSet<>();
+diffFromOldSchema.stream()
+// ignore meta fields
+.filter(col -> !META_FIELD_NAMES.contains(col))
+.sorted()
+.forEach(col -> {
+  // if parent is marked as nullable, only update the parent and not 
all the missing children field
+  String parent = TableChangesHelper.getParentName(col);
+  if (!visited.contains(parent)) {
+typeChange.updateColumnNullability(col, true);
+  }
+  visited.add(col);
+});

Review Comment:
   We can pass that through as a boolean, does that sound good to you?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]

2024-06-02 Thread via GitHub


nsivabalan commented on code in PR #11381:
URL: https://github.com/apache/hudi/pull/11381#discussion_r1623751675


##
hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java:
##
@@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema 
incomingSchema, InternalSche
   typeChange.updateColumnType(col, inComingInternalSchema.findType(col));
 });
 
+// mark columns missing from incoming schema as nullable
+Set visited = new HashSet<>();
+diffFromOldSchema.stream()
+// ignore meta fields
+.filter(col -> !META_FIELD_NAMES.contains(col))
+.sorted()
+.forEach(col -> {
+  // if parent is marked as nullable, only update the parent and not 
all the missing children field
+  String parent = TableChangesHelper.getParentName(col);
+  if (!visited.contains(parent)) {
+typeChange.updateColumnNullability(col, true);
+  }
+  visited.add(col);
+});

Review Comment:
   I do see that we call this only when the config value is true within 
HoodieSchemaUtils. but I am trying to be future proof and avoid mis-use of the 
method. 
   



##
hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java:
##
@@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema 
incomingSchema, InternalSche
   typeChange.updateColumnType(col, inComingInternalSchema.findType(col));
 });
 
+// mark columns missing from incoming schema as nullable
+Set visited = new HashSet<>();
+diffFromOldSchema.stream()
+// ignore meta fields
+.filter(col -> !META_FIELD_NAMES.contains(col))
+.sorted()
+.forEach(col -> {
+  // if parent is marked as nullable, only update the parent and not 
all the missing children field
+  String parent = TableChangesHelper.getParentName(col);
+  if (!visited.contains(parent)) {
+typeChange.updateColumnNullability(col, true);
+  }
+  visited.add(col);
+});

Review Comment:
   hey @the-other-tim-brown : where is the check for 
"hoodie.write.set.null.for.missing.columns". I don't see it being used or 
checked within AvroSchemaEvolutionUtils.reconcileSchema() 
   
   I am wondering how can we be sure that all callers will be calling this 
method only when "hoodie.write.set.null.for.missing.columns" is set to true.
   



##
hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java:
##
@@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema 
incomingSchema, InternalSche
   typeChange.updateColumnType(col, inComingInternalSchema.findType(col));
 });
 
+// mark columns missing from incoming schema as nullable
+Set visited = new HashSet<>();

Review Comment:
   hey @jonvex : when we added support for 
"hoodie.write.set.null.for.missing.columns", how did we miss this. the java 
docs for this method does call out that nulls will be injected for missing 
cols. but prior to this patch, I don't see that happening. 
   



##
hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java:
##
@@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema 
incomingSchema, InternalSche
   typeChange.updateColumnType(col, inComingInternalSchema.findType(col));
 });
 
+// mark columns missing from incoming schema as nullable
+Set visited = new HashSet<>();

Review Comment:
   do you think we can rename 2 variable names to make it more comprehensible 
or w/o ambiguity. 
   
   diffFromOldSchema -> missingFromIncoming
   diffFromEvolutionColumns -> additionsToTableSchema or additionsToOldSchema 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]

2024-06-02 Thread via GitHub


hudi-bot commented on PR #11381:
URL: https://github.com/apache/hudi/pull/11381#issuecomment-2144206868

   
   ## CI report:
   
   * abd7d6583d8d4469ded1cf054c395e86e14eeef0 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24190)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]

2024-06-02 Thread via GitHub


hudi-bot commented on PR #11381:
URL: https://github.com/apache/hudi/pull/11381#issuecomment-2144163955

   
   ## CI report:
   
   * abd7d6583d8d4469ded1cf054c395e86e14eeef0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24190)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]

2024-06-02 Thread via GitHub


hudi-bot commented on PR #11381:
URL: https://github.com/apache/hudi/pull/11381#issuecomment-2144158079

   
   ## CI report:
   
   * abd7d6583d8d4469ded1cf054c395e86e14eeef0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org