Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]
nsivabalan merged PR #11381: URL: https://github.com/apache/hudi/pull/11381 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]
the-other-tim-brown commented on code in PR #11381: URL: https://github.com/apache/hudi/pull/11381#discussion_r1632525375 ## hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java: ## @@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema incomingSchema, InternalSche typeChange.updateColumnType(col, inComingInternalSchema.findType(col)); }); +// mark columns missing from incoming schema as nullable +Set visited = new HashSet<>(); +diffFromOldSchema.stream() +// ignore meta fields +.filter(col -> !META_FIELD_NAMES.contains(col)) +.sorted() +.forEach(col -> { + // if parent is marked as nullable, only update the parent and not all the missing children field + String parent = TableChangesHelper.getParentName(col); + if (!visited.contains(parent)) { +typeChange.updateColumnNullability(col, true); + } + visited.add(col); +}); Review Comment: @nsivabalan I've updated the PR to include the boolean -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]
hudi-bot commented on PR #11381: URL: https://github.com/apache/hudi/pull/11381#issuecomment-2157083778 ## CI report: * 0d1802d42d4b67cc791cbd8d8c4619dd7a52d319 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24320) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]
hudi-bot commented on PR #11381: URL: https://github.com/apache/hudi/pull/11381#issuecomment-2156963475 ## CI report: * 7ac5620ea218b34184ba918f6197339f2f695eb9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24317) * 0d1802d42d4b67cc791cbd8d8c4619dd7a52d319 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24320) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]
hudi-bot commented on PR #11381: URL: https://github.com/apache/hudi/pull/11381#issuecomment-2156937303 ## CI report: * 7ac5620ea218b34184ba918f6197339f2f695eb9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24317) * 0d1802d42d4b67cc791cbd8d8c4619dd7a52d319 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]
hudi-bot commented on PR #11381: URL: https://github.com/apache/hudi/pull/11381#issuecomment-2156821392 ## CI report: * 7ac5620ea218b34184ba918f6197339f2f695eb9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24317) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]
hudi-bot commented on PR #11381: URL: https://github.com/apache/hudi/pull/11381#issuecomment-2156819182 ## CI report: * abd7d6583d8d4469ded1cf054c395e86e14eeef0 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24190) * 7ac5620ea218b34184ba918f6197339f2f695eb9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]
nsivabalan commented on code in PR #11381: URL: https://github.com/apache/hudi/pull/11381#discussion_r1625868147 ## hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java: ## @@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema incomingSchema, InternalSche typeChange.updateColumnType(col, inComingInternalSchema.findType(col)); }); +// mark columns missing from incoming schema as nullable +Set visited = new HashSet<>(); +diffFromOldSchema.stream() +// ignore meta fields +.filter(col -> !META_FIELD_NAMES.contains(col)) +.sorted() +.forEach(col -> { + // if parent is marked as nullable, only update the parent and not all the missing children field + String parent = TableChangesHelper.getParentName(col); + if (!visited.contains(parent)) { +typeChange.updateColumnNullability(col, true); + } + visited.add(col); +}); Review Comment: yes, I prefer adding a boolean arg to avoid mis-use of the method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]
nsivabalan commented on code in PR #11381: URL: https://github.com/apache/hudi/pull/11381#discussion_r1625866918 ## hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java: ## @@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema incomingSchema, InternalSche typeChange.updateColumnType(col, inComingInternalSchema.findType(col)); }); +// mark columns missing from incoming schema as nullable +Set visited = new HashSet<>(); Review Comment: gotcha. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]
the-other-tim-brown commented on code in PR #11381: URL: https://github.com/apache/hudi/pull/11381#discussion_r1623759939 ## hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java: ## @@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema incomingSchema, InternalSche typeChange.updateColumnType(col, inComingInternalSchema.findType(col)); }); +// mark columns missing from incoming schema as nullable +Set visited = new HashSet<>(); +diffFromOldSchema.stream() +// ignore meta fields +.filter(col -> !META_FIELD_NAMES.contains(col)) +.sorted() +.forEach(col -> { + // if parent is marked as nullable, only update the parent and not all the missing children field + String parent = TableChangesHelper.getParentName(col); + if (!visited.contains(parent)) { +typeChange.updateColumnNullability(col, true); + } + visited.add(col); +}); Review Comment: This actually requires updates in quite a few places, I will wait until there is alignment and then I can update -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]
the-other-tim-brown commented on code in PR #11381: URL: https://github.com/apache/hudi/pull/11381#discussion_r1623757086 ## hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java: ## @@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema incomingSchema, InternalSche typeChange.updateColumnType(col, inComingInternalSchema.findType(col)); }); +// mark columns missing from incoming schema as nullable +Set visited = new HashSet<>(); Review Comment: @nsivabalan the current behavior will set null values for the missing data in the rows themselves but it does not set the column as nullable in the schema, that's what this patch is aiming to fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]
the-other-tim-brown commented on code in PR #11381: URL: https://github.com/apache/hudi/pull/11381#discussion_r1623756369 ## hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java: ## @@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema incomingSchema, InternalSche typeChange.updateColumnType(col, inComingInternalSchema.findType(col)); }); +// mark columns missing from incoming schema as nullable +Set visited = new HashSet<>(); Review Comment: I like using `incoming` and `tableSchema` personally, I can change this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]
the-other-tim-brown commented on code in PR #11381: URL: https://github.com/apache/hudi/pull/11381#discussion_r1623755994 ## hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java: ## @@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema incomingSchema, InternalSche typeChange.updateColumnType(col, inComingInternalSchema.findType(col)); }); +// mark columns missing from incoming schema as nullable +Set visited = new HashSet<>(); +diffFromOldSchema.stream() +// ignore meta fields +.filter(col -> !META_FIELD_NAMES.contains(col)) +.sorted() +.forEach(col -> { + // if parent is marked as nullable, only update the parent and not all the missing children field + String parent = TableChangesHelper.getParentName(col); + if (!visited.contains(parent)) { +typeChange.updateColumnNullability(col, true); + } + visited.add(col); +}); Review Comment: We can pass that through as a boolean, does that sound good to you? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]
nsivabalan commented on code in PR #11381: URL: https://github.com/apache/hudi/pull/11381#discussion_r1623751675 ## hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java: ## @@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema incomingSchema, InternalSche typeChange.updateColumnType(col, inComingInternalSchema.findType(col)); }); +// mark columns missing from incoming schema as nullable +Set visited = new HashSet<>(); +diffFromOldSchema.stream() +// ignore meta fields +.filter(col -> !META_FIELD_NAMES.contains(col)) +.sorted() +.forEach(col -> { + // if parent is marked as nullable, only update the parent and not all the missing children field + String parent = TableChangesHelper.getParentName(col); + if (!visited.contains(parent)) { +typeChange.updateColumnNullability(col, true); + } + visited.add(col); +}); Review Comment: I do see that we call this only when the config value is true within HoodieSchemaUtils. but I am trying to be future proof and avoid mis-use of the method. ## hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java: ## @@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema incomingSchema, InternalSche typeChange.updateColumnType(col, inComingInternalSchema.findType(col)); }); +// mark columns missing from incoming schema as nullable +Set visited = new HashSet<>(); +diffFromOldSchema.stream() +// ignore meta fields +.filter(col -> !META_FIELD_NAMES.contains(col)) +.sorted() +.forEach(col -> { + // if parent is marked as nullable, only update the parent and not all the missing children field + String parent = TableChangesHelper.getParentName(col); + if (!visited.contains(parent)) { +typeChange.updateColumnNullability(col, true); + } + visited.add(col); +}); Review Comment: hey @the-other-tim-brown : where is the check for "hoodie.write.set.null.for.missing.columns". I don't see it being used or checked within AvroSchemaEvolutionUtils.reconcileSchema() I am wondering how can we be sure that all callers will be calling this method only when "hoodie.write.set.null.for.missing.columns" is set to true. ## hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java: ## @@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema incomingSchema, InternalSche typeChange.updateColumnType(col, inComingInternalSchema.findType(col)); }); +// mark columns missing from incoming schema as nullable +Set visited = new HashSet<>(); Review Comment: hey @jonvex : when we added support for "hoodie.write.set.null.for.missing.columns", how did we miss this. the java docs for this method does call out that nulls will be injected for missing cols. but prior to this patch, I don't see that happening. ## hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java: ## @@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema incomingSchema, InternalSche typeChange.updateColumnType(col, inComingInternalSchema.findType(col)); }); +// mark columns missing from incoming schema as nullable +Set visited = new HashSet<>(); Review Comment: do you think we can rename 2 variable names to make it more comprehensible or w/o ambiguity. diffFromOldSchema -> missingFromIncoming diffFromEvolutionColumns -> additionsToTableSchema or additionsToOldSchema -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]
hudi-bot commented on PR #11381: URL: https://github.com/apache/hudi/pull/11381#issuecomment-2144206868 ## CI report: * abd7d6583d8d4469ded1cf054c395e86e14eeef0 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24190) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]
hudi-bot commented on PR #11381: URL: https://github.com/apache/hudi/pull/11381#issuecomment-2144163955 ## CI report: * abd7d6583d8d4469ded1cf054c395e86e14eeef0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24190) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]
hudi-bot commented on PR #11381: URL: https://github.com/apache/hudi/pull/11381#issuecomment-2144158079 ## CI report: * abd7d6583d8d4469ded1cf054c395e86e14eeef0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-7826] Make column nullable when setNullForMissingColumns is true [hudi]
the-other-tim-brown opened a new pull request, #11381: URL: https://github.com/apache/hudi/pull/11381 ### Change Logs When deducing the writer schema, we should mark any missing columns as nullable when hoodie.write.set.null.for.missing.columns is set to true ### Impact Allows smoother operation when upstream data source drops a column. This is useful when consuming from a CDC datasource that may have a column dropped from a schema. ### Risk level (write none, low medium or high below) Low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none"._ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org