Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
codope merged PR #9834: URL: https://github.com/apache/hudi/pull/9834 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
hudi-bot commented on PR #9834: URL: https://github.com/apache/hudi/pull/9834#issuecomment-1781570288 ## CI report: * de2b5c95028029ff06d1f360763ca3f83c661ff3 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20497) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
hudi-bot commented on PR #9834: URL: https://github.com/apache/hudi/pull/9834#issuecomment-1781558140 ## CI report: * de2b5c95028029ff06d1f360763ca3f83c661ff3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
hudi-bot commented on PR #9834: URL: https://github.com/apache/hudi/pull/9834#issuecomment-1780446730 ## CI report: * de2b5c95028029ff06d1f360763ca3f83c661ff3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20497) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
hudi-bot commented on PR #9834: URL: https://github.com/apache/hudi/pull/9834#issuecomment-1780406926 ## CI report: * d28ebc812328746cb530a35db70df43e67c6ffc2 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20250) * de2b5c95028029ff06d1f360763ca3f83c661ff3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20497) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
hudi-bot commented on PR #9834: URL: https://github.com/apache/hudi/pull/9834#issuecomment-1780377139 ## CI report: * d28ebc812328746cb530a35db70df43e67c6ffc2 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20250) * de2b5c95028029ff06d1f360763ca3f83c661ff3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
harsh1231 commented on code in PR #9834: URL: https://github.com/apache/hudi/pull/9834#discussion_r1372547228 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/SanitizationUtils.java: ## @@ -120,6 +120,11 @@ public static Dataset sanitizeColumnNamesForAvro(Dataset inputDataset, return targetDataset; } + public static Dataset sanitizeColumnNamesForAvro(Dataset inputDataset, TypedProperties props) { Review Comment: This test has coverage for above method https://github.com/apache/hudi/blob/de2b5c95028029ff06d1f360763ca3f83c661ff3/hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestSourceFormatAdapter.java#L131 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
codope commented on code in PR #9834: URL: https://github.com/apache/hudi/pull/9834#discussion_r1351799159 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/SanitizationUtils.java: ## @@ -120,6 +120,11 @@ public static Dataset sanitizeColumnNamesForAvro(Dataset inputDataset, return targetDataset; } + public static Dataset sanitizeColumnNamesForAvro(Dataset inputDataset, TypedProperties props) { Review Comment: Is there a unit test for this util method? ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/RowSource.java: ## @@ -42,9 +43,10 @@ public RowSource(TypedProperties props, JavaSparkContext sparkContext, SparkSess protected final InputBatch> fetchNewData(Option lastCkptStr, long sourceLimit) { Pair>, String> res = fetchNextBatch(lastCkptStr, sourceLimit); return res.getKey().map(dsr -> { + Dataset sanitizedDSR = SanitizationUtils.sanitizeColumnNamesForAvro(dsr, props); Review Comment: What's DSR? Let's avoid abbreviations in variable names unless it's too verbose? Maybe, here the variable name can simply be `sanitizedRows` ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/SanitizationUtils.java: ## @@ -120,6 +120,11 @@ public static Dataset sanitizeColumnNamesForAvro(Dataset inputDataset, return targetDataset; } + public static Dataset sanitizeColumnNamesForAvro(Dataset inputDataset, TypedProperties props) { +return getShouldSanitize(props) ? sanitizeColumnNamesForAvro(inputDataset, getInvalidCharMask(props)) Review Comment: rename `getShouldSanitize` to `shouldSanitize`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
hudi-bot commented on PR #9834: URL: https://github.com/apache/hudi/pull/9834#issuecomment-1753904489 ## CI report: * d28ebc812328746cb530a35db70df43e67c6ffc2 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20250) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
hudi-bot commented on PR #9834: URL: https://github.com/apache/hudi/pull/9834#issuecomment-1753411489 ## CI report: * bd94b2cc68c5c339394bcaf9b077093eb6e9f18e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20243) * d28ebc812328746cb530a35db70df43e67c6ffc2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20250) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
hudi-bot commented on PR #9834: URL: https://github.com/apache/hudi/pull/9834#issuecomment-1753400924 ## CI report: * bd94b2cc68c5c339394bcaf9b077093eb6e9f18e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20243) * d28ebc812328746cb530a35db70df43e67c6ffc2 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
harsh1231 commented on code in PR #9834: URL: https://github.com/apache/hudi/pull/9834#discussion_r1350564789 ## hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestSourceFormatAdapter.java: ## @@ -136,16 +146,23 @@ public void testJsonSanitization(String unsanitizedDataFile, String sanitizedDat public static class TestRowDataSource extends Source> { private final InputBatch> batch; +private final SanitizationUtils rowSanitizer; public TestRowDataSource(TypedProperties props, JavaSparkContext sparkContext, SparkSession sparkSession, SchemaProvider schemaProvider, InputBatch> batch) { super(props, sparkContext, sparkSession, schemaProvider, SourceType.ROW); this.batch = batch; + this.rowSanitizer = new SanitizationUtils(); } @Override protected InputBatch> fetchNewData(Option lastCkptStr, long sourceLimit) { - return batch; + return batch.getBatch().map(dsr -> { +Dataset tranformed = SanitizationUtils.sanitizeColumnNamesForAvro(dsr, getInvalidCharMask(props)); Review Comment: added to cover actual RowSource . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
the-other-tim-brown commented on code in PR #9834: URL: https://github.com/apache/hudi/pull/9834#discussion_r1350528910 ## hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestSourceFormatAdapter.java: ## @@ -136,16 +146,23 @@ public void testJsonSanitization(String unsanitizedDataFile, String sanitizedDat public static class TestRowDataSource extends Source> { private final InputBatch> batch; +private final SanitizationUtils rowSanitizer; public TestRowDataSource(TypedProperties props, JavaSparkContext sparkContext, SparkSession sparkSession, SchemaProvider schemaProvider, InputBatch> batch) { super(props, sparkContext, sparkSession, schemaProvider, SourceType.ROW); this.batch = batch; + this.rowSanitizer = new SanitizationUtils(); } @Override protected InputBatch> fetchNewData(Option lastCkptStr, long sourceLimit) { - return batch; + return batch.getBatch().map(dsr -> { +Dataset tranformed = SanitizationUtils.sanitizeColumnNamesForAvro(dsr, getInvalidCharMask(props)); Review Comment: Can we add some testing that executes the new path in RowSource then? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
harsh1231 commented on code in PR #9834: URL: https://github.com/apache/hudi/pull/9834#discussion_r1350521264 ## hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestSourceFormatAdapter.java: ## @@ -136,16 +146,23 @@ public void testJsonSanitization(String unsanitizedDataFile, String sanitizedDat public static class TestRowDataSource extends Source> { private final InputBatch> batch; +private final SanitizationUtils rowSanitizer; public TestRowDataSource(TypedProperties props, JavaSparkContext sparkContext, SparkSession sparkSession, SchemaProvider schemaProvider, InputBatch> batch) { super(props, sparkContext, sparkSession, schemaProvider, SourceType.ROW); this.batch = batch; + this.rowSanitizer = new SanitizationUtils(); } @Override protected InputBatch> fetchNewData(Option lastCkptStr, long sourceLimit) { - return batch; + return batch.getBatch().map(dsr -> { +Dataset tranformed = SanitizationUtils.sanitizeColumnNamesForAvro(dsr, getInvalidCharMask(props)); Review Comment: We are bypassing RowSource in Unit test . SourceFormatAdapter is not doing any sanitization . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
the-other-tim-brown commented on code in PR #9834: URL: https://github.com/apache/hudi/pull/9834#discussion_r1350512605 ## hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestSourceFormatAdapter.java: ## @@ -136,16 +146,23 @@ public void testJsonSanitization(String unsanitizedDataFile, String sanitizedDat public static class TestRowDataSource extends Source> { private final InputBatch> batch; +private final SanitizationUtils rowSanitizer; public TestRowDataSource(TypedProperties props, JavaSparkContext sparkContext, SparkSession sparkSession, SchemaProvider schemaProvider, InputBatch> batch) { super(props, sparkContext, sparkSession, schemaProvider, SourceType.ROW); this.batch = batch; + this.rowSanitizer = new SanitizationUtils(); } @Override protected InputBatch> fetchNewData(Option lastCkptStr, long sourceLimit) { - return batch; + return batch.getBatch().map(dsr -> { +Dataset tranformed = SanitizationUtils.sanitizeColumnNamesForAvro(dsr, getInvalidCharMask(props)); Review Comment: or should we have this test class just override `fetchNextBatch`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
the-other-tim-brown commented on code in PR #9834: URL: https://github.com/apache/hudi/pull/9834#discussion_r1350509191 ## hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestSourceFormatAdapter.java: ## @@ -136,16 +146,23 @@ public void testJsonSanitization(String unsanitizedDataFile, String sanitizedDat public static class TestRowDataSource extends Source> { private final InputBatch> batch; +private final SanitizationUtils rowSanitizer; public TestRowDataSource(TypedProperties props, JavaSparkContext sparkContext, SparkSession sparkSession, SchemaProvider schemaProvider, InputBatch> batch) { super(props, sparkContext, sparkSession, schemaProvider, SourceType.ROW); this.batch = batch; + this.rowSanitizer = new SanitizationUtils(); } @Override protected InputBatch> fetchNewData(Option lastCkptStr, long sourceLimit) { - return batch; + return batch.getBatch().map(dsr -> { +Dataset tranformed = SanitizationUtils.sanitizeColumnNamesForAvro(dsr, getInvalidCharMask(props)); Review Comment: Why do we want the sanitizer here? Shouldn't we be checking that the source format adapter is doing the sanitization? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
hudi-bot commented on PR #9834: URL: https://github.com/apache/hudi/pull/9834#issuecomment-1752743925 ## CI report: * bd94b2cc68c5c339394bcaf9b077093eb6e9f18e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20243) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
hudi-bot commented on PR #9834: URL: https://github.com/apache/hudi/pull/9834#issuecomment-1752569635 ## CI report: * bd94b2cc68c5c339394bcaf9b077093eb6e9f18e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20243) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
hudi-bot commented on PR #9834: URL: https://github.com/apache/hudi/pull/9834#issuecomment-1752555349 ## CI report: * bd94b2cc68c5c339394bcaf9b077093eb6e9f18e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org