Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-26 Thread via GitHub


codope merged PR #9834:
URL: https://github.com/apache/hudi/pull/9834


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-26 Thread via GitHub


hudi-bot commented on PR #9834:
URL: https://github.com/apache/hudi/pull/9834#issuecomment-1781570288

   
   ## CI report:
   
   * de2b5c95028029ff06d1f360763ca3f83c661ff3 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20497)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-26 Thread via GitHub


hudi-bot commented on PR #9834:
URL: https://github.com/apache/hudi/pull/9834#issuecomment-1781558140

   
   ## CI report:
   
   * de2b5c95028029ff06d1f360763ca3f83c661ff3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9834:
URL: https://github.com/apache/hudi/pull/9834#issuecomment-1780446730

   
   ## CI report:
   
   * de2b5c95028029ff06d1f360763ca3f83c661ff3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20497)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9834:
URL: https://github.com/apache/hudi/pull/9834#issuecomment-1780406926

   
   ## CI report:
   
   * d28ebc812328746cb530a35db70df43e67c6ffc2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20250)
 
   * de2b5c95028029ff06d1f360763ca3f83c661ff3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20497)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9834:
URL: https://github.com/apache/hudi/pull/9834#issuecomment-1780377139

   
   ## CI report:
   
   * d28ebc812328746cb530a35db70df43e67c6ffc2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20250)
 
   * de2b5c95028029ff06d1f360763ca3f83c661ff3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-25 Thread via GitHub


harsh1231 commented on code in PR #9834:
URL: https://github.com/apache/hudi/pull/9834#discussion_r1372547228


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/SanitizationUtils.java:
##
@@ -120,6 +120,11 @@ public static Dataset 
sanitizeColumnNamesForAvro(Dataset inputDataset,
 return targetDataset;
   }
 
+  public static Dataset sanitizeColumnNamesForAvro(Dataset 
inputDataset, TypedProperties props) {

Review Comment:
   This test has coverage for above method 
https://github.com/apache/hudi/blob/de2b5c95028029ff06d1f360763ca3f83c661ff3/hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestSourceFormatAdapter.java#L131



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-10 Thread via GitHub


codope commented on code in PR #9834:
URL: https://github.com/apache/hudi/pull/9834#discussion_r1351799159


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/SanitizationUtils.java:
##
@@ -120,6 +120,11 @@ public static Dataset 
sanitizeColumnNamesForAvro(Dataset inputDataset,
 return targetDataset;
   }
 
+  public static Dataset sanitizeColumnNamesForAvro(Dataset 
inputDataset, TypedProperties props) {

Review Comment:
   Is there a unit test for this util method?



##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/RowSource.java:
##
@@ -42,9 +43,10 @@ public RowSource(TypedProperties props, JavaSparkContext 
sparkContext, SparkSess
   protected final InputBatch> fetchNewData(Option 
lastCkptStr, long sourceLimit) {
 Pair>, String> res = fetchNextBatch(lastCkptStr, 
sourceLimit);
 return res.getKey().map(dsr -> {
+  Dataset sanitizedDSR = 
SanitizationUtils.sanitizeColumnNamesForAvro(dsr, props);

Review Comment:
   What's DSR? Let's avoid abbreviations in variable names unless it's too 
verbose? Maybe, here the variable name can simply be `sanitizedRows`



##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/SanitizationUtils.java:
##
@@ -120,6 +120,11 @@ public static Dataset 
sanitizeColumnNamesForAvro(Dataset inputDataset,
 return targetDataset;
   }
 
+  public static Dataset sanitizeColumnNamesForAvro(Dataset 
inputDataset, TypedProperties props) {
+return getShouldSanitize(props) ? sanitizeColumnNamesForAvro(inputDataset, 
getInvalidCharMask(props))

Review Comment:
   rename `getShouldSanitize` to `shouldSanitize`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-09 Thread via GitHub


hudi-bot commented on PR #9834:
URL: https://github.com/apache/hudi/pull/9834#issuecomment-1753904489

   
   ## CI report:
   
   * d28ebc812328746cb530a35db70df43e67c6ffc2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20250)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-09 Thread via GitHub


hudi-bot commented on PR #9834:
URL: https://github.com/apache/hudi/pull/9834#issuecomment-1753411489

   
   ## CI report:
   
   * bd94b2cc68c5c339394bcaf9b077093eb6e9f18e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20243)
 
   * d28ebc812328746cb530a35db70df43e67c6ffc2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20250)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-09 Thread via GitHub


hudi-bot commented on PR #9834:
URL: https://github.com/apache/hudi/pull/9834#issuecomment-1753400924

   
   ## CI report:
   
   * bd94b2cc68c5c339394bcaf9b077093eb6e9f18e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20243)
 
   * d28ebc812328746cb530a35db70df43e67c6ffc2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-09 Thread via GitHub


harsh1231 commented on code in PR #9834:
URL: https://github.com/apache/hudi/pull/9834#discussion_r1350564789


##
hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestSourceFormatAdapter.java:
##
@@ -136,16 +146,23 @@ public void testJsonSanitization(String 
unsanitizedDataFile, String sanitizedDat
 
   public static class TestRowDataSource extends Source> {
 private final InputBatch> batch;
+private final SanitizationUtils rowSanitizer;
 
 public TestRowDataSource(TypedProperties props, JavaSparkContext 
sparkContext, SparkSession sparkSession,
  SchemaProvider schemaProvider, 
InputBatch> batch) {
   super(props, sparkContext, sparkSession, schemaProvider, SourceType.ROW);
   this.batch = batch;
+  this.rowSanitizer = new SanitizationUtils();
 }
 
 @Override
 protected InputBatch> fetchNewData(Option 
lastCkptStr, long sourceLimit) {
-  return batch;
+  return batch.getBatch().map(dsr -> {
+Dataset tranformed = 
SanitizationUtils.sanitizeColumnNamesForAvro(dsr, getInvalidCharMask(props));

Review Comment:
   added to cover actual RowSource . 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-09 Thread via GitHub


the-other-tim-brown commented on code in PR #9834:
URL: https://github.com/apache/hudi/pull/9834#discussion_r1350528910


##
hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestSourceFormatAdapter.java:
##
@@ -136,16 +146,23 @@ public void testJsonSanitization(String 
unsanitizedDataFile, String sanitizedDat
 
   public static class TestRowDataSource extends Source> {
 private final InputBatch> batch;
+private final SanitizationUtils rowSanitizer;
 
 public TestRowDataSource(TypedProperties props, JavaSparkContext 
sparkContext, SparkSession sparkSession,
  SchemaProvider schemaProvider, 
InputBatch> batch) {
   super(props, sparkContext, sparkSession, schemaProvider, SourceType.ROW);
   this.batch = batch;
+  this.rowSanitizer = new SanitizationUtils();
 }
 
 @Override
 protected InputBatch> fetchNewData(Option 
lastCkptStr, long sourceLimit) {
-  return batch;
+  return batch.getBatch().map(dsr -> {
+Dataset tranformed = 
SanitizationUtils.sanitizeColumnNamesForAvro(dsr, getInvalidCharMask(props));

Review Comment:
   Can we add some testing that executes the new path in RowSource then?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-09 Thread via GitHub


harsh1231 commented on code in PR #9834:
URL: https://github.com/apache/hudi/pull/9834#discussion_r1350521264


##
hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestSourceFormatAdapter.java:
##
@@ -136,16 +146,23 @@ public void testJsonSanitization(String 
unsanitizedDataFile, String sanitizedDat
 
   public static class TestRowDataSource extends Source> {
 private final InputBatch> batch;
+private final SanitizationUtils rowSanitizer;
 
 public TestRowDataSource(TypedProperties props, JavaSparkContext 
sparkContext, SparkSession sparkSession,
  SchemaProvider schemaProvider, 
InputBatch> batch) {
   super(props, sparkContext, sparkSession, schemaProvider, SourceType.ROW);
   this.batch = batch;
+  this.rowSanitizer = new SanitizationUtils();
 }
 
 @Override
 protected InputBatch> fetchNewData(Option 
lastCkptStr, long sourceLimit) {
-  return batch;
+  return batch.getBatch().map(dsr -> {
+Dataset tranformed = 
SanitizationUtils.sanitizeColumnNamesForAvro(dsr, getInvalidCharMask(props));

Review Comment:
   We are bypassing RowSource in Unit test . 
   SourceFormatAdapter is not doing any sanitization . 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-09 Thread via GitHub


the-other-tim-brown commented on code in PR #9834:
URL: https://github.com/apache/hudi/pull/9834#discussion_r1350512605


##
hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestSourceFormatAdapter.java:
##
@@ -136,16 +146,23 @@ public void testJsonSanitization(String 
unsanitizedDataFile, String sanitizedDat
 
   public static class TestRowDataSource extends Source> {
 private final InputBatch> batch;
+private final SanitizationUtils rowSanitizer;
 
 public TestRowDataSource(TypedProperties props, JavaSparkContext 
sparkContext, SparkSession sparkSession,
  SchemaProvider schemaProvider, 
InputBatch> batch) {
   super(props, sparkContext, sparkSession, schemaProvider, SourceType.ROW);
   this.batch = batch;
+  this.rowSanitizer = new SanitizationUtils();
 }
 
 @Override
 protected InputBatch> fetchNewData(Option 
lastCkptStr, long sourceLimit) {
-  return batch;
+  return batch.getBatch().map(dsr -> {
+Dataset tranformed = 
SanitizationUtils.sanitizeColumnNamesForAvro(dsr, getInvalidCharMask(props));

Review Comment:
   or should we have this test class just override `fetchNextBatch`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-09 Thread via GitHub


the-other-tim-brown commented on code in PR #9834:
URL: https://github.com/apache/hudi/pull/9834#discussion_r1350509191


##
hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestSourceFormatAdapter.java:
##
@@ -136,16 +146,23 @@ public void testJsonSanitization(String 
unsanitizedDataFile, String sanitizedDat
 
   public static class TestRowDataSource extends Source> {
 private final InputBatch> batch;
+private final SanitizationUtils rowSanitizer;
 
 public TestRowDataSource(TypedProperties props, JavaSparkContext 
sparkContext, SparkSession sparkSession,
  SchemaProvider schemaProvider, 
InputBatch> batch) {
   super(props, sparkContext, sparkSession, schemaProvider, SourceType.ROW);
   this.batch = batch;
+  this.rowSanitizer = new SanitizationUtils();
 }
 
 @Override
 protected InputBatch> fetchNewData(Option 
lastCkptStr, long sourceLimit) {
-  return batch;
+  return batch.getBatch().map(dsr -> {
+Dataset tranformed = 
SanitizationUtils.sanitizeColumnNamesForAvro(dsr, getInvalidCharMask(props));

Review Comment:
   Why do we want the sanitizer here? Shouldn't we be checking that the source 
format adapter is doing the sanitization?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-09 Thread via GitHub


hudi-bot commented on PR #9834:
URL: https://github.com/apache/hudi/pull/9834#issuecomment-1752743925

   
   ## CI report:
   
   * bd94b2cc68c5c339394bcaf9b077093eb6e9f18e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20243)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-09 Thread via GitHub


hudi-bot commented on PR #9834:
URL: https://github.com/apache/hudi/pull/9834#issuecomment-1752569635

   
   ## CI report:
   
   * bd94b2cc68c5c339394bcaf9b077093eb6e9f18e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20243)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-09 Thread via GitHub


hudi-bot commented on PR #9834:
URL: https://github.com/apache/hudi/pull/9834#issuecomment-1752555349

   
   ## CI report:
   
   * bd94b2cc68c5c339394bcaf9b077093eb6e9f18e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org