wombatu-kun commented on code in PR #18914:
URL: https://github.com/apache/hudi/pull/18914#discussion_r3411377231


##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/ErrorTableAwareChainedTransformer.java:
##########
@@ -55,8 +55,9 @@ public Dataset<Row> apply(JavaSparkContext jsc, SparkSession 
sparkSession, Datas
     for (TransformerInfo transformerInfo : transformers) {
       Transformer transformer = transformerInfo.getTransformer();
       dataset = transformer.apply(jsc, sparkSession, dataset, 
transformerInfo.getProperties(properties, transformers));
-      // validate in every stage to ensure ErrorRecordColumn not dropped by 
one of the transformer and added by next transformer.
-      ErrorTableUtils.validate(dataset);
+      // Re-inject _corrupt_record if the transformer dropped it (e.g. custom 
JAR transformers
+      // that do column projection like ColumnFilter with mode=include).
+      dataset = 
ErrorTableUtils.addNullValueErrorTableCorruptRecordColumn(dataset);

Review Comment:
   Confirmed reachable in production: StreamSync applies the chain then calls 
processErrorEvents with CUSTOM_TRANSFORMER_FAILURE, and that extraction in 
SourceFormatAdapter keys off _corrupt_record being non-null. If an earlier 
transformer marks rows and a later one projects the column away, re-injecting 
as null here makes every row match the isNull filter in processErrorEvents, so 
the marked rows flow into the main write path - not just dropped from the error 
table but silently written to the target table. The new test 
testCorruptRecordReInjectedAfterTransformerDropsIt sets up exactly this 
populate-then-drop case (t1 marks, t2 drops) yet only asserts column presence 
and count, so it passes whether the marked data survives or not. Safer to 
extract the column before re-injecting, or at minimum WARN when a non-null 
column is dropped and assert the row outcome in that test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to