yihua commented on code in PR #18478:
URL: https://github.com/apache/hudi/pull/18478#discussion_r3051177842


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java:
##########
@@ -130,6 +131,12 @@ public HoodieData<WriteStatus> compact(
     TaskContextSupplier taskContextSupplier = table.getTaskContextSupplier();
     // if this is a MDT, set up the instant range of log reader just like 
regular MDT snapshot reader.
     Option<InstantRange> instantRange = 
CompactHelpers.getInstance().getInstantRange(metaClient);
+    // Since we are using merge handle here, we can directly query the write 
schema from conf
+    // Write handle provides an option to use overridden write schema as well 
which is not used by merge handle
+    Schema writerSchema = HoodieAvroUtils.addMetadataFields(new 
Schema.Parser().parse(config.getWriteSchema()), 
config.allowOperationMetadataField());

Review Comment:
   🤖 Same gap as before — `hasTimestampMillisField` only returns true for 
MILLIS-precision timestamps, but `needsLogicalTypeRepair` also handles the case 
where the file schema is a plain LONG and the table schema has 
`LocalTimestampMicros` (non-UTC-adjusted). If someone's write schema has 
`LocalTimestampMicros` fields, the flag won't be set, and the downstream 
parquet read support could skip a needed repair. Could you either broaden this 
check to cover all timestamp logical types that `needsLogicalTypeRepair` 
handles, or add a comment explaining why `LocalTimestampMicros` can't appear 
here?
   
   🤖 *This review comment was generated by an AI agent and may contain 
mistakes. Please verify any suggestions before applying.*



##########
hudi-hadoop-common/src/main/java/org/apache/hudi/io/hadoop/HoodieAvroParquetReader.java:
##########
@@ -172,7 +173,12 @@ private ClosableIterator<IndexedRecord> 
getIndexedRecordIteratorInternal(Schema
     //       sure that in case the file-schema is not equal to read-schema 
we'd still
     //       be able to read that file (in case projection is a proper one)
     Configuration hadoopConf = 
storage.getConf().unwrapCopyAs(Configuration.class);
-    Schema repairedFileSchema = getRepairedSchema(getSchema(), schema);
+    Schema repairedFileSchema;

Review Comment:
   🤖 This uses `defaultValue=true` while the builder on line 192 uses 
`defaultValue=false`. When the conf flag isn't explicitly set (e.g., reads not 
going through merge/compaction), the file schema gets repaired here but the 
parquet read support may not repair. Is the asymmetry intentional — i.e., is 
Avro-level schema repair always safe/cheap enough to keep as default, while the 
parquet-level repair is the expensive one being optimized?
   
   🤖 *This review comment was generated by an AI agent and may contain 
mistakes. Please verify any suggestions before applying.*



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java:
##########
@@ -80,15 +80,20 @@ public void runMerge(HoodieTable<?, ?, ?, ?> table,
     HoodieBaseFile baseFile = mergeHandle.baseFileForMerge();
 
     HoodieRecord.HoodieRecordType recordType = 
table.getConfig().getRecordMerger().getRecordType();
+    AvroSchemaUtils.setLogicalTimestampRepairNeeded(table.getStorageConf());

Review Comment:
   🤖 `setLogicalTimestampRepairNeeded` is called unconditionally here, so the 
optimization (skip repair when no timestamp fields) doesn't apply to the merge 
path. Was this intentional? If so, it might be worth a brief comment explaining 
why the merge path can't benefit from this optimization (e.g., because the 
writer schema isn't available at this point, or because the base file schema 
could differ).
   
   🤖 *This review comment was generated by an AI agent and may contain 
mistakes. Please verify any suggestions before applying.*



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to