hudi-agent commented on code in PR #18375:
URL: https://github.com/apache/hudi/pull/18375#discussion_r3198725216
##########
hudi-common/src/main/java/org/apache/hudi/common/table/read/UpdateProcessor.java:
##########
@@ -135,19 +139,59 @@ protected BufferedRecord<T>
handleNonDeletes(BufferedRecord<T> previousRecord, B
if (previousRecord == null) {
// special case for payloads when there is no previous record
HoodieSchema recordSchema =
readerContext.getRecordContext().decodeAvroSchema(mergedRecord.getSchemaId());
- GenericRecord record =
readerContext.getRecordContext().convertToAvroRecord(mergedRecord.getRecord(),
recordSchema);
- HoodieAvroRecord hoodieRecord = new HoodieAvroRecord<>(null,
HoodieRecordUtils.loadPayload(payloadClass, record,
mergedRecord.getOrderingValue()));
- try {
- if (hoodieRecord.shouldIgnore(recordSchema, properties)) {
- return null;
- } else {
- HoodieSchema readerSchema =
readerContext.getSchemaHandler().getRequestedSchema();
- // If the record schema is different from the reader schema,
rewrite the record using the payload methods to ensure consistency with legacy
writer paths
- hoodieRecord.rewriteRecordWithNewSchema(recordSchema, properties,
readerSchema).toIndexedRecord(readerSchema, properties)
- .ifPresent(rewrittenRecord ->
mergedRecord.replaceRecord(readerContext.getRecordContext().convertAvroRecord(rewrittenRecord.getData())));
+ GenericRecord originalAvro = mergedRecord.getOriginalAvroRecord();
Review Comment:
🤖 nit: with the new originalAvro branch, `handleNonDeletes` is now ~50 lines
and four levels deep (if → else → try → if/else). Could you extract the else
branch into a helper like `rewriteAndReplaceFromPayload(mergedRecord,
recordSchema)` that returns the boolean "emit/suppress" decision? That would
let the top of the method read as a clean two-way switch on `originalAvro !=
null`.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-common/src/main/java/org/apache/hudi/common/engine/ExtractedData.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.engine;
+
+import org.apache.avro.generic.GenericRecord;
+
+import javax.annotation.Nullable;
+
+/**
+ * Result of {@link RecordContext#extractDataFromRecord} carrying the
engine-native row plus,
+ * when extraction went through an Avro payload (e.g. {@code
ExpressionPayload} in SQL MERGE INTO),
+ * the original Avro record produced by {@code HoodieRecord#toIndexedRecord}.
+ *
+ * <p>The original Avro record is propagated to {@link
org.apache.hudi.common.table.read.BufferedRecord}
+ * so downstream code (e.g. {@code UpdateProcessor}) can decode the payload
bytes with the correct
+ * source schema instead of re-serializing the engine-native row through a
possibly-mismatched schema.
+ */
+public final class ExtractedData<T> {
+
+ private final T data;
+ @Nullable
+ private final GenericRecord originalAvroRecord;
+
+ private ExtractedData(T data, @Nullable GenericRecord originalAvroRecord) {
+ this.data = data;
+ this.originalAvroRecord = originalAvroRecord;
+ }
+
+ public static <T> ExtractedData<T> of(T data) {
+ return new ExtractedData<>(data, null);
+ }
+
+ public static <T> ExtractedData<T> of(T data, @Nullable GenericRecord
originalAvroRecord) {
Review Comment:
🤖 nit: the two static factories `of(T)` and `of(T, GenericRecord)` differ
only by the optional avro hint — could you drop the single-arg overload and
just have callers pass `null` (or rename the two-arg one to `of(T,
GenericRecord)` and keep the no-hint variant clearly named, e.g.
`withoutAvroHint(T)`)? As written, picking between them at call sites is a
small distraction since `of(x)` and `of(x, null)` produce identical objects.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieReadHandle.java:
##########
@@ -68,14 +70,18 @@ protected HoodieBaseFile getLatestBaseFile() {
}
protected HoodieFileReader createNewFileReader() throws IOException {
- return HoodieIOFactory.getIOFactory(hoodieTable.getStorage())
- .getReaderFactory(this.config.getRecordMerger().getRecordType())
- .getFileReader(config, getLatestBaseFile().getStoragePath());
+ return createNewFileReader(getLatestBaseFile());
}
protected HoodieFileReader createNewFileReader(HoodieBaseFile
hoodieBaseFile) throws IOException {
+ String extension = hoodieBaseFile.getStoragePath().getFileExtension();
+ HoodieFileFormat format =
HoodieFileFormat.fromFileExtensionOrNull(extension);
+ HoodieRecord.HoodieRecordType mergerRecordType =
this.config.getRecordMerger().getRecordType();
+ HoodieRecord.HoodieRecordType recordType = format != null
Review Comment:
🤖 nit: this same `extension → fromFileExtensionOrNull → resolveRecordType,
fall back to merger type` snippet now appears here, in
`HoodieIndexUtils.filterKeysFromFile`, and in
`HoodieFileWriterFactory.getFileWriter`. Could you fold it into a static helper
on `HoodieFileFormat` (e.g. `resolveRecordTypeForExtension(String,
HoodieRecordType fallback)`)? The three call sites would become a one-liner and
stay in lockstep.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -327,7 +327,11 @@ public class HoodieWriteConfig extends HoodieConfig {
public static final ConfigProperty<HoodieFileFormat> BASE_FILE_FORMAT =
ConfigProperty
.key("hoodie.base.file.format")
.defaultValue(HoodieFileFormat.PARQUET)
- .withValidValues(HoodieFileFormat.PARQUET.name(),
HoodieFileFormat.ORC.name(), HoodieFileFormat.HFILE.name())
+ .withValidValues(
+ HoodieFileFormat.PARQUET.name(),
+ HoodieFileFormat.ORC.name(),
+ HoodieFileFormat.HFILE.name(),
+ HoodieFileFormat.LANCE.name())
Review Comment:
_⚠️ Potential issue_ | _🟠 Major_
**LANCE is now accepted here, but `getMaxFileSize()` still rejects it.**
Lines 2369-2379 in this same class still only handle `PARQUET`, `ORC`, and
`HFILE` before throwing. After advertising `HoodieFileFormat.LANCE` as a valid
base format, any path that derives max file size from
`config.getBaseFileFormat()` can now fail at runtime for Lance tables. Please
either add a `LANCE` branch there or block `LANCE` until its sizing behavior is
defined.
<details>
<summary>🤖 Prompt for AI Agents</summary>
```
Verify each finding against the current code and only fix it if needed.
In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java`
around lines 322 - 326, The config now advertises HoodieFileFormat.LANCE as a
valid base format but getMaxFileSize (in class HoodieWriteConfig) still only
handles PARQUET, ORC and HFILE and will throw for LANCE; update
getMaxFileSize
to either add a LANCE branch with the appropriate sizing policy (e.g., derive
from config.getBaseFileFormat() or a new lance-specific config key) or
explicitly reject/validate HoodieFileFormat.LANCE earlier so it cannot be
used
until sizing behavior is defined; locate the getMaxFileSize method and the
usages of config.getBaseFileFormat() and implement the LANCE case or add a
clear
validation/error message preventing LANCE selection.
```
</details>
<!--
fingerprinting:phantom:medusa:grasshopper:eb97e8e1-02f7-4cb8-914d-86726e9106c5
-->
<!-- This is an auto-generated comment by CodeRabbit -->
— *CodeRabbit*
([original](https://github.com/hudi-agent/hudi/pull/3#discussion_r3108757025))
(source:comment#3108757025)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]