hudi-agent commented on code in PR #18877:
URL: https://github.com/apache/hudi/pull/18877#discussion_r3408854284
##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/ClusteringOperator.java:
##########
@@ -169,6 +169,8 @@ public void open() throws Exception {
this.table = writeClient.getHoodieTable();
this.schema = HoodieSchemaConverter.convertToSchema(rowType);
+ // update write schema to include metadata fields.
Review Comment:
🤖 nit: the comment says "to include metadata fields" but the schema being
set here comes straight from `HoodieSchemaConverter.convertToSchema(rowType)` —
no `addMetadataFields` call in sight. Could you clarify what this comment is
actually describing? As written it reads like the schema contains metadata
fields when it may not, which could mislead whoever touches this next.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/storage/row/RowDataParquetWriteSupport.java:
##########
@@ -38,21 +36,21 @@
*/
public class RowDataParquetWriteSupport extends WriteSupport<RowData> {
- private final RowType rowType;
- private final MessageType schema;
+ protected final HoodieSchema schema;
Review Comment:
🤖 nit: the field used to be named `hoodieSchema` as the constructor
parameter, but now both the field and the parameter are just `schema`. Right
next to `fileSchema` (the Parquet `MessageType`), the unqualified `schema` is a
bit ambiguous — could you keep the name `hoodieSchema` for the field to make it
immediately clear which schema type each field holds?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/util/HoodieSchemaConverter.java:
##########
@@ -238,6 +279,30 @@ public static HoodieSchema convertToSchema(LogicalType
logicalType, String rowNa
return nullable ? HoodieSchema.createNullable(schema) : schema;
}
+ /**
+ * Validates that all configured VECTOR columns resolve to top-level fields
of the row schema.
+ *
+ * <p>Invoked by {@link #convertToSchema(LogicalType, String, String)}
before schema inference so
+ * an unknown column is rejected instead of being silently ignored during
conversion.
+ *
+ * @param logicalType Flink logical type
+ * @param vectorColumns parsed vector columns (normalized column name to
dimension), may be empty
+ */
+ private static void validateVectorColumns(LogicalType logicalType,
Map<String, Integer> vectorColumns) {
+ if (vectorColumns.isEmpty()) {
+ return;
+ }
+ List<String> normalizedFieldNames = ((RowType)
logicalType).getFieldNames().stream()
+ .map(fieldName -> fieldName.toLowerCase(Locale.ROOT))
+ .collect(Collectors.toList());
+ vectorColumns.keySet().stream()
+ .filter(vectorColumn -> !normalizedFieldNames.contains(vectorColumn))
+ .findFirst()
+ .ifPresent(vectorColumn -> {
Review Comment:
🤖 nit: using `ifPresent(x -> { throw ... })` to signal an error is
non-idiomatic — `ifPresent` reads as a "do something with a value" operation,
not an exception gate. Have you considered `findFirst().ifPresent(col -> ...)`
→ just storing the Optional and throwing explicitly if `isPresent()`? Something
like `Optional<String> unknown = ...; if (unknown.isPresent()) throw new
IllegalArgumentException(...)` makes the control flow immediately obvious.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]