voonhous commented on code in PR #18837:
URL: https://github.com/apache/hudi/pull/18837#discussion_r3340193901


##########
hudi-trino-plugin/src/main/java/io/trino/plugin/hudi/HudiUtil.java:
##########
@@ -321,37 +341,33 @@ public static Schema.Field getFieldFromSchema(String 
columnName, Schema schema)
                 "Failed to get column " + columnName + " from table schema");
     }
 
-    public static List<HiveColumnHandle> 
prependHudiMetaColumns(List<HiveColumnHandle> dataColumns)
+    public static List<HiveColumnHandle> 
prependHudiMetaAndOrderingColumns(HudiTableHandle tableHandle, 
List<HiveColumnHandle> dataColumns)
     {
-        //For efficient lookup
-        Set<String> dataColumnNames = dataColumns.stream()
+        Set<String> existingColumns = dataColumns.stream()
                 .map(HiveColumnHandle::getName)
-                .collect(Collectors.toSet());
-
-        // If all Hudi meta columns are already present, return the original 
list
-        if (dataColumnNames.containsAll(HOODIE_META_COLUMNS)) {
-            return dataColumns;
-        }
-
-        // Identify only the meta columns that are missing from dataColumns to 
avoid duplicates
-        List<String> missingMetaColumns = HOODIE_META_COLUMNS.stream()
-                .filter(metaColumn -> !dataColumnNames.contains(metaColumn))
-                .toList();
+                .collect(Collectors.toCollection(HashSet::new));
 
         List<HiveColumnHandle> columns = new ArrayList<>();
 
-        // Create and prepend the new HiveColumnHandles for the missing meta 
columns
-        columns.addAll(IntStream.range(0, missingMetaColumns.size())
-                .boxed()
-                .map(i -> new HiveColumnHandle(
-                        missingMetaColumns.get(i),
+        // Add missing Hudi meta columns first
+        for (int i = 0; i < HOODIE_META_COLUMNS.size(); i++) {
+            String metaColumn = HOODIE_META_COLUMNS.get(i);
+            if (existingColumns.add(metaColumn)) { // add() returns false if 
already present
+                columns.add(new HiveColumnHandle(
+                        metaColumn,
                         i,

Review Comment:
   Only a logical identifier here - it is never used for ordering/position 
lookups, so the skipped-duplicate case does not cause a mismatch. Traced 
through the page-source path:
   
   - Default resolution is by name: `shouldUseParquetColumnNames` defaults to 
`true`, so `createPageSource` resolves columns via `getBaseColumnName()` and 
never consults `baseHiveColumnIndex`.
   - The by-index path remaps by name anyway: when `useColumnNames=false`, 
`remapColumnIndicesToPhysical` overwrites each handle index with the physical 
index looked up by column name from the file schema, so the `i` assigned here 
is discarded before it is used as an index.
   - Output column order is driven by list position (the order handles are 
added to the list), and `constructSchema` builds the requested schema from 
column names - not from `baseHiveColumnIndex`.
   
   So skipping a duplicate meta column cannot produce a positional mismatch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to