[PR] [VL] Support mapping columns by index for ORC and Parquet files [incubator-gluten]

via GitHub Fri, 12 Sep 2025 12:30:10 -0700


kevinwilfong opened a new pull request, #10697:
URL: https://github.com/apache/incubator-gluten/pull/10697


   ## What changes are proposed in this pull request?
   
   In our data warehouse we support schema evolution by column index rather 
than by name. E.g. if a Hive table has schema a, b, c but the partition has 
schema c, a, b we won't reorder the columns from the partition, but read 
partition column c as column a, partition column a as column b, etc.
   
   This is supported in Velox by setting the configs hive.orc.use-column-names 
and hive.parquet.use-column-names in the HiveConfig to false for ORC and 
Parquet files respectively. Currently these are both hard coded to true in 
Gluten. This change adds configs to Gluten's VeloxConfig 
spark.gluten.sql.columnar.backend.velox.orcUseColumnNames and 
spark.gluten.sql.columnar.backend.velox.parquetUseColumnNames and plumbs these 
to the HiveConfig in Velox.
   
   In addition, we need to pass the full table schema to the HiveTableHandle, 
as this is how Velox determines the indices of each column. I updated 
VeloxIteratorApi to set the FileSchema for the LocalFilesNodes it generates if 
necessary (if the config is enabled for the format of the file), and 
VeloxPlanConverter/SubstraitToVeloxPlan to propagate this to the 
HiveTableHandle when present.
   
   Note that I considered just setting it in the ReadRel rather than in each 
LocalFilesNode. This however introduced the problem that we could no longer 
read from tables with column types we don't support, even if we don't read 
those columns, as we still need to propagate them to the HiveTableHandle. Since 
partition file formats don't always match table file formats, we don't know if 
we need the schema until we generate the splits, at which point it's too late 
to update the plan. See https://github.com/apache/incubator-gluten/pull/10569
   
   ## How was this patch tested?
   
   Added tests for ORC and Parquet files where the column names in the table 
don't match the column names in the file, and verified we could still read them 
by index when the flags are enabled.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [VL] Support mapping columns by index for ORC and Parquet files [incubator-gluten]

Reply via email to