kevinwilfong opened a new pull request, #10697: URL: https://github.com/apache/incubator-gluten/pull/10697
## What changes are proposed in this pull request? In our data warehouse we support schema evolution by column index rather than by name. E.g. if a Hive table has schema a, b, c but the partition has schema c, a, b we won't reorder the columns from the partition, but read partition column c as column a, partition column a as column b, etc. This is supported in Velox by setting the configs hive.orc.use-column-names and hive.parquet.use-column-names in the HiveConfig to false for ORC and Parquet files respectively. Currently these are both hard coded to true in Gluten. This change adds configs to Gluten's VeloxConfig spark.gluten.sql.columnar.backend.velox.orcUseColumnNames and spark.gluten.sql.columnar.backend.velox.parquetUseColumnNames and plumbs these to the HiveConfig in Velox. In addition, we need to pass the full table schema to the HiveTableHandle, as this is how Velox determines the indices of each column. I updated VeloxIteratorApi to set the FileSchema for the LocalFilesNodes it generates if necessary (if the config is enabled for the format of the file), and VeloxPlanConverter/SubstraitToVeloxPlan to propagate this to the HiveTableHandle when present. Note that I considered just setting it in the ReadRel rather than in each LocalFilesNode. This however introduced the problem that we could no longer read from tables with column types we don't support, even if we don't read those columns, as we still need to propagate them to the HiveTableHandle. Since partition file formats don't always match table file formats, we don't know if we need the schema until we generate the splits, at which point it's too late to update the plan. See https://github.com/apache/incubator-gluten/pull/10569 ## How was this patch tested? Added tests for ORC and Parquet files where the column names in the table don't match the column names in the file, and verified we could still read them by index when the flags are enabled. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
