Re: [PR] [VL] Support mapping columns by index for ORC and Parquet files [incubator-gluten]

via GitHub Fri, 17 Oct 2025 08:01:10 -0700


kevinwilfong commented on code in PR #10697:
URL: 
https://github.com/apache/incubator-gluten/pull/10697#discussion_r2392188301



##########
backends-velox/src/test/scala/org/apache/gluten/execution/VeloxScanSuite.scala:
##########
@@ -207,4 +207,78 @@ class VeloxScanSuite extends 
VeloxWholeStageTransformerSuite {
         }
     }
   }
+
+  test("parquet index based schema evolution") {
+    withSQLConf(VeloxConfig.PARQUET_USE_COLUMN_NAMES.key -> "false") {
+      withTempDir {
+        dir =>
+          val path = dir.getCanonicalPath
+          spark
+            .range(2)
+            .selectExpr("id as a", "cast(id + 10 as string) as b")
+            .write
+            .mode("overwrite")
+            .parquet(path)
+
+          withTable("test") {
+            sql(s"""create table test (c long, d string, e float) using 
parquet options
+                   |(path '$path')""".stripMargin)
+            var df = sql("select c, d from test")
+            checkAnswer(df, Seq(Row(0L, "10"), Row(1L, "11")))
+
+            df = sql("select d from test")
+            checkAnswer(df, Seq(Row("10"), Row("11")))
+
+            df = sql("select c from test")
+            checkAnswer(df, Seq(Row(0L), Row(1L)))
+
+            df = sql("select d, c from test")
+            checkAnswer(df, Seq(Row("10", 0L), Row("11", 1L)))
+
+            df = sql("select c, d, e from test")
+            checkAnswer(df, Seq(Row(0L, "10", null), Row(1L, "11", null)))
+
+            df = sql("select e, d, c from test")
+            checkAnswer(df, Seq(Row(null, "10", 0L), Row(null, "11", 1L)))
+          }
+      }
+    }
+  }
+
+  test("ORC index based schema evolution") {
+    withSQLConf(VeloxConfig.ORC_USE_COLUMN_NAMES.key -> "false") {
+      withTempDir {
+        dir =>
+          val path = dir.getCanonicalPath
+          spark
+            .range(2)
+            .selectExpr("id as a", "cast(id + 10 as string) as b")
+            .write
+            .mode("overwrite")
+            .orc(path)
+
+          withTable("test") {
+            sql(s"""create table test (c long, d string, e float) using orc 
options
+                   |(path '$path')""".stripMargin)
+            var df = sql("select c, d from test")
+            checkAnswer(df, Seq(Row(0L, "10"), Row(1L, "11")))

Review Comment:
   See my comment here 
https://github.com/apache/incubator-gluten/pull/10697#issuecomment-3325592859
   
   I will reuse most of the code in this change to support Text files, these 
are supported in Vanilla Spark and already supported in Gluten with ClickHouse.
   
   The configs are the only thing that don't completely align with Vanilla 
Spark.  In Vanilla Spark we support mapping columns by index by adding custom 
readers that support mapping columns by index instead of by name. This is 
partially supported in Vanilla Spark with ORC. 
https://issues.apache.org/jira/browse/SPARK-32864 With Velox we simply do this 
by setting a config in the connector. The C++ code in Gluten is very rigid and 
doesn't allow us to adjust connector configs to our needs without forking 
files. For users who don't want this, they can simply not use the configs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [VL] Support mapping columns by index for ORC and Parquet files [incubator-gluten]

Reply via email to