kevinwilfong commented on code in PR #10697:
URL:
https://github.com/apache/incubator-gluten/pull/10697#discussion_r2392188301
##########
backends-velox/src/test/scala/org/apache/gluten/execution/VeloxScanSuite.scala:
##########
@@ -207,4 +207,78 @@ class VeloxScanSuite extends
VeloxWholeStageTransformerSuite {
}
}
}
+
+ test("parquet index based schema evolution") {
+ withSQLConf(VeloxConfig.PARQUET_USE_COLUMN_NAMES.key -> "false") {
+ withTempDir {
+ dir =>
+ val path = dir.getCanonicalPath
+ spark
+ .range(2)
+ .selectExpr("id as a", "cast(id + 10 as string) as b")
+ .write
+ .mode("overwrite")
+ .parquet(path)
+
+ withTable("test") {
+ sql(s"""create table test (c long, d string, e float) using
parquet options
+ |(path '$path')""".stripMargin)
+ var df = sql("select c, d from test")
+ checkAnswer(df, Seq(Row(0L, "10"), Row(1L, "11")))
+
+ df = sql("select d from test")
+ checkAnswer(df, Seq(Row("10"), Row("11")))
+
+ df = sql("select c from test")
+ checkAnswer(df, Seq(Row(0L), Row(1L)))
+
+ df = sql("select d, c from test")
+ checkAnswer(df, Seq(Row("10", 0L), Row("11", 1L)))
+
+ df = sql("select c, d, e from test")
+ checkAnswer(df, Seq(Row(0L, "10", null), Row(1L, "11", null)))
+
+ df = sql("select e, d, c from test")
+ checkAnswer(df, Seq(Row(null, "10", 0L), Row(null, "11", 1L)))
+ }
+ }
+ }
+ }
+
+ test("ORC index based schema evolution") {
+ withSQLConf(VeloxConfig.ORC_USE_COLUMN_NAMES.key -> "false") {
+ withTempDir {
+ dir =>
+ val path = dir.getCanonicalPath
+ spark
+ .range(2)
+ .selectExpr("id as a", "cast(id + 10 as string) as b")
+ .write
+ .mode("overwrite")
+ .orc(path)
+
+ withTable("test") {
+ sql(s"""create table test (c long, d string, e float) using orc
options
+ |(path '$path')""".stripMargin)
+ var df = sql("select c, d from test")
+ checkAnswer(df, Seq(Row(0L, "10"), Row(1L, "11")))
Review Comment:
See my comment here
https://github.com/apache/incubator-gluten/pull/10697#issuecomment-3325592859
I will reuse most of the code in this change to support Text files, these
are supported in Vanilla Spark and already supported in Gluten with ClickHouse.
The configs are the only thing that don't completely align with Vanilla
Spark. In Vanilla Spark we support mapping columns by index by adding custom
readers that support mapping columns by index instead of by name. This is
partially supported in Vanilla Spark with ORC.
https://issues.apache.org/jira/browse/SPARK-32864 With Velox we simply do this
by setting a config in the connector. The C++ code in Gluten is very rigid and
doesn't allow us to adjust connector configs to our needs without forking
files. For users who don't want this, they can simply not use the configs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]