[ https://issues.apache.org/jira/browse/SPARK-27960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856955#comment-16856955 ]
Ryan Blue commented on SPARK-27960: ----------------------------------- [~Gengliang.Wang], FYI > DataSourceV2 ORC implementation doesn't handle schemas correctly > ---------------------------------------------------------------- > > Key: SPARK-27960 > URL: https://issues.apache.org/jira/browse/SPARK-27960 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.3 > Reporter: Ryan Blue > Priority: Major > > While testing SPARK-27919 > (#[24768|https://github.com/apache/spark/pull/24768]), I tried to use the v2 > ORC implementation to validate a v2 catalog that delegates to the session > catalog. The ORC implementation fails the following test case because it > cannot infer a schema (there is no data) but it should be using the schema > used to create the table. > Test case: > {code} > test("CreateTable: test ORC source") { > spark.conf.set("spark.sql.catalog.session", > classOf[V2SessionCatalog].getName) > spark.sql(s"CREATE TABLE table_name (id bigint, data string) USING $orc2") > val testCatalog = spark.catalog("session").asTableCatalog > val table = testCatalog.loadTable(Identifier.of(Array(), "table_name")) > assert(table.name == "orc ") // <-- should this be table_name? > assert(table.partitioning.isEmpty) > assert(table.properties == Map( > "provider" -> orc2, > "database" -> "default", > "table" -> "table_name").asJava) > assert(table.schema == new StructType().add("id", LongType).add("data", > StringType)) // <-- fail > val rdd = > spark.sparkContext.parallelize(table.asInstanceOf[InMemoryTable].rows) > checkAnswer(spark.internalCreateDataFrame(rdd, table.schema), Seq.empty) > } > {code} > Error: > {code} > Unable to infer schema for ORC. It must be specified manually.; > org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It > must be specified manually.; > at > org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$5(FileTable.scala:61) > at scala.Option.getOrElse(Option.scala:138) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:61) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:54) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:67) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:65) > at > org.apache.spark.sql.sources.v2.DataSourceV2SQLSuite.$anonfun$new$5(DataSourceV2SQLSuite.scala:82) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org