[jira] [Commented] (SPARK-27960) DataSourceV2 ORC implementation doesn't handle schemas correctly

Ryan Blue (JIRA) Wed, 05 Jun 2019 12:09:41 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-27960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856955#comment-16856955
 ]


Ryan Blue commented on SPARK-27960:
-----------------------------------

[~Gengliang.Wang], FYI

> DataSourceV2 ORC implementation doesn't handle schemas correctly
> ----------------------------------------------------------------
>
>                 Key: SPARK-27960
>                 URL: https://issues.apache.org/jira/browse/SPARK-27960
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.3
>            Reporter: Ryan Blue
>            Priority: Major
>
> While testing SPARK-27919 
> (#[24768|https://github.com/apache/spark/pull/24768]), I tried to use the v2 
> ORC implementation to validate a v2 catalog that delegates to the session 
> catalog. The ORC implementation fails the following test case because it 
> cannot infer a schema (there is no data) but it should be using the schema 
> used to create the table.
>  Test case:
> {code}
> test("CreateTable: test ORC source") {
>   spark.conf.set("spark.sql.catalog.session", 
> classOf[V2SessionCatalog].getName)
>   spark.sql(s"CREATE TABLE table_name (id bigint, data string) USING $orc2")
>   val testCatalog = spark.catalog("session").asTableCatalog
>   val table = testCatalog.loadTable(Identifier.of(Array(), "table_name"))
>   assert(table.name == "orc ") // <-- should this be table_name?
>   assert(table.partitioning.isEmpty)
>   assert(table.properties == Map(
>     "provider" -> orc2,
>     "database" -> "default",
>     "table" -> "table_name").asJava)
>   assert(table.schema == new StructType().add("id", LongType).add("data", 
> StringType)) // <-- fail
>   val rdd = 
> spark.sparkContext.parallelize(table.asInstanceOf[InMemoryTable].rows)
>   checkAnswer(spark.internalCreateDataFrame(rdd, table.schema), Seq.empty)
> }
> {code}
> Error:
> {code}
> Unable to infer schema for ORC. It must be specified manually.;
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It 
> must be specified manually.;
>       at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$5(FileTable.scala:61)
>       at scala.Option.getOrElse(Option.scala:138)
>       at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:61)
>       at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:54)
>       at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:67)
>       at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:65)
>       at 
> org.apache.spark.sql.sources.v2.DataSourceV2SQLSuite.$anonfun$new$5(DataSourceV2SQLSuite.scala:82)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27960) DataSourceV2 ORC implementation doesn't handle schemas correctly

Reply via email to