[ https://issues.apache.org/jira/browse/SPARK-16518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375455#comment-15375455 ]
Dongjoon Hyun commented on SPARK-16518: --------------------------------------- +1 :) > Schema Compatibility of Parquet Data Source > ------------------------------------------- > > Key: SPARK-16518 > URL: https://issues.apache.org/jira/browse/SPARK-16518 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Reporter: Xiao Li > > Currently, we are not checking the schema compatibility. Different file > formats behave differently. This JIRA just summarizes what I observed for > parquet data source tables. > *Scenario 1 Data type mismatch*: > The existing schema is {{(col1 int, col2 string)}} > The schema of appending dataset is {{(col1 int, col2 int)}} > *Case 1*: _when {{spark.sql.parquet.mergeSchema}} is {{false}}_, the error we > got: > {noformat} > Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most > recent failure: > Lost task 0.0 in stage 4.0 (TID 4, localhost): java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:231) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:62) > {noformat} > *Case 2*: _when {{spark.sql.parquet.mergeSchema}} is {{true}}_, the error we > got: > {noformat} > Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): > org.apache.spark.SparkException: > Failed merging schema of file > file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-4c2f0b69-ee05-4be1-91f0-0e54f89f2308/part-r-00000-6b76638c-a624-444c-9479-3c8e894cb65e.snappy.parquet: > root > |-- a: integer (nullable = false) > |-- b: string (nullable = true) > {noformat} > *Scenario 2 More columns in append dataset*: > The existing schema is {{(col1 int, col2 string)}} > The schema of appending dataset is {{(col1 int, col2 string, col3 int)}} > *Case 1*: _when {{spark.sql.parquet.mergeSchema}} is {{false}}_, the schema > of the resultset is {{(col1 int, col2 string)}}. > *Case 2*: _when {{spark.sql.parquet.mergeSchema}} is {{true}}_, the schema of > the resultset is {{(col1 int, col2 string, col3 int)}}. > *Scenario 3 Less columns in append dataset*: > The existing schema is {{(col1 int, col2 string)}} > The schema of appending dataset is {{(col1 int)}} > *Case 1*: _when {{spark.sql.parquet.mergeSchema}} is {{false}}_, the > schema of the resultset is {{(col1 int, col2 string)}}. > *Case 2*: _when {{spark.sql.parquet.mergeSchema}} is {{true}}_, the schema > of the resultset is {{(col1 int)}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org