[jira] [Commented] (SPARK-31574) Schema evolution in spark while using the storage format as parquet

Pablo Langa Blanco (Jira) Thu, 30 Apr 2020 10:31:24 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-31574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096805#comment-17096805
 ]


Pablo Langa Blanco commented on SPARK-31574:
--------------------------------------------

Spark have a functionality to read multiple parquet files with different 
*compatible* schemas and by default is disabled.

[https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging]

The problem in the example you propose is that int and string are incompatible 
data types so merge schema is not going to work

> Schema evolution in spark while using the storage format as parquet
> -------------------------------------------------------------------
>
>                 Key: SPARK-31574
>                 URL: https://issues.apache.org/jira/browse/SPARK-31574
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: sharad Gupta
>            Priority: Major
>
> Hi Team,
>  
> Use case:
> Suppose there is a table T1 with column C1 with datatype as int in schema 
> version 1. In the first on boarding table T1. I wrote couple of parquet files 
> with this schema version 1 with underlying file format used parquet.
> Now in schema version 2 the C1 column datatype changed to string from int. 
> Now It will write data with schema version 2 in parquet.
> So some parquet files are written with schema version 1 and some written with 
> schema version 2.
> Problem statement :
> 1. We are not able to execute the below command from spark sql
> ```Alter table Table T1 change C1 C1 string```
> 2. So as a solution i goto hive and alter the table change datatype because 
> it supported in hive then try to read the data in spark. So it is giving me 
> error
> ```
> Caused by: java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
>   at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:44)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
>   at 
> org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:372)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)```
>  
> 3. Suspecting that the underlying parquet file is written with integer type 
> and we are reading from a table whose column is changed to string type. So 
> that is why it is happening.
> How you can reproduce this:
> spark sql
> 1. Create a table from spark sql with one column with datatype as int with 
> stored as parquet.
> 2. Now put some data into table.
> 3. Now you can see the data if you select from table.
> Hive
> 1. change datatype from int to string by alter command
> 2. Now try to read data, You will be able to read the data here even after 
> changing the datatype.
> spark sql 
> 1. Try to read data from here now you will see the error.
> Now the question is how to solve schema evolution in spark while using the 
> storage format as parquet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31574) Schema evolution in spark while using the storage format as parquet

Reply via email to