[ https://issues.apache.org/jira/browse/SPARK-31574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096805#comment-17096805 ]
Pablo Langa Blanco commented on SPARK-31574: -------------------------------------------- Spark have a functionality to read multiple parquet files with different *compatible* schemas and by default is disabled. [https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging] The problem in the example you propose is that int and string are incompatible data types so merge schema is not going to work > Schema evolution in spark while using the storage format as parquet > ------------------------------------------------------------------- > > Key: SPARK-31574 > URL: https://issues.apache.org/jira/browse/SPARK-31574 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Reporter: sharad Gupta > Priority: Major > > Hi Team, > > Use case: > Suppose there is a table T1 with column C1 with datatype as int in schema > version 1. In the first on boarding table T1. I wrote couple of parquet files > with this schema version 1 with underlying file format used parquet. > Now in schema version 2 the C1 column datatype changed to string from int. > Now It will write data with schema version 2 in parquet. > So some parquet files are written with schema version 1 and some written with > schema version 2. > Problem statement : > 1. We are not able to execute the below command from spark sql > ```Alter table Table T1 change C1 C1 string``` > 2. So as a solution i goto hive and alter the table change datatype because > it supported in hive then try to read the data in spark. So it is giving me > error > ``` > Caused by: java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary > at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:44) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51) > at > org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:372) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745)``` > > 3. Suspecting that the underlying parquet file is written with integer type > and we are reading from a table whose column is changed to string type. So > that is why it is happening. > How you can reproduce this: > spark sql > 1. Create a table from spark sql with one column with datatype as int with > stored as parquet. > 2. Now put some data into table. > 3. Now you can see the data if you select from table. > Hive > 1. change datatype from int to string by alter command > 2. Now try to read data, You will be able to read the data here even after > changing the datatype. > spark sql > 1. Try to read data from here now you will see the error. > Now the question is how to solve schema evolution in spark while using the > storage format as parquet. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org