[ https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17261398#comment-17261398 ]
Attila Zsolt Piros commented on SPARK-26836: -------------------------------------------- I am working on a this > Columns get switched in Spark SQL using Avro backed Hive table if schema > evolves > -------------------------------------------------------------------------------- > > Key: SPARK-26836 > URL: https://issues.apache.org/jira/browse/SPARK-26836 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.1, 2.4.0 > Environment: I tested with Hive and HCatalog which runs on version > 2.3.4 and with Spark 2.3.1 and 2.4 > Reporter: Tamás Németh > Priority: Critical > Labels: correctness > Attachments: doctors.avro, doctors_evolved.avro, > doctors_evolved.json, original.avsc > > > I have a hive avro table where the avro schema is stored on s3 next to the > avro files. > In the table definiton the avro.schema.url always points to the latest > partition's _schema.avsc file which is always the lates schema. (Avro schemas > are backward and forward compatible in a table) > When new data comes in, I always add a new partition where the > avro.schema.url properties also set to the _schema.avsc which was used when > it was added and of course I always update the table avro.schema.url property > to the latest one. > Querying this table works fine until the schema evolves in a way that a new > optional property is added in the middle. > When this happens then after the spark sql query the columns in the old > partition gets mixed up and it shows the wrong data for the columns. > If I query the table with Hive then everything is perfectly fine and it gives > me back the correct columns for the partitions which were created the old > schema and for the new which was created the evolved schema. > > Here is how I could reproduce with the > [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro] > example data in sql test suite. > # I have created two partition folder: > {code:java} > [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/ > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors > /dt=2019-02-05/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-05/doctors.avro > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors > /dt=2019-02-06/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-06/doctors_evolved.avro{code} > Here the first partition had data which was created with the schema before > evolving and the second one had the evolved one. (the evolved schema is the > same as in your testcase except I moved the extra_field column to the last > from the second and I generated two lines of avro data with the evolved > schema. > # I have created a hive table with the following command: > > {code:java} > CREATE EXTERNAL TABLE `default.doctors` > PARTITIONED BY ( > `dt` string > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > WITH SERDEPROPERTIES ( > 'avro.schema.url'='s3://somelocation/doctors/ > /dt=2019-02-06/_schema.avsc') > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION > 's3://somelocation/doctors/' > TBLPROPERTIES ( > 'transient_lastDdlTime'='1538130975'){code} > > Here as you can see the table schema url points to the latest schema > 3. I ran an msck _repair table_ to pick up all the partitions. > Fyi: If I run my select * query from here then everything is fine and no > columns switch happening. > 4. Then I changed the first partition's avro.schema.url url to points to the > schema which is under the partition folder (non-evolved one -> > s3://somelocation/doctors/ > /dt=2019-02-05/_schema.avsc) > Then if you ran a _select * from default.spark_test_ then the columns will be > mixed up (on the data below the first name column becomes the extra_field > column. I guess because in the latest schema it is the second column): > > {code:java} > number,extra_field,first_name,last_name,dt > 6,Colin,Baker,null,2019-02-05 > 3,Jon,Pertwee,null,2019-02-05 > 4,Tom,Baker,null,2019-02-05 > 5,Peter,Davison,null,2019-02-05 > 11,Matt,Smith,null,2019-02-05 > 1,William,Hartnell,null,2019-02-05 > 7,Sylvester,McCoy,null,2019-02-05 > 8,Paul,McGann,null,2019-02-05 > 2,Patrick,Troughton,null,2019-02-05 > 9,Christopher,Eccleston,null,2019-02-05 > 10,David,Tennant,null,2019-02-05 > 21,fishfinger,Jim,Baker,2019-02-06 > 24,fishfinger,Bean,Pertwee,2019-02-06 > {code} > If I try the same query from Hive and not from spark sql then everything is > fine and it never switches the columns. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org