[jira] [Commented] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
[ https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17262681#comment-17262681 ] Apache Spark commented on SPARK-26836: -- User 'attilapiros' has created a pull request for this issue: https://github.com/apache/spark/pull/31133 > Columns get switched in Spark SQL using Avro backed Hive table if schema > evolves > > > Key: SPARK-26836 > URL: https://issues.apache.org/jira/browse/SPARK-26836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0 > Environment: I tested with Hive and HCatalog which runs on version > 2.3.4 and with Spark 2.3.1 and 2.4 >Reporter: Tamás Németh >Priority: Critical > Labels: correctness > Attachments: doctors.avro, doctors_evolved.avro, > doctors_evolved.json, original.avsc > > > I have a hive avro table where the avro schema is stored on s3 next to the > avro files. > In the table definiton the avro.schema.url always points to the latest > partition's _schema.avsc file which is always the lates schema. (Avro schemas > are backward and forward compatible in a table) > When new data comes in, I always add a new partition where the > avro.schema.url properties also set to the _schema.avsc which was used when > it was added and of course I always update the table avro.schema.url property > to the latest one. > Querying this table works fine until the schema evolves in a way that a new > optional property is added in the middle. > When this happens then after the spark sql query the columns in the old > partition gets mixed up and it shows the wrong data for the columns. > If I query the table with Hive then everything is perfectly fine and it gives > me back the correct columns for the partitions which were created the old > schema and for the new which was created the evolved schema. > > Here is how I could reproduce with the > [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro] > example data in sql test suite. > # I have created two partition folder: > {code:java} > [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/ > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors > /dt=2019-02-05/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-05/doctors.avro > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors > /dt=2019-02-06/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-06/doctors_evolved.avro{code} > Here the first partition had data which was created with the schema before > evolving and the second one had the evolved one. (the evolved schema is the > same as in your testcase except I moved the extra_field column to the last > from the second and I generated two lines of avro data with the evolved > schema. > # I have created a hive table with the following command: > > {code:java} > CREATE EXTERNAL TABLE `default.doctors` > PARTITIONED BY ( > `dt` string > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > WITH SERDEPROPERTIES ( > 'avro.schema.url'='s3://somelocation/doctors/ > /dt=2019-02-06/_schema.avsc') > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION > 's3://somelocation/doctors/' > TBLPROPERTIES ( > 'transient_lastDdlTime'='1538130975'){code} > > Here as you can see the table schema url points to the latest schema > 3. I ran an msck _repair table_ to pick up all the partitions. > Fyi: If I run my select * query from here then everything is fine and no > columns switch happening. > 4. Then I changed the first partition's avro.schema.url url to points to the > schema which is under the partition folder (non-evolved one -> > s3://somelocation/doctors/ > /dt=2019-02-05/_schema.avsc) > Then if you ran a _select * from default.spark_test_ then the columns will be > mixed up (on the data below the first name column becomes the extra_field > column. I guess because in the latest schema it is the second column): > > {code:java} > number,extra_field,first_name,last_name,dt > 6,Colin,Baker,null,2019-02-05 > 3,Jon,Pertwee,null,2019-02-05 > 4,Tom,Baker,null,2019-02-05 > 5,Peter,Davison,null,2019-02-05 > 11,Matt,Smith,null,2019-02-05 > 1,William,Hartnell,null,2019-02-05 > 7,Sylvester,McCoy,null,2019-02-05 > 8,Paul,McGann,null,2019-02-05 > 2,Patrick,Troughton,null,2019-02-05 > 9,Christopher,Eccleston,null,2019-02-05 > 10,David,Tennant,null,2
[jira] [Commented] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
[ https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17261398#comment-17261398 ] Attila Zsolt Piros commented on SPARK-26836: I am working on a this > Columns get switched in Spark SQL using Avro backed Hive table if schema > evolves > > > Key: SPARK-26836 > URL: https://issues.apache.org/jira/browse/SPARK-26836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0 > Environment: I tested with Hive and HCatalog which runs on version > 2.3.4 and with Spark 2.3.1 and 2.4 >Reporter: Tamás Németh >Priority: Critical > Labels: correctness > Attachments: doctors.avro, doctors_evolved.avro, > doctors_evolved.json, original.avsc > > > I have a hive avro table where the avro schema is stored on s3 next to the > avro files. > In the table definiton the avro.schema.url always points to the latest > partition's _schema.avsc file which is always the lates schema. (Avro schemas > are backward and forward compatible in a table) > When new data comes in, I always add a new partition where the > avro.schema.url properties also set to the _schema.avsc which was used when > it was added and of course I always update the table avro.schema.url property > to the latest one. > Querying this table works fine until the schema evolves in a way that a new > optional property is added in the middle. > When this happens then after the spark sql query the columns in the old > partition gets mixed up and it shows the wrong data for the columns. > If I query the table with Hive then everything is perfectly fine and it gives > me back the correct columns for the partitions which were created the old > schema and for the new which was created the evolved schema. > > Here is how I could reproduce with the > [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro] > example data in sql test suite. > # I have created two partition folder: > {code:java} > [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/ > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors > /dt=2019-02-05/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-05/doctors.avro > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors > /dt=2019-02-06/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-06/doctors_evolved.avro{code} > Here the first partition had data which was created with the schema before > evolving and the second one had the evolved one. (the evolved schema is the > same as in your testcase except I moved the extra_field column to the last > from the second and I generated two lines of avro data with the evolved > schema. > # I have created a hive table with the following command: > > {code:java} > CREATE EXTERNAL TABLE `default.doctors` > PARTITIONED BY ( > `dt` string > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > WITH SERDEPROPERTIES ( > 'avro.schema.url'='s3://somelocation/doctors/ > /dt=2019-02-06/_schema.avsc') > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION > 's3://somelocation/doctors/' > TBLPROPERTIES ( > 'transient_lastDdlTime'='1538130975'){code} > > Here as you can see the table schema url points to the latest schema > 3. I ran an msck _repair table_ to pick up all the partitions. > Fyi: If I run my select * query from here then everything is fine and no > columns switch happening. > 4. Then I changed the first partition's avro.schema.url url to points to the > schema which is under the partition folder (non-evolved one -> > s3://somelocation/doctors/ > /dt=2019-02-05/_schema.avsc) > Then if you ran a _select * from default.spark_test_ then the columns will be > mixed up (on the data below the first name column becomes the extra_field > column. I guess because in the latest schema it is the second column): > > {code:java} > number,extra_field,first_name,last_name,dt > 6,Colin,Baker,null,2019-02-05 > 3,Jon,Pertwee,null,2019-02-05 > 4,Tom,Baker,null,2019-02-05 > 5,Peter,Davison,null,2019-02-05 > 11,Matt,Smith,null,2019-02-05 > 1,William,Hartnell,null,2019-02-05 > 7,Sylvester,McCoy,null,2019-02-05 > 8,Paul,McGann,null,2019-02-05 > 2,Patrick,Troughton,null,2019-02-05 > 9,Christopher,Eccleston,null,2019-02-05 > 10,David,Tennant,null,2019-02-05 > 21,fishfinger,Jim,Baker,2019-02-06 > 24,fishfinger,Bean,Pertwee,2019-02-06 >
[jira] [Commented] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
[ https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047170#comment-17047170 ] Hyukjin Kwon commented on SPARK-26836: -- I am lowering the priority to Critical as it's at least not a regression and doesn't look blocking Spark 3.0; however, indeed we should treat correctness issues at least Critical+. > Columns get switched in Spark SQL using Avro backed Hive table if schema > evolves > > > Key: SPARK-26836 > URL: https://issues.apache.org/jira/browse/SPARK-26836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0 > Environment: I tested with Hive and HCatalog which runs on version > 2.3.4 and with Spark 2.3.1 and 2.4 >Reporter: Tamás Németh >Priority: Critical > Labels: correctness > Attachments: doctors.avro, doctors_evolved.avro, > doctors_evolved.json, original.avsc > > > I have a hive avro table where the avro schema is stored on s3 next to the > avro files. > In the table definiton the avro.schema.url always points to the latest > partition's _schema.avsc file which is always the lates schema. (Avro schemas > are backward and forward compatible in a table) > When new data comes in, I always add a new partition where the > avro.schema.url properties also set to the _schema.avsc which was used when > it was added and of course I always update the table avro.schema.url property > to the latest one. > Querying this table works fine until the schema evolves in a way that a new > optional property is added in the middle. > When this happens then after the spark sql query the columns in the old > partition gets mixed up and it shows the wrong data for the columns. > If I query the table with Hive then everything is perfectly fine and it gives > me back the correct columns for the partitions which were created the old > schema and for the new which was created the evolved schema. > > Here is how I could reproduce with the > [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro] > example data in sql test suite. > # I have created two partition folder: > {code:java} > [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/ > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors > /dt=2019-02-05/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-05/doctors.avro > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors > /dt=2019-02-06/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-06/doctors_evolved.avro{code} > Here the first partition had data which was created with the schema before > evolving and the second one had the evolved one. (the evolved schema is the > same as in your testcase except I moved the extra_field column to the last > from the second and I generated two lines of avro data with the evolved > schema. > # I have created a hive table with the following command: > > {code:java} > CREATE EXTERNAL TABLE `default.doctors` > PARTITIONED BY ( > `dt` string > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > WITH SERDEPROPERTIES ( > 'avro.schema.url'='s3://somelocation/doctors/ > /dt=2019-02-06/_schema.avsc') > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION > 's3://somelocation/doctors/' > TBLPROPERTIES ( > 'transient_lastDdlTime'='1538130975'){code} > > Here as you can see the table schema url points to the latest schema > 3. I ran an msck _repair table_ to pick up all the partitions. > Fyi: If I run my select * query from here then everything is fine and no > columns switch happening. > 4. Then I changed the first partition's avro.schema.url url to points to the > schema which is under the partition folder (non-evolved one -> > s3://somelocation/doctors/ > /dt=2019-02-05/_schema.avsc) > Then if you ran a _select * from default.spark_test_ then the columns will be > mixed up (on the data below the first name column becomes the extra_field > column. I guess because in the latest schema it is the second column): > > {code:java} > number,extra_field,first_name,last_name,dt > 6,Colin,Baker,null,2019-02-05 > 3,Jon,Pertwee,null,2019-02-05 > 4,Tom,Baker,null,2019-02-05 > 5,Peter,Davison,null,2019-02-05 > 11,Matt,Smith,null,2019-02-05 > 1,William,Hartnell,null,2019-02-05 > 7,Sylvester,McCoy,null,2019-02-05 > 8,Paul,McGann,null,2019-02-05 > 2,Patrick,Troughton,null,2019-02-05 > 9,Christo
[jira] [Commented] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
[ https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17040853#comment-17040853 ] Wenchen Fan commented on SPARK-26836: - cc [~Gengliang.Wang] > Columns get switched in Spark SQL using Avro backed Hive table if schema > evolves > > > Key: SPARK-26836 > URL: https://issues.apache.org/jira/browse/SPARK-26836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0 > Environment: I tested with Hive and HCatalog which runs on version > 2.3.4 and with Spark 2.3.1 and 2.4 >Reporter: Tamás Németh >Priority: Blocker > Labels: correctness > Attachments: doctors.avro, doctors_evolved.avro, > doctors_evolved.json, original.avsc > > > I have a hive avro table where the avro schema is stored on s3 next to the > avro files. > In the table definiton the avro.schema.url always points to the latest > partition's _schema.avsc file which is always the lates schema. (Avro schemas > are backward and forward compatible in a table) > When new data comes in, I always add a new partition where the > avro.schema.url properties also set to the _schema.avsc which was used when > it was added and of course I always update the table avro.schema.url property > to the latest one. > Querying this table works fine until the schema evolves in a way that a new > optional property is added in the middle. > When this happens then after the spark sql query the columns in the old > partition gets mixed up and it shows the wrong data for the columns. > If I query the table with Hive then everything is perfectly fine and it gives > me back the correct columns for the partitions which were created the old > schema and for the new which was created the evolved schema. > > Here is how I could reproduce with the > [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro] > example data in sql test suite. > # I have created two partition folder: > {code:java} > [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/ > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors > /dt=2019-02-05/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-05/doctors.avro > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors > /dt=2019-02-06/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-06/doctors_evolved.avro{code} > Here the first partition had data which was created with the schema before > evolving and the second one had the evolved one. (the evolved schema is the > same as in your testcase except I moved the extra_field column to the last > from the second and I generated two lines of avro data with the evolved > schema. > # I have created a hive table with the following command: > > {code:java} > CREATE EXTERNAL TABLE `default.doctors` > PARTITIONED BY ( > `dt` string > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > WITH SERDEPROPERTIES ( > 'avro.schema.url'='s3://somelocation/doctors/ > /dt=2019-02-06/_schema.avsc') > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION > 's3://somelocation/doctors/' > TBLPROPERTIES ( > 'transient_lastDdlTime'='1538130975'){code} > > Here as you can see the table schema url points to the latest schema > 3. I ran an msck _repair table_ to pick up all the partitions. > Fyi: If I run my select * query from here then everything is fine and no > columns switch happening. > 4. Then I changed the first partition's avro.schema.url url to points to the > schema which is under the partition folder (non-evolved one -> > s3://somelocation/doctors/ > /dt=2019-02-05/_schema.avsc) > Then if you ran a _select * from default.spark_test_ then the columns will be > mixed up (on the data below the first name column becomes the extra_field > column. I guess because in the latest schema it is the second column): > > {code:java} > number,extra_field,first_name,last_name,dt > 6,Colin,Baker,null,2019-02-05 > 3,Jon,Pertwee,null,2019-02-05 > 4,Tom,Baker,null,2019-02-05 > 5,Peter,Davison,null,2019-02-05 > 11,Matt,Smith,null,2019-02-05 > 1,William,Hartnell,null,2019-02-05 > 7,Sylvester,McCoy,null,2019-02-05 > 8,Paul,McGann,null,2019-02-05 > 2,Patrick,Troughton,null,2019-02-05 > 9,Christopher,Eccleston,null,2019-02-05 > 10,David,Tennant,null,2019-02-05 > 21,fishfinger,Jim,Baker,2019-02-06 > 24,fishfinger,Bean,Pertwee,2019-02-06 > {code} > If I try
[jira] [Commented] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
[ https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1686#comment-1686 ] Dongjoon Hyun commented on SPARK-26836: --- cc [~dbtsai] > Columns get switched in Spark SQL using Avro backed Hive table if schema > evolves > > > Key: SPARK-26836 > URL: https://issues.apache.org/jira/browse/SPARK-26836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0 > Environment: I tested with Hive and HCatalog which runs on version > 2.3.4 and with Spark 2.3.1 and 2.4 >Reporter: Tamas Nemeth >Priority: Major > Labels: correctness > Attachments: doctors.avro, doctors_evolved.avro, > doctors_evolved.json, original.avsc > > > I have a hive avro table where the avro schema is stored on s3 next to the > avro files. > In the table definiton the avro.schema.url always points to the latest > partition's _schema.avsc file which is always the lates schema. (Avro schemas > are backward and forward compatible in a table) > When new data comes in, I always add a new partition where the > avro.schema.url properties also set to the _schema.avsc which was used when > it was added and of course I always update the table avro.schema.url property > to the latest one. > Querying this table works fine until the schema evolves in a way that a new > optional property is added in the middle. > When this happens then after the spark sql query the columns in the old > partition gets mixed up and it shows the wrong data for the columns. > If I query the table with Hive then everything is perfectly fine and it gives > me back the correct columns for the partitions which were created the old > schema and for the new which was created the evolved schema. > > Here is how I could reproduce with the > [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro] > example data in sql test suite. > # I have created two partition folder: > {code:java} > [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/ > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors > /dt=2019-02-05/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-05/doctors.avro > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors > /dt=2019-02-06/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-06/doctors_evolved.avro{code} > Here the first partition had data which was created with the schema before > evolving and the second one had the evolved one. (the evolved schema is the > same as in your testcase except I moved the extra_field column to the last > from the second and I generated two lines of avro data with the evolved > schema. > # I have created a hive table with the following command: > > {code:java} > CREATE EXTERNAL TABLE `default.doctors` > PARTITIONED BY ( > `dt` string > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > WITH SERDEPROPERTIES ( > 'avro.schema.url'='s3://somelocation/doctors/ > /dt=2019-02-06/_schema.avsc') > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION > 's3://somelocation/doctors/' > TBLPROPERTIES ( > 'transient_lastDdlTime'='1538130975'){code} > > Here as you can see the table schema url points to the latest schema > 3. I ran an msck _repair table_ to pick up all the partitions. > Fyi: If I run my select * query from here then everything is fine and no > columns switch happening. > 4. Then I changed the first partition's avro.schema.url url to points to the > schema which is under the partition folder (non-evolved one -> > s3://somelocation/doctors/ > /dt=2019-02-05/_schema.avsc) > Then if you ran a _select * from default.spark_test_ then the columns will be > mixed up (on the data below the first name column becomes the extra_field > column. I guess because in the latest schema it is the second column): > > {code:java} > number,extra_field,first_name,last_name,dt > 6,Colin,Baker,null,2019-02-05 > 3,Jon,Pertwee,null,2019-02-05 > 4,Tom,Baker,null,2019-02-05 > 5,Peter,Davison,null,2019-02-05 > 11,Matt,Smith,null,2019-02-05 > 1,William,Hartnell,null,2019-02-05 > 7,Sylvester,McCoy,null,2019-02-05 > 8,Paul,McGann,null,2019-02-05 > 2,Patrick,Troughton,null,2019-02-05 > 9,Christopher,Eccleston,null,2019-02-05 > 10,David,Tennant,null,2019-02-05 > 21,fishfinger,Jim,Baker,2019-02-06 > 24,fishfinger,Bean,Pertwee,2019-02-06 > {code} > If I try the s
[jira] [Commented] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
[ https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868414#comment-16868414 ] david ravet commented on SPARK-26836: - Hi, We encounter the same issue with Spark 2.2.0 when reading avro from a partition where avro.schema.url point to a new schema (forward compatibility) but stored avro files were in older schema. > Columns get switched in Spark SQL using Avro backed Hive table if schema > evolves > > > Key: SPARK-26836 > URL: https://issues.apache.org/jira/browse/SPARK-26836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0 > Environment: I tested with Hive and HCatalog which runs on version > 2.3.4 and with Spark 2.3.1 and 2.4 >Reporter: Tamas Nemeth >Priority: Major > Labels: correctness > Attachments: doctors.avro, doctors_evolved.avro, > doctors_evolved.json, original.avsc > > > I have a hive avro table where the avro schema is stored on s3 next to the > avro files. > In the table definiton the avro.schema.url always points to the latest > partition's _schema.avsc file which is always the lates schema. (Avro schemas > are backward and forward compatible in a table) > When new data comes in, I always add a new partition where the > avro.schema.url properties also set to the _schema.avsc which was used when > it was added and of course I always update the table avro.schema.url property > to the latest one. > Querying this table works fine until the schema evolves in a way that a new > optional property is added in the middle. > When this happens then after the spark sql query the columns in the old > partition gets mixed up and it shows the wrong data for the columns. > If I query the table with Hive then everything is perfectly fine and it gives > me back the correct columns for the partitions which were created the old > schema and for the new which was created the evolved schema. > > Here is how I could reproduce with the > [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro] > example data in sql test suite. > # I have created two partition folder: > {code:java} > [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/ > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors > /dt=2019-02-05/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-05/doctors.avro > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors > /dt=2019-02-06/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-06/doctors_evolved.avro{code} > Here the first partition had data which was created with the schema before > evolving and the second one had the evolved one. (the evolved schema is the > same as in your testcase except I moved the extra_field column to the last > from the second and I generated two lines of avro data with the evolved > schema. > # I have created a hive table with the following command: > > {code:java} > CREATE EXTERNAL TABLE `default.doctors` > PARTITIONED BY ( > `dt` string > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > WITH SERDEPROPERTIES ( > 'avro.schema.url'='s3://somelocation/doctors/ > /dt=2019-02-06/_schema.avsc') > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION > 's3://somelocation/doctors/' > TBLPROPERTIES ( > 'transient_lastDdlTime'='1538130975'){code} > > Here as you can see the table schema url points to the latest schema > 3. I ran an msck _repair table_ to pick up all the partitions. > Fyi: If I run my select * query from here then everything is fine and no > columns switch happening. > 4. Then I changed the first partition's avro.schema.url url to points to the > schema which is under the partition folder (non-evolved one -> > s3://somelocation/doctors/ > /dt=2019-02-05/_schema.avsc) > Then if you ran a _select * from default.spark_test_ then the columns will be > mixed up (on the data below the first name column becomes the extra_field > column. I guess because in the latest schema it is the second column): > > {code:java} > number,extra_field,first_name,last_name,dt > 6,Colin,Baker,null,2019-02-05 > 3,Jon,Pertwee,null,2019-02-05 > 4,Tom,Baker,null,2019-02-05 > 5,Peter,Davison,null,2019-02-05 > 11,Matt,Smith,null,2019-02-05 > 1,William,Hartnell,null,2019-02-05 > 7,Sylvester,McCoy,null,2019-02-05 > 8,Paul,McGann,null,2019-02-05 > 2,Patrick,Troughton,null,2019-02-0
[jira] [Commented] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
[ https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764965#comment-16764965 ] Tamas Nemeth commented on SPARK-26836: -- And one more thing. I also got this warning which I did not get if I run on the table where the partitions do not contain the avro.schema.url property. {code:java} 19/02/11 14:39:40 WARN AvroDeserializer: Received different schemas. Have to re-encode: {"type":"record","name":"doctors","namespace":"testing.hive.avro.serde","fields":[{"name":"number","type":"int","doc":"Order of playing the role"},{"name":"extra_field","type":"string","default":"fishfingers and custard","doc:":"an extra field not in the original file"},{"name":"first_name","type":"string","doc":"first name of actor playing role"},{"name":"last_name","type":"string","doc":"last name of actor playing role"}]} SIZE{-3ac2eea4:168dcc8e145:-8000=org.apache.hadoop.hive.serde2.avro.AvroDeserializer$SchemaReEncoder@1429dfec} ID -3ac2eea4:168dcc8e145:-8000{code} > Columns get switched in Spark SQL using Avro backed Hive table if schema > evolves > > > Key: SPARK-26836 > URL: https://issues.apache.org/jira/browse/SPARK-26836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0 > Environment: I tested with Hive and HCatalog which runs on version > 2.3.4 and with Spark 2.3.1 and 2.4 >Reporter: Tamas Nemeth >Priority: Major > Labels: correctness > Attachments: doctors.avro, doctors_evolved.avro, > doctors_evolved.json, original.avsc > > > I have a hive avro table where the avro schema is stored on s3 next to the > avro files. > In the table definiton the avro.schema.url always points to the latest > partition's _schema.avsc file which is always the lates schema. (Avro schemas > are backward and forward compatible in a table) > When new data comes in, I always add a new partition where the > avro.schema.url properties also set to the _schema.avsc which was used when > it was added and of course I always update the table avro.schema.url property > to the latest one. > Querying this table works fine until the schema evolves in a way that a new > optional property is added in the middle. > When this happens then after the spark sql query the columns in the old > partition gets mixed up and it shows the wrong data for the columns. > If I query the table with Hive then everything is perfectly fine and it gives > me back the correct columns for the partitions which were created the old > schema and for the new which was created the evolved schema. > > Here is how I could reproduce with the > [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro] > example data in sql test suite. > # I have created two partition folder: > {code:java} > [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/ > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors > /dt=2019-02-05/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-05/doctors.avro > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors > /dt=2019-02-06/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-06/doctors_evolved.avro{code} > Here the first partition had data which was created with the schema before > evolving and the second one had the evolved one. (the evolved schema is the > same as in your testcase except I moved the extra_field column to the last > from the second and I generated two lines of avro data with the evolved > schema. > # I have created a hive table with the following command: > > {code:java} > CREATE EXTERNAL TABLE `default.doctors` > PARTITIONED BY ( > `dt` string > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > WITH SERDEPROPERTIES ( > 'avro.schema.url'='s3://somelocation/doctors/ > /dt=2019-02-06/_schema.avsc') > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION > 's3://somelocation/doctors/' > TBLPROPERTIES ( > 'transient_lastDdlTime'='1538130975'){code} > > Here as you can see the table schema url points to the latest schema > 3. I ran an msck _repair table_ to pick up all the partitions. > Fyi: If I run my select * query from here then everything is fine and no > columns switch happening. > 4. Then I changed the first partition's avro.schema.url url to points to the > schema which is under the partition folder (non-evolved one -> > s3://somelocation/doct
[jira] [Commented] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
[ https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764933#comment-16764933 ] Tamas Nemeth commented on SPARK-26836: -- On hive directly the query returns with the correct resultset regardless if _avro.schema.url_ is set for the partitions or not : {code:java} hive> select * from spark_test; OK 6 fishfingers and custard Colin Baker 2019-02-05 3 fishfingers and custard Jon Pertwee 2019-02-05 4 fishfingers and custard Tom Baker 2019-02-05 5 fishfingers and custard Peter Davison 2019-02-05 11 fishfingers and custard Matt Smith 2019-02-05 1 fishfingers and custard William Hartnell 2019-02-05 7 fishfingers and custard Sylvester McCoy 2019-02-05 8 fishfingers and custard Paul McGann 2019-02-05 2 fishfingers and custard Patrick Troughton 2019-02-05 9 fishfingers and custard Christopher Eccleston 2019-02-05 10 fishfingers and custard David Tennant 2019-02-05 21 fishfinger Jim Baker 2019-02-06 24 fishfinger Bean Pertwee 2019-02-06 Time taken: 4.291 seconds, Fetched: 13 row(s){code} > Columns get switched in Spark SQL using Avro backed Hive table if schema > evolves > > > Key: SPARK-26836 > URL: https://issues.apache.org/jira/browse/SPARK-26836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0 > Environment: I tested with Hive and HCatalog which runs on version > 2.3.4 and with Spark 2.3.1 and 2.4 >Reporter: Tamas Nemeth >Priority: Major > Labels: correctness > Attachments: doctors.avro, doctors_evolved.avro, > doctors_evolved.json, original.avsc > > > I have a hive avro table where the avro schema is stored on s3 next to the > avro files. > In the table definiton the avro.schema.url always points to the latest > partition's _schema.avsc file which is always the lates schema. (Avro schemas > are backward and forward compatible in a table) > When new data comes in, I always add a new partition where the > avro.schema.url properties also set to the _schema.avsc which was used when > it was added and of course I always update the table avro.schema.url property > to the latest one. > Querying this table works fine until the schema evolves in a way that a new > optional property is added in the middle. > When this happens then after the spark sql query the columns in the old > partition gets mixed up and it shows the wrong data for the columns. > If I query the table with Hive then everything is perfectly fine and it gives > me back the correct columns for the partitions which were created the old > schema and for the new which was created the evolved schema. > > Here is how I could reproduce with the > [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro] > example data in sql test suite. > # I have created two partition folder: > {code:java} > [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/ > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors > /dt=2019-02-05/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-05/doctors.avro > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors > /dt=2019-02-06/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-06/doctors_evolved.avro{code} > Here the first partition had data which was created with the schema before > evolving and the second one had the evolved one. (the evolved schema is the > same as in your testcase except I moved the extra_field column to the last > from the second and I generated two lines of avro data with the evolved > schema. > # I have created a hive table with the following command: > > {code:java} > CREATE EXTERNAL TABLE `default.doctors` > PARTITIONED BY ( > `dt` string > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > WITH SERDEPROPERTIES ( > 'avro.schema.url'='s3://somelocation/doctors/ > /dt=2019-02-06/_schema.avsc') > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION > 's3://somelocation/doctors/' > TBLPROPERTIES ( > 'transient_lastDdlTime'='1538130975'){code} > > Here as you can see the table schema url points to the latest schema > 3. I ran an msck _repair table_ to pick up all the partitions. > Fyi: If I run my select * query from here then everything is fine and no > columns switch happening. > 4. Then I changed the first partition's avro.schema.url url to points to the > schema which is under the partition folder (non-evolved one
[jira] [Commented] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
[ https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764719#comment-16764719 ] Gengliang Wang commented on SPARK-26836: [~dongjoon]I don't think the issue is related to spark-avro lib. [~treff7es] Can you try reproducing the issue on Hive directly and see what the behavior is? > Columns get switched in Spark SQL using Avro backed Hive table if schema > evolves > > > Key: SPARK-26836 > URL: https://issues.apache.org/jira/browse/SPARK-26836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0 > Environment: I tested with Hive and HCatalog which runs on version > 2.3.4 and with Spark 2.3.1 and 2.4 >Reporter: Tamas Nemeth >Priority: Major > Labels: correctness > Attachments: doctors.avro, doctors_evolved.avro, > doctors_evolved.json, original.avsc > > > I have a hive avro table where the avro schema is stored on s3 next to the > avro files. > In the table definiton the avro.schema.url always points to the latest > partition's _schema.avsc file which is always the lates schema. (Avro schemas > are backward and forward compatible in a table) > When new data comes in, I always add a new partition where the > avro.schema.url properties also set to the _schema.avsc which was used when > it was added and of course I always update the table avro.schema.url property > to the latest one. > Querying this table works fine until the schema evolves in a way that a new > optional property is added in the middle. > When this happens then after the spark sql query the columns in the old > partition gets mixed up and it shows the wrong data for the columns. > If I query the table with Hive then everything is perfectly fine and it gives > me back the correct columns for the partitions which were created the old > schema and for the new which was created the evolved schema. > > Here is how I could reproduce with the > [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro] > example data in sql test suite. > # I have created two partition folder: > {code:java} > [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/ > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors > /dt=2019-02-05/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-05/doctors.avro > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors > /dt=2019-02-06/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-06/doctors_evolved.avro{code} > Here the first partition had data which was created with the schema before > evolving and the second one had the evolved one. (the evolved schema is the > same as in your testcase except I moved the extra_field column to the last > from the second and I generated two lines of avro data with the evolved > schema. > # I have created a hive table with the following command: > > {code:java} > CREATE EXTERNAL TABLE `default.doctors` > PARTITIONED BY ( > `dt` string > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > WITH SERDEPROPERTIES ( > 'avro.schema.url'='s3://somelocation/doctors/ > /dt=2019-02-06/_schema.avsc') > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION > 's3://somelocation/doctors/' > TBLPROPERTIES ( > 'transient_lastDdlTime'='1538130975'){code} > > Here as you can see the table schema url points to the latest schema > 3. I ran an msck _repair table_ to pick up all the partitions. > Fyi: If I run my select * query from here then everything is fine and no > columns switch happening. > 4. Then I changed the first partition's avro.schema.url url to points to the > schema which is under the partition folder (non-evolved one -> > s3://somelocation/doctors/ > /dt=2019-02-05/_schema.avsc) > Then if you ran a _select * from default.spark_test_ then the columns will be > mixed up (on the data below the first name column becomes the extra_field > column. I guess because in the latest schema it is the second column): > > {code:java} > number,extra_field,first_name,last_name,dt > 6,Colin,Baker,null,2019-02-05 > 3,Jon,Pertwee,null,2019-02-05 > 4,Tom,Baker,null,2019-02-05 > 5,Peter,Davison,null,2019-02-05 > 11,Matt,Smith,null,2019-02-05 > 1,William,Hartnell,null,2019-02-05 > 7,Sylvester,McCoy,null,2019-02-05 > 8,Paul,McGann,null,2019-02-05 > 2,Patrick,Troughton,null,2019-02-05 > 9,Christopher,Eccleston,null,20
[jira] [Commented] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
[ https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763625#comment-16763625 ] Tamas Nemeth commented on SPARK-26836: -- In the meantime I checked if using the same hive version in Spark would solve this issue by setting: {code:java} spark.sql.hive.metastore.version 2.3.3 spark.sql.hive.metastore.jars /etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*::/etc/tez/conf:/usr/lib/tez/*:/usr/lib/tez/lib/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/lib/hive/lib/* {code} The issue still exists regardless the hive metastore version spark is using. > Columns get switched in Spark SQL using Avro backed Hive table if schema > evolves > > > Key: SPARK-26836 > URL: https://issues.apache.org/jira/browse/SPARK-26836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0 > Environment: I tested with Hive and HCatalog which runs on version > 2.3.4 and with Spark 2.3.1 and 2.4 >Reporter: Tamas Nemeth >Priority: Major > Labels: correctness > Attachments: doctors.avro, doctors_evolved.avro, > doctors_evolved.json, original.avsc > > > I have a hive avro table where the avro schema is stored on s3 next to the > avro files. > In the table definiton the avro.schema.url always points to the latest > partition's _schema.avsc file which is always the lates schema. (Avro schemas > are backward and forward compatible in a table) > When new data comes in, I always add a new partition where the > avro.schema.url properties also set to the _schema.avsc which was used when > it was added and of course I always update the table avro.schema.url property > to the latest one. > Querying this table works fine until the schema evolves in a way that a new > optional property is added in the middle. > When this happens then after the spark sql query the columns in the old > partition gets mixed up and it shows the wrong data for the columns. > If I query the table with Hive then everything is perfectly fine and it gives > me back the correct columns for the partitions which were created the old > schema and for the new which was created the evolved schema. > > Here is how I could reproduce with the > [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro] > example data in sql test suite. > # I have created two partition folder: > {code:java} > [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/ > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors > /dt=2019-02-05/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-05/doctors.avro > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors > /dt=2019-02-06/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-06/doctors_evolved.avro{code} > Here the first partition had data which was created with the schema before > evolving and the second one had the evolved one. (the evolved schema is the > same as in your testcase except I moved the extra_field column to the last > from the second and I generated two lines of avro data with the evolved > schema. > # I have created a hive table with the following command: > > {code:java} > CREATE EXTERNAL TABLE `default.doctors` > PARTITIONED BY ( > `dt` string > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > WITH SERDEPROPERTIES ( > 'avro.schema.url'='s3://somelocation/doctors/ > /dt=2019-02-06/_schema.avsc') > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION > 's3://somelocation/doctors/' > TBLPROPERTIES ( > 'transient_lastDdlTime'='1538130975'){code} > > Here as you can see the table schema url points to the latest schema > 3. I ran an msck _repair table_ to pick up all the partitions. > Fyi: If I run my select * query from here then everything is fine and no > columns switch happening. >
[jira] [Commented] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
[ https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763438#comment-16763438 ] Tamas Nemeth commented on SPARK-26836: -- I just tried with Spark 2.3.1 and the same issue. I did not specify any Avro library because it is a Hive Avro table originally. I guess it is using Hive's Avro serde for serialization/deserialization (or maybe I'm not right :)). > Columns get switched in Spark SQL using Avro backed Hive table if schema > evolves > > > Key: SPARK-26836 > URL: https://issues.apache.org/jira/browse/SPARK-26836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: I tested with Hive and HCatalog version of 2.3.4 on > Spark 2.4 >Reporter: Tamas Nemeth >Priority: Major > Labels: correctness > Attachments: doctors.avro, doctors_evolved.avro, > doctors_evolved.json, original.avsc > > > I have a hive avro table where the avro schema is stored on s3 next to the > avro files. > In the table definiton the avro.schema.url always points to the latest > partition's _schema.avsc file which is always the lates schema. (Avro schemas > are backward and forward compatible in a table) > When new data comes in, I always add a new partition where the > avro.schema.url properties also set to the _schema.avsc which was used when > it was added and of course I always update the table avro.schema.url property > to the latest one. > Querying this table works fine until the schema evolves in a way that a new > optional property is added in the middle. > When this happens then after the spark sql query the columns in the old > partition gets mixed up and it shows the wrong data for the columns. > If I query the table with Hive then everything is perfectly fine and it gives > me back the correct columns for the partitions which were created the old > schema and for the new which was created the evolved schema. > > Here is how I could reproduce with the > [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro] > example data in sql test suite. > # I have created two partition folder: > {code:java} > [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/ > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors > /dt=2019-02-05/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-05/doctors.avro > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors > /dt=2019-02-06/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-06/doctors_evolved.avro{code} > Here the first partition had data which was created with the schema before > evolving and the second one had the evolved one. (the evolved schema is the > same as in your testcase except I moved the extra_field column to the last > from the second and I generated two lines of avro data with the evolved > schema. > # I have created a hive table with the following command: > > {code:java} > CREATE EXTERNAL TABLE `default.doctors` > PARTITIONED BY ( > `dt` string > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > WITH SERDEPROPERTIES ( > 'avro.schema.url'='s3://somelocation/doctors/ > /dt=2019-02-06/_schema.avsc') > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION > 's3://somelocation/doctors/' > TBLPROPERTIES ( > 'transient_lastDdlTime'='1538130975'){code} > > Here as you can see the table schema url points to the latest schema > 3. I ran an msck _repair table_ to pick up all the partitions. > Fyi: If I run my select * query from here then everything is fine and no > columns switch happening. > 4. Then I changed the first partition's avro.schema.url url to points to the > schema which is under the partition folder (non-evolved one -> > s3://somelocation/doctors/ > /dt=2019-02-05/_schema.avsc) > Then if you ran a _select * from default.spark_test_ then the columns will be > mixed up (on the data below the first name column becomes the extra_field > column. I guess because in the latest schema it is the second column): > > {code:java} > number,extra_field,first_name,last_name,dt > 6,Colin,Baker,null,2019-02-05 > 3,Jon,Pertwee,null,2019-02-05 > 4,Tom,Baker,null,2019-02-05 > 5,Peter,Davison,null,2019-02-05 > 11,Matt,Smith,null,2019-02-05 > 1,William,Hartnell,null,2019-02-05 > 7,Sylvester,McCoy,null,2019-02-05 > 8,Paul,McGann,null,2019-02-05 > 2,Patrick,Troughton,null,2019-02-05
[jira] [Commented] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
[ https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763393#comment-16763393 ] Dongjoon Hyun commented on SPARK-26836: --- Thank you for reporting, [~treff7es]. Did you see this issue before at Spark 2.3.x with Databricks Avro libraries? cc [~Gengliang.Wang] > Columns get switched in Spark SQL using Avro backed Hive table if schema > evolves > > > Key: SPARK-26836 > URL: https://issues.apache.org/jira/browse/SPARK-26836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: I tested with Hive and HCatalog version of 2.3.4 on > Spark 2.4 >Reporter: Tamas Nemeth >Priority: Major > Attachments: doctors.avro, doctors_evolved.avro, > doctors_evolved.json, original.avsc > > > I have a hive avro table where the avro schema is stored on s3 next to the > avro files. > In the table definiton the avro.schema.url always points to the latest > partition's _schema.avsc file which is always the lates schema. (Avro schemas > are backward and forward compatible in a table) > When new data comes in, I always add a new partition where the > avro.schema.url properties also set to the _schema.avsc which was used when > it was added and of course I always update the table avro.schema.url property > to the latest one. > Querying this table works fine until the schema evolves in a way that a new > optional property is added in the middle. > When this happens then after the spark sql query the columns in the old > partition gets mixed up and it shows the wrong data for the columns. > If I query the table with Hive then everything is perfectly fine and it gives > me back the correct columns for the partitions which were created the old > schema and for the new which was created the evolved schema. > > Here is how I could reproduce with the > [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro] > example data in sql test suite. > # I have created two partition folder: > {code:java} > [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/ > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors > /dt=2019-02-05/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-05/doctors.avro > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors > /dt=2019-02-06/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-06/doctors_evolved.avro{code} > Here the first partition had data which was created with the schema before > evolving and the second one had the evolved one. (the evolved schema is the > same as in your testcase except I moved the extra_field column to the last > from the second and I generated two lines of avro data with the evolved > schema. > # I have created a hive table with the following command: > > {code:java} > CREATE EXTERNAL TABLE `default.doctors` > PARTITIONED BY ( > `dt` string > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > WITH SERDEPROPERTIES ( > 'avro.schema.url'='s3://somelocation/doctors/ > /dt=2019-02-06/_schema.avsc') > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION > 's3://somelocation/doctors/' > TBLPROPERTIES ( > 'transient_lastDdlTime'='1538130975'){code} > > Here as you can see the table schema url points to the latest schema > 3. I ran an msck _repair table_ to pick up all the partitions. > Fyi: If I run my select * query from here then everything is fine and no > columns switch happening. > 4. Then I changed the first partition's avro.schema.url url to points to the > schema which is under the partition folder (non-evolved one -> > s3://somelocation/doctors/ > /dt=2019-02-05/_schema.avsc) > Then if you ran a _select * from default.spark_test_ then the columns will be > mixed up (on the data below the first name column becomes the extra_field > column. I guess because in the latest schema it is the second column): > > {code:java} > number,extra_field,first_name,last_name,dt > 6,Colin,Baker,null,2019-02-05 > 3,Jon,Pertwee,null,2019-02-05 > 4,Tom,Baker,null,2019-02-05 > 5,Peter,Davison,null,2019-02-05 > 11,Matt,Smith,null,2019-02-05 > 1,William,Hartnell,null,2019-02-05 > 7,Sylvester,McCoy,null,2019-02-05 > 8,Paul,McGann,null,2019-02-05 > 2,Patrick,Troughton,null,2019-02-05 > 9,Christopher,Eccleston,null,2019-02-05 > 10,David,Tennant,null,2019-02-05 > 21,fishfinger,Jim,Baker,2019-02-06 > 24,fish