[ https://issues.apache.org/jira/browse/HIVE-14789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mithun Radhakrishnan reassigned HIVE-14789: ------------------------------------------- Assignee: Mithun Radhakrishnan > Avro Table-reads bork when using SerDe-generated table-schema. > -------------------------------------------------------------- > > Key: HIVE-14789 > URL: https://issues.apache.org/jira/browse/HIVE-14789 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers > Affects Versions: 1.2.1, 2.0.1 > Reporter: Mithun Radhakrishnan > Assignee: Mithun Radhakrishnan > > AvroSerDe allows one to skip the table-columns in a table-definition when > creating a table, as long as the TBLPROPERTIES includes a valid > {{avro.schema.url}} or {{avro.schema.literal}}. The table-columns are > inferred from processing the Avro schema file/literal. > The problem is that the inferred schema might not be congruent with the > actual schema in the Avro schema file/literal. Consider the following table > definition: > {code:sql} > CREATE TABLE avro_schema_break_1 > ROW FORMAT > SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > TBLPROPERTIES ('avro.schema.literal'='{ > "type": "record", > "name": "Messages", > "namespace": "net.myth", > "fields": [ > { > "name": "header", > "type": [ > "null", > { > "type": "record", > "name": "HeaderInfo", > "fields": [ > { > "name": "inferred_event_type", > "type": [ > "null", > "string" > ], > "default": null > }, > { > "name": "event_type", > "type": [ > "null", > "string" > ], > "default": null > }, > { > "name": "event_version", > "type": [ > "null", > "string" > ], > "default": null > } > ] > } > ] > }, > { > "name": "messages", > "type": { > "type": "array", > "items": { > "name": "MessageInfo", > "type": "record", > "fields": [ > { > "name": "message_id", > "type": [ > "null", > "string" > ], > "doc": "Message-ID" > }, > { > "name": "received_date", > "type": [ > "null", > "long" > ], > "doc": "Received Date" > }, > { > "name": "sent_date", > "type": [ > "null", > "long" > ] > }, > { > "name": "from_name", > "type": [ > "null", > "string" > ] > }, > { > "name": "flags", > "type": [ > "null", > { > "type": "record", > "name": "Flags", > "fields": [ > { > "name": "is_seen", > "type": [ > "null", > "boolean" > ], > "default": null > }, > { > "name": "is_read", > "type": [ > "null", > "boolean" > ], > "default": null > }, > { > "name": "is_flagged", > "type": [ > "null", > "boolean" > ], > "default": null > } > ] > } > ], > "default": null > } > ] > } > } > } > ] > }'); > {code} > This produces a table with the following schema: > {noformat} > 2016-09-19T13:23:42,934 DEBUG [0ce7e586-13ea-4390-ac2a-6dac36e8a216 main] > hive.log: DDL: struct avro_schema_break_1 { > struct<inferred_event_type:string,event_type:string,event_version:string> > header, > list<struct<message_id:string,received_date:i64,sent_date:i64,from_name:string,flags:struct<is_seen:bool,is_read:bool,is_flagged:bool>>> > messages} > {noformat} > Data written to this table using the AvroSchema from {{avro.schema.literal}} > using Pig's {{AvroStorage}} cannot be read using Hive using the generated > table schema. This is the exception one sees: > {noformat} > java.io.IOException: org.apache.avro.AvroTypeException: Found > net.myth.HeaderInfo, expecting union > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:521) > at > org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428) > at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147) > at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2019) > at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253) > at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336) > at > org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:1162) > at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:1136) > at > org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:172) > at > org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:104) > at > org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver(TestCliDriver.java:59) > ... > {noformat} > The only way to read this table is by using the attached > {{avro.schema.literal}} or {{avro.schema.url}}. This has implications on > systems where data could be produced externally to Hive. It also has > repercussions on table-replication using Falcon/GDM, in that the schema > file/literal needs to be replicated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)