Inconsistent handling of schema in Avro tables

Todd Lipcon Wed, 11 Jul 2018 16:51:42 -0700

Hey folks,

I'm trying to understand the current behavior of tables that contain
partitions of mixed format, specifically when one or more partitions is
stored as Avro. Impala seems to be doing a number of things which I find
surprising, and I'm not sure if they are intentional or should be
considered bugs.


*Surprise 1*: the _presence_ of an Avro-formatted partition can change the
table schema
https://gist.github.com/74bdef8a69b558763e4453ac21313649

- create a table that is Parquet-formatted, but with an 'avro.schema.url'
property
- the Avro schema is ignored, and we see whatever schema we specified (*makes
sense, because the table is Parquet)*
- add an partition
- set the new partition's format to Avro
- refresh the table
- the schema for the table now reflects the Avro schema, because it has at
least one Avro partition

*Surprise 2*: the above is inconsistent with Hive and Spark

Hive seems to still reflect the table-level defined schema, and ignore the
avro.schema.url property in this mixed scenario. That is to say, with the
state set up by the above, we have the following behavior:

Impala:
- uses the external avro schema for all table-level info, SELECT *, etc.
- "compute stats" detects the inconsistency and tells the user to recreate
the table.
- if some existing partitions (eg in Parquet) aren't compatible with that
avro schema, errors result from the backend that there are missing columns
in the Parquet data files

Hive:
- uses the table-level schema defined in the HMS for describe, etc
- queries like 'select *' again use the table-level HMS schema. The
underlying reader that reads the Avro partition seems to use the defined
external Avro schema, resulting in nulls for missing columns.
- computing stats (analyze table mixedtable partition (y=1) compute stats
for columns) seems to end up only recording stats against the column
defined in the table-level Schema.

Spark:
- DESCRIBE TABLE shows the table-level info
- select * fails, because apparently Spark doesn't support multi-format
tables at all (it tries to read the avro files as a parquet file)


It seems to me that Hive's behavior is a bit better.* I'd like to propose
we treat this as a bug and move to the following behavior:*

- if a table's properties indicate it's an avro table, parse and adopt the
external avro schema as the table schema
- if a table's properties indicate it's _not_ an avro table, but there is
an external avro schema defined in the table properties, then parse the
avro schema and include it in the TableDescriptor (for use by avro
partitions) but do not adopt it as the table schema.

The added benefit of the above proposal (and the reason why I started
looking into this in the first place) is that, in order to service a simple
query like DESCRIBE, our current behavior requires all partition metadata
to be loaded to know whether there is any avro-formatted partition. With
the proposed new behavior, we can avoid looking at all partitions. This is
important for any metadata design which supports fine-grained loading of
metadata to the coordinator.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Inconsistent handling of schema in Avro tables

Reply via email to