Hey folks, I'm trying to understand the current behavior of tables that contain partitions of mixed format, specifically when one or more partitions is stored as Avro. Impala seems to be doing a number of things which I find surprising, and I'm not sure if they are intentional or should be considered bugs.
*Surprise 1*: the _presence_ of an Avro-formatted partition can change the table schema https://gist.github.com/74bdef8a69b558763e4453ac21313649 - create a table that is Parquet-formatted, but with an 'avro.schema.url' property - the Avro schema is ignored, and we see whatever schema we specified (*makes sense, because the table is Parquet)* - add an partition - set the new partition's format to Avro - refresh the table - the schema for the table now reflects the Avro schema, because it has at least one Avro partition *Surprise 2*: the above is inconsistent with Hive and Spark Hive seems to still reflect the table-level defined schema, and ignore the avro.schema.url property in this mixed scenario. That is to say, with the state set up by the above, we have the following behavior: Impala: - uses the external avro schema for all table-level info, SELECT *, etc. - "compute stats" detects the inconsistency and tells the user to recreate the table. - if some existing partitions (eg in Parquet) aren't compatible with that avro schema, errors result from the backend that there are missing columns in the Parquet data files Hive: - uses the table-level schema defined in the HMS for describe, etc - queries like 'select *' again use the table-level HMS schema. The underlying reader that reads the Avro partition seems to use the defined external Avro schema, resulting in nulls for missing columns. - computing stats (analyze table mixedtable partition (y=1) compute stats for columns) seems to end up only recording stats against the column defined in the table-level Schema. Spark: - DESCRIBE TABLE shows the table-level info - select * fails, because apparently Spark doesn't support multi-format tables at all (it tries to read the avro files as a parquet file) It seems to me that Hive's behavior is a bit better.* I'd like to propose we treat this as a bug and move to the following behavior:* - if a table's properties indicate it's an avro table, parse and adopt the external avro schema as the table schema - if a table's properties indicate it's _not_ an avro table, but there is an external avro schema defined in the table properties, then parse the avro schema and include it in the TableDescriptor (for use by avro partitions) but do not adopt it as the table schema. The added benefit of the above proposal (and the reason why I started looking into this in the first place) is that, in order to service a simple query like DESCRIBE, our current behavior requires all partition metadata to be loaded to know whether there is any avro-formatted partition. With the proposed new behavior, we can avoid looking at all partitions. This is important for any metadata design which supports fine-grained loading of metadata to the coordinator. -Todd -- Todd Lipcon Software Engineer, Cloudera
