Re: Inconsistent handling of schema in Avro tables

Todd Lipcon Wed, 11 Jul 2018 21:32:57 -0700

Turns out it's even a bit more messy. The presence of one or more avro
partitions can change the types of existing columns, even if there is no
explicit avro schema specified for the table:
https://gist.github.com/5018d6ff50f846c72762319eb7cf5ca8


Not quite sure how to handle this one in a world where we don't load all of
the partitions up front. Perhaps the best approach is to just throw an
error and then provide a command for the user to "re-sync" the schema to
the appropriate avro-supported types? Hive provides ALTER TABLE <foo>
UPDATE COLUMNS for something like this, though still I don't think that
would iterate over all partitions in the case of a mixed table.

-Todd

On Wed, Jul 11, 2018 at 9:03 PM, Bharath Vissapragada <
[email protected]> wrote:

> Agreed.
>
> On Wed, Jul 11, 2018 at 8:55 PM Todd Lipcon <[email protected]>
> wrote:
>
> > Your commit message there makes sense, Bharath -- we should set
> > 'avroSchema' in the descriptor in case any referenced partition is avro,
> > because the scanner needs that info. However, we don't need to also
> > override the table-level schema. So, I think we can preserve the fix that
> > you made while also making the behavior less surprising.
> >
> > -Todd
> >
> > On Wed, Jul 11, 2018 at 8:21 PM, Bharath Vissapragada <
> > [email protected]> wrote:
> >
> > > I added this functionality
> > > <https://github.com/apache/impala/commit/
> 49610e2cfa40aa10b626c5ae41d7f0
> > > d99d7cabc5>
> > >  where adding an Avro partition in a mixed partition table resets the
> > table
> > > level schema. While I don't exactly remember why we chose this path, I
> do
> > > recall that we debated quite a bit about Avro schema evolution causing
> > > schema inconsistencies across partitions. AFAICT there is no specific
> > > reason Impala chose to different from Hive. Now that I see your email,
> > > Hive's behavior makes more sense to me, especially in the context of
> lazy
> > > loading of metadata.
> > >
> > > Also, agree with Edward that the whole mixed partitions + Avro schema
> > > evolution is a mess and I doubt if any serious user relies on a
> specific
> > > behavior.
> > >
> > > On Wed, Jul 11, 2018 at 7:48 PM Edward Capriolo <[email protected]
> >
> > > wrote:
> > >
> > > > I know that Hive can deal with schema being different per partition,
> > but
> > > I
> > > > really hesitate to understand why someone would want to do this. If
> > > someone
> > > > asked me to support a mixed avro/parquet table I would suggest they
> > > create
> > > > a view. If they kept insisting I would reply "Well it is your
> funeral."
> > > >
> > > > On Wed, Jul 11, 2018 at 7:51 PM, Todd Lipcon
> <[email protected]
> > >
> > > > wrote:
> > > >
> > > > > Hey folks,
> > > > >
> > > > > I'm trying to understand the current behavior of tables that
> contain
> > > > > partitions of mixed format, specifically when one or more
> partitions
> > is
> > > > > stored as Avro. Impala seems to be doing a number of things which I
> > > find
> > > > > surprising, and I'm not sure if they are intentional or should be
> > > > > considered bugs.
> > > > >
> > > > > *Surprise 1*: the _presence_ of an Avro-formatted partition can
> > change
> > > > the
> > > > > table schema
> > > > > https://gist.github.com/74bdef8a69b558763e4453ac21313649
> > > > >
> > > > > - create a table that is Parquet-formatted, but with an
> > > 'avro.schema.url'
> > > > > property
> > > > > - the Avro schema is ignored, and we see whatever schema we
> specified
> > > > > (*makes
> > > > > sense, because the table is Parquet)*
> > > > > - add an partition
> > > > > - set the new partition's format to Avro
> > > > > - refresh the table
> > > > > - the schema for the table now reflects the Avro schema, because it
> > has
> > > > at
> > > > > least one Avro partition
> > > > >
> > > > > *Surprise 2*: the above is inconsistent with Hive and Spark
> > > > >
> > > > > Hive seems to still reflect the table-level defined schema, and
> > ignore
> > > > the
> > > > > avro.schema.url property in this mixed scenario. That is to say,
> with
> > > the
> > > > > state set up by the above, we have the following behavior:
> > > > >
> > > > > Impala:
> > > > > - uses the external avro schema for all table-level info, SELECT *,
> > > etc.
> > > > > - "compute stats" detects the inconsistency and tells the user to
> > > > recreate
> > > > > the table.
> > > > > - if some existing partitions (eg in Parquet) aren't compatible
> with
> > > that
> > > > > avro schema, errors result from the backend that there are missing
> > > > columns
> > > > > in the Parquet data files
> > > > >
> > > > > Hive:
> > > > > - uses the table-level schema defined in the HMS for describe, etc
> > > > > - queries like 'select *' again use the table-level HMS schema. The
> > > > > underlying reader that reads the Avro partition seems to use the
> > > defined
> > > > > external Avro schema, resulting in nulls for missing columns.
> > > > > - computing stats (analyze table mixedtable partition (y=1) compute
> > > stats
> > > > > for columns) seems to end up only recording stats against the
> column
> > > > > defined in the table-level Schema.
> > > > >
> > > > > Spark:
> > > > > - DESCRIBE TABLE shows the table-level info
> > > > > - select * fails, because apparently Spark doesn't support
> > multi-format
> > > > > tables at all (it tries to read the avro files as a parquet file)
> > > > >
> > > > >
> > > > > It seems to me that Hive's behavior is a bit better.* I'd like to
> > > propose
> > > > > we treat this as a bug and move to the following behavior:*
> > > > >
> > > > > - if a table's properties indicate it's an avro table, parse and
> > adopt
> > > > the
> > > > > external avro schema as the table schema
> > > > > - if a table's properties indicate it's _not_ an avro table, but
> > there
> > > is
> > > > > an external avro schema defined in the table properties, then parse
> > the
> > > > > avro schema and include it in the TableDescriptor (for use by avro
> > > > > partitions) but do not adopt it as the table schema.
> > > > >
> > > > > The added benefit of the above proposal (and the reason why I
> started
> > > > > looking into this in the first place) is that, in order to service
> a
> > > > simple
> > > > > query like DESCRIBE, our current behavior requires all partition
> > > metadata
> > > > > to be loaded to know whether there is any avro-formatted partition.
> > > With
> > > > > the proposed new behavior, we can avoid looking at all partitions.
> > This
> > > > is
> > > > > important for any metadata design which supports fine-grained
> loading
> > > of
> > > > > metadata to the coordinator.
> > > > >
> > > > > -Todd
> > > > > --
> > > > > Todd Lipcon
> > > > > Software Engineer, Cloudera
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Inconsistent handling of schema in Avro tables

Reply via email to