Hi folks,

I would like to discuss the topic of schema evolution in Hudi as I think we
could improve user experience here a bit.

Currently, we have two schema evolution "modes" available:
1. "old" out of the box schema evolution rules,
2. "new" schema on read evolution rules.

Out of the box schema evolution allows us to add nullable columns at the
end of a struct (root or nested), we can not modify column order in the
incoming batch and Hudi will not resolve it for the user. It also allows
for some limited data types evolution.

On the other hand schema on read evolution rules allow adding, reordering
and dropping columns. It also allows for pretty flexible data types
evolution.

>From my experience and from some discussions in Hudi slack emerge a few
use-cases for schema evolution. I want to focus on "new" schema on read
context.

1. Automatic/dynamic schema evolution - in this mode user can provide
partial record schema, and table schema will be automatically evolved so
the user can "just" write to Hudi (but honoring schema evolution rules).
This could be supported for data frame write (upsert, insert) and MERGE
INTO SQL statement for both INSERT *, UPDATE * as well as for partial
inserts and updates used in MERGE INTO statement. No columns should
be dropped from the table schema. Reordering columns should be taken care
of etc.

2. Enforce schema on write - in some use-cases users do not want to evolve
table schema automatically. In this case, the target table schema should be
used on write.

Currently, it feels like Hudi is not fully consistent in this matter:
- MERGE INTO enforces schema on write (target table schema) and drops
additional columns if needed,
- for UPSERT/INSERT when schema on read and reconcile schema are enabled it
does automatic schema evolution (missing columns are added, schema is
resolved tobe compatible target table schema).
- for UPSERT/INSERT when out-of-the box schema evolution is used and
reconcile schema is enabled wider schema is accepted and target table
schema is evolved accordingly or if the incoming schema is narrower, then
the latest table schema is used. There are issues though when a new column
is added and a column is missing or if column order is mixed in the
incoming batch.

>From the user perspective, it would be good to focus on the new schema on
read evolution rules and introduce a new config:
hoodie.schema.evolution.strategy: merge [1] or enforce [2]

That being said it can be a good idea to preserve reconcile.schema config
just for out-of-the box schema evolution scenarios and to preserve behavior.


Best Regards,
Daniel Kaźmirski

Reply via email to