Hi folks, I would like to discuss the topic of schema evolution in Hudi as I think we could improve user experience here a bit.
Currently, we have two schema evolution "modes" available: 1. "old" out of the box schema evolution rules, 2. "new" schema on read evolution rules. Out of the box schema evolution allows us to add nullable columns at the end of a struct (root or nested), we can not modify column order in the incoming batch and Hudi will not resolve it for the user. It also allows for some limited data types evolution. On the other hand schema on read evolution rules allow adding, reordering and dropping columns. It also allows for pretty flexible data types evolution. >From my experience and from some discussions in Hudi slack emerge a few use-cases for schema evolution. I want to focus on "new" schema on read context. 1. Automatic/dynamic schema evolution - in this mode user can provide partial record schema, and table schema will be automatically evolved so the user can "just" write to Hudi (but honoring schema evolution rules). This could be supported for data frame write (upsert, insert) and MERGE INTO SQL statement for both INSERT *, UPDATE * as well as for partial inserts and updates used in MERGE INTO statement. No columns should be dropped from the table schema. Reordering columns should be taken care of etc. 2. Enforce schema on write - in some use-cases users do not want to evolve table schema automatically. In this case, the target table schema should be used on write. Currently, it feels like Hudi is not fully consistent in this matter: - MERGE INTO enforces schema on write (target table schema) and drops additional columns if needed, - for UPSERT/INSERT when schema on read and reconcile schema are enabled it does automatic schema evolution (missing columns are added, schema is resolved tobe compatible target table schema). - for UPSERT/INSERT when out-of-the box schema evolution is used and reconcile schema is enabled wider schema is accepted and target table schema is evolved accordingly or if the incoming schema is narrower, then the latest table schema is used. There are issues though when a new column is added and a column is missing or if column order is mixed in the incoming batch. >From the user perspective, it would be good to focus on the new schema on read evolution rules and introduce a new config: hoodie.schema.evolution.strategy: merge [1] or enforce [2] That being said it can be a good idea to preserve reconcile.schema config just for out-of-the box schema evolution scenarios and to preserve behavior. Best Regards, Daniel Kaźmirski