hi! great, it's high time to do something with config-mess in Hudi.

> On Jul 31, 2025, at 08:14, Shiyan Xu <[email protected]> wrote:
> 
> Hi all,
> 
> Since config names are the first thing users see when working with Hudi and
> directly impact user and dev experience, we should pay careful attention to
> keeping them standardized and easy to remember and use. I wanted to start
> this thread to raise some points so we can establish a set of standards and
> create a migration path.
> 
> 1. Plural vs Singular
> 
> If a config supports taking multiple values, it has to be plural if
> applicable. For e.g., since Hudi 1.1, we support multiple ordering fields,
> we should make `hoodie.datasource.write.precombine.field` plural. To show a
> little bit seriousness, treat this kind of misleading config name (singular
> but supports multiple values) as a bug.
> 
> 2. Namespaces
> 
> Always start with `hoodie.<function area>.` as the namespace to denote the
> area of the config would serve. For e.g., `hoodie.table.*` is always a
> table config, `hoodie.write.*` is meant for writer to set, `hoodie.read.*`
> is meant for query engines to use,
> `hoodie.<compaction|clustering|cleaning|indexing>.*` always denotes table
> service specific configs, `hoodie.<storage>.*` indicates configs that
> control storage layer settings, `hoodie.table.metadata.*` is specific for
> the metadata table.
> 
> Keep these namespaces a fixed set of constants (a mandatory enum for
> composing config names), and do not causally change the words, like
> `compaction` vs `compact`, `cleaning` vs `clean`
> 
> 3. snake_case
> 
> Use `.` to delimit functionally distinct words and `_` (snake_case) to
> connect a meaningful phrase. For example:
> 
> - `hoodie.table.recordkey.fields` should be
> `hoodie.table.record_key.fields`, as `recordkey` is not one word and should
> follow snake_case.
> - `hoodie.table.keygenerator.class` should be
> `hoodie.table.key_generator.class`, for similar reason
> - `hoodie.table.index.defs.path` should be `hoodie.table.index_defs.path`,
> "index defs" putting together is meant for one thing, but reading them
> separately as "index" and "defs" do not convey meaningful info about this
> config
> - `hoodie.file.group.reader.enabled` should be
> `hoodie.file_group.reader.enabled`, for similar reason
> 
> 4. `hoodie.properties` only for catalog/table configs
> 
> Only keep catalog/table configs in `hoodie.properties`; keep configs like
> `hoodie.datasource.write.*` out of it, add new table configs for those do
> not have a table config alias. For e.g., remove
> `hoodie.datasource.write.hive_style_partitioning` and put
> `hoodie.table.hive_style_partitioning` instead.
> 
> 5. Improve naming case by case
> 
> Some examples to consider:
> - All `hoodie.datasource.write.*` move to `hoodie.write.*`, keep things
> shorter
> - All feature-switching configs end with `enabled`, not to mix with `enable`
> - All meta/hive-sync related configs move to `hoodie.catalog.sync.*`,
> clearly stating it's working with catalogs, and the function is about "sync"
> 
> 6. Standardize shorthand property names in SQL TBLPROPERTIES
> 
> Everyone's first example of running Hudi has contained something like this
> 
> TBLPROPERTIES (
>  primaryKey = 'id',
>  preCombineField = 'ts'
> );
> 
> Let's fix it:
> 
> - "record key" is the term in Hudi so we don't want people to remember
> "primary key is meant for record key", and make sure the plural rule applies
> - "ordering field" is the newer term so let's deprecate the term
> "pre-combine field", and make sure the plural rule applies too
> - again, snake_case all the way so it should be like below (omit the
> `hoodie.table.` namespace) so people can associate them with the full name
> easily:
> 
> TBLPROPERTIES (
>  record_key.fields = 'id',
>  ordering.fields = 'ts'
> );
> 
> - in cases where non-table configs need to be put in TBLPROPERTIES() , we
> can just omit `hoodie.` since we have `USING HUDI` in the SQL, so it should
> support `read.*`, `write.*`, `storage.*` sort of shorthand keys
> 
> 7. Address discrepancies between Flink options and Spark options
> 
> A one-time sweep of flink configs that diverge from Spark configs, and
> align them according to the standards we're making. The goals are:
> 
> - All `hoodie.*` configs should be engine-agnostic and universally accepted
> by all engines when applicable
> - Any engine-specific config should be owned by the engine, and starts with
> `hudi.` (like how the Trino Hudi connector does now)
> 
> 
> About migration: we should start adding new config names while keeping the
> old ones compatible as aliases. That means, throughout the codebase, config
> variables will contain the standard strings as the names, and any
> user-provided config will be translated to its new name if applicable.
> 
> We don't really want to fail writers/readers just because of old config
> names so we can keep the aliases for quite some time, but there has to be
> deprecation warnings from now, and drop aliases at some major release (like
> 2.0 or 3.0). But before that, any table version upgrade should strive to
> rename the configs in `hoodie.properties` as per the standards to
> evangelize the new names.
> 
> Best,
> Shiyan

Reply via email to