This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push: new 98f2ca487bd [HUDI-6854][DOCS] Change default payload type to HOODIE_AVRO_DEFAULT (#11551) 98f2ca487bd is described below commit 98f2ca487bd53eb4edd05187a9a7a7d58140db79 Author: Vova Kolmakov <wombatu...@gmail.com> AuthorDate: Tue Jul 2 07:28:38 2024 +0700 [HUDI-6854][DOCS] Change default payload type to HOODIE_AVRO_DEFAULT (#11551) --- website/docs/basic_configurations.md | 63 ++++++------- website/docs/configurations.md | 173 ++++++++++++++++++----------------- 2 files changed, 120 insertions(+), 116 deletions(-) diff --git a/website/docs/basic_configurations.md b/website/docs/basic_configurations.md index 9d579738ca9..08d5ac717d9 100644 --- a/website/docs/basic_configurations.md +++ b/website/docs/basic_configurations.md @@ -1,7 +1,7 @@ --- title: Basic Configurations summary: This page covers the basic configurations you may use to write/read Hudi tables. This page only features a subset of the most frequently used configurations. For a full list of all configs, please visit the [All Configurations](/docs/configurations) page. -last_modified_at: 2024-06-06T12:59:56.064 +last_modified_at: 2024-07-01T15:09:57.63 --- @@ -33,36 +33,37 @@ Configurations of the Hudi Table like type of ingestion, storage formats, hive t [**Basic Configs**](#Hudi-Table-Basic-Configs-basic-configs) -| Config Name | Default | Description [...] -| ------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- [...] -| [hoodie.bootstrap.base.path](#hoodiebootstrapbasepath) | (N/A) | Base path of the dataset that needs to be bootstrapped as a Hudi table<br />`Config Param: BOOTSTRAP_BASE_PATH` [...] -| [hoodie.database.name](#hoodiedatabasename) | (N/A) | Database name that will be used for incremental query.If different databases have the same table name during incremental query, we can set it to limit the table name under a specific database<br />`Config Param: DATABASE_NAME` [...] -| [hoodie.table.checksum](#hoodietablechecksum) | (N/A) | Table checksum is used to guard against partial writes in HDFS. It is added as the last entry in hoodie.properties and then used to validate while reading table config.<br />`Config Param: TABLE_CHECKSUM`<br />`Since Version: 0.11.0` [...] -| [hoodie.table.create.schema](#hoodietablecreateschema) | (N/A) | Schema used when creating the table, for the first time.<br />`Config Param: CREATE_SCHEMA` [...] -| [hoodie.table.keygenerator.class](#hoodietablekeygeneratorclass) | (N/A) | Key Generator class property for the hoodie table<br />`Config Param: KEY_GENERATOR_CLASS_NAME` [...] -| [hoodie.table.metadata.partitions](#hoodietablemetadatapartitions) | (N/A) | Comma-separated list of metadata partitions that have been completely built and in-sync with data table. These partitions are ready for use by the readers<br />`Config Param: TABLE_METADATA_PARTITIONS`<br />`Since Version: 0.11.0` [...] -| [hoodie.table.metadata.partitions.inflight](#hoodietablemetadatapartitionsinflight) | (N/A) | Comma-separated list of metadata partitions whose building is in progress. These partitions are not yet ready for use by the readers.<br />`Config Param: TABLE_METADATA_PARTITIONS_INFLIGHT`<br />`Since Version: 0.11.0` [...] -| [hoodie.table.name](#hoodietablename) | (N/A) | Table name that will be used for registering with Hive. Needs to be same across runs.<br />`Config Param: NAME` [...] -| [hoodie.table.partition.fields](#hoodietablepartitionfields) | (N/A) | Fields used to partition the table. Concatenated values of these fields are used as the partition path, by invoking toString()<br />`Config Param: PARTITION_FIELDS` [...] -| [hoodie.table.precombine.field](#hoodietableprecombinefield) | (N/A) | Field used in preCombining before actual write. By default, when two records have the same key value, the largest value for the precombine field determined by Object.compareTo(..), is picked.<br />`Config Param: PRECOMBINE_FIELD` [...] -| [hoodie.table.recordkey.fields](#hoodietablerecordkeyfields) | (N/A) | Columns used to uniquely identify the table. Concatenated values of these fields are used as the record key component of HoodieKey.<br />`Config Param: RECORDKEY_FIELDS` [...] -| [hoodie.table.secondary.indexes.metadata](#hoodietablesecondaryindexesmetadata) | (N/A) | The metadata of secondary indexes<br />`Config Param: SECONDARY_INDEXES_METADATA`<br />`Since Version: 0.13.0` [...] -| [hoodie.timeline.layout.version](#hoodietimelinelayoutversion) | (N/A) | Version of timeline used, by the table.<br />`Config Param: TIMELINE_LAYOUT_VERSION` [...] -| [hoodie.archivelog.folder](#hoodiearchivelogfolder) | archived | path under the meta folder, to store archived timeline instants at.<br />`Config Param: ARCHIVELOG_FOLDER` [...] -| [hoodie.bootstrap.index.class](#hoodiebootstrapindexclass) | org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex | Implementation to use, for mapping base files to bootstrap base file, that contain actual data.<br />`Config Param: BOOTSTRAP_INDEX_CLASS_NAME` [...] -| [hoodie.bootstrap.index.enable](#hoodiebootstrapindexenable) | true | Whether or not, this is a bootstrapped table, with bootstrap base data and an mapping index defined, default true.<br />`Config Param: BOOTSTRAP_INDEX_ENABLE` [...] -| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass) | org.apache.hudi.common.model.OverwriteWithLatestAvroPayload | Payload class to use for performing compactions, i.e merge delta logs with current base file and then produce a new base file.<br />`Config Param: PAYLOAD_CLASS_NAME` [...] -| [hoodie.compaction.record.merger.strategy](#hoodiecompactionrecordmergerstrategy) | eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 | Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in hoodie.datasource.write.record.merger.impls which has the same merger strategy id<br />`Config Param: RECORD_MERGER_STRATEGY`<br />`Since Version: 0.13.0` [...] -| [hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning) | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)<br />`Config Param: HIVE_STYLE_PARTITIONING_ENABLE` [...] -| [hoodie.partition.metafile.use.base.format](#hoodiepartitionmetafileusebaseformat) | false | If true, partition metafiles are saved in the same format as base-files for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles are saved as properties files.<br />`Config Param: PARTITION_METAFILE_USE_BASE_FORMAT` [...] -| [hoodie.populate.meta.fields](#hoodiepopulatemetafields) | true | When enabled, populates all meta fields. When disabled, no meta fields are populated and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing<br />`Config Param: POPULATE_META_FIELDS` [...] -| [hoodie.table.base.file.format](#hoodietablebasefileformat) | PARQUET | Base file format to store all the base file data.<br />`Config Param: BASE_FILE_FORMAT` [...] -| [hoodie.table.cdc.enabled](#hoodietablecdcenabled) | false | When enable, persist the change data if necessary, and can be queried as a CDC query mode.<br />`Config Param: CDC_ENABLED`<br />`Since Version: 0.13.0` [...] -| [hoodie.table.cdc.supplemental.logging.mode](#hoodietablecdcsupplementalloggingmode) | DATA_BEFORE_AFTER | org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode: Change log capture supplemental logging mode. The supplemental log is used for accelerating the generation of change log details. OP_KEY_ONLY: Only keeping record keys in the supplemental logs, so the reader needs to figure out the update before image [...] -| [hoodie.table.log.file.format](#hoodietablelogfileformat) | HOODIE_LOG | Log format used for the delta logs.<br />`Config Param: LOG_FILE_FORMAT` [...] -| [hoodie.table.timeline.timezone](#hoodietabletimelinetimezone) | LOCAL | User can set hoodie commit timeline timezone, such as utc, local and so on. local is default<br />`Config Param: TIMELINE_TIMEZONE` [...] -| [hoodie.table.type](#hoodietabletype) | COPY_ON_WRITE | The table type for the underlying data, for this write. This can’t change between writes.<br />`Config Param: TYPE` [...] -| [hoodie.table.version](#hoodietableversion) | ZERO | Version of table, used for running upgrade/downgrade steps between releases with potentially breaking/backwards compatible changes.<br />`Config Param: VERSION` [...] +| Config Name | Default | Description [...] +| ------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- [...] +| [hoodie.bootstrap.base.path](#hoodiebootstrapbasepath) | (N/A) | Base path of the dataset that needs to be bootstrapped as a Hudi table<br />`Config Param: BOOTSTRAP_BASE_PATH` [...] +| [hoodie.database.name](#hoodiedatabasename) | (N/A) | Database name that will be used for incremental query.If different databases have the same table name during incremental query, we can set it to limit the table name under a specific database<br />`Config Param: DATABASE_NAME` [...] +| [hoodie.table.checksum](#hoodietablechecksum) | (N/A) | Table checksum is used to guard against partial writes in HDFS. It is added as the last entry in hoodie.properties and then used to validate while reading table config.<br />`Config Param: TABLE_CHECKSUM`<br />`Since Version: 0.11.0` [...] +| [hoodie.table.create.schema](#hoodietablecreateschema) | (N/A) | Schema used when creating the table, for the first time.<br />`Config Param: CREATE_SCHEMA` [...] +| [hoodie.table.keygenerator.class](#hoodietablekeygeneratorclass) | (N/A) | Key Generator class property for the hoodie table<br />`Config Param: KEY_GENERATOR_CLASS_NAME` [...] +| [hoodie.table.metadata.partitions](#hoodietablemetadatapartitions) | (N/A) | Comma-separated list of metadata partitions that have been completely built and in-sync with data table. These partitions are ready for use by the readers<br />`Config Param: TABLE_METADATA_PARTITIONS`<br />`Since Version: 0.11.0` [...] +| [hoodie.table.metadata.partitions.inflight](#hoodietablemetadatapartitionsinflight) | (N/A) | Comma-separated list of metadata partitions whose building is in progress. These partitions are not yet ready for use by the readers.<br />`Config Param: TABLE_METADATA_PARTITIONS_INFLIGHT`<br />`Since Version: 0.11.0` [...] +| [hoodie.table.name](#hoodietablename) | (N/A) | Table name that will be used for registering with Hive. Needs to be same across runs.<br />`Config Param: NAME` [...] +| [hoodie.table.partition.fields](#hoodietablepartitionfields) | (N/A) | Fields used to partition the table. Concatenated values of these fields are used as the partition path, by invoking toString()<br />`Config Param: PARTITION_FIELDS` [...] +| [hoodie.table.precombine.field](#hoodietableprecombinefield) | (N/A) | Field used in preCombining before actual write. By default, when two records have the same key value, the largest value for the precombine field determined by Object.compareTo(..), is picked.<br />`Config Param: PRECOMBINE_FIELD` [...] +| [hoodie.table.recordkey.fields](#hoodietablerecordkeyfields) | (N/A) | Columns used to uniquely identify the table. Concatenated values of these fields are used as the record key component of HoodieKey.<br />`Config Param: RECORDKEY_FIELDS` [...] +| [hoodie.table.secondary.indexes.metadata](#hoodietablesecondaryindexesmetadata) | (N/A) | The metadata of secondary indexes<br />`Config Param: SECONDARY_INDEXES_METADATA`<br />`Since Version: 0.13.0` [...] +| [hoodie.timeline.layout.version](#hoodietimelinelayoutversion) | (N/A) | Version of timeline used, by the table.<br />`Config Param: TIMELINE_LAYOUT_VERSION` [...] +| [hoodie.archivelog.folder](#hoodiearchivelogfolder) | archived | path under the meta folder, to store archived timeline instants at.<br />`Config Param: ARCHIVELOG_FOLDER` [...] +| [hoodie.bootstrap.index.class](#hoodiebootstrapindexclass) | org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex | Implementation to use, for mapping base files to bootstrap base file, that contain actual data.<br />`Config Param: BOOTSTRAP_INDEX_CLASS_NAME` [...] +| [hoodie.bootstrap.index.enable](#hoodiebootstrapindexenable) | true | Whether or not, this is a bootstrapped table, with bootstrap base data and an mapping index defined, default true.<br />`Config Param: BOOTSTRAP_INDEX_ENABLE` [...] +| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass) | org.apache.hudi.common.model.DefaultHoodieRecordPayload | Payload class to use for performing compactions, i.e merge delta logs with current base file and then produce a new base file.<br />`Config Param: PAYLOAD_CLASS_NAME` [...] +| [hoodie.compaction.payload.type](#hoodiecompactionpayloadtype) | HOODIE_AVRO_DEFAULT | org.apache.hudi.common.model.RecordPayloadType: Payload to use for merging records AWS_DMS_AVRO: Provides support for seamlessly applying changes captured via Amazon Database Migration Service onto S3. HOODIE_AVRO: A payload to wrap a existing Hoodie Avro Record. Useful to create a HoodieRecord over existing Gener [...] +| [hoodie.compaction.record.merger.strategy](#hoodiecompactionrecordmergerstrategy) | eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 | Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in hoodie.datasource.write.record.merger.impls which has the same merger strategy id<br />`Config Param: RECORD_MERGER_STRATEGY`<br />`Since Version: 0.13.0` [...] +| [hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning) | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)<br />`Config Param: HIVE_STYLE_PARTITIONING_ENABLE` [...] +| [hoodie.partition.metafile.use.base.format](#hoodiepartitionmetafileusebaseformat) | false | If true, partition metafiles are saved in the same format as base-files for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles are saved as properties files.<br />`Config Param: PARTITION_METAFILE_USE_BASE_FORMAT` [...] +| [hoodie.populate.meta.fields](#hoodiepopulatemetafields) | true | When enabled, populates all meta fields. When disabled, no meta fields are populated and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing<br />`Config Param: POPULATE_META_FIELDS` [...] +| [hoodie.table.base.file.format](#hoodietablebasefileformat) | PARQUET | Base file format to store all the base file data.<br />`Config Param: BASE_FILE_FORMAT` [...] +| [hoodie.table.cdc.enabled](#hoodietablecdcenabled) | false | When enable, persist the change data if necessary, and can be queried as a CDC query mode.<br />`Config Param: CDC_ENABLED`<br />`Since Version: 0.13.0` [...] +| [hoodie.table.cdc.supplemental.logging.mode](#hoodietablecdcsupplementalloggingmode) | DATA_BEFORE_AFTER | org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode: Change log capture supplemental logging mode. The supplemental log is used for accelerating the generation of change log details. OP_KEY_ONLY: Only keeping record keys in the supplemental logs, so the reader needs to figure out the update before image [...] +| [hoodie.table.log.file.format](#hoodietablelogfileformat) | HOODIE_LOG | Log format used for the delta logs.<br />`Config Param: LOG_FILE_FORMAT` [...] +| [hoodie.table.timeline.timezone](#hoodietabletimelinetimezone) | LOCAL | User can set hoodie commit timeline timezone, such as utc, local and so on. local is default<br />`Config Param: TIMELINE_TIMEZONE` [...] +| [hoodie.table.type](#hoodietabletype) | COPY_ON_WRITE | The table type for the underlying data, for this write. This can’t change between writes.<br />`Config Param: TYPE` [...] +| [hoodie.table.version](#hoodietableversion) | ZERO | Version of table, used for running upgrade/downgrade steps between releases with potentially breaking/backwards compatible changes.<br />`Config Param: VERSION` [...] --- ## Spark Datasource Configs {#SPARK_DATASOURCE} diff --git a/website/docs/configurations.md b/website/docs/configurations.md index 278be1f5afa..a6814502cdc 100644 --- a/website/docs/configurations.md +++ b/website/docs/configurations.md @@ -5,7 +5,7 @@ permalink: /docs/configurations.html summary: This page covers the different ways of configuring your job to write/read Hudi tables. At a high level, you can control behaviour at few levels. toc_min_heading_level: 2 toc_max_heading_level: 4 -last_modified_at: 2024-06-06T12:59:56.026 +last_modified_at: 2024-07-01T15:09:57.588 --- @@ -54,36 +54,37 @@ Configurations of the Hudi Table like type of ingestion, storage formats, hive t [**Basic Configs**](#Hudi-Table-Basic-Configs-basic-configs) -| Config Name | Default | Description [...] -| ------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- [...] -| [hoodie.bootstrap.base.path](#hoodiebootstrapbasepath) | (N/A) | Base path of the dataset that needs to be bootstrapped as a Hudi table<br />`Config Param: BOOTSTRAP_BASE_PATH` [...] -| [hoodie.database.name](#hoodiedatabasename) | (N/A) | Database name that will be used for incremental query.If different databases have the same table name during incremental query, we can set it to limit the table name under a specific database<br />`Config Param: DATABASE_NAME` [...] -| [hoodie.table.checksum](#hoodietablechecksum) | (N/A) | Table checksum is used to guard against partial writes in HDFS. It is added as the last entry in hoodie.properties and then used to validate while reading table config.<br />`Config Param: TABLE_CHECKSUM`<br />`Since Version: 0.11.0` [...] -| [hoodie.table.create.schema](#hoodietablecreateschema) | (N/A) | Schema used when creating the table, for the first time.<br />`Config Param: CREATE_SCHEMA` [...] -| [hoodie.table.keygenerator.class](#hoodietablekeygeneratorclass) | (N/A) | Key Generator class property for the hoodie table<br />`Config Param: KEY_GENERATOR_CLASS_NAME` [...] -| [hoodie.table.metadata.partitions](#hoodietablemetadatapartitions) | (N/A) | Comma-separated list of metadata partitions that have been completely built and in-sync with data table. These partitions are ready for use by the readers<br />`Config Param: TABLE_METADATA_PARTITIONS`<br />`Since Version: 0.11.0` [...] -| [hoodie.table.metadata.partitions.inflight](#hoodietablemetadatapartitionsinflight) | (N/A) | Comma-separated list of metadata partitions whose building is in progress. These partitions are not yet ready for use by the readers.<br />`Config Param: TABLE_METADATA_PARTITIONS_INFLIGHT`<br />`Since Version: 0.11.0` [...] -| [hoodie.table.name](#hoodietablename) | (N/A) | Table name that will be used for registering with Hive. Needs to be same across runs.<br />`Config Param: NAME` [...] -| [hoodie.table.partition.fields](#hoodietablepartitionfields) | (N/A) | Fields used to partition the table. Concatenated values of these fields are used as the partition path, by invoking toString()<br />`Config Param: PARTITION_FIELDS` [...] -| [hoodie.table.precombine.field](#hoodietableprecombinefield) | (N/A) | Field used in preCombining before actual write. By default, when two records have the same key value, the largest value for the precombine field determined by Object.compareTo(..), is picked.<br />`Config Param: PRECOMBINE_FIELD` [...] -| [hoodie.table.recordkey.fields](#hoodietablerecordkeyfields) | (N/A) | Columns used to uniquely identify the table. Concatenated values of these fields are used as the record key component of HoodieKey.<br />`Config Param: RECORDKEY_FIELDS` [...] -| [hoodie.table.secondary.indexes.metadata](#hoodietablesecondaryindexesmetadata) | (N/A) | The metadata of secondary indexes<br />`Config Param: SECONDARY_INDEXES_METADATA`<br />`Since Version: 0.13.0` [...] -| [hoodie.timeline.layout.version](#hoodietimelinelayoutversion) | (N/A) | Version of timeline used, by the table.<br />`Config Param: TIMELINE_LAYOUT_VERSION` [...] -| [hoodie.archivelog.folder](#hoodiearchivelogfolder) | archived | path under the meta folder, to store archived timeline instants at.<br />`Config Param: ARCHIVELOG_FOLDER` [...] -| [hoodie.bootstrap.index.class](#hoodiebootstrapindexclass) | org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex | Implementation to use, for mapping base files to bootstrap base file, that contain actual data.<br />`Config Param: BOOTSTRAP_INDEX_CLASS_NAME` [...] -| [hoodie.bootstrap.index.enable](#hoodiebootstrapindexenable) | true | Whether or not, this is a bootstrapped table, with bootstrap base data and an mapping index defined, default true.<br />`Config Param: BOOTSTRAP_INDEX_ENABLE` [...] -| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass) | org.apache.hudi.common.model.OverwriteWithLatestAvroPayload | Payload class to use for performing compactions, i.e merge delta logs with current base file and then produce a new base file.<br />`Config Param: PAYLOAD_CLASS_NAME` [...] -| [hoodie.compaction.record.merger.strategy](#hoodiecompactionrecordmergerstrategy) | eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 | Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in hoodie.datasource.write.record.merger.impls which has the same merger strategy id<br />`Config Param: RECORD_MERGER_STRATEGY`<br />`Since Version: 0.13.0` [...] -| [hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning) | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)<br />`Config Param: HIVE_STYLE_PARTITIONING_ENABLE` [...] -| [hoodie.partition.metafile.use.base.format](#hoodiepartitionmetafileusebaseformat) | false | If true, partition metafiles are saved in the same format as base-files for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles are saved as properties files.<br />`Config Param: PARTITION_METAFILE_USE_BASE_FORMAT` [...] -| [hoodie.populate.meta.fields](#hoodiepopulatemetafields) | true | When enabled, populates all meta fields. When disabled, no meta fields are populated and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing<br />`Config Param: POPULATE_META_FIELDS` [...] -| [hoodie.table.base.file.format](#hoodietablebasefileformat) | PARQUET | Base file format to store all the base file data.<br />`Config Param: BASE_FILE_FORMAT` [...] -| [hoodie.table.cdc.enabled](#hoodietablecdcenabled) | false | When enable, persist the change data if necessary, and can be queried as a CDC query mode.<br />`Config Param: CDC_ENABLED`<br />`Since Version: 0.13.0` [...] -| [hoodie.table.cdc.supplemental.logging.mode](#hoodietablecdcsupplementalloggingmode) | DATA_BEFORE_AFTER | org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode: Change log capture supplemental logging mode. The supplemental log is used for accelerating the generation of change log details. OP_KEY_ONLY: Only keeping record keys in the supplemental logs, so the reader needs to figure out the update before image [...] -| [hoodie.table.log.file.format](#hoodietablelogfileformat) | HOODIE_LOG | Log format used for the delta logs.<br />`Config Param: LOG_FILE_FORMAT` [...] -| [hoodie.table.timeline.timezone](#hoodietabletimelinetimezone) | LOCAL | User can set hoodie commit timeline timezone, such as utc, local and so on. local is default<br />`Config Param: TIMELINE_TIMEZONE` [...] -| [hoodie.table.type](#hoodietabletype) | COPY_ON_WRITE | The table type for the underlying data, for this write. This can’t change between writes.<br />`Config Param: TYPE` [...] -| [hoodie.table.version](#hoodietableversion) | ZERO | Version of table, used for running upgrade/downgrade steps between releases with potentially breaking/backwards compatible changes.<br />`Config Param: VERSION` [...] +| Config Name | Default | Description [...] +| ------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- [...] +| [hoodie.bootstrap.base.path](#hoodiebootstrapbasepath) | (N/A) | Base path of the dataset that needs to be bootstrapped as a Hudi table<br />`Config Param: BOOTSTRAP_BASE_PATH` [...] +| [hoodie.database.name](#hoodiedatabasename) | (N/A) | Database name that will be used for incremental query.If different databases have the same table name during incremental query, we can set it to limit the table name under a specific database<br />`Config Param: DATABASE_NAME` [...] +| [hoodie.table.checksum](#hoodietablechecksum) | (N/A) | Table checksum is used to guard against partial writes in HDFS. It is added as the last entry in hoodie.properties and then used to validate while reading table config.<br />`Config Param: TABLE_CHECKSUM`<br />`Since Version: 0.11.0` [...] +| [hoodie.table.create.schema](#hoodietablecreateschema) | (N/A) | Schema used when creating the table, for the first time.<br />`Config Param: CREATE_SCHEMA` [...] +| [hoodie.table.keygenerator.class](#hoodietablekeygeneratorclass) | (N/A) | Key Generator class property for the hoodie table<br />`Config Param: KEY_GENERATOR_CLASS_NAME` [...] +| [hoodie.table.metadata.partitions](#hoodietablemetadatapartitions) | (N/A) | Comma-separated list of metadata partitions that have been completely built and in-sync with data table. These partitions are ready for use by the readers<br />`Config Param: TABLE_METADATA_PARTITIONS`<br />`Since Version: 0.11.0` [...] +| [hoodie.table.metadata.partitions.inflight](#hoodietablemetadatapartitionsinflight) | (N/A) | Comma-separated list of metadata partitions whose building is in progress. These partitions are not yet ready for use by the readers.<br />`Config Param: TABLE_METADATA_PARTITIONS_INFLIGHT`<br />`Since Version: 0.11.0` [...] +| [hoodie.table.name](#hoodietablename) | (N/A) | Table name that will be used for registering with Hive. Needs to be same across runs.<br />`Config Param: NAME` [...] +| [hoodie.table.partition.fields](#hoodietablepartitionfields) | (N/A) | Fields used to partition the table. Concatenated values of these fields are used as the partition path, by invoking toString()<br />`Config Param: PARTITION_FIELDS` [...] +| [hoodie.table.precombine.field](#hoodietableprecombinefield) | (N/A) | Field used in preCombining before actual write. By default, when two records have the same key value, the largest value for the precombine field determined by Object.compareTo(..), is picked.<br />`Config Param: PRECOMBINE_FIELD` [...] +| [hoodie.table.recordkey.fields](#hoodietablerecordkeyfields) | (N/A) | Columns used to uniquely identify the table. Concatenated values of these fields are used as the record key component of HoodieKey.<br />`Config Param: RECORDKEY_FIELDS` [...] +| [hoodie.table.secondary.indexes.metadata](#hoodietablesecondaryindexesmetadata) | (N/A) | The metadata of secondary indexes<br />`Config Param: SECONDARY_INDEXES_METADATA`<br />`Since Version: 0.13.0` [...] +| [hoodie.timeline.layout.version](#hoodietimelinelayoutversion) | (N/A) | Version of timeline used, by the table.<br />`Config Param: TIMELINE_LAYOUT_VERSION` [...] +| [hoodie.archivelog.folder](#hoodiearchivelogfolder) | archived | path under the meta folder, to store archived timeline instants at.<br />`Config Param: ARCHIVELOG_FOLDER` [...] +| [hoodie.bootstrap.index.class](#hoodiebootstrapindexclass) | org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex | Implementation to use, for mapping base files to bootstrap base file, that contain actual data.<br />`Config Param: BOOTSTRAP_INDEX_CLASS_NAME` [...] +| [hoodie.bootstrap.index.enable](#hoodiebootstrapindexenable) | true | Whether or not, this is a bootstrapped table, with bootstrap base data and an mapping index defined, default true.<br />`Config Param: BOOTSTRAP_INDEX_ENABLE` [...] +| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass) | org.apache.hudi.common.model.DefaultHoodieRecordPayload | Payload class to use for performing compactions, i.e merge delta logs with current base file and then produce a new base file.<br />`Config Param: PAYLOAD_CLASS_NAME` [...] +| [hoodie.compaction.payload.type](#hoodiecompactionpayloadtype) | HOODIE_AVRO_DEFAULT | org.apache.hudi.common.model.RecordPayloadType: Payload to use for merging records AWS_DMS_AVRO: Provides support for seamlessly applying changes captured via Amazon Database Migration Service onto S3. HOODIE_AVRO: A payload to wrap a existing Hoodie Avro Record. Useful to create a HoodieRecord over existing Gener [...] +| [hoodie.compaction.record.merger.strategy](#hoodiecompactionrecordmergerstrategy) | eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 | Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in hoodie.datasource.write.record.merger.impls which has the same merger strategy id<br />`Config Param: RECORD_MERGER_STRATEGY`<br />`Since Version: 0.13.0` [...] +| [hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning) | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)<br />`Config Param: HIVE_STYLE_PARTITIONING_ENABLE` [...] +| [hoodie.partition.metafile.use.base.format](#hoodiepartitionmetafileusebaseformat) | false | If true, partition metafiles are saved in the same format as base-files for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles are saved as properties files.<br />`Config Param: PARTITION_METAFILE_USE_BASE_FORMAT` [...] +| [hoodie.populate.meta.fields](#hoodiepopulatemetafields) | true | When enabled, populates all meta fields. When disabled, no meta fields are populated and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing<br />`Config Param: POPULATE_META_FIELDS` [...] +| [hoodie.table.base.file.format](#hoodietablebasefileformat) | PARQUET | Base file format to store all the base file data.<br />`Config Param: BASE_FILE_FORMAT` [...] +| [hoodie.table.cdc.enabled](#hoodietablecdcenabled) | false | When enable, persist the change data if necessary, and can be queried as a CDC query mode.<br />`Config Param: CDC_ENABLED`<br />`Since Version: 0.13.0` [...] +| [hoodie.table.cdc.supplemental.logging.mode](#hoodietablecdcsupplementalloggingmode) | DATA_BEFORE_AFTER | org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode: Change log capture supplemental logging mode. The supplemental log is used for accelerating the generation of change log details. OP_KEY_ONLY: Only keeping record keys in the supplemental logs, so the reader needs to figure out the update before image [...] +| [hoodie.table.log.file.format](#hoodietablelogfileformat) | HOODIE_LOG | Log format used for the delta logs.<br />`Config Param: LOG_FILE_FORMAT` [...] +| [hoodie.table.timeline.timezone](#hoodietabletimelinetimezone) | LOCAL | User can set hoodie commit timeline timezone, such as utc, local and so on. local is default<br />`Config Param: TIMELINE_TIMEZONE` [...] +| [hoodie.table.type](#hoodietabletype) | COPY_ON_WRITE | The table type for the underlying data, for this write. This can’t change between writes.<br />`Config Param: TYPE` [...] +| [hoodie.table.version](#hoodietableversion) | ZERO | Version of table, used for running upgrade/downgrade steps between releases with potentially breaking/backwards compatible changes.<br />`Config Param: VERSION` [...] [**Advanced Configs**](#Hudi-Table-Basic-Configs-advanced-configs) @@ -181,58 +182,59 @@ Options useful for writing tables via `write.format.option(...)` [**Advanced Configs**](#Write-Options-advanced-configs) -| Config Name | Default | Description [...] -| ------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- [...] -| [hoodie.datasource.hive_sync.serde_properties](#hoodiedatasourcehive_syncserde_properties) | (N/A) | Serde properties to hive table.<br />`Config Param: HIVE_TABLE_SERDE_PROPERTIES` [...] -| [hoodie.datasource.hive_sync.table_properties](#hoodiedatasourcehive_synctable_properties) | (N/A) | Additional properties to store with table.<br />`Config Param: HIVE_TABLE_PROPERTIES` [...] -| [hoodie.datasource.overwrite.mode](#hoodiedatasourceoverwritemode) | (N/A) | Controls whether overwrite use dynamic or static mode, if not configured, respect spark.sql.sources.partitionOverwriteMode<br />`Config Param: OVERWRITE_MODE`<br />`Since Version: 0.14.0` [...] -| [hoodie.datasource.write.partitions.to.delete](#hoodiedatasourcewritepartitionstodelete) | (N/A) | Comma separated list of partitions to delete. Allows use of wildcard *<br />`Config Param: PARTITIONS_TO_DELETE` [...] -| [hoodie.datasource.write.table.name](#hoodiedatasourcewritetablename) | (N/A) | Table name for the datasource write. Also used to register the table into meta stores.<br />`Config Param: TABLE_NAME` [...] -| [hoodie.datasource.compaction.async.enable](#hoodiedatasourcecompactionasyncenable) | true | Controls whether async compaction should be turned on for MOR table writing.<br />`Config Param: ASYNC_COMPACT_ENABLE` [...] -| [hoodie.datasource.hive_sync.assume_date_partitioning](#hoodiedatasourcehive_syncassume_date_partitioning) | false | Assume partitioning is yyyy/MM/dd<br />`Config Param: HIVE_ASSUME_DATE_PARTITION` [...] -| [hoodie.datasource.hive_sync.auto_create_database](#hoodiedatasourcehive_syncauto_create_database) | true | Auto create hive database if does not exists<br />`Config Param: HIVE_AUTO_CREATE_DATABASE` [...] -| [hoodie.datasource.hive_sync.base_file_format](#hoodiedatasourcehive_syncbase_file_format) | PARQUET | Base file format for the sync.<br />`Config Param: HIVE_BASE_FILE_FORMAT` [...] -| [hoodie.datasource.hive_sync.batch_num](#hoodiedatasourcehive_syncbatch_num) | 1000 | The number of partitions one batch when synchronous partitions to hive.<br />`Config Param: HIVE_BATCH_SYNC_PARTITION_NUM` [...] -| [hoodie.datasource.hive_sync.bucket_sync](#hoodiedatasourcehive_syncbucket_sync) | false | Whether sync hive metastore bucket specification when using bucket index.The specification is 'CLUSTERED BY (trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS'<br />`Config Param: HIVE_SYNC_BUCKET_SYNC` [...] -| [hoodie.datasource.hive_sync.create_managed_table](#hoodiedatasourcehive_synccreate_managed_table) | false | Whether to sync the table as managed table.<br />`Config Param: HIVE_CREATE_MANAGED_TABLE` [...] -| [hoodie.datasource.hive_sync.database](#hoodiedatasourcehive_syncdatabase) | default | The name of the destination database that we should sync the hudi table to.<br />`Config Param: HIVE_DATABASE` [...] -| [hoodie.datasource.hive_sync.ignore_exceptions](#hoodiedatasourcehive_syncignore_exceptions) | false | Ignore exceptions when syncing with Hive.<br />`Config Param: HIVE_IGNORE_EXCEPTIONS` [...] -| [hoodie.datasource.hive_sync.partition_extractor_class](#hoodiedatasourcehive_syncpartition_extractor_class) | org.apache.hudi.hive.MultiPartKeysValueExtractor | Class which implements PartitionValueExtractor to extract the partition values, default 'org.apache.hudi.hive.MultiPartKeysValueExtractor'.<br />`Config Param: HIVE_PARTITION_EXTRACTOR_CLASS` [...] -| [hoodie.datasource.hive_sync.partition_fields](#hoodiedatasourcehive_syncpartition_fields) | | Field in the table to use for determining hive partition columns.<br />`Config Param: HIVE_PARTITION_FIELDS` [...] -| [hoodie.datasource.hive_sync.password](#hoodiedatasourcehive_syncpassword) | hive | hive password to use<br />`Config Param: HIVE_PASS` [...] -| [hoodie.datasource.hive_sync.skip_ro_suffix](#hoodiedatasourcehive_syncskip_ro_suffix) | false | Skip the _ro suffix for Read optimized table, when registering<br />`Config Param: HIVE_SKIP_RO_SUFFIX_FOR_READ_OPTIMIZED_TABLE` [...] -| [hoodie.datasource.hive_sync.support_timestamp](#hoodiedatasourcehive_syncsupport_timestamp) | false | ‘INT64’ with original type TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. Disabled by default for backward compatibility. NOTE: On Spark entrypoints, this is defaulted to TRUE<br />`Config Param: HIVE_SUPPORT_TIMESTAMP_TYPE` [...] -| [hoodie.datasource.hive_sync.sync_as_datasource](#hoodiedatasourcehive_syncsync_as_datasource) | true | <br />`Config Param: HIVE_SYNC_AS_DATA_SOURCE_TABLE` [...] -| [hoodie.datasource.hive_sync.sync_comment](#hoodiedatasourcehive_syncsync_comment) | false | Whether to sync the table column comments while syncing the table.<br />`Config Param: HIVE_SYNC_COMMENT` [...] -| [hoodie.datasource.hive_sync.table](#hoodiedatasourcehive_synctable) | unknown | The name of the destination table that we should sync the hudi table to.<br />`Config Param: HIVE_TABLE` [...] -| [hoodie.datasource.hive_sync.use_jdbc](#hoodiedatasourcehive_syncuse_jdbc) | true | Use JDBC when hive synchronization is enabled<br />`Config Param: HIVE_USE_JDBC` [...] -| [hoodie.datasource.hive_sync.use_pre_apache_input_format](#hoodiedatasourcehive_syncuse_pre_apache_input_format) | false | Flag to choose InputFormat under com.uber.hoodie package instead of org.apache.hudi package. Use this when you are in the process of migrating from com.uber.hoodie to org.apache.hudi. Stop using this after you migrated the table definition to org.apache.hudi input format<br />`Co [...] -| [hoodie.datasource.hive_sync.username](#hoodiedatasourcehive_syncusername) | hive | hive user name to use<br />`Config Param: HIVE_USER` [...] -| [hoodie.datasource.insert.dup.policy](#hoodiedatasourceinsertduppolicy) | none | **Note** This is only applicable to Spark SQL writing.<br />When operation type is set to "insert", users can optionally enforce a dedup policy. This policy will be employed when records being ingested already exists in storage. Default policy is none and no action will be [...] -| [hoodie.datasource.meta_sync.condition.sync](#hoodiedatasourcemeta_syncconditionsync) | false | If true, only sync on conditions like schema change or partition change.<br />`Config Param: HIVE_CONDITIONAL_SYNC` [...] -| [hoodie.datasource.write.commitmeta.key.prefix](#hoodiedatasourcewritecommitmetakeyprefix) | _ | Option keys beginning with this prefix, are automatically added to the commit/deltacommit metadata. This is useful to store checkpointing information, in a consistent way with the hudi timeline<br />`Config Param: COMMIT_METADATA_KEYPREFIX` [...] -| [hoodie.datasource.write.drop.partition.columns](#hoodiedatasourcewritedroppartitioncolumns) | false | When set to true, will not write the partition columns into hudi. By default, false.<br />`Config Param: DROP_PARTITION_COLUMNS` [...] -| [hoodie.datasource.write.insert.drop.duplicates](#hoodiedatasourcewriteinsertdropduplicates) | false | If set to true, records from the incoming dataframe will not overwrite existing records with the same key during the write operation. <br /> **Note** Just for Insert operation in Spark SQL writing since 0.14.0, users can switch to the config `hoodie.datasource.insert.dup.po [...] -| [hoodie.datasource.write.keygenerator.class](#hoodiedatasourcewritekeygeneratorclass) | org.apache.hudi.keygen.SimpleKeyGenerator | Key generator class, that implements `org.apache.hudi.keygen.KeyGenerator`<br />`Config Param: KEYGENERATOR_CLASS_NAME` [...] -| [hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled](#hoodiedatasourcewritekeygeneratorconsistentlogicaltimestampenabled) | false | When set to true, consistent value will be generated for a logical timestamp type column, like timestamp-millis and timestamp-micros, irrespective of whether row-writer is enabled. Disabled by default so as not to break the pipeline that deploy either fully row-writer path or non [...] -| [hoodie.datasource.write.partitionpath.urlencode](#hoodiedatasourcewritepartitionpathurlencode) | false | Should we url encode the partition path value, before creating the folder structure.<br />`Config Param: URL_ENCODE_PARTITIONING` [...] -| [hoodie.datasource.write.payload.class](#hoodiedatasourcewritepayloadclass) | org.apache.hudi.common.model.OverwriteWithLatestAvroPayload | Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for PRECOMBINE_FIELD_OPT_VAL in-effective<br />`Config Param: PAYLOAD_CLASS_NAME` [...] -| [hoodie.datasource.write.reconcile.schema](#hoodiedatasourcewritereconcileschema) | false | This config controls how writer's schema will be selected based on the incoming batch's schema as well as existing table's one. When schema reconciliation is DISABLED, incoming batch's schema will be picked as a writer-schema (therefore updating table's schema). When schema recon [...] -| [hoodie.datasource.write.record.merger.impls](#hoodiedatasourcewriterecordmergerimpls) | org.apache.hudi.common.model.HoodieAvroRecordMerger | List of HoodieMerger implementations constituting Hudi's merging strategy -- based on the engine used. These merger impls will filter by hoodie.datasource.write.record.merger.strategy Hudi will pick most efficient implementation to perform merging/combining of the records (during [...] -| [hoodie.datasource.write.record.merger.strategy](#hoodiedatasourcewriterecordmergerstrategy) | eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 | Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in hoodie.datasource.write.record.merger.impls which has the same merger strategy id<br />`Config Param: RECORD_MERGER_STRATEGY`<br />`Since Version: 0.13.0` [...] -| [hoodie.datasource.write.row.writer.enable](#hoodiedatasourcewriterowwriterenable) | true | When set to true, will perform write operations directly using the spark native `Row` representation, avoiding any additional conversion costs.<br />`Config Param: ENABLE_ROW_WRITER` [...] -| [hoodie.datasource.write.streaming.checkpoint.identifier](#hoodiedatasourcewritestreamingcheckpointidentifier) | default_single_writer | A stream identifier used for HUDI to fetch the right checkpoint(`batch id` to be more specific) corresponding this writer. Please note that keep the identifier an unique value for different writer if under multi-writer scenario. If the value is not set, will only keep the checkpo [...] -| [hoodie.datasource.write.streaming.disable.compaction](#hoodiedatasourcewritestreamingdisablecompaction) | false | By default for MOR table, async compaction is enabled with spark streaming sink. By setting this config to true, we can disable it and the expectation is that, users will schedule and execute compaction in a different process/job altogether. Some users may wish to run it separate [...] -| [hoodie.datasource.write.streaming.ignore.failed.batch](#hoodiedatasourcewritestreamingignorefailedbatch) | false | Config to indicate whether to ignore any non exception error (e.g. writestatus error) within a streaming microbatch. Turning this on, could hide the write status errors while the spark checkpoint moves ahead.So, would recommend users to use this with caution.<br />`Config Param: [...] -| [hoodie.datasource.write.streaming.retry.count](#hoodiedatasourcewritestreamingretrycount) | 3 | Config to indicate how many times streaming job should retry for a failed micro batch.<br />`Config Param: STREAMING_RETRY_CNT` [...] -| [hoodie.datasource.write.streaming.retry.interval.ms](#hoodiedatasourcewritestreamingretryintervalms) | 2000 | Config to indicate how long (by millisecond) before a retry should issued for failed microbatch<br />`Config Param: STREAMING_RETRY_INTERVAL_MS` [...] -| [hoodie.meta.sync.client.tool.class](#hoodiemetasyncclienttoolclass) | org.apache.hudi.hive.HiveSyncTool | Sync tool class name used to sync to metastore. Defaults to Hive.<br />`Config Param: META_SYNC_CLIENT_TOOL_CLASS_NAME` [...] -| [hoodie.spark.sql.insert.into.operation](#hoodiesparksqlinsertintooperation) | insert | Sql write operation to use with INSERT_INTO spark sql command. This comes with 3 possible values, bulk_insert, insert and upsert. bulk_insert is generally meant for initial loads and is known to be performant compared to insert. But bulk_insert may not do small file management. I [...] -| [hoodie.spark.sql.optimized.writes.enable](#hoodiesparksqloptimizedwritesenable) | true | Controls whether spark sql prepped update, delete, and merge are enabled.<br />`Config Param: SPARK_SQL_OPTIMIZED_WRITES`<br />`Since Version: 0.14.0` [...] -| [hoodie.sql.bulk.insert.enable](#hoodiesqlbulkinsertenable) | false | When set to true, the sql insert statement will use bulk insert. This config is deprecated as of 0.14.0. Please use hoodie.spark.sql.insert.into.operation instead.<br />`Config Param: SQL_ENABLE_BULK_INSERT` [...] -| [hoodie.sql.insert.mode](#hoodiesqlinsertmode) | upsert | Insert mode when insert data to pk-table. The optional modes are: upsert, strict and non-strict.For upsert mode, insert statement do the upsert operation for the pk-table which will update the duplicate record.For strict mode, insert statement will keep the primary key uniqueness [...] -| [hoodie.streamer.source.kafka.value.deserializer.class](#hoodiestreamersourcekafkavaluedeserializerclass) | io.confluent.kafka.serializers.KafkaAvroDeserializer | This class is used by kafka client to deserialize the records<br />`Config Param: KAFKA_AVRO_VALUE_DESERIALIZER_CLASS`<br />`Since Version: 0.9.0` [...] -| [hoodie.write.set.null.for.missing.columns](#hoodiewritesetnullformissingcolumns) | false | When a nullable column is missing from incoming batch during a write operation, the write operation will fail schema compatibility check. Set this option to true will make the missing column be filled with null values to successfully complete the write operation.<br />`Config P [...] +| Config Name | Default | Description [...] +| ------------------------------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- [...] +| [hoodie.datasource.hive_sync.serde_properties](#hoodiedatasourcehive_syncserde_properties) | (N/A) | Serde properties to hive table.<br />`Config Param: HIVE_TABLE_SERDE_PROPERTIES` [...] +| [hoodie.datasource.hive_sync.table_properties](#hoodiedatasourcehive_synctable_properties) | (N/A) | Additional properties to store with table.<br />`Config Param: HIVE_TABLE_PROPERTIES` [...] +| [hoodie.datasource.overwrite.mode](#hoodiedatasourceoverwritemode) | (N/A) | Controls whether overwrite use dynamic or static mode, if not configured, respect spark.sql.sources.partitionOverwriteMode<br />`Config Param: OVERWRITE_MODE`<br />`Since Version: 0.14.0` [...] +| [hoodie.datasource.write.partitions.to.delete](#hoodiedatasourcewritepartitionstodelete) | (N/A) | Comma separated list of partitions to delete. Allows use of wildcard *<br />`Config Param: PARTITIONS_TO_DELETE` [...] +| [hoodie.datasource.write.table.name](#hoodiedatasourcewritetablename) | (N/A) | Table name for the datasource write. Also used to register the table into meta stores.<br />`Config Param: TABLE_NAME` [...] +| [hoodie.datasource.compaction.async.enable](#hoodiedatasourcecompactionasyncenable) | true | Controls whether async compaction should be turned on for MOR table writing.<br />`Config Param: ASYNC_COMPACT_ENABLE` [...] +| [hoodie.datasource.hive_sync.assume_date_partitioning](#hoodiedatasourcehive_syncassume_date_partitioning) | false | Assume partitioning is yyyy/MM/dd<br />`Config Param: HIVE_ASSUME_DATE_PARTITION` [...] +| [hoodie.datasource.hive_sync.auto_create_database](#hoodiedatasourcehive_syncauto_create_database) | true | Auto create hive database if does not exists<br />`Config Param: HIVE_AUTO_CREATE_DATABASE` [...] +| [hoodie.datasource.hive_sync.base_file_format](#hoodiedatasourcehive_syncbase_file_format) | PARQUET | Base file format for the sync.<br />`Config Param: HIVE_BASE_FILE_FORMAT` [...] +| [hoodie.datasource.hive_sync.batch_num](#hoodiedatasourcehive_syncbatch_num) | 1000 | The number of partitions one batch when synchronous partitions to hive.<br />`Config Param: HIVE_BATCH_SYNC_PARTITION_NUM` [...] +| [hoodie.datasource.hive_sync.bucket_sync](#hoodiedatasourcehive_syncbucket_sync) | false | Whether sync hive metastore bucket specification when using bucket index.The specification is 'CLUSTERED BY (trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS'<br />`Config Param: HIVE_SYNC_BUCKET_SYNC` [...] +| [hoodie.datasource.hive_sync.create_managed_table](#hoodiedatasourcehive_synccreate_managed_table) | false | Whether to sync the table as managed table.<br />`Config Param: HIVE_CREATE_MANAGED_TABLE` [...] +| [hoodie.datasource.hive_sync.database](#hoodiedatasourcehive_syncdatabase) | default | The name of the destination database that we should sync the hudi table to.<br />`Config Param: HIVE_DATABASE` [...] +| [hoodie.datasource.hive_sync.ignore_exceptions](#hoodiedatasourcehive_syncignore_exceptions) | false | Ignore exceptions when syncing with Hive.<br />`Config Param: HIVE_IGNORE_EXCEPTIONS` [...] +| [hoodie.datasource.hive_sync.partition_extractor_class](#hoodiedatasourcehive_syncpartition_extractor_class) | org.apache.hudi.hive.MultiPartKeysValueExtractor | Class which implements PartitionValueExtractor to extract the partition values, default 'org.apache.hudi.hive.MultiPartKeysValueExtractor'.<br />`Config Param: HIVE_PARTITION_EXTRACTOR_CLASS` [...] +| [hoodie.datasource.hive_sync.partition_fields](#hoodiedatasourcehive_syncpartition_fields) | | Field in the table to use for determining hive partition columns.<br />`Config Param: HIVE_PARTITION_FIELDS` [...] +| [hoodie.datasource.hive_sync.password](#hoodiedatasourcehive_syncpassword) | hive | hive password to use<br />`Config Param: HIVE_PASS` [...] +| [hoodie.datasource.hive_sync.skip_ro_suffix](#hoodiedatasourcehive_syncskip_ro_suffix) | false | Skip the _ro suffix for Read optimized table, when registering<br />`Config Param: HIVE_SKIP_RO_SUFFIX_FOR_READ_OPTIMIZED_TABLE` [...] +| [hoodie.datasource.hive_sync.support_timestamp](#hoodiedatasourcehive_syncsupport_timestamp) | false | ‘INT64’ with original type TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. Disabled by default for backward compatibility. NOTE: On Spark entrypoints, this is defaulted to TRUE<br />`Config Param: HIVE_SUPPORT_TIMESTAMP_TYPE` [...] +| [hoodie.datasource.hive_sync.sync_as_datasource](#hoodiedatasourcehive_syncsync_as_datasource) | true | <br />`Config Param: HIVE_SYNC_AS_DATA_SOURCE_TABLE` [...] +| [hoodie.datasource.hive_sync.sync_comment](#hoodiedatasourcehive_syncsync_comment) | false | Whether to sync the table column comments while syncing the table.<br />`Config Param: HIVE_SYNC_COMMENT` [...] +| [hoodie.datasource.hive_sync.table](#hoodiedatasourcehive_synctable) | unknown | The name of the destination table that we should sync the hudi table to.<br />`Config Param: HIVE_TABLE` [...] +| [hoodie.datasource.hive_sync.use_jdbc](#hoodiedatasourcehive_syncuse_jdbc) | true | Use JDBC when hive synchronization is enabled<br />`Config Param: HIVE_USE_JDBC` [...] +| [hoodie.datasource.hive_sync.use_pre_apache_input_format](#hoodiedatasourcehive_syncuse_pre_apache_input_format) | false | Flag to choose InputFormat under com.uber.hoodie package instead of org.apache.hudi package. Use this when you are in the process of migrating from com.uber.hoodie to org.apache.hudi. Stop using this after you migrated the table definition to org.apache.hudi input format<br />`Config [...] +| [hoodie.datasource.hive_sync.username](#hoodiedatasourcehive_syncusername) | hive | hive user name to use<br />`Config Param: HIVE_USER` [...] +| [hoodie.datasource.insert.dup.policy](#hoodiedatasourceinsertduppolicy) | none | **Note** This is only applicable to Spark SQL writing.<br />When operation type is set to "insert", users can optionally enforce a dedup policy. This policy will be employed when records being ingested already exists in storage. Default policy is none and no action will be tak [...] +| [hoodie.datasource.meta_sync.condition.sync](#hoodiedatasourcemeta_syncconditionsync) | false | If true, only sync on conditions like schema change or partition change.<br />`Config Param: HIVE_CONDITIONAL_SYNC` [...] +| [hoodie.datasource.write.commitmeta.key.prefix](#hoodiedatasourcewritecommitmetakeyprefix) | _ | Option keys beginning with this prefix, are automatically added to the commit/deltacommit metadata. This is useful to store checkpointing information, in a consistent way with the hudi timeline<br />`Config Param: COMMIT_METADATA_KEYPREFIX` [...] +| [hoodie.datasource.write.drop.partition.columns](#hoodiedatasourcewritedroppartitioncolumns) | false | When set to true, will not write the partition columns into hudi. By default, false.<br />`Config Param: DROP_PARTITION_COLUMNS` [...] +| [hoodie.datasource.write.insert.drop.duplicates](#hoodiedatasourcewriteinsertdropduplicates) | false | If set to true, records from the incoming dataframe will not overwrite existing records with the same key during the write operation. <br /> **Note** Just for Insert operation in Spark SQL writing since 0.14.0, users can switch to the config `hoodie.datasource.insert.dup.policy [...] +| [hoodie.datasource.write.keygenerator.class](#hoodiedatasourcewritekeygeneratorclass) | org.apache.hudi.keygen.SimpleKeyGenerator | Key generator class, that implements `org.apache.hudi.keygen.KeyGenerator`<br />`Config Param: KEYGENERATOR_CLASS_NAME` [...] +| [hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled](#hoodiedatasourcewritekeygeneratorconsistentlogicaltimestampenabled) | false | When set to true, consistent value will be generated for a logical timestamp type column, like timestamp-millis and timestamp-micros, irrespective of whether row-writer is enabled. Disabled by default so as not to break the pipeline that deploy either fully row-writer path or non row [...] +| [hoodie.datasource.write.partitionpath.urlencode](#hoodiedatasourcewritepartitionpathurlencode) | false | Should we url encode the partition path value, before creating the folder structure.<br />`Config Param: URL_ENCODE_PARTITIONING` [...] +| [hoodie.datasource.write.payload.class](#hoodiedatasourcewritepayloadclass) | org.apache.hudi.common.model.DefaultHoodieRecordPayload | Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for PRECOMBINE_FIELD_OPT_VAL in-effective<br />`Config Param: PAYLOAD_CLASS_NAME` [...] +| [hoodie.datasource.write.payload.type](#hoodiedatasourcewritepayloadtype) | HOODIE_AVRO_DEFAULT | org.apache.hudi.common.model.RecordPayloadType: Payload to use for merging records AWS_DMS_AVRO: Provides support for seamlessly applying changes captured via Amazon Database Migration Service onto S3. HOODIE_AVRO: A payload to wrap a existing Hoodie Avro Record. Useful to cr [...] +| [hoodie.datasource.write.reconcile.schema](#hoodiedatasourcewritereconcileschema) | false | This config controls how writer's schema will be selected based on the incoming batch's schema as well as existing table's one. When schema reconciliation is DISABLED, incoming batch's schema will be picked as a writer-schema (therefore updating table's schema). When schema reconcili [...] +| [hoodie.datasource.write.record.merger.impls](#hoodiedatasourcewriterecordmergerimpls) | org.apache.hudi.common.model.HoodieAvroRecordMerger | List of HoodieMerger implementations constituting Hudi's merging strategy -- based on the engine used. These merger impls will filter by hoodie.datasource.write.record.merger.strategy Hudi will pick most efficient implementation to perform merging/combining of the records (during upd [...] +| [hoodie.datasource.write.record.merger.strategy](#hoodiedatasourcewriterecordmergerstrategy) | eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 | Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in hoodie.datasource.write.record.merger.impls which has the same merger strategy id<br />`Config Param: RECORD_MERGER_STRATEGY`<br />`Since Version: 0.13.0` [...] +| [hoodie.datasource.write.row.writer.enable](#hoodiedatasourcewriterowwriterenable) | true | When set to true, will perform write operations directly using the spark native `Row` representation, avoiding any additional conversion costs.<br />`Config Param: ENABLE_ROW_WRITER` [...] +| [hoodie.datasource.write.streaming.checkpoint.identifier](#hoodiedatasourcewritestreamingcheckpointidentifier) | default_single_writer | A stream identifier used for HUDI to fetch the right checkpoint(`batch id` to be more specific) corresponding this writer. Please note that keep the identifier an unique value for different writer if under multi-writer scenario. If the value is not set, will only keep the checkpoint [...] +| [hoodie.datasource.write.streaming.disable.compaction](#hoodiedatasourcewritestreamingdisablecompaction) | false | By default for MOR table, async compaction is enabled with spark streaming sink. By setting this config to true, we can disable it and the expectation is that, users will schedule and execute compaction in a different process/job altogether. Some users may wish to run it separately t [...] +| [hoodie.datasource.write.streaming.ignore.failed.batch](#hoodiedatasourcewritestreamingignorefailedbatch) | false | Config to indicate whether to ignore any non exception error (e.g. writestatus error) within a streaming microbatch. Turning this on, could hide the write status errors while the spark checkpoint moves ahead.So, would recommend users to use this with caution.<br />`Config Param: STRE [...] +| [hoodie.datasource.write.streaming.retry.count](#hoodiedatasourcewritestreamingretrycount) | 3 | Config to indicate how many times streaming job should retry for a failed micro batch.<br />`Config Param: STREAMING_RETRY_CNT` [...] +| [hoodie.datasource.write.streaming.retry.interval.ms](#hoodiedatasourcewritestreamingretryintervalms) | 2000 | Config to indicate how long (by millisecond) before a retry should issued for failed microbatch<br />`Config Param: STREAMING_RETRY_INTERVAL_MS` [...] +| [hoodie.meta.sync.client.tool.class](#hoodiemetasyncclienttoolclass) | org.apache.hudi.hive.HiveSyncTool | Sync tool class name used to sync to metastore. Defaults to Hive.<br />`Config Param: META_SYNC_CLIENT_TOOL_CLASS_NAME` [...] +| [hoodie.spark.sql.insert.into.operation](#hoodiesparksqlinsertintooperation) | insert | Sql write operation to use with INSERT_INTO spark sql command. This comes with 3 possible values, bulk_insert, insert and upsert. bulk_insert is generally meant for initial loads and is known to be performant compared to insert. But bulk_insert may not do small file management. If yo [...] +| [hoodie.spark.sql.optimized.writes.enable](#hoodiesparksqloptimizedwritesenable) | true | Controls whether spark sql prepped update, delete, and merge are enabled.<br />`Config Param: SPARK_SQL_OPTIMIZED_WRITES`<br />`Since Version: 0.14.0` [...] +| [hoodie.sql.bulk.insert.enable](#hoodiesqlbulkinsertenable) | false | When set to true, the sql insert statement will use bulk insert. This config is deprecated as of 0.14.0. Please use hoodie.spark.sql.insert.into.operation instead.<br />`Config Param: SQL_ENABLE_BULK_INSERT` [...] +| [hoodie.sql.insert.mode](#hoodiesqlinsertmode) | upsert | Insert mode when insert data to pk-table. The optional modes are: upsert, strict and non-strict.For upsert mode, insert statement do the upsert operation for the pk-table which will update the duplicate record.For strict mode, insert statement will keep the primary key uniqueness con [...] +| [hoodie.streamer.source.kafka.value.deserializer.class](#hoodiestreamersourcekafkavaluedeserializerclass) | io.confluent.kafka.serializers.KafkaAvroDeserializer | This class is used by kafka client to deserialize the records<br />`Config Param: KAFKA_AVRO_VALUE_DESERIALIZER_CLASS`<br />`Since Version: 0.9.0` [...] +| [hoodie.write.set.null.for.missing.columns](#hoodiewritesetnullformissingcolumns) | false | When a nullable column is missing from incoming batch during a write operation, the write operation will fail schema compatibility check. Set this option to true will make the missing column be filled with null values to successfully complete the write operation.<br />`Config Param [...] --- @@ -936,7 +938,8 @@ Configurations that control write behavior on Hudi tables. These can be directly | [hoodie.consistency.check.max_checks](#hoodieconsistencycheckmax_checks) | 7 | Maximum number of checks, for consistency of written data.<br />`Config Param: MAX_CONSISTENCY_CHECKS` [...] | [hoodie.consistency.check.max_interval_ms](#hoodieconsistencycheckmax_interval_ms) | 300000 | Max time to wait between successive attempts at performing consistency checks<br />`Config Param: MAX_CONSISTENCY_CHECK_INTERVAL_MS` [...] | [hoodie.datasource.write.keygenerator.type](#hoodiedatasourcewritekeygeneratortype) | SIMPLE | **Note** This is being actively worked on. Please use `hoodie.datasource.write.keygenerator.class` instead. org.apache.hudi.keygen.constant.KeyGeneratorType: Key generator type, indicating the key generator class to use, that implements `org.apache.hudi.keygen.KeyGenerator`. SIMPLE(default) [...] -| [hoodie.datasource.write.payload.class](#hoodiedatasourcewritepayloadclass) | org.apache.hudi.common.model.OverwriteWithLatestAvroPayload | Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for PRECOMBINE_FIELD_OPT_VAL in-effective<br />`Config Param: WRITE_PAYLOAD_CLASS_NAME` [...] +| [hoodie.datasource.write.payload.class](#hoodiedatasourcewritepayloadclass) | org.apache.hudi.common.model.DefaultHoodieRecordPayload | Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for PRECOMBINE_FIELD_OPT_VAL in-effective<br />`Config Param: WRITE_PAYLOAD_CLASS_NAME` [...] +| [hoodie.datasource.write.payload.type](#hoodiedatasourcewritepayloadtype) | HOODIE_AVRO_DEFAULT | org.apache.hudi.common.model.RecordPayloadType: Payload to use for merging records AWS_DMS_AVRO: Provides support for seamlessly applying changes captured via Amazon Database Migration Service onto S3. HOODIE_AVRO: A payload to wrap a existing Hoodie Avro Record. Useful to create a HoodieRe [...] | [hoodie.datasource.write.record.merger.impls](#hoodiedatasourcewriterecordmergerimpls) | org.apache.hudi.common.model.HoodieAvroRecordMerger | List of HoodieMerger implementations constituting Hudi's merging strategy -- based on the engine used. These merger impls will filter by hoodie.datasource.write.record.merger.strategy Hudi will pick most efficient implementation to perform merging/combining of the records (during update, readin [...] | [hoodie.datasource.write.record.merger.strategy](#hoodiedatasourcewriterecordmergerstrategy) | eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 | Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in hoodie.datasource.write.record.merger.impls which has the same merger strategy id<br />`Config Param: RECORD_MERGER_STRATEGY`<br />`Since Version: 0.13.0` [...] | [hoodie.datasource.write.schema.allow.auto.evolution.column.drop](#hoodiedatasourcewriteschemaallowautoevolutioncolumndrop) | false | Controls whether table's schema is allowed to automatically evolve when incoming batch's schema can have any of the columns dropped. By default, Hudi will not allow this kind of (auto) schema evolution. Set this config to true to allow table's schema to be updated automatically when columns are [...] @@ -1691,7 +1694,7 @@ Payload related configs, that can be leveraged to control merges based on specif | Config Name | Default | Description | | ---------------------------------------------------------------- | ------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass) | org.apache.hudi.common.model.OverwriteWithLatestAvroPayload | This needs to be same as class used during insert/upserts. Just like writing, compaction also uses the record payload class to merge records in the log against each other, merge again with the base file and produce the final record to be written after compaction.<br />`Config Param: PAYLOAD_CLASS_NAME` | +| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass) | org.apache.hudi.common.model.DefaultHoodieRecordPayload | This needs to be same as class used during insert/upserts. Just like writing, compaction also uses the record payload class to merge records in the log against each other, merge again with the base file and produce the final record to be written after compaction.<br />`Config Param: PAYLOAD_CLASS_NAME` | | [hoodie.payload.event.time.field](#hoodiepayloadeventtimefield) | ts | Table column/field name to derive timestamp associated with the records. This canbe useful for e.g, determining the freshness of the table.<br />`Config Param: EVENT_TIME_FIELD` | | [hoodie.payload.ordering.field](#hoodiepayloadorderingfield) | ts | Table column/field name to order records that have the same key, before merging and writing to storage.<br />`Config Param: ORDERING_FIELD` | ---