nsivabalan commented on code in PR #10612: URL: https://github.com/apache/hudi/pull/10612#discussion_r1479356893
########## website/docs/configurations.md: ########## @@ -127,59 +127,59 @@ Options useful for writing tables via `write.format.option(...)` [**Advanced Configs**](#Write-Options-advanced-configs) -| Config Name | Default | Description | -| ------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------- | -| [hoodie.datasource.hive_sync.serde_properties](#hoodiedatasourcehive_syncserde_properties) | (N/A) | Serde properties to hive table.<br />`Config Param: HIVE_TABLE_SERDE_PROPERTIES` | -| [hoodie.datasource.hive_sync.table_properties](#hoodiedatasourcehive_synctable_properties) | (N/A) | Additional properties to store with table.<br />`Config Param: HIVE_TABLE_PROPERTIES` | -| [hoodie.datasource.overwrite.mode](#hoodiedatasourceoverwritemode) | (N/A) | Controls whether overwrite use dynamic or static mode, if not configured, respect spark.sql.sources.partitionOverwriteMode<br />`Config Param: OVERWRITE_MODE`<br />`Since Version: 0.14.0` | -| [hoodie.datasource.write.partitions.to.delete](#hoodiedatasourcewritepartitionstodelete) | (N/A) | Comma separated list of partitions to delete. Allows use of wildcard *<br />`Config Param: PARTITIONS_TO_DELETE` | -| [hoodie.datasource.write.table.name](#hoodiedatasourcewritetablename) | (N/A) | Table name for the datasource write. Also used to register the table into meta stores.<br />`Config Param: TABLE_NAME` | -| [hoodie.datasource.compaction.async.enable](#hoodiedatasourcecompactionasyncenable) | true | Controls whether async compaction should be turned on for MOR table writing.<br />`Config Param: ASYNC_COMPACT_ENABLE` | -| [hoodie.datasource.hive_sync.assume_date_partitioning](#hoodiedatasourcehive_syncassume_date_partitioning) | false | Assume partitioning is yyyy/MM/dd<br />`Config Param: HIVE_ASSUME_DATE_PARTITION` | -| [hoodie.datasource.hive_sync.auto_create_database](#hoodiedatasourcehive_syncauto_create_database) | true | Auto create hive database if does not exists<br />`Config Param: HIVE_AUTO_CREATE_DATABASE` | -| [hoodie.datasource.hive_sync.base_file_format](#hoodiedatasourcehive_syncbase_file_format) | PARQUET | Base file format for the sync.<br />`Config Param: HIVE_BASE_FILE_FORMAT` | -| [hoodie.datasource.hive_sync.batch_num](#hoodiedatasourcehive_syncbatch_num) | 1000 | The number of partitions one batch when synchronous partitions to hive.<br />`Config Param: HIVE_BATCH_SYNC_PARTITION_NUM` | -| [hoodie.datasource.hive_sync.bucket_sync](#hoodiedatasourcehive_syncbucket_sync) | false | Whether sync hive metastore bucket specification when using bucket index.The specification is 'CLUSTERED BY (trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS'<br />`Config Param: HIVE_SYNC_BUCKET_SYNC` | -| [hoodie.datasource.hive_sync.create_managed_table](#hoodiedatasourcehive_synccreate_managed_table) | false | Whether to sync the table as managed table.<br />`Config Param: HIVE_CREATE_MANAGED_TABLE` | -| [hoodie.datasource.hive_sync.database](#hoodiedatasourcehive_syncdatabase) | default | The name of the destination database that we should sync the hudi table to.<br />`Config Param: HIVE_DATABASE` | -| [hoodie.datasource.hive_sync.ignore_exceptions](#hoodiedatasourcehive_syncignore_exceptions) | false | Ignore exceptions when syncing with Hive.<br />`Config Param: HIVE_IGNORE_EXCEPTIONS` | -| [hoodie.datasource.hive_sync.partition_extractor_class](#hoodiedatasourcehive_syncpartition_extractor_class) | org.apache.hudi.hive.MultiPartKeysValueExtractor | Class which implements PartitionValueExtractor to extract the partition values, default 'org.apache.hudi.hive.MultiPartKeysValueExtractor'.<br />`Config Param: HIVE_PARTITION_EXTRACTOR_CLASS` | -| [hoodie.datasource.hive_sync.partition_fields](#hoodiedatasourcehive_syncpartition_fields) | | Field in the table to use for determining hive partition columns.<br />`Config Param: HIVE_PARTITION_FIELDS` | -| [hoodie.datasource.hive_sync.password](#hoodiedatasourcehive_syncpassword) | hive | hive password to use<br />`Config Param: HIVE_PASS` | -| [hoodie.datasource.hive_sync.skip_ro_suffix](#hoodiedatasourcehive_syncskip_ro_suffix) | false | Skip the _ro suffix for Read optimized table, when registering<br />`Config Param: HIVE_SKIP_RO_SUFFIX_FOR_READ_OPTIMIZED_TABLE` | -| [hoodie.datasource.hive_sync.support_timestamp](#hoodiedatasourcehive_syncsupport_timestamp) | false | ‘INT64’ with original type TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. Disabled by default for backward compatibility.<br />`Config Param: HIVE_SUPPORT_TIMESTAMP_TYPE` | -| [hoodie.datasource.hive_sync.sync_as_datasource](#hoodiedatasourcehive_syncsync_as_datasource) | true | <br />`Config Param: HIVE_SYNC_AS_DATA_SOURCE_TABLE` | -| [hoodie.datasource.hive_sync.sync_comment](#hoodiedatasourcehive_syncsync_comment) | false | Whether to sync the table column comments while syncing the table.<br />`Config Param: HIVE_SYNC_COMMENT` | -| [hoodie.datasource.hive_sync.table](#hoodiedatasourcehive_synctable) | unknown | The name of the destination table that we should sync the hudi table to.<br />`Config Param: HIVE_TABLE` | -| [hoodie.datasource.hive_sync.use_jdbc](#hoodiedatasourcehive_syncuse_jdbc) | true | Use JDBC when hive synchronization is enabled<br />`Config Param: HIVE_USE_JDBC` | -| [hoodie.datasource.hive_sync.use_pre_apache_input_format](#hoodiedatasourcehive_syncuse_pre_apache_input_format) | false | Flag to choose InputFormat under com.uber.hoodie package instead of org.apache.hudi package. Use this when you are in the process of migrating from com.uber.hoodie to org.apache.hudi. Stop using this after you migrated the table definition to org.apache.hudi input format<br />`Config Param: HIVE_USE_PRE_APACHE_INPUT_FORMAT` | -| [hoodie.datasource.hive_sync.username](#hoodiedatasourcehive_syncusername) | hive | hive user name to use<br />`Config Param: HIVE_USER` | -| [hoodie.datasource.insert.dup.policy](#hoodiedatasourceinsertduppolicy) | none | When operation type is set to "insert", users can optionally enforce a dedup policy. This policy will be employed when records being ingested already exists in storage. Default policy is none and no action will be taken. Another option is to choose "drop", on which matching records from incoming will be dropped and the rest will be ingested. Third option is "fail" which will fail the write operation when same records are re-ingested. In other words, a given record as deduced by the key generation policy can be ingested only once to the target table of interest.<br />`Config Param: INSERT_DUP_POLICY`<br />`Since Version: 0.14.0` | -| [hoodie.datasource.meta_sync.condition.sync](#hoodiedatasourcemeta_syncconditionsync) | false | If true, only sync on conditions like schema change or partition change.<br />`Config Param: HIVE_CONDITIONAL_SYNC` | -| [hoodie.datasource.write.commitmeta.key.prefix](#hoodiedatasourcewritecommitmetakeyprefix) | _ | Option keys beginning with this prefix, are automatically added to the commit/deltacommit metadata. This is useful to store checkpointing information, in a consistent way with the hudi timeline<br />`Config Param: COMMIT_METADATA_KEYPREFIX` | -| [hoodie.datasource.write.drop.partition.columns](#hoodiedatasourcewritedroppartitioncolumns) | false | When set to true, will not write the partition columns into hudi. By default, false.<br />`Config Param: DROP_PARTITION_COLUMNS` | -| [hoodie.datasource.write.insert.drop.duplicates](#hoodiedatasourcewriteinsertdropduplicates) | false | If set to true, records from the incoming dataframe will not overwrite existing records with the same key during the write operation. This config is deprecated as of 0.14.0. Please use hoodie.datasource.insert.dup.policy instead.<br />`Config Param: INSERT_DROP_DUPS` | -| [hoodie.datasource.write.keygenerator.class](#hoodiedatasourcewritekeygeneratorclass) | org.apache.hudi.keygen.SimpleKeyGenerator | Key generator class, that implements `org.apache.hudi.keygen.KeyGenerator`<br />`Config Param: KEYGENERATOR_CLASS_NAME` | -| [hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled](#hoodiedatasourcewritekeygeneratorconsistentlogicaltimestampenabled) | false | When set to true, consistent value will be generated for a logical timestamp type column, like timestamp-millis and timestamp-micros, irrespective of whether row-writer is enabled. Disabled by default so as not to break the pipeline that deploy either fully row-writer path or non row-writer path. For example, if it is kept disabled then record key of timestamp type with value `2016-12-29 09:54:00` will be written as timestamp `2016-12-29 09:54:00.0` in row-writer path, while it will be written as long value `1483023240000000` in non row-writer path. If enabled, then the timestamp value will be written in both the cases.<br />`Config Param: KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED` | -| [hoodie.datasource.write.new.columns.nullable](#hoodiedatasourcewritenewcolumnsnullable) | false | When a non-nullable column is added to datasource during a write operation, the write operation will fail schema compatibility check. Set this option to true will make the newly added column nullable to successfully complete the write operation.<br />`Config Param: MAKE_NEW_COLUMNS_NULLABLE`<br />`Since Version: 0.14.0` | -| [hoodie.datasource.write.partitionpath.urlencode](#hoodiedatasourcewritepartitionpathurlencode) | false | Should we url encode the partition path value, before creating the folder structure.<br />`Config Param: URL_ENCODE_PARTITIONING` | -| [hoodie.datasource.write.payload.class](#hoodiedatasourcewritepayloadclass) | org.apache.hudi.common.model.OverwriteWithLatestAvroPayload | Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for PRECOMBINE_FIELD_OPT_VAL in-effective<br />`Config Param: PAYLOAD_CLASS_NAME` | -| [hoodie.datasource.write.reconcile.schema](#hoodiedatasourcewritereconcileschema) | false | This config controls how writer's schema will be selected based on the incoming batch's schema as well as existing table's one. When schema reconciliation is DISABLED, incoming batch's schema will be picked as a writer-schema (therefore updating table's schema). When schema reconciliation is ENABLED, writer-schema will be picked such that table's schema (after txn) is either kept the same or extended, meaning that we'll always prefer the schema that either adds new columns or stays the same. This enables us, to always extend the table's schema during evolution and never lose the data (when, for ex, existing column is being dropped in a new batch)<br />`Config Param: RECONCILE_SCHEMA` | -| [hoodie.datasource.write.record.merger.impls](#hoodiedatasourcewriterecordmergerimpls) | org.apache.hudi.common.model.HoodieAvroRecordMerger | List of HoodieMerger implementations constituting Hudi's merging strategy -- based on the engine used. These merger impls will filter by hoodie.datasource.write.record.merger.strategy Hudi will pick most efficient implementation to perform merging/combining of the records (during update, reading MOR table, etc)<br />`Config Param: RECORD_MERGER_IMPLS`<br />`Since Version: 0.13.0` | -| [hoodie.datasource.write.record.merger.strategy](#hoodiedatasourcewriterecordmergerstrategy) | eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 | Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in hoodie.datasource.write.record.merger.impls which has the same merger strategy id<br />`Config Param: RECORD_MERGER_STRATEGY`<br />`Since Version: 0.13.0` | -| [hoodie.datasource.write.row.writer.enable](#hoodiedatasourcewriterowwriterenable) | true | When set to true, will perform write operations directly using the spark native `Row` representation, avoiding any additional conversion costs.<br />`Config Param: ENABLE_ROW_WRITER` | -| [hoodie.datasource.write.streaming.checkpoint.identifier](#hoodiedatasourcewritestreamingcheckpointidentifier) | default_single_writer | A stream identifier used for HUDI to fetch the right checkpoint(`batch id` to be more specific) corresponding this writer. Please note that keep the identifier an unique value for different writer if under multi-writer scenario. If the value is not set, will only keep the checkpoint info in the memory. This could introduce the potential issue that the job is restart(`batch id` is lost) while spark checkpoint write fails, causing spark will retry and rewrite the data.<br />`Config Param: STREAMING_CHECKPOINT_IDENTIFIER`<br />`Since Version: 0.13.0` | -| [hoodie.datasource.write.streaming.disable.compaction](#hoodiedatasourcewritestreamingdisablecompaction) | false | By default for MOR table, async compaction is enabled with spark streaming sink. By setting this config to true, we can disable it and the expectation is that, users will schedule and execute compaction in a different process/job altogether. Some users may wish to run it separately to manage resources across table services and regular ingestion pipeline and so this could be preferred on such cases.<br />`Config Param: STREAMING_DISABLE_COMPACTION`<br />`Since Version: 0.14.0` | -| [hoodie.datasource.write.streaming.ignore.failed.batch](#hoodiedatasourcewritestreamingignorefailedbatch) | false | Config to indicate whether to ignore any non exception error (e.g. writestatus error) within a streaming microbatch. Turning this on, could hide the write status errors while the spark checkpoint moves ahead.So, would recommend users to use this with caution.<br />`Config Param: STREAMING_IGNORE_FAILED_BATCH` | -| [hoodie.datasource.write.streaming.retry.count](#hoodiedatasourcewritestreamingretrycount) | 3 | Config to indicate how many times streaming job should retry for a failed micro batch.<br />`Config Param: STREAMING_RETRY_CNT` | -| [hoodie.datasource.write.streaming.retry.interval.ms](#hoodiedatasourcewritestreamingretryintervalms) | 2000 | Config to indicate how long (by millisecond) before a retry should issued for failed microbatch<br />`Config Param: STREAMING_RETRY_INTERVAL_MS` | -| [hoodie.meta.sync.client.tool.class](#hoodiemetasyncclienttoolclass) | org.apache.hudi.hive.HiveSyncTool | Sync tool class name used to sync to metastore. Defaults to Hive.<br />`Config Param: META_SYNC_CLIENT_TOOL_CLASS_NAME` | +| Config Name | Default | Description | Review Comment: why entire file is shown as changed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org