yihua commented on code in PR #18867: URL: https://github.com/apache/hudi/pull/18867#discussion_r3314756778
########## website/docs/cleaning.md: ########## @@ -50,10 +51,47 @@ Hudi cleaner currently supports the below cleaning policies to keep a certain nu be retained are cleaned. Currently you can configure by parameter [`hoodie.clean.hours.retained`](https://hudi.apache.org/docs/configurations/#hoodiecleanerhoursretained). The corresponding Flink related config is [`clean.retain_hours`](https://hudi.apache.org/docs/configurations/#cleanretain_hours). +#### Empty Clean Commits for Append-Only Tables + +Append-only tables never accumulate updates, so the cleaner's `earliest_commit_to_retain` pointer never advances — +causing the cleaner to scan the full table history on every run. Hudi 1.2.0 introduced periodic _empty clean commits_ +to advance this pointer even when there is nothing to delete. + +| Config Name | Default | Description | +|---|---|---| +| `hoodie.write.empty.clean.interval.hours` | `-1` (disabled) | Interval in hours at which an empty clean commit is created. `-1` disables the feature. Must be `-1` or `>= 1`. When enabled, the cleaner advances `earliest_commit_to_retain` so that subsequent clean plans only scan partitions modified after the last empty clean's pointer. | + +#### Capping the Number of Commits Cleaned per Run + +Since 1.2.0, you can limit how many commits are cleaned in a single clean run, which is useful for controlling job +duration on tables that have fallen significantly behind on cleaning. + +| Config Name | Default | Description | +|---|---|---| +| `hoodie.clean.max.commits.to.clean` | `Long.MAX_VALUE` (unbounded) | Maximum number of commits cleaned in a single clean commit. Applicable when the cleaning policy is `KEEP_LATEST_COMMITS` or `KEEP_LATEST_BY_HOURS`. Must be `>= 1`. | + ### Configs For details about all possible configurations and their default values see the [configuration docs](https://hudi.apache.org/docs/next/configurations/#Clean-Configs). For Flink related configs refer [here](https://hudi.apache.org/docs/next/configurations/#FLINK_SQL). +#### Driver-Side Planning Optimization + +Hudi 1.2.0 introduced a driver-local planning mode to prevent OOM during clean planning on large metadata-table +partitions (such as `record_index`). + +| Config Name | Default | Description | +|---|---|---| +| `hoodie.clean.optimize.using.local.engine.context` | `true` | When enabled, clean planning for metadata tables and non-partitioned datasets runs on the driver only (local engine context), avoiding OOM on executor memory caused by large `record_index` partitions during file listing. | + +#### MDT Cleaner Inherits Data-Table Policy Review Comment: These are advanced features that should only stay in configuration docs. ########## website/docs/cleaning.md: ########## @@ -50,10 +51,47 @@ Hudi cleaner currently supports the below cleaning policies to keep a certain nu be retained are cleaned. Currently you can configure by parameter [`hoodie.clean.hours.retained`](https://hudi.apache.org/docs/configurations/#hoodiecleanerhoursretained). The corresponding Flink related config is [`clean.retain_hours`](https://hudi.apache.org/docs/configurations/#cleanretain_hours). +#### Empty Clean Commits for Append-Only Tables + +Append-only tables never accumulate updates, so the cleaner's `earliest_commit_to_retain` pointer never advances — +causing the cleaner to scan the full table history on every run. Hudi 1.2.0 introduced periodic _empty clean commits_ +to advance this pointer even when there is nothing to delete. + +| Config Name | Default | Description | +|---|---|---| +| `hoodie.write.empty.clean.interval.hours` | `-1` (disabled) | Interval in hours at which an empty clean commit is created. `-1` disables the feature. Must be `-1` or `>= 1`. When enabled, the cleaner advances `earliest_commit_to_retain` so that subsequent clean plans only scan partitions modified after the last empty clean's pointer. | + +#### Capping the Number of Commits Cleaned per Run + +Since 1.2.0, you can limit how many commits are cleaned in a single clean run, which is useful for controlling job +duration on tables that have fallen significantly behind on cleaning. + +| Config Name | Default | Description | +|---|---|---| +| `hoodie.clean.max.commits.to.clean` | `Long.MAX_VALUE` (unbounded) | Maximum number of commits cleaned in a single clean commit. Applicable when the cleaning policy is `KEEP_LATEST_COMMITS` or `KEEP_LATEST_BY_HOURS`. Must be `>= 1`. | + ### Configs For details about all possible configurations and their default values see the [configuration docs](https://hudi.apache.org/docs/next/configurations/#Clean-Configs). For Flink related configs refer [here](https://hudi.apache.org/docs/next/configurations/#FLINK_SQL). +#### Driver-Side Planning Optimization + +Hudi 1.2.0 introduced a driver-local planning mode to prevent OOM during clean planning on large metadata-table +partitions (such as `record_index`). + +| Config Name | Default | Description | +|---|---|---| +| `hoodie.clean.optimize.using.local.engine.context` | `true` | When enabled, clean planning for metadata tables and non-partitioned datasets runs on the driver only (local engine context), avoiding OOM on executor memory caused by large `record_index` partitions during file listing. | + +#### MDT Cleaner Inherits Data-Table Policy Review Comment: Fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
