yihua commented on code in PR #18867:
URL: https://github.com/apache/hudi/pull/18867#discussion_r3314756778


##########
website/docs/cleaning.md:
##########
@@ -50,10 +51,47 @@ Hudi cleaner currently supports the below cleaning policies 
to keep a certain nu
   be retained are cleaned. Currently you can configure by parameter 
[`hoodie.clean.hours.retained`](https://hudi.apache.org/docs/configurations/#hoodiecleanerhoursretained).
   The corresponding Flink related config is 
[`clean.retain_hours`](https://hudi.apache.org/docs/configurations/#cleanretain_hours).
 
+#### Empty Clean Commits for Append-Only Tables
+
+Append-only tables never accumulate updates, so the cleaner's 
`earliest_commit_to_retain` pointer never advances —
+causing the cleaner to scan the full table history on every run. Hudi 1.2.0 
introduced periodic _empty clean commits_
+to advance this pointer even when there is nothing to delete.
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.write.empty.clean.interval.hours` | `-1` (disabled) | Interval in 
hours at which an empty clean commit is created. `-1` disables the feature. 
Must be `-1` or `>= 1`. When enabled, the cleaner advances 
`earliest_commit_to_retain` so that subsequent clean plans only scan partitions 
modified after the last empty clean's pointer. |
+
+#### Capping the Number of Commits Cleaned per Run
+
+Since 1.2.0, you can limit how many commits are cleaned in a single clean run, 
which is useful for controlling job
+duration on tables that have fallen significantly behind on cleaning.
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.clean.max.commits.to.clean` | `Long.MAX_VALUE` (unbounded) | Maximum 
number of commits cleaned in a single clean commit. Applicable when the 
cleaning policy is `KEEP_LATEST_COMMITS` or `KEEP_LATEST_BY_HOURS`. Must be `>= 
1`. |
+
 ### Configs
 For details about all possible configurations and their default values see the 
[configuration 
docs](https://hudi.apache.org/docs/next/configurations/#Clean-Configs).
 For Flink related configs refer 
[here](https://hudi.apache.org/docs/next/configurations/#FLINK_SQL).
 
+#### Driver-Side Planning Optimization
+
+Hudi 1.2.0 introduced a driver-local planning mode to prevent OOM during clean 
planning on large metadata-table
+partitions (such as `record_index`).
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.clean.optimize.using.local.engine.context` | `true` | When enabled, 
clean planning for metadata tables and non-partitioned datasets runs on the 
driver only (local engine context), avoiding OOM on executor memory caused by 
large `record_index` partitions during file listing. |
+
+#### MDT Cleaner Inherits Data-Table Policy

Review Comment:
   These are advanced features that should only stay in configuration docs.



##########
website/docs/cleaning.md:
##########
@@ -50,10 +51,47 @@ Hudi cleaner currently supports the below cleaning policies 
to keep a certain nu
   be retained are cleaned. Currently you can configure by parameter 
[`hoodie.clean.hours.retained`](https://hudi.apache.org/docs/configurations/#hoodiecleanerhoursretained).
   The corresponding Flink related config is 
[`clean.retain_hours`](https://hudi.apache.org/docs/configurations/#cleanretain_hours).
 
+#### Empty Clean Commits for Append-Only Tables
+
+Append-only tables never accumulate updates, so the cleaner's 
`earliest_commit_to_retain` pointer never advances —
+causing the cleaner to scan the full table history on every run. Hudi 1.2.0 
introduced periodic _empty clean commits_
+to advance this pointer even when there is nothing to delete.
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.write.empty.clean.interval.hours` | `-1` (disabled) | Interval in 
hours at which an empty clean commit is created. `-1` disables the feature. 
Must be `-1` or `>= 1`. When enabled, the cleaner advances 
`earliest_commit_to_retain` so that subsequent clean plans only scan partitions 
modified after the last empty clean's pointer. |
+
+#### Capping the Number of Commits Cleaned per Run
+
+Since 1.2.0, you can limit how many commits are cleaned in a single clean run, 
which is useful for controlling job
+duration on tables that have fallen significantly behind on cleaning.
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.clean.max.commits.to.clean` | `Long.MAX_VALUE` (unbounded) | Maximum 
number of commits cleaned in a single clean commit. Applicable when the 
cleaning policy is `KEEP_LATEST_COMMITS` or `KEEP_LATEST_BY_HOURS`. Must be `>= 
1`. |
+
 ### Configs
 For details about all possible configurations and their default values see the 
[configuration 
docs](https://hudi.apache.org/docs/next/configurations/#Clean-Configs).
 For Flink related configs refer 
[here](https://hudi.apache.org/docs/next/configurations/#FLINK_SQL).
 
+#### Driver-Side Planning Optimization
+
+Hudi 1.2.0 introduced a driver-local planning mode to prevent OOM during clean 
planning on large metadata-table
+partitions (such as `record_index`).
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.clean.optimize.using.local.engine.context` | `true` | When enabled, 
clean planning for metadata tables and non-partitioned datasets runs on the 
driver only (local engine context), avoiding OOM on executor memory caused by 
large `record_index` partitions during file listing. |
+
+#### MDT Cleaner Inherits Data-Table Policy

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to