hudi-agent commented on code in PR #18869:
URL: https://github.com/apache/hudi/pull/18869#discussion_r3315149627
##########
website/docs/indexes.md:
##########
@@ -211,13 +211,38 @@ for more details. All these, support the index types
mentioned [above](#addition
#### Flink based configs
-For Flink DataStream and Flink SQL, Bucket index and Flink state index are
supported.
+For Flink DataStream and Flink SQL, Bucket index, Flink state index, and
record-level index are supported.
Following are the basic configs that control the indexing behavior. Please
refer [the Flink
configurations](configurations.md#Flink-Options-advanced-configs) for advanced
configs.
-| Config Name | Default | Description
|
-|----------------------------|------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| index.type | FLINK_STATE (Optional) | Index type of Flink
write job, default is using state backed index. Possible values:<br />
<ul><li>FLINK_STATE</li><li>BUCKET</li></ul><br /> `Config Param: INDEX_TYPE`
|
-| hoodie.index.bucket.engine | SIMPLE (Optional) |
org.apache.hudi.index.HoodieIndex$BucketIndexEngineType: Determines the type of
bucketing or hashing to use when `hoodie.index.type` is set to `BUCKET`.
Possible Values: <br /> <ul><li>SIMPLE</li><li>CONSISTENT_HASHING</li></ul> |
+| Config Name | Default
| Description
|
+|-------------------------------------------------------------|------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| index.type | FLINK_STATE
(Optional) | Index type of Flink write job, default is using state backed
index. Possible values:<br />
<ul><li>FLINK_STATE</li><li>BUCKET</li><li>GLOBAL_RECORD_LEVEL_INDEX</li><li>RECORD_LEVEL_INDEX</li></ul><br
/> `Config Param: INDEX_TYPE`
|
+| hoodie.index.bucket.engine | SIMPLE
(Optional) | org.apache.hudi.index.HoodieIndex$BucketIndexEngineType:
Determines the type of bucketing or hashing to use when `hoodie.index.type` is
set to `BUCKET`. Possible Values: <br />
<ul><li>SIMPLE</li><li>CONSISTENT_HASHING</li></ul>
|
+| metadata.enabled | true
(Optional) | Enables the metadata table. Required for Flink record-level
index lookups.
|
Review Comment:
🤖 The default for `index.bootstrap.rocksdb.path` is listed as `(Optional)`
with no value, but in `FlinkOptions.INDEX_BOOTSTRAP_ROCKSDB_PATH` it has a real
default of `FileIOUtils.getDefaultSpillableMapBasePath()` (typically the system
temp directory). Could you update the Default column to reflect this, since
users relying on the documented "no default" may be surprised that a path is
implicitly chosen for them?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
website/docs/indexes.md:
##########
@@ -211,13 +211,38 @@ for more details. All these, support the index types
mentioned [above](#addition
#### Flink based configs
-For Flink DataStream and Flink SQL, Bucket index and Flink state index are
supported.
+For Flink DataStream and Flink SQL, Bucket index, Flink state index, and
record-level index are supported.
Following are the basic configs that control the indexing behavior. Please
refer [the Flink
configurations](configurations.md#Flink-Options-advanced-configs) for advanced
configs.
-| Config Name | Default | Description
|
-|----------------------------|------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| index.type | FLINK_STATE (Optional) | Index type of Flink
write job, default is using state backed index. Possible values:<br />
<ul><li>FLINK_STATE</li><li>BUCKET</li></ul><br /> `Config Param: INDEX_TYPE`
|
-| hoodie.index.bucket.engine | SIMPLE (Optional) |
org.apache.hudi.index.HoodieIndex$BucketIndexEngineType: Determines the type of
bucketing or hashing to use when `hoodie.index.type` is set to `BUCKET`.
Possible Values: <br /> <ul><li>SIMPLE</li><li>CONSISTENT_HASHING</li></ul> |
+| Config Name | Default
| Description
|
+|-------------------------------------------------------------|------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| index.type | FLINK_STATE
(Optional) | Index type of Flink write job, default is using state backed
index. Possible values:<br />
<ul><li>FLINK_STATE</li><li>BUCKET</li><li>GLOBAL_RECORD_LEVEL_INDEX</li><li>RECORD_LEVEL_INDEX</li></ul><br
/> `Config Param: INDEX_TYPE`
|
+| hoodie.index.bucket.engine | SIMPLE
(Optional) | org.apache.hudi.index.HoodieIndex$BucketIndexEngineType:
Determines the type of bucketing or hashing to use when `hoodie.index.type` is
set to `BUCKET`. Possible Values: <br />
<ul><li>SIMPLE</li><li>CONSISTENT_HASHING</li></ul>
|
Review Comment:
🤖 The default column shows `false (Optional)` but the description explains
Flink may override this to `true` for `GLOBAL_RECORD_LEVEL_INDEX` at runtime.
This dual default is easy to miss — it might help to make the override behavior
more prominent (e.g., "Static default: `false`; runtime default for
`GLOBAL_RECORD_LEVEL_INDEX`: `true` when unset"), so users don't assume the
static default applies in all cases.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
website/docs/indexes.md:
##########
@@ -211,13 +211,38 @@ for more details. All these, support the index types
mentioned [above](#addition
#### Flink based configs
-For Flink DataStream and Flink SQL, Bucket index and Flink state index are
supported.
+For Flink DataStream and Flink SQL, Bucket index, Flink state index, and
record-level index are supported.
Following are the basic configs that control the indexing behavior. Please
refer [the Flink
configurations](configurations.md#Flink-Options-advanced-configs) for advanced
configs.
-| Config Name | Default | Description
|
-|----------------------------|------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| index.type | FLINK_STATE (Optional) | Index type of Flink
write job, default is using state backed index. Possible values:<br />
<ul><li>FLINK_STATE</li><li>BUCKET</li></ul><br /> `Config Param: INDEX_TYPE`
|
-| hoodie.index.bucket.engine | SIMPLE (Optional) |
org.apache.hudi.index.HoodieIndex$BucketIndexEngineType: Determines the type of
bucketing or hashing to use when `hoodie.index.type` is set to `BUCKET`.
Possible Values: <br /> <ul><li>SIMPLE</li><li>CONSISTENT_HASHING</li></ul> |
+| Config Name | Default
| Description
|
+|-------------------------------------------------------------|------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| index.type | FLINK_STATE
(Optional) | Index type of Flink write job, default is using state backed
index. Possible values:<br />
<ul><li>FLINK_STATE</li><li>BUCKET</li><li>GLOBAL_RECORD_LEVEL_INDEX</li><li>RECORD_LEVEL_INDEX</li></ul><br
/> `Config Param: INDEX_TYPE`
|
+| hoodie.index.bucket.engine | SIMPLE
(Optional) | org.apache.hudi.index.HoodieIndex$BucketIndexEngineType:
Determines the type of bucketing or hashing to use when `hoodie.index.type` is
set to `BUCKET`. Possible Values: <br />
<ul><li>SIMPLE</li><li>CONSISTENT_HASHING</li></ul>
|
+| metadata.enabled | true
(Optional) | Enables the metadata table. Required for Flink record-level
index lookups.
|
+| index.global.enabled | true
(Optional) | Whether to update the old partition path when the same
record key arrives with a different partition path. This must be `true` for
`GLOBAL_RECORD_LEVEL_INDEX` and is set to `false` for `RECORD_LEVEL_INDEX`.
|
+| index.bootstrap.enabled | false
(Optional) | When `index.type=GLOBAL_RECORD_LEVEL_INDEX`, controls
whether Flink bootstraps the global index into a local RocksDB backend. If not
explicitly set for global RLI, Flink enables bootstrap by default. Set to
`false` to force native metadata-table RLI access.
|
+| index.bootstrap.rocksdb.path | (Optional)
| Local directory path for the RocksDB backend used when
`index.bootstrap.enabled=true`. Each task manager creates a unique subdirectory
under this path.
|
+| index.rli.cache.size | 256 (Optional)
| Maximum memory, in MB, allocated for the record-level index cache per
bucket-assign task. Applies to native metadata-table RLI access and partitioned
RLI caches.
|
+| index.rli.cache.concurrent.partitions.num | 2 (Optional)
| Expected number of partitions whose partitioned RLI caches are
updated concurrently. Used to size each partition cache when historical cache
usage is unavailable.
|
Review Comment:
🤖 The `index.write.tasks` row shows default `(N/A)`. Since this option
resolves to the execution environment parallelism at runtime, it might read
more clearly as a default of "execution environment parallelism" (matching how
`read.tasks` and similar Flink configs are typically described), rather than
`(N/A)` which could read as "unsupported."
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
website/docs/indexes.md:
##########
@@ -211,13 +211,38 @@ for more details. All these, support the index types
mentioned [above](#addition
#### Flink based configs
-For Flink DataStream and Flink SQL, Bucket index and Flink state index are
supported.
+For Flink DataStream and Flink SQL, Bucket index, Flink state index, and
record-level index are supported.
Following are the basic configs that control the indexing behavior. Please
refer [the Flink
configurations](configurations.md#Flink-Options-advanced-configs) for advanced
configs.
-| Config Name | Default | Description
|
-|----------------------------|------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| index.type | FLINK_STATE (Optional) | Index type of Flink
write job, default is using state backed index. Possible values:<br />
<ul><li>FLINK_STATE</li><li>BUCKET</li></ul><br /> `Config Param: INDEX_TYPE`
|
-| hoodie.index.bucket.engine | SIMPLE (Optional) |
org.apache.hudi.index.HoodieIndex$BucketIndexEngineType: Determines the type of
bucketing or hashing to use when `hoodie.index.type` is set to `BUCKET`.
Possible Values: <br /> <ul><li>SIMPLE</li><li>CONSISTENT_HASHING</li></ul> |
+| Config Name | Default
| Description
|
+|-------------------------------------------------------------|------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| index.type | FLINK_STATE
(Optional) | Index type of Flink write job, default is using state backed
index. Possible values:<br />
<ul><li>FLINK_STATE</li><li>BUCKET</li><li>GLOBAL_RECORD_LEVEL_INDEX</li><li>RECORD_LEVEL_INDEX</li></ul><br
/> `Config Param: INDEX_TYPE`
|
+| hoodie.index.bucket.engine | SIMPLE
(Optional) | org.apache.hudi.index.HoodieIndex$BucketIndexEngineType:
Determines the type of bucketing or hashing to use when `hoodie.index.type` is
set to `BUCKET`. Possible Values: <br />
<ul><li>SIMPLE</li><li>CONSISTENT_HASHING</li></ul>
|
+| metadata.enabled | true
(Optional) | Enables the metadata table. Required for Flink record-level
index lookups.
|
+| index.global.enabled | true
(Optional) | Whether to update the old partition path when the same
record key arrives with a different partition path. This must be `true` for
`GLOBAL_RECORD_LEVEL_INDEX` and is set to `false` for `RECORD_LEVEL_INDEX`.
|
Review Comment:
🤖 The description says this cache size "Applies to native metadata-table RLI
access and partitioned RLI caches." The FlinkOptions Javadoc adds an important
detail: the per-checkpoint cache size is dynamically computed based on
historical checkpoint averages. It might help to mention this dynamic sizing
here so users understand `index.rli.cache.size` is an upper bound rather than a
fixed allocation.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
website/docs/indexes.md:
##########
@@ -211,13 +211,38 @@ for more details. All these, support the index types
mentioned [above](#addition
#### Flink based configs
-For Flink DataStream and Flink SQL, Bucket index and Flink state index are
supported.
+For Flink DataStream and Flink SQL, Bucket index, Flink state index, and
record-level index are supported.
Following are the basic configs that control the indexing behavior. Please
refer [the Flink
configurations](configurations.md#Flink-Options-advanced-configs) for advanced
configs.
-| Config Name | Default | Description
|
-|----------------------------|------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| index.type | FLINK_STATE (Optional) | Index type of Flink
write job, default is using state backed index. Possible values:<br />
<ul><li>FLINK_STATE</li><li>BUCKET</li></ul><br /> `Config Param: INDEX_TYPE`
|
-| hoodie.index.bucket.engine | SIMPLE (Optional) |
org.apache.hudi.index.HoodieIndex$BucketIndexEngineType: Determines the type of
bucketing or hashing to use when `hoodie.index.type` is set to `BUCKET`.
Possible Values: <br /> <ul><li>SIMPLE</li><li>CONSISTENT_HASHING</li></ul> |
+| Config Name | Default
| Description
|
+|-------------------------------------------------------------|------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| index.type | FLINK_STATE
(Optional) | Index type of Flink write job, default is using state backed
index. Possible values:<br />
<ul><li>FLINK_STATE</li><li>BUCKET</li><li>GLOBAL_RECORD_LEVEL_INDEX</li><li>RECORD_LEVEL_INDEX</li></ul><br
/> `Config Param: INDEX_TYPE`
|
+| hoodie.index.bucket.engine | SIMPLE
(Optional) | org.apache.hudi.index.HoodieIndex$BucketIndexEngineType:
Determines the type of bucketing or hashing to use when `hoodie.index.type` is
set to `BUCKET`. Possible Values: <br />
<ul><li>SIMPLE</li><li>CONSISTENT_HASHING</li></ul>
|
+| metadata.enabled | true
(Optional) | Enables the metadata table. Required for Flink record-level
index lookups.
|
+| index.global.enabled | true
(Optional) | Whether to update the old partition path when the same
record key arrives with a different partition path. This must be `true` for
`GLOBAL_RECORD_LEVEL_INDEX` and is set to `false` for `RECORD_LEVEL_INDEX`.
|
+| index.bootstrap.enabled | false
(Optional) | When `index.type=GLOBAL_RECORD_LEVEL_INDEX`, controls
whether Flink bootstraps the global index into a local RocksDB backend. If not
explicitly set for global RLI, Flink enables bootstrap by default. Set to
`false` to force native metadata-table RLI access.
|
+| index.bootstrap.rocksdb.path | (Optional)
| Local directory path for the RocksDB backend used when
`index.bootstrap.enabled=true`. Each task manager creates a unique subdirectory
under this path.
|
+| index.rli.cache.size | 256 (Optional)
| Maximum memory, in MB, allocated for the record-level index cache per
bucket-assign task. Applies to native metadata-table RLI access and partitioned
RLI caches.
|
+| index.rli.cache.concurrent.partitions.num | 2 (Optional)
| Expected number of partitions whose partitioned RLI caches are
updated concurrently. Used to size each partition cache when historical cache
usage is unavailable.
|
+| index.rli.lookup.minibatch.size | 1000
(Optional) | Maximum number of input records buffered for mini-batch
record-index lookup. Mini-batching reduces individual metadata-table lookup
calls for native global RLI access.
|
+| index.rli.write.buffer.size | 100 (Optional)
| Maximum memory, in MB, for the index record writer buffer. When the
threshold is reached, Flink flushes index records to avoid OOM.
|
+| index.write.tasks | (N/A)
| Parallelism for tasks that write record-level index records. Defaults
to the execution environment parallelism when not set.
|
+| metadata.compaction.schedule.enabled | true
(Optional) | Schedules metadata table compaction plans.
|
Review Comment:
🤖 The doc states "Flink ingestion does not support deferred RLI
initialization, so keep this set to `false` for Flink RLI writes." Could you
confirm where this restriction is enforced in Flink? If a user sets
`hoodie.metadata.record.level.index.defer.init=true` with Flink today, does the
write fail loudly, or does it silently fall back? A pointer to the failure mode
(or to the relevant validator) would make this guidance more actionable for
on-call debugging.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
website/docs/indexes.md:
##########
@@ -211,13 +211,38 @@ for more details. All these, support the index types
mentioned [above](#addition
#### Flink based configs
-For Flink DataStream and Flink SQL, Bucket index and Flink state index are
supported.
+For Flink DataStream and Flink SQL, Bucket index, Flink state index, and
record-level index are supported.
Following are the basic configs that control the indexing behavior. Please
refer [the Flink
configurations](configurations.md#Flink-Options-advanced-configs) for advanced
configs.
-| Config Name | Default | Description
|
-|----------------------------|------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| index.type | FLINK_STATE (Optional) | Index type of Flink
write job, default is using state backed index. Possible values:<br />
<ul><li>FLINK_STATE</li><li>BUCKET</li></ul><br /> `Config Param: INDEX_TYPE`
|
-| hoodie.index.bucket.engine | SIMPLE (Optional) |
org.apache.hudi.index.HoodieIndex$BucketIndexEngineType: Determines the type of
bucketing or hashing to use when `hoodie.index.type` is set to `BUCKET`.
Possible Values: <br /> <ul><li>SIMPLE</li><li>CONSISTENT_HASHING</li></ul> |
+| Config Name | Default
| Description
|
+|-------------------------------------------------------------|------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| index.type | FLINK_STATE
(Optional) | Index type of Flink write job, default is using state backed
index. Possible values:<br />
<ul><li>FLINK_STATE</li><li>BUCKET</li><li>GLOBAL_RECORD_LEVEL_INDEX</li><li>RECORD_LEVEL_INDEX</li></ul><br
/> `Config Param: INDEX_TYPE`
|
+| hoodie.index.bucket.engine | SIMPLE
(Optional) | org.apache.hudi.index.HoodieIndex$BucketIndexEngineType:
Determines the type of bucketing or hashing to use when `hoodie.index.type` is
set to `BUCKET`. Possible Values: <br />
<ul><li>SIMPLE</li><li>CONSISTENT_HASHING</li></ul>
|
+| metadata.enabled | true
(Optional) | Enables the metadata table. Required for Flink record-level
index lookups.
|
+| index.global.enabled | true
(Optional) | Whether to update the old partition path when the same
record key arrives with a different partition path. This must be `true` for
`GLOBAL_RECORD_LEVEL_INDEX` and is set to `false` for `RECORD_LEVEL_INDEX`.
|
+| index.bootstrap.enabled | false
(Optional) | When `index.type=GLOBAL_RECORD_LEVEL_INDEX`, controls
whether Flink bootstraps the global index into a local RocksDB backend. If not
explicitly set for global RLI, Flink enables bootstrap by default. Set to
`false` to force native metadata-table RLI access.
|
+| index.bootstrap.rocksdb.path | (Optional)
| Local directory path for the RocksDB backend used when
`index.bootstrap.enabled=true`. Each task manager creates a unique subdirectory
under this path.
|
Review Comment:
🤖 The FlinkOptions Javadoc for `index.rli.lookup.minibatch.size` notes that
1000 is also the minimum — if a smaller value is configured, the default is
used instead. Worth surfacing here so users tuning down the value aren't
surprised that small settings are silently ignored.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]