fhan688 opened a new pull request, #18897:
URL: https://github.com/apache/hudi/pull/18897
### Describe the issue this Pull Request addresses
This PR wires Flink simple bucket index writes to the existing
timeline-service-backed remote partitioner path.
Today Flink simple bucket index routes records with local bucket-to-task
mapping. When writer parallelism changes, bucket ownership can shift across
subtasks, which can make the writer route records and load bucket file groups
inconsistently. Hudi already has remote partition helper support in the
timeline service and write config, but the Flink simple bucket index write
pipeline does not use it.
This PR enables Flink writers to use remote bucket partition assignment
when `hoodie.bucket.index.remote.partitioner.enable=true`.
### Summary and Changelog
This PR adds remote partitioner support for Flink simple bucket index.
Changes:
- Add Flink option `hoodie.bucket.index.remote.partitioner.enable`, backed
by the existing core bucket remote partitioner config key.
- Add Flink option resolution logic to enable remote partitioning only for
simple bucket index with embedded timeline server enabled.
- Add `BucketIndexRemotePartitioner` for Flink `partitionCustom`, using
`RemotePartitionHelper` and `NumBucketsFunction`.
- Wire the remote partitioner into Flink bulk insert and streaming write
bucket index pipelines.
- Make `BucketStreamWriteFunction` use remote partition assignment when
deciding whether a bucket belongs to the current writer subtask.
- Propagate the Flink option into `HoodieWriteConfig`.
- Include the remote partitioner flag in `EmbeddedTimelineService` reuse
identity to avoid sharing a timeline service across incompatible writer
configs.
- Add unit tests and an integration test covering the enabled remote
partitioner path.
No code was copied from external sources.
### Impact
This is a user-facing Flink write feature, disabled by default.
When `hoodie.bucket.index.remote.partitioner.enable=true`, Flink simple
bucket index writers use the timeline service to determine bucket-to-task
assignment. This improves bucket routing consistency for simple bucket index
writes, especially when writer parallelism changes.
Public API/config impact:
- Adds Flink support for `hoodie.bucket.index.remote.partitioner.enable`.
Storage format impact:
- None.
Performance impact:
- Remote partitioning introduces timeline-service lookups during bucket
routing. The option is disabled by default and only applies to simple bucket
index when embedded timeline server is enabled.
### Risk Level
medium
This changes Flink bucket index routing behavior when the new option is
enabled. The default behavior is unchanged because the option defaults to
`false`.
Risk mitigation:
- The feature is gated to simple bucket index.
- The feature requires embedded timeline server to be enabled.
- Added unit tests for option resolution, write config propagation, remote
partition calculation, and Flink write client config.
- Added an integration test for Flink bucket stream writes with remote
partitioner enabled.
Verification:
- `mvn -pl hudi-flink-datasource/hudi-flink -am -Dcheckstyle.skip=true
-DskipITs -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
-Dtest=TestOptionsResolver,TestStreamerUtil,TestFlinkWriteClients,TestBucketIndexRemotePartitioner
test`
- `mvn -pl hudi-flink-datasource/hudi-flink -am -Dcheckstyle.skip=true
-DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
-Dtest=ITTestBucketStreamWrite#testRemotePartitioner test`
### Documentation Update
Required.
The Hudi website/config documentation should be updated to describe Flink
support for:
- `hoodie.bucket.index.remote.partitioner.enable`
The documentation should mention that this option applies to Flink simple
bucket index writes, requires embedded timeline server, and defaults to `false`.
### Contributor's checklist
- [x] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [x] Enough context is provided in the sections above
- [x] Adequate tests were added if applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]