fhan688 opened a new pull request, #18897:
URL: https://github.com/apache/hudi/pull/18897

   ### Describe the issue this Pull Request addresses
   
     This PR wires Flink simple bucket index writes to the existing 
timeline-service-backed remote partitioner path.
   
     Today Flink simple bucket index routes records with local bucket-to-task 
mapping. When writer parallelism changes, bucket ownership can shift across 
subtasks, which can make the writer route records and load bucket file groups 
inconsistently. Hudi already has remote partition helper support in the 
timeline service and write config, but the Flink simple bucket index write 
pipeline does not use it.
   
     This PR enables Flink writers to use remote bucket partition assignment 
when `hoodie.bucket.index.remote.partitioner.enable=true`.
   
   ### Summary and Changelog
   
     This PR adds remote partitioner support for Flink simple bucket index.
   
     Changes:
     - Add Flink option `hoodie.bucket.index.remote.partitioner.enable`, backed 
by the existing core bucket remote partitioner config key.
     - Add Flink option resolution logic to enable remote partitioning only for 
simple bucket index with embedded timeline server enabled.
     - Add `BucketIndexRemotePartitioner` for Flink `partitionCustom`, using 
`RemotePartitionHelper` and `NumBucketsFunction`.
     - Wire the remote partitioner into Flink bulk insert and streaming write 
bucket index pipelines.
     - Make `BucketStreamWriteFunction` use remote partition assignment when 
deciding whether a bucket belongs to the current writer subtask.
     - Propagate the Flink option into `HoodieWriteConfig`.
     - Include the remote partitioner flag in `EmbeddedTimelineService` reuse 
identity to avoid sharing a timeline service across incompatible writer
     configs.
     - Add unit tests and an integration test covering the enabled remote 
partitioner path.
   
     No code was copied from external sources.
   
   ### Impact
   
     This is a user-facing Flink write feature, disabled by default.
   
     When `hoodie.bucket.index.remote.partitioner.enable=true`, Flink simple 
bucket index writers use the timeline service to determine bucket-to-task 
assignment. This improves bucket routing consistency for simple bucket index 
writes, especially when writer parallelism changes.
   
     Public API/config impact:
     - Adds Flink support for `hoodie.bucket.index.remote.partitioner.enable`.
   
     Storage format impact:
     - None.
   
     Performance impact:
     - Remote partitioning introduces timeline-service lookups during bucket 
routing. The option is disabled by default and only applies to simple bucket 
index when embedded timeline server is enabled.
   
   ### Risk Level
   
     medium
   
     This changes Flink bucket index routing behavior when the new option is 
enabled. The default behavior is unchanged because the option defaults to 
`false`.
   
     Risk mitigation:
     - The feature is gated to simple bucket index.
     - The feature requires embedded timeline server to be enabled.
     - Added unit tests for option resolution, write config propagation, remote 
partition calculation, and Flink write client config.
     - Added an integration test for Flink bucket stream writes with remote 
partitioner enabled.
   
     Verification:
     - `mvn -pl hudi-flink-datasource/hudi-flink -am -Dcheckstyle.skip=true 
-DskipITs -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
     
-Dtest=TestOptionsResolver,TestStreamerUtil,TestFlinkWriteClients,TestBucketIndexRemotePartitioner
 test`
     - `mvn -pl hudi-flink-datasource/hudi-flink -am -Dcheckstyle.skip=true 
-DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
     -Dtest=ITTestBucketStreamWrite#testRemotePartitioner test`
   
   ### Documentation Update
   
     Required.
   
     The Hudi website/config documentation should be updated to describe Flink 
support for:
   
     - `hoodie.bucket.index.remote.partitioner.enable`
   
     The documentation should mention that this option applies to Flink simple 
bucket index writes, requires embedded timeline server, and defaults to `false`.
   
   ### Contributor's checklist
   
     - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
     - [x] Enough context is provided in the sections above
     - [x] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to