danny0405 opened a new pull request, #19022:
URL: https://github.com/apache/hudi/pull/19022
### Describe the issue this Pull Request addresses
Flink `WriteProfile` and `DeltaWriteProfile` refresh average record size
from commit or delta-commit metadata. When the current timeline has no eligible
commit metadata for size estimation, the refresh path could fall back to the
configured default record size estimate, even if the profile had already
learned a better value from a preceding eligible commit.
This can make `recordsPerBucket` jump back to the default estimate after a
reload, affecting insert bucket sizing for Flink COW and MOR writes. This PR
keeps the preceding profiled average size when a later refresh cannot derive a
valid estimate from eligible commits or delta commits.
### Summary and Changelog
Flink write profiles now preserve the last successful average record size
estimate across reloads when no eligible commit metadata is available for the
current estimation pass.
- Updated `WriteProfile.averageBytesPerRecord()` to seed estimation from the
existing `avgSize` once initialized, falling back to
`COPY_ON_WRITE_RECORD_SIZE_ESTIMATE` only before any profiled value exists.
- Updated `DeltaWriteProfile.averageBytesPerRecord()` with the same reuse
behavior for MOR commit and delta-commit estimation.
- Moved average-size logging into `WriteProfile.recordProfile()` so the
refreshed value is logged once after the profile is updated.
- Added regression coverage in `TestBucketAssigner` for:
- `testWriteProfileReusesPreviousAvgSizeWhenNoEligibleCommitOnReload`
-
`testDeltaWriteProfileReusesPreviousAvgSizeWhenNoEligibleDeltaCommitOnReload`
### Impact
This affects Flink write bucket sizing for COW and MOR tables when profile
reloads encounter commits or delta commits that are not eligible for
record-size estimation. The change avoids reverting to the configured default
estimate after a better preceding estimate has already been learned.
There are no public API, config, storage format, or compatibility changes.
Performance impact is negligible; the change only reuses an already cached
in-memory value during profile refresh.
### Risk Level
low
The change is limited to Flink `WriteProfile` and `DeltaWriteProfile`
average record size estimation. The main behavioral risk is retaining a stale
estimate longer when no eligible metadata exists, but that is the intended
behavior and is safer than reverting to a less accurate configured default.
Once eligible commit metadata is available again, the estimate is refreshed
from metadata as before.
Verified with:
`mvn -pl hudi-flink-datasource/hudi-flink -am -Dtest=TestBucketAssigner
-Dsurefire.failIfNoSpecifiedTests=false test -DskipITs`
Result: `TestBucketAssigner` ran 16 tests with 0 failures.
### Documentation Update
none
This is an internal Flink write profile estimation fix with no new
user-facing configuration, API, or documented behavior change.
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Enough context is provided in the sections above
- [ ] Adequate tests were added if applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]