JNSimba opened a new pull request, #64566:
URL: https://github.com/apache/doris/pull/64566
## Proposed changes
### Problem
In `FROM ... TO ...` (at-least-once) CDC streaming jobs, the snapshot
split-admission step
(`advanceSplitsIfNeed` → `advanceSplits`) admits only **one table's** batch
of splits per effective
scheduler tick, and the tick interval is bound to `max_interval`. When a job
syncs many small tables
(e.g. 1000 tables), the splitting pace (~1 table/tick) falls far behind the
consumer, so the overall
snapshot is much slower than syncing a single large table.
### Change
Turn the per-tick single `advanceSplits()` call into a **bounded loop**
inside one scheduler-task
invocation. Within a tick it keeps admitting splits until one of:
- splitting is complete (`noMoreSplits()`), or
- the pending (produced-but-unconsumed) queue reaches `MAX_PENDING_SPLITS`
(512) — a safety valve
bounding FE-side backlog, or
- the loop approaches the next interval tick (`deadline = now + interval -
margin`) — it yields so the
pre-armed next tick is scheduled normally and resumes the rest.
A no-progress round breaks the loop to avoid spinning. A new
`SourceOffsetProvider.pendingSplitCount()`
exposes the pending queue depth (the JDBC provider returns
`remainingSplits.size()` under the splits lock).
### Safety
- No new persisted fields; FE-restart replay is unchanged (cdc split
progress is still rebuilt from
`chunk_list` meta).
- The deadline yield preserves the existing single-flight scheduling — the
next interval tick is not
coalesced away.
- The 512 cap bounds FE memory / `chunk_list` write bursts for extreme table
counts.
### Tests
- New UT `StreamingInsertJobAdvanceSplitsTest` covers the loop exit
conditions (loops until done /
stops at pending cap / breaks on a no-progress round / entry guard).
- Existing `*_async_split_*` regression suites (multi-table, uneven string
PK, restart-fe, pause-resume;
pg / mysql / tvf) exercise splitting correctness, replay and pause/resume
through the new loop.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]