mrproliu opened a new pull request, #1168:
URL: https://github.com/apache/skywalking-banyandb/pull/1168
## Background
The lifecycle migration moves data down the storage tiers (hot → warm →
cold). For
multi-target groups it uses **row-replay**: it reads each source part and
re-publishes
every row through the normal Write API so the target tier re-shards and
re-indexes it.
This PR fixes two production problems found while migrating real SkyWalking
metrics data
(`sw_metricsHour` / `sw_metricsMinute`): the data nodes were **OOM-killed**
during
row-replay, and parts with unrecoverable rows caused **hard migration
failures**.
## Problem 1 — OOM during row-replay
Migrating large measure parts drove a data node's heap to **~1.5 GB** and
the pod was
OOM-killed, stalling the whole migration. Three things compounded:
1. The dump reader materialized a part's column data far beyond a single
block — the
decoder's internal buffer grew to hold the whole part's string data.
2. Every replayed row allocated a **fresh marshal buffer**, so a part with
millions of
rows produced millions of short-lived large allocations.
3. There was **no bound on in-flight batches** — rows were marshaled and
queued faster
than they were confirmed, so sent-but-unconfirmed payloads piled up in
memory.
### How it's fixed
The row-replay path is rebuilt around **bounded, reusable memory**:
1. **Streaming columnar iterator** — the part is walked one block at a time
(`IterateColumnar`) using reusable scratch buffers. The decoder is reset
per block, so
the decode buffer is bounded to a single block instead of the whole part.
2. **Size-classed marshal buffer pool** (`row_replay_bufpool.go`) — each row
borrows one
reusable buffer grouped by power-of-two size class; the whole batch's
buffers are
returned together once `Publish` has put every body on the wire. Grouping
by size class
keeps a small body from pinning a large buffer, and lazy reclamation
drops classes left
idle for several flushes so a transient burst of large rows is released
rather than
hoarded.
3. **Bounded batch + bounded confirmation pipeline** — a batch flushes when
it reaches
either `rowReplayMaxBatchRows` rows or **`rowReplayMaxBatchBytes` (32
MiB)** of marshaled
bytes, and a confirmation pipeline caps how many batches may be
sent-but-unconfirmed
(`rowReplayMaxInflight`). Together these put a hard ceiling on
outstanding buffers
regardless of part size.
**Result:** peak heap drops from **~1.5 GB to ~200–300 MB**, independent of
part size.
## Problem 2 — hard failure on unresolvable rows
Some source parts can't have their rows republished because the entity
identity can't be
recovered:
- **`rebuild-failed`** — the series-index (sidx) has a gap (no
`EntityValues` for a
`seriesID`); the part's entity columns are present but no candidate's
recomputed hash
matches, so the entity can't be rebuilt.
- **`incomplete-part`** — the part carries no entity tag columns at all, so
neither sidx
nor column rebuild can recover the entity.
Previously this **hard-failed** the part/segment migration.
### How it's handled
These rows are now **skipped and retained** instead of failing the migration:
- The unresolved rows are skipped; the rest of the part migrates normally
and the part is
marked complete.
- A structured error is recorded with a bounded sample of the skipped series
(part,
`seriesID`, reason) so an operator can locate the exact source part — not
just a count.
- The **source segment is retained** (excluded from post-migration
deletion), so skipped
rows are never lost — they stay on the source tier and can still be
queried.
## Error reporting consolidation (related)
All migration errors are consolidated into a single flat list of
**structured, stage-aware
entries** (`source_stage` / `target_stage` / `group` / `catalog` / `scope` /
`segment` /
`shard` / `part` / `interval`), replacing the previous nested error maps.
The report still
exposes them under named buckets, and the `summary.*.errors` counts derive
from the same
single source.
## Validation (real cluster, full hot→warm→cold migration)
- **Migration: 288/288 parts, 100% complete, 0 hard failures, 0 node-level
send errors.**
~40M rows replayed.
- **Memory: 0 OOM-kills, 0 restarts** across all data/lifecycle containers
during the
migration window. The active row-replay process peaked at **~205 MiB** (vs
~1.5 GB
before).
- **Skip-and-retain: ~1.1M rows** skipped as `series-index gap /
incomplete-part`, each
recorded as a structured error with the source segment retained.
- **Data is queryable** after migration via the gRPC `MeasureService/Query`:
old time
ranges return **0 on `hot`** (correctly migrated out) and **full results
on `warm` and
`cold`**.
- [ ] If this pull request closes/resolves/fixes an existing issue, replace
the issue number. Fixes apache/skywalking#<issue number>.
- [x] Update the [`CHANGES`
log](https://github.com/apache/skywalking-banyandb/blob/main/CHANGES.md).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]