mrproliu opened a new pull request, #1181: URL: https://github.com/apache/skywalking-banyandb/pull/1181
## What Fixes the recurring OOM-kill of the `backup` sidecar on large (cold-tier) data nodes, and speeds up snapshot uploads so a full backup finishes well within its schedule interval. ## Why (root cause) A full cold backup uploads a large number of tiny snapshot files to remote storage **sequentially**, which can take far longer than the `@hourly` schedule. The shared scheduler (`pkg/timestamp/scheduler.go`) abandons — but does **not cancel** — an action that exceeds its internal 5-minute timeout, then schedules the next run anyway. The abandoned backup goroutine keeps running and holding memory, so consecutive runs **overlap and stack up**. When the `backup` container is given a small memory budget, the overlapping runs exceed it and the process is OOM-killed, while each run logs `action timed out` and keeps uploading. ## Changes All changes are scoped to the backup module; the shared scheduler is intentionally left untouched. - **Prevent overlap (the OOM fix).** A `backupInFlight` guard in the backup scheduler callback skips a scheduled run while the previous one is still in progress, so only one backup runs at a time and memory stays bounded to a single run. - **Parallelize small-file uploads.** `backupSnapshot` now streams the snapshot via `filepath.Walk` (never materializing the full local file list) and uploads files `< 5 MiB` concurrently via `errgroup` with a bounded limit; larger files are uploaded sequentially to keep peak write-buffer memory bounded. Concurrency is configurable via a new `--upload-concurrency` flag (default `8`). - **Single-request GCS upload (feature).** For seekable sources (the backup's `*os.File`), the checksum is computed in a first pass and written as object metadata together with the object in one request, removing the per-object metadata `Update` round-trip. Objects smaller than one chunk also use a single-request (non-resumable) upload. The checksum helper is shared via a new `checksum.Verifier.Sum`, and writer creation/chunk-size selection is shared via a single `gcsFS.newWriter`. ## Testing - `TestBackupSnapshotConcurrent` — parallel small-file path, sequential large-file path, and orphan deletion together (run with `-race`). - `TestBackupSnapshotUploadError` — a failed upload surfaces an error and orphaned remote files are not deleted. - `TestSum` — the shared checksum helper, including parity with the streaming path. - `go build`, `golangci-lint`, and `go test -race` are green for the affected packages. - [ ] If this pull request closes/resolves/fixes an existing issue, replace the issue number. Fixes apache/skywalking#<issue number>. - [x] Update the [`CHANGES` log](https://github.com/apache/skywalking-banyandb/blob/main/CHANGES.md). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
