mrproliu opened a new pull request, #1181:
URL: https://github.com/apache/skywalking-banyandb/pull/1181

   ## What
   
   Fixes the recurring OOM-kill of the `backup` sidecar on large (cold-tier) 
data nodes, and speeds up snapshot uploads so a full backup finishes well 
within its schedule interval.
   
   ## Why (root cause)
   
   A full cold backup uploads a large number of tiny snapshot files to remote 
storage **sequentially**, which can take far longer than the `@hourly` 
schedule. The shared scheduler (`pkg/timestamp/scheduler.go`) abandons — but 
does **not cancel** — an action that exceeds its internal 5-minute timeout, 
then schedules the next run anyway. The abandoned backup goroutine keeps 
running and holding memory, so consecutive runs **overlap and stack up**. When 
the `backup` container is given a small memory budget, the overlapping runs 
exceed it and the process is OOM-killed, while each run logs `action timed out` 
and keeps uploading.
   
   ## Changes
   
   All changes are scoped to the backup module; the shared scheduler is 
intentionally left untouched.
   
   - **Prevent overlap (the OOM fix).** A `backupInFlight` guard in the backup 
scheduler callback skips a scheduled run while the previous one is still in 
progress, so only one backup runs at a time and memory stays bounded to a 
single run.
   - **Parallelize small-file uploads.** `backupSnapshot` now streams the 
snapshot via `filepath.Walk` (never materializing the full local file list) and 
uploads files `< 5 MiB` concurrently via `errgroup` with a bounded limit; 
larger files are uploaded sequentially to keep peak write-buffer memory 
bounded. Concurrency is configurable via a new `--upload-concurrency` flag 
(default `8`).
   - **Single-request GCS upload (feature).** For seekable sources (the 
backup's `*os.File`), the checksum is computed in a first pass and written as 
object metadata together with the object in one request, removing the 
per-object metadata `Update` round-trip. Objects smaller than one chunk also 
use a single-request (non-resumable) upload. The checksum helper is shared via 
a new `checksum.Verifier.Sum`, and writer creation/chunk-size selection is 
shared via a single `gcsFS.newWriter`.
   
   ## Testing
   
   - `TestBackupSnapshotConcurrent` — parallel small-file path, sequential 
large-file path, and orphan deletion together (run with `-race`).
   - `TestBackupSnapshotUploadError` — a failed upload surfaces an error and 
orphaned remote files are not deleted.
   - `TestSum` — the shared checksum helper, including parity with the 
streaming path.
   - `go build`, `golangci-lint`, and `go test -race` are green for the 
affected packages.
   
   
   - [ ] If this pull request closes/resolves/fixes an existing issue, replace 
the issue number. Fixes apache/skywalking#<issue number>.
   - [x] Update the [`CHANGES` 
log](https://github.com/apache/skywalking-banyandb/blob/main/CHANGES.md).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to