snehith-justdial opened a new issue, #13912:
URL: https://github.com/apache/skywalking/issues/13912

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/skywalking/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Apache SkyWalking Component
   
   BanyanDB (apache/skywalking-banyandb)
   
   ### What happened
   
   All data nodes panic and crash (Exit Code 2). The "background merger takes 
too long" warning appears 30–60 minutes before the panic, followed by:
   
   
   
{"level":"warn","module":"SW_JDCORE_TRACE.SEGID-20260613.SHARD4","beforeTotalCount":2468627,"afterTotalCount":2468627,"beforePartCount":15,"elapsed":1105669.876047,"time":"2026-06-13T21:46:07Z","message":"background
 merger takes too long"}
   
   {"level":"panic","time":"2026-06-13T22:19:41Z","message":"offset 512565614 
must be equal to bytesRead 512563919"}
   panic: offset 512565614 must be equal to bytesRead 512563919
   
   goroutine ... [running]:
   banyand/trace.(*partIter).init(...)
       /mnt/d/worktree/release-0.10.2/banyand/trace/part_iter.go:359 +0xf5
   banyand/trace.(*merger).mergeParts(...)
       /mnt/d/worktree/release-0.10.2/banyand/trace/merger.go:417 +0x79e
   banyand/trace.(*merger).merge(...)
       /mnt/d/worktree/release-0.10.2/banyand/trace/merger.go:340 +0x42a
   banyand/trace.(*merger).run(...)
       /mnt/d/worktree/release-0.10.2/banyand/trace/merger.go:118 +0x145
   banyand/trace.(*merger).run(...)
       /mnt/d/worktree/release-0.10.2/banyand/trace/merger.go:104 +0x125
   banyand/trace.(*merger).runMerge(...)
       /mnt/d/worktree/release-0.10.2/banyand/trace/merger.go:78 +0x1f9
   banyand/trace.(*merger).background(...)
       /mnt/d/worktree/release-0.10.2/banyand/trace/merger.go:90 +0x271
   A second variant also observed (nil pointer):
   
   
   panic: runtime error: invalid memory address or nil pointer dereference
   banyand/trace.(*dataBlock).marshal(...)
       banyand/trace/block_metadata.go:48
   banyand/trace.(*merger).flush(...)
       banyand/trace/merger.go:422
   
   ### What you expected to happen
   
   Background trace compaction completes without error.
   
   ### How to reproduce
   
   - BanyanDB: 0.10.0, 0.10.1, 0.10.2 (all affected, confirmed independently)
   - OAP: 10.4.0
   - Deployment: 3-node cluster — 3 liaison pods + 3 data-hot pods (StatefulSet 
via Helm chart v0.6.0)
   - Trace shards: 6
   - Storage: static local PVs, 1500Gi trace volume per data node
   - Ingestion rate: ~2.5 million requests/minute (trace segments written to 
BanyanDB)
   - Fresh BanyanDB cluster install (3 data nodes, 3 liaison nodes)
   - Connect OAP 10.4.0 with SW_STORAGE_BANYANDB_TRACE_SHARD_NUM: 6
   - Run continuous trace ingestion at ~2M+ spans/minute
   - Wait ~34–47 hours for background compaction to trigger on accumulated data
   
   ### Anything else
   
   - After crash + restart, data nodes stabilize for another ~33 hours before 
the panic recurs as trace data accumulates again
   The panic was first seen with the image tagged 0.10.0 (build path: 
/mnt/d/release-0.11.0/) and then confirmed in the genuine 0.10.2 build (path: 
/mnt/d/worktree/release-0.10.2/)
   The assertion at part_iter.go:359 fails when offset != bytesRead, suggesting 
either a data race in the concurrent writer/flusher or a byte-level corruption 
in the part file format during high-throughput writes
   
   ### Are you willing to submit a pull request to fix on your own?
   
   - [ ] Yes I am willing to submit a pull request on my own!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to