snehith-justdial opened a new issue, #13912: URL: https://github.com/apache/skywalking/issues/13912
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/skywalking/issues?q=is%3Aissue) and found no similar issues. ### Apache SkyWalking Component BanyanDB (apache/skywalking-banyandb) ### What happened All data nodes panic and crash (Exit Code 2). The "background merger takes too long" warning appears 30–60 minutes before the panic, followed by: {"level":"warn","module":"SW_JDCORE_TRACE.SEGID-20260613.SHARD4","beforeTotalCount":2468627,"afterTotalCount":2468627,"beforePartCount":15,"elapsed":1105669.876047,"time":"2026-06-13T21:46:07Z","message":"background merger takes too long"} {"level":"panic","time":"2026-06-13T22:19:41Z","message":"offset 512565614 must be equal to bytesRead 512563919"} panic: offset 512565614 must be equal to bytesRead 512563919 goroutine ... [running]: banyand/trace.(*partIter).init(...) /mnt/d/worktree/release-0.10.2/banyand/trace/part_iter.go:359 +0xf5 banyand/trace.(*merger).mergeParts(...) /mnt/d/worktree/release-0.10.2/banyand/trace/merger.go:417 +0x79e banyand/trace.(*merger).merge(...) /mnt/d/worktree/release-0.10.2/banyand/trace/merger.go:340 +0x42a banyand/trace.(*merger).run(...) /mnt/d/worktree/release-0.10.2/banyand/trace/merger.go:118 +0x145 banyand/trace.(*merger).run(...) /mnt/d/worktree/release-0.10.2/banyand/trace/merger.go:104 +0x125 banyand/trace.(*merger).runMerge(...) /mnt/d/worktree/release-0.10.2/banyand/trace/merger.go:78 +0x1f9 banyand/trace.(*merger).background(...) /mnt/d/worktree/release-0.10.2/banyand/trace/merger.go:90 +0x271 A second variant also observed (nil pointer): panic: runtime error: invalid memory address or nil pointer dereference banyand/trace.(*dataBlock).marshal(...) banyand/trace/block_metadata.go:48 banyand/trace.(*merger).flush(...) banyand/trace/merger.go:422 ### What you expected to happen Background trace compaction completes without error. ### How to reproduce - BanyanDB: 0.10.0, 0.10.1, 0.10.2 (all affected, confirmed independently) - OAP: 10.4.0 - Deployment: 3-node cluster — 3 liaison pods + 3 data-hot pods (StatefulSet via Helm chart v0.6.0) - Trace shards: 6 - Storage: static local PVs, 1500Gi trace volume per data node - Ingestion rate: ~2.5 million requests/minute (trace segments written to BanyanDB) - Fresh BanyanDB cluster install (3 data nodes, 3 liaison nodes) - Connect OAP 10.4.0 with SW_STORAGE_BANYANDB_TRACE_SHARD_NUM: 6 - Run continuous trace ingestion at ~2M+ spans/minute - Wait ~34–47 hours for background compaction to trigger on accumulated data ### Anything else - After crash + restart, data nodes stabilize for another ~33 hours before the panic recurs as trace data accumulates again The panic was first seen with the image tagged 0.10.0 (build path: /mnt/d/release-0.11.0/) and then confirmed in the genuine 0.10.2 build (path: /mnt/d/worktree/release-0.10.2/) The assertion at part_iter.go:359 fails when offset != bytesRead, suggesting either a data race in the concurrent writer/flusher or a byte-level corruption in the part file format during high-throughput writes ### Are you willing to submit a pull request to fix on your own? - [ ] Yes I am willing to submit a pull request on my own! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
