wu-sheng opened a new pull request, #132:
URL: https://github.com/apache/skywalking-nodejs/pull/132
## Problem
The Node.js agent OOMs when the SkyWalking collector is unreachable
(apache/skywalking#13764).
With the backend down, the trace report loop fires every second; each gRPC
call fails **failfast** with `14 UNAVAILABLE ... ECONNREFUSED`, and
`logger.error('Failed to report trace data', error)` hands winston a multi-KB
error (full stack) on every tick. Those log objects accumulate in winston's
internal stream buffer — the `WritableState.bufferedRequest` linked list seen
in the reporter's heap dump — faster than the transport drains them, until the
heap is exhausted.
## Fix (commit 1)
- **Gate the report loop on connectivity.** Skip `collect()` while the
channel is not `READY`. gRPC-js already reconnects with its own exponential
backoff (subchannel `BackoffTimeout`, 1s→120s), and the bounded buffer is
retained instead of failing a stream and dropping segments every tick. Checked
against grpc-js internals: `getConnectivityState(true)` only nudges a
connection from `IDLE` (no reconnect storm), and failfast picks fail
immediately on `TRANSIENT_FAILURE`.
- **Throttle + slim failure logging** in the Trace and Heartbeat clients via
a small `throttled()` helper: at most one line per 30s with a `suppressed`
count, reduced to the error `code`/`message` so no stack is retained.
- **Fix dead env parsing** for `SW_AGENT_MAX_BUFFER_SIZE` /
`SW_AGENT_TRACE_TIMEOUT` — `Number.isSafeInteger(process.env.X)` on a string is
always `false`, so those overrides never applied and always fell back to
defaults.
## Dependency upgrade + CI (commit 2)
- `@grpc/grpc-js` `^1.6.7` → `^1.14.4` (API-compatible across 1.x; no source
changes needed).
- `grpc-tools` `^1.11.1` → `^1.13.1` — 1.13.x ships a darwin-arm64 prebuilt;
1.11.2 had none and was uninstallable on Apple Silicon.
`grpc_tools_node_protoc_ts` → `^5.3.3`.
- `build.yaml` Node matrix `10/12/14/16/18` → `14/16/18/20` (grpc-js 1.13+
requires Node ≥12.10; matches `test.yaml`; Node 10/12 are EOL).
- `package-lock.json` regenerated.
## Verification
- `npm install` (native, arm64), `npm run build` (`tsc`), and `npm run lint`
all exit 0 with grpc-js 1.14.4.
- Not yet validated through the plugin trace-validation tests
(mock-collector) — those need Docker; CI runs them on this PR. The change only
alters behavior when the collector is unreachable, so trace content in the
happy path is unchanged.
## Notes for reviewers
- Commit 3 adds a `.claude/` Code skill documenting the build pipeline —
drop it if you'd rather not carry assistant tooling upstream.
- The `package-lock.json` diff includes a lockfileVersion 1 → 3 migration
(modern npm); happy to regenerate as v1 to keep the diff small if preferred.
Related to apache/skywalking#13764
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]