qianye1001 commented on issue #10397: URL: https://github.com/apache/rocketmq/issues/10397#issuecomment-4560519111
## 🤖 Automated Fix Proposal (v1) I've analyzed this issue against the latest upstream code and generated a fix specification. **Code Base:** c9e51d625e1d (develop) ### Summary **Root Cause:** When TLS certificates are hot-reloaded, both `NettyRemotingServer.loadSslContext()` and `ProxyAndTlsProtocolNegotiator.loadSslContext()` overwrite the `sslContext` field without releasing the old instance. Since `OpenSslContext` allocates native off-heap memory (SSL_CTX, X509 chain, EVP_PKEY), this native memory leaks on every certificate rotation cycle. **Fix Strategy:** 1. Save the old `SslContext` reference before building the replacement 2. After successful assignment of the new context, call `ReferenceCountUtil.release(oldSslContext)` to free native memory 3. Use "build new, then release old" ordering to ensure `sslContext` is never null or prematurely released **Files to Modify:** - `remoting/src/main/java/org/apache/rocketmq/remoting/netty/NettyRemotingServer.java` — add old context release in `loadSslContext()` - `proxy/src/main/java/org/apache/rocketmq/proxy/grpc/ProxyAndTlsProtocolNegotiator.java` — add `volatile` + old context release in `loadSslContext()` **Risk Assessment:** LOW — straightforward resource lifecycle fix following Netty's reference-counting pattern. Fully backward compatible. ### ⚠️ Duplicate Issue Notice This issue describes the **same root cause** as issue #10395. An existing fix has been proposed in PR #10396 (branch `fix/issue-10395`). If PR #10396 is merged, this issue can be closed as a duplicate. If the community prefers a separate or revised fix for this issue, please provide feedback below. <details> <summary>Full Specification (click to expand)</summary> ```markdown # Fix Specification: TLS Certificate Hot-Reload Native Memory Leak **Issue:** #10397 - TLS certificate hot-reload leaks native memory due to unreleased old SslContext **Related Issue:** #10395 (duplicate describing the same root cause) **Existing PR:** #10396 (branch `fix/issue-10395`, commit `7d55555`) --- ## 1. Root Cause Analysis When TLS certificates are hot-reloaded, both `NettyRemotingServer.loadSslContext()` and `ProxyAndTlsProtocolNegotiator.loadSslContext()` overwrite the `sslContext` field with a newly built `SslContext` instance **without releasing the previous one**. Netty's `SslContext` (especially when using `SslProvider.OPENSSL` / `SslProvider.OPENSSL_REFCNT`) wraps native OpenSSL resources managed via reference counting. When the old reference is simply discarded: - The native SSL_CTX and associated buffers are never freed. - Each reload accumulates leaked native memory proportional to the size of the certificate chain and private key material. - Over time this leads to OOM or process termination by the OS OOM-killer. **Affected code paths:** | Location | Line | Behavior | |----------|------|----------| | `NettyRemotingServer.loadSslContext()` | remoting/.../NettyRemotingServer.java:186 | `sslContext = TlsHelper.buildSslContext(false);` overwrites without release | | `ProxyAndTlsProtocolNegotiator.loadSslContext()` | proxy/.../ProxyAndTlsProtocolNegotiator.java:118,131 | `sslContext = GrpcSslContexts.forServer(...)...build();` overwrites without release | The `sslContext` field in `NettyRemotingAbstract` is already declared `volatile` (line 121), so `NettyRemotingServer` is thread-safe for reads. However, in `ProxyAndTlsProtocolNegotiator` the field is `private static` without `volatile`, creating a potential visibility issue across threads. --- ## 2. Fix Strategy (Step by Step) ### NettyRemotingServer.loadSslContext() 1. Before building the new `SslContext`, save the current value: `SslContext oldSslContext = this.sslContext;` 2. Build the new context via `TlsHelper.buildSslContext(false)`. 3. Assign the new context to `this.sslContext`. 4. If `oldSslContext != null`, call `ReferenceCountUtil.release(oldSslContext)` to decrement its reference count and free native resources. 5. Wrap the release in a try-catch to avoid disrupting the reload if release fails (log warning). ### ProxyAndTlsProtocolNegotiator.loadSslContext() 1. Add `volatile` modifier to the `private static SslContext sslContext` field for thread-safe visibility. 2. At the start of `loadSslContext()`, save: `SslContext oldSslContext = sslContext;` 3. Build the new context (both test-mode and production paths already produce a new `SslContext`). 4. Assign to `sslContext`. 5. After successful assignment, release the old context: `if (oldSslContext != null) ReferenceCountUtil.release(oldSslContext);` 6. Wrap the release in a try-catch to avoid disrupting the reload. ### Ordering Rationale "Build new, then release old" ensures: - `sslContext` is neve ``` </details> ### Next Steps - Reply `/approve` to proceed with generating a PR (or to endorse the existing PR #10396 as the fix for both issues) - Reply `/revise <feedback>` to request changes to the fix approach - Reply `/reject` to close this proposal *This proposal will expire in 72 hours if no response is received.* -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
