qianye1001 commented on issue #10398:
URL: https://github.com/apache/rocketmq/issues/10398#issuecomment-4560601394

   ## 🤖 Automated Fix Proposal (v1)
   
   I've analyzed this issue against the latest upstream code and generated a 
fix specification.
   
   **Code Base:** c9e51d625e1d (develop)
   
   ### Summary
   
   **Root Cause:** When TLS certificates are hot-reloaded, both 
`NettyRemotingServer.loadSslContext()` and 
`ProxyAndTlsProtocolNegotiator.loadSslContext()` overwrite the `sslContext` 
field without releasing the old instance. Since `OpenSslContext` is a 
`ReferenceCounted` object backed by native memory, each reload leaks ~100KB–1MB 
of native (off-heap) memory per reload cycle.
   
   **Fix Strategy:**
   1. Save the old `SslContext` reference before building the replacement
   2. After successful assignment of the new context, call 
`ReferenceCountUtil.release(oldSslContext)` to free native memory
   3. Use "build new, then release old" ordering to ensure `sslContext` is 
never null or prematurely released
   4. Add `volatile` to `ProxyAndTlsProtocolNegotiator.sslContext` field for 
thread safety
   
   **Files to Modify:**
   - 
`remoting/src/main/java/org/apache/rocketmq/remoting/netty/NettyRemotingServer.java`
 — add old context release in `loadSslContext()`
   - 
`proxy/src/main/java/org/apache/rocketmq/proxy/grpc/ProxyAndTlsProtocolNegotiator.java`
 — add `volatile` + old context release in `loadSslContext()`
   
   **Risk Assessment:** LOW — straightforward resource lifecycle fix following 
Netty's reference-counting pattern. Fully backward compatible.
   
   **Note:** A previous fix (PR #10396 for issue #10395) was closed without 
merge. This spec proposes an independent fix for the same root cause.
   
   <details>
   <summary>Full Specification (click to expand)</summary>
   
   ```markdown
   # Fix Spec: Issue #10398 — TLS Certificate Hot-Reload Leaks Native Memory
   
   ## 1. Root Cause Analysis
   
   When TLS certificates are hot-reloaded, a new `SslContext` is created and 
assigned to the field, but the **old** `SslContext` is never released. Since 
Netty's `SslContext` (especially with OpenSSL provider) is a `ReferenceCounted` 
object backed by native memory, each reload leaks the previous context's native 
allocation.
   
   **Two independent sites exhibit this bug:**
   
   | Location | File | Line | Issue |
   |----------|------|------|-------|
   | Remoting (Broker/Namesrv) | `remoting/.../NettyRemotingServer.java` | 186 
| Old `sslContext` overwritten without `release()` |
   | Proxy (gRPC) | `proxy/.../ProxyAndTlsProtocolNegotiator.java` | 118/131 | 
Old `sslContext` overwritten without `release()`; field also not `volatile` 
(visibility issue across threads) |
   
   **Impact:** Every certificate reload leaks ~100KB–1MB of native (off-heap) 
memory per reload cycle. Under frequent reloads or long-running processes, this 
leads to OOM or container kills.
   
   ## 2. Fix Strategy
   
   ### Step 1: Fix `NettyRemotingServer.loadSslContext()`
   
   1. Before assigning a new value to `sslContext`, capture the old reference.
   2. Build the new `SslContext` into a local variable.
   3. Assign the new context to the field (field is already `volatile` in 
parent `NettyRemotingAbstract`).
   4. Release the old context via `ReferenceCountUtil.release(oldSslContext)` 
inside a try-catch to prevent release failures from disrupting the reload flow.
   
   ### Step 2: Fix `ProxyAndTlsProtocolNegotiator.loadSslContext()`
   
   1. Add `volatile` to the `sslContext` field declaration to ensure 
cross-thread visibility.
   2. Before assigning a new value, capture the old reference into a local 
variable.
   3. Build the new `SslContext` into a local variable.
   4. Assign the new context to the static field.
   5. Release the old context via `ReferenceCountUtil.release(oldSslContext)` 
inside a try-catch.
   
   ### Step 3: Import `ReferenceCountUtil`
   
   Add `import io.netty.util.ReferenceCountUtil;` to both files.
   
   ## 3. Files to Modify
   
   ### File 1: 
`remoting/src/main/java/org/apache/rocketmq/remoting/netty/NettyRemotingServer.java`
   
   **Add import:**
   ```java
   import io.netty.util.ReferenceCountUtil;
   ```
   
   **Replace `loadSslContext()` method (lines 180–192):**
   
   ```java
   public void loadSslContext() {
       TlsMode tlsMode = TlsSystemConfig.tlsMode;
       log.info("Server is running in TLS {} mode", tlsMode.getName());
   
       if (tlsMode != TlsMode.DISABLED) {
           try {
               SslContext newSslContext = TlsHelper.buildSslContext(false);
               SslContext oldSslContext = this.sslContext;
               this.sslContext = newSslContext;
               if (oldSslContext != null) {
                   try {
                       ReferenceCountUtil.release(oldSslContext);
                       log.info("Old SslContext released for server");
                   } catch (Exception e) {
                       log.warn("Failed to release old SslContext for server", 
e);
                   }
               }
               log.info("SslContext created for server");
           } catch (CertificateException | IOException e) {
               log.error("Failed to create SslContext for server", e);
           }
       }
   }
   ```
   
   ### File 2: 
`proxy/src/main/java/org/apache/rocketmq/proxy/grpc/ProxyAndTlsProtocolNegotiator.java`
   
   **Add import:**
   ```java
   import io.netty.util.ReferenceCountUtil;
   ```
   
   **Change field declaration (line 80):**
   ```java
   // Before:
   private static SslContext sslContext;
   
   // After:
   private static volatile SslContext sslContext;
   ```
   
   **Replace `loadSslContext()` method (lines 106–139):**
   
   ```java
   public static void loadSslContext() throws CertificateException, IOException 
{
       ProxyConfig proxyConfig = ConfigurationManager.getProxyConfig();
       SslProvider provider;
       if (OpenSsl.isAvailable()) {
           provider = SslProvider.OPENSSL;
           log.info("Using OpenSSL provider");
       } else {
           provider = SslProvider.JDK;
           log.info("Using JDK SSL provider");
       }
       SslContext newSslContext;
       if (proxyConfig.isTlsTestModeEnable()) {
           SelfSignedCertificate selfSignedCertificate = new 
SelfSignedCertificate();
           newSslContext = 
GrpcSslContexts.forServer(selfSignedCertificate.certificate(), 
selfSignedCertificate.privateKey())
               .sslProvider(provider)
               .trustManager(InsecureTrustManagerFactory.INSTANCE)
               .clientAuth(ClientAuth.NONE)
               .build();
       } else {
           String tlsCertPath = 
ConfigurationManager.getProxyConfig().getTlsCertPath();
           String tlsKeyPath = 
ConfigurationManager.getProxyConfig().getTlsKeyPath();
           String tlsKeyPassword = 
ConfigurationManager.getProxyConfig().getTlsKeyPassword();
           try (InputStream serverKeyInputStream = Files.newInputStream(
               Paths.get(tlsKeyPath));
                InputStream serverCertificateStream = Files.newInputStream(
                    Paths.get(tlsCertPath))) {
               newSslContext = 
GrpcSslContexts.forServer(serverCertificateStream,
                       serverKeyInputStream,
                       StringUtils.isNotBlank(tlsKeyPassword) ? tlsKeyPassword 
: null)
                   .trustManager(InsecureTrustManagerFactory.INSTANCE)
                   .clientAuth(ClientAuth.NONE)
                   .build();
           }
       }
       SslContext oldSslContext = sslContext;
       sslContext = newSslContext;
       if (oldSslContext != null) {
           try {
               ReferenceCountUtil.release(oldSslContext);
               log.info("Old SslContext released for proxy server");
           } catch (Exception e) {
               log.warn("Failed to release old SslContext for proxy server", e);
           }
       }
   }
   ```
   
   ## 4. Test Plan
   
   ### Unit Tests
   
   1. **`NettyRemotingServerTest.testLoadSslContextReleasesOldContext()`**
      - Create a `NettyRemotingServer` with TLS enabled.
      - Call `loadSslContext()` twice.
      - Verify the old `SslContext`'s `refCnt()` is 0 after the second call.
   
   2. 
**`ProxyAndTlsProtocolNegotiatorTest.testLoadSslContextReleasesOldContext()`**
      - Call `loadSslContext()` twice (using test mode).
      - Verify the first `SslContext`'s `refCnt()` is 0 after the second call.
   
   3. **`ProxyAndTlsProtocolNegotiatorTest.testSslContextFieldIsVolatile()`**
      - Use reflection to verify the `sslContext` field has the `volatile` 
modifier.
   
   ### Integration Tests
   
   4. **Memory leak verification:**
      - In an existing TLS integration test, invoke the hot-reload path 100 
times.
      - Assert that native memory usage (via 
`PlatformDependent.usedDirectMemory()`) does not grow linearly.
   
   ### Manual Verification
   
   5. Start broker with TLS enabled; trigger cert reload via file watcher; 
confirm:
      - New connections use new cert (via `openssl s_client`).
      - `jcmd <pid> VM.native_memory` shows stable native memory.
   
   ## 5. Backward Compatibility
   
   | Aspect | Assessment |
   |--------|------------|
   | API surface | No public API changes. `loadSslContext()` signatures 
unchanged. |
   | Behavior | Existing connections using the old `SslHandler` are NOT 
affected — `SslHandler` holds its own reference to the `SslEngine`, which is 
independent of the `SslContext` lifecycle once created. |
   | Configuration | No config changes required. |
   | Dependencies | No new dependencies — `ReferenceCountUtil` is already in 
Netty's `io.netty:netty-common` which is a transitive dependency. |
   
   ## 6. Risk Assessment
   
   | Risk | Likelihood | Mitigation |
   |------|-----------|------------|
   | Releasing `SslContext` while `newHandler()` is in flight | Low — the new 
context is assigned before the old is released; in-flight `SslHandler` 
instances hold their own refcount | Use local variable swap pattern: assign new 
first, then release old |
   | `release()` throws exception | Low — only possible if refCnt is already 0 
(double-release) | Wrapped in try-catch; logged as WARN |
   | JDK `SslContext` (non-refcounted) passed to `release()` | None — 
`ReferenceCountUtil.release()` is a no-op for non-`ReferenceCounted` objects | 
Safe by design |
   | Thread-safety of proxy static field | Low after fix — `volatile` ensures 
visibility | Field made `volatile` |
   | Regression in existing TLS handshake | Low — only lifecycle management 
changed, not handshake logic | Covered by existing TLS integration tests |
   
   **Overall risk: LOW** — The fix is minimal, well-isolated, and follows 
established Netty patterns for `ReferenceCounted` resource management.
   
   ```
   
   </details>
   
   ### Next Steps
   
   - Reply `/approve` to proceed with generating a PR
   - Reply `/revise <feedback>` to request changes to the fix approach
   - Reply `/reject` to close this proposal
   
   *This proposal will expire in 72 hours if no response is received.*
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to