davidzollo commented on issue #10669:
URL: https://github.com/apache/seatunnel/issues/10669#issuecomment-4162094666

   Thanks @knight6236 for this thorough and well-structured analysis! I've 
cross-checked the findings against the current codebase and can confirm the 
observations are accurate. Below are my thoughts:
   
   ## Confirming the Observations
   
   ### 1. `URLClassLoader.close()` is never called
   
   I verified that in `DefaultClassLoaderService.releaseClassLoader()`, when 
the reference count reaches zero, the ClassLoader is removed from the cache and 
`recycleClassLoaderFromThread()` sets the TCCL to `null` for any thread still 
holding the old loader — but `URLClassLoader.close()` is never invoked. The 
`close()` method of `DefaultClassLoaderService` itself simply clears the two 
internal maps (`classLoaderCache` and `classLoaderReferenceCount`).
   
   This means JAR file handles are not released deterministically, and 
Metaspace reclamation relies entirely on GC. For long-running Engine nodes 
running hundreds or thousands of jobs, this is a real concern.
   
   ### 2. Reflective `addURL` in `AbstractPluginDiscovery`
   
   Confirmed. The `DEFAULT_URL_TO_CLASSLOADER` BiConsumer in 
`AbstractPluginDiscovery` uses `ReflectionUtils.invoke(classLoader, "addURL", 
url)` to mutate existing ClassLoaders at runtime. This makes the effective 
classpath of a loader non-deterministic and creates potential for cross-job 
pollution when the same process handles different plugin combinations.
   
   ### 3. TCCL not restored in some connectors
   
   I found several connectors that set the TCCL without a `try-finally` 
restoration pattern. For example:
   - **Iceberg** (`IcebergCatalogLoader.loadCatalog()`) — sets TCCL but never 
restores it
   - **Paimon** (`PaimonCatalogLoader.loadCatalog()`) — same pattern
   
   Meanwhile, core engine operations (e.g., `SinkPrepareCommitOperation`, 
`SourceNoMoreElementOperation`) correctly use `try-finally` blocks. This 
inconsistency suggests there's no enforced framework-level convention — each 
module handles it independently.
   
   ### 4. No JDBC driver deregistration
   
   I found no `DriverManager.deregisterDriver()` calls in the codebase. Drivers 
registered by connector-loaded ClassLoaders will remain in the 
`DriverManager`'s registry, holding references that prevent ClassLoader GC.
   
   ## Thoughts on the Proposal
   
   The phased approach makes a lot of sense. I particularly appreciate that 
it's designed to be **non-breaking** and compatible with existing 
`classloader-cache-mode` settings.
   
   **Phase 1 (Explicit close):** This is the highest-impact, lowest-risk 
change. When `cacheMode = false` and the reference count reaches zero, calling 
`((URLClassLoader) classLoader).close()` after `recycleClassLoaderFromThread()` 
would be a straightforward improvement. For `cacheMode = true`, we'd need to be 
careful — perhaps only close when the service itself shuts down.
   
   **Phase 2 (Stabilize boundaries):** Strongly agree. The `addURL` pattern is 
especially problematic because it was already restricted in JDK 16+ (the 
`URLClassLoader.addURL()` method is no longer reflectively accessible by 
default). Resolving all URLs upfront during ClassLoader construction would be 
both cleaner and forward-compatible with modern JDKs.
   
   **Phase 3 (Consolidate TCCL/JDBC):** A framework-level utility like 
`ClassLoaderContext` (try-with-resources) would standardize the pattern across 
all connectors. This should also include paired JDBC driver deregistration 
during ClassLoader release.
   
   **Phase 4 (Verifiable reclamation):** `WeakReference` + `ReferenceQueue` 
tracking would be excellent for CI/integration tests. Exposing a metric (e.g., 
"active ClassLoader count") on the REST API would also help operators monitor 
Metaspace health in production.
   
   ## Regarding the Open Questions
   
   1. We have observed Metaspace growth in long-running Zeta Engine deployments 
with frequent job scheduling. It typically manifests after hundreds of job 
cycles, and the workaround has been periodic node restarts.
   
   2. For `classloader-cache-mode = true`, I think explicit `close()` should be 
deferred to `DefaultClassLoaderService.close()` (node/service shutdown). The 
caching contract implies the loader lives as long as the service. But we could 
further explore a TTL-based eviction strategy for idle loaders.
   
   3. There is currently no unified convention. Each connector and engine 
module handles TCCL independently. This is a gap worth closing with a 
standardized API.
   
   ---
   
   Overall, I think this is a valuable discussion and the phased improvement 
path is well-thought-out. I'd be happy to collaborate on the implementation. 
Would you be interested in starting with a PR for Phase 1?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to