knight6236 opened a new issue, #10669:
URL: https://github.com/apache/seatunnel/issues/10669

   ## Background
   
   SeaTunnel already provides solid foundations for plugin loading and 
execution:
   
   - Centralized ClassLoader management via `ClassLoaderService`
   - Dynamic loading and execution of connectors
   - Certain class isolation mechanisms (e.g., child‑first loader)
   
   From a pure functionality perspective, this works well and covers most 
current use cases.
   
   However, from the perspective of **long‑running Engine nodes (repeated job 
scheduling, varying plugin combinations)**, the lifecycle management and 
isolation boundaries of ClassLoaders still exhibit some “non‑deterministic” 
behavior. I recently conducted a systematic review of the ClassLoader‑related 
code and would like to share some observations and a possible evolution path. 
Feedback and discussion are highly welcome.
   
   ---
   
   ## Gap
   
   ### 1. Semantics of “release” vs “close”
   
   `DefaultClassLoaderService.releaseClassLoader()` removes the ClassLoader 
from the cache and performs some thread‑local cleanup when the reference count 
reaches zero, but **it never calls `URLClassLoader.close()` explicitly**. The 
service’s own `close()` only clears the internal map.
   
   **Potential impacts**:
   - JAR file handles are released only when GC decides to do so
   - On Windows, JAR files may remain locked, preventing deletion or replacement
   - Metaspace reclamation is not deterministic; over time it may accumulate
   
   ### 2. Classloader boundaries can still change at runtime
   
   Several code paths still inject dependencies into the current ClassLoader 
via `addURL` at runtime, for example:
   
   - Reflection‑based `addURL` in `AbstractPluginDiscovery`
   - Plugin dependency injection into the current loader in Flink execution 
paths
   
   **Potential impacts**:
   - The actual classloader boundary depends not only on the loader structure 
but also on runtime behavior
   - When multiple jobs reuse the same process with different plugin sets, 
“historical contamination” may occur
   - Any attempt to verify unloading is weakened by this mutable boundary
   
   ### 3. Unmanaged residual references
   
   The codebase uses multiple patterns for thread‑context ClassLoaders 
(synchronous, asynchronous, cross‑thread). Some places:
   
   - Do not restore the original TCCL in a `finally` block
   - Restore the wrong baseline when switching across threads
   
   In addition, typical ClassLoader‑holding points such as JDBC driver 
registration (e.g., TDengine) or connectors that directly set TCCL without 
restoration are not yet managed in a unified way.
   
   > See the appendix for concrete code references.
   
   ---
   
   ## Goal
   
   Improve SeaTunnel for long‑running deployments in terms of:
   
   - **Controllable ClassLoader lifecycle**
   - **Stable classloading boundaries**
   - **Predictable resource reclamation and verifiability**
   
   Move from “ClassLoader works” to “ClassLoader is governable”.
   
   ---
   
   ## Proposal
   
   The following is a **progressive, non‑breaking** improvement path, which can 
be implemented in phases.
   
   ### Phase 1: Close the lifecycle explicitly
   - **Goal**: Resource release no longer depends on GC; instead it is 
triggered explicitly.
   - **Idea**: For `URLClassLoader` instances created by SeaTunnel, call 
`close()` when the reference count reaches zero.
   - **Acceptance**: When caching is off, JAR files can be deleted immediately 
after release; Metaspace does not grow with job count.
   
   ### Phase 2: Stabilize the classloading boundary
   - **Goal**: Eliminate runtime `addURL` from production paths.
   - **Idea**: Remove reflective `addURL` calls; determine the complete 
classpath before creating the loader.
   - **Acceptance**: The same loader instance has a constant classpath across 
time; no cross‑job pollution.
   
   ### Phase 3: Consolidate residual references
   - **Goal**: Unified handling of TCCL, JDBC drivers, threads, and 
ThreadLocals.
   - **Idea**:
     - Encapsulate TCCL switching with try‑with‑resources
     - Pair JDBC driver registration with deregistration
     - Clearly define thread ownership for background threads
   - **Acceptance**: All TCCL switches restore the original context; 
`DriverManager` does not hold outdated loaders; threads can be cleaned up.
   
   ### Phase 4 (optional): Enable verifiable reclamation
   - **Goal**: Provide engineering evidence that a ClassLoader has actually 
been collected.
   - **Idea**: Use `WeakReference` + `ReferenceQueue` to track loaders, or 
expose simple runtime metrics (e.g., number of live loaders).
   - **Acceptance**: After repeated job runs, old loaders can be observed to be 
reclaimed.
   
   All phases are designed to be compatible with existing configurations such 
as `classloader-cache-mode`.
   
   ---
   
   ## Non-goals
   
   This discussion does not aim to:
   
   - Introduce breaking changes (phase 1 aims for backward compatibility)
   - Provide full runtime‑equivalent validation
   - Change existing connector semantics
   - Restructure the plugin model or ClassLoader architecture
   
   ---
   
   ## Open Questions
   
   I would love to hear the community’s experience and thoughts:
   
   1. Have you encountered **Metaspace growth, JAR file locks, or Windows 
file‑replacement issues** in production?
   2. With `classloader-cache-mode = true`, would explicit `close()` conflict 
with the caching design? (I have some ideas for compatibility but want to 
understand the design intent first.)
   3. Is there already a community‑wide convention for handling TCCL and JDBC 
driver registration, or is each module currently responsible for its own 
cleanup?
   
   ---
   
   ## Appendix (Code references & diagnostics)
   
   <details>
   <summary>Click to expand: Key code locations (based on 
2.3.13-release)</summary>
   
   ### 1. Release path without explicit close
   - `DefaultClassLoaderService.releaseClassLoader()` – [line 
103‑132](https://github.com/apache/seatunnel/blob/2.3.13-release/seatunnel-engine/seatunnel-engine-core/src/main/java/org/apache/seatunnel/engine/core/classloader/DefaultClassLoaderService.java#L103-L132)
 – no `close()` call
   - `DefaultClassLoaderService.close()` – [line 
193‑197](https://github.com/apache/seatunnel/blob/2.3.13-release/seatunnel-engine/seatunnel-engine-core/src/main/java/org/apache/seatunnel/engine/core/classloader/DefaultClassLoaderService.java#L193-L197)
 – only clears map
   
   ### 2. Runtime addURL
   - `AbstractPluginDiscovery.addURL` – [line 
80‑87](https://github.com/apache/seatunnel/blob/2.3.13-release/seatunnel-plugin-discovery/src/main/java/org/apache/seatunnel/plugin/discovery/AbstractPluginDiscovery.java#L80-L87)
 – reflective `addURL`
   - `AbstractPluginDiscovery` fallback – [line 
220‑233](https://github.com/apache/seatunnel/blob/2.3.13-release/seatunnel-plugin-discovery/src/main/java/org/apache/seatunnel/plugin/discovery/AbstractPluginDiscovery.java#L220-L233)
 – `URLClassLoader` without `close`
   - Flink starter related: 
[`FlinkAbstractPluginExecuteProcessor`](https://github.com/apache/seatunnel/blob/2.3.13-release/seatunnel-core/seatunnel-flink-starter/seatunnel-flink-starter-common/src/main/java/org/apache/seatunnel/core/starter/flink/execution/FlinkAbstractPluginExecuteProcessor.java#L40-L69)
 and 
[`FlinkExecution`](https://github.com/apache/seatunnel/blob/2.3.13-release/seatunnel-core/seatunnel-flink-starter/seatunnel-flink-starter-common/src/main/java/org/apache/seatunnel/core/starter/flink/execution/FlinkExecution.java#L194-L195)
   
   ### 3. TCCL not restored / residual references
   - `TaskExecutionService` – [line 
783‑816](https://github.com/apache/seatunnel/blob/2.3.13-release/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/TaskExecutionService.java#L783-L816)
 – cooperative worker TCCL without `finally`
   - `RequestSplitOperation` – [line 
54‑63](https://github.com/apache/seatunnel/blob/2.3.13-release/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/operation/source/RequestSplitOperation.java#L54-L63)
 – TCCL switch without try/finally
   - `SourceRegisterOperation` – [line 
60‑86](https://github.com/apache/seatunnel/blob/2.3.13-release/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/operation/source/SourceRegisterOperation.java#L60-L86)
 – cross‑thread restoration wrong baseline
   - `NotifyTaskRestoreOperation` – [line 
88‑125](https://github.com/apache/seatunnel/blob/2.3.13-release/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/checkpoint/operation/NotifyTaskRestoreOperation.java#L88-L125)
 – same issue
   - `IcebergCatalogLoader` – [line 
60](https://github.com/apache/seatunnel/blob/2.3.13-release/seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/IcebergCatalogLoader.java#L60)
 – direct `setContextClassLoader` without restore
   - `PaimonCatalogLoader` – [line 
78](https://github.com/apache/seatunnel/blob/2.3.13-release/seatunnel-connectors-v2/connector-paimon/src/main/java/org/apache/seatunnel/connectors/seatunnel/paimon/catalog/PaimonCatalogLoader.java#L78)
 – same
   - `LanceCatalogLoader` – [line 
49](https://github.com/apache/seatunnel/blob/2.3.13-release/seatunnel-connectors-v2/connector-lance/src/main/java/org/apache/seatunnel/connectors/seatunnel/lance/catalog/LanceCatalogLoader.java#L49)
 – same
   - `TDengineUtil` – [line 
46‑56](https://github.com/apache/seatunnel/blob/2.3.13-release/seatunnel-connectors-v2/connector-tdengine/src/main/java/org/apache/seatunnel/connectors/seatunnel/tdengine/utils/TDengineUtil.java#L46-L56)
 – `DriverManager.registerDriver` without deregistration
   
   </details>
   
   <details>
   <summary>Click to expand: Diagnostic methods (how to detect whether a 
ClassLoader is reclaimed)</summary>
   
   - **WeakReference + ReferenceQueue**: Create a WeakReference when a loader 
is released; poll the queue to know if it has been collected.
   - **JMX**: Monitor `java.lang.ClassLoading` for loaded class count.
   - **Heap dump**: Use MAT to search for `SeaTunnelChildFirstClassLoader` 
instances and analyze GC roots.
   
   </details>
   
   ---
   
   The observations above are based on reading the code; I may have missed or 
misinterpreted something. If you have encountered similar issues in production, 
or if there are design decisions I am not aware of, please feel free to share. 
Let’s discuss! 🙌


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to