guowangy opened a new pull request, #11896: URL: https://github.com/apache/gluten/pull/11896
## What changes are proposed in this pull request? Fix SIGSEGV on `CPUThreadPool` threads during HDFS scan caused by `DetachCurrentThread` poisoning `libhdfs.so`'s TLS-cached `JNIEnv*`. Fixes #11895 ### Root cause `libhdfs.so` caches `JNIEnv*` per thread in a two-level TLS structure: - **Fast path**: `static __thread ThreadLocalState *quickTlsEnv` (ELF linker-initialized, zero-cost read, **no re-validation**) - **Slow path**: `pthread_getspecific(gTlsKey)` (mutex-protected, only on first call per thread) After the first `AttachCurrentThread` on a `CPUThreadPool` thread, libhdfs caches the `JNIEnv*` in `quickTlsEnv`. The fast path returns this pointer on all subsequent calls without checking validity. Two Gluten destructors called `vm_->DetachCurrentThread()` unconditionally after JNI cleanup: - `JniColumnarBatchIterator::~JniColumnarBatchIterator()` (`cpp/core/jni/JniCommon.cc`) - `JavaInputStreamAdaptor::Close()` (`cpp/core/jni/JniWrapper.cc`) Both used `attachCurrentThreadAsDaemonOrThrow()` followed by unconditional `DetachCurrentThread()`. This was not a proper attach/detach pair: `attachCurrentThreadAsDaemonOrThrow` only attaches if the thread is not already attached, but `DetachCurrentThread` ran regardless — detaching threads that libhdfs had attached. This freed the JVM's `JavaThread` object while `quickTlsEnv` still held the stale pointer. The crash sequence: 1. `CPUThreadPool21` runs a preload task → libhdfs calls `AttachCurrentThread`, caches `JNIEnv*` in `quickTlsEnv` 2. Object cleanup on the same thread → Gluten destructor calls `DetachCurrentThread` → `JavaThread` freed, but `quickTlsEnv` still holds stale pointer 3. `CPUThreadPool21` runs next preload task → `hdfsGetPathInfo()` → libhdfs fast path returns stale env → `jni_NewStringUTF(stale_env)` → **SIGSEGV** `CPUThreadPool` threads live for the entire executor JVM lifetime (created once in `VeloxBackend::initConnector`, destroyed only in `VeloxBackend::tearDown`), so the stale pointer persists across all subsequent queries. Confirmed by core dump analysis (`core.CPUThreadPool21.1770392` from TPC-DS on YARN): - `RDI = 0x0` (NULL `JavaThread*`, set by `block_if_vm_exited`) - `R12 = 0x7f3003a52200` (stale `JNIEnv*` from libhdfs TLS cache) - Memory at stale env shows JVM method resolution data (reused memory), not a valid JNI function table ### Fix Remove `DetachCurrentThread()` from both destructors with the following considerations: 1. **Broken ownership**: the original code detached threads it didn't attach — `attachCurrentThreadAsDaemonOrThrow` is conditional (no-op if already attached) but `DetachCurrentThread` was unconditional 2. **Daemon threads** (`AttachCurrentThreadAsDaemon`) do not block JVM shutdown 3. **`folly::ThreadLocal<HdfsFile>` destructors** call `hdfsCloseFile()` (JNI) during thread exit — detaching mid-lifetime would crash these too 4. **libhdfs `pthread_key` destructor** (`hdfsThreadDestructor`) calls `DetachCurrentThread` when the thread actually exits — proper cleanup still happens at thread exit when libhdfs is in use 5. **Non-HDFS paths** (S3, GCS, local): libhdfs is absent, no `pthread_key` is registered, daemon-attached threads are safely reclaimed when the executor JVM exits ## How was this patch tested? - `JniThreadDetachTest.testIteratorDestructorDoesNotDetachThread`: on a native `std::thread` (simulating `CPUThreadPool`), saves `JNIEnv*` (simulating libhdfs TLS cache), creates and destroys a real `JniColumnarBatchIterator`, then reuses the saved env for `FindClass` (simulating libhdfs's next `hdfsGetPathInfo`). **With the bug: SIGSEGV crashes the JVM. With the fix: succeeds normally.** - Existing surefire tests pass (`mvn surefire:test -pl backends-velox`) - Full TPC-DS benchmark on HDFS — no more `CPUThreadPool` SIGSEGV ## Was this patch authored or co-authored using generative AI tooling? Co-authored-by: Claude Opus 4.6 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
