jackylee-ch opened a new pull request, #12100:
URL: https://github.com/apache/gluten/pull/12100
## What changes are proposed in this pull request?
When loading `libgluten.dylib` on macOS arm64, the JVM aborts during the
`System.loadLibrary` call with:
```
ERROR: flag 'flagfile' was defined more than once
(in files '.../gflags.cc' and '.../gflags.cc')
... is being linked both statically and dynamically
```
The root cause is dyld weak-symbol coalescing across two dylibs that each
contain their own copy of gflags:
| Dylib | gflags origin
|
|--------------------|--------------------------------------------------------------------------------------------|
| `libvelox.dylib` | static `libgflags.a` baked in via Folly (Velox builds
Folly with `-DGFLAGS_SHARED=FALSE`) |
| `libgluten.dylib` | dynamic `libgflags.dylib` pulled transitively through
`glog::glog` / `Folly::folly` `INTERFACE_LINK_LIBRARIES` |
On macOS, dyld coalesces the weak C++ function-local-static guard inside
`FlagRegistry::GlobalRegistry()` between the two dylibs. Both copies then
register `--flagfile` against the same registry and gflags' duplicate-flag
check aborts the process before any user code runs.
Linux is unaffected because (a) ELF does not coalesce weak symbols across
shared objects by default, and (b) Gluten already uses `symbols.map` to
control the export surface of `libgluten.so`. macOS has no version-script
equivalent, so this PR uses a different mechanism. All Darwin-specific
logic is gated on `APPLE` / `CMAKE_SYSTEM_NAME STREQUAL "Darwin"`; Linux
and Windows build and link semantics are untouched.
The fix has five parts that all need to be in place to fully eliminate the
abort across the production load path *and* the test executables:
1. **`cpp/CMake/Findglog.cmake`** — On Darwin, prefer the static
`libglog.a` and force `gflags_component=static`. When both archives
are available we replace the imported `google::glog` target with an
`INTERFACE IMPORTED` target whose `INTERFACE_LINK_OPTIONS` carry
`LINKER:-load_hidden,<libglog.a>` and
`LINKER:-load_hidden,<libgflags.a>`. `-load_hidden` is the Apple ld64
flag that gives every symbol pulled from the archive *hidden*
visibility, which prevents dyld from coalescing them across dylibs.
We resolve the static gflags archive path by inspecting
`IMPORTED_LOCATION_RELEASE / _NOCONFIG / *` on
`gflags::gflags_static`.
2. **`cpp/core/utils/GflagsStubDarwin.cc` (new)** — Exports a no-op
`google::HandleCommandLineHelpFlags` with default visibility. Velox's
archive of gflags pulls `gflags.cc.o` but never references
`gflags_reporting.cc.o`, so once `-load_hidden` makes the real copy
invisible, the dynamic linker would fail to resolve this symbol at
dlopen time. The stub resolves it from `libgluten.dylib` instead.
3. **`cpp/core/CMakeLists.txt`** — Conditionally adds the stub to the
`gluten` target on `APPLE`.
4. **`cpp/velox/CMakeLists.txt`** — On Darwin, links `google::glog` as
`PUBLIC` on the `velox` target so its `INTERFACE_LINK_OPTIONS`
propagate through `libvelox.dylib` to test binaries and benchmarks.
The default PRIVATE linkage on `gluten` is intentional for Linux
(`symbols.map` handles it), but on Darwin `Folly::folly`'s
`INTERFACE_LINK_LIBRARIES` pulls `libgflags.a` into `libvelox.dylib`
and any test executables with default visibility, reviving the same
dual-registration abort at test startup.
5. **`cpp/velox/compute/VeloxBackend.cc`** — Guards
`google::InitGoogleLogging` with `IsGoogleLoggingInitialized()` and
makes `VeloxBackend::create()` idempotent. Multi-suite gtest binaries
on macOS re-enter `VeloxBackend::init` from each `SetUpTestSuite`,
otherwise triggering glog's `"You called InitGoogleLogging() twice!"`
check and Gluten's `Registry "Required object already registered"`
check.
## How was this patch tested?
Built on macOS 14 arm64 with Apple Clang 17 and the Homebrew toolchain.
**Symbol audit (after the fix):**
```
$ nm -g libvelox.dylib | grep "google.*ParseCommandLine"
(empty)
$ nm libvelox.dylib | awk '/FlagRegistry/ {print $2}' | sort | uniq -c
3 b
21 t
```
All `FlagRegistry` symbols are lowercase (`t` = local text, `b` = local
bss); none are exported across the dylib boundary, so dyld has nothing
to coalesce.
**Behavioral validation:**
- Before the fix, `dlopen("libgluten.dylib")` aborts before any test
reaches `main()`.
- After the fix, `cpp/build/velox/tests/velox_shuffle_writer_test` runs
5436 / 5436 cases cleanly on macOS 14 arm64.
- Spark 3.5 + Velox backend Java JUnit canaries (the JNI-only suites
that exercise native load without query execution) all pass on macOS
arm64:
- `org.apache.gluten.utils.VeloxBloomFilterTest`
- `org.apache.gluten.columnarbatch.ColumnarBatchTest`
- `org.apache.gluten.backendsapi.VeloxListenerApiTest`
- `org.apache.gluten.fs.OnHeapFileSystemTest`
- `org.apache.gluten.vectorized.ArrowColumnVectorTest`
- Full ctest of `cpp/build` reports 5574 / 5585 pass; the 11 failures
are unrelated upstream Velox issues exposed by the recent
`dft-2026_05_13` bump (HYPERLOGLOG cast registration tightening,
`Type::equivalent()` regression on identically-printed ROW types) —
not caused by this PR.
**Linux:**
- Linux x86_64 build green; all changes are gated behind `APPLE` /
Darwin checks, so no behavioral change on Linux is expected. Local
Ubuntu build verified clean.
## Was this patch authored or co-authored using generative AI tooling?
co-auth: Claude (Sonnet/Opus) via Claude Code 1.x
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]