mbutrovich opened a new issue, #4456: URL: https://github.com/apache/datafusion-comet/issues/4456
## Problem `CometS3CredentialDispatcher` caches one provider instance per `(FQCN, dispatchKey, catalogProperties)` triple in `KEY_TO_HANDLE` and `INSTANCES` for the lifetime of the executor JVM. Today eviction only happens via the JVM shutdown hook. Catalogs that vend per-table credentials inject those credentials into the table's FileIO property bag, which Comet captures into `catalogProperties` at plan time (see comment in `IcebergReflection.scala` `getFileIOProperties`). The Iceberg REST + STS AssumeRole pattern (e.g. [Apache Gravitino's `S3TokenCredential`](https://gravitino.apache.org/docs/0.9.0-incubating/security/credential-vending) with default 1h TTL) is the canonical example. Each `loadTable` returns a fresh session token, so each unique table → a new `InstanceKey` → a new cached provider instance. On long-running JVMs (Spark Connect, Thrift Server) the maps grow without bound. Discussion thread: https://github.com/apache/datafusion-comet/pull/4309#discussion_r3290519239 ## Why a plain LRU is not the fix Native `CometS3CredentialBridge` instances hold the handle by value and are reused across scans. Time- or size-based eviction can invalidate a handle that a live bridge still references mid-job, surfacing as opaque credential failures from `object_store` or `opendal`. ## Proposed approach Refcounted handle lifecycle driven by the native side: - Java: add a JNI-callable `releaseHandle(long)` that decrements a refcount and, at zero, removes the entry from both maps and calls `provider.close()` (swallowing exceptions, matching the shutdown-hook precedent). - Native: wrap the handle in a struct whose `Drop` calls `releaseHandle`. The bridge itself is already shared via `Arc`, so the wrapping struct's `Drop` fires when the last `Arc` goes away. - Guard against re-entry during JVM shutdown (the existing shutdown hook already drains everything) and during panic unwind. The `ensureInitialized` path stays as-is: `computeIfAbsent` returns the existing handle and increments the refcount; first-time insertion sets count to 1. ## Acceptance criteria - `KEY_TO_HANDLE.size()` does not grow unboundedly under per-table-vended-credential workloads. - Steady-state path (same triple reused across many scans) keeps a single cached instance. - Concurrent `ensureInitialized` + `releaseHandle` against the same key does not double-free or resurrect a released entry. - Provider `close()` exceptions on release are swallowed and logged. - Existing `CometS3CredentialBridgeSuite` and dispatcher unit tests continue to pass. ## Out of scope - Driver→executor refresh of `catalogProperties` for long-running scans. That side of the AssumeRole story is addressed by composing with Spark's `HadoopDelegationTokenProvider`, documented in `docs/source/user-guide/latest/s3-credential-providers.md` and `docs/source/contributor-guide/s3-credential-provider-design.md`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
