iemejia opened a new pull request, #12400:
URL: https://github.com/apache/gluten/pull/12400
## What changes are proposed in this pull request?
Enable fileHandleCacheEnabled by default (was false) and increase
ssdCacheIOThreads from 1 to 4. Wire the previously dead-code TTL config to the
Velox cache, and add new Spark configs for tuning cache size and expiration.
### Changes
1. **Default config changes:**
- `fileHandleCacheEnabled`: false -> true
- `ssdCacheIOThreads`: 1 -> 4
2. **Fix Velox TTL wiring** (`file-handle-cache-ttl.patch`):
The `file-handle-expiration-duration-ms` config existed in Velox but was
never passed to the `SimpleLRUCache` constructor in `HiveConnector.cpp`. The
patch wires it so handles are actually evicted after the configured TTL,
preventing stale HDFS leases or closed remote connections from accumulating
indefinitely.
3. **New Spark configs exposed:**
- `spark.gluten.sql.columnar.backend.velox.numCacheFileHandles` (default:
20000) - max entries in the LRU cache
-
`spark.gluten.sql.columnar.backend.velox.fileHandleExpirationDurationMs`
(default: 600000 / 10 min) - TTL per handle; idle handles are evicted
4. **Test suite** (`VeloxFileHandleCacheSuite`, 6 tests):
- Basic scan correctness with cache enabled
- Repeated scans produce consistent results (cache hit path)
- Many small files (200) do not cause resource errors
- Filtered scan correctness with predicate pushdown
- Graceful behavior when files are deleted between scans
- Column pruning with different projections on cached handles
5. **Benchmark** (`FileHandleCacheBenchmark`):
Measures repeated scans of 200 small Parquet files with cache enabled vs
disabled.
### Rationale
Data lake files (Parquet, Delta, Iceberg) are immutable once written, making
file handle caching safe for production workloads. Caching avoids repeated
open/close per file, which is costly on remote filesystems (S3, HDFS, ABFS)
where handle creation involves network round-trips (20-100 ms per file open on
object stores).
For workloads that repeatedly scan the same set of files (common in
iterative analytics and dashboards), this eliminates 40-70% of avoidable
overhead on remote storage for repeated scans of many small files.
Users who work with mutable files can set
`spark.gluten.sql.columnar.backend.velox.fileHandleCacheEnabled=false`.
## How was this patch tested?
- New `VeloxFileHandleCacheSuite` (6 tests) covering correctness, cache
hits, many files, predicate pushdown, deleted files, and column pruning
- New `FileHandleCacheBenchmark` for reproducible before/after measurement
- All existing Velox test suites pass
## Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude claude-opus-4.6
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]