gerashegalov opened a new issue, #16153: URL: https://github.com/apache/iceberg/issues/16153
### Feature Request / Improvement ## Problem Statement When different compute engines or hardware accelerators (e.g., GPU via RAPIDS Accelerator and CPU) read the same Iceberg table concurrently, they need different values for `read.split.target-size`. GPU readers benefit from significantly larger splits (e.g., 2GB) to saturate device memory bandwidth, while CPU readers perform better with the current default (128MB). Today there are only two ways to control `read.split.target-size`: 1. **Table property** (`read.split.target-size`) -- requires DDL (`ALTER TABLE ... SET TBLPROPERTIES`), affects all readers globally, and is unsuitable when GPU and CPU workloads hit the same table simultaneously. 2. **Read option** (`split-size`) -- requires source code changes to every `DataFrameReader` call site, which is impractical for ad-hoc SQL queries and shared notebooks. Hardware accelerators like [RAPIDS Accelerator for Apache Spark](https://nvidia.github.io/spark-rapids/) are designed as drop-in replacements that require no application code changes, so requiring per-call-site read options defeats this benefit. Neither approach allows a Spark session to declare "all my reads should use split size X" without modifying table metadata or application code. ## Proposed Solution Add a Spark **session configuration** key that overrides the table property for split size: ``` spark.sql.iceberg.split-size ``` This follows the existing pattern used by other Iceberg session configs such as `spark.sql.iceberg.vectorization.enabled` and `spark.sql.iceberg.data-planning-mode`. The resolution precedence would be: 1. Read option (`split-size`) 2. Table-scoped session configuration (`spark.sql.iceberg.split-size.<catalog>.<database>.<table>`) 3. Global session configuration (`spark.sql.iceberg.split-size`) 4. Table property (`read.split.target-size`) 5. Default (128MB) ### Usage ```sql -- GPU session: use 2GB splits for all table reads SET spark.sql.iceberg.split-size = 2147483648; -- CPU session: keep the default or set a different value SET spark.sql.iceberg.split-size = 134217728; -- Override split size for a specific table in this session SET spark.sql.iceberg.split-size.my_catalog.my_db.large_table = 1073741824; ``` No DDL or code changes needed -- each session gets its own split size, with optional per-table granularity. ### Alternative: Engine-scoped table property overrides A further extension could add an engine or hardware type qualifier to the `read.split.target-size` table property itself (e.g., `read.split.target-size.gpu`), allowing a single table to declare optimal split sizes for different hardware profiles without session-level configuration. This would let table owners encode hardware-aware defaults directly in metadata. ## Scope - Spark only (all supported shims: 3.4, 3.5, 4.0) - Files affected: `SparkSQLProperties`, `SparkReadConf`, `SparkConfParser` - Backward compatible: no behavior change unless the new session key is explicitly set ### Query engine Spark ### Willingness to contribute - [x] I can contribute this improvement/feature independently - [x] I would be willing to contribute this improvement/feature with guidance from the Iceberg community - [ ] I cannot contribute this improvement/feature at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
