gerashegalov opened a new issue, #16153:
URL: https://github.com/apache/iceberg/issues/16153

   ### Feature Request / Improvement
   
   ## Problem Statement
   
   When different compute engines or hardware accelerators (e.g., GPU via 
RAPIDS Accelerator and CPU) read the same Iceberg table concurrently, they need 
different values for `read.split.target-size`. GPU readers benefit from 
significantly larger splits (e.g., 2GB) to saturate device memory bandwidth, 
while CPU readers perform better with the current default (128MB).
   
   Today there are only two ways to control `read.split.target-size`:
   
   1. **Table property** (`read.split.target-size`) -- requires DDL (`ALTER 
TABLE ... SET TBLPROPERTIES`), affects all readers globally, and is unsuitable 
when GPU and CPU workloads hit the same table simultaneously.
   2. **Read option** (`split-size`) -- requires source code changes to every 
`DataFrameReader` call site, which is impractical for ad-hoc SQL queries and 
shared notebooks. Hardware accelerators like [RAPIDS Accelerator for Apache 
Spark](https://nvidia.github.io/spark-rapids/) are designed as drop-in 
replacements that require no application code changes, so requiring 
per-call-site read options defeats this benefit.
   
   Neither approach allows a Spark session to declare "all my reads should use 
split size X" without modifying table metadata or application code.
   
   ## Proposed Solution
   
   Add a Spark **session configuration** key that overrides the table property 
for split size:
   
   ```
   spark.sql.iceberg.split-size
   ```
   
   This follows the existing pattern used by other Iceberg session configs such 
as `spark.sql.iceberg.vectorization.enabled` and 
`spark.sql.iceberg.data-planning-mode`.
   
   The resolution precedence would be:
   
   1. Read option (`split-size`)
   2. Table-scoped session configuration 
(`spark.sql.iceberg.split-size.<catalog>.<database>.<table>`)
   3. Global session configuration (`spark.sql.iceberg.split-size`)
   4. Table property (`read.split.target-size`)
   5. Default (128MB)
   
   ### Usage
   
   ```sql
   -- GPU session: use 2GB splits for all table reads
   SET spark.sql.iceberg.split-size = 2147483648;
   
   -- CPU session: keep the default or set a different value
   SET spark.sql.iceberg.split-size = 134217728;
   
   -- Override split size for a specific table in this session
   SET spark.sql.iceberg.split-size.my_catalog.my_db.large_table = 1073741824;
   ```
   
   No DDL or code changes needed -- each session gets its own split size, with 
optional per-table granularity.
   
   ### Alternative: Engine-scoped table property overrides
   
   A further extension could add an engine or hardware type qualifier to the 
`read.split.target-size` table property itself (e.g., 
`read.split.target-size.gpu`), allowing a single table to declare optimal split 
sizes for different hardware profiles without session-level configuration. This 
would let table owners encode hardware-aware defaults directly in metadata.
   
   ## Scope
   
   - Spark only (all supported shims: 3.4, 3.5, 4.0)
   - Files affected: `SparkSQLProperties`, `SparkReadConf`, `SparkConfParser`
   - Backward compatible: no behavior change unless the new session key is 
explicitly set
   
   
   ### Query engine
   
   Spark
   
   ### Willingness to contribute
   
   - [x] I can contribute this improvement/feature independently
   - [x] I would be willing to contribute this improvement/feature with 
guidance from the Iceberg community
   - [ ] I cannot contribute this improvement/feature at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to