gerashegalov opened a new pull request, #16154: URL: https://github.com/apache/iceberg/pull/16154
Closes #16153 ### What changes were made in this PR? Add a new Spark session configuration key `spark.sql.iceberg.split-size` that allows overriding the `read.split.target-size` table property at the session level without requiring DDL changes to table metadata or source code changes to read call sites. This is particularly useful when GPU and CPU workloads read the same Iceberg table concurrently: GPU sessions benefit from significantly larger splits (e.g. 2GB) while CPU sessions perform better with the default 128MB. Hardware accelerators like [RAPIDS Accelerator for Apache Spark](https://nvidia.github.io/spark-rapids/) are designed as drop-in replacements requiring no application code changes, so a session-level knob is essential. ### Changes **All Spark shims (v3.4, v3.5, v4.0):** - `SparkSQLProperties`: add `SPLIT_SIZE = "spark.sql.iceberg.split-size"` constant - `SparkReadConf`: add `.sessionConf(SparkSQLProperties.SPLIT_SIZE)` to both `splitSize()` and `splitSizeOption()` parser chains; update Javadoc to document 5-level precedence - `SparkConfParser`: store `Table.name()` as `tableName` and in `ConfParser.parse()` try a table-qualified session key (`<key>.<tableName>`) before the global session key **v3.5 only:** - `TestSparkWriteConf`: add 4 tests for table-scoped session conf resolution ### Resolution precedence 1. Read option (`split-size`) 2. Table-scoped session conf (`spark.sql.iceberg.split-size.<catalog>.<db>.<table>`) 3. Global session conf (`spark.sql.iceberg.split-size`) 4. Table property (`read.split.target-size`) 5. Default (128MB) ### How was this patch tested? 4 new unit tests in `TestSparkWriteConf` (v3.5): - table-scoped session key takes precedence over global - global session key works when no table-scoped key is set - read option takes precedence over table-scoped session key - table-scoped session key takes precedence over table property -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
