[PR] Make `SparkBatch.createReaderFactory` customizable [iceberg]

via GitHub Mon, 30 Jun 2025 10:04:26 -0700


zhztheplayer opened a new pull request, #13433:
URL: https://github.com/apache/iceberg/pull/13433


   A patch to make the API `SparkBatch.createReaderFactory` customizable. 
   
   ### Reason
   
   User might need to customize the Spark partition reader in deep without 
going through Iceberg's build-in `BaseReader` routine. For example, in the 
Apache Gluten project we are translating and sending a whole Iceberg 
`SparkInputPartition` to the native layer for Velox to process. All the 
remaining code in `BaseReader` /  `BaseBatchReader` won't help much with that.
   
   It turned out that `SparkBatch.createReaderFactory` is a nice cut-in point 
for this customization because the returned object is a Spark 
`PartitionReaderFactory` which is a stable developer API.
   
   ### The Change
   
   (Only Spark 4.0 code is affected in the PR)
   
   A new Spark option is being added:
   
   ```
   spark.sql.iceberg.partition-reader-factory.provider
   ```
   
   of which the default value is:
   
   ```
   org.apache.iceberg.spark.source.BaseSparkPartitionReaderFactoryProvider
   ```
   
   The previous partition-reader creation logic is moved from `SparkBatch` to 
the default provider implementation `BaseSparkPartitionReaderFactoryProvider`. 
So user can customize an implementation to replace it.
   
   `BaseSparkPartitionReaderFactoryProvider` itself can be created 
programmatically from user end for fallback purpose. I.e, if the user 
implementation is not able to handle the input Spark partition, user could 
fallback the further processing back to it.
   
   The `SparkPartitionReaderFactoryProvider` API looks like:
   
   ```java
   public interface SparkPartitionReaderFactoryProvider {
     PartitionReaderFactory createReaderFactory(SparkPartitionReaderFactoryConf 
conf);
   }
   ```
   
   For maximizing the forward compatibility, one single parameter 
`SparkPartitionReaderFactoryConf` is relied rather than multiple individual 
ones. `SparkPartitionReaderFactoryConf` is currently defined as:
   
   ```java
   
   @Value.Immutable
   public interface SparkPartitionReaderFactoryConf {
     SparkReadConf readConf();
     Schema expectedSchema();
     List<? extends ScanTaskGroup<?>> taskGroups();
   }
   ```
   
   Once the development goes and new parameters are about to added, user's 
implementation code of `SparkPartitionReaderFactoryProvider` won't have to be 
changed because we are only adding new methods to the conf class.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Make `SparkBatch.createReaderFactory` customizable [iceberg]

Reply via email to