zhztheplayer opened a new pull request, #13433:
URL: https://github.com/apache/iceberg/pull/13433
A patch to make the API `SparkBatch.createReaderFactory` customizable.
### Reason
User might need to customize the Spark partition reader in deep without
going through Iceberg's build-in `BaseReader` routine. For example, in the
Apache Gluten project we are translating and sending a whole Iceberg
`SparkInputPartition` to the native layer for Velox to process. All the
remaining code in `BaseReader` / `BaseBatchReader` won't help much with that.
It turned out that `SparkBatch.createReaderFactory` is a nice cut-in point
for this customization because the returned object is a Spark
`PartitionReaderFactory` which is a stable developer API.
### The Change
(Only Spark 4.0 code is affected in the PR)
A new Spark option is being added:
```
spark.sql.iceberg.partition-reader-factory.provider
```
of which the default value is:
```
org.apache.iceberg.spark.source.BaseSparkPartitionReaderFactoryProvider
```
The previous partition-reader creation logic is moved from `SparkBatch` to
the default provider implementation `BaseSparkPartitionReaderFactoryProvider`.
So user can customize an implementation to replace it.
`BaseSparkPartitionReaderFactoryProvider` itself can be created
programmatically from user end for fallback purpose. I.e, if the user
implementation is not able to handle the input Spark partition, user could
fallback the further processing back to it.
The `SparkPartitionReaderFactoryProvider` API looks like:
```java
public interface SparkPartitionReaderFactoryProvider {
PartitionReaderFactory createReaderFactory(SparkPartitionReaderFactoryConf
conf);
}
```
For maximizing the forward compatibility, one single parameter
`SparkPartitionReaderFactoryConf` is relied rather than multiple individual
ones. `SparkPartitionReaderFactoryConf` is currently defined as:
```java
@Value.Immutable
public interface SparkPartitionReaderFactoryConf {
SparkReadConf readConf();
Schema expectedSchema();
List<? extends ScanTaskGroup<?>> taskGroups();
}
```
Once the development goes and new parameters are about to added, user's
implementation code of `SparkPartitionReaderFactoryProvider` won't have to be
changed because we are only adding new methods to the conf class.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]