[jira] [Commented] (SPARK-34423) Allow FileTable.fileIndex to be reused for custom partition schema in DataSourceV2 read path

Omer Ozarslan (Jira) Thu, 11 Feb 2021 07:55:15 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-34423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283107#comment-17283107
 ]


Omer Ozarslan commented on SPARK-34423:
---------------------------------------

If this sounds good, I can happily submit a PR. Thanks.

> Allow FileTable.fileIndex to be reused for custom partition schema in 
> DataSourceV2 read path
> --------------------------------------------------------------------------------------------
>
>                 Key: SPARK-34423
>                 URL: https://issues.apache.org/jira/browse/SPARK-34423
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.1
>            Reporter: Omer Ozarslan
>            Priority: Minor
>
> It is currently possible to provide custom partition schema in DataSourceV2 
> read path with custom implementations of 
> PartitionAwareFileIndex/PartitionSpec and by overriding fileIndex in a 
> subclass of FileTable. Since fileIndex is lazy val it's not possible to reuse 
> it from the subclass however (i.e. super.fileIndex).
> [https://github.com/apache/spark/blob/e0053853c90d39ef6de9d59fb933525e20bae1fa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala#L44-L61]
> Duplicating this code in the subclass is possible but somewhat hacky e.g. 
> DataSource globbing function is private API. I was wondering if this logic 
> can be refactored into something like this:
> {code:java}
> def createFileIndex(): PartitionAwareFileIndex = {
>   ...[current fileIndex logic]...
> }
> lazy val fileIndex: PartitionAwareFileIndex = createFileIndex(){code}
> This would allow reusing fileIndex logic downstream by wrapping it up with 
> custom implementations.
> (Note that this proposed change considers custom partition schema in read 
> path only. Write path is out of the scope of this change.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34423) Allow FileTable.fileIndex to be reused for custom partition schema in DataSourceV2 read path

Reply via email to