Omer Ozarslan created SPARK-34423:
-------------------------------------

             Summary: Allow FileTable.fileIndex to be reused for custom 
partition schema in DataSourceV2 read path
                 Key: SPARK-34423
                 URL: https://issues.apache.org/jira/browse/SPARK-34423
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.0.1
            Reporter: Omer Ozarslan


It is currently possible to provide custom partition schema in DataSourceV2 
read path with custom implementations of PartitionAwareFileIndex/PartitionSpec 
and by overriding fileIndex in a subclass of FileTable. Since fileIndex is lazy 
val it's not possible to reuse it from the subclass however (i.e. 
super.fileIndex).

[https://github.com/apache/spark/blob/e0053853c90d39ef6de9d59fb933525e20bae1fa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala#L44-L61]

Duplicating this code in the subclass is possible but somewhat hacky e.g. 
DataSource globbing function is private API. I was wondering if this logic can 
be refactored into something like this:
{code:java}
def createFileIndex(): PartitionAwareFileIndex = {
  ...[current fileIndex logic]...
}

lazy val fileIndex: PartitionAwareFileIndex = createFileIndex(){code}
This would allow reusing fileIndex logic downstream by wrapping it up with 
custom implementations.

(Note that this proposed change considers custom partition schema in read path 
only. Write path is out of the scope of this change.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to