[ 
https://issues.apache.org/jira/browse/SPARK-40955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40955:
---------------------------------
    Summary: Allow DSV2 Predicate pushdown in FileScanBuilder.pushedDataFilter  
 (was: allow DSV2 Predicate pushdown in FileScanBuilder.pushedDataFilter )

> Allow DSV2 Predicate pushdown in FileScanBuilder.pushedDataFilter 
> ------------------------------------------------------------------
>
>                 Key: SPARK-40955
>                 URL: https://issues.apache.org/jira/browse/SPARK-40955
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output, SQL
>    Affects Versions: 3.3.1
>            Reporter: RJ Marcus
>            Priority: Major
>
> {+}overall{+}: 
> Allow FileScanBuilder to push `Predicate` instead of `Filter` for data 
> filters being pushed down to source. This would allow new (arbitrary) DS V2 
> Predicates to be pushed down to the file source. 
> Hello spark developers,
> Thank you in advance for reading. Please excuse me if I make mistakes; this 
> is my first time working on apache/spark internals. I am asking these 
> questions to better understand whether my proposed changes fall within the 
> intended scope of Data Source V2 API functionality.
> +Motivation / Background:+
> I am working on a branch in 
> [apache/incubator-sedona|https://github.com/apache/incubator-sedona] to 
> extend its support of geoparquet files to include predicate pushdown of 
> postGIS style spatial predicates (e.g. `ST_Contains()`) that can take 
> advantage of spatial info in file metadata. We would like to inherit as much 
> as possible from the Parquet classes (because geoparquet basically just adds 
> a binary geometry column). However, {{FileScanBuilder.scala}} appears to be 
> missing some functionality I need for DSV2 {{{}Predicates{}}}.
> +My understanding of the problem so far:+
> The ST_* {{Expression}} must be detected as a pushable predicate 
> (ParquetScanBuilder.scala:71) and passed as a {{pushedDataFilter}} to the 
> {{parquetPartitionReaderFactory}} where it will be translated into a (user 
> defined) {{{}FilterPredicate{}}}.
> The [Filter class is 
> sealed|https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala]
>  so the sedona package can’t define new Filters; DSV2 Predicate appears to be 
> the preferred method for accomplishing this task (referred to as “V2 Filter”, 
> SPARK-39966). However, `pushedDataFilters` in FileScanBuilder.scala is of 
> type {{{}sources.Filter{}}}.
> Some recent work (SPARK-39139) added the ability to detect user defined 
> functions in  {{DataSourceV2Strategy.translateFilterV2() > 
> V2ExpressionBuilder.generateExpression()}}  , which I think could accomplish 
> detection correctly if {{FileScanBuilder}} called 
> {{DataSourceV2Strategy.translateFilterV2()}} instead of 
> {{{}DataSourceStrategy.translateFilter(){}}}.
> However, changing {{FileScanBuilder}} to use {{Predicate}} instead of 
> {{Filter}} would require many changes to all file based data sources. I don’t 
> want to spend effort making sweeping changes if the current behavior of Spark 
> is intentional.
>  
> +Concluding Questions:+
> Should {{FileScanBuilder}} be pushing {{Predicate}} instead of {{Filter}} for 
> data filters being pushed down to source? Or maybe in a FileScanBuilderV2?
> If not, how can a developer of a data source push down a new (or user 
> defined) predicate to the file source?
> Thank you again for reading. Pending feedback, I will start working on a PR 
> for this functionality.
> [~beliefer] [~cloud_fan] [~huaxingao]   have worked on DSV2 related spark 
> issues and I welcome your input. Please ignore this if I "@" you incorrectly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to