[ 
https://issues.apache.org/jira/browse/SPARK-40955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RJ Marcus updated SPARK-40955:
------------------------------
    Description: 
{+}overall{+}: 


Allow FileScanBuilder to push `Predicate` instead of `Filter` for data filters 
being pushed down to source. This would allow new (arbitrary) DS V2 Predicates 
to be pushed down to the file source. 




Hello spark developers,

Thank you in advance for reading. Please excuse me if I make mistakes; this is 
my first time working on apache/spark internals. I am asking these questions to 
better understand whether my proposed changes fall within the intended scope of 
Data Source V2 API functionality.


+Motivation / Background:+

I am working on a branch in 
[apache/incubator-sedona|https://github.com/apache/incubator-sedona] to extend 
its support of geoparquet files to include predicate pushdown of postGIS style 
spatial predicates (e.g. `ST_Contains()`) that can take advantage of spatial 
info in file metadata. We would like to inherit as much as possible from the 
Parquet classes (because geoparquet basically just adds a binary geometry 
column). However, {{FileScanBuilder.scala}} appears to be missing some 
functionality I need for DSV2 {{{}Predicates{}}}.


+My understanding of the problem so far:+

The ST_* {{Expression}} must be detected as a pushable predicate 
(ParquetScanBuilder.scala:71) and passed as a {{pushedDataFilter}} to the 
{{parquetPartitionReaderFactory}} where it will be translated into a (user 
defined) {{{}FilterPredicate{}}}.

The [Filter class is 
sealed|https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala]
 so the sedona package can’t define new Filters; DSV2 Predicate appears to be 
the preferred method for accomplishing this task (referred to as “V2 Filter”, 
SPARK-39966). However, `pushedDataFilters` in FileScanBuilder.scala is of type 
{{{}sources.Filter{}}}.

Some recent work (SPARK-39139) added the ability to detect user defined 
functions in  {{DataSourceV2Strategy.translateFilterV2() > 
V2ExpressionBuilder.generateExpression()}}  , which I think could accomplish 
detection correctly if {{FileScanBuilder}} called 
{{DataSourceV2Strategy.translateFilterV2()}} instead of 
{{{}DataSourceStrategy.translateFilter(){}}}.

However, changing {{FileScanBuilder}} to use {{Predicate}} instead of 
{{Filter}} would require many changes to all file based data sources. I don’t 
want to spend effort making sweeping changes if the current behavior of Spark 
is intentional.

 

+Concluding Questions:+

Should {{FileScanBuilder}} be pushing {{Predicate}} instead of {{Filter}} for 
data filters being pushed down to source? Or maybe in a FileScanBuilderV2?

If not, how can a developer of a data source push down a new (or user defined) 
predicate to the file source?

Thank you again for reading. Pending feedback, I will start working on a PR for 
this functionality.



[~beliefer] [~cloud_fan] [~huaxingao]   have worked on DSV2 related spark 
issues and I welcome your input. Please ignore this if I "@" you incorrectly.

  was:
{+}overall{+}: 
Allow FileScanBuilder to push `Predicate` instead of `Filter` for data filters 
being pushed down to source. This would allow new (arbitrary) DS V2 Predicates 
to be pushed down to the file source. 




Hello spark developers,

Thank you in advance for reading. Please excuse me if I make mistakes; this is 
my first time working on apache/spark internals. I am asking these questions to 
better understand whether my proposed changes fall within the intended scope of 
Data Source V2 API functionality.

+
Motivation / Background:+

I am working on a branch in 
[apache/incubator-sedona|https://github.com/apache/incubator-sedona] to extend 
its support of geoparquet files to include predicate pushdown of postGIS style 
spatial predicates (e.g. `ST_Contains()`) that can take advantage of spatial 
info in file metadata. We would like to inherit as much as possible from the 
Parquet classes (because geoparquet basically just adds a binary geometry 
column). However, {{FileScanBuilder.scala}} appears to be missing some 
functionality I need for DSV2 {{{}Predicates{}}}.

+
My understanding of the problem so far:+

The ST_* {{Expression}} must be detected as a pushable predicate 
(ParquetScanBuilder.scala:71) and passed as a {{pushedDataFilter}} to the 
{{parquetPartitionReaderFactory}} where it will be translated into a (user 
defined) {{{}FilterPredicate{}}}.

The [Filter class is 
sealed|https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala]
 so the sedona package can’t define new Filters; DSV2 Predicate appears to be 
the preferred method for accomplishing this task (referred to as “V2 Filter”, 
SPARK-39966). However, `pushedDataFilters` in FileScanBuilder.scala is of type 
{{{}sources.Filter{}}}.

Some recent work (SPARK-39139) added the ability to detect user defined 
functions in  {{DataSourceV2Strategy.translateFilterV2() > 
V2ExpressionBuilder.generateExpression()}}  , which I think could accomplish 
detection correctly if {{FileScanBuilder}} called 
{{DataSourceV2Strategy.translateFilterV2()}} instead of 
{{{}DataSourceStrategy.translateFilter(){}}}.

However, changing {{FileScanBuilder}} to use {{Predicate}} instead of 
{{Filter}} would require many changes to all file based data sources. I don’t 
want to spend effort making sweeping changes if the current behavior of Spark 
is intentional.

+
Concluding Questions:+

Should {{FileScanBuilder}} be pushing {{Predicate}} instead of {{Filter}} for 
data filters being pushed down to source? Or maybe in a FileScanBuilderV2?

If not, how can a developer of a data source push down a new (or user defined) 
predicate to the file source?

Thank you again for reading. Pending feedback, I will start working on a PR for 
this functionality.


[~beliefer] [~cloud_fan] [~huaxingao]   have worked on DSV2 related spark 
issues and I welcome your input. Please ignore this if I "@" you incorrectly.


> allow DSV2 Predicate pushdown in FileScanBuilder.pushedDataFilter 
> ------------------------------------------------------------------
>
>                 Key: SPARK-40955
>                 URL: https://issues.apache.org/jira/browse/SPARK-40955
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output, SQL
>    Affects Versions: 3.3.1
>            Reporter: RJ Marcus
>            Priority: Major
>
> {+}overall{+}: 
> Allow FileScanBuilder to push `Predicate` instead of `Filter` for data 
> filters being pushed down to source. This would allow new (arbitrary) DS V2 
> Predicates to be pushed down to the file source. 
> Hello spark developers,
> Thank you in advance for reading. Please excuse me if I make mistakes; this 
> is my first time working on apache/spark internals. I am asking these 
> questions to better understand whether my proposed changes fall within the 
> intended scope of Data Source V2 API functionality.
> +Motivation / Background:+
> I am working on a branch in 
> [apache/incubator-sedona|https://github.com/apache/incubator-sedona] to 
> extend its support of geoparquet files to include predicate pushdown of 
> postGIS style spatial predicates (e.g. `ST_Contains()`) that can take 
> advantage of spatial info in file metadata. We would like to inherit as much 
> as possible from the Parquet classes (because geoparquet basically just adds 
> a binary geometry column). However, {{FileScanBuilder.scala}} appears to be 
> missing some functionality I need for DSV2 {{{}Predicates{}}}.
> +My understanding of the problem so far:+
> The ST_* {{Expression}} must be detected as a pushable predicate 
> (ParquetScanBuilder.scala:71) and passed as a {{pushedDataFilter}} to the 
> {{parquetPartitionReaderFactory}} where it will be translated into a (user 
> defined) {{{}FilterPredicate{}}}.
> The [Filter class is 
> sealed|https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala]
>  so the sedona package can’t define new Filters; DSV2 Predicate appears to be 
> the preferred method for accomplishing this task (referred to as “V2 Filter”, 
> SPARK-39966). However, `pushedDataFilters` in FileScanBuilder.scala is of 
> type {{{}sources.Filter{}}}.
> Some recent work (SPARK-39139) added the ability to detect user defined 
> functions in  {{DataSourceV2Strategy.translateFilterV2() > 
> V2ExpressionBuilder.generateExpression()}}  , which I think could accomplish 
> detection correctly if {{FileScanBuilder}} called 
> {{DataSourceV2Strategy.translateFilterV2()}} instead of 
> {{{}DataSourceStrategy.translateFilter(){}}}.
> However, changing {{FileScanBuilder}} to use {{Predicate}} instead of 
> {{Filter}} would require many changes to all file based data sources. I don’t 
> want to spend effort making sweeping changes if the current behavior of Spark 
> is intentional.
>  
> +Concluding Questions:+
> Should {{FileScanBuilder}} be pushing {{Predicate}} instead of {{Filter}} for 
> data filters being pushed down to source? Or maybe in a FileScanBuilderV2?
> If not, how can a developer of a data source push down a new (or user 
> defined) predicate to the file source?
> Thank you again for reading. Pending feedback, I will start working on a PR 
> for this functionality.
> [~beliefer] [~cloud_fan] [~huaxingao]   have worked on DSV2 related spark 
> issues and I welcome your input. Please ignore this if I "@" you incorrectly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to