[ https://issues.apache.org/jira/browse/SPARK-40955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-40955: --------------------------------- Summary: Allow DSV2 Predicate pushdown in FileScanBuilder.pushedDataFilter (was: allow DSV2 Predicate pushdown in FileScanBuilder.pushedDataFilter ) > Allow DSV2 Predicate pushdown in FileScanBuilder.pushedDataFilter > ------------------------------------------------------------------ > > Key: SPARK-40955 > URL: https://issues.apache.org/jira/browse/SPARK-40955 > Project: Spark > Issue Type: Improvement > Components: Input/Output, SQL > Affects Versions: 3.3.1 > Reporter: RJ Marcus > Priority: Major > > {+}overall{+}: > Allow FileScanBuilder to push `Predicate` instead of `Filter` for data > filters being pushed down to source. This would allow new (arbitrary) DS V2 > Predicates to be pushed down to the file source. > Hello spark developers, > Thank you in advance for reading. Please excuse me if I make mistakes; this > is my first time working on apache/spark internals. I am asking these > questions to better understand whether my proposed changes fall within the > intended scope of Data Source V2 API functionality. > +Motivation / Background:+ > I am working on a branch in > [apache/incubator-sedona|https://github.com/apache/incubator-sedona] to > extend its support of geoparquet files to include predicate pushdown of > postGIS style spatial predicates (e.g. `ST_Contains()`) that can take > advantage of spatial info in file metadata. We would like to inherit as much > as possible from the Parquet classes (because geoparquet basically just adds > a binary geometry column). However, {{FileScanBuilder.scala}} appears to be > missing some functionality I need for DSV2 {{{}Predicates{}}}. > +My understanding of the problem so far:+ > The ST_* {{Expression}} must be detected as a pushable predicate > (ParquetScanBuilder.scala:71) and passed as a {{pushedDataFilter}} to the > {{parquetPartitionReaderFactory}} where it will be translated into a (user > defined) {{{}FilterPredicate{}}}. > The [Filter class is > sealed|https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala] > so the sedona package can’t define new Filters; DSV2 Predicate appears to be > the preferred method for accomplishing this task (referred to as “V2 Filter”, > SPARK-39966). However, `pushedDataFilters` in FileScanBuilder.scala is of > type {{{}sources.Filter{}}}. > Some recent work (SPARK-39139) added the ability to detect user defined > functions in {{DataSourceV2Strategy.translateFilterV2() > > V2ExpressionBuilder.generateExpression()}} , which I think could accomplish > detection correctly if {{FileScanBuilder}} called > {{DataSourceV2Strategy.translateFilterV2()}} instead of > {{{}DataSourceStrategy.translateFilter(){}}}. > However, changing {{FileScanBuilder}} to use {{Predicate}} instead of > {{Filter}} would require many changes to all file based data sources. I don’t > want to spend effort making sweeping changes if the current behavior of Spark > is intentional. > > +Concluding Questions:+ > Should {{FileScanBuilder}} be pushing {{Predicate}} instead of {{Filter}} for > data filters being pushed down to source? Or maybe in a FileScanBuilderV2? > If not, how can a developer of a data source push down a new (or user > defined) predicate to the file source? > Thank you again for reading. Pending feedback, I will start working on a PR > for this functionality. > [~beliefer] [~cloud_fan] [~huaxingao] have worked on DSV2 related spark > issues and I welcome your input. Please ignore this if I "@" you incorrectly. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org