Scott Schenkein created SPARK-55869:
---------------------------------------

             Summary:  Extensible predicate pushdown for DataSource V2
                 Key: SPARK-55869
                 URL: https://issues.apache.org/jira/browse/SPARK-55869
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.0.0
            Reporter: Scott Schenkein


h2. Problem

 

  DataSource V2 predicate pushdown is limited to a fixed, closed set of 
expression types hardcoded in \{{V2ExpressionBuilder}}. Data source authors who 
need to push predicates involving:

 

  # *Custom operators* (e.g. \{{my_col indexquery 'param'}}, 
\{{my_boolean_function(col1, col2, 'p')}})

  # *Builtin Spark expressions not in the pushdown whitelist* (e.g. \{{RLIKE}}, 
\{{LIKE}}, \{{ILIKE}})

 

  are forced to resort to fragile workarounds: intercepting the logical plan 
via \{{SparkSessionExtensions}} injected rules, using thread-local state to 
smuggle filter information past the optimizer, and

   effectively reimplementing pushdown outside Spark's designed architecture.

 

  These hacks are:

  * Brittle across Spark versions

  * Invisible to Spark's query planning (no EXPLAIN output, no metrics)

  * Unable to participate in Spark's post-scan filter safety net

  * Duplicative — every data source author reinvents the same machinery

 

  h2. Proposed Solution

 

  Three independently adoptable layers:

 

  *Layer 1: Capability-gated builtin predicate translation*

  New \{{SupportsPushDownPredicateCapabilities}} interface (mix-in for 
\{{ScanBuilder}}) lets data sources declare which additional V2 predicate names 
they support (e.g. \{{"RLIKE"}}, \{{"LIKE"}}).

  {\{V2ExpressionBuilder}} uses this set to conditionally translate builtin 
Catalyst expressions that currently have no match case.

 

  Tier 1 builtins: \{{LIKE}}, \{{RLIKE}}, \{{IS_NAN}}, \{{ARRAY_CONTAINS}}, 
\{{MAP_CONTAINS_KEY}}.

 

  *Layer 2: Custom predicate functions*

  New \{{SupportsCustomPredicates}} interface (mix-in for \{{Table}}) lets data 
sources declare custom boolean predicate functions with dot-qualified canonical 
names (e.g. \{{"com.mycompany.INDEXQUERY"}}).

   A new analyzer rule (\{{ResolveCustomPredicates}}) resolves these during 
analysis. A new \{{CustomPredicateExpression}} Catalyst node translates to a 
namespaced V2 \{{Predicate}} during pushdown. A

  safety rule (\{{EnsureCustomPredicatesPushed}}) fails queries if a custom 
predicate wasn't pushed.

 

  *Layer 3: Custom infix operator syntax*

  Helper base class \{{CustomOperatorParserExtension}} simplifies parser 
extensions that rewrite custom infix operators (e.g. \{{col INDEXQUERY 
'param'}}) to function call syntax, composing cleanly with

  Layer 2.

 

  h2. Key Design Decisions

 

  * Custom predicate names use dot-qualified canonical names to avoid collision 
with Spark builtins

  * \{{CustomPredicateExpression}} uses \{{CodegenFallback}} (not 
\{{Unevaluable}}) with a post-optimizer safety rule

  * Layer 1 capabilities gate only predicate-level translation; scalar 
expression gating is unnecessary

  * Analyzer rule registered in \{{postHocResolutionRules}} batch

  * Custom predicates produce namespaced \{{Predicate}} directly (no 
\{{BOOLEAN_EXPRESSION}} wrapper)

 

  h2. Validation

 

  Design validated against a production DSv2 connector (IndexTables4Spark) that 
currently uses ~585 lines of ThreadLocal + WeakHashMap + logical plan 
interception hacks for custom predicate pushdown.

  The proposed design eliminates all of these hack patterns.

 

  h2. Implementation Plan

 

  ||Phase||Scope||

  |Phase 1|\{{SupportsPushDownPredicateCapabilities}} + 
\{{V2ExpressionBuilder}} capability gating + Tier 1 builtins|

  |Phase 2|\{{SupportsCustomPredicates}} + \{{CustomPredicateDescriptor}} + 
analyzer rule + safety rule + Tier 2 builtins|

  |Phase 3|\{{CustomOperatorParserExtension}} helper + documentation|

  |Phase 4|JDBC reference implementation|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to