[
https://issues.apache.org/jira/browse/SPARK-56395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhidong Qu updated SPARK-56395:
-------------------------------
Description:
h2. Q1. What are you trying to do? Articulate your objectives using absolutely
no jargon.
Add a new JOIN syntax to Spark SQL that lets users find the closest matches
between two tables. For each row in one table (the "query" side), the system
finds the top-K most similar or closest rows in another table (the "base"
side), ranked by a user-provided scoring expression.
For example: given a table of user profiles and a table of products, find the
10 most similar products for each user based on their embeddings. Or given a
table of points of interest and a table of areas, find the 5 geographically
nearest areas for each point.
This is a building block for use cases like semantic search, recommendation
systems, retrieval-augmented generation (RAG), and geospatial nearest-neighbor
queries — all expressed in standard SQL rather than requiring custom
application code or external services.
h2. Q2. What problem is this proposal NOT designed to solve?
This proposal adds the SQL syntax and brute-force (exact) execution path for
top-K ranking joins. It does *NOT* cover:
* *Vector index creation or management.* Index DDL and indexed approximate
nearest neighbor (ANN) execution are planned as subsequent work. The syntax is
designed to accommodate indexes transparently — when indexes exist, {{APPROX}}
queries can leverage them without changing the query — but index support itself
is out of scope.
* *New distance or similarity functions.* The proposal relies on existing
scalar functions (e.g., {{{}vector_cosine_similarity{}}},
{{{}vector_l2_distance{}}}). Adding new scoring functions is orthogonal.
* *Cross-table filter predicates.* Predicates like {{ON t.price <= q.budget}}
combined with NEAREST BY are a planned future extension but not part of this
proposal.
* *Threshold-based retrieval.* Threshold retrieval for returning all
candidates within a given distance rather than a fixed count can be added as a
future extension but not part of this proposal.
h2. Q3. How is it done today, and what are the limits of current practice?
Today, users simulate top-K ranking joins in Spark SQL using one of two
patterns:
*Approach 1: CROSS JOIN + window function*
sql
SELECT *
FROM (
SELECT q.{*}, t.{*},
ROW_NUMBER() OVER (
PARTITION BY q.id
ORDER BY vector_cosine_similarity(q.embedding, t.embedding) DESC
) AS rn
FROM query_table q
CROSS JOIN base_table t
) ranked
WHERE rn <= 10;
*Approach 2: CROSS JOIN + {{max_by}} / {{min_by}} with K overload*
sql
SELECT
q.id,
max_by(t.id, vector_cosine_similarity(q.embedding, t.embedding), 10) AS top_ids
FROM query_table q
CROSS JOIN base_table t
GROUP BY q.id;
*Limitations:*
* *Complexity.* Approach 1 requires nested subqueries, window functions, and
manual ranking logic — verbose, error-prone, and difficult for non-expert SQL
users. Approach 2 is more concise but packs results into arrays, requiring an
additional {{LATERAL}} explode step to recover individual rows with all
base-side columns. Neither pattern is self-explanatory.
* *No path to optimization.* Because the intent (top-K nearest neighbor) is
buried inside generic CROSS JOIN + ranking patterns, the optimizer has no
semantic signal to apply specialized execution strategies. There is no way to
transparently benefit from future index-based approximate search without
rewriting the query.
h2. Q4. What is new in your approach and why do you think it will be successful?
The approach extends standard SQL {{JOIN}} syntax with a {{NEAREST ... BY}}
clause rather than introducing a table-valued function (TVF). This is a
deliberate design choice:
* *Composability.* JOIN syntax naturally supports batch (multi-query) search —
each row on the left side drives an independent top-K search on the right side.
TVF-based approaches in other systems require workarounds like {{CROSS APPLY}}
for batch queries.
* *Pluggable ranking.* The {{BY DISTANCE <expr>}} / {{BY SIMILARITY <expr>}}
clause accepts any scalar expression, not just vector similarity. This makes
the same syntax usable for vector search, geospatial nearest-neighbor, BM25
text relevance, or composite scoring — without needing separate functions for
each modality.
* *Explicit search contract.* The {{APPROX}} / {{EXACT}} keywords make the
search algorithm contract explicit in the query. This ensures that index
creation or deletion never silently changes query results — only queries that
opt into {{APPROX}} are affected. This is a key differentiator from systems
where index presence implicitly changes behavior.
h2. Q5. Who cares? If you are successful, what difference will it make?
* *AI/ML practitioners* get a native SQL primitive for semantic search, RAG
pipelines, and recommendation systems without leaving SQL or depending on
external vector databases.
* *Data analysts* can express "find the top-K closest matches" as a single
JOIN clause instead of a complex CROSS JOIN + window function pattern.
* *The Spark ecosystem* closes a competitive gap with BigQuery, Snowflake, SQL
Server, and PostgreSQL, all of which offer dedicated vector/nearest-neighbor
search capabilities.
* *Future work* (indexed ANN, threshold-based retrieval, ON-clause filters)
builds directly on the syntax and semantics established here.
h2. Q6. What are the risks?
* *Syntax commitment.* Extending JOIN syntax is a language-level change that
is difficult to reverse. Mitigation: competitive analysis and scrutinized SQL
API review.
* *Performance expectations.* Without index support (out of scope for this
proposal), brute-force execution on large tables may disappoint users expecting
sub-linear performance. Mitigation: the {{APPROX}} / {{EXACT}} keywords are
part of the syntax from day one, establishing the contract for future
index-based optimization. Documentation will clearly state the initial
execution model.
Risks are low overall. The feature enables strictly new workloads — it does not
modify existing JOIN semantics or execution paths. Existing queries are
unaffected.
h2. Q7. How long will it take?
Estimated 3–4 months for the scoped work:
||Phase||Duration||Description||
|SQL parsing & analysis|1 week|ANTLR grammar extension, logical plan node,
analysis rules|
|Brute-force execution|1 week|Query resolution and optimization, LEFT OUTER /
INNER semantics|
|Spark Connect & DataFrame API|1 week|Proto definitions, DataFrame DSL, PySpark
bindings|
|Docs, testing & benchmarking|1 week|SQL reference docs, end-to-end tests,
performance characterization|
h2. Q8. What are the mid-term and final "exams" to check for success?
*Mid-term:* The basic syntax parses, analyzes, and executes correctly for
single-vector and batch vector similarity search using brute-force evaluation.
The ad-hoc and batch examples from the function doc execute end-to-end.
*Final:* Full feature parity with the scoped proposal — INNER and LEFT OUTER
join modes, APPROX and EXACT keywords, DISTANCE and SIMILARITY ranking
directions, and SQL reference documentation merged. Performance on
representative benchmarks (single-query and batch top-K on tables with 1M+
rows) is characterized and documented.
h2. Appendix A. Proposed API Changes
h3. Syntax
sql
FROM <query_table>
[LEFT OUTER | INNER] JOIN <base_table>
{APPROX | EXACT} NEAREST <num_results>
BY \{DISTANCE <distance_expression> | SIMILARITY <similarity_expression>}
h3. Arguments
||Argument||Description||
|{{query_table}}|The driving (left) table. Each row generates a separate top-K
search against the base table. Can be a table reference, subquery, or CTE.|
|{{base_table}}|The target (right) table to search. Can be a table reference,
subquery, or CTE.|
|{{LEFT OUTER \| INNER}}|Optional. INNER (default) drops query rows with no
matches. LEFT OUTER returns all query rows; base-side columns are NULL when no
candidates exist.|
|{{APPROX \| EXACT}}|Required (one of APPROX or EXACT). APPROX allows the
optimizer to choose faster approximate strategies (e.g., indexed ANN in
future). EXACT forces brute-force evaluation. EXACT fails for inherently
nondeterministic expressions.|
|{{NEAREST <num_results>}}|Compile-time constant integer expression. Maximum
results per query row. Capped at 100,000 following
{{{}max_by{}}}/{{{}min_by{}}} with K overload limit. Defaults to 1.|
|{{BY DISTANCE <expr>}}|Scalar expression for distance ranking — smallest
values rank first. Must return an orderable type.|
|{{BY SIMILARITY <expr>}}|Scalar expression for similarity ranking — largest
values rank first. Must return an orderable type.|
h3. Returns
All columns from the query table and all columns from the base table.
h3. Examples
sql
– Batch recommendations
SELECT q.user_id, t.*
FROM users q
INNER JOIN products t
APPROX NEAREST 10 BY SIMILARITY vector_cosine_similarity(q.embedding,
t.embedding)
– Pre-filtered search
SELECT q.user_id, t.*
FROM users q
INNER JOIN (SELECT * FROM products WHERE price >= 500 AND country = 'EU') AS t
APPROX NEAREST 10 BY SIMILARITY vector_cosine_similarity(q.embedding,
t.embedding)
h3. Planned Future Extensions (out of scope)
* threshold-based retrieval instead of fixed count
* cross-table filter predicates (e.g., {{{}ON t.price <= q.budget{}}})
* Index DDL for ANN acceleration
h3. Backward Compatibility
This proposal introduces new syntax only. No existing queries, plans, or
behavior are affected. The new {{NEAREST}} keyword is contextual within JOIN
clauses and does not conflict with existing reserved words.
----
h2. Appendix B. Design Sketch
h3. Logical Plan
A new logical plan node NearestByJoin captures the join type (INNER/LEFT
OUTER), search mode (APPROX/EXACT), num_results, ranking direction
(DISTANCE/SIMILARITY), ranking expression, and child plans for the query and
base sides.
h3. Physical Plan
NearestByJoin will be rewritten into existing physical operators including
JOIN, [max_by / min_by overloaded by
K|https://github.com/apache/spark/pull/54134], and [vector similarity /
distance expressions|https://github.com/apache/spark/pull/53481]. No new
physical operators are introduced. Longer term speaking, we may benefit from
writing a dedicated fused physical operator that avoids materializing a full
cartesian product for performance improvements, but that is out of scope for
this proposal and initial implementation.
h3. Analysis
The analyzer validates that num_results is a foldable positive integer, the BY
expression returns an orderable type, and EXACT mode is not used with
nondeterministic expressions. Standard column resolution applies to the BY
expression, which may reference columns from both sides.
----
h2. Appendix C. Rejected Designs
h3. Table-Valued Function (TVF)
An earlier design used a {{VECTOR_SEARCH(base_table, query_table, ...)}} TVF.
This was rejected because:
* The function signature becomes unwieldy when supporting multiple distance
types, search modes, and filter options as named parameters.
* JOIN syntax is more natural for SQL users and composes cleanly with CTEs,
subqueries, and standard SQL clauses.
h3. USING clause with distance expression
An intermediate design used {{JOIN ... USING <distance_expression>}} with a
{{NEAREST}} clause. This was rejected because {{USING}} has established
semantics in SQL (equi-join on identically named columns) and overloading it
for distance expressions would be confusing.
h3. ORDER BY + LIMIT pattern
Relying on {{ORDER BY similarity_function(...) LIMIT K}} was considered but
rejected because it only supports single-query search naturally. Batch search
requires verbose CROSS JOIN + window function patterns, and the optimizer has
no semantic signal to apply specialized execution strategies.
h3. LATERAL Subquery
A LATERAL subquery approach - where each query row drives an ORDER BY distance
LIMIT K subquery against the base table - was considered. This composes well
for batch search and is the pattern used by pgvector. However, LATERAL has
well-defined semantics: the subquery executes deterministically per row. This
makes it inherently an exact search construct with no room for approximate
semantics — there is no natural way to express "the optimizer may skip
candidates or use an ANN index" within LATERAL's execution contract. Since the
ability to opt into approximate search (APPROX) is a core design goal, LATERAL
was rejected as the primary syntax.
was:
h2. Q1. What are you trying to do? Articulate your objectives using absolutely
no jargon.
Add a new JOIN syntax to Spark SQL that lets users find the closest matches
between two tables. For each row in one table (the "query" side), the system
finds the top-K most similar or closest rows in another table (the "base"
side), ranked by a user-provided scoring expression.
For example: given a table of user profiles and a table of products, find the
10 most similar products for each user based on their embeddings. Or given a
table of points of interest and a table of areas, find the 5 geographically
nearest areas for each point.
This is a building block for use cases like semantic search, recommendation
systems, retrieval-augmented generation (RAG), and geospatial nearest-neighbor
queries — all expressed in standard SQL rather than requiring custom
application code or external services.
h2. Q2. What problem is this proposal NOT designed to solve?
This proposal adds the SQL syntax and brute-force (exact) execution path for
top-K ranking joins. It does *not* cover:
* *Vector index creation or management.* Index DDL and indexed approximate
nearest neighbor (ANN) execution are planned as subsequent work. The syntax is
designed to accommodate indexes transparently — when indexes exist, {{APPROX}}
queries can leverage them without changing the query — but index support itself
is out of scope.
* *New distance or similarity functions.* The proposal relies on existing
scalar functions (e.g., {{{}vector_cosine_similarity{}}},
{{{}vector_l2_distance{}}}). Adding new scoring functions is orthogonal.
* *Cross-table filter predicates in ON clauses.* Predicates like {{ON t.price
<= q.budget}} combined with NEAREST BY are a planned future extension but not
part of this proposal.
* *Threshold-based retrieval.* A {{WITHIN <threshold>}} clause (return all
candidates within a given distance rather than a fixed count) is a planned
future extension.
h2. Q3. How is it done today, and what are the limits of current practice?
Today, users simulate top-K ranking joins in Spark SQL using one of two
patterns:
*Approach 1: CROSS JOIN + window function*
sql
SELECT *
FROM (
SELECT q.*, t.*,
ROW_NUMBER() OVER (
PARTITION BY q.id
ORDER BY vector_cosine_similarity(q.embedding, t.embedding) DESC
) AS rn
FROM query_table q
CROSS JOIN base_table t
) ranked
WHERE rn <= 10;
*Approach 2: CROSS JOIN + {{max_by}} / {{min_by}} with K overload*
sql
SELECT
q.id,
max_by(t.id, vector_cosine_similarity(q.embedding, t.embedding), 10) AS
top_ids
FROM query_table q
CROSS JOIN base_table t
GROUP BY q.id;
*Limitations:*
* *Complexity.* Approach 1 requires nested subqueries, window functions, and
manual ranking logic — verbose, error-prone, and difficult for non-expert SQL
users. Approach 2 is more concise but packs results into arrays, requiring an
additional {{LATERAL}} explode step to recover individual rows with all
base-side columns. Neither pattern is self-explanatory.
* *No path to optimization.* Because the intent (top-K nearest neighbor) is
buried inside generic CROSS JOIN + ranking patterns, the optimizer has no
semantic signal to apply specialized execution strategies. There is no way to
transparently benefit from future index-based approximate search without
rewriting the query.
h2. Q4. What is new in your approach and why do you think it will be successful?
The approach extends standard SQL {{JOIN}} syntax with a {{NEAREST ... BY}}
clause rather than introducing a table-valued function (TVF). This is a
deliberate design choice:
* *Composability.* JOIN syntax naturally supports batch (multi-query) search —
each row on the left side drives an independent top-K search on the right side.
TVF-based approaches in other systems require workarounds like {{CROSS APPLY}}
for batch queries.
* *Pluggable ranking.* The {{BY DISTANCE <expr>}} / {{BY SIMILARITY <expr>}}
clause accepts any scalar expression, not just vector similarity. This makes
the same syntax usable for vector search, geospatial nearest-neighbor, BM25
text relevance, or composite scoring — without needing separate functions for
each modality.
* *Explicit search contract.* The {{APPROX}} / {{EXACT}} keywords make the
search algorithm contract explicit in the query. This ensures that index
creation or deletion never silently changes query results — only queries that
opt into {{APPROX}} are affected. This is a key differentiator from systems
where index presence implicitly changes behavior.
h2. Q5. Who cares? If you are successful, what difference will it make?
* *AI/ML practitioners* get a native SQL primitive for semantic search, RAG
pipelines, and recommendation systems without leaving SQL or depending on
external vector databases.
* *Data analysts* can express "find the top-K closest matches" as a single
JOIN clause instead of a complex CROSS JOIN + window function pattern.
* *The Spark ecosystem* closes a competitive gap with BigQuery, Snowflake, SQL
Server, and PostgreSQL, all of which offer dedicated vector/nearest-neighbor
search capabilities.
* *Future work* (indexed ANN, threshold-based retrieval, ON-clause filters)
builds directly on the syntax and semantics established here.
h2. Q6. What are the risks?
* *Syntax commitment.* Extending JOIN syntax is a language-level change that
is difficult to reverse. Mitigation: competitive analysis and scrutinized SQL
API review.
* *Performance expectations.* Without index support (out of scope for this
proposal), brute-force execution on large tables may disappoint users expecting
sub-linear performance. Mitigation: the {{APPROX}} / {{EXACT}} keywords are
part of the syntax from day one, establishing the contract for future
index-based optimization. Documentation will clearly state the initial
execution model.
Risks are low overall. The feature enables strictly new workloads — it does not
modify existing JOIN semantics or execution paths. Existing queries are
unaffected.
h2. Q7. How long will it take?
Estimated 3–4 months for the scoped work:
||Phase||Duration||Description||
|SQL parsing & analysis|1 week|ANTLR grammar extension, logical plan node,
analysis rules|
|Brute-force execution|1 week|Query resolution and optimization, LEFT OUTER /
INNER semantics|
|Spark Connect & DataFrame API|1 week|Proto definitions, DataFrame DSL, PySpark
bindings|
|Docs, testing & benchmarking|1 week|SQL reference docs, end-to-end tests,
performance characterization|
h2. Q8. What are the mid-term and final "exams" to check for success?
*Mid-term:* The basic syntax parses, analyzes, and executes correctly for
single-vector and batch vector similarity search using brute-force evaluation.
The ad-hoc and batch examples from the function doc execute end-to-end.
*Final:* Full feature parity with the scoped proposal — INNER and LEFT OUTER
join modes, APPROX and EXACT keywords, DISTANCE and SIMILARITY ranking
directions, Spark Connect and PySpark support, and SQL reference documentation
merged. Performance on representative benchmarks (single-query and batch top-K
on tables with 1M+ rows) is characterized and documented.
h2. Appendix A. Proposed API Changes
h3. Syntax
sql
FROM <query_table>
[LEFT OUTER | INNER] JOIN <base_table>
[APPROX | EXACT] NEAREST <num_results>
BY \{DISTANCE <distance_expression> | SIMILARITY <similarity_expression>}
h3. Arguments
||Argument||Description||
|{{query_table}}|The driving (left) table. Each row generates a separate top-K
search against the base table. Can be a table reference, subquery, or CTE.|
|{{base_table}}|The target (right) table to search. Can be a table reference,
subquery, or CTE.|
|{{LEFT OUTER \| INNER}}|Optional. INNER (default) drops query rows with no
matches. LEFT OUTER returns all query rows; base-side columns are NULL when no
candidates exist.|
|{{APPROX \| EXACT}}|Optional, defaults to APPROX. APPROX allows the optimizer
to choose faster approximate strategies (e.g., indexed ANN in future). EXACT
forces brute-force evaluation. EXACT fails for inherently nondeterministic
expressions.|
|{{NEAREST <num_results>}}|Compile-time constant integer expression. Maximum
results per query row. Capped at 100,000 following
{{{}max_by{}}}/{{{}min_by{}}} with K overload limit. Defaults to 1.|
|{{BY DISTANCE <expr>}}|Scalar expression for distance ranking — smallest
values rank first. Must return an orderable type.|
|{{BY SIMILARITY <expr>}}|Scalar expression for similarity ranking — largest
values rank first. Must return an orderable type.|
h3. Returns
All columns from the query table and all columns from the base table.
h3. Examples
sql
-- Batch recommendations
SELECT q.user_id, t.*
FROM users q
INNER JOIN products t
APPROX NEAREST 10 BY SIMILARITY vector_cosine_similarity(q.embedding,
t.embedding)
-- Pre-filtered search
SELECT q.user_id, t.*
FROM users q
INNER JOIN (SELECT * FROM products WHERE price >= 500 AND country = 'EU') AS t
APPROX NEAREST 10 BY SIMILARITY vector_cosine_similarity(q.embedding,
t.embedding)
h3. Planned Future Extensions (out of scope)
* {{WITHIN <threshold>}} — threshold-based retrieval instead of fixed count
* {{ON <predicate>}} — cross-table filter predicates (e.g., {{{}ON t.price <=
q.budget{}}})
* Index DDL for ANN acceleration
h3. Backward Compatibility
This proposal introduces new syntax only. No existing queries, plans, or
behavior are affected. The new {{NEAREST}} keyword is contextual within JOIN
clauses and does not conflict with existing reserved words.
----
h2. Appendix B. Design Sketch
h3. Logical Plan
A new logical plan node {{NearestByJoin}} captures the join type (INNER/LEFT
OUTER), search mode (APPROX/EXACT), num_results, ranking direction
(DISTANCE/SIMILARITY), ranking expression, and child plans for the query and
base sides.
h3. Analysis
The analyzer validates that {{num_results}} is a foldable positive integer, the
BY expression returns an orderable type, and EXACT mode is not used with
nondeterministic expressions. Standard column resolution applies to the BY
expression, which may reference columns from both sides.
h3. Execution (Brute-Force)
For each partition of the query side, the executor evaluates the BY expression
against all rows of the base side and retains the top-K results using min_by /
max_by operator with K overload.
The APPROX keyword has no effect on execution in this initial phase
(brute-force is used in all cases). When index support is added in a subsequent
phase, APPROX queries against indexed tables will transparently use ANN search
for indexed data files and brute-force for unindexed data files.
h3. Separability
The optimizer pattern-matches the BY expression to determine future index
eligibility. For index-eligible routing, the expression must be a recognized
function with separable arguments — one argument bound entirely to the query
side and one bound entirely to the base side. Non-separable expressions are
valid but will always use brute-force execution.
----
h2. Appendix C. Rejected Designs
h3. Table-Valued Function (TVF)
An earlier design used a {{VECTOR_SEARCH(base_table, query_table, ...)}} TVF.
This was rejected because:
* The function signature becomes unwieldy when supporting multiple distance
types, search modes, and filter options as named parameters.
* JOIN syntax is more natural for SQL users and composes cleanly with CTEs,
subqueries, and standard SQL clauses.
h3. USING clause with distance expression
An intermediate design used {{JOIN ... USING <distance_expression>}} with a
{{NEAREST}} clause. This was rejected because {{USING}} has established
semantics in SQL (equi-join on identically named columns) and overloading it
for distance expressions would be confusing.
h3. ORDER BY + LIMIT pattern
Relying on {{ORDER BY similarity_function(...) LIMIT K}} was considered but
rejected because it only supports single-query search naturally. Batch search
requires verbose CROSS JOIN + window function patterns, and the optimizer has
no semantic signal to apply specialized execution strategies.
> SPIP: NEAREST BY Top-K Ranking Join
> -----------------------------------
>
> Key: SPARK-56395
> URL: https://issues.apache.org/jira/browse/SPARK-56395
> Project: Spark
> Issue Type: Umbrella
> Components: SQL
> Affects Versions: 4.2.0
> Reporter: Zhidong Qu
> Priority: Major
> Original Estimate: 672h
> Remaining Estimate: 672h
>
> h2. Q1. What are you trying to do? Articulate your objectives using
> absolutely no jargon.
> Add a new JOIN syntax to Spark SQL that lets users find the closest matches
> between two tables. For each row in one table (the "query" side), the system
> finds the top-K most similar or closest rows in another table (the "base"
> side), ranked by a user-provided scoring expression.
> For example: given a table of user profiles and a table of products, find the
> 10 most similar products for each user based on their embeddings. Or given a
> table of points of interest and a table of areas, find the 5 geographically
> nearest areas for each point.
> This is a building block for use cases like semantic search, recommendation
> systems, retrieval-augmented generation (RAG), and geospatial
> nearest-neighbor queries — all expressed in standard SQL rather than
> requiring custom application code or external services.
> h2. Q2. What problem is this proposal NOT designed to solve?
> This proposal adds the SQL syntax and brute-force (exact) execution path for
> top-K ranking joins. It does *NOT* cover:
> * *Vector index creation or management.* Index DDL and indexed approximate
> nearest neighbor (ANN) execution are planned as subsequent work. The syntax
> is designed to accommodate indexes transparently — when indexes exist,
> {{APPROX}} queries can leverage them without changing the query — but index
> support itself is out of scope.
> * *New distance or similarity functions.* The proposal relies on existing
> scalar functions (e.g., {{{}vector_cosine_similarity{}}},
> {{{}vector_l2_distance{}}}). Adding new scoring functions is orthogonal.
> * *Cross-table filter predicates.* Predicates like {{ON t.price <=
> q.budget}} combined with NEAREST BY are a planned future extension but not
> part of this proposal.
> * *Threshold-based retrieval.* Threshold retrieval for returning all
> candidates within a given distance rather than a fixed count can be added as
> a future extension but not part of this proposal.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Today, users simulate top-K ranking joins in Spark SQL using one of two
> patterns:
> *Approach 1: CROSS JOIN + window function*
> sql
> SELECT *
> FROM (
> SELECT q.{*}, t.{*},
> ROW_NUMBER() OVER (
> PARTITION BY q.id
> ORDER BY vector_cosine_similarity(q.embedding, t.embedding) DESC
> ) AS rn
> FROM query_table q
> CROSS JOIN base_table t
> ) ranked
> WHERE rn <= 10;
> *Approach 2: CROSS JOIN + {{max_by}} / {{min_by}} with K overload*
> sql
> SELECT
> q.id,
> max_by(t.id, vector_cosine_similarity(q.embedding, t.embedding), 10) AS
> top_ids
> FROM query_table q
> CROSS JOIN base_table t
> GROUP BY q.id;
> *Limitations:*
> * *Complexity.* Approach 1 requires nested subqueries, window functions, and
> manual ranking logic — verbose, error-prone, and difficult for non-expert SQL
> users. Approach 2 is more concise but packs results into arrays, requiring an
> additional {{LATERAL}} explode step to recover individual rows with all
> base-side columns. Neither pattern is self-explanatory.
> * *No path to optimization.* Because the intent (top-K nearest neighbor) is
> buried inside generic CROSS JOIN + ranking patterns, the optimizer has no
> semantic signal to apply specialized execution strategies. There is no way to
> transparently benefit from future index-based approximate search without
> rewriting the query.
> h2. Q4. What is new in your approach and why do you think it will be
> successful?
> The approach extends standard SQL {{JOIN}} syntax with a {{NEAREST ... BY}}
> clause rather than introducing a table-valued function (TVF). This is a
> deliberate design choice:
> * *Composability.* JOIN syntax naturally supports batch (multi-query) search
> — each row on the left side drives an independent top-K search on the right
> side. TVF-based approaches in other systems require workarounds like {{CROSS
> APPLY}} for batch queries.
> * *Pluggable ranking.* The {{BY DISTANCE <expr>}} / {{BY SIMILARITY <expr>}}
> clause accepts any scalar expression, not just vector similarity. This makes
> the same syntax usable for vector search, geospatial nearest-neighbor, BM25
> text relevance, or composite scoring — without needing separate functions for
> each modality.
> * *Explicit search contract.* The {{APPROX}} / {{EXACT}} keywords make the
> search algorithm contract explicit in the query. This ensures that index
> creation or deletion never silently changes query results — only queries that
> opt into {{APPROX}} are affected. This is a key differentiator from systems
> where index presence implicitly changes behavior.
> h2. Q5. Who cares? If you are successful, what difference will it make?
> * *AI/ML practitioners* get a native SQL primitive for semantic search, RAG
> pipelines, and recommendation systems without leaving SQL or depending on
> external vector databases.
> * *Data analysts* can express "find the top-K closest matches" as a single
> JOIN clause instead of a complex CROSS JOIN + window function pattern.
> * *The Spark ecosystem* closes a competitive gap with BigQuery, Snowflake,
> SQL Server, and PostgreSQL, all of which offer dedicated
> vector/nearest-neighbor search capabilities.
> * *Future work* (indexed ANN, threshold-based retrieval, ON-clause filters)
> builds directly on the syntax and semantics established here.
> h2. Q6. What are the risks?
> * *Syntax commitment.* Extending JOIN syntax is a language-level change that
> is difficult to reverse. Mitigation: competitive analysis and scrutinized SQL
> API review.
> * *Performance expectations.* Without index support (out of scope for this
> proposal), brute-force execution on large tables may disappoint users
> expecting sub-linear performance. Mitigation: the {{APPROX}} / {{EXACT}}
> keywords are part of the syntax from day one, establishing the contract for
> future index-based optimization. Documentation will clearly state the initial
> execution model.
> Risks are low overall. The feature enables strictly new workloads — it does
> not modify existing JOIN semantics or execution paths. Existing queries are
> unaffected.
> h2. Q7. How long will it take?
> Estimated 3–4 months for the scoped work:
> ||Phase||Duration||Description||
> |SQL parsing & analysis|1 week|ANTLR grammar extension, logical plan node,
> analysis rules|
> |Brute-force execution|1 week|Query resolution and optimization, LEFT OUTER /
> INNER semantics|
> |Spark Connect & DataFrame API|1 week|Proto definitions, DataFrame DSL,
> PySpark bindings|
> |Docs, testing & benchmarking|1 week|SQL reference docs, end-to-end tests,
> performance characterization|
> h2. Q8. What are the mid-term and final "exams" to check for success?
> *Mid-term:* The basic syntax parses, analyzes, and executes correctly for
> single-vector and batch vector similarity search using brute-force
> evaluation. The ad-hoc and batch examples from the function doc execute
> end-to-end.
> *Final:* Full feature parity with the scoped proposal — INNER and LEFT OUTER
> join modes, APPROX and EXACT keywords, DISTANCE and SIMILARITY ranking
> directions, and SQL reference documentation merged. Performance on
> representative benchmarks (single-query and batch top-K on tables with 1M+
> rows) is characterized and documented.
> h2. Appendix A. Proposed API Changes
> h3. Syntax
> sql
> FROM <query_table>
> [LEFT OUTER | INNER] JOIN <base_table>
> {APPROX | EXACT} NEAREST <num_results>
> BY \{DISTANCE <distance_expression> | SIMILARITY <similarity_expression>}
> h3. Arguments
> ||Argument||Description||
> |{{query_table}}|The driving (left) table. Each row generates a separate
> top-K search against the base table. Can be a table reference, subquery, or
> CTE.|
> |{{base_table}}|The target (right) table to search. Can be a table reference,
> subquery, or CTE.|
> |{{LEFT OUTER \| INNER}}|Optional. INNER (default) drops query rows with no
> matches. LEFT OUTER returns all query rows; base-side columns are NULL when
> no candidates exist.|
> |{{APPROX \| EXACT}}|Required (one of APPROX or EXACT). APPROX allows the
> optimizer to choose faster approximate strategies (e.g., indexed ANN in
> future). EXACT forces brute-force evaluation. EXACT fails for inherently
> nondeterministic expressions.|
> |{{NEAREST <num_results>}}|Compile-time constant integer expression. Maximum
> results per query row. Capped at 100,000 following
> {{{}max_by{}}}/{{{}min_by{}}} with K overload limit. Defaults to 1.|
> |{{BY DISTANCE <expr>}}|Scalar expression for distance ranking — smallest
> values rank first. Must return an orderable type.|
> |{{BY SIMILARITY <expr>}}|Scalar expression for similarity ranking — largest
> values rank first. Must return an orderable type.|
> h3. Returns
> All columns from the query table and all columns from the base table.
> h3. Examples
> sql
> – Batch recommendations
> SELECT q.user_id, t.*
> FROM users q
> INNER JOIN products t
> APPROX NEAREST 10 BY SIMILARITY vector_cosine_similarity(q.embedding,
> t.embedding)
> – Pre-filtered search
> SELECT q.user_id, t.*
> FROM users q
> INNER JOIN (SELECT * FROM products WHERE price >= 500 AND country = 'EU') AS t
> APPROX NEAREST 10 BY SIMILARITY vector_cosine_similarity(q.embedding,
> t.embedding)
> h3. Planned Future Extensions (out of scope)
> * threshold-based retrieval instead of fixed count
> * cross-table filter predicates (e.g., {{{}ON t.price <= q.budget{}}})
> * Index DDL for ANN acceleration
> h3. Backward Compatibility
> This proposal introduces new syntax only. No existing queries, plans, or
> behavior are affected. The new {{NEAREST}} keyword is contextual within JOIN
> clauses and does not conflict with existing reserved words.
> ----
> h2. Appendix B. Design Sketch
> h3. Logical Plan
> A new logical plan node NearestByJoin captures the join type (INNER/LEFT
> OUTER), search mode (APPROX/EXACT), num_results, ranking direction
> (DISTANCE/SIMILARITY), ranking expression, and child plans for the query and
> base sides.
> h3. Physical Plan
> NearestByJoin will be rewritten into existing physical operators including
> JOIN, [max_by / min_by overloaded by
> K|https://github.com/apache/spark/pull/54134], and [vector similarity /
> distance expressions|https://github.com/apache/spark/pull/53481]. No new
> physical operators are introduced. Longer term speaking, we may benefit from
> writing a dedicated fused physical operator that avoids materializing a full
> cartesian product for performance improvements, but that is out of scope for
> this proposal and initial implementation.
> h3. Analysis
> The analyzer validates that num_results is a foldable positive integer, the
> BY expression returns an orderable type, and EXACT mode is not used with
> nondeterministic expressions. Standard column resolution applies to the BY
> expression, which may reference columns from both sides.
> ----
> h2. Appendix C. Rejected Designs
> h3. Table-Valued Function (TVF)
> An earlier design used a {{VECTOR_SEARCH(base_table, query_table, ...)}} TVF.
> This was rejected because:
> * The function signature becomes unwieldy when supporting multiple distance
> types, search modes, and filter options as named parameters.
> * JOIN syntax is more natural for SQL users and composes cleanly with CTEs,
> subqueries, and standard SQL clauses.
> h3. USING clause with distance expression
> An intermediate design used {{JOIN ... USING <distance_expression>}} with a
> {{NEAREST}} clause. This was rejected because {{USING}} has established
> semantics in SQL (equi-join on identically named columns) and overloading it
> for distance expressions would be confusing.
> h3. ORDER BY + LIMIT pattern
> Relying on {{ORDER BY similarity_function(...) LIMIT K}} was considered but
> rejected because it only supports single-query search naturally. Batch search
> requires verbose CROSS JOIN + window function patterns, and the optimizer has
> no semantic signal to apply specialized execution strategies.
> h3. LATERAL Subquery
> A LATERAL subquery approach - where each query row drives an ORDER BY
> distance LIMIT K subquery against the base table - was considered. This
> composes well for batch search and is the pattern used by pgvector. However,
> LATERAL has well-defined semantics: the subquery executes deterministically
> per row. This makes it inherently an exact search construct with no room for
> approximate semantics — there is no natural way to express "the optimizer may
> skip candidates or use an ANN index" within LATERAL's execution contract.
> Since the ability to opt into approximate search (APPROX) is a core design
> goal, LATERAL was rejected as the primary syntax.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]