zhidongqu-db commented on code in PR #55629:
URL: https://github.com/apache/spark/pull/55629#discussion_r3184629440
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##########
@@ -657,6 +657,34 @@ trait CheckAnalysis extends LookupCatalog with
QueryErrorsBase with PlanToString
messageParameters = Map.empty)
}
+ // Reject streaming inputs early. The optimizer rewrite groups by a
`__qid` derived
+ // from `MonotonicallyIncreasingID()` and feeds it to a global
`Aggregate`, which
+ // Spark turns into a stateful streaming aggregation. Because MID
restarts per
+ // micro-batch, `__qid` values collide across batches, and the
stateful aggregate
+ // silently merges state from old batches into new rows that share
the same key --
+ // producing wrong top-K results. Failing at analysis time is
clearer than letting
+ // this slip through. Streaming support is tracked as a follow-up;
resolving it does
+ // not require streaming-aware MID and is likely to come from a
different grouping
+ // strategy or a dedicated physical operator.
+ case j: NearestByJoin if j.isStreaming =>
+ j.failAnalysis(
+ errorClass = "NEAREST_BY_JOIN.STREAMING_NOT_SUPPORTED",
+ messageParameters = Map.empty)
+
+ case j @ NearestByJoin(_, _, _, _, _, rankingExpression, _)
+ if !RowOrdering.isOrderable(rankingExpression.dataType) =>
+ j.failAnalysis(
+ errorClass = "NEAREST_BY_JOIN.NON_ORDERABLE_RANKING_EXPRESSION",
+ messageParameters = Map(
+ "expression" -> toSQLExpr(rankingExpression),
+ "type" -> toSQLType(rankingExpression.dataType)))
+
+ case j @ NearestByJoin(_, _, _, false, _, rankingExpression, _)
+ if !rankingExpression.deterministic =>
+ j.failAnalysis(
Review Comment:
I guess this depends on how we define EXACT semantic here. We explicitly
mentioned in the SPIP that EXACT with non-deterministic ordering expr should
fail. The intention was to have the EXACT keyword express the semantic of
deterministic ordering given a deterministic input and scoring expr. If the
scoring expr is not deterministic in the first place - e.g. LLM generated
scores, the query would fail and user should use APPROX where the keyword
explicitly does not imply deterministic results
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]