sigmod commented on code in PR #55629:
URL: https://github.com/apache/spark/pull/55629#discussion_r3184211391
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##########
@@ -657,6 +657,34 @@ trait CheckAnalysis extends LookupCatalog with
QueryErrorsBase with PlanToString
messageParameters = Map.empty)
}
+ // Reject streaming inputs early. The optimizer rewrite groups by a
`__qid` derived
+ // from `MonotonicallyIncreasingID()` and feeds it to a global
`Aggregate`, which
+ // Spark turns into a stateful streaming aggregation. Because MID
restarts per
+ // micro-batch, `__qid` values collide across batches, and the
stateful aggregate
+ // silently merges state from old batches into new rows that share
the same key --
+ // producing wrong top-K results. Failing at analysis time is
clearer than letting
+ // this slip through. Streaming support is tracked as a follow-up;
resolving it does
+ // not require streaming-aware MID and is likely to come from a
different grouping
+ // strategy or a dedicated physical operator.
+ case j: NearestByJoin if j.isStreaming =>
+ j.failAnalysis(
+ errorClass = "NEAREST_BY_JOIN.STREAMING_NOT_SUPPORTED",
+ messageParameters = Map.empty)
+
+ case j @ NearestByJoin(_, _, _, _, _, rankingExpression, _)
+ if !RowOrdering.isOrderable(rankingExpression.dataType) =>
+ j.failAnalysis(
+ errorClass = "NEAREST_BY_JOIN.NON_ORDERABLE_RANKING_EXPRESSION",
+ messageParameters = Map(
+ "expression" -> toSQLExpr(rankingExpression),
+ "type" -> toSQLType(rankingExpression.dataType)))
+
+ case j @ NearestByJoin(_, _, _, false, _, rankingExpression, _)
+ if !rankingExpression.deterministic =>
+ j.failAnalysis(
Review Comment:
Do we have to fail this case?
We still call the result of the following query "exact results" rather than
"approximate results"?
> SELECT any_value(t.v)
> FROM t
I view them as
- exact results: can be deterministic or non-deterministic, but deliver a
well-defined semantics w.r.t. input/output.
- approx results: there's no well-defined semantics w.r.t. input/output.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]