zhidongqu-db commented on PR #55629:
URL: https://github.com/apache/spark/pull/55629#issuecomment-4374836208
Thanks everyone for the thorough review. Posting a single update on what's
going into the next push and what's being tracked separately, based on all the
threads so far.
**Going into the next push (code):**
- Remove the structural skip in `CheckCartesianProducts` — agreeing with
@sigmod and earlier discussion that NEAREST BY queries should fail under
`crossJoin.enabled = false` rather than auto-pass. Users opt in by setting the
conf to `true`. This also makes the Project-injection edge case (when ranking
is nondeterministic) moot, since there's no skip pattern to bypass.
- Move `NearestByJoin` (companion + class) into its own file under
`plans/logical/`.
- Extract the NEAREST BY block in `AstBuilder.withJoinRelation` into a
helper method.
- Complete the cut-off comment on `allowNonDeterministicExpression` in
`NearestByJoin`.
- Add `@param` Scaladoc on the `NearestByJoin` case class fields.
- Doc update in `sql-ref-syntax-qry-select-join.md` to call out determinism
semantics and that the ranking expression is evaluated for every (left, right)
pair (so side-effecting UDFs run K×N times).
- Tighten left-side nullability handling in `NearestByJoin.output` for
symmetry with the right side (both will declare nullable, since `First` widens
left-column nullability in the rewrite anyway).
- Lowercase `nearest-by` in error messages — keeping consistency with the
other six `NEAREST_BY_JOIN.*` sub-classes.
**Going into the next push (tests):**
- `SELECT *` smoke test (per @sigmod) verifying the output schema is clean —
no `__qid`, no `__nearest_matches__`, no struct leakage.
- Negative test under `spark.sql.crossJoin.enabled = false` confirming
NEAREST BY now requires opt-in.
- (Already added in the last commit: distance+leftouter, EXACT≡APPROX
rewrite equivalence, k=1 / k=`MaxNumResults`, self-join via
`DeduplicateRelations`, nondeterministic-ranking + Project-injection shape,
deterministic-ranking does NOT inject Project.)
**Reply-only (no code change needed):**
- @dtenedor #29 (LATERAL_JOIN_NEAREST_BY scope): lateral *column alias*
usage in queries over NEAREST BY results is orthogonal — only the explicit
\`JOIN LATERAL ... NEAREST BY\` form is rejected.
- @dtenedor #31 (move \`NearestByJoinType\` to \`sql/api\`): would require
lifting \`JoinType\` to \`sql/api\` first, so deferring. Threading the constant
through the parser error helper is the local fix.
- @dtenedor #26/#27 (error-message UX): current \`NUM_RESULTS_OUT_OF_RANGE\`
already states the valid range; will revisit if this proves confusing in
practice. original right-side exprIds still live in the right child of the
synthetic Join, so reusing them would create duplicate exprIds in the same
plan. \`attrMapping\` carries the old→new mapping so upstream references stay
resolved.
- @sigmod #47 (\`qidAttr\` leakage): \`__qid\` is in \`groupingExpressions\`
only, never in \`aggregateExpressions\`, so it doesn't appear in the rewritten
plan's output. The schema-fidelity concern (left-nullability widening from
\`First\`) will be addressed by the symmetric-nullability change above.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]