[ https://issues.apache.org/jira/browse/SPARK-45099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Huw updated SPARK-45099: ------------------------ Description: When performing a 'using' join with a sort hint in a full outer, the ResolveNaturalAndUsingJoin will kick in and build a new join with Equality conditions and a Projection like this: {quote}val joinedCols = joinPairs.map \{ case (l, r) => Alias(Coalesce(Seq(l, r)), l.name)() } {quote} There's nothing wrong with this per se, but, SortMergeJoinExec has it's output ordering for a full outer join as empty, even though these join pairs in their final coalesced form actually are ordered. This means that code like this: {quote}frames.reduceLeft { case (l, r) => l.join(r.hint("merge"), usingColumns = Seq("a", "b"), joinType = "outer") }{{{}{}}}{quote} Given a non empty list of frames, will not 'stream' without a shuffle step, as each join forgets its sort order. Ideally this whole operation wouldn't require any shuffles if all the frames are grouped and sorted by the keys. was: When performing a 'using' join with a sort hint in a full outer, the ResolveNaturalAndUsingJoin will kick in and build a new join with Equality conditions and a Projection like this: {quote}val joinedCols = joinPairs.map \{ case (l, r) => Alias(Coalesce(Seq(l, r)), l.name)() } {quote} There's nothing wrong with this per se, but, SortMergeJoinExec has it's output ordering for a full outer join as empty, even though these join pairs in their final coalesced form actually are ordered. This means that code like this: {quote}{{frames.reduceLeft }}{{{}{ case (l, r) => l.join(r.hint("merge"), usingColumns = Seq("a", "b"), joinType = "outer") }{}}}{{{}{}}}{quote} Given a non empty list of frames, will not 'stream' without a shuffle step, as each join forgets its sort order. Ideally this whole operation wouldn't require any shuffles if all the frames are grouped and sorted by the keys. > SortMergeExec with Outer using join forgets sort information > ------------------------------------------------------------ > > Key: SPARK-45099 > URL: https://issues.apache.org/jira/browse/SPARK-45099 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.4.1 > Reporter: Huw > Priority: Minor > > When performing a 'using' join with a sort hint in a full outer, the > ResolveNaturalAndUsingJoin will kick in and build a new join with Equality > conditions and a Projection like this: > {quote}val joinedCols = joinPairs.map \{ case (l, r) => Alias(Coalesce(Seq(l, > r)), l.name)() } > {quote} > There's nothing wrong with this per se, but, SortMergeJoinExec has it's > output ordering for a full outer join as empty, even though these join pairs > in their final coalesced form actually are ordered. > This means that code like this: > {quote}frames.reduceLeft { > case (l, r) => l.join(r.hint("merge"), usingColumns = Seq("a", "b"), > joinType = "outer") > }{{{}{}}}{quote} > Given a non empty list of frames, will not 'stream' without a shuffle step, > as each join forgets its sort order. > Ideally this whole operation wouldn't require any shuffles if all the frames > are grouped and sorted by the keys. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org