[ 
https://issues.apache.org/jira/browse/SPARK-45099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huw updated SPARK-45099:
------------------------
    Description: 
When performing a 'using' join with a sort hint in a full outer, the 
ResolveNaturalAndUsingJoin will kick in and build a new join with Equality 
conditions and a Projection like this:
{quote}val joinedCols = joinPairs.map \{ case (l, r) => Alias(Coalesce(Seq(l, 
r)), l.name)() }
{quote}
There's nothing wrong with this per se, but, SortMergeJoinExec has it's output 
ordering for a full outer join as empty, even though these join pairs in their 
final coalesced form actually are ordered.

This means that code like this:
{quote}{{frames.reduceLeft }}{{{}{
  case (l, r) => l.join(r.hint("merge"), usingColumns = Seq("a", "b"), joinType 
= "outer")
}{}}}{{{}{}}}{quote}
Given a non empty list of frames, will not 'stream' without a shuffle step, as 
each join forgets its sort order.

Ideally this whole operation wouldn't require any shuffles if all the frames 
are grouped and sorted by the keys.

  was:
When performing a 'using' join with a sort hint in a full outer, the 
ResolveNaturalAndUsingJoin will kick in and build a new join with Equality 
conditions and a Projection like this:
{quote}val joinedCols = joinPairs.map \{ case (l, r) => Alias(Coalesce(Seq(l, 
r)), l.name)() }
{quote}
There's nothing wrong with this per se, but, SortMergeJoinExec has it's output 
ordering for a full outer join as empty, even though these join pairs in their 
final coalesced form actually are ordered.

This means that code like this:
{quote}frames.reduceLeft({ case (l, r) =>   l.join(r.hint("merge"), 
usingColumns = Seq("a", "b"), joinType = "outer") })
{quote}
Given a non empty list of frames, will not 'stream' without a shuffle step, as 
each join forgets its sort order.

Ideally this whole operation wouldn't require any shuffles if all the frames 
are grouped and sorted by the keys.


> SortMergeExec with Outer using join forgets sort information
> ------------------------------------------------------------
>
>                 Key: SPARK-45099
>                 URL: https://issues.apache.org/jira/browse/SPARK-45099
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.4.1
>            Reporter: Huw
>            Priority: Minor
>
> When performing a 'using' join with a sort hint in a full outer, the 
> ResolveNaturalAndUsingJoin will kick in and build a new join with Equality 
> conditions and a Projection like this:
> {quote}val joinedCols = joinPairs.map \{ case (l, r) => Alias(Coalesce(Seq(l, 
> r)), l.name)() }
> {quote}
> There's nothing wrong with this per se, but, SortMergeJoinExec has it's 
> output ordering for a full outer join as empty, even though these join pairs 
> in their final coalesced form actually are ordered.
> This means that code like this:
> {quote}{{frames.reduceLeft }}{{{}{
>   case (l, r) => l.join(r.hint("merge"), usingColumns = Seq("a", "b"), 
> joinType = "outer")
> }{}}}{{{}{}}}{quote}
> Given a non empty list of frames, will not 'stream' without a shuffle step, 
> as each join forgets its sort order.
> Ideally this whole operation wouldn't require any shuffles if all the frames 
> are grouped and sorted by the keys.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to