[ 
https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600820#comment-15600820
 ] 

Tejas Patil commented on SPARK-18067:
-------------------------------------

The predicate `value1 === value2` is pushed down to the Join operator which 
makes sense. Although, later the optimizer could have avoided a shuffle.

Here is whats happening: 
https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L185

{noformat}
val useExistingPartitioning = children.zip(requiredChildDistributions).forall {
  case (child, distribution) =>
    child.outputPartitioning.guarantees(
      createPartitioning(distribution, maxChildrenNumPartitions))
}
{noformat}

Child's `outputPartitioning` is `HashPartitioning(expressions = \[key\], 
numPartitions = 200)`.
Partitioning returned by `createPartitioning()` is 
`HashPartitioning(expressions = \[value, key\], numPartitions = 200)`.

Since they don't match, an extra shuffle is added.








> Adding filter after SortMergeJoin creates unnecessary shuffle
> -------------------------------------------------------------
>
>                 Key: SPARK-18067
>                 URL: https://issues.apache.org/jira/browse/SPARK-18067
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1
>            Reporter: Paul Jones
>            Priority: Minor
>
> Basic setup
> {code}
> scala> case class Data1(key: String, value1: Int)
> scala> case class Data2(key: String, value2: Int)
> scala> val partition1 = sc.parallelize(1 to 100000).map(x => Data1(s"$x", x))
>     .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> scala> val partition2 = sc.parallelize(1 to 100000).map(x => Data2(s"$x", x))
>     .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> {code}
> Join on key
> {code}
> scala> partition1.join(partition2, "key").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [key#0], [key#12]
>    :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 10000, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>    +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 10000, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we get a super efficient join with no shuffle.
> But if we add a filter our join gets less efficient and we end up with a 
> shuffle.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" === 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [value1#1,key#0], [value2#13,key#12]
>    :- Sort [value1#1 ASC,key#0 ASC], false, 0
>    :  +- TungstenExchange hashpartitioning(value1#1,key#0,200), None
>    :     +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 10000, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>    +- Sort [value2#13 ASC,key#12 ASC], false, 0
>       +- TungstenExchange hashpartitioning(value2#13,key#12,200), None
>          +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 10000, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we can avoid the shuffle if use a filter statement that can't be pushed 
> in the join.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" >= 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- Filter (value1#1 >= value2#13)
>    +- SortMergeJoin [key#0], [key#12]
>       :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 10000, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>       +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 10000, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> What's the best way to avoid the filter pushdown here??



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to