[ https://issues.apache.org/jira/browse/SPARK-19468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-19468: ------------------------------------ Assignee: Apache Spark > Dataset slow because of unnecessary shuffles > -------------------------------------------- > > Key: SPARK-19468 > URL: https://issues.apache.org/jira/browse/SPARK-19468 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0 > Reporter: koert kuipers > Assignee: Apache Spark > > we noticed that some algos we ported from rdd to dataset are significantly > slower, and the main reason seems to be more shuffles that we successfully > avoid for rdds by careful partitioning. this seems to be dataset specific as > it works ok for dataframe. > see also here: > http://blog.hydronitrogen.com/2016/05/13/shuffle-free-joins-in-spark-sql/ > it kind of boils down to this... if i partition and sort dataframes that get > used for joins repeatedly i can avoid shuffles: > {noformat} > System.setProperty("spark.sql.autoBroadcastJoinThreshold", "-1") > val df1 = Seq((0, 0), (1, 1)).toDF("key", "value") > > .repartition(col("key")).sortWithinPartitions(col("key")).persist(StorageLevel.DISK_ONLY) > val df2 = Seq((0, 0), (1, 1)).toDF("key2", "value2") > > .repartition(col("key2")).sortWithinPartitions(col("key2")).persist(StorageLevel.DISK_ONLY) > val joined = df1.join(df2, col("key") === col("key2")) > joined.explain > == Physical Plan == > *SortMergeJoin [key#5], [key2#27], Inner > :- InMemoryTableScan [key#5, value#6] > : +- InMemoryRelation [key#5, value#6], true, 10000, StorageLevel(disk, 1 > replicas) > : +- *Sort [key#5 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(key#5, 4) > : +- LocalTableScan [key#5, value#6] > +- InMemoryTableScan [key2#27, value2#28] > +- InMemoryRelation [key2#27, value2#28], true, 10000, > StorageLevel(disk, 1 replicas) > +- *Sort [key2#27 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(key2#27, 4) > +- LocalTableScan [key2#27, value2#28] > {noformat} > notice how the persisted dataframes are not shuffled or sorted anymore before > being used in the join. however if i try to do the same with dataset i have > no luck: > {noformat} > val ds1 = Seq((0, 0), (1, 1)).toDS > > .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY) > val ds2 = Seq((0, 0), (1, 1)).toDS > > .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY) > val joined1 = ds1.joinWith(ds2, ds1("_1") === ds2("_1")) > joined1.explain > == Physical Plan == > *SortMergeJoin [_1#105._1], [_2#106._1], Inner > :- *Sort [_1#105._1 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(_1#105._1, 4) > : +- *Project [named_struct(_1, _1#83, _2, _2#84) AS _1#105] > : +- InMemoryTableScan [_1#83, _2#84] > : +- InMemoryRelation [_1#83, _2#84], true, 10000, > StorageLevel(disk, 1 replicas) > : +- *Sort [_1#83 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(_1#83, 4) > : +- LocalTableScan [_1#83, _2#84] > +- *Sort [_2#106._1 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(_2#106._1, 4) > +- *Project [named_struct(_1, _1#100, _2, _2#101) AS _2#106] > +- InMemoryTableScan [_1#100, _2#101] > +- InMemoryRelation [_1#100, _2#101], true, 10000, > StorageLevel(disk, 1 replicas) > +- *Sort [_1#83 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(_1#83, 4) > +- LocalTableScan [_1#83, _2#84] > {noformat} > notice how my persisted Datasets are shuffled and sorted again. part of the > issue seems to be in joinWith, which does some preprocessing that seems to > confuse the planner. if i change the joinWith to join (which returns a > dataframe) it looks a little better in that only one side gets shuffled > again, but still not optimal: > {noformat} > val ds1 = Seq((0, 0), (1, 1)).toDS > > .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY) > val ds2 = Seq((0, 0), (1, 1)).toDS > > .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY) > val joined1 = ds1.join(ds2, ds1("_1") === ds2("_1")) > joined1.explain > == Physical Plan == > *SortMergeJoin [_1#83], [_1#100], Inner > :- InMemoryTableScan [_1#83, _2#84] > : +- InMemoryRelation [_1#83, _2#84], true, 10000, StorageLevel(disk, 1 > replicas) > : +- *Sort [_1#83 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(_1#83, 4) > : +- LocalTableScan [_1#83, _2#84] > +- *Sort [_1#100 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(_1#100, 4) > +- InMemoryTableScan [_1#100, _2#101] > +- InMemoryRelation [_1#100, _2#101], true, 10000, > StorageLevel(disk, 1 replicas) > +- *Sort [_1#83 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(_1#83, 4) > +- LocalTableScan [_1#83, _2#84] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org