Hello,

Data about my spark job is below. My source data is only 916MB (stage 0)
and 231MB (stage 1), but when i join the two data sets (stage 2) it takes a
very long time and as i see the shuffled data is 614GB. Is this something
expected? Both the data sets produce 200 partitions.

Stage IdDescriptionSubmittedDurationTasks: Succeeded/TotalInputOutputShuffle
ReadShuffle Write2saveAsTable at Driver.scala:269
<http://sparkhs.rfiserve.net:18080/history/application_1437606252645_1034031/stages/stage?id=2&attempt=0>
+details

2015/10/22 18:48:122.3 h
200/200
614.6 GB1saveAsTable at Driver.scala:269
<http://sparkhs.rfiserve.net:18080/history/application_1437606252645_1034031/stages/stage?id=1&attempt=0>
+details

2015/10/22 18:46:022.1 min
8/8
916.2 MB3.9 MB0saveAsTable at Driver.scala:269
<http://sparkhs.rfiserve.net:18080/history/application_1437606252645_1034031/stages/stage?id=0&attempt=0>
+details

2015/10/22 18:46:0235 s
3/3
231.2 MB4.8 MBAm running Spark 1.4.1 and my code snippet which joins the
two data sets is:

hc.sql(query).
    mapPartitions(iter => {
      iter.map {
        case Row(
         ...
         ...
         ...
        )
      }
    }
    ).toDF()
    .groupBy("id_1", "id_2", "day_hour", "day_hour_2")
    .agg($"id_1", $"id_2", $"day_hour", $"day_hour_2",
      sum("attr1").alias("attr1"), sum("attr2").alias("attr2"))


Please advise on how to reduce the shuffle and speed this up.


~Pratik

Reply via email to