Hi, Any input on this? I'm willing to instrument further and experiment if there are any ideas.
On Mon, May 4, 2015 at 11:27 AM, Akshat Aranya <aara...@gmail.com> wrote: > Hi, > > I have been investigating scheduling delays in Spark and I found some > unexplained anomalies. In my use case, I have two stages after > collapsing the transformations: the first is a mapPartitions() and the > second is a sortByKey(). I found that the task serialization for the > first stage takes much longer than the second. > > 1. mapPartitions() - this launches 256 tasks in 603 ms (avg. 2.363 > ms). Each task serializes to 1220 bytes. > 2. sortByKey() - this launches 64 tasks in 12 ms (avg. 0.187 ms). Each > task serializes to 1139 bytes. > > Note that the serialized size of the task is similar, but the avg. > scheduling time is very different. I also instrumented the code to > print out the serialization time, and it seems like it is indeed the > serialization that takes much longer. This seemed weird to me because > the biggest part of the Task, the taskBinary is actually directly > copied from a byte array. > > Any explanation of why this happens? > > Thanks, > Akshat --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org