Maybe JIT? The 1st stage -- the scheduler code isn't JITed yet. On Wed, May 13, 2015 at 9:18 AM, Akshat Aranya <aara...@gmail.com> wrote:
> Hi, > Any input on this? I'm willing to instrument further and experiment > if there are any ideas. > > On Mon, May 4, 2015 at 11:27 AM, Akshat Aranya <aara...@gmail.com> wrote: > > Hi, > > > > I have been investigating scheduling delays in Spark and I found some > > unexplained anomalies. In my use case, I have two stages after > > collapsing the transformations: the first is a mapPartitions() and the > > second is a sortByKey(). I found that the task serialization for the > > first stage takes much longer than the second. > > > > 1. mapPartitions() - this launches 256 tasks in 603 ms (avg. 2.363 > > ms). Each task serializes to 1220 bytes. > > 2. sortByKey() - this launches 64 tasks in 12 ms (avg. 0.187 ms). Each > > task serializes to 1139 bytes. > > > > Note that the serialized size of the task is similar, but the avg. > > scheduling time is very different. I also instrumented the code to > > print out the serialization time, and it seems like it is indeed the > > serialization that takes much longer. This seemed weird to me because > > the biggest part of the Task, the taskBinary is actually directly > > copied from a byte array. > > > > Any explanation of why this happens? > > > > Thanks, > > Akshat > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >