This is the stack trace of the worker thread: org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:150) org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:130) org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60) org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) org.apache.spark.rdd.RDD.iterator(RDD.scala:244) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) org.apache.spark.rdd.RDD.iterator(RDD.scala:244) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) org.apache.spark.rdd.RDD.iterator(RDD.scala:244) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) org.apache.spark.rdd.RDD.iterator(RDD.scala:242) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) org.apache.spark.scheduler.Task.run(Task.scala:64) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745)
On 8 May 2015 at 22:12, Josh Rosen <rosenvi...@gmail.com> wrote: > Do you have any more specific profiling data that you can share? I'm > curious to know where AppendOnlyMap.changeValue is being called from. > > On Fri, May 8, 2015 at 1:26 PM, Michal Haris <michal.ha...@visualdna.com> > wrote: > >> +dev >> On 6 May 2015 10:45, "Michal Haris" <michal.ha...@visualdna.com> wrote: >> >> > Just wanted to check if somebody has seen similar behaviour or knows >> what >> > we might be doing wrong. We have a relatively complex spark application >> > which processes half a terabyte of data at various stages. We have >> profiled >> > it in several ways and everything seems to point to one place where 90% >> of >> > the time is spent: AppendOnlyMap.changeValue. The job scales and is >> > relatively faster than its map-reduce alternative but it still feels >> slower >> > than it should be. I am suspecting too much spill but I haven't seen any >> > improvement by increasing number of partitions to 10k. Any idea would be >> > appreciated. >> > >> > -- >> > Michal Haris >> > Technical Architect >> > direct line: +44 (0) 207 749 0229 >> > www.visualdna.com | t: +44 (0) 207 734 7033, >> > >> > > -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033,