Help with debugging a performance issue

Sameer Tilak Wed, 06 Aug 2014 13:23:30 -0700

Hi All,

I am running my spark-job using spark-submit (yarn-client mode). In this PoC 
phase, am loading a ~200MB TSV file and then doing computation over strings. I 
generate few small files (in KB). The goal is then to load up around 250GB 
input file rather than 200 MB and run the analytics. In my sc.textFile if I 
specify a large number of partitions, the job proceeds further at a slow rate. 
I was playing with the partition count and in this run I didn't set it in my 
code.  However, the it runs ver slowly and I see the following messages on the 
console and CHD screen capture displaying the metrics is attached. This run 
started around 1 pm. I see the the disk throughput and ios are quite low. I am 
sure Spark can handle order of magnitude more than this small dataset. Any 
pointers to parameter settings or debugging tips will be great.


Console message:14/08/06 13:06:07 INFO TaskSetManager: Finished TID 2 in 19714 
ms on pzxdcc0151.x.y.org (progress: 1/2)14/08/06 13:06:08 INFO TaskSetManager: 
Starting task 3.0:0 as TID 4 on executor 2: pzxdcc0248.x.y.org 
(NODE_LOCAL)14/08/06 13:06:08 INFO TaskSetManager: Serialized task 3.0:0 as 
2375 bytes in 0 ms14/08/06 13:06:08 INFO DAGScheduler: Completed 
ShuffleMapTask(2, 1)14/08/06 13:06:08 INFO TaskSetManager: Finished TID 1 in 
20276 ms on pzxdcc0248.x.y.org (progress: 2/2)14/08/06 13:06:08 INFO 
DAGScheduler: Stage 2 (map at TwoScores.scala:53) finished in 20.287 s14/08/06 
13:06:08 INFO YarnClientClusterScheduler: Removed TaskSet 2.0, whose tasks have 
all completed, from pool14/08/06 13:06:08 INFO DAGScheduler: looking for newly 
runnable stages14/08/06 13:06:08 INFO DAGScheduler: running: Set(Stage 
3)14/08/06 13:06:08 INFO DAGScheduler: waiting: Set(Stage 1)14/08/06 13:06:08 
INFO DAGScheduler: failed: Set()14/08/06 13:06:08 INFO DAGScheduler: Missing 
parents for Stage 1: List(Stage 3)

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Help with debugging a performance issue

Reply via email to