Hi All, I am running my spark-job using spark-submit (yarn-client mode). In this PoC phase, am loading a ~200MB TSV file and then doing computation over strings. I generate few small files (in KB). The goal is then to load up around 250GB input file rather than 200 MB and run the analytics. In my sc.textFile if I specify a large number of partitions, the job proceeds further at a slow rate. I was playing with the partition count and in this run I didn't set it in my code. However, the it runs ver slowly and I see the following messages on the console and CHD screen capture displaying the metrics is attached. This run started around 1 pm. I see the the disk throughput and ios are quite low. I am sure Spark can handle order of magnitude more than this small dataset. Any pointers to parameter settings or debugging tips will be great.
Console message:14/08/06 13:06:07 INFO TaskSetManager: Finished TID 2 in 19714 ms on pzxdcc0151.x.y.org (progress: 1/2)14/08/06 13:06:08 INFO TaskSetManager: Starting task 3.0:0 as TID 4 on executor 2: pzxdcc0248.x.y.org (NODE_LOCAL)14/08/06 13:06:08 INFO TaskSetManager: Serialized task 3.0:0 as 2375 bytes in 0 ms14/08/06 13:06:08 INFO DAGScheduler: Completed ShuffleMapTask(2, 1)14/08/06 13:06:08 INFO TaskSetManager: Finished TID 1 in 20276 ms on pzxdcc0248.x.y.org (progress: 2/2)14/08/06 13:06:08 INFO DAGScheduler: Stage 2 (map at TwoScores.scala:53) finished in 20.287 s14/08/06 13:06:08 INFO YarnClientClusterScheduler: Removed TaskSet 2.0, whose tasks have all completed, from pool14/08/06 13:06:08 INFO DAGScheduler: looking for newly runnable stages14/08/06 13:06:08 INFO DAGScheduler: running: Set(Stage 3)14/08/06 13:06:08 INFO DAGScheduler: waiting: Set(Stage 1)14/08/06 13:06:08 INFO DAGScheduler: failed: Set()14/08/06 13:06:08 INFO DAGScheduler: Missing parents for Stage 1: List(Stage 3)
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org