The variability in task completion times could be caused by variability in the amount of work that those tasks perform rather than slow or faulty nodes.
For PageRank, consider a link graph contains a few disproportionately popular webpages that have many inlinks (such as Yahoo.com). These high-degree nodes may cause significant communications imbalances because they receive and send many messages in a Pregel-like model. If you look at the distribution of shuffled data sizes, does it exhibit similar skew to the task completion times? The PowerGraph paper gives a good overview of the challenges posed by these types of large-scale natural-graphs and develops techniques to split up and parallelize the processing of these high-degree nodes: http://graphlab.org/powergraph-presented-at-osdi/ On Thu, Dec 5, 2013 at 6:54 AM, Mayuresh Kunjir <mayuresh.kun...@gmail.com>wrote: > Thanks Jay for your response. Stragglers are a big problem here. I am > seeing such tasks in many stages of the workflow on a consistent basis. > It's not due to any particular nodes being slow since the slow tasks are > observed on all the nodes at different points in time. > The distribution of task completion times is too skewed for my liking. > GC delays is a possible reason, but I am just speculating. > > ~Mayuresh > > > > > On Thu, Dec 5, 2013 at 5:31 AM, huangjay <ja...@live.cn> wrote: > >> Hi, >> >> Maybe you need to check those nodes. It's very slow. >> >> >> 3487SUCCESSPROCESS_LOCALip-10-60-150-111.ec2.internal 2013/12/01 02:11:3817.7 >> m16.3 m 23.3 MB3447SUCCESS PROCESS_LOCALip-10-12-54-63.ec2.internal2013/12/01 >> 02:11:26 20.1 m13.9 m50.9 MB >> >> 在 2013年12月1日,上午10:59,"Mayuresh Kunjir" <mayuresh.kun...@gmail.com> 写道: >> >> I tried passing DISK_ONLY storage level to Bagel's run method. It's >> running without any error (so far) but is too slow. I am attaching details >> for a stage corresponding to second iteration of my algorithm. (foreach >> at >> Bagel.scala:237<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/stages/stage?id=23>) >> It's been running for more than 35 minutes. I am noticing very high GC time >> for some tasks. Listing below the setup parameters. >> >> #nodes = 16 >> SPARK_WORKER_MEMORY = 13G >> SPARK_MEM = 13G >> RDD storage fraction = 0.5 >> degree of parallelism = 192 (16 nodes * 4 cores each * 3) >> Serializer = Kryo >> Vertex data size after serialization = ~12G (probably too high, but it's >> the bare minimum required for the algorithm.) >> >> I would be grateful if you could suggest some further optimizations or >> point out reasons why/if Bagel is not suitable for this data size. I need >> to further scale my cluster and not feeling confident at all looking at >> this. >> >> Thanks and regards, >> ~Mayuresh >> >> >> On Sat, Nov 30, 2013 at 3:07 PM, Mayuresh Kunjir < >> mayuresh.kun...@gmail.com> wrote: >> >>> Hi Spark users, >>> >>> I am running a pagerank-style algorithm on Bagel and bumping into "out >>> of memory" issues with that. >>> >>> Referring to the following table, rdd_120 is the rdd of vertices, >>> serialized and compressed in memory. On each iteration, Bagel deserializes >>> the compressed rdd. e.g. rdd_126 shows the uncompressed version of rdd_120 >>> persisted in memory and disk. As iterations keep piling on, the cached >>> partitions start getting evicted. The moment a rdd_120 partition gets >>> evicted, it necessitates a recomputations and the performance goes for a >>> toss. Although we don't need uncompressed rdds from previous iterations, >>> they are the last ones to get evicted thanks to LRU policy. >>> >>> Should I make Bagel use DISK_ONLY persistence? How much of a performance >>> hit would that be? Or maybe there is a better solution here. >>> >>> Storage >>> RDD NameStorage Level Cached PartitionsFraction Cached Size in MemorySize >>> on Disk >>> rdd_83<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=83>Memory >>> Serialized1x Replicated2312%83.7 MB0.0 B >>> rdd_95<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=95>Memory >>> Serialized1x Replicated23 >>> 12% 2.5 MB 0.0 B >>> rdd_120<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=120>Memory >>> Serialized1x Replicated2513%761.1 MB0.0 B >>> rdd_126<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=126>Disk >>> Memory Deserialized 1x Replicated192 >>> 100% 77.9 GB 1016.5 MB >>> rdd_134<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=134>Disk >>> Memory Deserialized 1x Replicated18596%60.8 GB475.4 MB >>> Thanks and regards, >>> ~Mayuresh >>> >> >> <BigFrame - Details for Stage 23.htm> >> >> >