Oops! I forgot to excerpt the errors and warnings from that file: 15/02/12 08:02:03 ERROR TaskSchedulerImpl: Lost executor 4 on compute-0-3.wright: remote Akka client disassociated
15/02/12 08:03:00 WARN TaskSetManager: Lost task 1.0 in stage 28.0 (TID 37, compute-0-1.wright): java.lang.OutOfMemoryError: GC overhead limit exceeded 15/02/12 08:05:06 WARN TaskSetManager: Lost task 0.0 in stage 31.1 (TID 48, compute-0-2.wright): FetchFailed(BlockManagerId(0, wright.cs.umass.edu, 60837), shuffleId=0, mapId=1, reduceId=1, message= org.apache.spark.shuffle.FetchFailedException: Failed to connect to wright.cs.umass.edu/128.119.241.146:60837 Caused by: java.io.IOException: Failed to connect to wright.cs.umass.edu/128.119.241.146:60837 Caused by: java.net.ConnectException: Connection refused: wright.cs.umass.edu/128.119.241.146:60837 I see "Lost executor" messages on all nodes, not just the 0-3 one, so it's not node-specific. Any ideas about how to fix this? Thanks again, matt On Feb 12, 2015, at 10:46 AM, Matthew Cornell wrote: > Hi Folks, > > I'm running a five-step path following-algorithm on a movie graph with 120K > verticies and 400K edges. The graph has vertices for actors, directors, > movies, users, and user ratings, and my Scala code is walking the path > "rating > movie > rating > user > rating". There are 75K rating nodes and > each has ~100 edges. My program iterates over each path item, calling > aggregateMessages() then joinVertices() each time, and then processing that > result on the next iteration. The program never finishes the second 'rating' > step, which makes sense as, IIUC from my back-of-the-napkin estimate, the > intermediate result would have ~4B active vertices. > > Spark is version 1.2.0 and running in standalone mode on a small cluster of > five hosts: four compute nodes and a head node where the computes have 4 > cores and 32GB RAM each, and the head has 32 cores and 128GB RAM. After > restarting Spark just now, the Master web UI shows 15 workers (5 dead), two > per node, with cores and memory listed as "32 (0 Used)" and "125.0 GB (0.0 B > Used)" on the two head node workers and "4 (0 Used)" and "30.5 GB (0.0 B > Used)" for the 8 workers running on the compute nodes. (Note: I don't > understand why it's configured to run two workers per node.) The small Spark > example programs run to completion. > > I've listed the console output at http://pastebin.com/DPECKgQ9 (I'm running > in spark-shell). > > I hope you can provide some advice on things to try next (e.g., configuration > vars). My guess is the cluster is running out of memory, though I think it > has adequate aggregate ram to handle this app. > > Thanks very much -- matt > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org