Re: failing GraphX application ('GC overhead limit exceeded', 'Lost executor', 'Connection refused', etc.)

Matthew Cornell Sat, 14 Feb 2015 04:49:00 -0800

Oops! I forgot to excerpt the errors and warnings from that file:

15/02/12 08:02:03 ERROR TaskSchedulerImpl: Lost executor 4 on 
compute-0-3.wright: remote Akka client disassociated


15/02/12 08:03:00 WARN TaskSetManager: Lost task 1.0 in stage 28.0 (TID 37, 
compute-0-1.wright): java.lang.OutOfMemoryError: GC overhead limit exceeded

15/02/12 08:05:06 WARN TaskSetManager: Lost task 0.0 in stage 31.1 (TID 48, 
compute-0-2.wright): FetchFailed(BlockManagerId(0, wright.cs.umass.edu, 60837), 
shuffleId=0, mapId=1, reduceId=1, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
wright.cs.umass.edu/128.119.241.146:60837
Caused by: java.io.IOException: Failed to connect to 
wright.cs.umass.edu/128.119.241.146:60837
Caused by: java.net.ConnectException: Connection refused: 
wright.cs.umass.edu/128.119.241.146:60837


I see "Lost executor" messages on all nodes, not just the 0-3 one, so it's not 
node-specific.

Any ideas about how to fix this?

Thanks again,

matt


On Feb 12, 2015, at 10:46 AM, Matthew Cornell wrote:

> Hi Folks,
> 
> I'm running a five-step path following-algorithm on a movie graph with 120K 
> verticies and 400K edges. The graph has vertices for actors, directors, 
> movies, users, and user ratings, and my Scala code is walking the path 
> "rating > movie > rating > user > rating". There are 75K rating nodes and 
> each has ~100 edges. My program iterates over each path item, calling 
> aggregateMessages() then joinVertices() each time, and then processing that 
> result on the next iteration. The program never finishes the second 'rating' 
> step, which makes sense as, IIUC from my back-of-the-napkin estimate, the 
> intermediate result would have ~4B active vertices.
> 
> Spark is version 1.2.0 and running in standalone mode on a small cluster of 
> five hosts: four compute nodes and a head node where the computes have 4 
> cores and 32GB RAM each, and the head has 32 cores and 128GB RAM. After 
> restarting Spark just now, the Master web UI shows 15 workers (5 dead), two 
> per node, with cores and memory listed as "32 (0 Used)" and "125.0 GB (0.0 B 
> Used)" on the two head node workers and "4 (0 Used)" and "30.5 GB (0.0 B 
> Used)" for the 8 workers running on the compute nodes. (Note: I don't 
> understand why it's configured to run two workers per node.) The small Spark 
> example programs run to completion.
> 
> I've listed the console output at http://pastebin.com/DPECKgQ9 (I'm running 
> in spark-shell).
> 
> I hope you can provide some advice on things to try next (e.g., configuration 
> vars). My guess is the cluster is running out of memory, though I think it 
> has adequate aggregate ram to handle this app.
> 
> Thanks very much -- matt
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: failing GraphX application ('GC overhead limit exceeded', 'Lost executor', 'Connection refused', etc.)

Reply via email to