Re: Missing chosen worker error on particular superstep

Kaushik Patnaik Wed, 04 Dec 2013 22:51:30 -0800

Rob,

I implemented all your suggestions and was able to run a graph with 7K
nodes under 5mins. I believe the difference in run times maybe because of
the structure of the graph.

However, when I moved to a larger graph ~27K nodes and 352K edges, Giraph
showed the same error after 50 mins of processing. Is this a timeout issue
? I believe the standard timeout value in Hadoop is 10mins (600000 ms).
I dont know why Giraph should fail now that I am enabling message write to
the disk. Have you faced this issue ?

I will be testing it out on a much larger cluster tomorrow. I hope this
does not repeat there.

Syslogs -

2013-12-04 21:33:35,096 INFO
org.apache.giraph.partition.PartitionUtils: analyzePartitionStats:
Vertices - Mean: 27770, Min: Worker(hostname=localhost, MRtaskID=1,
port=30001) - 27770, Max: Worker(hostname=localhost, MRtaskID=1,
port=30001) - 27770
2013-12-04 21:33:35,096 INFO
org.apache.giraph.partition.PartitionUtils: analyzePartitionStats:
Edges - Mean: 704609, Min: Worker(hostname=localhost, MRtaskID=1,
port=30001) - 704609, Max: Worker(hostname=localhost, MRtaskID=1,
port=30001) - 704609
2013-12-04 21:33:35,184 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
of 1 workers finished on superstep 2 on path
/_hadoopBsp/job_201312042044_0003/_applicationAttemptsDir/0/_superstepDir/2/_workerFinishedDir
2013-12-04 21:33:35,184 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList:
Waiting on [localhost_1]
2013-12-04 21:38:35,202 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
of 1 workers finished on superstep 2 on path
/_hadoopBsp/job_201312042044_0003/_applicationAttemptsDir/0/_superstepDir/2/_workerFinishedDir
2013-12-04 21:38:35,202 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList:
Waiting on [localhost_1]
2013-12-04 21:43:35,208 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
of 1 workers finished on superstep 2 on path
/_hadoopBsp/job_201312042044_0003/_applicationAttemptsDir/0/_superstepDir/2/_workerFinishedDir
2013-12-04 21:43:35,208 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList:
Waiting on [localhost_1]
2013-12-04 21:48:35,214 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
of 1 workers finished on superstep 2 on path
/_hadoopBsp/job_201312042044_0003/_applicationAttemptsDir/0/_superstepDir/2/_workerFinishedDir
2013-12-04 21:48:35,214 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList:
Waiting on [localhost_1]
2013-12-04 21:53:35,220 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
of 1 workers finished on superstep 2 on path
/_hadoopBsp/job_201312042044_0003/_applicationAttemptsDir/0/_superstepDir/2/_workerFinishedDir
2013-12-04 21:53:35,220 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList:
Waiting on [localhost_1]
2013-12-04 21:58:35,225 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
of 1 workers finished on superstep 2 on path
/_hadoopBsp/job_201312042044_0003/_applicationAttemptsDir/0/_superstepDir/2/_workerFinishedDir
2013-12-04 21:58:35,226 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList:
Waiting on [localhost_1]
2013-12-04 22:03:35,231 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
of 1 workers finished on superstep 2 on path
/_hadoopBsp/job_201312042044_0003/_applicationAttemptsDir/0/_superstepDir/2/_workerFinishedDir
2013-12-04 22:03:35,231 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList:
Waiting on [localhost_1]
2013-12-04 22:08:35,237 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
of 1 workers finished on superstep 2 on path
/_hadoopBsp/job_201312042044_0003/_applicationAttemptsDir/0/_superstepDir/2/_workerFinishedDir
2013-12-04 22:08:35,237 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList:
Waiting on [localhost_1]
2013-12-04 22:11:42,170 ERROR
org.apache.giraph.master.BspServiceMaster: superstepChosenWorkerAlive:
Missing chosen worker Worker(hostname=localhost, MRtaskID=1,
port=30001) on superstep 2
2013-12-04 22:11:42,171 INFO org.apache.giraph.master.MasterThread:
masterThread: Coordination of superstep 2 took 2297.944 seconds ended
with state WORKER_FAILURE and is now on superstep 2
2013-12-04 22:11:42,222 FATAL
org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint: No
last good checkpoints can be found, killing the job.

On Wed, Dec 4, 2013 at 4:34 AM, Rob Vesse <rve...@dotnetrdf.org> wrote:

> Kaushik
>
> I'm also running a triangle counting program of my own at the moment and
> have encountered similar problems.
>
> Missing working appears to indicate that one of the workers failed (AFAIK)
> and you'll need to look at the logs for the other map tasks to determine
> why this is.  You appear to have provided those logs and it looks to be the
> same problem I had of out of memory.  Your algorithm appears to be roughly
> similar to mine so what you'll be hitting is that even with this relatively
> small graph you get a massive explosion of messages by the later super
> steps which exhausts memory (in my graph the theoretical maximum messages
> by the last super step was ~3 billion).
>
> In order for me to get a successful run on a single node cluster you'll
> likely have to do some/all of the following:
>
>    - Enable out of core messages by adding appropriate configuration, see
>    http://giraph.apache.org/ooc.html (I set max messages in memory to
>    10,000)
>    - Increase the heap size for the map reduce processes, this can be
>    done with the mapred.child.java.opts setting (I used –Xmx768M –Xms256M).
>     Ideally you should pick values such that the max memory times the number
>    of workers will fit into available RAM as otherwise you'll start hitting
>    swap files and grind to a halt
>    - Increase the number of workers you use (the –w argument you pass to
>    Giraph), you'll need to have also configured your Hadoop cluster to support
>    at least one more map tasks than the number of workers (if not using
>    external Zookeeper) or at least as many map tasks as the number of workers
>    if using external Zookeeper.  Bear in mind that using more workers doesn't
>    necessarily help that much on a single node cluster but it does mean you
>    can keep the heap size settings down lower.
>
> Even with all this my algorithm took 16 minutes to run on a graph with 10K
> vertices and 150K edges, if you have better luck I'd be interested to know.
>
> Hope this helps,
>
> Rob
>
> From: Kaushik Patnaik <kaushikpatn...@gmail.com>
> Reply-To: <user@giraph.apache.org>
> Date: Wednesday, 4 December 2013 04:18
> To: <user@giraph.apache.org>
> Subject: Missing chosen worker error on particular superstep
>
> Hi,
>
> I am getting a missing chosen worker error upon running a custom program
> (triangle counting) on a graph dataset of about 7K nodes and 200K edges.
> The program runs fine on the first two supersteps, but throws a missing
> chosen worker error in the third superstep. In the third superstep each
> node sends messages to all its neighbors.
>
> The program runs fine and outputs results on test datasets of size 100
> nodes and 1000 edges, still I have pasted it below in case I am missing
> something.
>
> I am running the job on a single cluster machine though, could that be a
> problem ?? Will increasing the "mapreduce.task.timeout" help in any
> manner ?
> I also see a Java Heap Source error in the second map task of the same
> run. Are these errors connected ?
>
> Regards
> Kaushik
>
> syslogs from attempt_201312031749_0011_m_000000_0
>
> 2013-12-03 23:10:50,661 INFO org.apache.giraph.master.MasterThread: 
> masterThread: Coordination of superstep 1 took 100.399 seconds ended with 
> state THIS_SUPERSTEP_DONE and is now on superstep 2
> 2013-12-03 23:10:50,774 INFO org.apache.giraph.comm.netty.NettyClient: 
> connectAllAddresses: Successfully added 0 connections, (0 total connected) 0 
> failed, 0 failures total.
> 2013-12-03 23:10:50,774 INFO org.apache.giraph.partition.PartitionBalancer: 
> balancePartitionsAcrossWorkers: Using algorithm static
> 2013-12-03 23:10:50,775 INFO org.apache.giraph.partition.PartitionUtils: 
> analyzePartitionStats: Vertices - Mean: 7115, Min: Worker(hostname=localhost, 
> MRtaskID=1, port=30001) - 7115, Max: Worker(hostname=localhost, MRtaskID=1, 
> port=30001) - 7115
> 2013-12-03 23:10:50,775 INFO org.apache.giraph.partition.PartitionUtils: 
> analyzePartitionStats: Edges - Mean: 201524, Min: Worker(hostname=localhost, 
> MRtaskID=1, port=30001) - 201524, Max: Worker(hostname=localhost, MRtaskID=1, 
> port=30001) - 201524
> 2013-12-03 23:10:50,806 INFO org.apache.giraph.master.BspServiceMaster: 
> barrierOnWorkerList: 0 out of 1 workers finished on superstep 2 on path 
> /_hadoopBsp/job_201312031749_0011/_applicationAttemptsDir/0/_superstepDir/2/_workerFinishedDir
> 2013-12-03 23:10:50,806 INFO org.apache.giraph.master.BspServiceMaster: 
> barrierOnWorkerList: Waiting on [localhost_1]
> 2013-12-03 23:11:01,756 ERROR org.apache.giraph.master.BspServiceMaster: 
> superstepChosenWorkerAlive: Missing chosen worker Worker(hostname=localhost, 
> MRtaskID=1, port=30001) on superstep 2
> 2013-12-03 23:11:01,757 INFO org.apache.giraph.master.MasterThread: 
> masterThread: Coordination of superstep 2 took 11.096 seconds ended with 
> state WORKER_FAILURE and is now on superstep 2
> 2013-12-03 23:11:01,761 FATAL org.apache.giraph.master.BspServiceMaster: 
> getLastGoodCheckpoint: No last good checkpoints can be found, killing the job.
>
> attempt_201312031749_0011_m_000001_0
>
> java.lang.IllegalStateException: run: Caught an unrecoverable exception 
> waitFor: ExecutionException occurred while waiting for 
> org.apache.giraph.utils.ProgressableUtils$FutureWaitable@61db327f
>       at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:101)
>       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>       at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>       at org.apache.hadoop.mapred.Child.main(Child.java:253)
> Caused by: java.lang.IllegalStateException: waitFor: ExecutionException 
> occurred while waiting for 
> org.apache.giraph.utils.ProgressableUtils$FutureWaitable@61db327f
>       at 
> org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:181)
>       at 
> org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:139)
>       at 
> org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:124)
>       at 
> org.apache.giraph.utils.ProgressableUtils.getFutureResult(ProgressableUtils.java:87)
>       at 
> org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:221)
>       at 
> org.apache.giraph.graph.GraphTaskManager.processGraphPartitions(GraphTaskManager.java:736)
>       at 
> org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:284)
>       at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:91)
>       ... 7 more
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.OutOfMemoryError: Java heap space
>       at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:262)
>       at java.util.concurrent.FutureTask.get(FutureTask.java:119)
>       at 
> org.apache.giraph.utils.ProgressableUtils$FutureWaitable.waitFor(ProgressableUtils.java:300)
>       at 
> org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:173)
>
>
> Program -
>
> public class TriangleCounting extends BasicComputation<
>     DoubleWritable, DoubleWritable, FloatWritable, DoubleWritable> {
>
>   /** Class logger */
>   private static final Logger LOG = Logger.getLogger(TriangleCounting.class);
>
>   @Override
>   public void compute(
>       Vertex<DoubleWritable, DoubleWritable, FloatWritable> vertex,
>       Iterable<DoubleWritable> messages) throws IOException {
>               
>   /** First Superstep releases messages to vertexIds() whose value is greater 
> than its value. Both VertexId and Message are double **/                
>   if (getSuperstep() == 0) {
>     for (Edge<DoubleWritable, FloatWritable> edge: vertex.getEdges()) {
>         if (edge.getTargetVertexId().compareTo(vertex.getId()) == 1) {
>               sendMessage(edge.getTargetVertexId(), vertex.getId());
>         if (LOG.isDebugEnabled()) {
>           LOG.debug("Vertex " + vertex.getId() + " sent message " + 
> vertex.getId() + " to vertex " + edge.getTargetVertexId());
>         }
>         System.out.println("Vertex " + vertex.getId() + " sent message " + 
> vertex.getId() + " to vertex " + edge.getTargetVertexId());
>         }
>     }
>   }
>
>   /** Second superstep releases messages to message.get() < vertex.getId() < 
> targetVertexId() **/
>   if (getSuperstep() == 1) {
>       for (DoubleWritable message: messages) {
>           for (Edge<DoubleWritable, FloatWritable> edge: vertex.getEdges()) {
>                 if (edge.getTargetVertexId().compareTo(vertex.getId()) + 
> vertex.getId().compareTo(message) == 2) {
>                   sendMessage(edge.getTargetVertexId(), message);
>               if (LOG.isDebugEnabled()) {
>                 LOG.debug("Vertex " + vertex.getId() + " sent message " + 
> message + " to vertex " + edge.getTargetVertexId());
>               }
>               System.out.println("Vertex " + vertex.getId() + " sent message 
> " + message + " to vertex " + edge.getTargetVertexId());
>                 }
>           }
>     }
>   }
>   /** Sends messages to all its neighbours, the messages it receives **/
>   if (getSuperstep() == 2) {
>     for (DoubleWritable message: messages) {
>               sendMessageToAllEdges(vertex, message);
>       }
>   }
>
>   if (getSuperstep() == 3) {
>       double Value = 0.0;
>         for (DoubleWritable message: messages) {
>                 if (vertex.getId().equals(message)) {
>               Value += 1.0;
>               System.out.println("Vertex " + vertex.getId() + " received 
> message " + message);
>             }
>       }
>       vertex.setValue(new DoubleWritable(Value));
>   }
>
>   vertex.voteToHalt();
>   }
> }
>
>

Re: Missing chosen worker error on particular superstep

Reply via email to