Re: about fault tolerance in Giraph

2013-03-18 Thread Yuanyuan Tian
I have tried increase number of reties to 10. But it is the same: no retry 
at all. Isn't retrying of failed task the default behavior for hadoop? Why 
isn't it working in the case of Giraph? 

Here is the message from master: 

2013-03-18 18:23:30,628 ERROR org.apache.giraph.graph.BspServiceMaster: 
checkWorkers: Did not receive enough processes in time (only 54 of 55 
required).  This occurs if you do not have enough map tasks available 
simultaneously on your Hadoop instance to fulfill the number of requested 
workers.
2013-03-18 18:23:30,628 FATAL org.apache.giraph.graph.BspServiceMaster: 
coordinateSuperstep: Not enough healthy workers for superstep 12
2013-03-18 18:23:30,629 INFO org.apache.giraph.graph.BspServiceMaster: 
setJobState: 
{"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on 
superstep 12
2013-03-18 18:23:30,649 FATAL org.apache.giraph.graph.BspServiceMaster: 
failJob: Killing job job_201303181655_0004
2013-03-18 18:23:30,703 FATAL org.apache.giraph.graph.GraphMapper: 
uncaughtException: OverrideExceptionHandler on thread 
org.apache.giraph.graph.MasterThread, msg = null, exiting...
java.lang.NullPointerException
 at 
org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1411)
 at 
org.apache.giraph.graph.MasterThread.run(MasterThread.java:111)
2013-03-18 18:23:30,705 WARN org.apache.giraph.zk.ZooKeeperManager: 
onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper 
process.


The workers except for the one who threw the expected exception report the 
following error:

2013-03-18 18:20:54,107 ERROR org.apache.zookeeper.ClientCnxn: Error while 
calling watcher 
java.lang.RuntimeException: process: Disconnected from ZooKeeper, cannot 
recover - WatchedEvent state:Disconnected type:None path:null
 at 
org.apache.giraph.graph.BspService.process(BspService.java:974)
 at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
 at 
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
2013-03-18 18:20:55,110 INFO org.apache.zookeeper.ClientCnxn: Opening 
socket connection to server idp30.almaden.ibm.com/172.16.0.30:22181
2013-03-18 18:20:55,111 WARN org.apache.zookeeper.ClientCnxn: Session 
0x13d8037f818 for server null, unexpected error, closing socket 
connection and attempting reconnect
java.net.ConnectException: Connection refused
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native 
Method)
 at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
 at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2013-03-18 18:20:55,218 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2013-03-18 18:20:55,254 INFO org.apache.hadoop.io.nativeio.NativeIO: 
Initialized cache for UID to User mapping with a cache timeout of 14400 
seconds.
2013-03-18 18:20:55,254 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
UserName ytian for UID 3005 from the native implementation
2013-03-18 18:20:55,257 WARN org.apache.hadoop.mapred.Child: Error running 
child
java.lang.IllegalStateException: startSuperstep: KeeperException getting 
assignments
 at 
org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:928)
 at 
org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:649)
 at 
org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:891)
 at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
 at java.security.AccessController.doPrivileged(Native 
Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
 at org.apache.hadoop.mapred.Child.main(Child.java:253)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for 
/_hadoopBsp/job_201303181655_0004/_applicationAttemptsDir/0/_superstepDir/2/_partitionAssignments
 at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
 at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
 at 
org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:809)
 at 
org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:837)
 at 
org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:909)

Yuanyuan



From:   Avery Ching 
To: user@giraph.apache.org
Cc: Yuanyuan Tian/Almaden/IBM@IBMUS
Date:   03/18/2013 03:05 PM
Subject:Re: about fault tolerance in Gi

Re: about fault tolerance in Giraph

2013-03-18 Thread Avery Ching
How many retries did you set for hadoop map task failures?  Might want 
to try 10?


Avery

On 3/18/13 2:38 PM, Yuanyuan Tian wrote:

Hi Avery,

I was just testing how Giraph can handle fault tolerance. I wrote a 
simple algorithm that could run without a problem. Then I artificially 
added a line of code to throw an IOException for the 12th superstep 
when the taskID is the 0001 and attempt ID is . The job returned 
the excepted IOException, but it cannot recover from it. There is no 
retry of the failed task, even though there are empty map slots left 
in the cluster. Eventually, the whole job failed after time out.


Yuanyuan



From: Avery Ching 
To: user@giraph.apache.org
Date: 03/18/2013 02:09 PM
Subject: Re: about fault tolerance in Giraph




Hi Yuanyuan,

We haven't tested this feature in a while.  But it should work.  What 
did the job report about why it failed?


Avery

On 3/18/13 10:22 AM, Yuanyuan Tian wrote:
Can anyone help me answer the question?

Yuanyuan



From: Yuanyuan Tian/Almaden/IBM@IBMUS
To: _user@giraph.apache.org_ 
Date: 03/15/2013 02:05 PM
Subject: about fault tolerance in Giraph




Hi

I was testing the fault tolerance of Giraph on a long running job. I 
noticed that when one of the worker throw an exception, the whole job 
failed without retrying the task, even though I turned on the 
checkpointing and there were available map slots in my cluster. Why 
wasn't the fault tolerance mechanism working?


I was running a version of Giraph downloaded sometime in June 2012 and 
I used Netty for the communication layer.


Thanks,

Yuanyuan





Re: about fault tolerance in Giraph

2013-03-18 Thread Yuanyuan Tian
Hi Avery,

I was just testing how Giraph can handle fault tolerance. I wrote a simple 
algorithm that could run without a problem. Then I artificially added a 
line of code to throw an IOException for the 12th superstep when the 
taskID is the 0001 and attempt ID is . The job returned the excepted 
IOException, but it cannot recover from it. There is no retry of the 
failed task, even though there are empty map slots left in the cluster. 
Eventually, the whole job failed after time out.

Yuanyuan



From:   Avery Ching 
To: user@giraph.apache.org
Date:   03/18/2013 02:09 PM
Subject:Re: about fault tolerance in Giraph



Hi Yuanyuan,

We haven't tested this feature in a while.  But it should work.  What did 
the job report about why it failed?

Avery

On 3/18/13 10:22 AM, Yuanyuan Tian wrote:
Can anyone help me answer the question? 

Yuanyuan 



From:Yuanyuan Tian/Almaden/IBM@IBMUS 
To:user@giraph.apache.org 
Date:03/15/2013 02:05 PM 
Subject:about fault tolerance in Giraph 



Hi 

I was testing the fault tolerance of Giraph on a long running job. I 
noticed that when one of the worker throw an exception, the whole job 
failed without retrying the task, even though I turned on the 
checkpointing and there were available map slots in my cluster. Why wasn't 
the fault tolerance mechanism working? 

I was running a version of Giraph downloaded sometime in June 2012 and I 
used Netty for the communication layer. 

Thanks, 

Yuanyuan 



Re: about fault tolerance in Giraph

2013-03-18 Thread Avery Ching

Hi Yuanyuan,

We haven't tested this feature in a while.  But it should work. What did 
the job report about why it failed?


Avery

On 3/18/13 10:22 AM, Yuanyuan Tian wrote:

Can anyone help me answer the question?

Yuanyuan



From: Yuanyuan Tian/Almaden/IBM@IBMUS
To: user@giraph.apache.org
Date: 03/15/2013 02:05 PM
Subject: about fault tolerance in Giraph




Hi

I was testing the fault tolerance of Giraph on a long running job. I 
noticed that when one of the worker throw an exception, the whole job 
failed without retrying the task, even though I turned on the 
checkpointing and there were available map slots in my cluster. Why 
wasn't the fault tolerance mechanism working?


I was running a version of Giraph downloaded sometime in June 2012 and 
I used Netty for the communication layer.


Thanks,

Yuanyuan




Re: about fault tolerance in Giraph

2013-03-18 Thread Yuanyuan Tian
Can anyone help me answer the question?

Yuanyuan



From:   Yuanyuan Tian/Almaden/IBM@IBMUS
To: user@giraph.apache.org
Date:   03/15/2013 02:05 PM
Subject:about fault tolerance in Giraph



Hi 

I was testing the fault tolerance of Giraph on a long running job. I 
noticed that when one of the worker throw an exception, the whole job 
failed without retrying the task, even though I turned on the 
checkpointing and there were available map slots in my cluster. Why wasn't 
the fault tolerance mechanism working? 

I was running a version of Giraph downloaded sometime in June 2012 and I 
used Netty for the communication layer. 

Thanks, 

Yuanyuan 


Re: Connected components output format

2013-03-18 Thread Maja Kabiljo
Hi Wasim,

Two things:
- TextVertexWriter is not a static class, so VertexWithComponentWriter 
shouldn't be either
- TextVertexWriter only has a default constructor, and you don't have to create 
RecordWriter

Maja

From: Wasim Mohammad mailto:wasim@gmail.com>>
Reply-To: "user@giraph.apache.org" 
mailto:user@giraph.apache.org>>
Date: Sunday, March 17, 2013 6:21 AM
To: "user@giraph.apache.org" 
mailto:user@giraph.apache.org>>
Subject: Connected components output format


Please tell me what is wrong with this code. It is giving me compilation error.

package org.apache.giraph.io;

import org.apache.giraph.graph.Vertex;
import org.apache.giraph.io.VertexWriter;
import org.apache.giraph.io.formats.TextVertexOutputFormat;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;

import java.io.IOException;

/**
 * Text-based {@link org.apache.giraph.graph.VertexOutputFormat} for usage with
 * {@link ConnectedComponentsVertex}
 *
 * Each line consists of a vertex and its associated component (represented by 
the smallest
 * vertex id in the component)
 */
public class VertexWithComponentTextOutputFormat extends
TextVertexOutputFormat {

@Override
public TextVertexWriter //
createVertexWriter(TaskAttemptContext context)
throws IOException, InterruptedException {
RecordWriter recordWriter =
textOutputFormat.getRecordWriter(context);
return new VertexWithComponentWriter(recordWriter);
}

 static  class VertexWithComponentWriter extends
TextVertexWriter /**/ {

public VertexWithComponentWriter(RecordWriter writer) {
super(writer);
}

@Override
public void writeVertex(Vertex vertex) throws IOException,
InterruptedException {
StringBuilder output = new StringBuilder();
output.append(vertex.getId().get());
output.append('\t');
output.append(vertex.getValue().get());
getRecordWriter().write(new Text(output.toString()), null);
}

}
}


Thanks,
M.Vasimuddin