Hi Everyone, I have a shortest path implementation that completes and outputs the correct results to a counter, but then hangs after the last superstep and is eventually killed by Hadoop.
Here's the output from the console: main-SendThread(localhost.localdomain:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server localhost.localdomain/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) [main-SendThread(localhost.localdomain:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established to localhost.localdomain/127.0.0.1:2181, initiating session [main-SendThread(localhost.localdomain:2181)] INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server localhost.localdomain/127.0.0.1:2181, sessionid = 0x1451fc674a30007, negotiated timeout = 40000 14/04/04 22:19:44 INFO job.JobProgressTracker: Data from 1 workers - Storing data: 0 out of 11 vertices stored; 0 out of 1 partitions stored; min free memory on worker 1 - 119.73MB, average 119.73MB 14/04/04 22:19:45 INFO mapred.JobClient: map 100% reduce 0% 14/04/04 22:19:49 INFO job.JobProgressTracker: Data from 1 workers - Storing data: 0 out of 11 vertices stored; 0 out of 1 partitions stored; min free memory on worker 1 - 119.73MB, average 119.73MB 14/04/04 22:19:54 INFO job.JobProgressTracker: Data from 1 workers - Storing data: 0 out of 11 vertices stored; 0 out of 1 partitions stored; min free memory on worker 1 - 119.44MB, average 119.44MB 1 This is the stack trace I see in Hadoop after the job is killed: Caused by: java.lang.IllegalStateException: waitFor: ExecutionException occurred while waiting for org.apache.giraph.utils.ProgressableUtils$FutureWaitable@43349eef at org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:193) at org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:151) at org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:136) at org.apache.giraph.utils.ProgressableUtils.getFutureResult(ProgressableUtils.java:99) at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:233) at org.apache.giraph.worker.BspServiceWorker.saveVertices(BspServiceWorker.java:1033) at org.apache.giraph.worker.BspServiceWorker.cleanup(BspServiceWorker.java:1179) at org.apache.giraph.graph.GraphTaskManager.cleanup(GraphTaskManager.java:843) at org.apache.giraph.graph.GraphMapper.cleanup(GraphMapper.java:81) at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:93) ... 7 more Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/prototype/giraph/twitter-path-result/_temporary/_attempt_201404012018_0003_m_000001_0/part-m-00001 for DFSClient_attempt_201404012018_0003_m_000001_0_-1149212770_1 on client 127.0.0.1 because current leaseholder is trying to recreate file. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1452) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1324) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1266) at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:668) at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:647) at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387) I realize that the root cause appears to be within Hadoop and not Giraph, but I am wondering if there is Giraph configuration parameter I am missing? In researching the HDFS exception (not many posts on this, BTW), one responder opined that this exception is due to speculative execution being enabled. Also, I tested a standard Map/Reduce job writing to the same datablock and it worked fine, so I don't think HDFS is the problem (corrupt datablock, etc...) Any ideas? --John