Please see the zookeeper logs to figure out the reason of ConnectionLossException. There are many possibilities such as FullGC, heavy swap space usage, or session expired.
I guess, the answer will be in the sentence "stopped working after 4600 supersteps". On Mon, Jun 17, 2013 at 6:11 PM, Sascha Jonas <[email protected]> wrote: > The servers are reserved for Apache Hama, so there is no other network > traffic. I tested it on three other PCs at another location but with the > same configuration and got the same errors :( > > Am So, 16.06.2013, 16:44 schrieb Chia-Hung Lin: >> Have you checked if underlying network traffic is busy when error happens? >> >> Can't be very sure but the symptom seems to be the heavy network >> traffic leads to the zk connection lost. >> >> >> >> On 16 June 2013 20:22, Sascha Jonas <[email protected]> >> wrote: >>> Hey, >>> >>> iam using Apache Hama on a small cluster with two computers. Its working >>> fine with a small number of supersteps but every time i am trying with >>> lots of iterations e.g. 10000 it crashes. >>> >>> Right now it stopped working after 4600 supersteps. 8 from 16 Tasks are >>> still running while the log shows some errors. >>> >>> Iam using Apache Hama 0.6 and the builtin Zookeeper. Should i go with a >>> newer Hama or Zookeeper version? >>> >>> 13/06/16 00:14:14 ERROR sync.ZKSyncClient: Error creating zk path >>> /bsp/job_201306091733_0009/sync/4276 >>> org.apache.zookeeper.KeeperException$ConnectionLossException: >>> KeeperErrorCode = ConnectionLoss for >>> /bsp/job_201306091733_0009/sync/4276 >>> at >>> org.apache.zookeeper.KeeperException.create(KeeperException.java:99) >>> at >>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51) >>> at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) >>> at >>> org.apache.hama.bsp.sync.ZKSyncClient.createZnode(ZKSyncClient.java:138) >>> at >>> org.apache.hama.bsp.sync.ZKSyncClient.writeNode(ZKSyncClient.java:290) >>> at >>> org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:99) >>> at >>> org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474) >>> at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428) >>> at >>> de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90) >>> at >>> de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57) >>> at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168) >>> at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144) >>> at >>> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262) >>> 13/06/16 00:14:15 ERROR >>> distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer: >>> org.apache.hama.bsp.sync.SyncException >>> org.apache.hama.bsp.sync.SyncException >>> at >>> org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:137) >>> at >>> org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474) >>> at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428) >>> at >>> de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90) >>> at >>> de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57) >>> at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168) >>> at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144) >>> at >>> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262) >>> >> > > -- Best Regards, Edward J. Yoon @eddieyoon
