Have you checked if underlying network traffic is busy when error happens? Can't be very sure but the symptom seems to be the heavy network traffic leads to the zk connection lost.
On 16 June 2013 20:22, Sascha Jonas <[email protected]> wrote: > Hey, > > iam using Apache Hama on a small cluster with two computers. Its working > fine with a small number of supersteps but every time i am trying with > lots of iterations e.g. 10000 it crashes. > > Right now it stopped working after 4600 supersteps. 8 from 16 Tasks are > still running while the log shows some errors. > > Iam using Apache Hama 0.6 and the builtin Zookeeper. Should i go with a > newer Hama or Zookeeper version? > > 13/06/16 00:14:14 ERROR sync.ZKSyncClient: Error creating zk path > /bsp/job_201306091733_0009/sync/4276 > org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /bsp/job_201306091733_0009/sync/4276 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) > at > org.apache.hama.bsp.sync.ZKSyncClient.createZnode(ZKSyncClient.java:138) > at > org.apache.hama.bsp.sync.ZKSyncClient.writeNode(ZKSyncClient.java:290) > at > org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:99) > at org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474) > at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428) > at > de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90) > at > de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57) > at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168) > at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144) > at > org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262) > 13/06/16 00:14:15 ERROR > distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer: > org.apache.hama.bsp.sync.SyncException > org.apache.hama.bsp.sync.SyncException > at > org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:137) > at org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474) > at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428) > at > de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90) > at > de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57) > at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168) > at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144) > at > org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262) >
