Hey,
iam using Apache Hama on a small cluster with two computers. Its working
fine with a small number of supersteps but every time i am trying with
lots of iterations e.g. 10000 it crashes.
Right now it stopped working after 4600 supersteps. 8 from 16 Tasks are
still running while the log shows some errors.
Iam using Apache Hama 0.6 and the builtin Zookeeper. Should i go with a
newer Hama or Zookeeper version?
13/06/16 00:14:14 ERROR sync.ZKSyncClient: Error creating zk path
/bsp/job_201306091733_0009/sync/4276
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /bsp/job_201306091733_0009/sync/4276
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
at
org.apache.hama.bsp.sync.ZKSyncClient.createZnode(ZKSyncClient.java:138)
at
org.apache.hama.bsp.sync.ZKSyncClient.writeNode(ZKSyncClient.java:290)
at
org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:99)
at org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474)
at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428)
at
de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90)
at
de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57)
at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168)
at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144)
at
org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262)
13/06/16 00:14:15 ERROR
distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer:
org.apache.hama.bsp.sync.SyncException
org.apache.hama.bsp.sync.SyncException
at
org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:137)
at org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474)
at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428)
at
de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90)
at
de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57)
at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168)
at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144)
at
org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262)