Network traffic loading may depend on bandwidth, packet size, frequency of communication, etc, even though they are reserved instances. For example, in a scenario where only 2 servers are running in a network; and server A floods messages (large package size or higher frequency sending the messages) to its peer server B that may lead to the B server unresponsive or unable to respond in time.
On 17 June 2013 18:24, Edward J. Yoon <[email protected]> wrote: > Please see the zookeeper logs to figure out the reason of > ConnectionLossException. There are many possibilities such as FullGC, > heavy swap space usage, or session expired. > > I guess, the answer will be in the sentence "stopped working after > 4600 supersteps". > > On Mon, Jun 17, 2013 at 6:11 PM, Sascha Jonas > <[email protected]> wrote: >> The servers are reserved for Apache Hama, so there is no other network >> traffic. I tested it on three other PCs at another location but with the >> same configuration and got the same errors :( >> >> Am So, 16.06.2013, 16:44 schrieb Chia-Hung Lin: >>> Have you checked if underlying network traffic is busy when error happens? >>> >>> Can't be very sure but the symptom seems to be the heavy network >>> traffic leads to the zk connection lost. >>> >>> >>> >>> On 16 June 2013 20:22, Sascha Jonas <[email protected]> >>> wrote: >>>> Hey, >>>> >>>> iam using Apache Hama on a small cluster with two computers. Its working >>>> fine with a small number of supersteps but every time i am trying with >>>> lots of iterations e.g. 10000 it crashes. >>>> >>>> Right now it stopped working after 4600 supersteps. 8 from 16 Tasks are >>>> still running while the log shows some errors. >>>> >>>> Iam using Apache Hama 0.6 and the builtin Zookeeper. Should i go with a >>>> newer Hama or Zookeeper version? >>>> >>>> 13/06/16 00:14:14 ERROR sync.ZKSyncClient: Error creating zk path >>>> /bsp/job_201306091733_0009/sync/4276 >>>> org.apache.zookeeper.KeeperException$ConnectionLossException: >>>> KeeperErrorCode = ConnectionLoss for >>>> /bsp/job_201306091733_0009/sync/4276 >>>> at >>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:99) >>>> at >>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51) >>>> at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) >>>> at >>>> org.apache.hama.bsp.sync.ZKSyncClient.createZnode(ZKSyncClient.java:138) >>>> at >>>> org.apache.hama.bsp.sync.ZKSyncClient.writeNode(ZKSyncClient.java:290) >>>> at >>>> org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:99) >>>> at >>>> org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474) >>>> at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428) >>>> at >>>> de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90) >>>> at >>>> de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57) >>>> at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168) >>>> at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144) >>>> at >>>> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262) >>>> 13/06/16 00:14:15 ERROR >>>> distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer: >>>> org.apache.hama.bsp.sync.SyncException >>>> org.apache.hama.bsp.sync.SyncException >>>> at >>>> org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:137) >>>> at >>>> org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474) >>>> at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428) >>>> at >>>> de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90) >>>> at >>>> de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57) >>>> at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168) >>>> at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144) >>>> at >>>> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262) >>>> >>> >> >> > > > > -- > Best Regards, Edward J. Yoon > @eddieyoon
