Hi, It looks like a zookeeper connection problem. Please check whether zookeeper is running and every tasks can connect to zookeeper.
I would recommend you to stop the firewall during debugging, and please use the 0.7.0 latest release. -- Best Regards, Edward J. Yoon -----Original Message----- From: Behroz Sikander [mailto:[email protected]] Sent: Monday, June 29, 2015 7:34 AM To: [email protected] Subject: Re: Groomserer BSPPeerChild limit To figure out the issue, I was trying something else and found out another wiered issue. Might be a bug of Hama but I am not sure. Both following lines give an exception. System.out.println( peer.getPeerName(0)); //Exception System.out.println( peer.getNumPeers()); //Exception [time] ERROR bsp.BSPTask: *Error running bsp setup and bsp function.* [time]java.lang.*RuntimeException: All peer names could not be retrieved!* at org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.getAllPeerNames(ZooKeeperSyncClientImpl.java:305) at org.apache.hama.bsp.BSPPeerImpl.initPeerNames(BSPPeerImpl.java:544) at org.apache.hama.bsp.BSPPeerImpl.getNumPeers(BSPPeerImpl.java:538) at testHDFS.EVADMMBsp.setup*(EVADMMBsp.java:58)* at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:170) at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144) at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1243) On Sun, Jun 28, 2015 at 6:45 PM, Behroz Sikander <[email protected]> wrote: > I think I have more information on the issue. I did some debugging and > found something quite strange. > > If I open my job with 6 tasks ( 3 tasks will run on MACHINE1 and 3 task > will be opened on other MACHINE2), > > - 3 tasks on Machine1 are frozen and the strange thing is that the > processes do not even enter the SETUP function of BSP class. I have print > statements in the setup function of BSP class and it doesn't print > anything. I get empty files with zero size. > > drwxrwxr-x 2 behroz behroz 4096 Jun 28 16:29 . > drwxrwxr-x 99 behroz behroz 4096 Jun 28 16:28 .. > -rw-rw-r-- 1 behroz behroz 0 Jun 28 16:24 > attempt_201506281624_0001_000000_0.err > -rw-rw-r-- 1 behroz behroz 0 Jun 28 16:24 > attempt_201506281624_0001_000000_0.log > -rw-rw-r-- 1 behroz behroz 0 Jun 28 16:24 > attempt_201506281624_0001_000001_0.err > -rw-rw-r-- 1 behroz behroz 0 Jun 28 16:24 > attempt_201506281624_0001_000001_0.log > -rw-rw-r-- 1 behroz behroz 0 Jun 28 16:24 > attempt_201506281624_0001_000002_0.err > -rw-rw-r-- 1 behroz behroz 0 Jun 28 16:24 > attempt_201506281624_0001_000002_0.log > > - On MACHINE2, the code enters the SETUP function of BSP class and prints > stuff. See the size of files generated on output. How is it possible that > in 3 tasks the code can enter BSP and in others it cannot ? > > drwxrwxr-x 2 behroz behroz 4096 Jun 28 16:39 . > drwxrwxr-x 82 behroz behroz 4096 Jun 28 16:39 .. > -rw-rw-r-- 1 behroz behroz 659 Jun 28 16:39 > attempt_201506281639_0001_000003_0.err > -rw-rw-r-- 1 behroz behroz 1441 Jun 28 16:39 > attempt_201506281639_0001_000003_0.log > -rw-rw-r-- 1 behroz behroz 659 Jun 28 16:39 > attempt_201506281639_0001_000004_0.err > -rw-rw-r-- 1 behroz behroz 1368 Jun 28 16:39 > attempt_201506281639_0001_000004_0.log > -rw-rw-r-- 1 behroz behroz 659 Jun 28 16:39 > attempt_201506281639_0001_000005_0.err > -rw-rw-r-- 1 behroz behroz 1441 Jun 28 16:39 > attempt_201506281639_0001_000005_0.log > > - Hama Groom log file on MACHINE2 (which is frozen) shows. > [time] INFO org.apache.hama.bsp.GroomServer: Task > 'attempt_201506281639_0001_000001_0' has started. > [time] INFO org.apache.hama.bsp.GroomServer: Launch 3 tasks. > [time] INFO org.apache.hama.bsp.GroomServer: Task > 'attempt_201506281639_0001_000002_0' has started. > [time] INFO org.apache.hama.bsp.GroomServer: Launch 3 tasks. > [time] INFO org.apache.hama.bsp.GroomServer: Task > 'attempt_201506281639_0001_000000_0' has started. > > - Hama Groom log file on MACHINE2 shows > [time] INFO org.apache.hama.bsp.GroomServer: Task > 'attempt_201506281639_0001_000003_0' has started. > [time] INFO org.apache.hama.bsp.GroomServer: Launch 3 tasks. > [time] INFO org.apache.hama.bsp.GroomServer: Task > 'attempt_201506281639_0001_000004_0' has started. > [time] INFO org.apache.hama.bsp.GroomServer: Launch 3 tasks. > [time] INFO org.apache.hama.bsp.GroomServer: Task > 'attempt_201506281639_0001_000005_0' has started. > [time] INFO org.apache.hama.bsp.GroomServer: Task > attempt_201506281639_0001_000004_0 is *done*. > [time] INFO org.apache.hama.bsp.GroomServer: Task > attempt_201506281639_0001_000003_0 is *done*. > [time] INFO org.apache.hama.bsp.GroomServer: Task > attempt_201506281639_0001_000005_0 is *done*. > > Any clue what might be going wrong ? > > Regards, > Behroz > > > > On Sat, Jun 27, 2015 at 1:13 PM, Behroz Sikander <[email protected]> > wrote: > >> Here is the log file from that folder >> >> 15/06/27 11:10:34 INFO ipc.Server: Starting Socket Reader #1 for port >> 61001 >> 15/06/27 11:10:34 INFO ipc.Server: IPC Server Responder: starting >> 15/06/27 11:10:34 INFO ipc.Server: IPC Server listener on 61001: starting >> 15/06/27 11:10:34 INFO ipc.Server: IPC Server handler 0 on 61001: starting >> 15/06/27 11:10:34 INFO ipc.Server: IPC Server handler 1 on 61001: starting >> 15/06/27 11:10:34 INFO ipc.Server: IPC Server handler 2 on 61001: starting >> 15/06/27 11:10:34 INFO ipc.Server: IPC Server handler 3 on 61001: starting >> 15/06/27 11:10:34 INFO message.HamaMessageManagerImpl: BSPPeer >> address:b178b33b16cc port:61001 >> 15/06/27 11:10:34 INFO ipc.Server: IPC Server handler 4 on 61001: starting >> 15/06/27 11:10:34 INFO sync.ZKSyncClient: Initializing ZK Sync Client >> 15/06/27 11:10:34 INFO sync.ZooKeeperSyncClientImpl: Start connecting to >> Zookeeper! At b178b33b16cc/172.17.0.7:61001 >> 15/06/27 11:10:37 INFO ipc.Server: Stopping server on 61001 >> 15/06/27 11:10:37 INFO ipc.Server: IPC Server handler 0 on 61001: exiting >> 15/06/27 11:10:37 INFO ipc.Server: Stopping IPC Server listener on 61001 >> 15/06/27 11:10:37 INFO ipc.Server: IPC Server handler 1 on 61001: exiting >> 15/06/27 11:10:37 INFO ipc.Server: IPC Server handler 2 on 61001: exiting >> 15/06/27 11:10:37 INFO ipc.Server: Stopping IPC Server Responder >> 15/06/27 11:10:37 INFO ipc.Server: IPC Server handler 3 on 61001: exiting >> 15/06/27 11:10:37 INFO ipc.Server: IPC Server handler 4 on 61001: exiting >> >> >> And my console shows the following ouptut. Hama is frozen right now. >> 15/06/27 11:10:32 INFO bsp.BSPJobClient: Running job: >> job_201506262331_0003 >> 15/06/27 11:10:35 INFO bsp.BSPJobClient: Current supersteps number: 0 >> 15/06/27 11:10:38 INFO bsp.BSPJobClient: Current supersteps number: 2 >> >> On Sat, Jun 27, 2015 at 1:07 PM, Edward J. Yoon <[email protected]> >> wrote: >> >>> Please check the task logs in $HAMA_HOME/logs/tasklogs folder. >>> >>> On Sat, Jun 27, 2015 at 8:03 PM, Behroz Sikander <[email protected]> >>> wrote: >>> > Yea. I also thought that. I ran the program through eclipse with 20 >>> tasks >>> > and it works fine. >>> > >>> > On Sat, Jun 27, 2015 at 1:00 PM, Edward J. Yoon <[email protected] >>> > >>> > wrote: >>> > >>> >> > When I run the PI example, it uses 9 tasks and runs fine. When I >>> run my >>> >> > program with 3 tasks, everything runs fine. But when I increase the >>> tasks >>> >> > (to 4) by using "setNumBspTask". Hama freezes. I do not understand >>> what >>> >> can >>> >> > go wrong. >>> >> >>> >> It looks like a program bug. Have you ran your program in local mode? >>> >> >>> >> On Sat, Jun 27, 2015 at 8:03 AM, Behroz Sikander <[email protected]> >>> >> wrote: >>> >> > Hi, >>> >> > In the current thread, I mentioned 3 issues. Issue 1 and 3 are >>> resolved >>> >> but >>> >> > issue number 2 is still giving me headaches. >>> >> > >>> >> > My problem: >>> >> > My cluster now consists of 3 machines. Each one of them properly >>> >> configured >>> >> > (Apparently). From my master machine when I start Hadoop and Hama, >>> I can >>> >> > see the processes started on other 2 machines. If I check the >>> maximum >>> >> tasks >>> >> > that my cluster can support then I get 9 (3 tasks on each machine). >>> >> > >>> >> > When I run the PI example, it uses 9 tasks and runs fine. When I >>> run my >>> >> > program with 3 tasks, everything runs fine. But when I increase the >>> tasks >>> >> > (to 4) by using "setNumBspTask". Hama freezes. I do not understand >>> what >>> >> can >>> >> > go wrong. >>> >> > >>> >> > I checked the logs files and things look fine. I just sometimes get >>> an >>> >> > exception that hama was not able to delete the sytem directory >>> >> > (bsp.system.dir) defined in the hama-site.xml. >>> >> > >>> >> > Any help or clue would be great. >>> >> > >>> >> > Regards, >>> >> > Behroz Sikander >>> >> > >>> >> > On Thu, Jun 25, 2015 at 1:13 PM, Behroz Sikander < >>> [email protected]> >>> >> wrote: >>> >> > >>> >> >> Thank you :) >>> >> >> >>> >> >> On Thu, Jun 25, 2015 at 12:14 AM, Edward J. Yoon < >>> [email protected] >>> >> > >>> >> >> wrote: >>> >> >> >>> >> >>> Hi, >>> >> >>> >>> >> >>> You can get the maximum number of available tasks like following >>> code: >>> >> >>> >>> >> >>> BSPJobClient jobClient = new BSPJobClient(conf); >>> >> >>> ClusterStatus cluster = jobClient.getClusterStatus(true); >>> >> >>> >>> >> >>> // Set to maximum >>> >> >>> bsp.setNumBspTask(cluster.getMaxTasks()); >>> >> >>> >>> >> >>> >>> >> >>> On Wed, Jun 24, 2015 at 11:20 PM, Behroz Sikander < >>> [email protected]> >>> >> >>> wrote: >>> >> >>> > Hi, >>> >> >>> > 1) Thank you for this. >>> >> >>> > 2) Here are the images. I will look into the log files of PI >>> example >>> >> >>> > >>> >> >>> > *Result of JPS command on slave* >>> >> >>> > >>> >> >>> >>> >> >>> http://s17.postimg.org/gpwe2bbfj/Screen_Shot_2015_06_22_at_7_23_31_PM.png >>> >> >>> > >>> >> >>> > *Result of JPS command on Master* >>> >> >>> > >>> >> >>> >>> >> >>> http://s14.postimg.org/s9922em5p/Screen_Shot_2015_06_22_at_7_23_42_PM.png >>> >> >>> > >>> >> >>> > 3) In my current case, I do not have any input submitted to the >>> job. >>> >> >>> During >>> >> >>> > run time, I directly fetch data from HDFS. So, I am looking for >>> >> >>> something >>> >> >>> > like BSPJob.set*Max*NumBspTask(). >>> >> >>> > >>> >> >>> > Regards, >>> >> >>> > Behroz >>> >> >>> > >>> >> >>> > >>> >> >>> > >>> >> >>> > On Tue, Jun 23, 2015 at 12:57 AM, Edward J. Yoon < >>> >> [email protected] >>> >> >>> > >>> >> >>> > wrote: >>> >> >>> > >>> >> >>> >> Hello, >>> >> >>> >> >>> >> >>> >> 1) You can get the filesystem URI from a configuration using >>> >> >>> >> "FileSystem fs = FileSystem.get(conf);". Of course, the >>> fs.defaultFS >>> >> >>> >> property should be in hama-site.xml >>> >> >>> >> >>> >> >>> >> <property> >>> >> >>> >> <name>fs.defaultFS</name> >>> >> >>> >> <value>hdfs://host1.mydomain.com:9000/</value> >>> >> >>> >> <description> >>> >> >>> >> The name of the default file system. Either the literal >>> string >>> >> >>> >> "local" or a host:port for HDFS. >>> >> >>> >> </description> >>> >> >>> >> </property> >>> >> >>> >> >>> >> >>> >> 2) The 'bsp.tasks.maximum' is the number of tasks per node. It >>> looks >>> >> >>> >> cluster configuration issue. Please run Pi example and look at >>> the >>> >> >>> >> logs for more details. NOTE: you can not attach the images to >>> >> mailing >>> >> >>> >> list so I can't see it. >>> >> >>> >> >>> >> >>> >> 3) You can use the BSPJob.setNumBspTask(int) method. If input >>> is >>> >> >>> >> provided, the number of BSP tasks is basically driven by the >>> number >>> >> of >>> >> >>> >> DFS blocks. I'll fix it to be more flexible on HAMA-956. >>> >> >>> >> >>> >> >>> >> Thanks! >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> On Tue, Jun 23, 2015 at 2:33 AM, Behroz Sikander < >>> >> [email protected]> >>> >> >>> >> wrote: >>> >> >>> >> > Hi, >>> >> >>> >> > Recently, I moved from a single machine setup to a 2 machine >>> >> setup. >>> >> >>> I was >>> >> >>> >> > successfully able to run my job that uses the HDFS to get >>> data. I >>> >> >>> have 3 >>> >> >>> >> > trivial questions >>> >> >>> >> > >>> >> >>> >> > 1- To access HDFS, I have to manually give the IP address of >>> >> server >>> >> >>> >> running >>> >> >>> >> > HDFS. I thought that Hama will automatically pick from the >>> >> >>> configurations >>> >> >>> >> > but it does not. I am probably doing something wrong. Right >>> now my >>> >> >>> code >>> >> >>> >> work >>> >> >>> >> > by using the following. >>> >> >>> >> > >>> >> >>> >> > FileSystem fs = FileSystem.get(new >>> URI("hdfs://server_ip:port/"), >>> >> >>> conf); >>> >> >>> >> > >>> >> >>> >> > 2- On my master server, when I start hama it automatically >>> starts >>> >> >>> hama in >>> >> >>> >> > the slave machine (all good). Both master and slave are set >>> as >>> >> >>> >> groomservers. >>> >> >>> >> > This means that I have 2 servers to run my job which means >>> that I >>> >> can >>> >> >>> >> open >>> >> >>> >> > more BSPPeerChild processes. And if I submit my jar with 3 >>> bsp >>> >> tasks >>> >> >>> then >>> >> >>> >> > everything works fine. But when I move to 4 tasks, Hama >>> freezes. >>> >> >>> Here is >>> >> >>> >> the >>> >> >>> >> > result of JPS command on slave. >>> >> >>> >> > >>> >> >>> >> > >>> >> >>> >> > Result of JPS command on Master >>> >> >>> >> > >>> >> >>> >> > >>> >> >>> >> > >>> >> >>> >> > You can see that it is only opening tasks on slaves but not >>> on >>> >> >>> master. >>> >> >>> >> > >>> >> >>> >> > Note: I tried to change the bsp.tasks.maximum property in >>> >> >>> >> hama-default.xml >>> >> >>> >> > to 4 but still same result. >>> >> >>> >> > >>> >> >>> >> > 3- I want my cluster to open as many BSPPeerChild processes >>> as >>> >> >>> possible. >>> >> >>> >> Is >>> >> >>> >> > there any setting that can I do to achieve that ? Or hama >>> picks up >>> >> >>> the >>> >> >>> >> > values from hama-default.xml to open tasks ? >>> >> >>> >> > >>> >> >>> >> > >>> >> >>> >> > Regards, >>> >> >>> >> > >>> >> >>> >> > Behroz Sikander >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> -- >>> >> >>> >> Best Regards, Edward J. Yoon >>> >> >>> >> >>> >> >>> >>> >> >>> >>> >> >>> >>> >> >>> -- >>> >> >>> Best Regards, Edward J. Yoon >>> >> >>> >>> >> >> >>> >> >> >>> >> >>> >> >>> >> >>> >> -- >>> >> Best Regards, Edward J. Yoon >>> >> >>> >>> >>> >>> -- >>> Best Regards, Edward J. Yoon >>> >> >> >
