I think I have more information on the issue. I did some debugging and found something quite strange.
If I open my job with 6 tasks ( 3 tasks will run on MACHINE1 and 3 task will be opened on other MACHINE2), - 3 tasks on Machine1 are frozen and the strange thing is that the processes do not even enter the SETUP function of BSP class. I have print statements in the setup function of BSP class and it doesn't print anything. I get empty files with zero size. drwxrwxr-x 2 behroz behroz 4096 Jun 28 16:29 . drwxrwxr-x 99 behroz behroz 4096 Jun 28 16:28 .. -rw-rw-r-- 1 behroz behroz 0 Jun 28 16:24 attempt_201506281624_0001_000000_0.err -rw-rw-r-- 1 behroz behroz 0 Jun 28 16:24 attempt_201506281624_0001_000000_0.log -rw-rw-r-- 1 behroz behroz 0 Jun 28 16:24 attempt_201506281624_0001_000001_0.err -rw-rw-r-- 1 behroz behroz 0 Jun 28 16:24 attempt_201506281624_0001_000001_0.log -rw-rw-r-- 1 behroz behroz 0 Jun 28 16:24 attempt_201506281624_0001_000002_0.err -rw-rw-r-- 1 behroz behroz 0 Jun 28 16:24 attempt_201506281624_0001_000002_0.log - On MACHINE2, the code enters the SETUP function of BSP class and prints stuff. See the size of files generated on output. How is it possible that in 3 tasks the code can enter BSP and in others it cannot ? drwxrwxr-x 2 behroz behroz 4096 Jun 28 16:39 . drwxrwxr-x 82 behroz behroz 4096 Jun 28 16:39 .. -rw-rw-r-- 1 behroz behroz 659 Jun 28 16:39 attempt_201506281639_0001_000003_0.err -rw-rw-r-- 1 behroz behroz 1441 Jun 28 16:39 attempt_201506281639_0001_000003_0.log -rw-rw-r-- 1 behroz behroz 659 Jun 28 16:39 attempt_201506281639_0001_000004_0.err -rw-rw-r-- 1 behroz behroz 1368 Jun 28 16:39 attempt_201506281639_0001_000004_0.log -rw-rw-r-- 1 behroz behroz 659 Jun 28 16:39 attempt_201506281639_0001_000005_0.err -rw-rw-r-- 1 behroz behroz 1441 Jun 28 16:39 attempt_201506281639_0001_000005_0.log - Hama Groom log file on MACHINE2 (which is frozen) shows. [time] INFO org.apache.hama.bsp.GroomServer: Task 'attempt_201506281639_0001_000001_0' has started. [time] INFO org.apache.hama.bsp.GroomServer: Launch 3 tasks. [time] INFO org.apache.hama.bsp.GroomServer: Task 'attempt_201506281639_0001_000002_0' has started. [time] INFO org.apache.hama.bsp.GroomServer: Launch 3 tasks. [time] INFO org.apache.hama.bsp.GroomServer: Task 'attempt_201506281639_0001_000000_0' has started. - Hama Groom log file on MACHINE2 shows [time] INFO org.apache.hama.bsp.GroomServer: Task 'attempt_201506281639_0001_000003_0' has started. [time] INFO org.apache.hama.bsp.GroomServer: Launch 3 tasks. [time] INFO org.apache.hama.bsp.GroomServer: Task 'attempt_201506281639_0001_000004_0' has started. [time] INFO org.apache.hama.bsp.GroomServer: Launch 3 tasks. [time] INFO org.apache.hama.bsp.GroomServer: Task 'attempt_201506281639_0001_000005_0' has started. [time] INFO org.apache.hama.bsp.GroomServer: Task attempt_201506281639_0001_000004_0 is *done*. [time] INFO org.apache.hama.bsp.GroomServer: Task attempt_201506281639_0001_000003_0 is *done*. [time] INFO org.apache.hama.bsp.GroomServer: Task attempt_201506281639_0001_000005_0 is *done*. Any clue what might be going wrong ? Regards, Behroz On Sat, Jun 27, 2015 at 1:13 PM, Behroz Sikander <[email protected]> wrote: > Here is the log file from that folder > > 15/06/27 11:10:34 INFO ipc.Server: Starting Socket Reader #1 for port 61001 > 15/06/27 11:10:34 INFO ipc.Server: IPC Server Responder: starting > 15/06/27 11:10:34 INFO ipc.Server: IPC Server listener on 61001: starting > 15/06/27 11:10:34 INFO ipc.Server: IPC Server handler 0 on 61001: starting > 15/06/27 11:10:34 INFO ipc.Server: IPC Server handler 1 on 61001: starting > 15/06/27 11:10:34 INFO ipc.Server: IPC Server handler 2 on 61001: starting > 15/06/27 11:10:34 INFO ipc.Server: IPC Server handler 3 on 61001: starting > 15/06/27 11:10:34 INFO message.HamaMessageManagerImpl: BSPPeer > address:b178b33b16cc port:61001 > 15/06/27 11:10:34 INFO ipc.Server: IPC Server handler 4 on 61001: starting > 15/06/27 11:10:34 INFO sync.ZKSyncClient: Initializing ZK Sync Client > 15/06/27 11:10:34 INFO sync.ZooKeeperSyncClientImpl: Start connecting to > Zookeeper! At b178b33b16cc/172.17.0.7:61001 > 15/06/27 11:10:37 INFO ipc.Server: Stopping server on 61001 > 15/06/27 11:10:37 INFO ipc.Server: IPC Server handler 0 on 61001: exiting > 15/06/27 11:10:37 INFO ipc.Server: Stopping IPC Server listener on 61001 > 15/06/27 11:10:37 INFO ipc.Server: IPC Server handler 1 on 61001: exiting > 15/06/27 11:10:37 INFO ipc.Server: IPC Server handler 2 on 61001: exiting > 15/06/27 11:10:37 INFO ipc.Server: Stopping IPC Server Responder > 15/06/27 11:10:37 INFO ipc.Server: IPC Server handler 3 on 61001: exiting > 15/06/27 11:10:37 INFO ipc.Server: IPC Server handler 4 on 61001: exiting > > > And my console shows the following ouptut. Hama is frozen right now. > 15/06/27 11:10:32 INFO bsp.BSPJobClient: Running job: job_201506262331_0003 > 15/06/27 11:10:35 INFO bsp.BSPJobClient: Current supersteps number: 0 > 15/06/27 11:10:38 INFO bsp.BSPJobClient: Current supersteps number: 2 > > On Sat, Jun 27, 2015 at 1:07 PM, Edward J. Yoon <[email protected]> > wrote: > >> Please check the task logs in $HAMA_HOME/logs/tasklogs folder. >> >> On Sat, Jun 27, 2015 at 8:03 PM, Behroz Sikander <[email protected]> >> wrote: >> > Yea. I also thought that. I ran the program through eclipse with 20 >> tasks >> > and it works fine. >> > >> > On Sat, Jun 27, 2015 at 1:00 PM, Edward J. Yoon <[email protected]> >> > wrote: >> > >> >> > When I run the PI example, it uses 9 tasks and runs fine. When I run >> my >> >> > program with 3 tasks, everything runs fine. But when I increase the >> tasks >> >> > (to 4) by using "setNumBspTask". Hama freezes. I do not understand >> what >> >> can >> >> > go wrong. >> >> >> >> It looks like a program bug. Have you ran your program in local mode? >> >> >> >> On Sat, Jun 27, 2015 at 8:03 AM, Behroz Sikander <[email protected]> >> >> wrote: >> >> > Hi, >> >> > In the current thread, I mentioned 3 issues. Issue 1 and 3 are >> resolved >> >> but >> >> > issue number 2 is still giving me headaches. >> >> > >> >> > My problem: >> >> > My cluster now consists of 3 machines. Each one of them properly >> >> configured >> >> > (Apparently). From my master machine when I start Hadoop and Hama, I >> can >> >> > see the processes started on other 2 machines. If I check the maximum >> >> tasks >> >> > that my cluster can support then I get 9 (3 tasks on each machine). >> >> > >> >> > When I run the PI example, it uses 9 tasks and runs fine. When I run >> my >> >> > program with 3 tasks, everything runs fine. But when I increase the >> tasks >> >> > (to 4) by using "setNumBspTask". Hama freezes. I do not understand >> what >> >> can >> >> > go wrong. >> >> > >> >> > I checked the logs files and things look fine. I just sometimes get >> an >> >> > exception that hama was not able to delete the sytem directory >> >> > (bsp.system.dir) defined in the hama-site.xml. >> >> > >> >> > Any help or clue would be great. >> >> > >> >> > Regards, >> >> > Behroz Sikander >> >> > >> >> > On Thu, Jun 25, 2015 at 1:13 PM, Behroz Sikander <[email protected] >> > >> >> wrote: >> >> > >> >> >> Thank you :) >> >> >> >> >> >> On Thu, Jun 25, 2015 at 12:14 AM, Edward J. Yoon < >> [email protected] >> >> > >> >> >> wrote: >> >> >> >> >> >>> Hi, >> >> >>> >> >> >>> You can get the maximum number of available tasks like following >> code: >> >> >>> >> >> >>> BSPJobClient jobClient = new BSPJobClient(conf); >> >> >>> ClusterStatus cluster = jobClient.getClusterStatus(true); >> >> >>> >> >> >>> // Set to maximum >> >> >>> bsp.setNumBspTask(cluster.getMaxTasks()); >> >> >>> >> >> >>> >> >> >>> On Wed, Jun 24, 2015 at 11:20 PM, Behroz Sikander < >> [email protected]> >> >> >>> wrote: >> >> >>> > Hi, >> >> >>> > 1) Thank you for this. >> >> >>> > 2) Here are the images. I will look into the log files of PI >> example >> >> >>> > >> >> >>> > *Result of JPS command on slave* >> >> >>> > >> >> >>> >> >> >> http://s17.postimg.org/gpwe2bbfj/Screen_Shot_2015_06_22_at_7_23_31_PM.png >> >> >>> > >> >> >>> > *Result of JPS command on Master* >> >> >>> > >> >> >>> >> >> >> http://s14.postimg.org/s9922em5p/Screen_Shot_2015_06_22_at_7_23_42_PM.png >> >> >>> > >> >> >>> > 3) In my current case, I do not have any input submitted to the >> job. >> >> >>> During >> >> >>> > run time, I directly fetch data from HDFS. So, I am looking for >> >> >>> something >> >> >>> > like BSPJob.set*Max*NumBspTask(). >> >> >>> > >> >> >>> > Regards, >> >> >>> > Behroz >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> > On Tue, Jun 23, 2015 at 12:57 AM, Edward J. Yoon < >> >> [email protected] >> >> >>> > >> >> >>> > wrote: >> >> >>> > >> >> >>> >> Hello, >> >> >>> >> >> >> >>> >> 1) You can get the filesystem URI from a configuration using >> >> >>> >> "FileSystem fs = FileSystem.get(conf);". Of course, the >> fs.defaultFS >> >> >>> >> property should be in hama-site.xml >> >> >>> >> >> >> >>> >> <property> >> >> >>> >> <name>fs.defaultFS</name> >> >> >>> >> <value>hdfs://host1.mydomain.com:9000/</value> >> >> >>> >> <description> >> >> >>> >> The name of the default file system. Either the literal >> string >> >> >>> >> "local" or a host:port for HDFS. >> >> >>> >> </description> >> >> >>> >> </property> >> >> >>> >> >> >> >>> >> 2) The 'bsp.tasks.maximum' is the number of tasks per node. It >> looks >> >> >>> >> cluster configuration issue. Please run Pi example and look at >> the >> >> >>> >> logs for more details. NOTE: you can not attach the images to >> >> mailing >> >> >>> >> list so I can't see it. >> >> >>> >> >> >> >>> >> 3) You can use the BSPJob.setNumBspTask(int) method. If input is >> >> >>> >> provided, the number of BSP tasks is basically driven by the >> number >> >> of >> >> >>> >> DFS blocks. I'll fix it to be more flexible on HAMA-956. >> >> >>> >> >> >> >>> >> Thanks! >> >> >>> >> >> >> >>> >> >> >> >>> >> On Tue, Jun 23, 2015 at 2:33 AM, Behroz Sikander < >> >> [email protected]> >> >> >>> >> wrote: >> >> >>> >> > Hi, >> >> >>> >> > Recently, I moved from a single machine setup to a 2 machine >> >> setup. >> >> >>> I was >> >> >>> >> > successfully able to run my job that uses the HDFS to get >> data. I >> >> >>> have 3 >> >> >>> >> > trivial questions >> >> >>> >> > >> >> >>> >> > 1- To access HDFS, I have to manually give the IP address of >> >> server >> >> >>> >> running >> >> >>> >> > HDFS. I thought that Hama will automatically pick from the >> >> >>> configurations >> >> >>> >> > but it does not. I am probably doing something wrong. Right >> now my >> >> >>> code >> >> >>> >> work >> >> >>> >> > by using the following. >> >> >>> >> > >> >> >>> >> > FileSystem fs = FileSystem.get(new >> URI("hdfs://server_ip:port/"), >> >> >>> conf); >> >> >>> >> > >> >> >>> >> > 2- On my master server, when I start hama it automatically >> starts >> >> >>> hama in >> >> >>> >> > the slave machine (all good). Both master and slave are set as >> >> >>> >> groomservers. >> >> >>> >> > This means that I have 2 servers to run my job which means >> that I >> >> can >> >> >>> >> open >> >> >>> >> > more BSPPeerChild processes. And if I submit my jar with 3 bsp >> >> tasks >> >> >>> then >> >> >>> >> > everything works fine. But when I move to 4 tasks, Hama >> freezes. >> >> >>> Here is >> >> >>> >> the >> >> >>> >> > result of JPS command on slave. >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > Result of JPS command on Master >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > You can see that it is only opening tasks on slaves but not on >> >> >>> master. >> >> >>> >> > >> >> >>> >> > Note: I tried to change the bsp.tasks.maximum property in >> >> >>> >> hama-default.xml >> >> >>> >> > to 4 but still same result. >> >> >>> >> > >> >> >>> >> > 3- I want my cluster to open as many BSPPeerChild processes as >> >> >>> possible. >> >> >>> >> Is >> >> >>> >> > there any setting that can I do to achieve that ? Or hama >> picks up >> >> >>> the >> >> >>> >> > values from hama-default.xml to open tasks ? >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > Regards, >> >> >>> >> > >> >> >>> >> > Behroz Sikander >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> -- >> >> >>> >> Best Regards, Edward J. Yoon >> >> >>> >> >> >> >>> >> >> >>> >> >> >>> >> >> >>> -- >> >> >>> Best Regards, Edward J. Yoon >> >> >>> >> >> >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> Best Regards, Edward J. Yoon >> >> >> >> >> >> -- >> Best Regards, Edward J. Yoon >> > >
