I think I have more information on the issue. I did some debugging and
found something quite strange.

If I open my job with 6 tasks ( 3 tasks will run on MACHINE1 and 3 task
will be opened on other MACHINE2),

 -  3 tasks on Machine1 are frozen and the strange thing is that the
processes do not even enter the SETUP function of BSP class. I have print
statements in the setup function of BSP class and it doesn't print
anything. I get empty files with zero size.

drwxrwxr-x  2 behroz behroz 4096 Jun 28 16:29 .
drwxrwxr-x 99 behroz behroz 4096 Jun 28 16:28 ..
-rw-rw-r--  1 behroz behroz    0 Jun 28 16:24
attempt_201506281624_0001_000000_0.err
-rw-rw-r--  1 behroz behroz    0 Jun 28 16:24
attempt_201506281624_0001_000000_0.log
-rw-rw-r--  1 behroz behroz    0 Jun 28 16:24
attempt_201506281624_0001_000001_0.err
-rw-rw-r--  1 behroz behroz    0 Jun 28 16:24
attempt_201506281624_0001_000001_0.log
-rw-rw-r--  1 behroz behroz    0 Jun 28 16:24
attempt_201506281624_0001_000002_0.err
-rw-rw-r--  1 behroz behroz    0 Jun 28 16:24
attempt_201506281624_0001_000002_0.log

- On MACHINE2, the code enters the SETUP function of BSP class and prints
stuff. See the size of files generated on output. How is it possible that
in 3 tasks the code can enter BSP and in others it cannot ?

drwxrwxr-x  2 behroz behroz 4096 Jun 28 16:39 .
drwxrwxr-x 82 behroz behroz 4096 Jun 28 16:39 ..
-rw-rw-r--  1 behroz behroz  659 Jun 28 16:39
attempt_201506281639_0001_000003_0.err
-rw-rw-r--  1 behroz behroz 1441 Jun 28 16:39
attempt_201506281639_0001_000003_0.log
-rw-rw-r--  1 behroz behroz  659 Jun 28 16:39
attempt_201506281639_0001_000004_0.err
-rw-rw-r--  1 behroz behroz 1368 Jun 28 16:39
attempt_201506281639_0001_000004_0.log
-rw-rw-r--  1 behroz behroz  659 Jun 28 16:39
attempt_201506281639_0001_000005_0.err
-rw-rw-r--  1 behroz behroz 1441 Jun 28 16:39
attempt_201506281639_0001_000005_0.log

- Hama Groom log file on MACHINE2 (which is frozen) shows.
[time] INFO org.apache.hama.bsp.GroomServer: Task
'attempt_201506281639_0001_000001_0' has started.
[time] INFO org.apache.hama.bsp.GroomServer: Launch 3 tasks.
[time] INFO org.apache.hama.bsp.GroomServer: Task
'attempt_201506281639_0001_000002_0' has started.
[time] INFO org.apache.hama.bsp.GroomServer: Launch 3 tasks.
[time] INFO org.apache.hama.bsp.GroomServer: Task
'attempt_201506281639_0001_000000_0' has started.

- Hama Groom log file on MACHINE2 shows
[time] INFO org.apache.hama.bsp.GroomServer: Task
'attempt_201506281639_0001_000003_0' has started.
[time] INFO org.apache.hama.bsp.GroomServer: Launch 3 tasks.
[time] INFO org.apache.hama.bsp.GroomServer: Task
'attempt_201506281639_0001_000004_0' has started.
[time] INFO org.apache.hama.bsp.GroomServer: Launch 3 tasks.
[time] INFO org.apache.hama.bsp.GroomServer: Task
'attempt_201506281639_0001_000005_0' has started.
[time] INFO org.apache.hama.bsp.GroomServer: Task
attempt_201506281639_0001_000004_0 is *done*.
[time] INFO org.apache.hama.bsp.GroomServer: Task
attempt_201506281639_0001_000003_0 is *done*.
[time] INFO org.apache.hama.bsp.GroomServer: Task
attempt_201506281639_0001_000005_0 is *done*.

Any clue what might be going wrong ?

Regards,
Behroz



On Sat, Jun 27, 2015 at 1:13 PM, Behroz Sikander <[email protected]> wrote:

> Here is the log file from that folder
>
> 15/06/27 11:10:34 INFO ipc.Server: Starting Socket Reader #1 for port 61001
> 15/06/27 11:10:34 INFO ipc.Server: IPC Server Responder: starting
> 15/06/27 11:10:34 INFO ipc.Server: IPC Server listener on 61001: starting
> 15/06/27 11:10:34 INFO ipc.Server: IPC Server handler 0 on 61001: starting
> 15/06/27 11:10:34 INFO ipc.Server: IPC Server handler 1 on 61001: starting
> 15/06/27 11:10:34 INFO ipc.Server: IPC Server handler 2 on 61001: starting
> 15/06/27 11:10:34 INFO ipc.Server: IPC Server handler 3 on 61001: starting
> 15/06/27 11:10:34 INFO message.HamaMessageManagerImpl: BSPPeer
> address:b178b33b16cc port:61001
> 15/06/27 11:10:34 INFO ipc.Server: IPC Server handler 4 on 61001: starting
> 15/06/27 11:10:34 INFO sync.ZKSyncClient: Initializing ZK Sync Client
> 15/06/27 11:10:34 INFO sync.ZooKeeperSyncClientImpl: Start connecting to
> Zookeeper! At b178b33b16cc/172.17.0.7:61001
> 15/06/27 11:10:37 INFO ipc.Server: Stopping server on 61001
> 15/06/27 11:10:37 INFO ipc.Server: IPC Server handler 0 on 61001: exiting
> 15/06/27 11:10:37 INFO ipc.Server: Stopping IPC Server listener on 61001
> 15/06/27 11:10:37 INFO ipc.Server: IPC Server handler 1 on 61001: exiting
> 15/06/27 11:10:37 INFO ipc.Server: IPC Server handler 2 on 61001: exiting
> 15/06/27 11:10:37 INFO ipc.Server: Stopping IPC Server Responder
> 15/06/27 11:10:37 INFO ipc.Server: IPC Server handler 3 on 61001: exiting
> 15/06/27 11:10:37 INFO ipc.Server: IPC Server handler 4 on 61001: exiting
>
>
> And my console shows the following ouptut. Hama is frozen right now.
> 15/06/27 11:10:32 INFO bsp.BSPJobClient: Running job: job_201506262331_0003
> 15/06/27 11:10:35 INFO bsp.BSPJobClient: Current supersteps number: 0
> 15/06/27 11:10:38 INFO bsp.BSPJobClient: Current supersteps number: 2
>
> On Sat, Jun 27, 2015 at 1:07 PM, Edward J. Yoon <[email protected]>
> wrote:
>
>> Please check the task logs in $HAMA_HOME/logs/tasklogs folder.
>>
>> On Sat, Jun 27, 2015 at 8:03 PM, Behroz Sikander <[email protected]>
>> wrote:
>> > Yea. I also thought that. I ran the program through eclipse with 20
>> tasks
>> > and it works fine.
>> >
>> > On Sat, Jun 27, 2015 at 1:00 PM, Edward J. Yoon <[email protected]>
>> > wrote:
>> >
>> >> > When I run the PI example, it uses 9 tasks and runs fine. When I run
>> my
>> >> > program with 3 tasks, everything runs fine. But when I increase the
>> tasks
>> >> > (to 4) by using "setNumBspTask". Hama freezes. I do not understand
>> what
>> >> can
>> >> > go wrong.
>> >>
>> >> It looks like a program bug. Have you ran your program in local mode?
>> >>
>> >> On Sat, Jun 27, 2015 at 8:03 AM, Behroz Sikander <[email protected]>
>> >> wrote:
>> >> > Hi,
>> >> > In the current thread, I mentioned 3 issues. Issue 1 and 3 are
>> resolved
>> >> but
>> >> > issue number 2 is still giving me headaches.
>> >> >
>> >> > My problem:
>> >> > My cluster now consists of 3 machines. Each one of them properly
>> >> configured
>> >> > (Apparently). From my master machine when I start Hadoop and Hama, I
>> can
>> >> > see the processes started on other 2 machines. If I check the maximum
>> >> tasks
>> >> > that my cluster can support then I get 9 (3 tasks on each machine).
>> >> >
>> >> > When I run the PI example, it uses 9 tasks and runs fine. When I run
>> my
>> >> > program with 3 tasks, everything runs fine. But when I increase the
>> tasks
>> >> > (to 4) by using "setNumBspTask". Hama freezes. I do not understand
>> what
>> >> can
>> >> > go wrong.
>> >> >
>> >> > I checked the logs files and things look fine. I just sometimes get
>> an
>> >> > exception that hama was not able to delete the sytem directory
>> >> > (bsp.system.dir) defined in the hama-site.xml.
>> >> >
>> >> > Any help or clue would be great.
>> >> >
>> >> > Regards,
>> >> > Behroz Sikander
>> >> >
>> >> > On Thu, Jun 25, 2015 at 1:13 PM, Behroz Sikander <[email protected]
>> >
>> >> wrote:
>> >> >
>> >> >> Thank you :)
>> >> >>
>> >> >> On Thu, Jun 25, 2015 at 12:14 AM, Edward J. Yoon <
>> [email protected]
>> >> >
>> >> >> wrote:
>> >> >>
>> >> >>> Hi,
>> >> >>>
>> >> >>> You can get the maximum number of available tasks like following
>> code:
>> >> >>>
>> >> >>>     BSPJobClient jobClient = new BSPJobClient(conf);
>> >> >>>     ClusterStatus cluster = jobClient.getClusterStatus(true);
>> >> >>>
>> >> >>>     // Set to maximum
>> >> >>>     bsp.setNumBspTask(cluster.getMaxTasks());
>> >> >>>
>> >> >>>
>> >> >>> On Wed, Jun 24, 2015 at 11:20 PM, Behroz Sikander <
>> [email protected]>
>> >> >>> wrote:
>> >> >>> > Hi,
>> >> >>> > 1) Thank you for this.
>> >> >>> > 2) Here are the images. I will look into the log files of PI
>> example
>> >> >>> >
>> >> >>> > *Result of JPS command on slave*
>> >> >>> >
>> >> >>>
>> >>
>> http://s17.postimg.org/gpwe2bbfj/Screen_Shot_2015_06_22_at_7_23_31_PM.png
>> >> >>> >
>> >> >>> > *Result of JPS command on Master*
>> >> >>> >
>> >> >>>
>> >>
>> http://s14.postimg.org/s9922em5p/Screen_Shot_2015_06_22_at_7_23_42_PM.png
>> >> >>> >
>> >> >>> > 3) In my current case, I do not have any input submitted to the
>> job.
>> >> >>> During
>> >> >>> > run time, I directly fetch data from HDFS. So, I am looking for
>> >> >>> something
>> >> >>> > like BSPJob.set*Max*NumBspTask().
>> >> >>> >
>> >> >>> > Regards,
>> >> >>> > Behroz
>> >> >>> >
>> >> >>> >
>> >> >>> >
>> >> >>> > On Tue, Jun 23, 2015 at 12:57 AM, Edward J. Yoon <
>> >> [email protected]
>> >> >>> >
>> >> >>> > wrote:
>> >> >>> >
>> >> >>> >> Hello,
>> >> >>> >>
>> >> >>> >> 1) You can get the filesystem URI from a configuration using
>> >> >>> >> "FileSystem fs = FileSystem.get(conf);". Of course, the
>> fs.defaultFS
>> >> >>> >> property should be in hama-site.xml
>> >> >>> >>
>> >> >>> >>   <property>
>> >> >>> >>     <name>fs.defaultFS</name>
>> >> >>> >>     <value>hdfs://host1.mydomain.com:9000/</value>
>> >> >>> >>     <description>
>> >> >>> >>       The name of the default file system. Either the literal
>> string
>> >> >>> >>       "local" or a host:port for HDFS.
>> >> >>> >>     </description>
>> >> >>> >>   </property>
>> >> >>> >>
>> >> >>> >> 2) The 'bsp.tasks.maximum' is the number of tasks per node. It
>> looks
>> >> >>> >> cluster configuration issue. Please run Pi example and look at
>> the
>> >> >>> >> logs for more details. NOTE: you can not attach the images to
>> >> mailing
>> >> >>> >> list so I can't see it.
>> >> >>> >>
>> >> >>> >> 3) You can use the BSPJob.setNumBspTask(int) method. If input is
>> >> >>> >> provided, the number of BSP tasks is basically driven by the
>> number
>> >> of
>> >> >>> >> DFS blocks. I'll fix it to be more flexible on HAMA-956.
>> >> >>> >>
>> >> >>> >> Thanks!
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> On Tue, Jun 23, 2015 at 2:33 AM, Behroz Sikander <
>> >> [email protected]>
>> >> >>> >> wrote:
>> >> >>> >> > Hi,
>> >> >>> >> > Recently, I moved from a single machine setup to a 2 machine
>> >> setup.
>> >> >>> I was
>> >> >>> >> > successfully able to run my job that uses the HDFS to get
>> data. I
>> >> >>> have 3
>> >> >>> >> > trivial questions
>> >> >>> >> >
>> >> >>> >> > 1- To access HDFS, I have to manually give the IP address of
>> >> server
>> >> >>> >> running
>> >> >>> >> > HDFS. I thought that Hama will automatically pick from the
>> >> >>> configurations
>> >> >>> >> > but it does not. I am probably doing something wrong. Right
>> now my
>> >> >>> code
>> >> >>> >> work
>> >> >>> >> > by using the following.
>> >> >>> >> >
>> >> >>> >> > FileSystem fs = FileSystem.get(new
>> URI("hdfs://server_ip:port/"),
>> >> >>> conf);
>> >> >>> >> >
>> >> >>> >> > 2- On my master server, when I start hama it automatically
>> starts
>> >> >>> hama in
>> >> >>> >> > the slave machine (all good). Both master and slave are set as
>> >> >>> >> groomservers.
>> >> >>> >> > This means that I have 2 servers to run my job which means
>> that I
>> >> can
>> >> >>> >> open
>> >> >>> >> > more BSPPeerChild processes. And if I submit my jar with 3 bsp
>> >> tasks
>> >> >>> then
>> >> >>> >> > everything works fine. But when I move to 4 tasks, Hama
>> freezes.
>> >> >>> Here is
>> >> >>> >> the
>> >> >>> >> > result of JPS command on slave.
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> > Result of JPS command on Master
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> > You can see that it is only opening tasks on slaves but not on
>> >> >>> master.
>> >> >>> >> >
>> >> >>> >> > Note: I tried to change the bsp.tasks.maximum property in
>> >> >>> >> hama-default.xml
>> >> >>> >> > to 4 but still same result.
>> >> >>> >> >
>> >> >>> >> > 3- I want my cluster to open as many BSPPeerChild processes as
>> >> >>> possible.
>> >> >>> >> Is
>> >> >>> >> > there any setting that can I do to achieve that ? Or hama
>> picks up
>> >> >>> the
>> >> >>> >> > values from hama-default.xml to open tasks ?
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> > Regards,
>> >> >>> >> >
>> >> >>> >> > Behroz Sikander
>> >> >>> >>
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> --
>> >> >>> >> Best Regards, Edward J. Yoon
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> --
>> >> >>> Best Regards, Edward J. Yoon
>> >> >>>
>> >> >>
>> >> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards, Edward J. Yoon
>> >>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>>
>
>

Reply via email to