More info on this: I noticed that only 2 machines were failing with OutOfMemory. After messing around, I found out that the swap memory was 0 for these 2 machines but others had swap space of 1 GB. I added the swap to these machines and it worked. But as expected in the next run of algorithm with more data it crashed again. This time GroomChildProcess crashed with the following log message
*OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000007fa100000, 42467328, 0) failed; error='Cannot allocate memory' (errno=12)* *#* *# There is insufficient memory for the Java Runtime Environment to continue.* *# Native memory allocation (malloc) failed to allocate 42467328 bytes for committing reserved memory.* *# An error report file with more information is saved as:* *# /home/behroz/Documents/Packages/tmp_data/hama_tmp/bsp/local/groomServer/attempt_201509040050_0004_000006_0/work/hs_err_pid28850.log* My slave machines have 8GB of RAM, 4 CPUs, 20 GB harddrive and 1GB swap. I run 3 groom child process each taking 2GB of RAM. Apart from GroomChildProcess, I have GroomServer, DataNode and TaskManager running on the slave. After assigning 2GB ram to 3 child groom process (total 6GB RAM), only 2 GB of RAM is left for others. Do you think this is the problem ? Regards, Behroz On Thu, Sep 3, 2015 at 11:39 PM, Behroz Sikander <[email protected]> wrote: > Ok I found a strange thing. In my hadoop folder, I found a new file named > "hs_err_pid4919.log" inside the $HADOOP_HOME directory. > > The content of the file are > > *# Increase physical memory or swap space* > *# Check if swap backing store is full* > *# Use 64 bit Java on a 64 bit OS* > *# Decrease Java heap size (-Xmx/-Xms)* > *# Decrease number of Java threads* > *# Decrease Java thread stack sizes (-Xss)* > *# Set larger code cache with -XX:ReservedCodeCacheSize=* > *# This output file may be truncated or incomplete.* > *#* > *# Out of Memory Error (os_linux.cpp:2809), pid=4919, tid=140564483778304* > *#* > *# JRE version: OpenJDK Runtime Environment (7.0_79-b14) (build > 1.7.0_79-b14)* > *# Java VM: OpenJDK 64-Bit Server VM (24.79-b02 mixed mode linux-amd64 > compressed oops)* > *# Derivative: IcedTea 2.5.6* > *# Distribution: Ubuntu 14.04 LTS, package 7u79-2.5.6-0ubuntu1.14.04.1* > *# Failed to write core dump. Core dumps have been disabled. To enable > core dumping, try "ulimit -c unlimited" before starting Java again* > *#* > > *--------------- T H R E A D ---------------* > > *Current thread (0x00007fd7c0438800): JavaThread "PacketResponder: > BP-1786576942-141.40.254.14-1441293753577:blk_1074136820_396012, > type=HAS_DOWNSTREAM_IN_PIPELINE" daemon [_thread_new, id=11943, > stack(0x00007fd7b80fa000,0x00007fd7b81fb000)]* > > *Stack: [0x00007fd7b80fa000,0x00007fd7b81fb000], sp=0x00007fd7b81f9be0, > free space=1022k* > *Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native > code)* > > I think my DataNode process is crashing. I now know that it is a out of > memory error but the reason is not sure. > > On Thu, Sep 3, 2015 at 10:25 PM, Behroz Sikander <[email protected]> > wrote: > >> ok. HA = High Availability ? >> >> I am also trying to solve the following problem. But I do not understand >> why I get the exception because my algorithm does not have a lot of data >> that is being sent to master. >> *'BSP task process exit with nonzero status of 1'* >> >> Each slave node processes some data and sends back a Double array of size >> 96 to the master machine. Recently, I was testing the algorithm on 8000 >> files when it crashed. This means that 8000 double arrays of size 96 are >> sent to the master to process. Once master receives all the data, it gets >> out of sync and starts the processing again. Here is the calculation >> >> 8000 * 96 * 8 (size of float) = 6144000 = ~6.144 MB. >> >> I am not sure but this does not seem to be alot of data and I think >> message manager that you mentioned should be able to handle it. >> >> Regards, >> Behroz >> >> On Tue, Sep 1, 2015 at 1:07 PM, Edward J. Yoon <[email protected]> >> wrote: >> >>> I'm reading GroomServer code and its taskMonitorService. It seems >>> related with cluster HA. >>> >>> On Sat, Aug 29, 2015 at 1:16 PM, Edward J. Yoon <[email protected]> >>> wrote: >>> >> If my Groom Child Process fails for some reason, the processes are >>> not killed automatically >>> > >>> > I also experienced this problem before. I guess, if one of processes >>> > crashed with OutOfMemory, other processes infinitely waiting for it. >>> > This is a bug. >>> > >>> > On Sat, Aug 29, 2015 at 1:02 AM, Behroz Sikander <[email protected]> >>> wrote: >>> >> Just another quick question. If my Groom Child Process fails for some >>> >> reason, the processes are not killed automatically. If i run JPS >>> command, I >>> >> can still see something like "3791 GroomServer$BSPPeerChild". Is this >>> the >>> >> expected behavior ? >>> >> >>> >> I am using latest hama version (0.7.0). >>> >> Regards, >>> >> Behroz >>> >> >>> >> On Fri, Aug 28, 2015 at 4:12 PM, Behroz Sikander <[email protected]> >>> wrote: >>> >> >>> >>> Ok I will try it out. >>> >>> >>> >>> No, actually I am learning alot by facing these problems. It is >>> actually a >>> >>> good thing :D >>> >>> >>> >>> Regards, >>> >>> Behroz >>> >>> >>> >>> On Fri, Aug 28, 2015 at 5:52 AM, Edward J. Yoon < >>> [email protected]> >>> >>> wrote: >>> >>> >>> >>>> > message managers. Hmmm, I will recheck my logic related to >>> messages. Btw >>> >>>> >>> >>>> Serialization (like GraphJobMessage) is good idea. It stores >>> multiple >>> >>>> messages in serialized form in a single object to reduce the memory >>> >>>> usage and RPC overhead. >>> >>>> >>> >>>> > what is the limit of these message managers ? How much data at a >>> single >>> >>>> > time they can handle ? >>> >>>> >>> >>>> It depends on memory. >>> >>>> >>> >>>> > P.S. Each day, as I am moving towards a big cluster I am running >>> into >>> >>>> > problems (alot of them :D). >>> >>>> >>> >>>> Haha, sorry for inconvenient and thanks for your reports. >>> >>>> >>> >>>> On Fri, Aug 28, 2015 at 11:25 AM, Behroz Sikander < >>> [email protected]> >>> >>>> wrote: >>> >>>> > Ok. So, I do have a memory problem. I will try to scale out. >>> >>>> > >>> >>>> > *>>Each task processor has two message manager, one for outgoing >>> and >>> >>>> one* >>> >>>> > >>> >>>> > *for incoming. All these are handled in memory, so it >>> sometimesrequires >>> >>>> > large memory space.* >>> >>>> > So, you mean that before barrier synchronization, I have alot of >>> data in >>> >>>> > message managers. Hmmm, I will recheck my logic related to >>> messages. Btw >>> >>>> > what is the limit of these message managers ? How much data at a >>> single >>> >>>> > time they can handle ? >>> >>>> > >>> >>>> > P.S. Each day, as I am moving towards a big cluster I am running >>> into >>> >>>> > problems (alot of them :D). >>> >>>> > >>> >>>> > Regards, >>> >>>> > Behroz Sikander >>> >>>> > >>> >>>> > On Fri, Aug 28, 2015 at 4:04 AM, Edward J. Yoon < >>> [email protected]> >>> >>>> > wrote: >>> >>>> > >>> >>>> >> > for 3 Groom child process + 2GB for Ubuntu OS). Is this correct >>> >>>> >> > understanding ? >>> >>>> >> >>> >>>> >> and, >>> >>>> >> >>> >>>> >> > on a big dataset. I think these exceptions have something to >>> do with >>> >>>> >> Ubuntu >>> >>>> >> > OS killing the hama process due to lack of memory. So, I was >>> curious >>> >>>> >> about >>> >>>> >> >>> >>>> >> Yes, you're right. >>> >>>> >> >>> >>>> >> Each task processor has two message manager, one for outgoing >>> and one >>> >>>> >> for incoming. All these are handled in memory, so it sometimes >>> >>>> >> requires large memory space. To solve the OutOfMemory issue, you >>> >>>> >> should scale-out your cluster by increasing the number of nodes >>> and >>> >>>> >> job tasks, or optimize your algorithm. Another option is >>> >>>> >> disk-spillable message manager. This is not supported yet. >>> >>>> >> >>> >>>> >> On Fri, Aug 28, 2015 at 10:45 AM, Behroz Sikander < >>> [email protected]> >>> >>>> >> wrote: >>> >>>> >> > Hi, >>> >>>> >> > Yes. According to hama-default.xml, each machine will open 3 >>> process >>> >>>> with >>> >>>> >> > 2GB memory each. This means that my VMs need atleast 8GB >>> memory (2GB >>> >>>> each >>> >>>> >> > for 3 Groom child process + 2GB for Ubuntu OS). Is this correct >>> >>>> >> > understanding ? >>> >>>> >> > >>> >>>> >> > I recently ran into the following exceptions when I was trying >>> to run >>> >>>> >> hama >>> >>>> >> > on a big dataset. I think these exceptions have something to >>> do with >>> >>>> >> Ubuntu >>> >>>> >> > OS killing the hama process due to lack of memory. So, I was >>> curious >>> >>>> >> about >>> >>>> >> > my configurations. >>> >>>> >> > 'BSP task process exit with nonzero status of 137.' >>> >>>> >> > 'BSP task process exit with nonzero status of 1' >>> >>>> >> > >>> >>>> >> > >>> >>>> >> > >>> >>>> >> > Regards, >>> >>>> >> > Behroz >>> >>>> >> > >>> >>>> >> > On Fri, Aug 28, 2015 at 3:04 AM, Edward J. Yoon < >>> >>>> [email protected]> >>> >>>> >> > wrote: >>> >>>> >> > >>> >>>> >> >> Hi, >>> >>>> >> >> >>> >>>> >> >> You can change the max tasks per node by setting below >>> property in >>> >>>> >> >> hama-site.xml. :-) >>> >>>> >> >> >>> >>>> >> >> <property> >>> >>>> >> >> <name>bsp.tasks.maximum</name> >>> >>>> >> >> <value>3</value> >>> >>>> >> >> <description>The maximum number of BSP tasks that will be >>> run >>> >>>> >> >> simultaneously >>> >>>> >> >> by a groom server.</description> >>> >>>> >> >> </property> >>> >>>> >> >> >>> >>>> >> >> >>> >>>> >> >> On Fri, Aug 28, 2015 at 5:18 AM, Behroz Sikander < >>> >>>> [email protected]> >>> >>>> >> >> wrote: >>> >>>> >> >> > Hi, >>> >>>> >> >> > Recently, I noticed that my hama deployment is only opening >>> 3 >>> >>>> >> processes >>> >>>> >> >> per >>> >>>> >> >> > machine. This is because of the configuration settings in >>> the >>> >>>> default >>> >>>> >> >> hama >>> >>>> >> >> > file. >>> >>>> >> >> > >>> >>>> >> >> > My questions is why 3 and why not 5 or 7 ? What criteria's >>> should >>> >>>> be >>> >>>> >> >> > considered if I want to increase the value ? >>> >>>> >> >> > >>> >>>> >> >> > Regards, >>> >>>> >> >> > Behroz >>> >>>> >> >> >>> >>>> >> >> >>> >>>> >> >> >>> >>>> >> >> -- >>> >>>> >> >> Best Regards, Edward J. Yoon >>> >>>> >> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> -- >>> >>>> >> Best Regards, Edward J. Yoon >>> >>>> >> >>> >>>> >>> >>>> >>> >>>> >>> >>>> -- >>> >>>> Best Regards, Edward J. Yoon >>> >>>> >>> >>> >>> >>> >>> > >>> > >>> > >>> > -- >>> > Best Regards, Edward J. Yoon >>> >>> >>> >>> -- >>> Best Regards, Edward J. Yoon >>> >> >> >
