Ok I found a strange thing. In my hadoop folder, I found a new file named "hs_err_pid4919.log" inside the $HADOOP_HOME directory.
The content of the file are *# Increase physical memory or swap space* *# Check if swap backing store is full* *# Use 64 bit Java on a 64 bit OS* *# Decrease Java heap size (-Xmx/-Xms)* *# Decrease number of Java threads* *# Decrease Java thread stack sizes (-Xss)* *# Set larger code cache with -XX:ReservedCodeCacheSize=* *# This output file may be truncated or incomplete.* *#* *# Out of Memory Error (os_linux.cpp:2809), pid=4919, tid=140564483778304* *#* *# JRE version: OpenJDK Runtime Environment (7.0_79-b14) (build 1.7.0_79-b14)* *# Java VM: OpenJDK 64-Bit Server VM (24.79-b02 mixed mode linux-amd64 compressed oops)* *# Derivative: IcedTea 2.5.6* *# Distribution: Ubuntu 14.04 LTS, package 7u79-2.5.6-0ubuntu1.14.04.1* *# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again* *#* *--------------- T H R E A D ---------------* *Current thread (0x00007fd7c0438800): JavaThread "PacketResponder: BP-1786576942-141.40.254.14-1441293753577:blk_1074136820_396012, type=HAS_DOWNSTREAM_IN_PIPELINE" daemon [_thread_new, id=11943, stack(0x00007fd7b80fa000,0x00007fd7b81fb000)]* *Stack: [0x00007fd7b80fa000,0x00007fd7b81fb000], sp=0x00007fd7b81f9be0, free space=1022k* *Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)* I think my DataNode process is crashing. I now know that it is a out of memory error but the reason is not sure. On Thu, Sep 3, 2015 at 10:25 PM, Behroz Sikander <[email protected]> wrote: > ok. HA = High Availability ? > > I am also trying to solve the following problem. But I do not understand > why I get the exception because my algorithm does not have a lot of data > that is being sent to master. > *'BSP task process exit with nonzero status of 1'* > > Each slave node processes some data and sends back a Double array of size > 96 to the master machine. Recently, I was testing the algorithm on 8000 > files when it crashed. This means that 8000 double arrays of size 96 are > sent to the master to process. Once master receives all the data, it gets > out of sync and starts the processing again. Here is the calculation > > 8000 * 96 * 8 (size of float) = 6144000 = ~6.144 MB. > > I am not sure but this does not seem to be alot of data and I think > message manager that you mentioned should be able to handle it. > > Regards, > Behroz > > On Tue, Sep 1, 2015 at 1:07 PM, Edward J. Yoon <[email protected]> > wrote: > >> I'm reading GroomServer code and its taskMonitorService. It seems >> related with cluster HA. >> >> On Sat, Aug 29, 2015 at 1:16 PM, Edward J. Yoon <[email protected]> >> wrote: >> >> If my Groom Child Process fails for some reason, the processes are not >> killed automatically >> > >> > I also experienced this problem before. I guess, if one of processes >> > crashed with OutOfMemory, other processes infinitely waiting for it. >> > This is a bug. >> > >> > On Sat, Aug 29, 2015 at 1:02 AM, Behroz Sikander <[email protected]> >> wrote: >> >> Just another quick question. If my Groom Child Process fails for some >> >> reason, the processes are not killed automatically. If i run JPS >> command, I >> >> can still see something like "3791 GroomServer$BSPPeerChild". Is this >> the >> >> expected behavior ? >> >> >> >> I am using latest hama version (0.7.0). >> >> Regards, >> >> Behroz >> >> >> >> On Fri, Aug 28, 2015 at 4:12 PM, Behroz Sikander <[email protected]> >> wrote: >> >> >> >>> Ok I will try it out. >> >>> >> >>> No, actually I am learning alot by facing these problems. It is >> actually a >> >>> good thing :D >> >>> >> >>> Regards, >> >>> Behroz >> >>> >> >>> On Fri, Aug 28, 2015 at 5:52 AM, Edward J. Yoon < >> [email protected]> >> >>> wrote: >> >>> >> >>>> > message managers. Hmmm, I will recheck my logic related to >> messages. Btw >> >>>> >> >>>> Serialization (like GraphJobMessage) is good idea. It stores multiple >> >>>> messages in serialized form in a single object to reduce the memory >> >>>> usage and RPC overhead. >> >>>> >> >>>> > what is the limit of these message managers ? How much data at a >> single >> >>>> > time they can handle ? >> >>>> >> >>>> It depends on memory. >> >>>> >> >>>> > P.S. Each day, as I am moving towards a big cluster I am running >> into >> >>>> > problems (alot of them :D). >> >>>> >> >>>> Haha, sorry for inconvenient and thanks for your reports. >> >>>> >> >>>> On Fri, Aug 28, 2015 at 11:25 AM, Behroz Sikander < >> [email protected]> >> >>>> wrote: >> >>>> > Ok. So, I do have a memory problem. I will try to scale out. >> >>>> > >> >>>> > *>>Each task processor has two message manager, one for outgoing >> and >> >>>> one* >> >>>> > >> >>>> > *for incoming. All these are handled in memory, so it >> sometimesrequires >> >>>> > large memory space.* >> >>>> > So, you mean that before barrier synchronization, I have alot of >> data in >> >>>> > message managers. Hmmm, I will recheck my logic related to >> messages. Btw >> >>>> > what is the limit of these message managers ? How much data at a >> single >> >>>> > time they can handle ? >> >>>> > >> >>>> > P.S. Each day, as I am moving towards a big cluster I am running >> into >> >>>> > problems (alot of them :D). >> >>>> > >> >>>> > Regards, >> >>>> > Behroz Sikander >> >>>> > >> >>>> > On Fri, Aug 28, 2015 at 4:04 AM, Edward J. Yoon < >> [email protected]> >> >>>> > wrote: >> >>>> > >> >>>> >> > for 3 Groom child process + 2GB for Ubuntu OS). Is this correct >> >>>> >> > understanding ? >> >>>> >> >> >>>> >> and, >> >>>> >> >> >>>> >> > on a big dataset. I think these exceptions have something to do >> with >> >>>> >> Ubuntu >> >>>> >> > OS killing the hama process due to lack of memory. So, I was >> curious >> >>>> >> about >> >>>> >> >> >>>> >> Yes, you're right. >> >>>> >> >> >>>> >> Each task processor has two message manager, one for outgoing and >> one >> >>>> >> for incoming. All these are handled in memory, so it sometimes >> >>>> >> requires large memory space. To solve the OutOfMemory issue, you >> >>>> >> should scale-out your cluster by increasing the number of nodes >> and >> >>>> >> job tasks, or optimize your algorithm. Another option is >> >>>> >> disk-spillable message manager. This is not supported yet. >> >>>> >> >> >>>> >> On Fri, Aug 28, 2015 at 10:45 AM, Behroz Sikander < >> [email protected]> >> >>>> >> wrote: >> >>>> >> > Hi, >> >>>> >> > Yes. According to hama-default.xml, each machine will open 3 >> process >> >>>> with >> >>>> >> > 2GB memory each. This means that my VMs need atleast 8GB memory >> (2GB >> >>>> each >> >>>> >> > for 3 Groom child process + 2GB for Ubuntu OS). Is this correct >> >>>> >> > understanding ? >> >>>> >> > >> >>>> >> > I recently ran into the following exceptions when I was trying >> to run >> >>>> >> hama >> >>>> >> > on a big dataset. I think these exceptions have something to do >> with >> >>>> >> Ubuntu >> >>>> >> > OS killing the hama process due to lack of memory. So, I was >> curious >> >>>> >> about >> >>>> >> > my configurations. >> >>>> >> > 'BSP task process exit with nonzero status of 137.' >> >>>> >> > 'BSP task process exit with nonzero status of 1' >> >>>> >> > >> >>>> >> > >> >>>> >> > >> >>>> >> > Regards, >> >>>> >> > Behroz >> >>>> >> > >> >>>> >> > On Fri, Aug 28, 2015 at 3:04 AM, Edward J. Yoon < >> >>>> [email protected]> >> >>>> >> > wrote: >> >>>> >> > >> >>>> >> >> Hi, >> >>>> >> >> >> >>>> >> >> You can change the max tasks per node by setting below >> property in >> >>>> >> >> hama-site.xml. :-) >> >>>> >> >> >> >>>> >> >> <property> >> >>>> >> >> <name>bsp.tasks.maximum</name> >> >>>> >> >> <value>3</value> >> >>>> >> >> <description>The maximum number of BSP tasks that will be >> run >> >>>> >> >> simultaneously >> >>>> >> >> by a groom server.</description> >> >>>> >> >> </property> >> >>>> >> >> >> >>>> >> >> >> >>>> >> >> On Fri, Aug 28, 2015 at 5:18 AM, Behroz Sikander < >> >>>> [email protected]> >> >>>> >> >> wrote: >> >>>> >> >> > Hi, >> >>>> >> >> > Recently, I noticed that my hama deployment is only opening 3 >> >>>> >> processes >> >>>> >> >> per >> >>>> >> >> > machine. This is because of the configuration settings in the >> >>>> default >> >>>> >> >> hama >> >>>> >> >> > file. >> >>>> >> >> > >> >>>> >> >> > My questions is why 3 and why not 5 or 7 ? What criteria's >> should >> >>>> be >> >>>> >> >> > considered if I want to increase the value ? >> >>>> >> >> > >> >>>> >> >> > Regards, >> >>>> >> >> > Behroz >> >>>> >> >> >> >>>> >> >> >> >>>> >> >> >> >>>> >> >> -- >> >>>> >> >> Best Regards, Edward J. Yoon >> >>>> >> >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> -- >> >>>> >> Best Regards, Edward J. Yoon >> >>>> >> >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> Best Regards, Edward J. Yoon >> >>>> >> >>> >> >>> >> > >> > >> > >> > -- >> > Best Regards, Edward J. Yoon >> >> >> >> -- >> Best Regards, Edward J. Yoon >> > >
