Hi, this is the exception i have been getting @ the mapreduce java.io.IOException: Cannot run program "bash": java.io.IOException: error=12, Cannot allocate memory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.util.Shell.runCommand(Shell.java:149) at org.apache.hadoop.util.Shell.run(Shell.java:134) at org.apache.hadoop.fs.DF.getAvailable(DF.java:73) at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:321) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1199) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333) at org.apache.hadoop.mapred.Child.main(Child.java:155) Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory at java.lang.UNIXProcess.(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) ... 10 more
On Fri, Apr 17, 2009 at 10:09 PM, Rakhi Khatwani <rakhi.khatw...@gmail.com>wrote: > Hi, > Its been several days since we have been trying to stabilize > hadoop/hbase on ec2 cluster. but failed to do so. > We still come across frequent region server fails, scanner timeout > exceptions and OS level deadlocks etc... > > and 2day while doing a list of tables on hbase i get the following > exception: > > hbase(main):001:0> list > 09/04/17 13:57:18 INFO ipc.HBaseClass: Retrying connect to server: / > 10.254.234.32:60020. Already tried 0 time(s). > 09/04/17 13:57:19 INFO ipc.HBaseClass: Retrying connect to server: / > 10.254.234.32:60020. Already tried 1 time(s). > 09/04/17 13:57:20 INFO ipc.HBaseClass: Retrying connect to server: / > 10.254.234.32:60020. Already tried 2 time(s). > 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not > available yet, Zzzzz... > 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could > not be reached after 1 tries, giving up. > 09/04/17 13:57:21 INFO ipc.HBaseClass: Retrying connect to server: / > 10.254.234.32:60020. Already tried 0 time(s). > 09/04/17 13:57:22 INFO ipc.HBaseClass: Retrying connect to server: / > 10.254.234.32:60020. Already tried 1 time(s). > 09/04/17 13:57:23 INFO ipc.HBaseClass: Retrying connect to server: / > 10.254.234.32:60020. Already tried 2 time(s). > 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not > available yet, Zzzzz... > 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could > not be reached after 1 tries, giving up. > 09/04/17 13:57:26 INFO ipc.HBaseClass: Retrying connect to server: / > 10.254.234.32:60020. Already tried 0 time(s). > 09/04/17 13:57:27 INFO ipc.HBaseClass: Retrying connect to server: / > 10.254.234.32:60020. Already tried 1 time(s). > 09/04/17 13:57:28 INFO ipc.HBaseClass: Retrying connect to server: / > 10.254.234.32:60020. Already tried 2 time(s). > 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not > available yet, Zzzzz... > 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could > not be reached after 1 tries, giving up. > 09/04/17 13:57:29 INFO ipc.HBaseClass: Retrying connect to server: / > 10.254.234.32:60020. Already tried 0 time(s). > 09/04/17 13:57:30 INFO ipc.HBaseClass: Retrying connect to server: / > 10.254.234.32:60020. Already tried 1 time(s). > 09/04/17 13:57:31 INFO ipc.HBaseClass: Retrying connect to server: / > 10.254.234.32:60020. Already tried 2 time(s). > 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not > available yet, Zzzzz... > 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could > not be reached after 1 tries, giving up. > 09/04/17 13:57:34 INFO ipc.HBaseClass: Retrying connect to server: / > 10.254.234.32:60020. Already tried 0 time(s). > 09/04/17 13:57:35 INFO ipc.HBaseClass: Retrying connect to server: / > 10.254.234.32:60020. Already tried 1 time(s). > 09/04/17 13:57:36 INFO ipc.HBaseClass: Retrying connect to server: / > 10.254.234.32:60020. Already tried 2 time(s). > 09/04/17 13:57:36 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not > available yet, Zzzzz... > > but if i check on the UI, hbase master is still on, (tried refreshing it > several times). > > > and i have been getting a lot of exceptions from time to time including > region servers going down (which happens very frequently due to which there > is heavy data loss... that too on production data), scanner timeout > exceptions, cannot allocate memory exceptions etc. > > I am working on amazon ec2 Large cluster with 6 nodes... > with each node having the hardware configuration as follows: > > - Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores > with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit > platform > > > I am using hadoop-0.19.0 and hbase 0.19.0 (resynced to all the nodes and > made sure that there is a symbolic link to hadoop-site from hbase/conf) > > Following is my configuration on hadoop-site.xml > <configuration> > > <property> > <name>hadoop.tmp.dir</name> > <value>/mnt/hadoop</value> > </property> > > <property> > <name>fs.default.name</name> > <value>hdfs://domU-12-31-39-00-E5-D2.compute-1.internal:50001</value> > </property> > > <property> > <name>mapred.job.tracker</name> > <value>domU-12-31-39-00-E5-D2.compute-1.internal:50002</value> > </property> > > <property> > <name>tasktracker.http.threads</name> > <value>80</value> > </property> > > <property> > <name>mapred.tasktracker.map.tasks.maximum</name> > <value>3</value> > </property> > > <property> > <name>mapred.tasktracker.reduce.tasks.maximum</name> > <value>3</value> > </property> > > <property> > <name>mapred.output.compress</name> > <value>true</value> > </property> > > <property> > <name>mapred.output.compression.type</name> > <value>BLOCK</value> > </property> > > <property> > <name>dfs.client.block.write.retries</name> > <value>3</value> > </property> > > <property> > <name>mapred.child.java.opts</name> > <value>-Xmx4096m</value> > </property> > > Given it a high value since the RAM on each node is 7GB... not sure of this > setting though > **i got Cannot Allocate Memory Exception after making this setting. (got it > for the first time) > after going through the archives, someone suggested enabling the overcommit > memory....not sure of it though ** > > <property> > <name>dfs.datanode.max.xcievers</name> > <value>4096</value> > </property> > > As suggested by some of you... i guess it solved the data xceivers > exception on hadoop > > <property> > <name>dfs.datanode.handler.count</name> > <value>10</value> > </property> > > <property> > <name>mapred.task.timeout</name> > <value>0</value> > <description>The number of milliseconds before a task will be > terminated if it neither reads an input, writes an output, nor > updates its status string. > </description> > </property> > > This property has been set coz i have been getting a lot of exceptions > "Cannot report in 602 seconds....killing" > > <property> > <name>mapred.tasktracker.expiry.interval</name> > <value>360000</value> > <description>Expert: The time-interval, in miliseconds, after which > a tasktracker is declared 'lost' if it doesn't send heartbeats. > </description> > </property> > > <property> > <name>dfs.datanode.socket.write.timeout</name> > <value>0</value> > </property> > > To avoid socket timeout exceptions > > <property> > <name>dfs.replication</name> > <value>5</value> > <description>Default block replication. > The actual number of replications can be specified when the file is > created. > The default is used if replication is not specified in create time. > </description> > </property> > > <property> > <name>mapred.job.reuse.jvm.num.tasks</name> > <value>-1</value> > <description>How many tasks to run per jvm. If set to -1, there is > no limit. > </description> > </property> > > </configuration> > > > and following is the configuration on hbase-site.xml > > <configuration> > <property> > <name>hbase.master</name> > <value>domU-12-31-39-00-E5-D2.compute-1.internal:60000</value> > </property> > > <property> > <name>hbase.rootdir</name> > > <value>hdfs://domU-12-31-39-00-E5-D2.compute-1.internal:50001/hbase</value> > </property> > > <property> > <name>hbase.regionserver.lease.period</name> > <value>12600000</value> > <description>HRegion server lease period in milliseconds. Default is > 60 seconds. Clients must report in within this period else they are > considered dead.</description> > </property> > > > I have set this coz there is a map reduce program which takes almost 3-4 > minutes to process a row. worst case is 7 mins > so this has been calculated as (7*60*1000) * (30) = 12600000 > where (7*60*1000) = time to proccess a row in ms. > and 30 = thedefault hbase scanner caching. > so i shoudnt be getting scanner timeout exception > > ** made this change today..... i haven't come across scanner timeout > exception today ** > > <property> > <name>hbase.master.lease.period</name> > <value>3600000</value> > <description>HMaster server lease period in milliseconds. Default is > 120 seconds. Region servers must report in within this period else > they are considered dead. On loaded cluster, may need to up this > period.</description> > </property> > > </configuration> > > > Any suggesstions on changes in the configurations?? > > My main concern is the region servers goin down from time to time which > happens very frequently. due to which my map-reduce tasks hangs and the > entire application fails :( > > I have tried almost all the suggestions mentioned by you except separating > the datanodes from computational nodes which i plan to do 2morrow. > has it been tried before?? > and what would be your recommendation?? how many nodes should i consider as > datanodes and computational nodes? > > i am hoping that the cluster would be stable by 2morrow :) > > Thanks a ton, > Raakhi >