Re: Ec2 instability

Rakhi Khatwani Fri, 17 Apr 2009 09:45:27 -0700

Hi,
 this is the exception i have been getting @ the mapreduce

java.io.IOException: Cannot run program "bash": java.io.IOException:
error=12, Cannot allocate memory
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
        at org.apache.hadoop.util.Shell.run(Shell.java:134)
        at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
        at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:321)
        at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
        at 
org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1199)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
        at org.apache.hadoop.mapred.Child.main(Child.java:155)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot
allocate memory
        at java.lang.UNIXProcess.(UNIXProcess.java:148)
        at java.lang.ProcessImpl.start(ProcessImpl.java:65)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
        ... 10 more




On Fri, Apr 17, 2009 at 10:09 PM, Rakhi Khatwani
<rakhi.khatw...@gmail.com>wrote:

> Hi,
>         Its been several days since we have been trying to stabilize
> hadoop/hbase on ec2 cluster. but failed to do so.
> We still come across frequent region server fails, scanner timeout
> exceptions and OS level deadlocks etc...
>
> and 2day while doing a list of tables on hbase i get the following
> exception:
>
> hbase(main):001:0> list
> 09/04/17 13:57:18 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:19 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:20 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
> available yet, Zzzzz...
> 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could
> not be reached after 1 tries, giving up.
> 09/04/17 13:57:21 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:22 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:23 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
> available yet, Zzzzz...
> 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could
> not be reached after 1 tries, giving up.
> 09/04/17 13:57:26 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:27 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:28 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
> available yet, Zzzzz...
> 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could
> not be reached after 1 tries, giving up.
> 09/04/17 13:57:29 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:30 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:31 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
> available yet, Zzzzz...
> 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could
> not be reached after 1 tries, giving up.
> 09/04/17 13:57:34 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:35 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:36 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:36 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
> available yet, Zzzzz...
>
> but if i check on the UI, hbase master is still on, (tried refreshing it
> several times).
>
>
> and i have been getting a lot of exceptions from time to time including
> region servers going down (which happens very frequently due to which there
> is heavy data loss... that too on production data), scanner timeout
> exceptions, cannot allocate memory exceptions etc.
>
> I am working on amazon ec2 Large cluster with 6 nodes...
> with each node having the hardware configuration as follows:
>
>    - Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores
>    with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit
>    platform
>
>
> I am using hadoop-0.19.0 and hbase 0.19.0 (resynced to all the nodes and
> made sure that there is a symbolic link to hadoop-site from hbase/conf)
>
> Following is my configuration on hadoop-site.xml
> <configuration>
>
> <property>
>   <name>hadoop.tmp.dir</name>
>   <value>/mnt/hadoop</value>
> </property>
>
> <property>
>   <name>fs.default.name</name>
>   <value>hdfs://domU-12-31-39-00-E5-D2.compute-1.internal:50001</value>
> </property>
>
> <property>
>   <name>mapred.job.tracker</name>
>   <value>domU-12-31-39-00-E5-D2.compute-1.internal:50002</value>
> </property>
>
> <property>
>   <name>tasktracker.http.threads</name>
>   <value>80</value>
> </property>
>
> <property>
>   <name>mapred.tasktracker.map.tasks.maximum</name>
>   <value>3</value>
> </property>
>
> <property>
>   <name>mapred.tasktracker.reduce.tasks.maximum</name>
>   <value>3</value>
> </property>
>
> <property>
>   <name>mapred.output.compress</name>
>   <value>true</value>
> </property>
>
> <property>
>   <name>mapred.output.compression.type</name>
>   <value>BLOCK</value>
> </property>
>
> <property>
>   <name>dfs.client.block.write.retries</name>
>   <value>3</value>
> </property>
>
> <property>
> <name>mapred.child.java.opts</name>
> <value>-Xmx4096m</value>
> </property>
>
> Given it a high value since the RAM on each node is 7GB... not sure of this
> setting though
> **i got Cannot Allocate Memory Exception after making this setting. (got it
> for the first time)
> after going through the archives, someone suggested enabling the overcommit
> memory....not sure of it though **
>
> <property>
> <name>dfs.datanode.max.xcievers</name>
> <value>4096</value>
> </property>
>
> As suggested by some of you... i guess it solved the data xceivers
> exception on hadoop
>
> <property>
> <name>dfs.datanode.handler.count</name>
> <value>10</value>
> </property>
>
> <property>
>  <name>mapred.task.timeout</name>
>  <value>0</value>
>  <description>The number of milliseconds before a task will be
>  terminated if it neither reads an input, writes an output, nor
>  updates its status string.
>  </description>
> </property>
>
> This property has been set coz i have been getting a lot of exceptions
> "Cannot report in 602 seconds....killing"
>
> <property>
>  <name>mapred.tasktracker.expiry.interval</name>
>  <value>360000</value>
>  <description>Expert: The time-interval, in miliseconds, after which
>  a tasktracker is declared 'lost' if it doesn't send heartbeats.
>  </description>
> </property>
>
> <property>
> <name>dfs.datanode.socket.write.timeout</name>
> <value>0</value>
> </property>
>
> To avoid socket timeout exceptions
>
> <property>
>   <name>dfs.replication</name>
>   <value>5</value>
>   <description>Default block replication.
>   The actual number of replications can be specified when the file is
> created.
>   The default is used if replication is not specified in create time.
>   </description>
> </property>
>
> <property>
>  <name>mapred.job.reuse.jvm.num.tasks</name>
>  <value>-1</value>
>  <description>How many tasks to run per jvm. If set to -1, there is
>  no limit.
>  </description>
> </property>
>
> </configuration>
>
>
> and following is the configuration on hbase-site.xml
>
> <configuration>
>   <property>
>     <name>hbase.master</name>
>     <value>domU-12-31-39-00-E5-D2.compute-1.internal:60000</value>
>   </property>
>
>   <property>
>     <name>hbase.rootdir</name>
>
> <value>hdfs://domU-12-31-39-00-E5-D2.compute-1.internal:50001/hbase</value>
>   </property>
>
> <property>
>    <name>hbase.regionserver.lease.period</name>
>    <value>12600000</value>
>    <description>HRegion server lease period in milliseconds. Default is
>    60 seconds. Clients must report in within this period else they are
>    considered dead.</description>
>  </property>
>
>
> I have set this coz there is a map reduce program which takes almost 3-4
> minutes to process a row. worst case is 7 mins
> so this has been calculated as (7*60*1000) * (30) = 12600000
> where (7*60*1000) = time to proccess a row in ms.
> and 30  = thedefault hbase scanner caching.
> so i shoudnt be getting scanner timeout exception
>
> ** made this change today..... i haven't come across scanner timeout
> exception today **
>
> <property>
>    <name>hbase.master.lease.period</name>
>    <value>3600000</value>
>    <description>HMaster server lease period in milliseconds. Default is
>    120 seconds.  Region servers must report in within this period else
>    they are considered dead.  On loaded cluster, may need to up this
>    period.</description>
>  </property>
>
> </configuration>
>
>
> Any suggesstions on changes in the configurations??
>
> My main concern is the region servers goin down from time to time which
> happens very frequently. due to which my map-reduce tasks hangs and the
> entire application fails :(
>
> I have tried almost all the suggestions mentioned by you except separating
> the datanodes from computational nodes which i plan to do 2morrow.
> has it been tried before??
> and what would be your recommendation?? how many nodes should i consider as
> datanodes and computational nodes?
>
> i am hoping that the cluster would be stable by 2morrow :)
>
> Thanks a ton,
> Raakhi
>

Re: Ec2 instability

Reply via email to