Re: Ec2 instability

2009-04-21 Thread Tim Hawkins
I would be interested in understanding what problems you are having,  
we are using 19.0 in production on EC2, running nutch and a set of  
custom apps

in a mixed workload on a farm of 5 instances.



On 17 Apr 2009, at 18:05, Ted Coyle wrote:


Rakhi,
I'd suggest going to 0.19.1.  hbase and hadoop.

We had so many problems with .0.19.0 on EC2 that we couldn't use it.
Having problems with name resolution and generic startup scripts with
.0.19.1 release but not a show stopper.

Ted


-Original Message-
From: Rakhi Khatwani [mailto:rakhi.khatw...@gmail.com]
Sent: Friday, April 17, 2009 12:45 PM
To: hbase-u...@hadoop.apache.org; core-user@hadoop.apache.org
Subject: Re: Ec2 instability

Hi,
this is the exception i have been getting @ the mapreduce

java.io.IOException: Cannot run program bash: java.io.IOException:
error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
at
org.apache.hadoop.fs.LocalDirAllocator 
$AllocatorPerContext.getLocalPathF

orWrite(LocalDirAllocator.java:321)
at
org 
.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllo

cator.java:124)
at
org 
.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFi

le.java:61)
at
org.apache.hadoop.mapred.MapTask 
$MapOutputBuffer.mergeParts(MapTask.java

:1199)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java: 
857)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
at org.apache.hadoop.mapred.Child.main(Child.java:155)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot
allocate memory
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
... 10 more



On Fri, Apr 17, 2009 at 10:09 PM, Rakhi Khatwani
rakhi.khatw...@gmail.comwrote:


Hi,
   Its been several days since we have been trying to stabilize
hadoop/hbase on ec2 cluster. but failed to do so.
We still come across frequent region server fails, scanner timeout
exceptions and OS level deadlocks etc...

and 2day while doing a list of tables on hbase i get the following
exception:

hbase(main):001:0 list
09/04/17 13:57:18 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 0 time(s).
09/04/17 13:57:19 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 1 time(s).
09/04/17 13:57:20 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 2 time(s).
09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020

not

available yet, Z...
09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020

could

not be reached after 1 tries, giving up.
09/04/17 13:57:21 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 0 time(s).
09/04/17 13:57:22 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 1 time(s).
09/04/17 13:57:23 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 2 time(s).
09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020

not

available yet, Z...
09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020

could

not be reached after 1 tries, giving up.
09/04/17 13:57:26 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 0 time(s).
09/04/17 13:57:27 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 1 time(s).
09/04/17 13:57:28 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 2 time(s).
09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020

not

available yet, Z...
09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020

could

not be reached after 1 tries, giving up.
09/04/17 13:57:29 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 0 time(s).
09/04/17 13:57:30 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 1 time(s).
09/04/17 13:57:31 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 2 time(s).
09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020

not

available yet, Z...
09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020

could

not be reached after 1 tries, giving up.
09/04/17 13:57:34 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 0 time(s).
09/04/17 13:57:35 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 1 time(s).
09/04/17 13:57:36 INFO ipc.HBaseClass: Retrying connect to server

Re: Ec2 instability

2009-04-18 Thread Andrew Purtell

Hi,

This is an OS level exception. Your node is out of memory
even to fork a process. 

How many instances do you currently have allocated? Have
you increased the number of instances over time to try and
spread the load of your application around? How many
concurrent mapper and/or reducer processes do you execute
on a node? Can you characterize the memory usage of your
mappers and reducers? Are you running other processes
external to hadoop/hbase which consume a lot of memory? Are
you running Ganglia or similar to track and characterize
resource usage over time? 

You may find you are trying to solve a 100 node problem
with 10.

   - Andy

 From: Rakhi Khatwani
 Subject: Re: Ec2 instability
 To: hbase-u...@hadoop.apache.org, core-user@hadoop.apache.org
 Date: Friday, April 17, 2009, 9:44 AM
 Hi,
  this is the exception i have been getting @ the mapreduce
 
 java.io.IOException: Cannot run program bash:
 java.io.IOException:
 error=12, Cannot allocate memory
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
   at org.apache.hadoop.util.Shell.run(Shell.java:134)
   at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
   at
 org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:321)
   at
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
   at
 org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
   at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1199)
   at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
   at org.apache.hadoop.mapred.Child.main(Child.java:155)
 Caused by: java.io.IOException: java.io.IOException:
 error=12, Cannot
 allocate memory
   at java.lang.UNIXProcess.(UNIXProcess.java:148)
   at java.lang.ProcessImpl.start(ProcessImpl.java:65)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
   ... 10 more



  


Re: Ec2 instability

2009-04-18 Thread Rakhi Khatwani
 Hi,

 I have 6 instances allocated.
i havent tried adding more instances coz i have maximum of 30,000 rows in
hbase tables. wht do u recommend?
i have max 4-5 map concurrent map/reduce tasks on one node.
how do we characterize the memory usage of mappers and reducers??
i m running spinn3r... other than regular hadoop/hbase... but spinn3r is
being called from one of my map tasks.
I am not running gangila or any other program to characterize resource usage
over time.

Thanks,
Raakhi

On Sat, Apr 18, 2009 at 7:09 PM, Andrew Purtell apurt...@apache.org wrote:


 Hi,

 This is an OS level exception. Your node is out of memory
 even to fork a process.

 How many instances do you currently have allocated? Have
 you increased the number of instances over time to try and
 spread the load of your application around? How many
 concurrent mapper and/or reducer processes do you execute
 on a node? Can you characterize the memory usage of your
 mappers and reducers? Are you running other processes
 external to hadoop/hbase which consume a lot of memory? Are
 you running Ganglia or similar to track and characterize
 resource usage over time?

 You may find you are trying to solve a 100 node problem
 with 10.

   - Andy

  From: Rakhi Khatwani
  Subject: Re: Ec2 instability
  To: hbase-u...@hadoop.apache.org, core-user@hadoop.apache.org
  Date: Friday, April 17, 2009, 9:44 AM
   Hi,
   this is the exception i have been getting @ the mapreduce
 
  java.io.IOException: Cannot run program bash:
  java.io.IOException:
  error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
at
 
 org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:321)
at
 
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at
 
 org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
at
 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1199)
at
  org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
at org.apache.hadoop.mapred.Child.main(Child.java:155)
  Caused by: java.io.IOException: java.io.IOException:
  error=12, Cannot
  allocate memory
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
... 10 more







Re: Ec2 instability

2009-04-17 Thread Rakhi Khatwani
Hi,
 this is the exception i have been getting @ the mapreduce

java.io.IOException: Cannot run program bash: java.io.IOException:
error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:321)
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at 
org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1199)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
at org.apache.hadoop.mapred.Child.main(Child.java:155)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot
allocate memory
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
... 10 more



On Fri, Apr 17, 2009 at 10:09 PM, Rakhi Khatwani
rakhi.khatw...@gmail.comwrote:

 Hi,
 Its been several days since we have been trying to stabilize
 hadoop/hbase on ec2 cluster. but failed to do so.
 We still come across frequent region server fails, scanner timeout
 exceptions and OS level deadlocks etc...

 and 2day while doing a list of tables on hbase i get the following
 exception:

 hbase(main):001:0 list
 09/04/17 13:57:18 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 0 time(s).
 09/04/17 13:57:19 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 1 time(s).
 09/04/17 13:57:20 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 2 time(s).
 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
 available yet, Z...
 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could
 not be reached after 1 tries, giving up.
 09/04/17 13:57:21 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 0 time(s).
 09/04/17 13:57:22 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 1 time(s).
 09/04/17 13:57:23 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 2 time(s).
 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
 available yet, Z...
 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could
 not be reached after 1 tries, giving up.
 09/04/17 13:57:26 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 0 time(s).
 09/04/17 13:57:27 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 1 time(s).
 09/04/17 13:57:28 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 2 time(s).
 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
 available yet, Z...
 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could
 not be reached after 1 tries, giving up.
 09/04/17 13:57:29 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 0 time(s).
 09/04/17 13:57:30 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 1 time(s).
 09/04/17 13:57:31 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 2 time(s).
 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
 available yet, Z...
 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could
 not be reached after 1 tries, giving up.
 09/04/17 13:57:34 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 0 time(s).
 09/04/17 13:57:35 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 1 time(s).
 09/04/17 13:57:36 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 2 time(s).
 09/04/17 13:57:36 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
 available yet, Z...

 but if i check on the UI, hbase master is still on, (tried refreshing it
 several times).


 and i have been getting a lot of exceptions from time to time including
 region servers going down (which happens very frequently due to which there
 is heavy data loss... that too on production data), scanner timeout
 exceptions, cannot allocate memory exceptions etc.

 I am working on amazon ec2 Large cluster with 6 nodes...
 with each node having the hardware configuration as follows:

- Large Instance 7.5 GB of memory, 4 EC2 

RE: Ec2 instability

2009-04-17 Thread Ted Coyle
Rakhi,
I'd suggest going to 0.19.1.  hbase and hadoop.

We had so many problems with .0.19.0 on EC2 that we couldn't use it.
Having problems with name resolution and generic startup scripts with
.0.19.1 release but not a show stopper.

Ted


-Original Message-
From: Rakhi Khatwani [mailto:rakhi.khatw...@gmail.com] 
Sent: Friday, April 17, 2009 12:45 PM
To: hbase-u...@hadoop.apache.org; core-user@hadoop.apache.org
Subject: Re: Ec2 instability

Hi,
 this is the exception i have been getting @ the mapreduce

java.io.IOException: Cannot run program bash: java.io.IOException:
error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathF
orWrite(LocalDirAllocator.java:321)
at
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllo
cator.java:124)
at
org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFi
le.java:61)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java
:1199)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
at org.apache.hadoop.mapred.Child.main(Child.java:155)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot
allocate memory
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
... 10 more



On Fri, Apr 17, 2009 at 10:09 PM, Rakhi Khatwani
rakhi.khatw...@gmail.comwrote:

 Hi,
 Its been several days since we have been trying to stabilize
 hadoop/hbase on ec2 cluster. but failed to do so.
 We still come across frequent region server fails, scanner timeout
 exceptions and OS level deadlocks etc...

 and 2day while doing a list of tables on hbase i get the following
 exception:

 hbase(main):001:0 list
 09/04/17 13:57:18 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 0 time(s).
 09/04/17 13:57:19 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 1 time(s).
 09/04/17 13:57:20 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 2 time(s).
 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
not
 available yet, Z...
 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
could
 not be reached after 1 tries, giving up.
 09/04/17 13:57:21 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 0 time(s).
 09/04/17 13:57:22 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 1 time(s).
 09/04/17 13:57:23 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 2 time(s).
 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
not
 available yet, Z...
 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
could
 not be reached after 1 tries, giving up.
 09/04/17 13:57:26 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 0 time(s).
 09/04/17 13:57:27 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 1 time(s).
 09/04/17 13:57:28 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 2 time(s).
 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
not
 available yet, Z...
 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
could
 not be reached after 1 tries, giving up.
 09/04/17 13:57:29 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 0 time(s).
 09/04/17 13:57:30 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 1 time(s).
 09/04/17 13:57:31 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 2 time(s).
 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
not
 available yet, Z...
 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
could
 not be reached after 1 tries, giving up.
 09/04/17 13:57:34 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 0 time(s).
 09/04/17 13:57:35 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 1 time(s).
 09/04/17 13:57:36 INFO ipc.HBaseClass: Retrying connect to server: /
 10.254.234.32:60020. Already tried 2 time(s).
 09/04/17 13:57:36 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
not
 available yet, Z...

 but if i check on the UI, hbase master is still on, (tried refreshing