On Wed, May 7, 2008 at 2:45 PM, Chris K Wensel <[EMAIL PROTECTED]> wrote:
> Hi James
>
> Were you able to start all the nodes in the same 'availability zone'? You
> using the new AMI kernels?

After I saw your note, I restarted new instances with the new kernels
(aki-b51cf9dc and ari-b31cf9da) and made sure everything was in the
same availability zone.

> If you are using the contrib/ec2 scripts, you might upgrade (just the
> scripts) to
> http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.17/src/contrib/ec2/

I'll take a look at these - I've been doing it by hand.

Hairong wrote:
> Taking the timeout out is very dangerous. It may cause your application to
> hang. You could change the timeout parameter to a larger number.

Thanks - reducing the timeout did seem like a bad idea.  With the new
kernels, I'm seeing timeout errors like this:

java.net.SocketTimeoutException: timed out waiting for rpc response
        at org.apache.hadoop.ipc.Client.call(Client.java:514)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:198)
        at org.apache.hadoop.dfs.$Proxy5.mkdirs(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at org.apache.hadoop.dfs.$Proxy5.mkdirs(Unknown Source)
        at org.apache.hadoop.dfs.DFSClient.mkdirs(DFSClient.java:550)
        at 
org.apache.hadoop.dfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:184)
        at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:982)
        at 
org.apache.hadoop.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:1429)
        at 
org.apache.hadoop.mapred.TaskTracker$TaskInProgress.launchTask(TaskTracker.java:1493)
        at 
org.apache.hadoop.mapred.TaskTracker.launchTaskForJob(TaskTracker.java:700)
        at 
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:693)
        at 
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1282)
        at 
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:923)
        at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1318)
        at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2210)

I'll experiment with increasing the timeout.

I'm using the machine running the namenode to run maps as well - could
that be a source of my problem?  The load is fairly high, essentially
no idle time.  8 cores per machine, so I've got 8 maps running.  I'm
guessing I'd be better off running 80 smaller machines instead of 20
larger ones for the same price, but we haven't been approved for more
than 20 instances yet.  Given that I'm not seeing any idle time, I'm
assuming that I'm CPU not IO-bound.

Cpu(s): 89.6%us,  5.7%sy,  0.0%ni,  0.6%id,  0.0%wa,  0.0%hi,  0.1%si,  4.0%st
Mem:  15736360k total, 14935708k used,   800652k free,   237980k buffers
Swap:        0k total,        0k used,        0k free,  7545100k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28955 james     21   0 1308m 750m 9440 S  121  4.9   0:36.61
/usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
-Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
28989 james     18   0 1298m 725m 9376 S  120  4.7   0:30.48
/usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
-Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
29029 james     18   0 1349m 504m 9376 S  117  3.3   0:24.55
/usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
-Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
29059 james     18   0 1301m 313m 9428 S   81  2.0   0:16.51
/usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
-Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
25658 james     20   0 1293m 277m 9204 S    8  1.8   0:29.98
/usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
-Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
25756 james     19   0 1286m 412m 9204 S    3  2.7   0:30.66
/usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
-Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
25688 james     19   0 1286m 332m 9204 S    2  2.2   0:28.69
/usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
-Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
 1141 james     24   0 2332m 281m 8932 S    1  1.8   3:56.17
/usr/lib/jvm/java-6-sun-1.6.0.03/bin/java -Xmx2000m
-Dcom.sun.management.jmxremote
-Dhadoop.log.dir=/home/james/dev/hadoop/logs
-Dhadoop.log.file=hadoop-james-jobtracker-domU-1
25724 james     19   0 1286m 386m 9204 S    1  2.5   0:28.96
/usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
-Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
  822 james     24   0 2306m  91m 8912 S    0  0.6   3:15.12
/usr/lib/jvm/java-6-sun-1.6.0.03/bin/java -Xmx2000m
-Dcom.sun.management.jmxremote
-Dhadoop.log.dir=/home/james/dev/hadoop/logs
-Dhadoop.log.file=hadoop-james-namenode-domU-12-

FYI, I'm using JRuby to do the work in the map tasks.  It's working well so far.

-- 
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com

Reply via email to