On Wed, May 7, 2008 at 2:45 PM, Chris K Wensel <[EMAIL PROTECTED]> wrote: > Hi James > > Were you able to start all the nodes in the same 'availability zone'? You > using the new AMI kernels?
After I saw your note, I restarted new instances with the new kernels (aki-b51cf9dc and ari-b31cf9da) and made sure everything was in the same availability zone. > If you are using the contrib/ec2 scripts, you might upgrade (just the > scripts) to > http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.17/src/contrib/ec2/ I'll take a look at these - I've been doing it by hand. Hairong wrote: > Taking the timeout out is very dangerous. It may cause your application to > hang. You could change the timeout parameter to a larger number. Thanks - reducing the timeout did seem like a bad idea. With the new kernels, I'm seeing timeout errors like this: java.net.SocketTimeoutException: timed out waiting for rpc response at org.apache.hadoop.ipc.Client.call(Client.java:514) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:198) at org.apache.hadoop.dfs.$Proxy5.mkdirs(Unknown Source) at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at org.apache.hadoop.dfs.$Proxy5.mkdirs(Unknown Source) at org.apache.hadoop.dfs.DFSClient.mkdirs(DFSClient.java:550) at org.apache.hadoop.dfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:184) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:982) at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:1429) at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.launchTask(TaskTracker.java:1493) at org.apache.hadoop.mapred.TaskTracker.launchTaskForJob(TaskTracker.java:700) at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:693) at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1282) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:923) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1318) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2210) I'll experiment with increasing the timeout. I'm using the machine running the namenode to run maps as well - could that be a source of my problem? The load is fairly high, essentially no idle time. 8 cores per machine, so I've got 8 maps running. I'm guessing I'd be better off running 80 smaller machines instead of 20 larger ones for the same price, but we haven't been approved for more than 20 instances yet. Given that I'm not seeing any idle time, I'm assuming that I'm CPU not IO-bound. Cpu(s): 89.6%us, 5.7%sy, 0.0%ni, 0.6%id, 0.0%wa, 0.0%hi, 0.1%si, 4.0%st Mem: 15736360k total, 14935708k used, 800652k free, 237980k buffers Swap: 0k total, 0k used, 0k free, 7545100k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 28955 james 21 0 1308m 750m 9440 S 121 4.9 0:36.61 /usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java -Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008 28989 james 18 0 1298m 725m 9376 S 120 4.7 0:30.48 /usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java -Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008 29029 james 18 0 1349m 504m 9376 S 117 3.3 0:24.55 /usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java -Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008 29059 james 18 0 1301m 313m 9428 S 81 2.0 0:16.51 /usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java -Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008 25658 james 20 0 1293m 277m 9204 S 8 1.8 0:29.98 /usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java -Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008 25756 james 19 0 1286m 412m 9204 S 3 2.7 0:30.66 /usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java -Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008 25688 james 19 0 1286m 332m 9204 S 2 2.2 0:28.69 /usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java -Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008 1141 james 24 0 2332m 281m 8932 S 1 1.8 3:56.17 /usr/lib/jvm/java-6-sun-1.6.0.03/bin/java -Xmx2000m -Dcom.sun.management.jmxremote -Dhadoop.log.dir=/home/james/dev/hadoop/logs -Dhadoop.log.file=hadoop-james-jobtracker-domU-1 25724 james 19 0 1286m 386m 9204 S 1 2.5 0:28.96 /usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java -Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008 822 james 24 0 2306m 91m 8912 S 0 0.6 3:15.12 /usr/lib/jvm/java-6-sun-1.6.0.03/bin/java -Xmx2000m -Dcom.sun.management.jmxremote -Dhadoop.log.dir=/home/james/dev/hadoop/logs -Dhadoop.log.file=hadoop-james-namenode-domU-12- FYI, I'm using JRuby to do the work in the map tasks. It's working well so far. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com