Re: How to make HOD apply more than one core on each machine?
Song, I know it is the way to set the capacity of each node, however, I want to know, how can we make Torque manager that we will run more than 1 mapred tasks on each machine. Because if we dont do this, torque will assign other cores on this machine to other tasks, which may cause a competition for cores. Do you know how to solve this? If I understand, what you want is that when a physical node is allocated via HOD by the Torque resource manager, you don't want that node to be shared by other jobs. Is that correct ? Looking on the web, I found that schedulers like Maui / Moab that are typically used with Torque allow for this. In particular, I thought this link: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2009-May/039949.html may be particularly useful. It talks about a NODEACCESSPOLICY configuration in Maui that is described here: http://www.clusterresources.com/products/maui/docs/5.3nodeaccess.shtml. Setting this policy to SINGLEJOB seems to solve your problem. Can you check if this meets your requirement ?
Re: Distributed Cache with New API
Thanks. That clears it up. Larry On Fri, Apr 16, 2010 at 1:05 AM, Amareshwari Sri Ramadasu amar...@yahoo-inc.com wrote: Hi, @Ted, below code is internal code. Users are not expected to call DistributedCache.getLocalCache(), they cannot use it also. They do not know all the parameters. @Larry, DistributedCache is not changed to use new api in branch 0.20. The change is done in only from branch 0.21. See MAPREDUCE-898 ( https://issues.apache.org/jira/browse/MAPREDUCE-898). If you are using branch 0.20, you are encouraged to use deprecated JobConf itself. You can try the following change in your code: Change the line DistributedCache.addCacheFile(new Path(args[0]).toUri(), conf); to DistributedCache.addCacheFile(new Path(args[0]).toUri(), job.getConfiguration()); Thanks Amareshwari On 4/16/10 2:27 AM, Ted Yu yuzhih...@gmail.com wrote: Please take a look at the loop starting at line 158 in TaskRunner.java: p[i] = DistributedCache.getLocalCache(files[i], conf, new Path(baseDir), fileStatus, false, Long.parseLong( fileTimestamps[i]), new Path(workDir. getAbsolutePath()), false); } DistributedCache.setLocalFiles(conf, stringifyPathArray(p)); I think the confusing part is that DistributedCache.getLocalCacheFiles() is paired with DistributedCache.setLocalFiles() Cheers On Thu, Apr 15, 2010 at 1:16 PM, Larry Compton lawrence.comp...@gmail.comwrote: Ted, Thanks. I have looked at that example. The javadocs for DistributedCache still refer to deprecated classes, like JobConf. I'm trying to use the revised API. Larry On Thu, Apr 15, 2010 at 4:07 PM, Ted Yu yuzhih...@gmail.com wrote: Please see the sample within src\core\org\apache\hadoop\filecache\DistributedCache.java: * JobConf job = new JobConf(); * DistributedCache.addCacheFile(new URI(/myapp/lookup.dat#lookup.dat), * job); On Thu, Apr 15, 2010 at 12:56 PM, Larry Compton lawrence.comp...@gmail.comwrote: I'm trying to use the distributed cache in a MapReduce job written to the new API (org.apache.hadoop.mapreduce.*). In my Tool class, a file path is added to the distributed cache as follows: public int run(String[] args) throws Exception { Configuration conf = getConf(); Job job = new Job(conf, Job); ... DistributedCache.addCacheFile(new Path(args[0]).toUri(), conf); ... return job.waitForCompletion(true) ? 0 : 1; } The setup() method in my mapper tries to read the path as follows: protected void setup(Context context) throws IOException { Path[] paths = DistributedCache.getLocalCacheFiles(context .getConfiguration()); } But paths is null. I'm assuming I'm setting up the distributed cache incorrectly. I've seen a few hints in previous mailing list postings that indicate that the distributed cache is accessed via the Job and JobContext objects in the revised API, but the javadocs don't seem to support that. Thanks. Larry
o.a.h.mapreduce API and SequenceFile encoding format
Hey Folks, No luck on IRC; trying here: I was playing around with 0.20.x and SequenceFileOutputFormat. The documentation doesn't specify any particular file encoding but I had just assumed that it was some sort of raw binary format. I see, after inspecting the output that it was a false assumption... the file encoding appears to be ascii hex pairs with space delimiters. Is this accurate? After trolling the javadocs somemore, I found SequenceFileAsBinaryOutputFormat but that class doesn't appear to have an analog in the new o.a.h.mapreduce API. Is it just not ported or am i missing some other method of specifying file encoding in the new mapreduce APIs? Thanks, Bo
Jetty returning 404s for everything
I have a cluster running Cloudera's 0.20.1+152-1 version of Hadoop. All was well, but there was an unfortunate power outage that affected just the namenode. Everything seemed largely normal upon resumption (I did have to recreate the local version of hadoop.tmp.dir to get the namenode to start), but now I find that none of the status webpages is working: Jetty is returning 404s for everything. The actual JobTracker appears fine: I am able to submit jobs and get results. Here's what I see: $ telnet localhost 50030 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. GET /jobtracker.jsp HTTP/1.1 Host: localhost HTTP/1.1 404 /jobtracker.jsp Content-Type: text/html; charset=iso-8859-1 Cache-Control: must-revalidate,no-cache,no-store Content-Length: 1412 Server: Jetty(6.1.14) html head meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/ titleError 404 /jobtracker.jsp/title /head bodyh2HTTP ERROR: 404/h2pre/jobtracker.jsp/pre pRequestURI=/jobtracker.jsp/ppismalla href=http://jetty.mortbay.org/;Powered by Jetty:///a/small/i/pbr/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ /body /html ^] telnet quit In contrast, another cluster running the slightly more up-to-date 0.20.1+169.68.1 returns what you'd expect, e.g. $ telnet localhost 50030 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. GET /jobtracker.jsp HTTP/1.1 Host: localhost HTTP/1.1 200 OK Content-Type: text/html; charset=utf-8 Expires: Thu, 01 Jan 1970 00:00:00 GMT Set-Cookie: JSESSIONID=12c1udmu09jok;Path=/ Content-Length: 2851 Server: Jetty(6.1.14) html head titlehdp-nn-pri Hadoop Map/Reduce Administration/title link rel=stylesheet type=text/css href=/static/hadoop.css link rel=icon type=image/vnd.microsoft.icon href=/static/images/favicon.ico / script type=text/javascript src=/static/jobtracker.js/script /head body h1hdp-nn-pri Hadoop Map/Reduce Administration/h1 . . . etc. I assume this stuff is under the control of the webapp directory and that appears identical between the two clusters: I did a recursive diff. Anyway, I've looked at a bunch of things and don't see any problems, so I'm kind of at wits currently. Any suggestions would be most appreciated. -- Robert Crocombe
Splitting input for mapper and contiguous data
As I may have mentioned, my main goal currently is the processing of physiologic data using hadoop and MR. The steps are: Convert ADC units to physical units (input is sample num, raw value, output is sample num, physical value Perform a peak detection to detect the systolic blood pressure (input is sample num, physical value, output is sample num, physical value but the output is only a subset of the input) Calculate the central tendency measure using a sliding window (mapper input is sample num, physical value, mapper output is window ID, (sample num, physical value), reducer input is window ID, central tendency measurement at different radii ) Each of the above steps builds upon the result of the previous. So, for the first two steps, I have been doing everything in the mapper and specified 0 reduce tasks. The last step, I am performing calculations on a sliding window of N points, skipping forward M points for the next window. N is M. So, to implement this, I have a mapper that outputs all of the x,y points (the value) for a particular key (the window ID). The reducer then performs the calculations on each window's data. Everything works pretty well except that I noticed the splitting of the input across different mappers affects the final output. Due to the nature of the calculations, this doesn't affect the end result very much. However, I'm trying to make sure I understand everything properly, and I want to see if there is a better/proper way of implementing something like this. I'm guessing the problem comes from the fact that I'm trying to use contiguous data points to create a window of N points. The window ID is just the first sample num encountered for the window. As a result, the first sample num encountered will change for everything but the first map task, when compared to a serial execution. Thanks! --Andrew
Extremely slow HDFS after upgrade
I have two clusters upgraded to CDH2. One is performing fine, and the other is EXTREMELY slow. Some jobs that formerly took 90 seconds, take 20 to 50 minutes. It is an HDFS issue from what I can tell. The simple DFS benchmark with one map task shows the problem clearly. I have looked at every difference I can find and am wondering where else to look to track this down. The disks on all nodes in the cluster check out -- capable of 75MB/sec minimum with a 'dd' write test. top / iostat do not show any significant CPU usage or iowait times on any machines in the cluster during the test. ifconfig does not report any dropped packets or other errors on any machine in the cluster. dmesg has nothing interesting. The poorly performing cluster is on a slightly newer CentOS version: Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.4, recent patches) Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.3, I think) The performance is always poor, not sporadically poor. It is poor with M/R tasks as well as non-M/R HDFS clients (i.e. sqoop). Poor performance cluster (no other jobs active during the test): --- $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write -nrFiles 1 -fileSize 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1 10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 100 10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: 2000 mega bytes, 1 files 10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files for: 1 files 10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/04/16 12:53:15 INFO mapred.FileInputFormat: Total input paths to process : 1 10/04/16 12:53:15 INFO mapred.JobClient: Running job: job_201004091928_0391 10/04/16 12:53:16 INFO mapred.JobClient: map 0% reduce 0% 10/04/16 13:42:30 INFO mapred.JobClient: map 100% reduce 0% 10/04/16 13:43:06 INFO mapred.JobClient: map 100% reduce 100% 10/04/16 13:43:07 INFO mapred.JobClient: Job complete: job_201004091928_0391 [snip] 10/04/16 13:43:07 INFO mapred.FileInputFormat: - TestDFSIO - : write 10/04/16 13:43:07 INFO mapred.FileInputFormat:Date time: Fri Apr 16 13:43:07 PDT 2010 10/04/16 13:43:07 INFO mapred.FileInputFormat:Number of files: 1 10/04/16 13:43:07 INFO mapred.FileInputFormat: Total MBytes processed: 2000 10/04/16 13:43:07 INFO mapred.FileInputFormat: Throughput mb/sec: 0.678296742615553 10/04/16 13:43:07 INFO mapred.FileInputFormat: Average IO rate mb/sec: 0.6782967448234558 10/04/16 13:43:07 INFO mapred.FileInputFormat: IO rate std deviation: 9.568803140552889E-5 10/04/16 13:43:07 INFO mapred.FileInputFormat: Test exec time sec: 2992.913 Good performance cluster (other jobs active during the test): - hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write -nrFiles 1 -fileSize 2000 10/04/16 12:50:52 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively TestFDSIO.0.0.4 10/04/16 12:50:52 INFO mapred.FileInputFormat: nrFiles = 1 10/04/16 12:50:52 INFO mapred.FileInputFormat: fileSize (MB) = 2000 10/04/16 12:50:52 INFO mapred.FileInputFormat: bufferSize = 100 10/04/16 12:50:52 INFO mapred.FileInputFormat: creating control file: 2000 mega bytes, 1 files 10/04/16 12:50:52 INFO mapred.FileInputFormat: created control files for: 1 files 10/04/16 12:50:52 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/04/16 12:50:53 INFO mapred.FileInputFormat: Total input paths to process : 1 10/04/16 12:50:54 INFO mapred.JobClient: Running job: job_201003311607_4098 10/04/16 12:50:55 INFO mapred.JobClient: map 0% reduce 0% 10/04/16 12:51:22 INFO mapred.JobClient: map 100% reduce 0% 10/04/16 12:51:32 INFO mapred.JobClient: map 100% reduce 100% 10/04/16 12:51:32 INFO mapred.JobClient: Job complete: job_201003311607_4098 [snip] 10/04/16 12:51:32 INFO mapred.FileInputFormat: - TestDFSIO - : write 10/04/16 12:51:32 INFO mapred.FileInputFormat:Date time: Fri Apr 16 12:51:32 PDT 2010 10/04/16 12:51:32 INFO mapred.FileInputFormat:Number of files: 1 10/04/16 12:51:32 INFO mapred.FileInputFormat: Total MBytes processed: 2000 10/04/16 12:51:32 INFO mapred.FileInputFormat: Throughput mb/sec: 92.47699634715865 10/04/16 12:51:32 INFO mapred.FileInputFormat: Average IO rate mb/sec: 92.47699737548828
Re: Extremely slow HDFS after upgrade
Hey Scott, This is indeed really strange... if you do a straight hadoop fs -put with dfs.replication set to 1 from one of the DNs, does it upload slow? That would cut out the network from the equation. -Todd On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey sc...@richrelevance.comwrote: I have two clusters upgraded to CDH2. One is performing fine, and the other is EXTREMELY slow. Some jobs that formerly took 90 seconds, take 20 to 50 minutes. It is an HDFS issue from what I can tell. The simple DFS benchmark with one map task shows the problem clearly. I have looked at every difference I can find and am wondering where else to look to track this down. The disks on all nodes in the cluster check out -- capable of 75MB/sec minimum with a 'dd' write test. top / iostat do not show any significant CPU usage or iowait times on any machines in the cluster during the test. ifconfig does not report any dropped packets or other errors on any machine in the cluster. dmesg has nothing interesting. The poorly performing cluster is on a slightly newer CentOS version: Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.4, recent patches) Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.3, I think) The performance is always poor, not sporadically poor. It is poor with M/R tasks as well as non-M/R HDFS clients (i.e. sqoop). Poor performance cluster (no other jobs active during the test): --- $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write -nrFiles 1 -fileSize 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1 10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 100 10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: 2000 mega bytes, 1 files 10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files for: 1 files 10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/04/16 12:53:15 INFO mapred.FileInputFormat: Total input paths to process : 1 10/04/16 12:53:15 INFO mapred.JobClient: Running job: job_201004091928_0391 10/04/16 12:53:16 INFO mapred.JobClient: map 0% reduce 0% 10/04/16 13:42:30 INFO mapred.JobClient: map 100% reduce 0% 10/04/16 13:43:06 INFO mapred.JobClient: map 100% reduce 100% 10/04/16 13:43:07 INFO mapred.JobClient: Job complete: job_201004091928_0391 [snip] 10/04/16 13:43:07 INFO mapred.FileInputFormat: - TestDFSIO - : write 10/04/16 13:43:07 INFO mapred.FileInputFormat:Date time: Fri Apr 16 13:43:07 PDT 2010 10/04/16 13:43:07 INFO mapred.FileInputFormat:Number of files: 1 10/04/16 13:43:07 INFO mapred.FileInputFormat: Total MBytes processed: 2000 10/04/16 13:43:07 INFO mapred.FileInputFormat: Throughput mb/sec: 0.678296742615553 10/04/16 13:43:07 INFO mapred.FileInputFormat: Average IO rate mb/sec: 0.6782967448234558 10/04/16 13:43:07 INFO mapred.FileInputFormat: IO rate std deviation: 9.568803140552889E-5 10/04/16 13:43:07 INFO mapred.FileInputFormat: Test exec time sec: 2992.913 Good performance cluster (other jobs active during the test): - hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write -nrFiles 1 -fileSize 2000 10/04/16 12:50:52 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively TestFDSIO.0.0.4 10/04/16 12:50:52 INFO mapred.FileInputFormat: nrFiles = 1 10/04/16 12:50:52 INFO mapred.FileInputFormat: fileSize (MB) = 2000 10/04/16 12:50:52 INFO mapred.FileInputFormat: bufferSize = 100 10/04/16 12:50:52 INFO mapred.FileInputFormat: creating control file: 2000 mega bytes, 1 files 10/04/16 12:50:52 INFO mapred.FileInputFormat: created control files for: 1 files 10/04/16 12:50:52 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/04/16 12:50:53 INFO mapred.FileInputFormat: Total input paths to process : 1 10/04/16 12:50:54 INFO mapred.JobClient: Running job: job_201003311607_4098 10/04/16 12:50:55 INFO mapred.JobClient: map 0% reduce 0% 10/04/16 12:51:22 INFO mapred.JobClient: map 100% reduce 0% 10/04/16 12:51:32 INFO mapred.JobClient: map 100% reduce 100% 10/04/16 12:51:32 INFO mapred.JobClient: Job complete: job_201003311607_4098 [snip] 10/04/16 12:51:32 INFO mapred.FileInputFormat: - TestDFSIO - : write 10/04/16 12:51:32 INFO mapred.FileInputFormat:Date time: Fri Apr 16
Re: Extremely slow HDFS after upgrade
Ok, so here is a ... fun result. I have dfs.replication.min set to 2, so I can't just do hsdoop fs -Ddfs.replication=1 put someFile someFile Since that will fail. So here are two results that are fascinating: $ time hadoop fs -Ddfs.replication=3 -put test.tar test.tar real1m53.237s user0m1.952s sys 0m0.308s $ time hadoop fs -Ddfs.replication=2 -put test.tar test.tar real0m1.689s user0m1.763s sys 0m0.315s The file is 77MB and so is two blocks. The test with replication level 3 is slow about 9 out of 10 times. When it is slow it sometimes is 28 seconds, sometimes 2 minutes. It was fast one time... The test with replication level 2 is fast in 40 out of 40 tests. This is a development cluster with 8 nodes. It looks like the replication level of 3 or more causes trouble. Looking more closely at the logs, it seems that certain datanodes (but not all) cause large delays if they are in the middle of an HDFS write chain. So, a write that goes from A B C is fast if B is a good node and C a bad node. If its A C B then its slow. So, I can say that some nodes but not all are doing something wrong. when in the middle of a write chain. If I do a replication = 2 write on one of these bad nodes, its always slow. So the good news is I can identify the bad nodes, and decomission them. The bad news is this still doesn't make a lot of sense, and 40% of the nodes have the issue. Worse, on a couple nodes the behavior in the replication = 2 case is not consistent -- sometimes the first block is fast. So it may be dependent on not just the source, but the source target combination in the chain. At this point, I suspect something completely broken at the network level, perhaps even routing. Why it would show up after an upgrade is yet to be determined, but the upgrade did include some config changes and OS updates. Thanks Todd! -Scott On Apr 16, 2010, at 5:34 PM, Todd Lipcon wrote: Hey Scott, This is indeed really strange... if you do a straight hadoop fs -put with dfs.replication set to 1 from one of the DNs, does it upload slow? That would cut out the network from the equation. -Todd On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey sc...@richrelevance.comwrote: I have two clusters upgraded to CDH2. One is performing fine, and the other is EXTREMELY slow. Some jobs that formerly took 90 seconds, take 20 to 50 minutes. It is an HDFS issue from what I can tell. The simple DFS benchmark with one map task shows the problem clearly. I have looked at every difference I can find and am wondering where else to look to track this down. The disks on all nodes in the cluster check out -- capable of 75MB/sec minimum with a 'dd' write test. top / iostat do not show any significant CPU usage or iowait times on any machines in the cluster during the test. ifconfig does not report any dropped packets or other errors on any machine in the cluster. dmesg has nothing interesting. The poorly performing cluster is on a slightly newer CentOS version: Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.4, recent patches) Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.3, I think) The performance is always poor, not sporadically poor. It is poor with M/R tasks as well as non-M/R HDFS clients (i.e. sqoop). Poor performance cluster (no other jobs active during the test): --- $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write -nrFiles 1 -fileSize 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1 10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 100 10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: 2000 mega bytes, 1 files 10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files for: 1 files 10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/04/16 12:53:15 INFO mapred.FileInputFormat: Total input paths to process : 1 10/04/16 12:53:15 INFO mapred.JobClient: Running job: job_201004091928_0391 10/04/16 12:53:16 INFO mapred.JobClient: map 0% reduce 0% 10/04/16 13:42:30 INFO mapred.JobClient: map 100% reduce 0% 10/04/16 13:43:06 INFO mapred.JobClient: map 100% reduce 100% 10/04/16 13:43:07 INFO mapred.JobClient: Job complete: job_201004091928_0391 [snip] 10/04/16 13:43:07 INFO mapred.FileInputFormat: - TestDFSIO - : write 10/04/16 13:43:07 INFO mapred.FileInputFormat:Date time: Fri Apr 16 13:43:07 PDT 2010 10/04/16 13:43:07 INFO mapred.FileInputFormat:Number of files: 1 10/04/16 13:43:07 INFO mapred.FileInputFormat: Total MBytes processed: 2000 10/04/16 13:43:07 INFO mapred.FileInputFormat: Throughput
Re: Extremely slow HDFS after upgrade
More info -- this is not a Hadoop issue. The network performance issue can be replicated with SSH only on the links where Hadoop has a problem, and only in the direction with a problem. HDFS is slow to transfer data in certain directions from certain machines. So, for example, copying from node C to D may be slow, but not the other direction from C to D. Likewise, although only 3 of 8 nodes have this problem, it is not universal. For example, node C might have trouble copying data to 5 of the 7 other nodes, and node G might have trouble with all 7 other nodes. No idea what it is yet, but SSH exhibits the same issue -- only in those specific point-to-point links in one specific direction. -Scott On Apr 16, 2010, at 7:10 PM, Scott Carey wrote: Ok, so here is a ... fun result. I have dfs.replication.min set to 2, so I can't just do hsdoop fs -Ddfs.replication=1 put someFile someFile Since that will fail. So here are two results that are fascinating: $ time hadoop fs -Ddfs.replication=3 -put test.tar test.tar real1m53.237s user0m1.952s sys 0m0.308s $ time hadoop fs -Ddfs.replication=2 -put test.tar test.tar real0m1.689s user0m1.763s sys 0m0.315s The file is 77MB and so is two blocks. The test with replication level 3 is slow about 9 out of 10 times. When it is slow it sometimes is 28 seconds, sometimes 2 minutes. It was fast one time... The test with replication level 2 is fast in 40 out of 40 tests. This is a development cluster with 8 nodes. It looks like the replication level of 3 or more causes trouble. Looking more closely at the logs, it seems that certain datanodes (but not all) cause large delays if they are in the middle of an HDFS write chain. So, a write that goes from A B C is fast if B is a good node and C a bad node. If its A C B then its slow. So, I can say that some nodes but not all are doing something wrong. when in the middle of a write chain. If I do a replication = 2 write on one of these bad nodes, its always slow. So the good news is I can identify the bad nodes, and decomission them. The bad news is this still doesn't make a lot of sense, and 40% of the nodes have the issue. Worse, on a couple nodes the behavior in the replication = 2 case is not consistent -- sometimes the first block is fast. So it may be dependent on not just the source, but the source target combination in the chain. At this point, I suspect something completely broken at the network level, perhaps even routing. Why it would show up after an upgrade is yet to be determined, but the upgrade did include some config changes and OS updates. Thanks Todd! -Scott On Apr 16, 2010, at 5:34 PM, Todd Lipcon wrote: Hey Scott, This is indeed really strange... if you do a straight hadoop fs -put with dfs.replication set to 1 from one of the DNs, does it upload slow? That would cut out the network from the equation. -Todd On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey sc...@richrelevance.comwrote: I have two clusters upgraded to CDH2. One is performing fine, and the other is EXTREMELY slow. Some jobs that formerly took 90 seconds, take 20 to 50 minutes. It is an HDFS issue from what I can tell. The simple DFS benchmark with one map task shows the problem clearly. I have looked at every difference I can find and am wondering where else to look to track this down. The disks on all nodes in the cluster check out -- capable of 75MB/sec minimum with a 'dd' write test. top / iostat do not show any significant CPU usage or iowait times on any machines in the cluster during the test. ifconfig does not report any dropped packets or other errors on any machine in the cluster. dmesg has nothing interesting. The poorly performing cluster is on a slightly newer CentOS version: Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.4, recent patches) Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.3, I think) The performance is always poor, not sporadically poor. It is poor with M/R tasks as well as non-M/R HDFS clients (i.e. sqoop). Poor performance cluster (no other jobs active during the test): --- $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write -nrFiles 1 -fileSize 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1 10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 100 10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: 2000 mega bytes, 1 files 10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files for: 1 files 10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/04/16 12:53:15
Re: Extremely slow HDFS after upgrade
Checked link autonegotiation with ethtool? Sometimes gige will autoneg to 10mb half duplex if there's a bad cable, NIC, or switch port. -Todd On Fri, Apr 16, 2010 at 8:08 PM, Scott Carey sc...@richrelevance.comwrote: More info -- this is not a Hadoop issue. The network performance issue can be replicated with SSH only on the links where Hadoop has a problem, and only in the direction with a problem. HDFS is slow to transfer data in certain directions from certain machines. So, for example, copying from node C to D may be slow, but not the other direction from C to D. Likewise, although only 3 of 8 nodes have this problem, it is not universal. For example, node C might have trouble copying data to 5 of the 7 other nodes, and node G might have trouble with all 7 other nodes. No idea what it is yet, but SSH exhibits the same issue -- only in those specific point-to-point links in one specific direction. -Scott On Apr 16, 2010, at 7:10 PM, Scott Carey wrote: Ok, so here is a ... fun result. I have dfs.replication.min set to 2, so I can't just do hsdoop fs -Ddfs.replication=1 put someFile someFile Since that will fail. So here are two results that are fascinating: $ time hadoop fs -Ddfs.replication=3 -put test.tar test.tar real1m53.237s user0m1.952s sys 0m0.308s $ time hadoop fs -Ddfs.replication=2 -put test.tar test.tar real0m1.689s user0m1.763s sys 0m0.315s The file is 77MB and so is two blocks. The test with replication level 3 is slow about 9 out of 10 times. When it is slow it sometimes is 28 seconds, sometimes 2 minutes. It was fast one time... The test with replication level 2 is fast in 40 out of 40 tests. This is a development cluster with 8 nodes. It looks like the replication level of 3 or more causes trouble. Looking more closely at the logs, it seems that certain datanodes (but not all) cause large delays if they are in the middle of an HDFS write chain. So, a write that goes from A B C is fast if B is a good node and C a bad node. If its A C B then its slow. So, I can say that some nodes but not all are doing something wrong. when in the middle of a write chain. If I do a replication = 2 write on one of these bad nodes, its always slow. So the good news is I can identify the bad nodes, and decomission them. The bad news is this still doesn't make a lot of sense, and 40% of the nodes have the issue. Worse, on a couple nodes the behavior in the replication = 2 case is not consistent -- sometimes the first block is fast. So it may be dependent on not just the source, but the source target combination in the chain. At this point, I suspect something completely broken at the network level, perhaps even routing. Why it would show up after an upgrade is yet to be determined, but the upgrade did include some config changes and OS updates. Thanks Todd! -Scott On Apr 16, 2010, at 5:34 PM, Todd Lipcon wrote: Hey Scott, This is indeed really strange... if you do a straight hadoop fs -put with dfs.replication set to 1 from one of the DNs, does it upload slow? That would cut out the network from the equation. -Todd On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey sc...@richrelevance.com wrote: I have two clusters upgraded to CDH2. One is performing fine, and the other is EXTREMELY slow. Some jobs that formerly took 90 seconds, take 20 to 50 minutes. It is an HDFS issue from what I can tell. The simple DFS benchmark with one map task shows the problem clearly. I have looked at every difference I can find and am wondering where else to look to track this down. The disks on all nodes in the cluster check out -- capable of 75MB/sec minimum with a 'dd' write test. top / iostat do not show any significant CPU usage or iowait times on any machines in the cluster during the test. ifconfig does not report any dropped packets or other errors on any machine in the cluster. dmesg has nothing interesting. The poorly performing cluster is on a slightly newer CentOS version: Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.4, recent patches) Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.3, I think) The performance is always poor, not sporadically poor. It is poor with M/R tasks as well as non-M/R HDFS clients (i.e. sqoop). Poor performance cluster (no other jobs active during the test): --- $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write -nrFiles 1 -fileSize 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1 10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 100 10/04/16 12:53:14 INFO