Re: Extremely slow HDFS after upgrade
All links check out as full duplex gigabit with ifconfig and ethtool. Ifconfig reports no dropped packets or retransmits. A tcpdump shows no retransmits, but shows significantly smaller packets and fewer outstanding packets between acks. But iperf in UDP mode shows a consistent 1.5% lost datagram rate in one direction -- I can transmit 900Mbits/sec + with 1.5% loss by udp over the flawed links, 0% loss the other way. So it appears that the Linux tcp flow control is throttling back the window size due to these losses. Time to have someone replace all the network cables. Thanks for the ideas Todd, -Scott On Apr 16, 2010, at 8:25 PM, Todd Lipcon wrote: Checked link autonegotiation with ethtool? Sometimes gige will autoneg to 10mb half duplex if there's a bad cable, NIC, or switch port. -Todd On Fri, Apr 16, 2010 at 8:08 PM, Scott Carey sc...@richrelevance.comwrote: More info -- this is not a Hadoop issue. The network performance issue can be replicated with SSH only on the links where Hadoop has a problem, and only in the direction with a problem. HDFS is slow to transfer data in certain directions from certain machines. So, for example, copying from node C to D may be slow, but not the other direction from C to D. Likewise, although only 3 of 8 nodes have this problem, it is not universal. For example, node C might have trouble copying data to 5 of the 7 other nodes, and node G might have trouble with all 7 other nodes. No idea what it is yet, but SSH exhibits the same issue -- only in those specific point-to-point links in one specific direction. -Scott On Apr 16, 2010, at 7:10 PM, Scott Carey wrote: Ok, so here is a ... fun result. I have dfs.replication.min set to 2, so I can't just do hsdoop fs -Ddfs.replication=1 put someFile someFile Since that will fail. So here are two results that are fascinating: $ time hadoop fs -Ddfs.replication=3 -put test.tar test.tar real1m53.237s user0m1.952s sys 0m0.308s $ time hadoop fs -Ddfs.replication=2 -put test.tar test.tar real0m1.689s user0m1.763s sys 0m0.315s The file is 77MB and so is two blocks. The test with replication level 3 is slow about 9 out of 10 times. When it is slow it sometimes is 28 seconds, sometimes 2 minutes. It was fast one time... The test with replication level 2 is fast in 40 out of 40 tests. This is a development cluster with 8 nodes. It looks like the replication level of 3 or more causes trouble. Looking more closely at the logs, it seems that certain datanodes (but not all) cause large delays if they are in the middle of an HDFS write chain. So, a write that goes from A B C is fast if B is a good node and C a bad node. If its A C B then its slow. So, I can say that some nodes but not all are doing something wrong. when in the middle of a write chain. If I do a replication = 2 write on one of these bad nodes, its always slow. So the good news is I can identify the bad nodes, and decomission them. The bad news is this still doesn't make a lot of sense, and 40% of the nodes have the issue. Worse, on a couple nodes the behavior in the replication = 2 case is not consistent -- sometimes the first block is fast. So it may be dependent on not just the source, but the source target combination in the chain. At this point, I suspect something completely broken at the network level, perhaps even routing. Why it would show up after an upgrade is yet to be determined, but the upgrade did include some config changes and OS updates. Thanks Todd! -Scott On Apr 16, 2010, at 5:34 PM, Todd Lipcon wrote: Hey Scott, This is indeed really strange... if you do a straight hadoop fs -put with dfs.replication set to 1 from one of the DNs, does it upload slow? That would cut out the network from the equation. -Todd On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey sc...@richrelevance.com wrote: I have two clusters upgraded to CDH2. One is performing fine, and the other is EXTREMELY slow. Some jobs that formerly took 90 seconds, take 20 to 50 minutes. It is an HDFS issue from what I can tell. The simple DFS benchmark with one map task shows the problem clearly. I have looked at every difference I can find and am wondering where else to look to track this down. The disks on all nodes in the cluster check out -- capable of 75MB/sec minimum with a 'dd' write test. top / iostat do not show any significant CPU usage or iowait times on any machines in the cluster during the test. ifconfig does not report any dropped packets or other errors on any machine in the cluster. dmesg has nothing interesting. The poorly performing cluster is on a slightly newer CentOS version: Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.4, recent patches) Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.3, I think) The
Re: Extremely slow HDFS after upgrade
Hey Scott, This is indeed really strange... if you do a straight hadoop fs -put with dfs.replication set to 1 from one of the DNs, does it upload slow? That would cut out the network from the equation. -Todd On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey sc...@richrelevance.comwrote: I have two clusters upgraded to CDH2. One is performing fine, and the other is EXTREMELY slow. Some jobs that formerly took 90 seconds, take 20 to 50 minutes. It is an HDFS issue from what I can tell. The simple DFS benchmark with one map task shows the problem clearly. I have looked at every difference I can find and am wondering where else to look to track this down. The disks on all nodes in the cluster check out -- capable of 75MB/sec minimum with a 'dd' write test. top / iostat do not show any significant CPU usage or iowait times on any machines in the cluster during the test. ifconfig does not report any dropped packets or other errors on any machine in the cluster. dmesg has nothing interesting. The poorly performing cluster is on a slightly newer CentOS version: Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.4, recent patches) Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.3, I think) The performance is always poor, not sporadically poor. It is poor with M/R tasks as well as non-M/R HDFS clients (i.e. sqoop). Poor performance cluster (no other jobs active during the test): --- $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write -nrFiles 1 -fileSize 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1 10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 100 10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: 2000 mega bytes, 1 files 10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files for: 1 files 10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/04/16 12:53:15 INFO mapred.FileInputFormat: Total input paths to process : 1 10/04/16 12:53:15 INFO mapred.JobClient: Running job: job_201004091928_0391 10/04/16 12:53:16 INFO mapred.JobClient: map 0% reduce 0% 10/04/16 13:42:30 INFO mapred.JobClient: map 100% reduce 0% 10/04/16 13:43:06 INFO mapred.JobClient: map 100% reduce 100% 10/04/16 13:43:07 INFO mapred.JobClient: Job complete: job_201004091928_0391 [snip] 10/04/16 13:43:07 INFO mapred.FileInputFormat: - TestDFSIO - : write 10/04/16 13:43:07 INFO mapred.FileInputFormat:Date time: Fri Apr 16 13:43:07 PDT 2010 10/04/16 13:43:07 INFO mapred.FileInputFormat:Number of files: 1 10/04/16 13:43:07 INFO mapred.FileInputFormat: Total MBytes processed: 2000 10/04/16 13:43:07 INFO mapred.FileInputFormat: Throughput mb/sec: 0.678296742615553 10/04/16 13:43:07 INFO mapred.FileInputFormat: Average IO rate mb/sec: 0.6782967448234558 10/04/16 13:43:07 INFO mapred.FileInputFormat: IO rate std deviation: 9.568803140552889E-5 10/04/16 13:43:07 INFO mapred.FileInputFormat: Test exec time sec: 2992.913 Good performance cluster (other jobs active during the test): - hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write -nrFiles 1 -fileSize 2000 10/04/16 12:50:52 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively TestFDSIO.0.0.4 10/04/16 12:50:52 INFO mapred.FileInputFormat: nrFiles = 1 10/04/16 12:50:52 INFO mapred.FileInputFormat: fileSize (MB) = 2000 10/04/16 12:50:52 INFO mapred.FileInputFormat: bufferSize = 100 10/04/16 12:50:52 INFO mapred.FileInputFormat: creating control file: 2000 mega bytes, 1 files 10/04/16 12:50:52 INFO mapred.FileInputFormat: created control files for: 1 files 10/04/16 12:50:52 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/04/16 12:50:53 INFO mapred.FileInputFormat: Total input paths to process : 1 10/04/16 12:50:54 INFO mapred.JobClient: Running job: job_201003311607_4098 10/04/16 12:50:55 INFO mapred.JobClient: map 0% reduce 0% 10/04/16 12:51:22 INFO mapred.JobClient: map 100% reduce 0% 10/04/16 12:51:32 INFO mapred.JobClient: map 100% reduce 100% 10/04/16 12:51:32 INFO mapred.JobClient: Job complete: job_201003311607_4098 [snip] 10/04/16 12:51:32 INFO mapred.FileInputFormat: - TestDFSIO - : write 10/04/16 12:51:32 INFO mapred.FileInputFormat:Date time: Fri Apr 16
Re: Extremely slow HDFS after upgrade
Ok, so here is a ... fun result. I have dfs.replication.min set to 2, so I can't just do hsdoop fs -Ddfs.replication=1 put someFile someFile Since that will fail. So here are two results that are fascinating: $ time hadoop fs -Ddfs.replication=3 -put test.tar test.tar real1m53.237s user0m1.952s sys 0m0.308s $ time hadoop fs -Ddfs.replication=2 -put test.tar test.tar real0m1.689s user0m1.763s sys 0m0.315s The file is 77MB and so is two blocks. The test with replication level 3 is slow about 9 out of 10 times. When it is slow it sometimes is 28 seconds, sometimes 2 minutes. It was fast one time... The test with replication level 2 is fast in 40 out of 40 tests. This is a development cluster with 8 nodes. It looks like the replication level of 3 or more causes trouble. Looking more closely at the logs, it seems that certain datanodes (but not all) cause large delays if they are in the middle of an HDFS write chain. So, a write that goes from A B C is fast if B is a good node and C a bad node. If its A C B then its slow. So, I can say that some nodes but not all are doing something wrong. when in the middle of a write chain. If I do a replication = 2 write on one of these bad nodes, its always slow. So the good news is I can identify the bad nodes, and decomission them. The bad news is this still doesn't make a lot of sense, and 40% of the nodes have the issue. Worse, on a couple nodes the behavior in the replication = 2 case is not consistent -- sometimes the first block is fast. So it may be dependent on not just the source, but the source target combination in the chain. At this point, I suspect something completely broken at the network level, perhaps even routing. Why it would show up after an upgrade is yet to be determined, but the upgrade did include some config changes and OS updates. Thanks Todd! -Scott On Apr 16, 2010, at 5:34 PM, Todd Lipcon wrote: Hey Scott, This is indeed really strange... if you do a straight hadoop fs -put with dfs.replication set to 1 from one of the DNs, does it upload slow? That would cut out the network from the equation. -Todd On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey sc...@richrelevance.comwrote: I have two clusters upgraded to CDH2. One is performing fine, and the other is EXTREMELY slow. Some jobs that formerly took 90 seconds, take 20 to 50 minutes. It is an HDFS issue from what I can tell. The simple DFS benchmark with one map task shows the problem clearly. I have looked at every difference I can find and am wondering where else to look to track this down. The disks on all nodes in the cluster check out -- capable of 75MB/sec minimum with a 'dd' write test. top / iostat do not show any significant CPU usage or iowait times on any machines in the cluster during the test. ifconfig does not report any dropped packets or other errors on any machine in the cluster. dmesg has nothing interesting. The poorly performing cluster is on a slightly newer CentOS version: Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.4, recent patches) Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.3, I think) The performance is always poor, not sporadically poor. It is poor with M/R tasks as well as non-M/R HDFS clients (i.e. sqoop). Poor performance cluster (no other jobs active during the test): --- $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write -nrFiles 1 -fileSize 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1 10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 100 10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: 2000 mega bytes, 1 files 10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files for: 1 files 10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/04/16 12:53:15 INFO mapred.FileInputFormat: Total input paths to process : 1 10/04/16 12:53:15 INFO mapred.JobClient: Running job: job_201004091928_0391 10/04/16 12:53:16 INFO mapred.JobClient: map 0% reduce 0% 10/04/16 13:42:30 INFO mapred.JobClient: map 100% reduce 0% 10/04/16 13:43:06 INFO mapred.JobClient: map 100% reduce 100% 10/04/16 13:43:07 INFO mapred.JobClient: Job complete: job_201004091928_0391 [snip] 10/04/16 13:43:07 INFO mapred.FileInputFormat: - TestDFSIO - : write 10/04/16 13:43:07 INFO mapred.FileInputFormat:Date time: Fri Apr 16 13:43:07 PDT 2010 10/04/16 13:43:07 INFO mapred.FileInputFormat:Number of files: 1 10/04/16 13:43:07 INFO mapred.FileInputFormat: Total MBytes processed: 2000 10/04/16 13:43:07 INFO mapred.FileInputFormat: Throughput
Re: Extremely slow HDFS after upgrade
More info -- this is not a Hadoop issue. The network performance issue can be replicated with SSH only on the links where Hadoop has a problem, and only in the direction with a problem. HDFS is slow to transfer data in certain directions from certain machines. So, for example, copying from node C to D may be slow, but not the other direction from C to D. Likewise, although only 3 of 8 nodes have this problem, it is not universal. For example, node C might have trouble copying data to 5 of the 7 other nodes, and node G might have trouble with all 7 other nodes. No idea what it is yet, but SSH exhibits the same issue -- only in those specific point-to-point links in one specific direction. -Scott On Apr 16, 2010, at 7:10 PM, Scott Carey wrote: Ok, so here is a ... fun result. I have dfs.replication.min set to 2, so I can't just do hsdoop fs -Ddfs.replication=1 put someFile someFile Since that will fail. So here are two results that are fascinating: $ time hadoop fs -Ddfs.replication=3 -put test.tar test.tar real1m53.237s user0m1.952s sys 0m0.308s $ time hadoop fs -Ddfs.replication=2 -put test.tar test.tar real0m1.689s user0m1.763s sys 0m0.315s The file is 77MB and so is two blocks. The test with replication level 3 is slow about 9 out of 10 times. When it is slow it sometimes is 28 seconds, sometimes 2 minutes. It was fast one time... The test with replication level 2 is fast in 40 out of 40 tests. This is a development cluster with 8 nodes. It looks like the replication level of 3 or more causes trouble. Looking more closely at the logs, it seems that certain datanodes (but not all) cause large delays if they are in the middle of an HDFS write chain. So, a write that goes from A B C is fast if B is a good node and C a bad node. If its A C B then its slow. So, I can say that some nodes but not all are doing something wrong. when in the middle of a write chain. If I do a replication = 2 write on one of these bad nodes, its always slow. So the good news is I can identify the bad nodes, and decomission them. The bad news is this still doesn't make a lot of sense, and 40% of the nodes have the issue. Worse, on a couple nodes the behavior in the replication = 2 case is not consistent -- sometimes the first block is fast. So it may be dependent on not just the source, but the source target combination in the chain. At this point, I suspect something completely broken at the network level, perhaps even routing. Why it would show up after an upgrade is yet to be determined, but the upgrade did include some config changes and OS updates. Thanks Todd! -Scott On Apr 16, 2010, at 5:34 PM, Todd Lipcon wrote: Hey Scott, This is indeed really strange... if you do a straight hadoop fs -put with dfs.replication set to 1 from one of the DNs, does it upload slow? That would cut out the network from the equation. -Todd On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey sc...@richrelevance.comwrote: I have two clusters upgraded to CDH2. One is performing fine, and the other is EXTREMELY slow. Some jobs that formerly took 90 seconds, take 20 to 50 minutes. It is an HDFS issue from what I can tell. The simple DFS benchmark with one map task shows the problem clearly. I have looked at every difference I can find and am wondering where else to look to track this down. The disks on all nodes in the cluster check out -- capable of 75MB/sec minimum with a 'dd' write test. top / iostat do not show any significant CPU usage or iowait times on any machines in the cluster during the test. ifconfig does not report any dropped packets or other errors on any machine in the cluster. dmesg has nothing interesting. The poorly performing cluster is on a slightly newer CentOS version: Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.4, recent patches) Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.3, I think) The performance is always poor, not sporadically poor. It is poor with M/R tasks as well as non-M/R HDFS clients (i.e. sqoop). Poor performance cluster (no other jobs active during the test): --- $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write -nrFiles 1 -fileSize 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1 10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 100 10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: 2000 mega bytes, 1 files 10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files for: 1 files 10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/04/16 12:53:15
Re: Extremely slow HDFS after upgrade
Checked link autonegotiation with ethtool? Sometimes gige will autoneg to 10mb half duplex if there's a bad cable, NIC, or switch port. -Todd On Fri, Apr 16, 2010 at 8:08 PM, Scott Carey sc...@richrelevance.comwrote: More info -- this is not a Hadoop issue. The network performance issue can be replicated with SSH only on the links where Hadoop has a problem, and only in the direction with a problem. HDFS is slow to transfer data in certain directions from certain machines. So, for example, copying from node C to D may be slow, but not the other direction from C to D. Likewise, although only 3 of 8 nodes have this problem, it is not universal. For example, node C might have trouble copying data to 5 of the 7 other nodes, and node G might have trouble with all 7 other nodes. No idea what it is yet, but SSH exhibits the same issue -- only in those specific point-to-point links in one specific direction. -Scott On Apr 16, 2010, at 7:10 PM, Scott Carey wrote: Ok, so here is a ... fun result. I have dfs.replication.min set to 2, so I can't just do hsdoop fs -Ddfs.replication=1 put someFile someFile Since that will fail. So here are two results that are fascinating: $ time hadoop fs -Ddfs.replication=3 -put test.tar test.tar real1m53.237s user0m1.952s sys 0m0.308s $ time hadoop fs -Ddfs.replication=2 -put test.tar test.tar real0m1.689s user0m1.763s sys 0m0.315s The file is 77MB and so is two blocks. The test with replication level 3 is slow about 9 out of 10 times. When it is slow it sometimes is 28 seconds, sometimes 2 minutes. It was fast one time... The test with replication level 2 is fast in 40 out of 40 tests. This is a development cluster with 8 nodes. It looks like the replication level of 3 or more causes trouble. Looking more closely at the logs, it seems that certain datanodes (but not all) cause large delays if they are in the middle of an HDFS write chain. So, a write that goes from A B C is fast if B is a good node and C a bad node. If its A C B then its slow. So, I can say that some nodes but not all are doing something wrong. when in the middle of a write chain. If I do a replication = 2 write on one of these bad nodes, its always slow. So the good news is I can identify the bad nodes, and decomission them. The bad news is this still doesn't make a lot of sense, and 40% of the nodes have the issue. Worse, on a couple nodes the behavior in the replication = 2 case is not consistent -- sometimes the first block is fast. So it may be dependent on not just the source, but the source target combination in the chain. At this point, I suspect something completely broken at the network level, perhaps even routing. Why it would show up after an upgrade is yet to be determined, but the upgrade did include some config changes and OS updates. Thanks Todd! -Scott On Apr 16, 2010, at 5:34 PM, Todd Lipcon wrote: Hey Scott, This is indeed really strange... if you do a straight hadoop fs -put with dfs.replication set to 1 from one of the DNs, does it upload slow? That would cut out the network from the equation. -Todd On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey sc...@richrelevance.com wrote: I have two clusters upgraded to CDH2. One is performing fine, and the other is EXTREMELY slow. Some jobs that formerly took 90 seconds, take 20 to 50 minutes. It is an HDFS issue from what I can tell. The simple DFS benchmark with one map task shows the problem clearly. I have looked at every difference I can find and am wondering where else to look to track this down. The disks on all nodes in the cluster check out -- capable of 75MB/sec minimum with a 'dd' write test. top / iostat do not show any significant CPU usage or iowait times on any machines in the cluster during the test. ifconfig does not report any dropped packets or other errors on any machine in the cluster. dmesg has nothing interesting. The poorly performing cluster is on a slightly newer CentOS version: Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.4, recent patches) Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.3, I think) The performance is always poor, not sporadically poor. It is poor with M/R tasks as well as non-M/R HDFS clients (i.e. sqoop). Poor performance cluster (no other jobs active during the test): --- $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write -nrFiles 1 -fileSize 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1 10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 100 10/04/16 12:53:14 INFO