Checked link autonegotiation with ethtool? Sometimes gige will autoneg to 10mb half duplex if there's a bad cable, NIC, or switch port.
-Todd On Fri, Apr 16, 2010 at 8:08 PM, Scott Carey <sc...@richrelevance.com>wrote: > More info -- this is not a Hadoop issue. > > The network performance issue can be replicated with SSH only on the links > where Hadoop has a problem, and only in the direction with a problem. > > HDFS is slow to transfer data in certain directions from certain machines. > > So, for example, copying from node C to D may be slow, but not the other > direction from C to D. Likewise, although only 3 of 8 nodes have this > problem, it is not universal. For example, node C might have trouble > copying data to 5 of the 7 other nodes, and node G might have trouble with > all 7 other nodes. > > No idea what it is yet, but SSH exhibits the same issue -- only in those > specific point-to-point links in one specific direction. > > -Scott > > On Apr 16, 2010, at 7:10 PM, Scott Carey wrote: > > > Ok, so here is a ... fun result. > > > > I have dfs.replication.min set to 2, so I can't just do > > hsdoop fs -Ddfs.replication=1 put someFile someFile > > Since that will fail. > > > > So here are two results that are fascinating: > > > > $ time hadoop fs -Ddfs.replication=3 -put test.tar test.tar > > real 1m53.237s > > user 0m1.952s > > sys 0m0.308s > > > > $ time hadoop fs -Ddfs.replication=2 -put test.tar test.tar > > real 0m1.689s > > user 0m1.763s > > sys 0m0.315s > > > > > > > > The file is 77MB and so is two blocks. > > The test with replication level 3 is slow about 9 out of 10 times. When > it is slow it sometimes is 28 seconds, sometimes 2 minutes. It was fast one > time... > > The test with replication level 2 is fast in 40 out of 40 tests. > > > > This is a development cluster with 8 nodes. > > > > It looks like the replication level of 3 or more causes trouble. Looking > more closely at the logs, it seems that certain datanodes (but not all) > cause large delays if they are in the middle of an HDFS write chain. So, a > write that goes from A > B > C is fast if B is a good node and C a bad node. > If its A > C > B then its slow. > > > > So, I can say that some nodes but not all are doing something wrong. when > in the middle of a write chain. If I do a replication = 2 write on one of > these bad nodes, its always slow. > > > > So the good news is I can identify the bad nodes, and decomission them. > The bad news is this still doesn't make a lot of sense, and 40% of the > nodes have the issue. Worse, on a couple nodes the behavior in the > replication = 2 case is not consistent -- sometimes the first block is fast. > So it may be dependent on not just the source, but the source <> target > combination in the chain. > > > > > > At this point, I suspect something completely broken at the network > level, perhaps even routing. Why it would show up after an upgrade is yet > to be determined, but the upgrade did include some config changes and OS > updates. > > > > Thanks Todd! > > > > -Scott > > > > > > On Apr 16, 2010, at 5:34 PM, Todd Lipcon wrote: > > > >> Hey Scott, > >> > >> This is indeed really strange... if you do a straight "hadoop fs -put" > with > >> dfs.replication set to 1 from one of the DNs, does it upload slow? That > >> would cut out the network from the equation. > >> > >> -Todd > >> > >> On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey <sc...@richrelevance.com > >wrote: > >> > >>> I have two clusters upgraded to CDH2. One is performing fine, and the > >>> other is EXTREMELY slow. > >>> > >>> Some jobs that formerly took 90 seconds, take 20 to 50 minutes. > >>> > >>> It is an HDFS issue from what I can tell. > >>> > >>> The simple DFS benchmark with one map task shows the problem clearly. > I > >>> have looked at every difference I can find and am wondering where else > to > >>> look to track this down. > >>> The disks on all nodes in the cluster check out -- capable of 75MB/sec > >>> minimum with a 'dd' write test. > >>> top / iostat do not show any significant CPU usage or iowait times on > any > >>> machines in the cluster during the test. > >>> ifconfig does not report any dropped packets or other errors on any > machine > >>> in the cluster. dmesg has nothing interesting. > >>> The poorly performing cluster is on a slightly newer CentOS version: > >>> Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 > x86_64 > >>> x86_64 GNU/Linux (CentOS 5.4, recent patches) > >>> Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 > >>> x86_64 GNU/Linux (CentOS 5.3, I think) > >>> The performance is always poor, not sporadically poor. It is poor with > M/R > >>> tasks as well as non-M/R HDFS clients (i.e. sqoop). > >>> > >>> Poor performance cluster (no other jobs active during the test): > >>> --------------------------------------------------- > >>> $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO > -write > >>> -nrFiles 1 -fileSize 2000 > >>> 10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1 > >>> 10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000 > >>> 10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 1000000 > >>> 10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: > 2000 > >>> mega bytes, 1 files > >>> 10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files > for: 1 > >>> files > >>> 10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for > >>> parsing the arguments. Applications should implement Tool for the same. > >>> 10/04/16 12:53:15 INFO mapred.FileInputFormat: Total input paths to > process > >>> : 1 > >>> 10/04/16 12:53:15 INFO mapred.JobClient: Running job: > job_201004091928_0391 > >>> 10/04/16 12:53:16 INFO mapred.JobClient: map 0% reduce 0% > >>> 10/04/16 13:42:30 INFO mapred.JobClient: map 100% reduce 0% > >>> 10/04/16 13:43:06 INFO mapred.JobClient: map 100% reduce 100% > >>> 10/04/16 13:43:07 INFO mapred.JobClient: Job complete: > >>> job_201004091928_0391 > >>> [snip] > >>> 10/04/16 13:43:07 INFO mapred.FileInputFormat: ----- TestDFSIO ----- : > >>> write > >>> 10/04/16 13:43:07 INFO mapred.FileInputFormat: Date & time: > Fri > >>> Apr 16 13:43:07 PDT 2010 > >>> 10/04/16 13:43:07 INFO mapred.FileInputFormat: Number of files: > 1 > >>> 10/04/16 13:43:07 INFO mapred.FileInputFormat: Total MBytes processed: > 2000 > >>> 10/04/16 13:43:07 INFO mapred.FileInputFormat: Throughput mb/sec: > >>> 0.678296742615553 > >>> 10/04/16 13:43:07 INFO mapred.FileInputFormat: Average IO rate mb/sec: > >>> 0.6782967448234558 > >>> 10/04/16 13:43:07 INFO mapred.FileInputFormat: IO rate std deviation: > >>> 9.568803140552889E-5 > >>> 10/04/16 13:43:07 INFO mapred.FileInputFormat: Test exec time sec: > >>> 2992.913 > >>> ------------------------------------ > >>> > >>> Good performance cluster (other jobs active during the test): > >>> --------------------------------- > >>> hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO > -write > >>> -nrFiles 1 -fileSize 2000 > >>> 10/04/16 12:50:52 WARN conf.Configuration: DEPRECATED: hadoop-site.xml > >>> found in the classpath. Usage of hadoop-site.xml is deprecated. Instead > use > >>> core-site.xml, mapred-site.xml and hdfs-site.xml to override properties > of > >>> core-default.xml, mapred-default.xml and hdfs-default.xml respectively > >>> TestFDSIO.0.0.4 > >>> 10/04/16 12:50:52 INFO mapred.FileInputFormat: nrFiles = 1 > >>> 10/04/16 12:50:52 INFO mapred.FileInputFormat: fileSize (MB) = 2000 > >>> 10/04/16 12:50:52 INFO mapred.FileInputFormat: bufferSize = 1000000 > >>> 10/04/16 12:50:52 INFO mapred.FileInputFormat: creating control file: > 2000 > >>> mega bytes, 1 files > >>> 10/04/16 12:50:52 INFO mapred.FileInputFormat: created control files > for: 1 > >>> files > >>> 10/04/16 12:50:52 WARN mapred.JobClient: Use GenericOptionsParser for > >>> parsing the arguments. Applications should implement Tool for the same. > >>> 10/04/16 12:50:53 INFO mapred.FileInputFormat: Total input paths to > process > >>> : 1 > >>> 10/04/16 12:50:54 INFO mapred.JobClient: Running job: > job_201003311607_4098 > >>> 10/04/16 12:50:55 INFO mapred.JobClient: map 0% reduce 0% > >>> 10/04/16 12:51:22 INFO mapred.JobClient: map 100% reduce 0% > >>> 10/04/16 12:51:32 INFO mapred.JobClient: map 100% reduce 100% > >>> 10/04/16 12:51:32 INFO mapred.JobClient: Job complete: > >>> job_201003311607_4098 > >>> [snip] > >>> 10/04/16 12:51:32 INFO mapred.FileInputFormat: ----- TestDFSIO ----- : > >>> write > >>> 10/04/16 12:51:32 INFO mapred.FileInputFormat: Date & time: > Fri > >>> Apr 16 12:51:32 PDT 2010 > >>> 10/04/16 12:51:32 INFO mapred.FileInputFormat: Number of files: > 1 > >>> 10/04/16 12:51:32 INFO mapred.FileInputFormat: Total MBytes processed: > 2000 > >>> 10/04/16 12:51:32 INFO mapred.FileInputFormat: Throughput mb/sec: > >>> 92.47699634715865 > >>> 10/04/16 12:51:32 INFO mapred.FileInputFormat: Average IO rate mb/sec: > >>> 92.47699737548828 > >>> 10/04/16 12:51:32 INFO mapred.FileInputFormat: IO rate std deviation: > >>> 0.008579127784432736 > >>> 10/04/16 12:51:32 INFO mapred.FileInputFormat: Test exec time sec: > >>> 39.522 > >>> ------------------------------- > >>> > >>> On the namenode logs during the write, there are about 2 minute > intervals > >>> between block writes (64MB) > >>> 2010-04-16 12:56:41,081 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > >>> NameSystem.allocateBlock: /benchmarks/TestDFSIO/io_data/test_io_0. > >>> blk_5304238373340657846_291044 > >>> 2010-04-16 12:58:15,144 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > >>> NameSystem.addStoredBlock: blockMap updated: 10.3.0.142:50010 is added > to > >>> blk_5304238373340657846_291044 size 67108864 > >>> 2010-04-16 12:58:15,145 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > >>> NameSystem.addStoredBlock: blockMap updated: 10.3.0.140:50010 is added > to > >>> blk_5304238373340657846_291044 size 67108864 > >>> 2010-04-16 12:58:15,147 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > >>> NameSystem.addStoredBlock: blockMap updated: 10.3.0.146:50010 is added > to > >>> blk_5304238373340657846_291044 size 67108864 > >>> 2010-04-16 12:58:15,148 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > >>> NameSystem.allocateBlock: /benchmarks/TestDFSIO/io_data/test_io_0. > >>> blk_-4700875175596279045_291044 > >>> 2010-04-16 12:58:15,792 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > >>> NameSystem.addStoredBlock: blockMap updated: 10.3.0.147:50010 is added > to > >>> blk_-4700875175596279045_291044 size 67108864 > >>> 2010-04-16 12:58:15,792 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > >>> NameSystem.addStoredBlock: blockMap updated: 10.3.0.149:50010 is added > to > >>> blk_-4700875175596279045_291044 size 67108864 > >>> 2010-04-16 12:58:15,794 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > >>> NameSystem.addStoredBlock: blockMap updated: 10.3.0.146:50010 is added > to > >>> blk_-4700875175596279045_291044 size 67108864 > >>> 2010-04-16 12:58:15,795 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > >>> NameSystem.allocateBlock: /benchmarks/TestDFSIO/io_data/test_io_0. > >>> blk_1059681682487510599_291044 > >>> > >>> On the node where the map task ran, its datanode log has similar chunks > of > >>> time while it receives the block: > >>> 2010-04-16 12:55:08,303 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > >>> blk_-7985493313278868469_291044 src: /10.3.0.146:48168 dest: / > >>> 10.3.0.146:50010 > >>> 2010-04-16 12:56:42,281 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > >>> 10.3.0.146:48168, dest: /10.3.0.146:50010, bytes: 67108864, op: > >>> HDFS_WRITE, cliID: DFSClient_attempt_201004091928_0391_m_000000_0, > offset: > >>> 0, srvID: DS-350876050-10.3.0.146-50010-1265338485169, blockid: > >>> blk_-7985493313278868469_291044, duration: 93972113000 > >>> 2010-04-16 12:56:42,282 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for > block > >>> blk_-7985493313278868469_291044 terminating > >>> 2010-04-16 12:56:42,285 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > >>> blk_5304238373340657846_291044 src: /10.3.0.146:48175 dest: / > >>> 10.3.0.146:50010 > >>> 2010-04-16 12:58:16,350 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > >>> 10.3.0.146:48175, dest: /10.3.0.146:50010, bytes: 67108864, op: > >>> HDFS_WRITE, cliID: DFSClient_attempt_201004091928_0391_m_000000_0, > offset: > >>> 0, srvID: DS-350876050-10.3.0.146-50010-1265338485169, blockid: > >>> blk_5304238373340657846_291044, duration: 94059819000 > >>> 2010-04-16 12:58:16,350 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for > block > >>> blk_5304238373340657846_291044 terminating > >>> 2010-04-16 12:58:16,353 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > >>> blk_-4700875175596279045_291044 src: /10.3.0.146:48182 dest: / > >>> 10.3.0.146:50010 > >>> 2010-04-16 12:58:16,997 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > >>> 10.3.0.146:48182, dest: /10.3.0.146:50010, bytes: 67108864, op: > >>> HDFS_WRITE, cliID: DFSClient_attempt_201004091928_0391_m_000000_0, > offset: > >>> 0, srvID: DS-350876050-10.3.0.146-50010-1265338485169, blockid: > >>> blk_-4700875175596279045_291044, duration: 637761000 > >>> 2010-04-16 12:58:16,997 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for > block > >>> blk_-4700875175596279045_291044 terminating > >>> > >>> > >>> On a different datanode it looks essentially the same but fewer blocks, > >>> since all blocks are written to the local node once but only > (replication > >>> -1)/(cluster size -1) times as often elsewhere. > >>> > >>> I grabbed a few stack traces of the running map task, but I don't think > >>> there is any surprises there -- its waiting on an HDFS write: > >>> "DataStreamer for file /benchmarks/TestDFSIO/io_data/test_io_0 block > >>> blk_6860895546192904948_291044" daemon prio=10 tid=0x00002aab08244000 > >>> nid=0x1031 runnable [0x0000000040c7a000] > >>> java.lang.Thread.State: RUNNABLE > >>> at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > >>> at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215) > >>> at > sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) > >>> at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) > >>> - locked <0x00002aaab363dc88> (a sun.nio.ch.Util$1) > >>> - locked <0x00002aaab363dc78> (a > >>> java.util.Collections$UnmodifiableSet) > >>> - locked <0x00002aaab363da50> (a sun.nio.ch.EPollSelectorImpl) > >>> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) > >>> at > >>> > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332) > >>> at > >>> > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) > >>> at > >>> > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) > >>> at > >>> > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) > >>> at > java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) > >>> - locked <0x00002aaaf89023c0> (a java.io.BufferedOutputStream) > >>> at java.io.DataOutputStream.write(DataOutputStream.java:90) > >>> > >>> "main" prio=10 tid=0x0000000055fb6000 nid=0x1017 waiting for monitor > entry > >>> [0x000000004112c000] > >>> java.lang.Thread.State: BLOCKED (on object monitor) > >>> at > >>> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:3006) > >>> - waiting to lock <0x00002aaac36bb440> (a java.util.LinkedList) > >>> - locked <0x00002aaac36bae60> (a > >>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream) > >>> at > >>> > org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150) > >>> > >>> > >>> > >>> I can get some stack traces from the namenode and datanodes during this > >>> too. Any other suggestions? Is there anything in a 0.19.2 to 0.20.1+ > >>> upgrade that might end up improperly configured that would have this > sort of > >>> symptom? Its almost like HDFS just decides to sleep for a second > between > >>> each 64K network transfer. Non-hadoop network transfer (i.e. scp) > seems > >>> fine. > >>> > >>> Another bit of information -- its not just HDFS writes, the read test > is > >>> almost equally awful. (1.9MB/sec) > >>> > >>> The Map task running the read test is in this stack trace: > >>> "main" prio=10 tid=0x00000000554b3000 nid=0x3376 runnable > >>> [0x0000000040a1d000] > >>> java.lang.Thread.State: RUNNABLE > >>> at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > >>> at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215) > >>> at > sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) > >>> at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) > >>> - locked <0x00002aaaf7641740> (a sun.nio.ch.Util$1) > >>> - locked <0x00002aaaf7641730> (a > >>> java.util.Collections$UnmodifiableSet) > >>> - locked <0x00002aaaf7641508> (a sun.nio.ch.EPollSelectorImpl) > >>> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) > >>> at > >>> > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332) > >>> at > >>> > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) > >>> at > >>> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > >>> at > >>> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > >>> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > >>> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) > >>> at java.io.BufferedInputStream.read(BufferedInputStream.java:317) > >>> - locked <0x00002aaaf76400f8> (a java.io.BufferedInputStream) > >>> at java.io.DataInputStream.read(DataInputStream.java:132) > >>> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100) > >>> at > >>> > org.apache.hadoop.hdfs.DFSClient$BlockReader.readChunk(DFSClient.java:1283) > >>> > >>> > >>> The namenode is not doing anything as far as I can tell in the stack > >>> traces. The datanode that is being accessed for data at any given time > >>> seems to be in this stack trace: > >>> "org.apache.hadoop.hdfs.server.datanode.dataxcei...@7b90a2d" daemon > >>> prio=10 tid=0x00002aaaf43b8000 nid=0x79cc runnable [0x000000004433c000] > >>> java.lang.Thread.State: RUNNABLE > >>> at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > >>> at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215) > >>> at > sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) > >>> at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) > >>> - locked <0x00002aaae7ee8440> (a sun.nio.ch.Util$1) > >>> - locked <0x00002aaae7ee8428> (a > >>> java.util.Collections$UnmodifiableSet) > >>> - locked <0x00002aaae7ee8098> (a sun.nio.ch.EPollSelectorImpl) > >>> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) > >>> at > >>> > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332) > >>> at > >>> > org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245) > >>> at > >>> > org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) > >>> at > >>> > org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) > >>> at > >>> > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313) > >>> at > >>> > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:401) > >>> at > >>> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180) > >>> at > >>> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95) > >>> at java.lang.Thread.run(Thread.java:619) > >>> > >>> > >>> So the behavior is as if the network is extremely slow, but it seems to > >>> only affect Hadoop. Any ideas? > >>> > >>> Thanks, > >>> > >>> -Scott > >>> > >>> > >> > >> > >> -- > >> Todd Lipcon > >> Software Engineer, Cloudera > > > > -- Todd Lipcon Software Engineer, Cloudera