Hey Scott,

This is indeed really strange... if you do a straight "hadoop fs -put" with
dfs.replication set to 1 from one of the DNs, does it upload slow? That
would cut out the network from the equation.

-Todd

On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey <sc...@richrelevance.com>wrote:

> I have two clusters upgraded to CDH2.   One is performing fine, and the
> other is EXTREMELY slow.
>
> Some jobs that formerly took 90 seconds, take 20 to 50 minutes.
>
> It is an HDFS issue from what I can tell.
>
> The simple DFS benchmark with one map task shows the problem clearly.  I
> have looked at every difference I can find and am wondering where else to
> look to track this down.
> The disks on all nodes in the cluster check out -- capable of 75MB/sec
> minimum with a 'dd' write test.
> top / iostat do not show any significant CPU usage or iowait times on any
> machines in the cluster during the test.
> ifconfig does not report any dropped packets or other errors on any machine
> in the cluster.  dmesg has nothing interesting.
> The poorly performing cluster is on a slightly newer CentOS version:
> Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64
> x86_64 GNU/Linux  (CentOS 5.4, recent patches)
> Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64
> x86_64 GNU/Linux  (CentOS 5.3, I think)
> The performance is always poor, not sporadically poor.  It is poor with M/R
> tasks as well as non-M/R HDFS clients (i.e. sqoop).
>
> Poor performance cluster (no other jobs active during the test):
> ---------------------------------------------------
> $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write
> -nrFiles 1 -fileSize 2000
> 10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1
> 10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000
> 10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 1000000
> 10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: 2000
> mega bytes, 1 files
> 10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files for: 1
> files
> 10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> 10/04/16 12:53:15 INFO mapred.FileInputFormat: Total input paths to process
> : 1
> 10/04/16 12:53:15 INFO mapred.JobClient: Running job: job_201004091928_0391
> 10/04/16 12:53:16 INFO mapred.JobClient:  map 0% reduce 0%
> 10/04/16 13:42:30 INFO mapred.JobClient:  map 100% reduce 0%
> 10/04/16 13:43:06 INFO mapred.JobClient:  map 100% reduce 100%
> 10/04/16 13:43:07 INFO mapred.JobClient: Job complete:
> job_201004091928_0391
> [snip]
> 10/04/16 13:43:07 INFO mapred.FileInputFormat: ----- TestDFSIO ----- :
> write
> 10/04/16 13:43:07 INFO mapred.FileInputFormat:            Date & time: Fri
> Apr 16 13:43:07 PDT 2010
> 10/04/16 13:43:07 INFO mapred.FileInputFormat:        Number of files: 1
> 10/04/16 13:43:07 INFO mapred.FileInputFormat: Total MBytes processed: 2000
> 10/04/16 13:43:07 INFO mapred.FileInputFormat:      Throughput mb/sec:
> 0.678296742615553
> 10/04/16 13:43:07 INFO mapred.FileInputFormat: Average IO rate mb/sec:
> 0.6782967448234558
> 10/04/16 13:43:07 INFO mapred.FileInputFormat:  IO rate std deviation:
> 9.568803140552889E-5
> 10/04/16 13:43:07 INFO mapred.FileInputFormat:     Test exec time sec:
> 2992.913
> ------------------------------------
>
> Good performance cluster (other jobs active during the test):
> ---------------------------------
> hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write
> -nrFiles 1 -fileSize 2000
> 10/04/16 12:50:52 WARN conf.Configuration: DEPRECATED: hadoop-site.xml
> found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use
> core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of
> core-default.xml, mapred-default.xml and hdfs-default.xml respectively
> TestFDSIO.0.0.4
> 10/04/16 12:50:52 INFO mapred.FileInputFormat: nrFiles = 1
> 10/04/16 12:50:52 INFO mapred.FileInputFormat: fileSize (MB) = 2000
> 10/04/16 12:50:52 INFO mapred.FileInputFormat: bufferSize = 1000000
> 10/04/16 12:50:52 INFO mapred.FileInputFormat: creating control file: 2000
> mega bytes, 1 files
> 10/04/16 12:50:52 INFO mapred.FileInputFormat: created control files for: 1
> files
> 10/04/16 12:50:52 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> 10/04/16 12:50:53 INFO mapred.FileInputFormat: Total input paths to process
> : 1
> 10/04/16 12:50:54 INFO mapred.JobClient: Running job: job_201003311607_4098
> 10/04/16 12:50:55 INFO mapred.JobClient:  map 0% reduce 0%
> 10/04/16 12:51:22 INFO mapred.JobClient:  map 100% reduce 0%
> 10/04/16 12:51:32 INFO mapred.JobClient:  map 100% reduce 100%
> 10/04/16 12:51:32 INFO mapred.JobClient: Job complete:
> job_201003311607_4098
> [snip]
> 10/04/16 12:51:32 INFO mapred.FileInputFormat: ----- TestDFSIO ----- :
> write
> 10/04/16 12:51:32 INFO mapred.FileInputFormat:            Date & time: Fri
> Apr 16 12:51:32 PDT 2010
> 10/04/16 12:51:32 INFO mapred.FileInputFormat:        Number of files: 1
> 10/04/16 12:51:32 INFO mapred.FileInputFormat: Total MBytes processed: 2000
> 10/04/16 12:51:32 INFO mapred.FileInputFormat:      Throughput mb/sec:
> 92.47699634715865
> 10/04/16 12:51:32 INFO mapred.FileInputFormat: Average IO rate mb/sec:
> 92.47699737548828
> 10/04/16 12:51:32 INFO mapred.FileInputFormat:  IO rate std deviation:
> 0.008579127784432736
> 10/04/16 12:51:32 INFO mapred.FileInputFormat:     Test exec time sec:
> 39.522
> -------------------------------
>
> On the namenode logs during the write, there are about 2 minute intervals
> between block writes (64MB)
> 2010-04-16 12:56:41,081 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock: /benchmarks/TestDFSIO/io_data/test_io_0.
> blk_5304238373340657846_291044
> 2010-04-16 12:58:15,144 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.3.0.142:50010 is added to
> blk_5304238373340657846_291044 size 67108864
> 2010-04-16 12:58:15,145 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.3.0.140:50010 is added to
> blk_5304238373340657846_291044 size 67108864
> 2010-04-16 12:58:15,147 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.3.0.146:50010 is added to
> blk_5304238373340657846_291044 size 67108864
> 2010-04-16 12:58:15,148 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock: /benchmarks/TestDFSIO/io_data/test_io_0.
> blk_-4700875175596279045_291044
> 2010-04-16 12:58:15,792 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.3.0.147:50010 is added to
> blk_-4700875175596279045_291044 size 67108864
> 2010-04-16 12:58:15,792 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.3.0.149:50010 is added to
> blk_-4700875175596279045_291044 size 67108864
> 2010-04-16 12:58:15,794 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.3.0.146:50010 is added to
> blk_-4700875175596279045_291044 size 67108864
> 2010-04-16 12:58:15,795 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock: /benchmarks/TestDFSIO/io_data/test_io_0.
> blk_1059681682487510599_291044
>
> On the node where the map task ran, its datanode log has similar chunks of
> time while it receives the block:
> 2010-04-16 12:55:08,303 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_-7985493313278868469_291044 src: /10.3.0.146:48168 dest: /
> 10.3.0.146:50010
> 2010-04-16 12:56:42,281 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.3.0.146:48168, dest: /10.3.0.146:50010, bytes: 67108864, op:
> HDFS_WRITE, cliID: DFSClient_attempt_201004091928_0391_m_000000_0, offset:
> 0, srvID: DS-350876050-10.3.0.146-50010-1265338485169, blockid:
> blk_-7985493313278868469_291044, duration: 93972113000
> 2010-04-16 12:56:42,282 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block
> blk_-7985493313278868469_291044 terminating
> 2010-04-16 12:56:42,285 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_5304238373340657846_291044 src: /10.3.0.146:48175 dest: /
> 10.3.0.146:50010
> 2010-04-16 12:58:16,350 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.3.0.146:48175, dest: /10.3.0.146:50010, bytes: 67108864, op:
> HDFS_WRITE, cliID: DFSClient_attempt_201004091928_0391_m_000000_0, offset:
> 0, srvID: DS-350876050-10.3.0.146-50010-1265338485169, blockid:
> blk_5304238373340657846_291044, duration: 94059819000
> 2010-04-16 12:58:16,350 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block
> blk_5304238373340657846_291044 terminating
> 2010-04-16 12:58:16,353 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_-4700875175596279045_291044 src: /10.3.0.146:48182 dest: /
> 10.3.0.146:50010
> 2010-04-16 12:58:16,997 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.3.0.146:48182, dest: /10.3.0.146:50010, bytes: 67108864, op:
> HDFS_WRITE, cliID: DFSClient_attempt_201004091928_0391_m_000000_0, offset:
> 0, srvID: DS-350876050-10.3.0.146-50010-1265338485169, blockid:
> blk_-4700875175596279045_291044, duration: 637761000
> 2010-04-16 12:58:16,997 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block
> blk_-4700875175596279045_291044 terminating
>
>
> On a different datanode it looks essentially the same but fewer blocks,
> since all blocks are written to the local node once but only (replication
> -1)/(cluster size -1) times as often elsewhere.
>
> I grabbed a few stack traces of the running map task, but I don't think
> there is any surprises there -- its waiting on an HDFS write:
> "DataStreamer for file /benchmarks/TestDFSIO/io_data/test_io_0 block
> blk_6860895546192904948_291044" daemon prio=10 tid=0x00002aab08244000
> nid=0x1031 runnable [0x0000000040c7a000]
>   java.lang.Thread.State: RUNNABLE
>        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
>        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>        - locked <0x00002aaab363dc88> (a sun.nio.ch.Util$1)
>        - locked <0x00002aaab363dc78> (a
> java.util.Collections$UnmodifiableSet)
>        - locked <0x00002aaab363da50> (a sun.nio.ch.EPollSelectorImpl)
>        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>        at
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332)
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>        at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>        at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>        - locked <0x00002aaaf89023c0> (a java.io.BufferedOutputStream)
>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>
> "main" prio=10 tid=0x0000000055fb6000 nid=0x1017 waiting for monitor entry
> [0x000000004112c000]
>   java.lang.Thread.State: BLOCKED (on object monitor)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:3006)
>        - waiting to lock <0x00002aaac36bb440> (a java.util.LinkedList)
>        - locked <0x00002aaac36bae60> (a
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream)
>        at
> org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150)
>
>
>
> I can get some stack traces from the namenode and datanodes during this
> too.  Any other suggestions?  Is there anything in a 0.19.2  to 0.20.1+
> upgrade that might end up improperly configured that would have this sort of
> symptom?  Its almost like HDFS just decides to sleep for a second between
> each 64K network transfer.  Non-hadoop network transfer (i.e. scp) seems
> fine.
>
> Another bit of information -- its not just HDFS writes, the read test is
> almost equally awful. (1.9MB/sec)
>
> The Map task running the read test is in this stack trace:
> "main" prio=10 tid=0x00000000554b3000 nid=0x3376 runnable
> [0x0000000040a1d000]
>   java.lang.Thread.State: RUNNABLE
>        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
>        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>        - locked <0x00002aaaf7641740> (a sun.nio.ch.Util$1)
>        - locked <0x00002aaaf7641730> (a
> java.util.Collections$UnmodifiableSet)
>        - locked <0x00002aaaf7641508> (a sun.nio.ch.EPollSelectorImpl)
>        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>        at
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332)
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>        at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>        at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>        at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>        - locked <0x00002aaaf76400f8> (a java.io.BufferedInputStream)
>        at java.io.DataInputStream.read(DataInputStream.java:132)
>        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>        at
> org.apache.hadoop.hdfs.DFSClient$BlockReader.readChunk(DFSClient.java:1283)
>
>
> The namenode is not doing anything as far as I can tell in the stack
> traces.  The datanode that is being accessed for data at any given time
> seems to be in this stack trace:
> "org.apache.hadoop.hdfs.server.datanode.dataxcei...@7b90a2d" daemon
> prio=10 tid=0x00002aaaf43b8000 nid=0x79cc runnable [0x000000004433c000]
>   java.lang.Thread.State: RUNNABLE
>        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
>        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>        - locked <0x00002aaae7ee8440> (a sun.nio.ch.Util$1)
>        - locked <0x00002aaae7ee8428> (a
> java.util.Collections$UnmodifiableSet)
>        - locked <0x00002aaae7ee8098> (a sun.nio.ch.EPollSelectorImpl)
>        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>        at
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332)
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245)
>        at
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>        at
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:401)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
>        at java.lang.Thread.run(Thread.java:619)
>
>
> So the behavior is as if the network is extremely slow, but it seems to
> only affect Hadoop.  Any ideas?
>
> Thanks,
>
> -Scott
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to