Re: Extremely slow HDFS after upgrade

2010-04-17 Thread Scott Carey
All links check out as full duplex gigabit with ifconfig and ethtool.
Ifconfig reports no dropped packets or retransmits.  A tcpdump shows no 
retransmits, but shows significantly smaller packets and fewer outstanding 
packets between acks.

But iperf in UDP mode shows a consistent 1.5% lost datagram rate in one 
direction -- I can transmit 900Mbits/sec + with 1.5% loss by udp over the 
flawed links, 0% loss the other way.  So it appears that the Linux tcp flow 
control is throttling back the window size due to these losses.  Time to have 
someone replace all the network cables.

Thanks for the ideas Todd,

-Scott

On Apr 16, 2010, at 8:25 PM, Todd Lipcon wrote:

 Checked link autonegotiation with ethtool? Sometimes gige will autoneg to
 10mb half duplex if there's a bad cable, NIC, or switch port.

 -Todd

 On Fri, Apr 16, 2010 at 8:08 PM, Scott Carey sc...@richrelevance.comwrote:

 More info -- this is not a Hadoop issue.

 The network performance issue can be replicated with SSH only on the links
 where Hadoop has a problem, and only in the direction with a problem.

 HDFS is slow to transfer data in certain directions from certain machines.

 So, for example, copying from node C to D may be slow, but not the other
 direction from C to D.  Likewise, although only 3 of 8 nodes have this
 problem, it is not universal.  For example, node C might have trouble
 copying data to 5 of the 7 other nodes, and node G might have trouble with
 all 7 other nodes.

 No idea what it is yet, but SSH exhibits the same issue -- only in those
 specific point-to-point links in one specific direction.

 -Scott

 On Apr 16, 2010, at 7:10 PM, Scott Carey wrote:

 Ok, so here is a ... fun result.

 I have dfs.replication.min set to 2, so I can't just do
 hsdoop fs -Ddfs.replication=1 put someFile someFile
 Since that will fail.

 So here are two results that are fascinating:

 $ time hadoop fs -Ddfs.replication=3 -put test.tar test.tar
 real1m53.237s
 user0m1.952s
 sys 0m0.308s

 $ time hadoop fs -Ddfs.replication=2 -put test.tar test.tar
 real0m1.689s
 user0m1.763s
 sys 0m0.315s



 The file is 77MB and so is two blocks.
 The test with replication level 3 is slow about 9 out of 10 times.  When
 it is slow it sometimes is 28 seconds, sometimes 2 minutes.  It was fast one
 time...
 The test with replication level 2 is fast in 40 out of 40 tests.

 This is a development cluster with 8 nodes.

 It looks like the replication level of 3 or more causes trouble.  Looking
 more closely at the logs, it seems that certain datanodes (but not all)
 cause large delays if they are in the middle of an HDFS write chain.  So, a
 write that goes from A  B  C is fast if B is a good node and C a bad node.
 If its A  C  B then its slow.

 So, I can say that some nodes but not all are doing something wrong. when
 in the middle of a write chain.  If I do a replication = 2 write on one of
 these bad nodes, its always slow.

 So the good news is I can identify the bad nodes, and decomission them.
 The bad news is this still doesn't make a lot of sense, and 40% of the
 nodes have the issue.  Worse, on a couple nodes the behavior in the
 replication = 2 case is not consistent -- sometimes the first block is fast.
 So it may be dependent on not just the source, but the source  target
 combination in the chain.


 At this point, I suspect something completely broken at the network
 level, perhaps even routing.  Why it would show up after an upgrade is yet
 to be determined, but the upgrade did include some config changes and OS
 updates.

 Thanks Todd!

 -Scott


 On Apr 16, 2010, at 5:34 PM, Todd Lipcon wrote:

 Hey Scott,

 This is indeed really strange... if you do a straight hadoop fs -put
 with
 dfs.replication set to 1 from one of the DNs, does it upload slow? That
 would cut out the network from the equation.

 -Todd

 On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey sc...@richrelevance.com
 wrote:

 I have two clusters upgraded to CDH2.   One is performing fine, and the
 other is EXTREMELY slow.

 Some jobs that formerly took 90 seconds, take 20 to 50 minutes.

 It is an HDFS issue from what I can tell.

 The simple DFS benchmark with one map task shows the problem clearly.
 I
 have looked at every difference I can find and am wondering where else
 to
 look to track this down.
 The disks on all nodes in the cluster check out -- capable of 75MB/sec
 minimum with a 'dd' write test.
 top / iostat do not show any significant CPU usage or iowait times on
 any
 machines in the cluster during the test.
 ifconfig does not report any dropped packets or other errors on any
 machine
 in the cluster.  dmesg has nothing interesting.
 The poorly performing cluster is on a slightly newer CentOS version:
 Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64
 x86_64
 x86_64 GNU/Linux  (CentOS 5.4, recent patches)
 Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64
 x86_64 GNU/Linux  (CentOS 5.3, I think)
 The 

Extremely slow HDFS after upgrade

2010-04-16 Thread Scott Carey
I have two clusters upgraded to CDH2.   One is performing fine, and the other 
is EXTREMELY slow.

Some jobs that formerly took 90 seconds, take 20 to 50 minutes.

It is an HDFS issue from what I can tell.

The simple DFS benchmark with one map task shows the problem clearly.  I have 
looked at every difference I can find and am wondering where else to look to 
track this down.
The disks on all nodes in the cluster check out -- capable of 75MB/sec minimum 
with a 'dd' write test.
top / iostat do not show any significant CPU usage or iowait times on any 
machines in the cluster during the test.
ifconfig does not report any dropped packets or other errors on any machine in 
the cluster.  dmesg has nothing interesting.
The poorly performing cluster is on a slightly newer CentOS version:
Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64 
x86_64 GNU/Linux  (CentOS 5.4, recent patches)
Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 
GNU/Linux  (CentOS 5.3, I think)
The performance is always poor, not sporadically poor.  It is poor with M/R 
tasks as well as non-M/R HDFS clients (i.e. sqoop).

Poor performance cluster (no other jobs active during the test):
---
$ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write 
-nrFiles 1 -fileSize 2000
10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1
10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000
10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 100
10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: 2000 mega 
bytes, 1 files
10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files for: 1 
files
10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
10/04/16 12:53:15 INFO mapred.FileInputFormat: Total input paths to process : 1
10/04/16 12:53:15 INFO mapred.JobClient: Running job: job_201004091928_0391
10/04/16 12:53:16 INFO mapred.JobClient:  map 0% reduce 0%
10/04/16 13:42:30 INFO mapred.JobClient:  map 100% reduce 0%
10/04/16 13:43:06 INFO mapred.JobClient:  map 100% reduce 100%
10/04/16 13:43:07 INFO mapred.JobClient: Job complete: job_201004091928_0391
[snip]
10/04/16 13:43:07 INFO mapred.FileInputFormat: - TestDFSIO - : write
10/04/16 13:43:07 INFO mapred.FileInputFormat:Date  time: Fri Apr 
16 13:43:07 PDT 2010
10/04/16 13:43:07 INFO mapred.FileInputFormat:Number of files: 1
10/04/16 13:43:07 INFO mapred.FileInputFormat: Total MBytes processed: 2000
10/04/16 13:43:07 INFO mapred.FileInputFormat:  Throughput mb/sec: 
0.678296742615553
10/04/16 13:43:07 INFO mapred.FileInputFormat: Average IO rate mb/sec: 
0.6782967448234558
10/04/16 13:43:07 INFO mapred.FileInputFormat:  IO rate std deviation: 
9.568803140552889E-5
10/04/16 13:43:07 INFO mapred.FileInputFormat: Test exec time sec: 2992.913


Good performance cluster (other jobs active during the test):
-
hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write 
-nrFiles 1 -fileSize 2000
10/04/16 12:50:52 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in 
the classpath. Usage of hadoop-site.xml is deprecated. Instead use 
core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of 
core-default.xml, mapred-default.xml and hdfs-default.xml respectively
TestFDSIO.0.0.4
10/04/16 12:50:52 INFO mapred.FileInputFormat: nrFiles = 1
10/04/16 12:50:52 INFO mapred.FileInputFormat: fileSize (MB) = 2000
10/04/16 12:50:52 INFO mapred.FileInputFormat: bufferSize = 100
10/04/16 12:50:52 INFO mapred.FileInputFormat: creating control file: 2000 mega 
bytes, 1 files
10/04/16 12:50:52 INFO mapred.FileInputFormat: created control files for: 1 
files
10/04/16 12:50:52 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
10/04/16 12:50:53 INFO mapred.FileInputFormat: Total input paths to process : 1
10/04/16 12:50:54 INFO mapred.JobClient: Running job: job_201003311607_4098
10/04/16 12:50:55 INFO mapred.JobClient:  map 0% reduce 0%
10/04/16 12:51:22 INFO mapred.JobClient:  map 100% reduce 0%
10/04/16 12:51:32 INFO mapred.JobClient:  map 100% reduce 100%
10/04/16 12:51:32 INFO mapred.JobClient: Job complete: job_201003311607_4098
[snip]
10/04/16 12:51:32 INFO mapred.FileInputFormat: - TestDFSIO - : write
10/04/16 12:51:32 INFO mapred.FileInputFormat:Date  time: Fri Apr 
16 12:51:32 PDT 2010
10/04/16 12:51:32 INFO mapred.FileInputFormat:Number of files: 1
10/04/16 12:51:32 INFO mapred.FileInputFormat: Total MBytes processed: 2000
10/04/16 12:51:32 INFO mapred.FileInputFormat:  Throughput mb/sec: 
92.47699634715865
10/04/16 12:51:32 INFO mapred.FileInputFormat: Average IO rate mb/sec: 
92.47699737548828

Re: Extremely slow HDFS after upgrade

2010-04-16 Thread Todd Lipcon
Hey Scott,

This is indeed really strange... if you do a straight hadoop fs -put with
dfs.replication set to 1 from one of the DNs, does it upload slow? That
would cut out the network from the equation.

-Todd

On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey sc...@richrelevance.comwrote:

 I have two clusters upgraded to CDH2.   One is performing fine, and the
 other is EXTREMELY slow.

 Some jobs that formerly took 90 seconds, take 20 to 50 minutes.

 It is an HDFS issue from what I can tell.

 The simple DFS benchmark with one map task shows the problem clearly.  I
 have looked at every difference I can find and am wondering where else to
 look to track this down.
 The disks on all nodes in the cluster check out -- capable of 75MB/sec
 minimum with a 'dd' write test.
 top / iostat do not show any significant CPU usage or iowait times on any
 machines in the cluster during the test.
 ifconfig does not report any dropped packets or other errors on any machine
 in the cluster.  dmesg has nothing interesting.
 The poorly performing cluster is on a slightly newer CentOS version:
 Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64
 x86_64 GNU/Linux  (CentOS 5.4, recent patches)
 Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64
 x86_64 GNU/Linux  (CentOS 5.3, I think)
 The performance is always poor, not sporadically poor.  It is poor with M/R
 tasks as well as non-M/R HDFS clients (i.e. sqoop).

 Poor performance cluster (no other jobs active during the test):
 ---
 $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write
 -nrFiles 1 -fileSize 2000
 10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1
 10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000
 10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 100
 10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: 2000
 mega bytes, 1 files
 10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files for: 1
 files
 10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the same.
 10/04/16 12:53:15 INFO mapred.FileInputFormat: Total input paths to process
 : 1
 10/04/16 12:53:15 INFO mapred.JobClient: Running job: job_201004091928_0391
 10/04/16 12:53:16 INFO mapred.JobClient:  map 0% reduce 0%
 10/04/16 13:42:30 INFO mapred.JobClient:  map 100% reduce 0%
 10/04/16 13:43:06 INFO mapred.JobClient:  map 100% reduce 100%
 10/04/16 13:43:07 INFO mapred.JobClient: Job complete:
 job_201004091928_0391
 [snip]
 10/04/16 13:43:07 INFO mapred.FileInputFormat: - TestDFSIO - :
 write
 10/04/16 13:43:07 INFO mapred.FileInputFormat:Date  time: Fri
 Apr 16 13:43:07 PDT 2010
 10/04/16 13:43:07 INFO mapred.FileInputFormat:Number of files: 1
 10/04/16 13:43:07 INFO mapred.FileInputFormat: Total MBytes processed: 2000
 10/04/16 13:43:07 INFO mapred.FileInputFormat:  Throughput mb/sec:
 0.678296742615553
 10/04/16 13:43:07 INFO mapred.FileInputFormat: Average IO rate mb/sec:
 0.6782967448234558
 10/04/16 13:43:07 INFO mapred.FileInputFormat:  IO rate std deviation:
 9.568803140552889E-5
 10/04/16 13:43:07 INFO mapred.FileInputFormat: Test exec time sec:
 2992.913
 

 Good performance cluster (other jobs active during the test):
 -
 hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write
 -nrFiles 1 -fileSize 2000
 10/04/16 12:50:52 WARN conf.Configuration: DEPRECATED: hadoop-site.xml
 found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use
 core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of
 core-default.xml, mapred-default.xml and hdfs-default.xml respectively
 TestFDSIO.0.0.4
 10/04/16 12:50:52 INFO mapred.FileInputFormat: nrFiles = 1
 10/04/16 12:50:52 INFO mapred.FileInputFormat: fileSize (MB) = 2000
 10/04/16 12:50:52 INFO mapred.FileInputFormat: bufferSize = 100
 10/04/16 12:50:52 INFO mapred.FileInputFormat: creating control file: 2000
 mega bytes, 1 files
 10/04/16 12:50:52 INFO mapred.FileInputFormat: created control files for: 1
 files
 10/04/16 12:50:52 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the same.
 10/04/16 12:50:53 INFO mapred.FileInputFormat: Total input paths to process
 : 1
 10/04/16 12:50:54 INFO mapred.JobClient: Running job: job_201003311607_4098
 10/04/16 12:50:55 INFO mapred.JobClient:  map 0% reduce 0%
 10/04/16 12:51:22 INFO mapred.JobClient:  map 100% reduce 0%
 10/04/16 12:51:32 INFO mapred.JobClient:  map 100% reduce 100%
 10/04/16 12:51:32 INFO mapred.JobClient: Job complete:
 job_201003311607_4098
 [snip]
 10/04/16 12:51:32 INFO mapred.FileInputFormat: - TestDFSIO - :
 write
 10/04/16 12:51:32 INFO mapred.FileInputFormat:Date  time: Fri
 Apr 16 

Re: Extremely slow HDFS after upgrade

2010-04-16 Thread Scott Carey
Ok, so here is a ... fun result.

I have dfs.replication.min set to 2, so I can't just do
hsdoop fs -Ddfs.replication=1 put someFile someFile
Since that will fail.

So here are two results that are fascinating:

$ time hadoop fs -Ddfs.replication=3 -put test.tar test.tar
real1m53.237s
user0m1.952s
sys 0m0.308s

$ time hadoop fs -Ddfs.replication=2 -put test.tar test.tar
real0m1.689s
user0m1.763s
sys 0m0.315s



The file is 77MB and so is two blocks.
The test with replication level 3 is slow about 9 out of 10 times.  When it is 
slow it sometimes is 28 seconds, sometimes 2 minutes.  It was fast one time...
The test with replication level 2 is fast in 40 out of 40 tests.

This is a development cluster with 8 nodes.

It looks like the replication level of 3 or more causes trouble.  Looking more 
closely at the logs, it seems that certain datanodes (but not all) cause large 
delays if they are in the middle of an HDFS write chain.  So, a write that goes 
from A  B  C is fast if B is a good node and C a bad node.  If its A  C  B 
then its slow.

So, I can say that some nodes but not all are doing something wrong. when in 
the middle of a write chain.  If I do a replication = 2 write on one of these 
bad nodes, its always slow.

So the good news is I can identify the bad nodes, and decomission them.  The 
bad news is this still doesn't make a lot of sense, and 40% of the nodes have 
the issue.  Worse, on a couple nodes the behavior in the replication = 2 case 
is not consistent -- sometimes the first block is fast.  So it may be dependent 
on not just the source, but the source  target combination in the chain.


At this point, I suspect something completely broken at the network level, 
perhaps even routing.  Why it would show up after an upgrade is yet to be 
determined, but the upgrade did include some config changes and OS updates.

Thanks Todd!

-Scott


On Apr 16, 2010, at 5:34 PM, Todd Lipcon wrote:

 Hey Scott,

 This is indeed really strange... if you do a straight hadoop fs -put with
 dfs.replication set to 1 from one of the DNs, does it upload slow? That
 would cut out the network from the equation.

 -Todd

 On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey sc...@richrelevance.comwrote:

 I have two clusters upgraded to CDH2.   One is performing fine, and the
 other is EXTREMELY slow.

 Some jobs that formerly took 90 seconds, take 20 to 50 minutes.

 It is an HDFS issue from what I can tell.

 The simple DFS benchmark with one map task shows the problem clearly.  I
 have looked at every difference I can find and am wondering where else to
 look to track this down.
 The disks on all nodes in the cluster check out -- capable of 75MB/sec
 minimum with a 'dd' write test.
 top / iostat do not show any significant CPU usage or iowait times on any
 machines in the cluster during the test.
 ifconfig does not report any dropped packets or other errors on any machine
 in the cluster.  dmesg has nothing interesting.
 The poorly performing cluster is on a slightly newer CentOS version:
 Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64
 x86_64 GNU/Linux  (CentOS 5.4, recent patches)
 Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64
 x86_64 GNU/Linux  (CentOS 5.3, I think)
 The performance is always poor, not sporadically poor.  It is poor with M/R
 tasks as well as non-M/R HDFS clients (i.e. sqoop).

 Poor performance cluster (no other jobs active during the test):
 ---
 $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write
 -nrFiles 1 -fileSize 2000
 10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1
 10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000
 10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 100
 10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: 2000
 mega bytes, 1 files
 10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files for: 1
 files
 10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the same.
 10/04/16 12:53:15 INFO mapred.FileInputFormat: Total input paths to process
 : 1
 10/04/16 12:53:15 INFO mapred.JobClient: Running job: job_201004091928_0391
 10/04/16 12:53:16 INFO mapred.JobClient:  map 0% reduce 0%
 10/04/16 13:42:30 INFO mapred.JobClient:  map 100% reduce 0%
 10/04/16 13:43:06 INFO mapred.JobClient:  map 100% reduce 100%
 10/04/16 13:43:07 INFO mapred.JobClient: Job complete:
 job_201004091928_0391
 [snip]
 10/04/16 13:43:07 INFO mapred.FileInputFormat: - TestDFSIO - :
 write
 10/04/16 13:43:07 INFO mapred.FileInputFormat:Date  time: Fri
 Apr 16 13:43:07 PDT 2010
 10/04/16 13:43:07 INFO mapred.FileInputFormat:Number of files: 1
 10/04/16 13:43:07 INFO mapred.FileInputFormat: Total MBytes processed: 2000
 10/04/16 13:43:07 INFO mapred.FileInputFormat:  Throughput 

Re: Extremely slow HDFS after upgrade

2010-04-16 Thread Scott Carey
More info -- this is not a Hadoop issue.

The network performance issue can be replicated with SSH only on the links 
where Hadoop has a problem, and only in the direction with a problem.

HDFS is slow to transfer data in certain directions from certain machines.

So, for example, copying from node C to D may be slow, but not the other 
direction from C to D.  Likewise, although only 3 of 8 nodes have this problem, 
it is not universal.  For example, node C might have trouble copying data to 5 
of the 7 other nodes, and node G might have trouble with all 7 other nodes.

No idea what it is yet, but SSH exhibits the same issue -- only in those 
specific point-to-point links in one specific direction.

-Scott

On Apr 16, 2010, at 7:10 PM, Scott Carey wrote:

 Ok, so here is a ... fun result.

 I have dfs.replication.min set to 2, so I can't just do
 hsdoop fs -Ddfs.replication=1 put someFile someFile
 Since that will fail.

 So here are two results that are fascinating:

 $ time hadoop fs -Ddfs.replication=3 -put test.tar test.tar
 real1m53.237s
 user0m1.952s
 sys 0m0.308s

 $ time hadoop fs -Ddfs.replication=2 -put test.tar test.tar
 real0m1.689s
 user0m1.763s
 sys 0m0.315s



 The file is 77MB and so is two blocks.
 The test with replication level 3 is slow about 9 out of 10 times.  When it 
 is slow it sometimes is 28 seconds, sometimes 2 minutes.  It was fast one 
 time...
 The test with replication level 2 is fast in 40 out of 40 tests.

 This is a development cluster with 8 nodes.

 It looks like the replication level of 3 or more causes trouble.  Looking 
 more closely at the logs, it seems that certain datanodes (but not all) cause 
 large delays if they are in the middle of an HDFS write chain.  So, a write 
 that goes from A  B  C is fast if B is a good node and C a bad node.  If 
 its A  C  B then its slow.

 So, I can say that some nodes but not all are doing something wrong. when in 
 the middle of a write chain.  If I do a replication = 2 write on one of these 
 bad nodes, its always slow.

 So the good news is I can identify the bad nodes, and decomission them.  The 
 bad news is this still doesn't make a lot of sense, and 40% of the nodes have 
 the issue.  Worse, on a couple nodes the behavior in the replication = 2 case 
 is not consistent -- sometimes the first block is fast.  So it may be 
 dependent on not just the source, but the source  target combination in the 
 chain.


 At this point, I suspect something completely broken at the network level, 
 perhaps even routing.  Why it would show up after an upgrade is yet to be 
 determined, but the upgrade did include some config changes and OS updates.

 Thanks Todd!

 -Scott


 On Apr 16, 2010, at 5:34 PM, Todd Lipcon wrote:

 Hey Scott,

 This is indeed really strange... if you do a straight hadoop fs -put with
 dfs.replication set to 1 from one of the DNs, does it upload slow? That
 would cut out the network from the equation.

 -Todd

 On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey sc...@richrelevance.comwrote:

 I have two clusters upgraded to CDH2.   One is performing fine, and the
 other is EXTREMELY slow.

 Some jobs that formerly took 90 seconds, take 20 to 50 minutes.

 It is an HDFS issue from what I can tell.

 The simple DFS benchmark with one map task shows the problem clearly.  I
 have looked at every difference I can find and am wondering where else to
 look to track this down.
 The disks on all nodes in the cluster check out -- capable of 75MB/sec
 minimum with a 'dd' write test.
 top / iostat do not show any significant CPU usage or iowait times on any
 machines in the cluster during the test.
 ifconfig does not report any dropped packets or other errors on any machine
 in the cluster.  dmesg has nothing interesting.
 The poorly performing cluster is on a slightly newer CentOS version:
 Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64
 x86_64 GNU/Linux  (CentOS 5.4, recent patches)
 Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64
 x86_64 GNU/Linux  (CentOS 5.3, I think)
 The performance is always poor, not sporadically poor.  It is poor with M/R
 tasks as well as non-M/R HDFS clients (i.e. sqoop).

 Poor performance cluster (no other jobs active during the test):
 ---
 $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write
 -nrFiles 1 -fileSize 2000
 10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1
 10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000
 10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 100
 10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: 2000
 mega bytes, 1 files
 10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files for: 1
 files
 10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the same.
 10/04/16 12:53:15 

Re: Extremely slow HDFS after upgrade

2010-04-16 Thread Todd Lipcon
Checked link autonegotiation with ethtool? Sometimes gige will autoneg to
10mb half duplex if there's a bad cable, NIC, or switch port.

-Todd

On Fri, Apr 16, 2010 at 8:08 PM, Scott Carey sc...@richrelevance.comwrote:

 More info -- this is not a Hadoop issue.

 The network performance issue can be replicated with SSH only on the links
 where Hadoop has a problem, and only in the direction with a problem.

 HDFS is slow to transfer data in certain directions from certain machines.

 So, for example, copying from node C to D may be slow, but not the other
 direction from C to D.  Likewise, although only 3 of 8 nodes have this
 problem, it is not universal.  For example, node C might have trouble
 copying data to 5 of the 7 other nodes, and node G might have trouble with
 all 7 other nodes.

 No idea what it is yet, but SSH exhibits the same issue -- only in those
 specific point-to-point links in one specific direction.

 -Scott

 On Apr 16, 2010, at 7:10 PM, Scott Carey wrote:

  Ok, so here is a ... fun result.
 
  I have dfs.replication.min set to 2, so I can't just do
  hsdoop fs -Ddfs.replication=1 put someFile someFile
  Since that will fail.
 
  So here are two results that are fascinating:
 
  $ time hadoop fs -Ddfs.replication=3 -put test.tar test.tar
  real1m53.237s
  user0m1.952s
  sys 0m0.308s
 
  $ time hadoop fs -Ddfs.replication=2 -put test.tar test.tar
  real0m1.689s
  user0m1.763s
  sys 0m0.315s
 
 
 
  The file is 77MB and so is two blocks.
  The test with replication level 3 is slow about 9 out of 10 times.  When
 it is slow it sometimes is 28 seconds, sometimes 2 minutes.  It was fast one
 time...
  The test with replication level 2 is fast in 40 out of 40 tests.
 
  This is a development cluster with 8 nodes.
 
  It looks like the replication level of 3 or more causes trouble.  Looking
 more closely at the logs, it seems that certain datanodes (but not all)
 cause large delays if they are in the middle of an HDFS write chain.  So, a
 write that goes from A  B  C is fast if B is a good node and C a bad node.
  If its A  C  B then its slow.
 
  So, I can say that some nodes but not all are doing something wrong. when
 in the middle of a write chain.  If I do a replication = 2 write on one of
 these bad nodes, its always slow.
 
  So the good news is I can identify the bad nodes, and decomission them.
  The bad news is this still doesn't make a lot of sense, and 40% of the
 nodes have the issue.  Worse, on a couple nodes the behavior in the
 replication = 2 case is not consistent -- sometimes the first block is fast.
  So it may be dependent on not just the source, but the source  target
 combination in the chain.
 
 
  At this point, I suspect something completely broken at the network
 level, perhaps even routing.  Why it would show up after an upgrade is yet
 to be determined, but the upgrade did include some config changes and OS
 updates.
 
  Thanks Todd!
 
  -Scott
 
 
  On Apr 16, 2010, at 5:34 PM, Todd Lipcon wrote:
 
  Hey Scott,
 
  This is indeed really strange... if you do a straight hadoop fs -put
 with
  dfs.replication set to 1 from one of the DNs, does it upload slow? That
  would cut out the network from the equation.
 
  -Todd
 
  On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey sc...@richrelevance.com
 wrote:
 
  I have two clusters upgraded to CDH2.   One is performing fine, and the
  other is EXTREMELY slow.
 
  Some jobs that formerly took 90 seconds, take 20 to 50 minutes.
 
  It is an HDFS issue from what I can tell.
 
  The simple DFS benchmark with one map task shows the problem clearly.
  I
  have looked at every difference I can find and am wondering where else
 to
  look to track this down.
  The disks on all nodes in the cluster check out -- capable of 75MB/sec
  minimum with a 'dd' write test.
  top / iostat do not show any significant CPU usage or iowait times on
 any
  machines in the cluster during the test.
  ifconfig does not report any dropped packets or other errors on any
 machine
  in the cluster.  dmesg has nothing interesting.
  The poorly performing cluster is on a slightly newer CentOS version:
  Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64
 x86_64
  x86_64 GNU/Linux  (CentOS 5.4, recent patches)
  Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64
  x86_64 GNU/Linux  (CentOS 5.3, I think)
  The performance is always poor, not sporadically poor.  It is poor with
 M/R
  tasks as well as non-M/R HDFS clients (i.e. sqoop).
 
  Poor performance cluster (no other jobs active during the test):
  ---
  $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO
 -write
  -nrFiles 1 -fileSize 2000
  10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1
  10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000
  10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 100
  10/04/16 12:53:14 INFO