Re: What will we encounter if we add a lot of nodes into the current cluster?

2009-08-12 Thread Ted Dunning
If you add these nodes, data will be put on them as you add data to the
cluster.

Soon after adding the nodes you should rebalance the storage to avoid age
related surprises in how files are arranged in your cluster.

Other than that, your addition should cause little in the way of surprises.

On Tue, Aug 11, 2009 at 11:00 PM, yang song hadoop.ini...@gmail.com wrote:

 Dear all
I'm sorry to disturb you.
Our cluster has 200 nodes now. In order to improve its ability, we hope
 to add 60 nodes into the current cluster. However, we all don't know what
 will happen if we add so many nodes at the same time. Could you give me
 some
 tips and notes? During the process, which part shall we pay much attention
 on?
Thank you!

P.S. Our environment is hadoop-0.19.1, jdk1.6.0_06, linux redhat
 enterprise 4.0




-- 
Ted Dunning, CTO
DeepDyve


Re: What will we encounter if we add a lot of nodes into the current cluster?

2009-08-12 Thread Aaron Kimball
Also, if you haven't yet configured rack awareness, now's a good time to
start :)
- Aaron

On Tue, Aug 11, 2009 at 11:27 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 If you add these nodes, data will be put on them as you add data to the
 cluster.

 Soon after adding the nodes you should rebalance the storage to avoid age
 related surprises in how files are arranged in your cluster.

 Other than that, your addition should cause little in the way of surprises.

 On Tue, Aug 11, 2009 at 11:00 PM, yang song hadoop.ini...@gmail.com
 wrote:

  Dear all
 I'm sorry to disturb you.
 Our cluster has 200 nodes now. In order to improve its ability, we
 hope
  to add 60 nodes into the current cluster. However, we all don't know what
  will happen if we add so many nodes at the same time. Could you give me
  some
  tips and notes? During the process, which part shall we pay much
 attention
  on?
 Thank you!
 
 P.S. Our environment is hadoop-0.19.1, jdk1.6.0_06, linux redhat
  enterprise 4.0
 



 --
 Ted Dunning, CTO
 DeepDyve



Re: What will we encounter if we add a lot of nodes into the current cluster?

2009-08-12 Thread yang song
Thank you for teaching me that.

I'm trying to use the balance tool(bin/hadoop balancer -t xxx). However, the
data transfer is so slow that it will take a long long time.
Is there a good method to solve it?

What's more, I have a puzzle. The situation is we rarely use the existed
data in the cluster. That means to rebalance existed data is not that worth.
So, I intend to not rebalance the data. Is my opinion right?

Thank you.

2009/8/12 Ted Dunning ted.dunn...@gmail.com

 If you add these nodes, data will be put on them as you add data to the
 cluster.

 Soon after adding the nodes you should rebalance the storage to avoid age
 related surprises in how files are arranged in your cluster.

 Other than that, your addition should cause little in the way of surprises.

 On Tue, Aug 11, 2009 at 11:00 PM, yang song hadoop.ini...@gmail.com
 wrote:

  Dear all
 I'm sorry to disturb you.
 Our cluster has 200 nodes now. In order to improve its ability, we
 hope
  to add 60 nodes into the current cluster. However, we all don't know what
  will happen if we add so many nodes at the same time. Could you give me
  some
  tips and notes? During the process, which part shall we pay much
 attention
  on?
 Thank you!
 
 P.S. Our environment is hadoop-0.19.1, jdk1.6.0_06, linux redhat
  enterprise 4.0
 



 --
 Ted Dunning, CTO
 DeepDyve



Re: What will we encounter if we add a lot of nodes into the current cluster?

2009-08-12 Thread Harish Mallipeddi
On Thu, Aug 13, 2009 at 8:06 AM, yang song hadoop.ini...@gmail.com wrote:

 Thank you for teaching me that.

 I'm trying to use the balance tool(bin/hadoop balancer -t xxx). However,
 the
 data transfer is so slow that it will take a long long time.
 Is there a good method to solve it?

 What's more, I have a puzzle. The situation is we rarely use the existed
 data in the cluster. That means to rebalance existed data is not that
 worth.
 So, I intend to not rebalance the data. Is my opinion right?


I think after you add new nodes to the cluster and if you don't rebalance,
hadoop is probably going to pick the new nodes over the older ones for all
the new data that you write into HDFS. As a result even your new data is
probably going to be not balanced evenly across the cluster. If you're going
to run m/r jobs on this new data, then it's a good idea to have that spread
across the cluster evenly.

-- 
Harish Mallipeddi
http://blog.poundbang.in


Re: What will we encounter if we add a lot of nodes into the current cluster?

2009-08-12 Thread Ted Dunning
There is a parameter (dfs.balance.bandwidthPerSec) that limits the
rebalancing bandwidth.  The default is rather low.

See http://developer.yahoo.com/hadoop/tutorial/module2.html#rebalancing

On Wed, Aug 12, 2009 at 7:36 PM, yang song hadoop.ini...@gmail.com wrote:

 I'm trying to use the balance tool(bin/hadoop balancer -t xxx). However,
 the
 data transfer is so slow that it will take a long long time.
 Is there a good method to solve it?




-- 
Ted Dunning, CTO
DeepDyve