Hi, Thanks for the idea, I will give this a try and report back.
My worry is if we decommission a small node (one at a time), will it move the data to larger nodes or choke another smaller nodes ? In principle it should distribute the blocks, the point is it is not distributing the way we expect it to, so do you think this may cause further problems ? --------- On Mar 24, 2013, at 3:37 PM, Jamal B <jm151...@gmail.com> wrote: > Then I think the only way around this would be to decommission 1 at a time, > the smaller nodes, and ensure that the blocks are moved to the larger nodes. > And once complete, bring back in the smaller nodes, but maybe only after you > tweak the rack topology to match your disk layout more than network layout to > compensate for the unbalanced nodes. > > Just my 2 cents > > > On Sun, Mar 24, 2013 at 4:31 PM, Tapas Sarangi <tapas.sara...@gmail.com> > wrote: > Thanks. We have a 1-1 configuration of drives and folder in all the datanodes. > > -Tapas > > On Mar 24, 2013, at 3:29 PM, Jamal B <jm151...@gmail.com> wrote: > >> On both types of nodes, what is your dfs.data.dir set to? Does it specify >> multiple folders on the same set's of drives or is it 1-1 between folder and >> drive? If it's set to multiple folders on the same drives, it is probably >> multiplying the amount of "available capacity" incorrectly in that it >> assumes a 1-1 relationship between folder and total capacity of the drive. >> >> >> On Sun, Mar 24, 2013 at 3:01 PM, Tapas Sarangi <tapas.sara...@gmail.com> >> wrote: >> Yes, thanks for pointing, but I already know that it is completing the >> balancing when exiting otherwise it shouldn't exit. >> Your answer doesn't solve the problem I mentioned earlier in my message. >> 'hdfs' is stalling and hadoop is not writing unless space is cleared up from >> the cluster even though "df" shows the cluster has about 500 TB of free >> space. >> >> ------- >> >> >> On Mar 24, 2013, at 1:54 PM, Balaji Narayanan (பாலாஜி நாராயணன்) >> <bal...@balajin.net> wrote: >> >>> -setBalancerBandwidth <bandwidth in bytes per second> >>> >>> So the value is bytes per second. If it is running and exiting,it means it >>> has completed the balancing. >>> >>> >>> On 24 March 2013 11:32, Tapas Sarangi <tapas.sara...@gmail.com> wrote: >>> Yes, we are running balancer, though a balancer process runs for almost a >>> day or more before exiting and starting over. >>> Current dfs.balance.bandwidthPerSec value is set to 2x10^9. I assume that's >>> bytes so about 2 GigaByte/sec. Shouldn't that be reasonable ? If it is in >>> Bits then we have a problem. >>> What's the unit for "dfs.balance.bandwidthPerSec" ? >>> >>> ----- >>> >>> On Mar 24, 2013, at 1:23 PM, Balaji Narayanan (பாலாஜி நாராயணன்) >>> <li...@balajin.net> wrote: >>> >>>> Are you running balancer? If balancer is running and if it is slow, try >>>> increasing the balancer bandwidth >>>> >>>> >>>> On 24 March 2013 09:21, Tapas Sarangi <tapas.sara...@gmail.com> wrote: >>>> Thanks for the follow up. I don't know whether attachment will pass >>>> through this mailing list, but I am attaching a pdf that contains the >>>> usage of all live nodes. >>>> >>>> All nodes starting with letter "g" are the ones with smaller storage space >>>> where as nodes starting with letter "s" have larger storage space. As you >>>> will see, most of the "gXX" nodes are completely full whereas "sXX" nodes >>>> have a lot of unused space. >>>> >>>> Recently, we are facing crisis frequently as 'hdfs' goes into a mode where >>>> it is not able to write any further even though the total space available >>>> in the cluster is about 500 TB. We believe this has something to do with >>>> the way it is balancing the nodes, but don't understand the problem yet. >>>> May be the attached PDF will help some of you (experts) to see what is >>>> going wrong here... >>>> >>>> Thanks >>>> ------ >>>> >>>> >>>> >>>> >>>> >>>> >>>>> >>>>> Balancer know about topology,but when calculate balancing it operates >>>>> only with nodes not with racks. >>>>> You can see how it work in Balancer.java in BalancerDatanode about >>>>> string 509. >>>>> >>>>> I was wrong about 350Tb,35Tb it calculates in such way : >>>>> >>>>> For example: >>>>> cluster_capacity=3.5Pb >>>>> cluster_dfsused=2Pb >>>>> >>>>> avgutil=cluster_dfsused/cluster_capacity*100=57.14% used cluster capacity >>>>> Then we know avg node utilization (node_dfsused/node_capacity*100) >>>>> .Balancer think that all good if avgutil >>>>> +10>node_utilizazation>=avgutil-10. >>>>> >>>>> Ideal case that all node used avgutl of capacity.but for 12TB node its >>>>> only 6.5Tb and for 72Tb its about 40Tb. >>>>> >>>>> Balancer cant help you. >>>>> >>>>> Show me http://namenode.rambler.ru:50070/dfsnodelist.jsp?whatNodes=LIVE >>>>> if you can. >>>>> >>>>> >>>>> >>>>> >>>>>> In ideal case with replication factor 2 ,with two nodes 12Tb and 72Tb >>>>>> you will be able to have only 12Tb replication data. >>>>> >>>>> Yes, this is true for exactly two nodes in the cluster with 12 TB and 72 >>>>> TB, but not true for more than two nodes in the cluster. >>>>> >>>>>> >>>>>> Best way,on my opinion,it is using multiple racks.Nodes in rack must be >>>>>> with identical capacity.Racks must be identical capacity. >>>>>> For example: >>>>>> >>>>>> rack1: 1 node with 72Tb >>>>>> rack2: 6 nodes with 12Tb >>>>>> rack3: 3 nodes with 24Tb >>>>>> >>>>>> It helps with balancing,because dublicated block must be another rack. >>>>>> >>>>> >>>>> The same question I asked earlier in this message, does multiple racks >>>>> with default threshold for the balancer minimizes the difference between >>>>> racks ? >>>>> >>>>>> Why did you select hdfs?May be lustre,cephfs and other is better choise. >>>>>> >>>>> >>>>> It wasn't my decision, and I probably can't change it now. I am new to >>>>> this cluster and trying to understand few issues. I will explore other >>>>> options as you mentioned. >>>>> >>>>> -- >>>>> http://balajin.net/blog >>>>> http://flic.kr/balajijegan >>> >>> >>> >>> >>> -- >>> http://balajin.net/blog >>> http://flic.kr/balajijegan >> >> > >