Re: How do I configure a Partitioner in the new API?
Hello, On Wed, May 4, 2011 at 9:25 PM, W.P. McNeill bill...@gmail.com wrote: I only wanted the context as a way of getting at the configuration, so making the class implement Configurable will solve my problem. Good to know. I believe this ought to be documented. I've opened https://issues.apache.org/jira/browse/MAPREDUCE-2474 for the same. -- Harsh J
Use different process method to process data by input file name?
Hi, guys As the topic shows, how can I use different process methods to process data according to input file name in map function? Ie, May I get the input file name that current process line belong to? Austin
Re: Use different process method to process data by input file name?
Moving this to mapreduce--user@ since that is more appropriate for hadoop-mapreduce questions (bcc: common-user@). 2011/5/5 王志强 wangzhiqi...@360.cn: Hi, guys As the topic shows, how can I use different process methods to process data according to input file name in map function? Ie, May I get the input file name that current process line belong to? Austin You can get the filename using these additional JVM properties of the MapTasks: http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Task+JVM+Reuse [Look at the table right below the link anchor for all properties, and map.input.file in particular] -- Harsh J
Re: Use different process method to process data by input file name?
Hi, I think MultipleInputs may fit your need. Yaozhen 在 2011 5 5 17:43,王志强 wangzhiqi...@360.cn写道: Hi, guys As the topic shows, how can I use different process methods to process data according to input file name in map function? Ie, May I get the input file name that current process line belong to? Austin
How hadoop parse input files into (Key,Value) pairs ??
Hi, As we know hadoop mapper takes input as (Key,Value) pairs and generate intermediate (Key,Value) pairs and usually we give input to our Mapper as a text file. How hadoop understand this and parse our input text file into (Key,Value) Pairs Usually our mapper looks like -- *public* *void* map(LongWritable key, Text value,OutputCollectorText, Text outputCollector, Reporter reporter) *throws* IOException { String word = value.toString(); //Some lines of code } So if I pass any text file as input, it is taking every line as VALUE to Mapper..on which I will do some processing and put it to OutputCollector. But how hadoop parsed my text file into ( Key,Value ) pair and how can we tell hadoop what (key,value) it should give to mapper ?? Thanks.
Re: Cluster hard drive ratios
On 04/05/11 19:59, Matt Goeke wrote: Mike, Thanks for the response. It looks like this discussion forked on the CDH list so I have two different conversations now. Also, you're dead on that one of the presentations I was referencing was Ravi's. With your setup I agree that it would have made no sense to go the 2.5 drive route given it would have forced you into the 500-750GB SATA drives and all it would allow is more spindles but less capacity at a higher cost. The servers we have been considering are actually the R710's so dual hexacore with 12 spindles of actual capacity is more of a 1:1 in terms of cores to spindles vs the 2:1 I have been reviewing. My original issue attempted to focus more around at what point do you actually see a plateau in write performance of cores:spindles but since we are headed that direction anyway it looks like it was more to sate curiosity than driving specifications. some people are using this as it gives best storage density. You can also go for single hexacore servers as in a big cluster the savings there translate into even more storage. It all depends on the application. As to your point, I forgot to include the issue of rebalancing in the original email but you are absolutely right. That was another major concern especially as we would get closer to filling capacity of a 24TB box. I think the original plan was bonded GBe but I think our infrastructure team has told us 10GBe would be standard. 1. If you want to play with bonded GBe then I have some notes I can send you -its harder than you think. 2. I don't know anyone who is running 10 GBe + Hadoop, though I see hints that StumbleUpon are doing this with Arista switches. You'd have to have a very chatty app or 10GBe on the mainboard to justify it. 3. I do know of installations with 24TB HDD and GBe, yes, the overhead of a node failure is higher. But with less nodes, P(failure) may be lower. The big fear is loss-of-rack, which can come from ToR switch failure or from network config errors. Hadoop isn't partition aware will treat a rack outage as the loss of 40+ servers, try to replicate all that data, and that's when you're in trouble (look at the AWS EBS outage for an example cascade failure). 4. There are JIRA issues for better handling of drive failure, including hotswapping and rebalancing data within a single machine. 5. I'd like support for the ability to say a node is going down, don't replicate, and the same for a rack, to ease maintenance. -Steve
distcp performing much better for rebalancing than dedicated balancer
Hi, On our 15node cluster (1GB ethernet and 4x1TB disk per node) I noticed that distcp does a much better job at rebalancing than the dedicated balancer does. We needed to decommision 11 nodes, so that prior to rebalancing we had 4 used and 11 empty nodes. The 4 used nodes had about 25% usage each. Most of our files are of average size: We have about 500K files in 280K blocks and 800K blocks total (blocksize is 64MB). So I changed dfs.balance.bandwidthPerSec to 800100100 and restarted the cluster. Started the balancer tool and I noticed that the it moved about 200GB in 1 hour. (I grepped the balancer log for Need to move). After stopping the balancer I started a distcp. This tool copied 900GB in just 45 minutes, with an average replication of 2 so it's total throughput was around 2.4 TB/hour. Fair enough, it is not purely rebalancing because the 4 overused nodes also get new blocks, still it performs much better. Munin confirms the much higher disk/ethernet throughputs of the distcp. Are these characteristics to be expected? Either way, can the balancer be boosted even more? (Aside the dfs.balance.bandwidthPerSec property). Ferdy.
Re: distcp performing much better for rebalancing than dedicated balancer
Did you explicitely start a balancer or did you decommission the nodes using dfs.hosts.exclude and a dfsadmin -refreshNodes? On Thu, May 5, 2011 at 14:30, Ferdy Galema ferdy.gal...@kalooga.com wrote: Hi, On our 15node cluster (1GB ethernet and 4x1TB disk per node) I noticed that distcp does a much better job at rebalancing than the dedicated balancer does. We needed to decommision 11 nodes, so that prior to rebalancing we had 4 used and 11 empty nodes. The 4 used nodes had about 25% usage each. Most of our files are of average size: We have about 500K files in 280K blocks and 800K blocks total (blocksize is 64MB). So I changed dfs.balance.bandwidthPerSec to 800100100 and restarted the cluster. Started the balancer tool and I noticed that the it moved about 200GB in 1 hour. (I grepped the balancer log for Need to move). After stopping the balancer I started a distcp. This tool copied 900GB in just 45 minutes, with an average replication of 2 so it's total throughput was around 2.4 TB/hour. Fair enough, it is not purely rebalancing because the 4 overused nodes also get new blocks, still it performs much better. Munin confirms the much higher disk/ethernet throughputs of the distcp. Are these characteristics to be expected? Either way, can the balancer be boosted even more? (Aside the dfs.balance.bandwidthPerSec property). Ferdy.
including jar files in hadoop job
Hi all , Can anyone tell me how to include jar files while running a hadoop job . version is 0.21.0 . I used -libjars option in hadoop 0.20.2 , it was working , but now in this version , its not working. Please help me in this regard Thbanks Regards Nakul chakrapani
Can we access NameNode HDFS from slave Nodes ??
hey, Can we access NameNode's hdfs on our slave machines ?? I am just running command hadoop dfs -ls on my slave machine ( running tasktracker and Datanode), and its giving me the following output : hadoop@ub12:~$ hadoop dfs -ls 11/05/05 18:31:54 INFO ipc.Client: Retrying connect to server: ub13/ 162.192.100.53:54310. Already tried 0 time(s). 11/05/05 18:31:55 INFO ipc.Client: Retrying connect to server: ub13/ 162.192.100.53:54310. Already tried 1 time(s). 11/05/05 18:31:56 INFO ipc.Client: Retrying connect to server: ub13/ 162.192.100.53:54310. Already tried 2 time(s). 11/05/05 18:31:57 INFO ipc.Client: Retrying connect to server: ub13/ 162.192.100.53:54310. Already tried 3 time(s). 11/05/05 18:31:58 INFO ipc.Client: Retrying connect to server: ub13/ 162.192.100.53:54310. Already tried 4 time(s). 11/05/05 18:31:59 INFO ipc.Client: Retrying connect to server: ub13/ 162.192.100.53:54310. Already tried 5 time(s). 11/05/05 18:32:00 INFO ipc.Client: Retrying connect to server: ub13/ 162.192.100.53:54310. Already tried 6 time(s). 11/05/05 18:32:01 INFO ipc.Client: Retrying connect to server: ub13/ 162.192.100.53:54310. Already tried 7 time(s). 11/05/05 18:32:02 INFO ipc.Client: Retrying connect to server: ub13/ 162.192.100.53:54310. Already tried 8 time(s). 11/05/05 18:32:03 INFO ipc.Client: Retrying connect to server: ub13/ 162.192.100.53:54310. Already tried 9 time(s). Bad connection to FS. command aborted. I just restarted my Master Node ( and run start-all.sh ) The output on my master node is hadoop@ub13:/usr/local/hadoop$ start-all.sh starting namenode, logging to /usr/local/hadoop/hadoop/bin/../logs/hadoop-hadoop-namenode-ub13.out ub11: starting datanode, logging to /usr/local/hadoop/hadoop/bin/../logs/hadoop-hadoop-datanode-ub11.out ub10: starting datanode, logging to /usr/local/hadoop/hadoop/bin/../logs/hadoop-hadoop-datanode-ub10.out ub12: starting datanode, logging to /usr/local/hadoop/hadoop/bin/../logs/hadoop-hadoop-datanode-ub12.out ub13: starting datanode, logging to /usr/local/hadoop/hadoop/bin/../logs/hadoop-hadoop-datanode-ub13.out ub13: starting secondarynamenode, logging to /usr/local/hadoop/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-ub13.out starting jobtracker, logging to /usr/local/hadoop/hadoop/bin/../logs/hadoop-hadoop-jobtracker-ub13.out ub10: starting tasktracker, logging to /usr/local/hadoop/hadoop/bin/../logs/hadoop-hadoop-tasktracker-ub10.out ub11: starting tasktracker, logging to /usr/local/hadoop/hadoop/bin/../logs/hadoop-hadoop-tasktracker-ub11.out ub12: starting tasktracker, logging to /usr/local/hadoop/hadoop/bin/../logs/hadoop-hadoop-tasktracker-ub12.out ub13: starting tasktracker, logging to /usr/local/hadoop/hadoop/bin/../logs/hadoop-hadoop-tasktracker-ub13.out hadoop@ub13:/usr/local/hadoop$ jps 6471 NameNode 7070 Jps 6875 JobTracker 6632 DataNode 7030 TaskTracker 6795 SecondaryNameNode Thanks, Praveenesh
Re: distcp performing much better for rebalancing than dedicated balancer
The decommissioning was performed with solely refreshNodes, but that's somewhat irrelevant because the balancing tests were performed after I re-added the 11 empty nodes. (FYI the drives were formatted with another unix fs). Though I did notice that the decommissioning shows about the same metrics as that of the balancer test afterwards, not very fast that is. On 05/05/2011 02:57 PM, Mathias Herberts wrote: Did you explicitely start a balancer or did you decommission the nodes using dfs.hosts.exclude and a dfsadmin -refreshNodes? On Thu, May 5, 2011 at 14:30, Ferdy Galemaferdy.gal...@kalooga.com wrote: Hi, On our 15node cluster (1GB ethernet and 4x1TB disk per node) I noticed that distcp does a much better job at rebalancing than the dedicated balancer does. We needed to decommision 11 nodes, so that prior to rebalancing we had 4 used and 11 empty nodes. The 4 used nodes had about 25% usage each. Most of our files are of average size: We have about 500K files in 280K blocks and 800K blocks total (blocksize is 64MB). So I changed dfs.balance.bandwidthPerSec to 800100100 and restarted the cluster. Started the balancer tool and I noticed that the it moved about 200GB in 1 hour. (I grepped the balancer log for Need to move). After stopping the balancer I started a distcp. This tool copied 900GB in just 45 minutes, with an average replication of 2 so it's total throughput was around 2.4 TB/hour. Fair enough, it is not purely rebalancing because the 4 overused nodes also get new blocks, still it performs much better. Munin confirms the much higher disk/ethernet throughputs of the distcp. Are these characteristics to be expected? Either way, can the balancer be boosted even more? (Aside the dfs.balance.bandwidthPerSec property). Ferdy.
Re: distcp performing much better for rebalancing than dedicated balancer
I figured out what caused the slow balancing. Starting the balancer with a too small threshold will decrease the speed dramatically: ./start-balancer.sh -threshold 0.01 2011-05-05 17:17:04,132 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 1.26 GBbytes in this iteration 2011-05-05 17:17:36,684 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 1.26 GBbytes in this iteration 2011-05-05 17:18:09,737 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 1.26 GBbytes in this iteration 2011-05-05 17:18:41,977 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 1.26 GBbytes in this iteration as opposed to: ./start-balancer.sh 2011-05-05 17:19:01,676 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 40 GBbytes in this iteration 2011-05-05 17:21:36,800 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 30 GBbytes in this iteration 2011-05-05 17:24:13,191 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 30 GBbytes in this iteration I'd expect setting the granularity would not affect speed, just the stopping threshold. Perhaps a bug? On 05/05/2011 03:43 PM, Ferdy Galema wrote: The decommissioning was performed with solely refreshNodes, but that's somewhat irrelevant because the balancing tests were performed after I re-added the 11 empty nodes. (FYI the drives were formatted with another unix fs). Though I did notice that the decommissioning shows about the same metrics as that of the balancer test afterwards, not very fast that is. On 05/05/2011 02:57 PM, Mathias Herberts wrote: Did you explicitely start a balancer or did you decommission the nodes using dfs.hosts.exclude and a dfsadmin -refreshNodes? On Thu, May 5, 2011 at 14:30, Ferdy Galemaferdy.gal...@kalooga.com wrote: Hi, On our 15node cluster (1GB ethernet and 4x1TB disk per node) I noticed that distcp does a much better job at rebalancing than the dedicated balancer does. We needed to decommision 11 nodes, so that prior to rebalancing we had 4 used and 11 empty nodes. The 4 used nodes had about 25% usage each. Most of our files are of average size: We have about 500K files in 280K blocks and 800K blocks total (blocksize is 64MB). So I changed dfs.balance.bandwidthPerSec to 800100100 and restarted the cluster. Started the balancer tool and I noticed that the it moved about 200GB in 1 hour. (I grepped the balancer log for Need to move). After stopping the balancer I started a distcp. This tool copied 900GB in just 45 minutes, with an average replication of 2 so it's total throughput was around 2.4 TB/hour. Fair enough, it is not purely rebalancing because the 4 overused nodes also get new blocks, still it performs much better. Munin confirms the much higher disk/ethernet throughputs of the distcp. Are these characteristics to be expected? Either way, can the balancer be boosted even more? (Aside the dfs.balance.bandwidthPerSec property). Ferdy.
Re: distcp performing much better for rebalancing than dedicated balancer
Ferdy - that is interesting. I would expect lower threshold = more data to move around (or equal to default 10%) Try with a whole integer, we regularly run balancer, -threshold 1 (to balance to 1%), maybe the decimal is throwing a wrench at hadoop. EF On 5 May 2011 09:27, Ferdy Galema ferdy.gal...@kalooga.com wrote: I figured out what caused the slow balancing. Starting the balancer with a too small threshold will decrease the speed dramatically: ./start-balancer.sh -threshold 0.01 2011-05-05 17:17:04,132 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 1.26 GBbytes in this iteration 2011-05-05 17:17:36,684 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 1.26 GBbytes in this iteration 2011-05-05 17:18:09,737 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 1.26 GBbytes in this iteration 2011-05-05 17:18:41,977 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 1.26 GBbytes in this iteration as opposed to: ./start-balancer.sh 2011-05-05 17:19:01,676 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 40 GBbytes in this iteration 2011-05-05 17:21:36,800 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 30 GBbytes in this iteration 2011-05-05 17:24:13,191 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 30 GBbytes in this iteration I'd expect setting the granularity would not affect speed, just the stopping threshold. Perhaps a bug? On 05/05/2011 03:43 PM, Ferdy Galema wrote: The decommissioning was performed with solely refreshNodes, but that's somewhat irrelevant because the balancing tests were performed after I re-added the 11 empty nodes. (FYI the drives were formatted with another unix fs). Though I did notice that the decommissioning shows about the same metrics as that of the balancer test afterwards, not very fast that is. On 05/05/2011 02:57 PM, Mathias Herberts wrote: Did you explicitely start a balancer or did you decommission the nodes using dfs.hosts.exclude and a dfsadmin -refreshNodes? On Thu, May 5, 2011 at 14:30, Ferdy Galemaferdy.gal...@kalooga.com wrote: Hi, On our 15node cluster (1GB ethernet and 4x1TB disk per node) I noticed that distcp does a much better job at rebalancing than the dedicated balancer does. We needed to decommision 11 nodes, so that prior to rebalancing we had 4 used and 11 empty nodes. The 4 used nodes had about 25% usage each. Most of our files are of average size: We have about 500K files in 280K blocks and 800K blocks total (blocksize is 64MB). So I changed dfs.balance.bandwidthPerSec to 800100100 and restarted the cluster. Started the balancer tool and I noticed that the it moved about 200GB in 1 hour. (I grepped the balancer log for Need to move). After stopping the balancer I started a distcp. This tool copied 900GB in just 45 minutes, with an average replication of 2 so it's total throughput was around 2.4 TB/hour. Fair enough, it is not purely rebalancing because the 4 overused nodes also get new blocks, still it performs much better. Munin confirms the much higher disk/ethernet throughputs of the distcp. Are these characteristics to be expected? Either way, can the balancer be boosted even more? (Aside the dfs.balance.bandwidthPerSec property). Ferdy.
Re: How do I configure a Partitioner in the new API?
The other thing you want to document is that the setConf() function is called by the Hadoop reflection utilities whenever it creates a Configurable object. I guess that's the only way it could happen, but I still wasn't sure how this worked until I stepped through it in the debugger. You also should document under what circumstances getConf() would be called. I'm still not clear what the scenario would be. On Thu, May 5, 2011 at 2:16 AM, Harsh J ha...@cloudera.com wrote: Hello, On Wed, May 4, 2011 at 9:25 PM, W.P. McNeill bill...@gmail.com wrote: I only wanted the context as a way of getting at the configuration, so making the class implement Configurable will solve my problem. Good to know. I believe this ought to be documented. I've opened https://issues.apache.org/jira/browse/MAPREDUCE-2474 for the same. -- Harsh J
Re: Cluster hard drive ratios
a node (or rack) is going down, don't replicate == DataNode Decommissioning. This feature is available. The current usage is to add the hosts to be decommissioned to the exclusion file named in dfs.hosts.exclude, then use DFSAdmin to invoke -refreshNodes. (Search for decommission in DFSAdmin source code.) NN will stop using these servers as replication targets, and will re-replicate all their replicas to other hosts that are still in service. The count of nodes that are in the process of being decommissioned is reported in the NN status web page. --Matt On May 5, 2011, at 4:48 AM, Steve Loughran wrote: On 04/05/11 19:59, Matt Goeke wrote: Mike, Thanks for the response. It looks like this discussion forked on the CDH list so I have two different conversations now. Also, you're dead on that one of the presentations I was referencing was Ravi's. With your setup I agree that it would have made no sense to go the 2.5 drive route given it would have forced you into the 500-750GB SATA drives and all it would allow is more spindles but less capacity at a higher cost. The servers we have been considering are actually the R710's so dual hexacore with 12 spindles of actual capacity is more of a 1:1 in terms of cores to spindles vs the 2:1 I have been reviewing. My original issue attempted to focus more around at what point do you actually see a plateau in write performance of cores:spindles but since we are headed that direction anyway it looks like it was more to sate curiosity than driving specifications. some people are using this as it gives best storage density. You can also go for single hexacore servers as in a big cluster the savings there translate into even more storage. It all depends on the application. As to your point, I forgot to include the issue of rebalancing in the original email but you are absolutely right. That was another major concern especially as we would get closer to filling capacity of a 24TB box. I think the original plan was bonded GBe but I think our infrastructure team has told us 10GBe would be standard. 1. If you want to play with bonded GBe then I have some notes I can send you -its harder than you think. 2. I don't know anyone who is running 10 GBe + Hadoop, though I see hints that StumbleUpon are doing this with Arista switches. You'd have to have a very chatty app or 10GBe on the mainboard to justify it. 3. I do know of installations with 24TB HDD and GBe, yes, the overhead of a node failure is higher. But with less nodes, P(failure) may be lower. The big fear is loss-of-rack, which can come from ToR switch failure or from network config errors. Hadoop isn't partition aware will treat a rack outage as the loss of 40+ servers, try to replicate all that data, and that's when you're in trouble (look at the AWS EBS outage for an example cascade failure). 4. There are JIRA issues for better handling of drive failure, including hotswapping and rebalancing data within a single machine. 5. I'd like support for the ability to say a node is going down, don't replicate, and the same for a rack, to ease maintenance. -Steve
Re: distcp performing much better for rebalancing than dedicated balancer
I actually tried 1% right after I ran the balancer with the default threshold. The data moved was the same as default. So in short this is what I tried: default: fast (lots of data moved; 30 to 40GB every iteration) 1%: fast (same as above) 0.01%: slow (it moves only 1.26GB in a iteration, for hours long the exact same amount) At the moment the cluster is already fully balanced. On 05/05/2011 05:46 PM, Eric Fiala wrote: Ferdy - that is interesting. I would expect lower threshold = more data to move around (or equal to default 10%) Try with a whole integer, we regularly run balancer, -threshold 1 (to balance to 1%), maybe the decimal is throwing a wrench at hadoop. EF On 5 May 2011 09:27, Ferdy Galemaferdy.gal...@kalooga.com wrote: I figured out what caused the slow balancing. Starting the balancer with a too small threshold will decrease the speed dramatically: ./start-balancer.sh -threshold 0.01 2011-05-05 17:17:04,132 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 1.26 GBbytes in this iteration 2011-05-05 17:17:36,684 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 1.26 GBbytes in this iteration 2011-05-05 17:18:09,737 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 1.26 GBbytes in this iteration 2011-05-05 17:18:41,977 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 1.26 GBbytes in this iteration as opposed to: ./start-balancer.sh 2011-05-05 17:19:01,676 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 40 GBbytes in this iteration 2011-05-05 17:21:36,800 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 30 GBbytes in this iteration 2011-05-05 17:24:13,191 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 30 GBbytes in this iteration I'd expect setting the granularity would not affect speed, just the stopping threshold. Perhaps a bug? On 05/05/2011 03:43 PM, Ferdy Galema wrote: The decommissioning was performed with solely refreshNodes, but that's somewhat irrelevant because the balancing tests were performed after I re-added the 11 empty nodes. (FYI the drives were formatted with another unix fs). Though I did notice that the decommissioning shows about the same metrics as that of the balancer test afterwards, not very fast that is. On 05/05/2011 02:57 PM, Mathias Herberts wrote: Did you explicitely start a balancer or did you decommission the nodes using dfs.hosts.exclude and a dfsadmin -refreshNodes? On Thu, May 5, 2011 at 14:30, Ferdy Galemaferdy.gal...@kalooga.com wrote: Hi, On our 15node cluster (1GB ethernet and 4x1TB disk per node) I noticed that distcp does a much better job at rebalancing than the dedicated balancer does. We needed to decommision 11 nodes, so that prior to rebalancing we had 4 used and 11 empty nodes. The 4 used nodes had about 25% usage each. Most of our files are of average size: We have about 500K files in 280K blocks and 800K blocks total (blocksize is 64MB). So I changed dfs.balance.bandwidthPerSec to 800100100 and restarted the cluster. Started the balancer tool and I noticed that the it moved about 200GB in 1 hour. (I grepped the balancer log for Need to move). After stopping the balancer I started a distcp. This tool copied 900GB in just 45 minutes, with an average replication of 2 so it's total throughput was around 2.4 TB/hour. Fair enough, it is not purely rebalancing because the 4 overused nodes also get new blocks, still it performs much better. Munin confirms the much higher disk/ethernet throughputs of the distcp. Are these characteristics to be expected? Either way, can the balancer be boosted even more? (Aside the dfs.balance.bandwidthPerSec property). Ferdy.
London Hadoop May Meetup: Clusters in the Cloud
Hey, Our may meet-up is on next Thursday 12th at Skills Matter, this time with talks on using Hadoop clusters in the cloud. We've got Matt Wood from Amazon Web Services coming to talk about their elastic map reduce service and how you can use it to get on-demand Hadoop clusters. Lars George from Cloudera will also be coming over to give us a talk on the Apache Whirr project and how you can use it to orchestrate Hadoop and other distributed systems on EC2 and other cloud providers. We'll also have pizza and beer again this time thanks to Cloudera one of our sponsors (sorry about the lack of last Month..) If you would like to come you will need to register here http://skillsmatter.com/event/cloud-grid/hadoopug-may/js-1766 I'm also starting to organise June's meet-up and still need some talks for the evening so please let me know if you are interested in giving a talk on anything Hadoop related. Thanks, Dan Harvey