Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Michel Segel Thu, 03 May 2012 16:00:41 -0700

Ok... So riddle me this...
I currently have a replication factor of 3.
I reset it to two.


What do you have to do to get the replication factor of 3 down to 2?
Do I just try to rebalance the nodes?

The point is that you are looking at a very small cluster.
You may want to start the be cluster with a replication factor of 2 and then 
when the data is moved over, increase it to a factor of 3. Or maybe not.

I do a distcp to. Copy the data and after each distcp, I do an fsck for a 
sanity check and then remove the files I copied. As I gain more room, I can 
then slowly drop nodes, do an fsck, rebalance and then repeat.

Even though this us a dev cluster, the OP wants to retain the data. 

There are other options depending on the amount and size of new hardware.
I mean make one machine a RAID 5 machine, copy data to it clearing off the 
cluster.

If 8TB was the amount of disk used, that would be 2.6666 TB used.
Let's say 3TB. Going raid 5, how much disk is that?  So you could fit it on one 
machine, depending on hardware, or maybe 2 machines...  Now you can rebuild 
initial cluster and then move data back. Then rebuild those machines. Lots of 
options... ;-)

Sent from a remote device. Please excuse any typos...

Mike Segel

On May 3, 2012, at 11:26 AM, Suresh Srinivas <sur...@hortonworks.com> wrote:

> This probably is a more relevant question in CDH mailing lists. That said,
> what Edward is suggesting seems reasonable. Reduce replication factor,
> decommission some of the nodes and create a new cluster with those nodes
> and do distcp.
> 
> Could you share with us the reasons you want to migrate from Apache 205?
> 
> Regards,
> Suresh
> 
> On Thu, May 3, 2012 at 8:25 AM, Edward Capriolo <edlinuxg...@gmail.com>wrote:
> 
>> Honestly that is a hassle, going from 205 to cdh3u3 is probably more
>> or a cross-grade then an upgrade or downgrade. I would just stick it
>> out. But yes like Michael said two clusters on the same gear and
>> distcp. If you are using RF=3 you could also lower your replication to
>> rf=2 'hadoop dfs -setrepl 2' to clear headroom as you are moving
>> stuff.
>> 
>> 
>> On Thu, May 3, 2012 at 7:25 AM, Michel Segel <michael_se...@hotmail.com>
>> wrote:
>>> Ok... When you get your new hardware...
>>> 
>>> Set up one server as your new NN, JT, SN.
>>> Set up the others as a DN.
>>> (Cloudera CDH3u3)
>>> 
>>> On your existing cluster...
>>> Remove your old log files, temp files on HDFS anything you don't need.
>>> This should give you some more space.
>>> Start copying some of the directories/files to the new cluster.
>>> As you gain space, decommission a node, rebalance, add node to new
>> cluster...
>>> 
>>> It's a slow process.
>>> 
>>> Should I remind you to make sure you up you bandwidth setting, and to
>> clean up the hdfs directories when you repurpose the nodes?
>>> 
>>> Does this make sense?
>>> 
>>> Sent from a remote device. Please excuse any typos...
>>> 
>>> Mike Segel
>>> 
>>> On May 3, 2012, at 5:46 AM, Austin Chungath <austi...@gmail.com> wrote:
>>> 
>>>> Yeah I know :-)
>>>> and this is not a production cluster ;-) and yes there is more hardware
>>>> coming :-)
>>>> 
>>>> On Thu, May 3, 2012 at 4:10 PM, Michel Segel <michael_se...@hotmail.com
>>> wrote:
>>>> 
>>>>> Well, you've kind of painted yourself in to a corner...
>>>>> Not sure why you didn't get a response from the Cloudera lists, but
>> it's a
>>>>> generic question...
>>>>> 
>>>>> 8 out of 10 TB. Are you talking effective storage or actual disks?
>>>>> And please tell me you've already ordered more hardware.. Right?
>>>>> 
>>>>> And please tell me this isn't your production cluster...
>>>>> 
>>>>> (Strong hint to Strata and Cloudea... You really want to accept my
>>>>> upcoming proposal talk... ;-)
>>>>> 
>>>>> 
>>>>> Sent from a remote device. Please excuse any typos...
>>>>> 
>>>>> Mike Segel
>>>>> 
>>>>> On May 3, 2012, at 5:25 AM, Austin Chungath <austi...@gmail.com>
>> wrote:
>>>>> 
>>>>>> Yes. This was first posted on the cloudera mailing list. There were no
>>>>>> responses.
>>>>>> 
>>>>>> But this is not related to cloudera as such.
>>>>>> 
>>>>>> cdh3 is based on apache hadoop 0.20 as the base. My data is in apache
>>>>>> hadoop 0.20.205
>>>>>> 
>>>>>> There is an upgrade namenode option when we are migrating to a higher
>>>>>> version say from 0.20 to 0.20.205
>>>>>> but here I am downgrading from 0.20.205 to 0.20 (cdh3)
>>>>>> Is this possible?
>>>>>> 
>>>>>> 
>>>>>> On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi <
>> prash1...@gmail.com
>>>>>> wrote:
>>>>>> 
>>>>>>> Seems like a matter of upgrade. I am not a Cloudera user so would not
>>>>> know
>>>>>>> much, but you might find some help moving this to Cloudera mailing
>> list.
>>>>>>> 
>>>>>>> On Thu, May 3, 2012 at 2:51 AM, Austin Chungath <austi...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> There is only one cluster. I am not copying between clusters.
>>>>>>>> 
>>>>>>>> Say I have a cluster running apache 0.20.205 with 10 TB storage
>>>>> capacity
>>>>>>>> and has about 8 TB of data.
>>>>>>>> Now how can I migrate the same cluster to use cdh3 and use that
>> same 8
>>>>> TB
>>>>>>>> of data.
>>>>>>>> 
>>>>>>>> I can't copy 8 TB of data using distcp because I have only 2 TB of
>> free
>>>>>>>> space
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar <
>> nitinpawar...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> you can actually look at the distcp
>>>>>>>>> 
>>>>>>>>> http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
>>>>>>>>> 
>>>>>>>>> but this means that you have two different set of clusters
>> available
>>>>> to
>>>>>>>> do
>>>>>>>>> the migration
>>>>>>>>> 
>>>>>>>>> On Thu, May 3, 2012 at 12:51 PM, Austin Chungath <
>> austi...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Thanks for the suggestions,
>>>>>>>>>> My concerns are that I can't actually copyToLocal from the dfs
>>>>>>> because
>>>>>>>>> the
>>>>>>>>>> data is huge.
>>>>>>>>>> 
>>>>>>>>>> Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do
>> a
>>>>>>>>>> namenode upgrade. I don't have to copy data out of dfs.
>>>>>>>>>> 
>>>>>>>>>> But here I am having Apache hadoop 0.20.205 and I want to use CDH3
>>>>>>> now,
>>>>>>>>>> which is based on 0.20
>>>>>>>>>> Now it is actually a downgrade as 0.20.205's namenode info has to
>> be
>>>>>>>> used
>>>>>>>>>> by 0.20's namenode.
>>>>>>>>>> 
>>>>>>>>>> Any idea how I can achieve what I am trying to do?
>>>>>>>>>> 
>>>>>>>>>> Thanks.
>>>>>>>>>> 
>>>>>>>>>> On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar <
>>>>>>> nitinpawar...@gmail.com
>>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> i can think of following options
>>>>>>>>>>> 
>>>>>>>>>>> 1) write a simple get and put code which gets the data from DFS
>> and
>>>>>>>>> loads
>>>>>>>>>>> it in dfs
>>>>>>>>>>> 2) see if the distcp  between both versions are compatible
>>>>>>>>>>> 3) this is what I had done (and my data was hardly few hundred
>> GB)
>>>>>>> ..
>>>>>>>>>> did a
>>>>>>>>>>> dfs -copyToLocal and then in the new grid did a copyFromLocal
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <
>>>>>>> austi...@gmail.com
>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> I am migrating from Apache hadoop 0.20.205 to CDH3u3.
>>>>>>>>>>>> I don't want to lose the data that is in the HDFS of Apache
>>>>>>> hadoop
>>>>>>>>>>>> 0.20.205.
>>>>>>>>>>>> How do I migrate to CDH3u3 but keep the data that I have on
>>>>>>>> 0.20.205.
>>>>>>>>>>>> What is the best practice/ techniques to do this?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Austin
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Reply via email to