Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3
$DuplicationException: Invalid input, there are duplicated files in the sources: hftp://ub13:50070/tmp/Rtmp1BU9Kb/file6abc6ccb6551/_logs/history, hftp://ub13:50070/tmp/Rtmp3yCJhu/file1ca96d9331/_logs/history Any idea what is the problem here? They are different files how are they conflicting? Thanks Regards On Tue, May 8, 2012 at 11:52 PM, Adam Faris afa...@linkedin.com wrote: Hi Austin, I'm glad that helped out. Regarding the -p flag for distcp, here's the online documentation http://hadoop.apache.org/common/docs/current/distcp.html#Option+Index You can also get this info from running 'hadoop distcp' without any flags. -p[rbugp] Preserve r: replication number b: block size u: user g: group p: permission -- Adam On May 7, 2012, at 10:55 PM, Austin Chungath wrote: Thanks Adam, That was very helpful. Your second point solved my problems :-) The hdfs port number was wrong. I didn't use the option -ppgu what does it do? On Mon, May 7, 2012 at 8:07 PM, Adam Faris afa...@linkedin.com wrote: Hi Austin, I don't know about using CDH3, but we use distcp for moving data between different versions of apache grids and several things come to mind. 1) you should use the -i flag to ignore checksum differences on the blocks. I'm not 100% but want to say hftp doesn't support checksums on the blocks as they go across the wire. 2) you should read from hftp but write to hdfs. Also make sure to check your port numbers. For example I can read from hftp on port 50070 and write to hdfs on port 9000. You'll find the hftp port in hdfs-site.xml and hdfs in core-site.xml on apache releases. 3) Do you have security (kerberos) enabled on 0.20.205? Does CDH3 support security? If security is enabled on 0.20.205 and CDH3 does not support security, you will need to disable security on 0.20.205. This is because you are unable to write from a secure to unsecured grid. 4) use the -m flag to limit your mappers so you don't DDOS your network backbone. 5) why isn't your vender helping you with the data migration? :) Otherwise something like this should get you going. hadoop -i -ppgu -log /tmp/mylog -m 20 distcp hftp://mynamenode.grid.one:50070/path/to/my/src/data hdfs://mynamenode.grid.two:9000/path/to/my/dst -- Adam On May 7, 2012, at 4:29 AM, Nitin Pawar wrote: things to check 1) when you launch distcp jobs all the datanodes of older hdfs are live and connected 2) when you launch distcp no data is being written/moved/deleteed in hdfs 3) you can use option -log to log errors into directory and user -i to ignore errors also u can try using distcp with hdfs protocol instead of hftp ... for more you can refer https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/d0d99ad9f1554edd if it failed there should be some error On Mon, May 7, 2012 at 4:44 PM, Austin Chungath austi...@gmail.com wrote: ok that was a lame mistake. $ hadoop distcp hftp://localhost:50070/tmp hftp://localhost:60070/tmp_copy I had spelled hdfs instead of hftp $ hadoop distcp hftp://localhost:50070/docs/index.html hftp://localhost:60070/user/hadoop 12/05/07 16:38:09 INFO tools.DistCp: srcPaths=[hftp://localhost:50070/docs/index.html] 12/05/07 16:38:09 INFO tools.DistCp: destPath=hftp://localhost:60070/user/hadoop With failures, global counters are inaccurate; consider running with -i Copy failed: java.io.IOException: Not supported at org.apache.hadoop.hdfs.HftpFileSystem.delete(HftpFileSystem.java:457) at org.apache.hadoop.tools.DistCp.fullyDelete(DistCp.java:963) at org.apache.hadoop.tools.DistCp.copy(DistCp.java:672) at org.apache.hadoop.tools.DistCp.run(DistCp.java:881) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.tools.DistCp.main(DistCp.java:908) Any idea why this error is coming? I am copying one file from 0.20.205 (/docs/index.html ) to cdh3u3 (/user/hadoop) Thanks Regards, Austin On Mon, May 7, 2012 at 3:57 PM, Austin Chungath austi...@gmail.com wrote: Thanks, So I decided to try and move using distcp. $ hadoop distcp hdfs://localhost:54310/tmp hdfs://localhost:8021/tmp_copy 12/05/07 14:57:38 INFO tools.DistCp: srcPaths=[hdfs://localhost:54310/tmp] 12/05/07 14:57:38 INFO tools.DistCp: destPath=hdfs://localhost:8021/tmp_copy With failures, global counters are inaccurate; consider running with -i Copy failed: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch. (client = 63, server = 61) I found that we can do distcp like above only if both are of the same hadoop version. so I tried
Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3
Thanks, So I decided to try and move using distcp. $ hadoop distcp hdfs://localhost:54310/tmp hdfs://localhost:8021/tmp_copy 12/05/07 14:57:38 INFO tools.DistCp: srcPaths=[hdfs://localhost:54310/tmp] 12/05/07 14:57:38 INFO tools.DistCp: destPath=hdfs://localhost:8021/tmp_copy With failures, global counters are inaccurate; consider running with -i Copy failed: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch. (client = 63, server = 61) I found that we can do distcp like above only if both are of the same hadoop version. so I tried: $ hadoop distcp hftp://localhost:50070/tmp hdfs://localhost:60070/tmp_copy 12/05/07 15:02:44 INFO tools.DistCp: srcPaths=[hftp://localhost:50070/tmp] 12/05/07 15:02:44 INFO tools.DistCp: destPath=hdfs://localhost:60070/tmp_copy But this process seemed to be hangs at this stage. What might I be doing wrong? hftp://dfs.http.address/path hftp://localhost:50070 is dfs.http.address of 0.20.205 hdfs://localhost:60070 is dfs.http.address of cdh3u3 Thanks and regards, Austin On Fri, May 4, 2012 at 4:30 AM, Michel Segel michael_se...@hotmail.comwrote: Ok... So riddle me this... I currently have a replication factor of 3. I reset it to two. What do you have to do to get the replication factor of 3 down to 2? Do I just try to rebalance the nodes? The point is that you are looking at a very small cluster. You may want to start the be cluster with a replication factor of 2 and then when the data is moved over, increase it to a factor of 3. Or maybe not. I do a distcp to. Copy the data and after each distcp, I do an fsck for a sanity check and then remove the files I copied. As I gain more room, I can then slowly drop nodes, do an fsck, rebalance and then repeat. Even though this us a dev cluster, the OP wants to retain the data. There are other options depending on the amount and size of new hardware. I mean make one machine a RAID 5 machine, copy data to it clearing off the cluster. If 8TB was the amount of disk used, that would be 2. TB used. Let's say 3TB. Going raid 5, how much disk is that? So you could fit it on one machine, depending on hardware, or maybe 2 machines... Now you can rebuild initial cluster and then move data back. Then rebuild those machines. Lots of options... ;-) Sent from a remote device. Please excuse any typos... Mike Segel On May 3, 2012, at 11:26 AM, Suresh Srinivas sur...@hortonworks.com wrote: This probably is a more relevant question in CDH mailing lists. That said, what Edward is suggesting seems reasonable. Reduce replication factor, decommission some of the nodes and create a new cluster with those nodes and do distcp. Could you share with us the reasons you want to migrate from Apache 205? Regards, Suresh On Thu, May 3, 2012 at 8:25 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Honestly that is a hassle, going from 205 to cdh3u3 is probably more or a cross-grade then an upgrade or downgrade. I would just stick it out. But yes like Michael said two clusters on the same gear and distcp. If you are using RF=3 you could also lower your replication to rf=2 'hadoop dfs -setrepl 2' to clear headroom as you are moving stuff. On Thu, May 3, 2012 at 7:25 AM, Michel Segel michael_se...@hotmail.com wrote: Ok... When you get your new hardware... Set up one server as your new NN, JT, SN. Set up the others as a DN. (Cloudera CDH3u3) On your existing cluster... Remove your old log files, temp files on HDFS anything you don't need. This should give you some more space. Start copying some of the directories/files to the new cluster. As you gain space, decommission a node, rebalance, add node to new cluster... It's a slow process. Should I remind you to make sure you up you bandwidth setting, and to clean up the hdfs directories when you repurpose the nodes? Does this make sense? Sent from a remote device. Please excuse any typos... Mike Segel On May 3, 2012, at 5:46 AM, Austin Chungath austi...@gmail.com wrote: Yeah I know :-) and this is not a production cluster ;-) and yes there is more hardware coming :-) On Thu, May 3, 2012 at 4:10 PM, Michel Segel michael_se...@hotmail.com wrote: Well, you've kind of painted yourself in to a corner... Not sure why you didn't get a response from the Cloudera lists, but it's a generic question... 8 out of 10 TB. Are you talking effective storage or actual disks? And please tell me you've already ordered more hardware.. Right? And please tell me this isn't your production cluster... (Strong hint to Strata and Cloudea... You really want to accept my upcoming proposal talk... ;-) Sent from a remote device. Please excuse any typos... Mike Segel On May 3, 2012, at 5:25 AM, Austin Chungath austi...@gmail.com wrote: Yes. This was first posted on the cloudera mailing
Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3
ok that was a lame mistake. $ hadoop distcp hftp://localhost:50070/tmp hftp://localhost:60070/tmp_copy I had spelled hdfs instead of hftp $ hadoop distcp hftp://localhost:50070/docs/index.html hftp://localhost:60070/user/hadoop 12/05/07 16:38:09 INFO tools.DistCp: srcPaths=[hftp://localhost:50070/docs/index.html] 12/05/07 16:38:09 INFO tools.DistCp: destPath=hftp://localhost:60070/user/hadoop With failures, global counters are inaccurate; consider running with -i Copy failed: java.io.IOException: Not supported at org.apache.hadoop.hdfs.HftpFileSystem.delete(HftpFileSystem.java:457) at org.apache.hadoop.tools.DistCp.fullyDelete(DistCp.java:963) at org.apache.hadoop.tools.DistCp.copy(DistCp.java:672) at org.apache.hadoop.tools.DistCp.run(DistCp.java:881) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.tools.DistCp.main(DistCp.java:908) Any idea why this error is coming? I am copying one file from 0.20.205 (/docs/index.html ) to cdh3u3 (/user/hadoop) Thanks Regards, Austin On Mon, May 7, 2012 at 3:57 PM, Austin Chungath austi...@gmail.com wrote: Thanks, So I decided to try and move using distcp. $ hadoop distcp hdfs://localhost:54310/tmp hdfs://localhost:8021/tmp_copy 12/05/07 14:57:38 INFO tools.DistCp: srcPaths=[hdfs://localhost:54310/tmp] 12/05/07 14:57:38 INFO tools.DistCp: destPath=hdfs://localhost:8021/tmp_copy With failures, global counters are inaccurate; consider running with -i Copy failed: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch. (client = 63, server = 61) I found that we can do distcp like above only if both are of the same hadoop version. so I tried: $ hadoop distcp hftp://localhost:50070/tmp hdfs://localhost:60070/tmp_copy 12/05/07 15:02:44 INFO tools.DistCp: srcPaths=[hftp://localhost:50070/tmp] 12/05/07 15:02:44 INFO tools.DistCp: destPath=hdfs://localhost:60070/tmp_copy But this process seemed to be hangs at this stage. What might I be doing wrong? hftp://dfs.http.address/path hftp://localhost:50070 is dfs.http.address of 0.20.205 hdfs://localhost:60070 is dfs.http.address of cdh3u3 Thanks and regards, Austin On Fri, May 4, 2012 at 4:30 AM, Michel Segel michael_se...@hotmail.comwrote: Ok... So riddle me this... I currently have a replication factor of 3. I reset it to two. What do you have to do to get the replication factor of 3 down to 2? Do I just try to rebalance the nodes? The point is that you are looking at a very small cluster. You may want to start the be cluster with a replication factor of 2 and then when the data is moved over, increase it to a factor of 3. Or maybe not. I do a distcp to. Copy the data and after each distcp, I do an fsck for a sanity check and then remove the files I copied. As I gain more room, I can then slowly drop nodes, do an fsck, rebalance and then repeat. Even though this us a dev cluster, the OP wants to retain the data. There are other options depending on the amount and size of new hardware. I mean make one machine a RAID 5 machine, copy data to it clearing off the cluster. If 8TB was the amount of disk used, that would be 2. TB used. Let's say 3TB. Going raid 5, how much disk is that? So you could fit it on one machine, depending on hardware, or maybe 2 machines... Now you can rebuild initial cluster and then move data back. Then rebuild those machines. Lots of options... ;-) Sent from a remote device. Please excuse any typos... Mike Segel On May 3, 2012, at 11:26 AM, Suresh Srinivas sur...@hortonworks.com wrote: This probably is a more relevant question in CDH mailing lists. That said, what Edward is suggesting seems reasonable. Reduce replication factor, decommission some of the nodes and create a new cluster with those nodes and do distcp. Could you share with us the reasons you want to migrate from Apache 205? Regards, Suresh On Thu, May 3, 2012 at 8:25 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Honestly that is a hassle, going from 205 to cdh3u3 is probably more or a cross-grade then an upgrade or downgrade. I would just stick it out. But yes like Michael said two clusters on the same gear and distcp. If you are using RF=3 you could also lower your replication to rf=2 'hadoop dfs -setrepl 2' to clear headroom as you are moving stuff. On Thu, May 3, 2012 at 7:25 AM, Michel Segel michael_se...@hotmail.com wrote: Ok... When you get your new hardware... Set up one server as your new NN, JT, SN. Set up the others as a DN. (Cloudera CDH3u3) On your existing cluster... Remove your old log files, temp files on HDFS anything you don't need. This should give you some more space. Start copying some of the directories/files to the new cluster. As you gain space, decommission a node, rebalance, add node to new cluster... It's
Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3
Thanks Adam, That was very helpful. Your second point solved my problems :-) The hdfs port number was wrong. I didn't use the option -ppgu what does it do? On Mon, May 7, 2012 at 8:07 PM, Adam Faris afa...@linkedin.com wrote: Hi Austin, I don't know about using CDH3, but we use distcp for moving data between different versions of apache grids and several things come to mind. 1) you should use the -i flag to ignore checksum differences on the blocks. I'm not 100% but want to say hftp doesn't support checksums on the blocks as they go across the wire. 2) you should read from hftp but write to hdfs. Also make sure to check your port numbers. For example I can read from hftp on port 50070 and write to hdfs on port 9000. You'll find the hftp port in hdfs-site.xml and hdfs in core-site.xml on apache releases. 3) Do you have security (kerberos) enabled on 0.20.205? Does CDH3 support security? If security is enabled on 0.20.205 and CDH3 does not support security, you will need to disable security on 0.20.205. This is because you are unable to write from a secure to unsecured grid. 4) use the -m flag to limit your mappers so you don't DDOS your network backbone. 5) why isn't your vender helping you with the data migration? :) Otherwise something like this should get you going. hadoop -i -ppgu -log /tmp/mylog -m 20 distcp hftp://mynamenode.grid.one:50070/path/to/my/src/data hdfs://mynamenode.grid.two:9000/path/to/my/dst -- Adam On May 7, 2012, at 4:29 AM, Nitin Pawar wrote: things to check 1) when you launch distcp jobs all the datanodes of older hdfs are live and connected 2) when you launch distcp no data is being written/moved/deleteed in hdfs 3) you can use option -log to log errors into directory and user -i to ignore errors also u can try using distcp with hdfs protocol instead of hftp ... for more you can refer https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/d0d99ad9f1554edd if it failed there should be some error On Mon, May 7, 2012 at 4:44 PM, Austin Chungath austi...@gmail.com wrote: ok that was a lame mistake. $ hadoop distcp hftp://localhost:50070/tmp hftp://localhost:60070/tmp_copy I had spelled hdfs instead of hftp $ hadoop distcp hftp://localhost:50070/docs/index.html hftp://localhost:60070/user/hadoop 12/05/07 16:38:09 INFO tools.DistCp: srcPaths=[hftp://localhost:50070/docs/index.html] 12/05/07 16:38:09 INFO tools.DistCp: destPath=hftp://localhost:60070/user/hadoop With failures, global counters are inaccurate; consider running with -i Copy failed: java.io.IOException: Not supported at org.apache.hadoop.hdfs.HftpFileSystem.delete(HftpFileSystem.java:457) at org.apache.hadoop.tools.DistCp.fullyDelete(DistCp.java:963) at org.apache.hadoop.tools.DistCp.copy(DistCp.java:672) at org.apache.hadoop.tools.DistCp.run(DistCp.java:881) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.tools.DistCp.main(DistCp.java:908) Any idea why this error is coming? I am copying one file from 0.20.205 (/docs/index.html ) to cdh3u3 (/user/hadoop) Thanks Regards, Austin On Mon, May 7, 2012 at 3:57 PM, Austin Chungath austi...@gmail.com wrote: Thanks, So I decided to try and move using distcp. $ hadoop distcp hdfs://localhost:54310/tmp hdfs://localhost:8021/tmp_copy 12/05/07 14:57:38 INFO tools.DistCp: srcPaths=[hdfs://localhost:54310/tmp] 12/05/07 14:57:38 INFO tools.DistCp: destPath=hdfs://localhost:8021/tmp_copy With failures, global counters are inaccurate; consider running with -i Copy failed: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch. (client = 63, server = 61) I found that we can do distcp like above only if both are of the same hadoop version. so I tried: $ hadoop distcp hftp://localhost:50070/tmp hdfs://localhost:60070/tmp_copy 12/05/07 15:02:44 INFO tools.DistCp: srcPaths=[hftp://localhost:50070/tmp] 12/05/07 15:02:44 INFO tools.DistCp: destPath=hdfs://localhost:60070/tmp_copy But this process seemed to be hangs at this stage. What might I be doing wrong? hftp://dfs.http.address/path hftp://localhost:50070 is dfs.http.address of 0.20.205 hdfs://localhost:60070 is dfs.http.address of cdh3u3 Thanks and regards, Austin On Fri, May 4, 2012 at 4:30 AM, Michel Segel michael_se...@hotmail.com wrote: Ok... So riddle me this... I currently have a replication factor of 3. I reset it to two. What do you have to do to get the replication factor of 3 down to 2? Do I just try to rebalance the nodes? The point is that you are looking at a very small cluster. You may want to start the be cluster with a replication factor of 2 and then when the data is moved over, increase it to a factor
Best practice to migrate HDFS from 0.20.205 to CDH3u3
Hi, I am migrating from Apache hadoop 0.20.205 to CDH3u3. I don't want to lose the data that is in the HDFS of Apache hadoop 0.20.205. How do I migrate to CDH3u3 but keep the data that I have on 0.20.205. What is the best practice/ techniques to do this? Thanks Regards, Austin
Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3
Thanks for the suggestions, My concerns are that I can't actually copyToLocal from the dfs because the data is huge. Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do a namenode upgrade. I don't have to copy data out of dfs. But here I am having Apache hadoop 0.20.205 and I want to use CDH3 now, which is based on 0.20 Now it is actually a downgrade as 0.20.205's namenode info has to be used by 0.20's namenode. Any idea how I can achieve what I am trying to do? Thanks. On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar nitinpawar...@gmail.comwrote: i can think of following options 1) write a simple get and put code which gets the data from DFS and loads it in dfs 2) see if the distcp between both versions are compatible 3) this is what I had done (and my data was hardly few hundred GB) .. did a dfs -copyToLocal and then in the new grid did a copyFromLocal On Thu, May 3, 2012 at 11:41 AM, Austin Chungath austi...@gmail.com wrote: Hi, I am migrating from Apache hadoop 0.20.205 to CDH3u3. I don't want to lose the data that is in the HDFS of Apache hadoop 0.20.205. How do I migrate to CDH3u3 but keep the data that I have on 0.20.205. What is the best practice/ techniques to do this? Thanks Regards, Austin -- Nitin Pawar
Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3
There is only one cluster. I am not copying between clusters. Say I have a cluster running apache 0.20.205 with 10 TB storage capacity and has about 8 TB of data. Now how can I migrate the same cluster to use cdh3 and use that same 8 TB of data. I can't copy 8 TB of data using distcp because I have only 2 TB of free space On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar nitinpawar...@gmail.com wrote: you can actually look at the distcp http://hadoop.apache.org/common/docs/r0.20.0/distcp.html but this means that you have two different set of clusters available to do the migration On Thu, May 3, 2012 at 12:51 PM, Austin Chungath austi...@gmail.com wrote: Thanks for the suggestions, My concerns are that I can't actually copyToLocal from the dfs because the data is huge. Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do a namenode upgrade. I don't have to copy data out of dfs. But here I am having Apache hadoop 0.20.205 and I want to use CDH3 now, which is based on 0.20 Now it is actually a downgrade as 0.20.205's namenode info has to be used by 0.20's namenode. Any idea how I can achieve what I am trying to do? Thanks. On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar nitinpawar...@gmail.com wrote: i can think of following options 1) write a simple get and put code which gets the data from DFS and loads it in dfs 2) see if the distcp between both versions are compatible 3) this is what I had done (and my data was hardly few hundred GB) .. did a dfs -copyToLocal and then in the new grid did a copyFromLocal On Thu, May 3, 2012 at 11:41 AM, Austin Chungath austi...@gmail.com wrote: Hi, I am migrating from Apache hadoop 0.20.205 to CDH3u3. I don't want to lose the data that is in the HDFS of Apache hadoop 0.20.205. How do I migrate to CDH3u3 but keep the data that I have on 0.20.205. What is the best practice/ techniques to do this? Thanks Regards, Austin -- Nitin Pawar -- Nitin Pawar
Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3
Yes. This was first posted on the cloudera mailing list. There were no responses. But this is not related to cloudera as such. cdh3 is based on apache hadoop 0.20 as the base. My data is in apache hadoop 0.20.205 There is an upgrade namenode option when we are migrating to a higher version say from 0.20 to 0.20.205 but here I am downgrading from 0.20.205 to 0.20 (cdh3) Is this possible? On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi prash1...@gmail.comwrote: Seems like a matter of upgrade. I am not a Cloudera user so would not know much, but you might find some help moving this to Cloudera mailing list. On Thu, May 3, 2012 at 2:51 AM, Austin Chungath austi...@gmail.com wrote: There is only one cluster. I am not copying between clusters. Say I have a cluster running apache 0.20.205 with 10 TB storage capacity and has about 8 TB of data. Now how can I migrate the same cluster to use cdh3 and use that same 8 TB of data. I can't copy 8 TB of data using distcp because I have only 2 TB of free space On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar nitinpawar...@gmail.com wrote: you can actually look at the distcp http://hadoop.apache.org/common/docs/r0.20.0/distcp.html but this means that you have two different set of clusters available to do the migration On Thu, May 3, 2012 at 12:51 PM, Austin Chungath austi...@gmail.com wrote: Thanks for the suggestions, My concerns are that I can't actually copyToLocal from the dfs because the data is huge. Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do a namenode upgrade. I don't have to copy data out of dfs. But here I am having Apache hadoop 0.20.205 and I want to use CDH3 now, which is based on 0.20 Now it is actually a downgrade as 0.20.205's namenode info has to be used by 0.20's namenode. Any idea how I can achieve what I am trying to do? Thanks. On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar nitinpawar...@gmail.com wrote: i can think of following options 1) write a simple get and put code which gets the data from DFS and loads it in dfs 2) see if the distcp between both versions are compatible 3) this is what I had done (and my data was hardly few hundred GB) .. did a dfs -copyToLocal and then in the new grid did a copyFromLocal On Thu, May 3, 2012 at 11:41 AM, Austin Chungath austi...@gmail.com wrote: Hi, I am migrating from Apache hadoop 0.20.205 to CDH3u3. I don't want to lose the data that is in the HDFS of Apache hadoop 0.20.205. How do I migrate to CDH3u3 but keep the data that I have on 0.20.205. What is the best practice/ techniques to do this? Thanks Regards, Austin -- Nitin Pawar -- Nitin Pawar
Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3
Yeah I know :-) and this is not a production cluster ;-) and yes there is more hardware coming :-) On Thu, May 3, 2012 at 4:10 PM, Michel Segel michael_se...@hotmail.comwrote: Well, you've kind of painted yourself in to a corner... Not sure why you didn't get a response from the Cloudera lists, but it's a generic question... 8 out of 10 TB. Are you talking effective storage or actual disks? And please tell me you've already ordered more hardware.. Right? And please tell me this isn't your production cluster... (Strong hint to Strata and Cloudea... You really want to accept my upcoming proposal talk... ;-) Sent from a remote device. Please excuse any typos... Mike Segel On May 3, 2012, at 5:25 AM, Austin Chungath austi...@gmail.com wrote: Yes. This was first posted on the cloudera mailing list. There were no responses. But this is not related to cloudera as such. cdh3 is based on apache hadoop 0.20 as the base. My data is in apache hadoop 0.20.205 There is an upgrade namenode option when we are migrating to a higher version say from 0.20 to 0.20.205 but here I am downgrading from 0.20.205 to 0.20 (cdh3) Is this possible? On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi prash1...@gmail.com wrote: Seems like a matter of upgrade. I am not a Cloudera user so would not know much, but you might find some help moving this to Cloudera mailing list. On Thu, May 3, 2012 at 2:51 AM, Austin Chungath austi...@gmail.com wrote: There is only one cluster. I am not copying between clusters. Say I have a cluster running apache 0.20.205 with 10 TB storage capacity and has about 8 TB of data. Now how can I migrate the same cluster to use cdh3 and use that same 8 TB of data. I can't copy 8 TB of data using distcp because I have only 2 TB of free space On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar nitinpawar...@gmail.com wrote: you can actually look at the distcp http://hadoop.apache.org/common/docs/r0.20.0/distcp.html but this means that you have two different set of clusters available to do the migration On Thu, May 3, 2012 at 12:51 PM, Austin Chungath austi...@gmail.com wrote: Thanks for the suggestions, My concerns are that I can't actually copyToLocal from the dfs because the data is huge. Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do a namenode upgrade. I don't have to copy data out of dfs. But here I am having Apache hadoop 0.20.205 and I want to use CDH3 now, which is based on 0.20 Now it is actually a downgrade as 0.20.205's namenode info has to be used by 0.20's namenode. Any idea how I can achieve what I am trying to do? Thanks. On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar nitinpawar...@gmail.com wrote: i can think of following options 1) write a simple get and put code which gets the data from DFS and loads it in dfs 2) see if the distcp between both versions are compatible 3) this is what I had done (and my data was hardly few hundred GB) .. did a dfs -copyToLocal and then in the new grid did a copyFromLocal On Thu, May 3, 2012 at 11:41 AM, Austin Chungath austi...@gmail.com wrote: Hi, I am migrating from Apache hadoop 0.20.205 to CDH3u3. I don't want to lose the data that is in the HDFS of Apache hadoop 0.20.205. How do I migrate to CDH3u3 but keep the data that I have on 0.20.205. What is the best practice/ techniques to do this? Thanks Regards, Austin -- Nitin Pawar -- Nitin Pawar
how to add more than one user to hadoop with DFS permissions?
I have a 2 node cluster running hadoop 0.20.205. There is only one user , username: hadoop of group: hadoop. What is the easiest way to add one more user say hadoop1 with DFS permissions set as true? I did the following to create a user in the master node. sudo adduser --ingroup hadoop hadoop1 My aim is to have hadoop run in such a way that each user input and output data is accessible only to the owner (chmod 700). I did play around with the configuration properties for sometime now but to no end. It would be great if some one could tell me what are the configuration file properties that I should change to achieve this? Thanks, Austin
Re: how to add more than one user to hadoop with DFS permissions?
Thanks Harsh :) On Sat, Mar 10, 2012 at 10:12 PM, Harsh J ha...@cloudera.com wrote: Austin, 1. Enable HDFS Permissions. In hdfs-site.xml, set dfs.permissions as true. 2. To commission any new user, as HDFS admin (the user who runs the NameNode process), run: hadoop fs -mkdir /user/username hadoop fs -chown username:username /user/username 3. For default file/dir permissions to be 700, tweak the dfs.umaskmode property. Much of this is also documented at the permissions guide: http://hadoop.apache.org/common/docs/r0.20.2/hdfs_permissions_guide.html On Sat, Mar 10, 2012 at 9:59 PM, Austin Chungath austi...@gmail.com wrote: I have a 2 node cluster running hadoop 0.20.205. There is only one user , username: hadoop of group: hadoop. What is the easiest way to add one more user say hadoop1 with DFS permissions set as true? I did the following to create a user in the master node. sudo adduser --ingroup hadoop hadoop1 My aim is to have hadoop run in such a way that each user input and output data is accessible only to the owner (chmod 700). I did play around with the configuration properties for sometime now but to no end. It would be great if some one could tell me what are the configuration file properties that I should change to achieve this? Thanks, Austin -- Harsh J
fairscheduler : group.name | Please edit patch to work for 0.20.205
Can someone have a look at the patch MAPREDUCE-2457 and see if it can be modified to work for 0.20.205? I am very new to java and have no idea what's going on in that patch. If you have any pointers for me, I will see if I can do it on my own. Thanks, Austin On Fri, Mar 2, 2012 at 7:15 PM, Austin Chungath austi...@gmail.com wrote: I tried the patch MAPREDUCE-2457 but it didn't work for my hadoop 0.20.205. Are you sure this patch will work for 0.20.205? According to the description it says that the patch works for 0.21 and 0.22 and it says that 0.20 supports group.name without this patch... So does this patch also apply to 0.20.205? Thanks, Austin On Thu, Mar 1, 2012 at 11:24 PM, Harsh J ha...@cloudera.com wrote: The group.name scheduler support was introduced in https://issues.apache.org/jira/browse/HADOOP-3892 but may have been broken by the security changes present in 0.20.205. You'll need the fix presented in https://issues.apache.org/jira/browse/MAPREDUCE-2457 to have group.name support. On Thu, Mar 1, 2012 at 6:42 PM, Austin Chungath austi...@gmail.com wrote: I am running fair scheduler on hadoop 0.20.205.0 http://hadoop.apache.org/common/docs/r0.20.205.0/fair_scheduler.html The above page talks about the following property *mapred.fairscheduler.poolnameproperty* ** which I can set to *group.name* The default is user.name and when a user submits a job the fair scheduler assigns each user's job to a pool which has the name of the user. I am trying to change it to group.name so that the job is submitted to a pool which has the name of the user's linux group. Thus all jobs from any user from a specific group go to the same pool instead of an individual pool for every user. But *group.name* doesn't seem to work, has anyone tried this before? *user.name* and *mapred.job.queue.name* works. Is group.name supported in 0.20.205.0 because I don't see it mentioned in the docs? Thanks, Austin -- Harsh J
Re: fairscheduler : group.name doesn't work, please help
I tried the patch MAPREDUCE-2457 but it didn't work for my hadoop 0.20.205. Are you sure this patch will work for 0.20.205? According to the description it says that the patch works for 0.21 and 0.22 and it says that 0.20 supports group.name without this patch... So does this patch also apply to 0.20.205? Thanks, Austin On Thu, Mar 1, 2012 at 11:24 PM, Harsh J ha...@cloudera.com wrote: The group.name scheduler support was introduced in https://issues.apache.org/jira/browse/HADOOP-3892 but may have been broken by the security changes present in 0.20.205. You'll need the fix presented in https://issues.apache.org/jira/browse/MAPREDUCE-2457 to have group.name support. On Thu, Mar 1, 2012 at 6:42 PM, Austin Chungath austi...@gmail.com wrote: I am running fair scheduler on hadoop 0.20.205.0 http://hadoop.apache.org/common/docs/r0.20.205.0/fair_scheduler.html The above page talks about the following property *mapred.fairscheduler.poolnameproperty* ** which I can set to *group.name* The default is user.name and when a user submits a job the fair scheduler assigns each user's job to a pool which has the name of the user. I am trying to change it to group.name so that the job is submitted to a pool which has the name of the user's linux group. Thus all jobs from any user from a specific group go to the same pool instead of an individual pool for every user. But *group.name* doesn't seem to work, has anyone tried this before? *user.name* and *mapred.job.queue.name* works. Is group.name supported in 0.20.205.0 because I don't see it mentioned in the docs? Thanks, Austin -- Harsh J
Re: Hadoop fair scheduler doubt: allocate jobs to pool
Thanks, I will be trying the suggestions and will get back to you soon. On Thu, Mar 1, 2012 at 8:09 PM, Dave Shine dave.sh...@channelintelligence.com wrote: I've just started playing with the Fair Scheduler. To specify the pool at job submission time you set the mapred.fairscheduler.pool property on the Job Conf to the name of the pool you want the job to use. Dave -Original Message- From: Merto Mertek [mailto:masmer...@gmail.com] Sent: Thursday, March 01, 2012 9:33 AM To: common-user@hadoop.apache.org Subject: Re: Hadoop fair scheduler doubt: allocate jobs to pool From the fairscheduler docs I assume the following should work: property namemapred.fairscheduler.poolnameproperty/name valuepool.name/value /property property namepool.name/name value${mapreduce.job.group.name}/value /property which means that the default pool will be the group of the user that has submitted the job. In your case I think that allocations.xml is correct. If you want to explicitly define a job to specific pool from your allocation.xml file you can define it as follows: Configuration conf3 = conf; conf3.set(pool.name, pool3); // conf.set(propriety.name, value) Let me know if it works.. On 29 February 2012 14:18, Austin Chungath austi...@gmail.com wrote: How can I set the fair scheduler such that all jobs submitted from a particular user group go to a pool with the group name? I have setup fair scheduler and I have two users: A and B (belonging to the user group hadoop) When these users submit hadoop jobs, the jobs from A got to a pool named A and the jobs from B go to a pool named B. I want them to go to a pool with their group name, So I tried adding the following to mapred-site.xml: property namemapred.fairscheduler.poolnameproperty/name valuegroup.name/value /property But instead the jobs now go to the default pool. I want the jobs submitted by A and B to go to the pool named hadoop. How do I do that? also how can I explicity set a job to any specified pool? I have set the allocation file (fair-scheduler.xml) like this: allocations pool name=hadoop minMaps1/minMaps minReduces1/minReduces maxMaps3/maxMaps maxReduces3/maxReduces /pool userMaxJobsDefault5/userMaxJobsDefault /allocations Any help is greatly appreciated. Thanks, Austin The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.
Re: Hadoop fair scheduler doubt: allocate jobs to pool
Hi, I tried what you had said. I added the following to mapred-site.xml: property namemapred.fairscheduler.poolnameproperty/name valuepool.name/value /property property namepool.name/name value${mapreduce.job.group.name}/value /property Funny enough it created a pool with the name ${mapreduce.job.group.name} so I tried ${mapred.job.group.name} and ${group.name} all to the same effect. But when I did ${user.name} it worked! and created a pool with the user name. On Thu, Mar 1, 2012 at 8:03 PM, Merto Mertek masmer...@gmail.com wrote: From the fairscheduler docs I assume the following should work: property namemapred.fairscheduler.poolnameproperty/name valuepool.name/value /property property namepool.name/name value${mapreduce.job.group.name}/value /property which means that the default pool will be the group of the user that has submitted the job. In your case I think that allocations.xml is correct. If you want to explicitly define a job to specific pool from your allocation.xml file you can define it as follows: Configuration conf3 = conf; conf3.set(pool.name, pool3); // conf.set(propriety.name, value) Let me know if it works.. On 29 February 2012 14:18, Austin Chungath austi...@gmail.com wrote: How can I set the fair scheduler such that all jobs submitted from a particular user group go to a pool with the group name? I have setup fair scheduler and I have two users: A and B (belonging to the user group hadoop) When these users submit hadoop jobs, the jobs from A got to a pool named A and the jobs from B go to a pool named B. I want them to go to a pool with their group name, So I tried adding the following to mapred-site.xml: property namemapred.fairscheduler.poolnameproperty/name valuegroup.name/value /property But instead the jobs now go to the default pool. I want the jobs submitted by A and B to go to the pool named hadoop. How do I do that? also how can I explicity set a job to any specified pool? I have set the allocation file (fair-scheduler.xml) like this: allocations pool name=hadoop minMaps1/minMaps minReduces1/minReduces maxMaps3/maxMaps maxReduces3/maxReduces /pool userMaxJobsDefault5/userMaxJobsDefault /allocations Any help is greatly appreciated. Thanks, Austin
Hadoop fair scheduler doubt: allocate jobs to pool
How can I set the fair scheduler such that all jobs submitted from a particular user group go to a pool with the group name? I have setup fair scheduler and I have two users: A and B (belonging to the user group hadoop) When these users submit hadoop jobs, the jobs from A got to a pool named A and the jobs from B go to a pool named B. I want them to go to a pool with their group name, So I tried adding the following to mapred-site.xml: property namemapred.fairscheduler.poolnameproperty/name valuegroup.name/value /property But instead the jobs now go to the default pool. I want the jobs submitted by A and B to go to the pool named hadoop. How do I do that? also how can I explicity set a job to any specified pool? I have set the allocation file (fair-scheduler.xml) like this: allocations pool name=hadoop minMaps1/minMaps minReduces1/minReduces maxMaps3/maxMaps maxReduces3/maxReduces /pool userMaxJobsDefault5/userMaxJobsDefault /allocations Any help is greatly appreciated. Thanks, Austin
Re: hadoop streaming : need help in using custom key value separator
Thanks subir, -D stream.mapred.output.field.separator=* is not an available option, my bad what I should have done is: -D stream.map.output.field.separator=* On Tue, Feb 28, 2012 at 2:36 PM, Subir S subir.sasiku...@gmail.com wrote: http://hadoop.apache.org/common/docs/current/streaming.html#Customizing+How+Lines+are+Split+into+Key%2FValue+Pairs Read this link, your options are wrong below. On Tue, Feb 28, 2012 at 1:13 PM, Austin Chungath austi...@gmail.com wrote: When I am using more than one reducer in hadoop streaming where I am using my custom separater rather than the tab, it looks like the hadoop shuffling process is not happening as it should. This is the reducer output when I am using '\t' to separate my key value pair that is output from the mapper. *output from reducer 1:* 10321,22 23644,37 41231,42 23448,20 12325,39 71234,20 *output from reducer 2:* 24123,43 33213,46 11321,29 21232,32 the above output is as expected the first column is the key and the second value is the count. There are 10 unique keys and 6 of them are in output of the first reducer and the remaining 4 int the second reducer output. But now when I use a custom separater for my key value pair output from my mapper. Here I am using '*' as the separator -D stream.mapred.output.field.separator=* -D mapred.reduce.tasks=2 *output from reducer 1:* 10321,5 21232,19 24123,16 33213,28 23644,21 41231,12 23448,18 11321,29 12325,24 71234,9 * * *output from reducer 2:* 10321,17 21232,13 33213,18 23644,16 41231,30 23448,2 24123,27 12325,15 71234,11 Now both the reducers are getting all the keys and part of the values go to reducer 1 and part of the reducer go to reducer 2. Why is it behaving like this when I am using a custom separator, shouldn't each reducer get a unique key after the shuffling? I am using Hadoop 0.20.205.0 and below is the command that I am using to run hadoop streaming. Is there some more options that I should specify for hadoop streaming to work properly if I am using a custom separator? hadoop jar $HADOOP_PREFIX/contrib/streaming/hadoop-streaming-0.20.205.0.jar -D stream.mapred.output.field.separator=* -D mapred.reduce.tasks=2 -mapper ./map.py -reducer ./reducer.py -file ./map.py -file ./reducer.py -input /user/inputdata -output /user/outputdata -verbose Any help is much appreciated, Thanks, Austin
hadoop streaming : need help in using custom key value separator
When I am using more than one reducer in hadoop streaming where I am using my custom separater rather than the tab, it looks like the hadoop shuffling process is not happening as it should. This is the reducer output when I am using '\t' to separate my key value pair that is output from the mapper. *output from reducer 1:* 10321,22 23644,37 41231,42 23448,20 12325,39 71234,20 *output from reducer 2:* 24123,43 33213,46 11321,29 21232,32 the above output is as expected the first column is the key and the second value is the count. There are 10 unique keys and 6 of them are in output of the first reducer and the remaining 4 int the second reducer output. But now when I use a custom separater for my key value pair output from my mapper. Here I am using '*' as the separator -D stream.mapred.output.field.separator=* -D mapred.reduce.tasks=2 *output from reducer 1:* 10321,5 21232,19 24123,16 33213,28 23644,21 41231,12 23448,18 11321,29 12325,24 71234,9 * * *output from reducer 2:* 10321,17 21232,13 33213,18 23644,16 41231,30 23448,2 24123,27 12325,15 71234,11 Now both the reducers are getting all the keys and part of the values go to reducer 1 and part of the reducer go to reducer 2. Why is it behaving like this when I am using a custom separator, shouldn't each reducer get a unique key after the shuffling? I am using Hadoop 0.20.205.0 and below is the command that I am using to run hadoop streaming. Is there some more options that I should specify for hadoop streaming to work properly if I am using a custom separator? hadoop jar $HADOOP_PREFIX/contrib/streaming/hadoop-streaming-0.20.205.0.jar -D stream.mapred.output.field.separator=* -D mapred.reduce.tasks=2 -mapper ./map.py -reducer ./reducer.py -file ./map.py -file ./reducer.py -input /user/inputdata -output /user/outputdata -verbose Any help is much appreciated, Thanks, Austin