Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

2012-05-09 Thread Austin Chungath
$DuplicationException: Invalid input, there are duplicated files in the
sources: hftp://ub13:50070/tmp/Rtmp1BU9Kb/file6abc6ccb6551/_logs/history,
hftp://ub13:50070/tmp/Rtmp3yCJhu/file1ca96d9331/_logs/history

Any idea what is the problem here?
They are different files how are they conflicting?

Thanks  Regards

On Tue, May 8, 2012 at 11:52 PM, Adam Faris afa...@linkedin.com wrote:

 Hi Austin,

 I'm glad that helped out.  Regarding the -p flag for distcp, here's the
 online documentation

 http://hadoop.apache.org/common/docs/current/distcp.html#Option+Index

 You can also get this info from running 'hadoop distcp' without any flags.
 
 -p[rbugp]   Preserve
   r: replication number
   b: block size
   u: user
   g: group
   p: permission
 

 -- Adam

 On May 7, 2012, at 10:55 PM, Austin Chungath wrote:

  Thanks Adam,
 
  That was very helpful. Your second point solved my problems :-)
  The hdfs port number was wrong.
  I didn't use the option -ppgu what does it do?
 
 
 
  On Mon, May 7, 2012 at 8:07 PM, Adam Faris afa...@linkedin.com wrote:
 
  Hi Austin,
 
  I don't know about using CDH3, but we use distcp for moving data between
  different versions of apache grids and several things come to mind.
 
  1) you should use the -i flag to ignore checksum differences on the
  blocks.  I'm not 100% but want to say hftp doesn't support checksums on
 the
  blocks as they go across the wire.
 
  2) you should read from hftp but write to hdfs.  Also make sure to check
  your port numbers.   For example I can read from hftp on port 50070 and
  write to hdfs on port 9000.  You'll find the hftp port in hdfs-site.xml
 and
  hdfs in core-site.xml on apache releases.
 
  3) Do you have security (kerberos) enabled on 0.20.205? Does CDH3
 support
  security?  If security is enabled on 0.20.205 and CDH3 does not support
  security, you will need to disable security on 0.20.205.  This is
 because
  you are unable to write from a secure to unsecured grid.
 
  4) use the -m flag to limit your mappers so you don't DDOS your network
  backbone.
 
  5) why isn't your vender helping you with the data migration? :)
 
  Otherwise something like this should get you going.
 
  hadoop -i -ppgu -log /tmp/mylog -m 20 distcp
  hftp://mynamenode.grid.one:50070/path/to/my/src/data
  hdfs://mynamenode.grid.two:9000/path/to/my/dst
 
  -- Adam
 
  On May 7, 2012, at 4:29 AM, Nitin Pawar wrote:
 
  things to check
 
  1) when you launch distcp jobs all the datanodes of older hdfs are live
  and
  connected
  2) when you launch distcp no data is being written/moved/deleteed in
 hdfs
  3)  you can use option -log to log errors into directory and user -i to
  ignore errors
 
  also u can try using distcp with hdfs protocol instead of hftp  ... for
  more you can refer
 
 
 https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/d0d99ad9f1554edd
 
 
 
  if it failed there should be some error
  On Mon, May 7, 2012 at 4:44 PM, Austin Chungath austi...@gmail.com
  wrote:
 
  ok that was a lame mistake.
  $ hadoop distcp hftp://localhost:50070/tmp
  hftp://localhost:60070/tmp_copy
  I had spelled hdfs instead of hftp
 
  $ hadoop distcp hftp://localhost:50070/docs/index.html
  hftp://localhost:60070/user/hadoop
  12/05/07 16:38:09 INFO tools.DistCp:
  srcPaths=[hftp://localhost:50070/docs/index.html]
  12/05/07 16:38:09 INFO tools.DistCp:
  destPath=hftp://localhost:60070/user/hadoop
  With failures, global counters are inaccurate; consider running with
 -i
  Copy failed: java.io.IOException: Not supported
  at
 org.apache.hadoop.hdfs.HftpFileSystem.delete(HftpFileSystem.java:457)
  at org.apache.hadoop.tools.DistCp.fullyDelete(DistCp.java:963)
  at org.apache.hadoop.tools.DistCp.copy(DistCp.java:672)
  at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
  at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
 
  Any idea why this error is coming?
  I am copying one file from 0.20.205 (/docs/index.html ) to cdh3u3
  (/user/hadoop)
 
  Thanks  Regards,
  Austin
 
  On Mon, May 7, 2012 at 3:57 PM, Austin Chungath austi...@gmail.com
  wrote:
 
  Thanks,
 
  So I decided to try and move using distcp.
 
  $ hadoop distcp hdfs://localhost:54310/tmp
  hdfs://localhost:8021/tmp_copy
  12/05/07 14:57:38 INFO tools.DistCp:
  srcPaths=[hdfs://localhost:54310/tmp]
  12/05/07 14:57:38 INFO tools.DistCp:
  destPath=hdfs://localhost:8021/tmp_copy
  With failures, global counters are inaccurate; consider running with
 -i
  Copy failed: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol
  org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch.
  (client
  =
  63, server = 61)
 
  I found that we can do distcp like above only if both are of the same
  hadoop version.
  so I tried

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

2012-05-07 Thread Austin Chungath
Thanks,

So I decided to try and move using distcp.

$ hadoop distcp hdfs://localhost:54310/tmp hdfs://localhost:8021/tmp_copy
12/05/07 14:57:38 INFO tools.DistCp: srcPaths=[hdfs://localhost:54310/tmp]
12/05/07 14:57:38 INFO tools.DistCp: destPath=hdfs://localhost:8021/tmp_copy
With failures, global counters are inaccurate; consider running with -i
Copy failed: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol
org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch. (client =
63, server = 61)

I found that we can do distcp like above only if both are of the same
hadoop version.
so I tried:

$ hadoop distcp hftp://localhost:50070/tmp hdfs://localhost:60070/tmp_copy
12/05/07 15:02:44 INFO tools.DistCp: srcPaths=[hftp://localhost:50070/tmp]
12/05/07 15:02:44 INFO tools.DistCp:
destPath=hdfs://localhost:60070/tmp_copy

But this process seemed to be hangs at this stage. What might I be doing
wrong?

hftp://dfs.http.address/path
hftp://localhost:50070 is dfs.http.address of 0.20.205
hdfs://localhost:60070 is dfs.http.address of cdh3u3

Thanks and regards,
Austin


On Fri, May 4, 2012 at 4:30 AM, Michel Segel michael_se...@hotmail.comwrote:

 Ok... So riddle me this...
 I currently have a replication factor of 3.
 I reset it to two.

 What do you have to do to get the replication factor of 3 down to 2?
 Do I just try to rebalance the nodes?

 The point is that you are looking at a very small cluster.
 You may want to start the be cluster with a replication factor of 2 and
 then when the data is moved over, increase it to a factor of 3. Or maybe
 not.

 I do a distcp to. Copy the data and after each distcp, I do an fsck for a
 sanity check and then remove the files I copied. As I gain more room, I can
 then slowly drop nodes, do an fsck, rebalance and then repeat.

 Even though this us a dev cluster, the OP wants to retain the data.

 There are other options depending on the amount and size of new hardware.
 I mean make one machine a RAID 5 machine, copy data to it clearing off the
 cluster.

 If 8TB was the amount of disk used, that would be 2. TB used.
 Let's say 3TB. Going raid 5, how much disk is that?  So you could fit it
 on one machine, depending on hardware, or maybe 2 machines...  Now you can
 rebuild initial cluster and then move data back. Then rebuild those
 machines. Lots of options... ;-)

 Sent from a remote device. Please excuse any typos...

 Mike Segel

 On May 3, 2012, at 11:26 AM, Suresh Srinivas sur...@hortonworks.com
 wrote:

  This probably is a more relevant question in CDH mailing lists. That
 said,
  what Edward is suggesting seems reasonable. Reduce replication factor,
  decommission some of the nodes and create a new cluster with those nodes
  and do distcp.
 
  Could you share with us the reasons you want to migrate from Apache 205?
 
  Regards,
  Suresh
 
  On Thu, May 3, 2012 at 8:25 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:
 
  Honestly that is a hassle, going from 205 to cdh3u3 is probably more
  or a cross-grade then an upgrade or downgrade. I would just stick it
  out. But yes like Michael said two clusters on the same gear and
  distcp. If you are using RF=3 you could also lower your replication to
  rf=2 'hadoop dfs -setrepl 2' to clear headroom as you are moving
  stuff.
 
 
  On Thu, May 3, 2012 at 7:25 AM, Michel Segel michael_se...@hotmail.com
 
  wrote:
  Ok... When you get your new hardware...
 
  Set up one server as your new NN, JT, SN.
  Set up the others as a DN.
  (Cloudera CDH3u3)
 
  On your existing cluster...
  Remove your old log files, temp files on HDFS anything you don't need.
  This should give you some more space.
  Start copying some of the directories/files to the new cluster.
  As you gain space, decommission a node, rebalance, add node to new
  cluster...
 
  It's a slow process.
 
  Should I remind you to make sure you up you bandwidth setting, and to
  clean up the hdfs directories when you repurpose the nodes?
 
  Does this make sense?
 
  Sent from a remote device. Please excuse any typos...
 
  Mike Segel
 
  On May 3, 2012, at 5:46 AM, Austin Chungath austi...@gmail.com
 wrote:
 
  Yeah I know :-)
  and this is not a production cluster ;-) and yes there is more
 hardware
  coming :-)
 
  On Thu, May 3, 2012 at 4:10 PM, Michel Segel 
 michael_se...@hotmail.com
  wrote:
 
  Well, you've kind of painted yourself in to a corner...
  Not sure why you didn't get a response from the Cloudera lists, but
  it's a
  generic question...
 
  8 out of 10 TB. Are you talking effective storage or actual disks?
  And please tell me you've already ordered more hardware.. Right?
 
  And please tell me this isn't your production cluster...
 
  (Strong hint to Strata and Cloudea... You really want to accept my
  upcoming proposal talk... ;-)
 
 
  Sent from a remote device. Please excuse any typos...
 
  Mike Segel
 
  On May 3, 2012, at 5:25 AM, Austin Chungath austi...@gmail.com
  wrote:
 
  Yes. This was first posted on the cloudera mailing

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

2012-05-07 Thread Austin Chungath
ok that was a lame mistake.
$ hadoop distcp hftp://localhost:50070/tmp hftp://localhost:60070/tmp_copy
I had spelled hdfs instead of hftp

$ hadoop distcp hftp://localhost:50070/docs/index.html
hftp://localhost:60070/user/hadoop
12/05/07 16:38:09 INFO tools.DistCp:
srcPaths=[hftp://localhost:50070/docs/index.html]
12/05/07 16:38:09 INFO tools.DistCp:
destPath=hftp://localhost:60070/user/hadoop
With failures, global counters are inaccurate; consider running with -i
Copy failed: java.io.IOException: Not supported
at org.apache.hadoop.hdfs.HftpFileSystem.delete(HftpFileSystem.java:457)
at org.apache.hadoop.tools.DistCp.fullyDelete(DistCp.java:963)
at org.apache.hadoop.tools.DistCp.copy(DistCp.java:672)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)

Any idea why this error is coming?
I am copying one file from 0.20.205 (/docs/index.html ) to cdh3u3
(/user/hadoop)

Thanks  Regards,
Austin

On Mon, May 7, 2012 at 3:57 PM, Austin Chungath austi...@gmail.com wrote:

 Thanks,

 So I decided to try and move using distcp.

 $ hadoop distcp hdfs://localhost:54310/tmp hdfs://localhost:8021/tmp_copy
 12/05/07 14:57:38 INFO tools.DistCp: srcPaths=[hdfs://localhost:54310/tmp]
 12/05/07 14:57:38 INFO tools.DistCp:
 destPath=hdfs://localhost:8021/tmp_copy
 With failures, global counters are inaccurate; consider running with -i
 Copy failed: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol
 org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch. (client =
 63, server = 61)

 I found that we can do distcp like above only if both are of the same
 hadoop version.
 so I tried:

 $ hadoop distcp hftp://localhost:50070/tmp hdfs://localhost:60070/tmp_copy
 12/05/07 15:02:44 INFO tools.DistCp: srcPaths=[hftp://localhost:50070/tmp]
 12/05/07 15:02:44 INFO tools.DistCp:
 destPath=hdfs://localhost:60070/tmp_copy

 But this process seemed to be hangs at this stage. What might I be doing
 wrong?

 hftp://dfs.http.address/path
 hftp://localhost:50070 is dfs.http.address of 0.20.205
 hdfs://localhost:60070 is dfs.http.address of cdh3u3

 Thanks and regards,
 Austin


 On Fri, May 4, 2012 at 4:30 AM, Michel Segel michael_se...@hotmail.comwrote:

 Ok... So riddle me this...
 I currently have a replication factor of 3.
 I reset it to two.

 What do you have to do to get the replication factor of 3 down to 2?
 Do I just try to rebalance the nodes?

 The point is that you are looking at a very small cluster.
 You may want to start the be cluster with a replication factor of 2 and
 then when the data is moved over, increase it to a factor of 3. Or maybe
 not.

 I do a distcp to. Copy the data and after each distcp, I do an fsck for a
 sanity check and then remove the files I copied. As I gain more room, I can
 then slowly drop nodes, do an fsck, rebalance and then repeat.

 Even though this us a dev cluster, the OP wants to retain the data.

 There are other options depending on the amount and size of new hardware.
 I mean make one machine a RAID 5 machine, copy data to it clearing off
 the cluster.

 If 8TB was the amount of disk used, that would be 2. TB used.
 Let's say 3TB. Going raid 5, how much disk is that?  So you could fit it
 on one machine, depending on hardware, or maybe 2 machines...  Now you can
 rebuild initial cluster and then move data back. Then rebuild those
 machines. Lots of options... ;-)

 Sent from a remote device. Please excuse any typos...

 Mike Segel

 On May 3, 2012, at 11:26 AM, Suresh Srinivas sur...@hortonworks.com
 wrote:

  This probably is a more relevant question in CDH mailing lists. That
 said,
  what Edward is suggesting seems reasonable. Reduce replication factor,
  decommission some of the nodes and create a new cluster with those nodes
  and do distcp.
 
  Could you share with us the reasons you want to migrate from Apache 205?
 
  Regards,
  Suresh
 
  On Thu, May 3, 2012 at 8:25 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:
 
  Honestly that is a hassle, going from 205 to cdh3u3 is probably more
  or a cross-grade then an upgrade or downgrade. I would just stick it
  out. But yes like Michael said two clusters on the same gear and
  distcp. If you are using RF=3 you could also lower your replication to
  rf=2 'hadoop dfs -setrepl 2' to clear headroom as you are moving
  stuff.
 
 
  On Thu, May 3, 2012 at 7:25 AM, Michel Segel 
 michael_se...@hotmail.com
  wrote:
  Ok... When you get your new hardware...
 
  Set up one server as your new NN, JT, SN.
  Set up the others as a DN.
  (Cloudera CDH3u3)
 
  On your existing cluster...
  Remove your old log files, temp files on HDFS anything you don't need.
  This should give you some more space.
  Start copying some of the directories/files to the new cluster.
  As you gain space, decommission a node, rebalance, add node to new
  cluster...
 
  It's

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

2012-05-07 Thread Austin Chungath
Thanks Adam,

That was very helpful. Your second point solved my problems :-)
The hdfs port number was wrong.
I didn't use the option -ppgu what does it do?



On Mon, May 7, 2012 at 8:07 PM, Adam Faris afa...@linkedin.com wrote:

 Hi Austin,

 I don't know about using CDH3, but we use distcp for moving data between
 different versions of apache grids and several things come to mind.

 1) you should use the -i flag to ignore checksum differences on the
 blocks.  I'm not 100% but want to say hftp doesn't support checksums on the
 blocks as they go across the wire.

 2) you should read from hftp but write to hdfs.  Also make sure to check
 your port numbers.   For example I can read from hftp on port 50070 and
 write to hdfs on port 9000.  You'll find the hftp port in hdfs-site.xml and
 hdfs in core-site.xml on apache releases.

 3) Do you have security (kerberos) enabled on 0.20.205? Does CDH3 support
 security?  If security is enabled on 0.20.205 and CDH3 does not support
 security, you will need to disable security on 0.20.205.  This is because
 you are unable to write from a secure to unsecured grid.

 4) use the -m flag to limit your mappers so you don't DDOS your network
 backbone.

 5) why isn't your vender helping you with the data migration? :)

 Otherwise something like this should get you going.

 hadoop -i -ppgu -log /tmp/mylog -m 20 distcp
 hftp://mynamenode.grid.one:50070/path/to/my/src/data
 hdfs://mynamenode.grid.two:9000/path/to/my/dst

 -- Adam

 On May 7, 2012, at 4:29 AM, Nitin Pawar wrote:

  things to check
 
  1) when you launch distcp jobs all the datanodes of older hdfs are live
 and
  connected
  2) when you launch distcp no data is being written/moved/deleteed in hdfs
  3)  you can use option -log to log errors into directory and user -i to
  ignore errors
 
  also u can try using distcp with hdfs protocol instead of hftp  ... for
  more you can refer
 
 https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/d0d99ad9f1554edd
 
 
 
  if it failed there should be some error
  On Mon, May 7, 2012 at 4:44 PM, Austin Chungath austi...@gmail.com
 wrote:
 
  ok that was a lame mistake.
  $ hadoop distcp hftp://localhost:50070/tmp
 hftp://localhost:60070/tmp_copy
  I had spelled hdfs instead of hftp
 
  $ hadoop distcp hftp://localhost:50070/docs/index.html
  hftp://localhost:60070/user/hadoop
  12/05/07 16:38:09 INFO tools.DistCp:
  srcPaths=[hftp://localhost:50070/docs/index.html]
  12/05/07 16:38:09 INFO tools.DistCp:
  destPath=hftp://localhost:60070/user/hadoop
  With failures, global counters are inaccurate; consider running with -i
  Copy failed: java.io.IOException: Not supported
  at org.apache.hadoop.hdfs.HftpFileSystem.delete(HftpFileSystem.java:457)
  at org.apache.hadoop.tools.DistCp.fullyDelete(DistCp.java:963)
  at org.apache.hadoop.tools.DistCp.copy(DistCp.java:672)
  at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
  at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
 
  Any idea why this error is coming?
  I am copying one file from 0.20.205 (/docs/index.html ) to cdh3u3
  (/user/hadoop)
 
  Thanks  Regards,
  Austin
 
  On Mon, May 7, 2012 at 3:57 PM, Austin Chungath austi...@gmail.com
  wrote:
 
  Thanks,
 
  So I decided to try and move using distcp.
 
  $ hadoop distcp hdfs://localhost:54310/tmp
 hdfs://localhost:8021/tmp_copy
  12/05/07 14:57:38 INFO tools.DistCp:
  srcPaths=[hdfs://localhost:54310/tmp]
  12/05/07 14:57:38 INFO tools.DistCp:
  destPath=hdfs://localhost:8021/tmp_copy
  With failures, global counters are inaccurate; consider running with -i
  Copy failed: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol
  org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch.
 (client
  =
  63, server = 61)
 
  I found that we can do distcp like above only if both are of the same
  hadoop version.
  so I tried:
 
  $ hadoop distcp hftp://localhost:50070/tmp
  hdfs://localhost:60070/tmp_copy
  12/05/07 15:02:44 INFO tools.DistCp:
  srcPaths=[hftp://localhost:50070/tmp]
  12/05/07 15:02:44 INFO tools.DistCp:
  destPath=hdfs://localhost:60070/tmp_copy
 
  But this process seemed to be hangs at this stage. What might I be
 doing
  wrong?
 
  hftp://dfs.http.address/path
  hftp://localhost:50070 is dfs.http.address of 0.20.205
  hdfs://localhost:60070 is dfs.http.address of cdh3u3
 
  Thanks and regards,
  Austin
 
 
  On Fri, May 4, 2012 at 4:30 AM, Michel Segel 
 michael_se...@hotmail.com
  wrote:
 
  Ok... So riddle me this...
  I currently have a replication factor of 3.
  I reset it to two.
 
  What do you have to do to get the replication factor of 3 down to 2?
  Do I just try to rebalance the nodes?
 
  The point is that you are looking at a very small cluster.
  You may want to start the be cluster with a replication factor of 2
 and
  then when the data is moved over, increase it to a factor

Best practice to migrate HDFS from 0.20.205 to CDH3u3

2012-05-03 Thread Austin Chungath
Hi,
I am migrating from Apache hadoop 0.20.205 to CDH3u3.
I don't want to lose the data that is in the HDFS of Apache hadoop
0.20.205.
How do I migrate to CDH3u3 but keep the data that I have on 0.20.205.
What is the best practice/ techniques to do this?

Thanks  Regards,
Austin


Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

2012-05-03 Thread Austin Chungath
Thanks for the suggestions,
My concerns are that I can't actually copyToLocal from the dfs because the
data is huge.

Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do a
namenode upgrade. I don't have to copy data out of dfs.

But here I am having Apache hadoop 0.20.205 and I want to use CDH3 now,
which is based on 0.20
Now it is actually a downgrade as 0.20.205's namenode info has to be used
by 0.20's namenode.

Any idea how I can achieve what I am trying to do?

Thanks.

On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 i can think of following options

 1) write a simple get and put code which gets the data from DFS and loads
 it in dfs
 2) see if the distcp  between both versions are compatible
 3) this is what I had done (and my data was hardly few hundred GB) .. did a
 dfs -copyToLocal and then in the new grid did a copyFromLocal

 On Thu, May 3, 2012 at 11:41 AM, Austin Chungath austi...@gmail.com
 wrote:

  Hi,
  I am migrating from Apache hadoop 0.20.205 to CDH3u3.
  I don't want to lose the data that is in the HDFS of Apache hadoop
  0.20.205.
  How do I migrate to CDH3u3 but keep the data that I have on 0.20.205.
  What is the best practice/ techniques to do this?
 
  Thanks  Regards,
  Austin
 



 --
 Nitin Pawar



Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

2012-05-03 Thread Austin Chungath
There is only one cluster. I am not copying between clusters.

Say I have a cluster running apache 0.20.205 with 10 TB storage capacity
and has about 8 TB of data.
Now how can I migrate the same cluster to use cdh3 and use that same 8 TB
of data.

I can't copy 8 TB of data using distcp because I have only 2 TB of free
space


On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar nitinpawar...@gmail.com wrote:

 you can actually look at the distcp

 http://hadoop.apache.org/common/docs/r0.20.0/distcp.html

 but this means that you have two different set of clusters available to do
 the migration

 On Thu, May 3, 2012 at 12:51 PM, Austin Chungath austi...@gmail.com
 wrote:

  Thanks for the suggestions,
  My concerns are that I can't actually copyToLocal from the dfs because
 the
  data is huge.
 
  Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do a
  namenode upgrade. I don't have to copy data out of dfs.
 
  But here I am having Apache hadoop 0.20.205 and I want to use CDH3 now,
  which is based on 0.20
  Now it is actually a downgrade as 0.20.205's namenode info has to be used
  by 0.20's namenode.
 
  Any idea how I can achieve what I am trying to do?
 
  Thanks.
 
  On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar nitinpawar...@gmail.com
  wrote:
 
   i can think of following options
  
   1) write a simple get and put code which gets the data from DFS and
 loads
   it in dfs
   2) see if the distcp  between both versions are compatible
   3) this is what I had done (and my data was hardly few hundred GB) ..
  did a
   dfs -copyToLocal and then in the new grid did a copyFromLocal
  
   On Thu, May 3, 2012 at 11:41 AM, Austin Chungath austi...@gmail.com
   wrote:
  
Hi,
I am migrating from Apache hadoop 0.20.205 to CDH3u3.
I don't want to lose the data that is in the HDFS of Apache hadoop
0.20.205.
How do I migrate to CDH3u3 but keep the data that I have on 0.20.205.
What is the best practice/ techniques to do this?
   
Thanks  Regards,
Austin
   
  
  
  
   --
   Nitin Pawar
  
 



 --
 Nitin Pawar



Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

2012-05-03 Thread Austin Chungath
Yes. This was first posted on the cloudera mailing list. There were no
responses.

But this is not related to cloudera as such.

cdh3 is based on apache hadoop 0.20 as the base. My data is in apache
hadoop 0.20.205

There is an upgrade namenode option when we are migrating to a higher
version say from 0.20 to 0.20.205
but here I am downgrading from 0.20.205 to 0.20 (cdh3)
Is this possible?


On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi prash1...@gmail.comwrote:

 Seems like a matter of upgrade. I am not a Cloudera user so would not know
 much, but you might find some help moving this to Cloudera mailing list.

 On Thu, May 3, 2012 at 2:51 AM, Austin Chungath austi...@gmail.com
 wrote:

  There is only one cluster. I am not copying between clusters.
 
  Say I have a cluster running apache 0.20.205 with 10 TB storage capacity
  and has about 8 TB of data.
  Now how can I migrate the same cluster to use cdh3 and use that same 8 TB
  of data.
 
  I can't copy 8 TB of data using distcp because I have only 2 TB of free
  space
 
 
  On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar nitinpawar...@gmail.com
  wrote:
 
   you can actually look at the distcp
  
   http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
  
   but this means that you have two different set of clusters available to
  do
   the migration
  
   On Thu, May 3, 2012 at 12:51 PM, Austin Chungath austi...@gmail.com
   wrote:
  
Thanks for the suggestions,
My concerns are that I can't actually copyToLocal from the dfs
 because
   the
data is huge.
   
Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do a
namenode upgrade. I don't have to copy data out of dfs.
   
But here I am having Apache hadoop 0.20.205 and I want to use CDH3
 now,
which is based on 0.20
Now it is actually a downgrade as 0.20.205's namenode info has to be
  used
by 0.20's namenode.
   
Any idea how I can achieve what I am trying to do?
   
Thanks.
   
On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar 
 nitinpawar...@gmail.com
wrote:
   
 i can think of following options

 1) write a simple get and put code which gets the data from DFS and
   loads
 it in dfs
 2) see if the distcp  between both versions are compatible
 3) this is what I had done (and my data was hardly few hundred GB)
 ..
did a
 dfs -copyToLocal and then in the new grid did a copyFromLocal

 On Thu, May 3, 2012 at 11:41 AM, Austin Chungath 
 austi...@gmail.com
  
 wrote:

  Hi,
  I am migrating from Apache hadoop 0.20.205 to CDH3u3.
  I don't want to lose the data that is in the HDFS of Apache
 hadoop
  0.20.205.
  How do I migrate to CDH3u3 but keep the data that I have on
  0.20.205.
  What is the best practice/ techniques to do this?
 
  Thanks  Regards,
  Austin
 



 --
 Nitin Pawar

   
  
  
  
   --
   Nitin Pawar
  
 



Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

2012-05-03 Thread Austin Chungath
Yeah I know :-)
and this is not a production cluster ;-) and yes there is more hardware
coming :-)

On Thu, May 3, 2012 at 4:10 PM, Michel Segel michael_se...@hotmail.comwrote:

 Well, you've kind of painted yourself in to a corner...
 Not sure why you didn't get a response from the Cloudera lists, but it's a
 generic question...

 8 out of 10 TB. Are you talking effective storage or actual disks?
 And please tell me you've already ordered more hardware.. Right?

 And please tell me this isn't your production cluster...

 (Strong hint to Strata and Cloudea... You really want to accept my
 upcoming proposal talk... ;-)


 Sent from a remote device. Please excuse any typos...

 Mike Segel

 On May 3, 2012, at 5:25 AM, Austin Chungath austi...@gmail.com wrote:

  Yes. This was first posted on the cloudera mailing list. There were no
  responses.
 
  But this is not related to cloudera as such.
 
  cdh3 is based on apache hadoop 0.20 as the base. My data is in apache
  hadoop 0.20.205
 
  There is an upgrade namenode option when we are migrating to a higher
  version say from 0.20 to 0.20.205
  but here I am downgrading from 0.20.205 to 0.20 (cdh3)
  Is this possible?
 
 
  On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi prash1...@gmail.com
 wrote:
 
  Seems like a matter of upgrade. I am not a Cloudera user so would not
 know
  much, but you might find some help moving this to Cloudera mailing list.
 
  On Thu, May 3, 2012 at 2:51 AM, Austin Chungath austi...@gmail.com
  wrote:
 
  There is only one cluster. I am not copying between clusters.
 
  Say I have a cluster running apache 0.20.205 with 10 TB storage
 capacity
  and has about 8 TB of data.
  Now how can I migrate the same cluster to use cdh3 and use that same 8
 TB
  of data.
 
  I can't copy 8 TB of data using distcp because I have only 2 TB of free
  space
 
 
  On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar nitinpawar...@gmail.com
  wrote:
 
  you can actually look at the distcp
 
  http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
 
  but this means that you have two different set of clusters available
 to
  do
  the migration
 
  On Thu, May 3, 2012 at 12:51 PM, Austin Chungath austi...@gmail.com
  wrote:
 
  Thanks for the suggestions,
  My concerns are that I can't actually copyToLocal from the dfs
  because
  the
  data is huge.
 
  Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do a
  namenode upgrade. I don't have to copy data out of dfs.
 
  But here I am having Apache hadoop 0.20.205 and I want to use CDH3
  now,
  which is based on 0.20
  Now it is actually a downgrade as 0.20.205's namenode info has to be
  used
  by 0.20's namenode.
 
  Any idea how I can achieve what I am trying to do?
 
  Thanks.
 
  On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar 
  nitinpawar...@gmail.com
  wrote:
 
  i can think of following options
 
  1) write a simple get and put code which gets the data from DFS and
  loads
  it in dfs
  2) see if the distcp  between both versions are compatible
  3) this is what I had done (and my data was hardly few hundred GB)
  ..
  did a
  dfs -copyToLocal and then in the new grid did a copyFromLocal
 
  On Thu, May 3, 2012 at 11:41 AM, Austin Chungath 
  austi...@gmail.com
 
  wrote:
 
  Hi,
  I am migrating from Apache hadoop 0.20.205 to CDH3u3.
  I don't want to lose the data that is in the HDFS of Apache
  hadoop
  0.20.205.
  How do I migrate to CDH3u3 but keep the data that I have on
  0.20.205.
  What is the best practice/ techniques to do this?
 
  Thanks  Regards,
  Austin
 
 
 
 
  --
  Nitin Pawar
 
 
 
 
 
  --
  Nitin Pawar
 
 
 



how to add more than one user to hadoop with DFS permissions?

2012-03-10 Thread Austin Chungath
I have a 2 node cluster running hadoop 0.20.205. There is only one user ,
username: hadoop of group: hadoop.
What is the easiest way to add one more user say hadoop1 with DFS
permissions set as true?

I did the following to create a user in the master node.
sudo adduser --ingroup hadoop hadoop1

My aim is to have hadoop run in such a way that each user input and output
data is accessible only to the owner (chmod 700).
I did play around with the configuration properties for sometime now but to
no end.

It would be great if some one could tell me what are the configuration file
properties that I should change to achieve this?

Thanks,
Austin


Re: how to add more than one user to hadoop with DFS permissions?

2012-03-10 Thread Austin Chungath
Thanks Harsh :)

On Sat, Mar 10, 2012 at 10:12 PM, Harsh J ha...@cloudera.com wrote:

 Austin,

 1. Enable HDFS Permissions. In hdfs-site.xml, set dfs.permissions as
 true.

 2. To commission any new user, as HDFS admin (the user who runs the
 NameNode process), run:
 hadoop fs -mkdir /user/username
 hadoop fs -chown username:username /user/username

 3. For default file/dir permissions to be 700, tweak the dfs.umaskmode
 property.

 Much of this is also documented at the permissions guide:
 http://hadoop.apache.org/common/docs/r0.20.2/hdfs_permissions_guide.html

 On Sat, Mar 10, 2012 at 9:59 PM, Austin Chungath austi...@gmail.com
 wrote:
  I have a 2 node cluster running hadoop 0.20.205. There is only one user ,
  username: hadoop of group: hadoop.
  What is the easiest way to add one more user say hadoop1 with DFS
  permissions set as true?
 
  I did the following to create a user in the master node.
  sudo adduser --ingroup hadoop hadoop1
 
  My aim is to have hadoop run in such a way that each user input and
 output
  data is accessible only to the owner (chmod 700).
  I did play around with the configuration properties for sometime now but
 to
  no end.
 
  It would be great if some one could tell me what are the configuration
 file
  properties that I should change to achieve this?
 
  Thanks,
  Austin



 --
 Harsh J



fairscheduler : group.name | Please edit patch to work for 0.20.205

2012-03-05 Thread Austin Chungath
Can someone have a look at the patch MAPREDUCE-2457 and see if it can be
modified to work for 0.20.205?
I am very new to java and have no idea what's going on in that patch. If
you have any pointers for me, I will see if I can do it on my own.

Thanks,
Austin

On Fri, Mar 2, 2012 at 7:15 PM, Austin Chungath austi...@gmail.com wrote:

 I tried the patch MAPREDUCE-2457 but it didn't work for my hadoop 0.20.205.
 Are you sure this patch will work for 0.20.205?
 According to the description it says that the patch works for 0.21 and
 0.22 and it says that 0.20 supports group.name without this patch...

 So does this patch also apply to 0.20.205?

 Thanks,
 Austin

  On Thu, Mar 1, 2012 at 11:24 PM, Harsh J ha...@cloudera.com wrote:

 The group.name scheduler support was introduced in
 https://issues.apache.org/jira/browse/HADOOP-3892 but may have been
 broken by the security changes present in 0.20.205. You'll need the
 fix presented in  https://issues.apache.org/jira/browse/MAPREDUCE-2457
 to have group.name support.

 On Thu, Mar 1, 2012 at 6:42 PM, Austin Chungath austi...@gmail.com
 wrote:
   I am running fair scheduler on hadoop 0.20.205.0
 
  http://hadoop.apache.org/common/docs/r0.20.205.0/fair_scheduler.html
  The above page talks about the following property
 
  *mapred.fairscheduler.poolnameproperty*
  **
  which I can set to *group.name*
  The default is user.name and when a user submits a job the fair
 scheduler
  assigns each user's job to a pool which has the name of the user.
  I am trying to change it to group.name so that the job is submitted to
 a
  pool which has the name of the user's linux group. Thus all jobs from
 any
  user from a specific group go to the same pool instead of an individual
  pool for every user.
  But *group.name* doesn't seem to work, has anyone tried this before?
 
  *user.name* and *mapred.job.queue.name* works. Is group.name supported
 in
   0.20.205.0 because I don't see it mentioned in the docs?
 
  Thanks,
  Austin



 --
 Harsh J





Re: fairscheduler : group.name doesn't work, please help

2012-03-02 Thread Austin Chungath
I tried the patch MAPREDUCE-2457 but it didn't work for my hadoop 0.20.205.
Are you sure this patch will work for 0.20.205?
According to the description it says that the patch works for 0.21 and 0.22
and it says that 0.20 supports group.name without this patch...

So does this patch also apply to 0.20.205?

Thanks,
Austin

On Thu, Mar 1, 2012 at 11:24 PM, Harsh J ha...@cloudera.com wrote:

 The group.name scheduler support was introduced in
 https://issues.apache.org/jira/browse/HADOOP-3892 but may have been
 broken by the security changes present in 0.20.205. You'll need the
 fix presented in  https://issues.apache.org/jira/browse/MAPREDUCE-2457
 to have group.name support.

 On Thu, Mar 1, 2012 at 6:42 PM, Austin Chungath austi...@gmail.com
 wrote:
   I am running fair scheduler on hadoop 0.20.205.0
 
  http://hadoop.apache.org/common/docs/r0.20.205.0/fair_scheduler.html
  The above page talks about the following property
 
  *mapred.fairscheduler.poolnameproperty*
  **
  which I can set to *group.name*
  The default is user.name and when a user submits a job the fair
 scheduler
  assigns each user's job to a pool which has the name of the user.
  I am trying to change it to group.name so that the job is submitted to a
  pool which has the name of the user's linux group. Thus all jobs from any
  user from a specific group go to the same pool instead of an individual
  pool for every user.
  But *group.name* doesn't seem to work, has anyone tried this before?
 
  *user.name* and *mapred.job.queue.name* works. Is group.name supported
 in
   0.20.205.0 because I don't see it mentioned in the docs?
 
  Thanks,
  Austin



 --
 Harsh J



Re: Hadoop fair scheduler doubt: allocate jobs to pool

2012-03-01 Thread Austin Chungath
Thanks,
I will be trying the suggestions and will get back to you soon.

On Thu, Mar 1, 2012 at 8:09 PM, Dave Shine 
dave.sh...@channelintelligence.com wrote:

 I've just started playing with the Fair Scheduler.  To specify the pool at
 job submission time you set the mapred.fairscheduler.pool property on the
 Job Conf to the name of the pool you want the job to use.

 Dave


 -Original Message-
 From: Merto Mertek [mailto:masmer...@gmail.com]
 Sent: Thursday, March 01, 2012 9:33 AM
 To: common-user@hadoop.apache.org
 Subject: Re: Hadoop fair scheduler doubt: allocate jobs to pool

 From the fairscheduler docs I assume the following should work:

 property
  namemapred.fairscheduler.poolnameproperty/name
   valuepool.name/value
 /property

 property
  namepool.name/name
  value${mapreduce.job.group.name}/value
 /property

 which means that the default pool will be the group of the user that has
 submitted the job. In your case I think that allocations.xml is correct. If
 you want to explicitly define a job to specific pool from your
 allocation.xml file you can define it as follows:

 Configuration conf3 = conf;
 conf3.set(pool.name, pool3); // conf.set(propriety.name, value)

 Let me know if it works..


 On 29 February 2012 14:18, Austin Chungath austi...@gmail.com wrote:

  How can I set the fair scheduler such that all jobs submitted from a
  particular user group go to a pool with the group name?
 
  I have setup fair scheduler and I have two users: A and B (belonging
  to the user group hadoop)
 
  When these users submit hadoop jobs, the jobs from A got to a pool
  named A and the jobs from B go to a pool named B.
   I want them to go to a pool with their group name, So I tried adding
  the following to mapred-site.xml:
 
  property
   namemapred.fairscheduler.poolnameproperty/name
  valuegroup.name/value
  /property
 
  But instead the jobs now go to the default pool.
  I want the jobs submitted by A and B to go to the pool named hadoop.
  How do I do that?
  also how can I explicity set a job to any specified pool?
 
  I have set the allocation file (fair-scheduler.xml) like this:
 
  allocations
   pool name=hadoop
 minMaps1/minMaps
 minReduces1/minReduces
 maxMaps3/maxMaps
 maxReduces3/maxReduces
   /pool
   userMaxJobsDefault5/userMaxJobsDefault
  /allocations
 
  Any help is greatly appreciated.
  Thanks,
  Austin
 

 The information contained in this email message is considered confidential
 and proprietary to the sender and is intended solely for review and use by
 the named recipient. Any unauthorized review, use or distribution is
 strictly prohibited. If you have received this message in error, please
 advise the sender by reply email and delete the message.



Re: Hadoop fair scheduler doubt: allocate jobs to pool

2012-03-01 Thread Austin Chungath
Hi,
I tried what you had said. I added the following to mapred-site.xml:


property
 namemapred.fairscheduler.poolnameproperty/name
  valuepool.name/value
/property

property
 namepool.name/name
 value${mapreduce.job.group.name}/value
/property

Funny enough it created a pool with the name ${mapreduce.job.group.name}
so I tried ${mapred.job.group.name} and ${group.name} all to the same
effect.

But when I did ${user.name} it worked! and created a pool with the user
name.



On Thu, Mar 1, 2012 at 8:03 PM, Merto Mertek masmer...@gmail.com wrote:

 From the fairscheduler docs I assume the following should work:

 property
  namemapred.fairscheduler.poolnameproperty/name
   valuepool.name/value
 /property

 property
  namepool.name/name
  value${mapreduce.job.group.name}/value
 /property

 which means that the default pool will be the group of the user that has
 submitted the job. In your case I think that allocations.xml is correct. If
 you want to explicitly define a job to specific pool from your
 allocation.xml file you can define it as follows:

 Configuration conf3 = conf;
 conf3.set(pool.name, pool3); // conf.set(propriety.name, value)

 Let me know if it works..


 On 29 February 2012 14:18, Austin Chungath austi...@gmail.com wrote:

  How can I set the fair scheduler such that all jobs submitted from a
  particular user group go to a pool with the group name?
 
  I have setup fair scheduler and I have two users: A and B (belonging to
 the
  user group hadoop)
 
  When these users submit hadoop jobs, the jobs from A got to a pool named
 A
  and the jobs from B go to a pool named B.
   I want them to go to a pool with their group name, So I tried adding the
  following to mapred-site.xml:
 
  property
   namemapred.fairscheduler.poolnameproperty/name
  valuegroup.name/value
  /property
 
  But instead the jobs now go to the default pool.
  I want the jobs submitted by A and B to go to the pool named hadoop.
 How
  do I do that?
  also how can I explicity set a job to any specified pool?
 
  I have set the allocation file (fair-scheduler.xml) like this:
 
  allocations
   pool name=hadoop
 minMaps1/minMaps
 minReduces1/minReduces
 maxMaps3/maxMaps
 maxReduces3/maxReduces
   /pool
   userMaxJobsDefault5/userMaxJobsDefault
  /allocations
 
  Any help is greatly appreciated.
  Thanks,
  Austin
 



Hadoop fair scheduler doubt: allocate jobs to pool

2012-02-29 Thread Austin Chungath
How can I set the fair scheduler such that all jobs submitted from a
particular user group go to a pool with the group name?

I have setup fair scheduler and I have two users: A and B (belonging to the
user group hadoop)

When these users submit hadoop jobs, the jobs from A got to a pool named A
and the jobs from B go to a pool named B.
 I want them to go to a pool with their group name, So I tried adding the
following to mapred-site.xml:

property
 namemapred.fairscheduler.poolnameproperty/name
valuegroup.name/value
/property

But instead the jobs now go to the default pool.
I want the jobs submitted by A and B to go to the pool named hadoop. How
do I do that?
also how can I explicity set a job to any specified pool?

I have set the allocation file (fair-scheduler.xml) like this:

allocations
  pool name=hadoop
minMaps1/minMaps
minReduces1/minReduces
maxMaps3/maxMaps
maxReduces3/maxReduces
  /pool
  userMaxJobsDefault5/userMaxJobsDefault
/allocations

Any help is greatly appreciated.
Thanks,
Austin


Re: hadoop streaming : need help in using custom key value separator

2012-02-28 Thread Austin Chungath
Thanks subir,

-D stream.mapred.output.field.separator=* is not an available option, my
bad
what I should have done is:

-D stream.map.output.field.separator=*
On Tue, Feb 28, 2012 at 2:36 PM, Subir S subir.sasiku...@gmail.com wrote:


 http://hadoop.apache.org/common/docs/current/streaming.html#Customizing+How+Lines+are+Split+into+Key%2FValue+Pairs

 Read this link, your options are wrong below.



 On Tue, Feb 28, 2012 at 1:13 PM, Austin Chungath austi...@gmail.com
 wrote:

  When I am using more than one reducer in hadoop streaming where I am
 using
  my custom separater rather than the tab, it looks like the hadoop
 shuffling
  process is not happening as it should.
 
  This is the reducer output when I am using '\t' to separate my key value
  pair that is output from the mapper.
 
  *output from reducer 1:*
  10321,22
  23644,37
  41231,42
  23448,20
  12325,39
  71234,20
  *output from reducer 2:*
  24123,43
  33213,46
  11321,29
  21232,32
 
  the above output is as expected the first column is the key and the
 second
  value is the count. There are 10 unique keys and 6 of them are in output
 of
  the first reducer and the remaining 4 int the second reducer output.
 
  But now when I use a custom separater for my key value pair output from
 my
  mapper. Here I am using '*' as the separator
  -D stream.mapred.output.field.separator=*
  -D mapred.reduce.tasks=2
 
  *output from reducer 1:*
  10321,5
  21232,19
  24123,16
  33213,28
  23644,21
  41231,12
  23448,18
  11321,29
  12325,24
  71234,9
  * *
  *output from reducer 2:*
   10321,17
  21232,13
  33213,18
  23644,16
  41231,30
  23448,2
  24123,27
  12325,15
  71234,11
 
  Now both the reducers are getting all the keys and part of the values go
 to
  reducer 1 and part of the reducer go to reducer 2.
  Why is it behaving like this when I am using a custom separator,
 shouldn't
  each reducer get a unique key after the shuffling?
  I am using Hadoop 0.20.205.0 and below is the command that I am using to
  run hadoop streaming. Is there some more options that I should specify
 for
  hadoop streaming to work properly if I am using a custom separator?
 
  hadoop jar
  $HADOOP_PREFIX/contrib/streaming/hadoop-streaming-0.20.205.0.jar
  -D stream.mapred.output.field.separator=*
  -D mapred.reduce.tasks=2
  -mapper ./map.py
  -reducer ./reducer.py
  -file ./map.py
  -file ./reducer.py
  -input /user/inputdata
  -output /user/outputdata
  -verbose
 
 
  Any help is much appreciated,
  Thanks,
  Austin
 



hadoop streaming : need help in using custom key value separator

2012-02-27 Thread Austin Chungath
When I am using more than one reducer in hadoop streaming where I am using
my custom separater rather than the tab, it looks like the hadoop shuffling
process is not happening as it should.

This is the reducer output when I am using '\t' to separate my key value
pair that is output from the mapper.

*output from reducer 1:*
10321,22
23644,37
41231,42
23448,20
12325,39
71234,20
*output from reducer 2:*
24123,43
33213,46
11321,29
21232,32

the above output is as expected the first column is the key and the second
value is the count. There are 10 unique keys and 6 of them are in output of
the first reducer and the remaining 4 int the second reducer output.

But now when I use a custom separater for my key value pair output from my
mapper. Here I am using '*' as the separator
-D stream.mapred.output.field.separator=*
-D mapred.reduce.tasks=2

*output from reducer 1:*
10321,5
21232,19
24123,16
33213,28
23644,21
41231,12
23448,18
11321,29
12325,24
71234,9
* *
*output from reducer 2:*
10321,17
21232,13
33213,18
23644,16
41231,30
23448,2
24123,27
12325,15
71234,11

Now both the reducers are getting all the keys and part of the values go to
reducer 1 and part of the reducer go to reducer 2.
Why is it behaving like this when I am using a custom separator, shouldn't
each reducer get a unique key after the shuffling?
I am using Hadoop 0.20.205.0 and below is the command that I am using to
run hadoop streaming. Is there some more options that I should specify for
hadoop streaming to work properly if I am using a custom separator?

hadoop jar
$HADOOP_PREFIX/contrib/streaming/hadoop-streaming-0.20.205.0.jar
-D stream.mapred.output.field.separator=*
-D mapred.reduce.tasks=2
-mapper ./map.py
-reducer ./reducer.py
-file ./map.py
-file ./reducer.py
-input /user/inputdata
-output /user/outputdata
-verbose


Any help is much appreciated,
Thanks,
Austin