Re: Copy Vs DistCP

2013-04-14 Thread Mathias Herberts
That was a hidden shameless plug Ted ;-)

The main disadvantage of fs -cp is that all data has to transit via the
machine you issue the command on, depending on the size of data you want to
copy that can be a killer. DistCp is distributed as its name imply, so no
bottleneck of this kind then.
On Apr 14, 2013 6:15 AM, Ted Dunning tdunn...@maprtech.com wrote:


 Lance,

 Never say never.

 Linux programs can read from the right kind of Hadoop cluster without
 using FUSE.




 On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog goks...@gmail.com wrote:

  Shell 'cp' only works if you use 'fuse', which makes the HDFS file
 system visible as a Unix mounted file system. Otherwise, Unix programs
 cannot read or write HDFS files.

 On 04/11/2013 09:52 AM, KayVajj wrote:

Summing up what would be the recommendations for copy

  1) DistCP
  2) shell cp command
  3) Using File System API(FileUtils to be precise) inside of a Java
 program
  4) A MR with an Identity Mapper and no Reducer (may be this is what
 DistCP does)


  I did not run any comparisons as my dev cluster is just a two node
 cluster and not sure how this would perform on a production cluster.

  Kay


 On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas jayunit...@gmail.com wrote:

  Yes makes sense...  cp is serialized and simpler, and does not rely on
 jobtracker- Whereas distcp actually only submits a job and waits for
 completion.
 So it can fail if tasks start to fail or timeout.
  I Have seen distcp fail and hang before albeit not often.

 Sent from my iPhone

 On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov apivova...@gmail.com
 wrote:

   if cluster is busy with other jobs distcp will wait for free map
 slots. Regular cp is more reliable and predictable. Especialy if you need
 to copy just several GB
 On Apr 10, 2013 6:31 PM, Azuryy Yu azury...@gmail.com wrote:

  CP command is not parallel, It's just call FileSystem, even if
 DFSClient has multi threads.

  DistCp can work well on the same cluster.


  On Thu, Apr 11, 2013 at 8:17 AM, KayVajj vajjalak...@gmail.comwrote:

  The File System Copy utility copies files byte by byte if I'm not
 wrong. Could it be possible that the cp command works with blocks and 
 moves
 them which could be significantly efficient?


  Also how does the cp command work if the file is distributed on
 different data nodes??

  Thanks
  Kay


 On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas jayunit...@gmail.comwrote:

  DistCP is a full blown mapreduce job (mapper only, where the
 mappers do a fully parallel copy to the detsination).

 CP appears (correct me if im wrong) to simply invoke the FileSystem
 and issues a copy command for every source file.

  I have an additional question: how is CP which is internal to a
 cluster optimized (if at all) ?



  On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 shurong@qunar.com wrote:

  Hi,

 I think it' better using Copy in the same cluster while using distCP
 between clusters, and cp command is a hadoop internal parallel process 
 and
 will not copy files locally.

 --
  麦树荣

  *From:* KayVajj vajjalak...@gmail.com
 *Date:* 2013-04-11 06:20
 *To:* user@hadoop.apache.org
 *Subject:* Copy Vs DistCP
I have few questions regarding the usage of DistCP for
 copying files in the same cluster.


 1) Which one is better within a  same cluster and what factors (like
 file size etc) wouldinfluence the usage of one over te other?

  2) when we run a cp command like below from a  client node of the
 cluster (not a data node), How does the cp command work
   i) like an MR job
  ii) copy files locally and then it copy it back at the new
 location.

  Example of the copy command

  hdfs dfs -cp /some_location/file /new_location/

  Thanks, your responses are appreciated.

  -- Kay




  --
 Jay Vyas
 http://jayunit100.blogspot.com









Re: Copy Vs DistCP

2013-04-14 Thread Ted Dunning
Inline


On Sun, Apr 14, 2013 at 1:13 AM, Mathias Herberts 
mathias.herbe...@gmail.com wrote:

 That was a hidden shameless plug Ted ;-)


Well, I will admit it was a shameless correction to Lance's absolute and
incorrect claim.


 The main disadvantage of fs -cp is that all data has to transit via the
 machine you issue the command on, depending on the size of data you want to
 copy that can be a killer. DistCp is distributed as its name imply, so no
 bottleneck of this kind then.


This is absolutely true.  Distcp dominates cp for large copies.  On the
other hand cp dominates distcp for convenience.

In my own experience, I love cp when copying relatively small amounts of
data (10's of GB) where the available bandwidth of about a GB/s allows the
copy to complete in less time that it takes distcp to get started.

At larger sizes (100's of GB and up), the startup time of distcp doesn't
matter because once it gets going, it moves data much faster.





 On Apr 14, 2013 6:15 AM, Ted Dunning tdunn...@maprtech.com wrote:


 Lance,

 Never say never.

 Linux programs can read from the right kind of Hadoop cluster without
 using FUSE.




 On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog goks...@gmail.comwrote:

  Shell 'cp' only works if you use 'fuse', which makes the HDFS file
 system visible as a Unix mounted file system. Otherwise, Unix programs
 cannot read or write HDFS files.

 On 04/11/2013 09:52 AM, KayVajj wrote:

Summing up what would be the recommendations for copy

  1) DistCP
  2) shell cp command
  3) Using File System API(FileUtils to be precise) inside of a Java
 program
  4) A MR with an Identity Mapper and no Reducer (may be this is what
 DistCP does)


  I did not run any comparisons as my dev cluster is just a two node
 cluster and not sure how this would perform on a production cluster.

  Kay


 On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas jayunit...@gmail.com wrote:

  Yes makes sense...  cp is serialized and simpler, and does not rely
 on jobtracker- Whereas distcp actually only submits a job and waits for
 completion.
 So it can fail if tasks start to fail or timeout.
  I Have seen distcp fail and hang before albeit not often.

 Sent from my iPhone

 On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov apivova...@gmail.com
 wrote:

   if cluster is busy with other jobs distcp will wait for free map
 slots. Regular cp is more reliable and predictable. Especialy if you need
 to copy just several GB
 On Apr 10, 2013 6:31 PM, Azuryy Yu azury...@gmail.com wrote:

  CP command is not parallel, It's just call FileSystem, even if
 DFSClient has multi threads.

  DistCp can work well on the same cluster.


  On Thu, Apr 11, 2013 at 8:17 AM, KayVajj vajjalak...@gmail.comwrote:

  The File System Copy utility copies files byte by byte if I'm not
 wrong. Could it be possible that the cp command works with blocks and 
 moves
 them which could be significantly efficient?


  Also how does the cp command work if the file is distributed on
 different data nodes??

  Thanks
  Kay


 On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas jayunit...@gmail.comwrote:

  DistCP is a full blown mapreduce job (mapper only, where the
 mappers do a fully parallel copy to the detsination).

 CP appears (correct me if im wrong) to simply invoke the FileSystem
 and issues a copy command for every source file.

  I have an additional question: how is CP which is internal to a
 cluster optimized (if at all) ?



  On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 shurong@qunar.com wrote:

  Hi,

 I think it' better using Copy in the same cluster while using
 distCP between clusters, and cp command is a hadoop internal parallel
 process and will not copy files locally.

 --
  麦树荣

  *From:* KayVajj vajjalak...@gmail.com
 *Date:* 2013-04-11 06:20
 *To:* user@hadoop.apache.org
 *Subject:* Copy Vs DistCP
I have few questions regarding the usage of DistCP for
 copying files in the same cluster.


 1) Which one is better within a  same cluster and what factors
 (like file size etc) wouldinfluence the usage of one over te other?

  2) when we run a cp command like below from a  client node of the
 cluster (not a data node), How does the cp command work
   i) like an MR job
  ii) copy files locally and then it copy it back at the new
 location.

  Example of the copy command

  hdfs dfs -cp /some_location/file /new_location/

  Thanks, your responses are appreciated.

  -- Kay




  --
 Jay Vyas
 http://jayunit100.blogspot.com









Re: Copy Vs DistCP

2013-04-14 Thread Mathias Herberts

 This is absolutely true.  Distcp dominates cp for large copies.  On the
other hand cp dominates distcp for convenience.

 In my own experience, I love cp when copying relatively small amounts of
data (10's of GB) where the available bandwidth of about a GB/s allows the
copy to complete in less time that it takes distcp to get started.

 At larger sizes (100's of GB and up), the startup time of distcp doesn't
matter because once it gets going, it moves data much faster.

Maybe we could put together a 'fs -smartcp' which choses wisely between
copy and distcp depending on file size


Re: Copy Vs DistCP

2013-04-14 Thread Ted Dunning
On Sun, Apr 14, 2013 at 10:33 AM, Mathias Herberts 
mathias.herbe...@gmail.com wrote:


 
  This is absolutely true.  Distcp dominates cp for large copies.  On the
 other hand cp dominates distcp for convenience.
 
  In my own experience, I love cp when copying relatively small amounts of
 data (10's of GB) where the available bandwidth of about a GB/s allows the
 copy to complete in less time that it takes distcp to get started.
 
  At larger sizes (100's of GB and up), the startup time of distcp doesn't
 matter because once it gets going, it moves data much faster.

 Maybe we could put together a 'fs -smartcp' which choses wisely between
 copy and distcp depending on file size


Uh... hmm...

This is a good suggestion.  Obvious in fact.  In retrospect.

I would also suggest that the new command be called distcp.


Re: Copy Vs DistCP

2013-04-13 Thread Ted Dunning
Lance,

Never say never.

Linux programs can read from the right kind of Hadoop cluster without using
FUSE.




On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog goks...@gmail.com wrote:

  Shell 'cp' only works if you use 'fuse', which makes the HDFS file system
 visible as a Unix mounted file system. Otherwise, Unix programs cannot read
 or write HDFS files.

 On 04/11/2013 09:52 AM, KayVajj wrote:

Summing up what would be the recommendations for copy

  1) DistCP
  2) shell cp command
  3) Using File System API(FileUtils to be precise) inside of a Java program
  4) A MR with an Identity Mapper and no Reducer (may be this is what
 DistCP does)


  I did not run any comparisons as my dev cluster is just a two node
 cluster and not sure how this would perform on a production cluster.

  Kay


 On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas jayunit...@gmail.com wrote:

  Yes makes sense...  cp is serialized and simpler, and does not rely on
 jobtracker- Whereas distcp actually only submits a job and waits for
 completion.
 So it can fail if tasks start to fail or timeout.
  I Have seen distcp fail and hang before albeit not often.

 Sent from my iPhone

 On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov apivova...@gmail.com
 wrote:

   if cluster is busy with other jobs distcp will wait for free map
 slots. Regular cp is more reliable and predictable. Especialy if you need
 to copy just several GB
 On Apr 10, 2013 6:31 PM, Azuryy Yu azury...@gmail.com wrote:

  CP command is not parallel, It's just call FileSystem, even if
 DFSClient has multi threads.

  DistCp can work well on the same cluster.


  On Thu, Apr 11, 2013 at 8:17 AM, KayVajj vajjalak...@gmail.com wrote:

  The File System Copy utility copies files byte by byte if I'm not
 wrong. Could it be possible that the cp command works with blocks and moves
 them which could be significantly efficient?


  Also how does the cp command work if the file is distributed on
 different data nodes??

  Thanks
  Kay


 On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas jayunit...@gmail.com wrote:

  DistCP is a full blown mapreduce job (mapper only, where the mappers
 do a fully parallel copy to the detsination).

 CP appears (correct me if im wrong) to simply invoke the FileSystem
 and issues a copy command for every source file.

  I have an additional question: how is CP which is internal to a
 cluster optimized (if at all) ?



  On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 shurong@qunar.com wrote:

  Hi,

 I think it' better using Copy in the same cluster while using distCP
 between clusters, and cp command is a hadoop internal parallel process 
 and
 will not copy files locally.

 --
  麦树荣

  *From:* KayVajj vajjalak...@gmail.com
 *Date:* 2013-04-11 06:20
 *To:* user@hadoop.apache.org
 *Subject:* Copy Vs DistCP
I have few questions regarding the usage of DistCP for
 copying files in the same cluster.


 1) Which one is better within a  same cluster and what factors (like
 file size etc) wouldinfluence the usage of one over te other?

  2) when we run a cp command like below from a  client node of the
 cluster (not a data node), How does the cp command work
   i) like an MR job
  ii) copy files locally and then it copy it back at the new
 location.

  Example of the copy command

  hdfs dfs -cp /some_location/file /new_location/

  Thanks, your responses are appreciated.

  -- Kay




  --
 Jay Vyas
 http://jayunit100.blogspot.com








Re: Copy Vs DistCP

2013-04-12 Thread Lance Norskog
Shell 'cp' only works if you use 'fuse', which makes the HDFS file 
system visible as a Unix mounted file system. Otherwise, Unix programs 
cannot read or write HDFS files.


On 04/11/2013 09:52 AM, KayVajj wrote:

Summing up what would be the recommendations for copy

1) DistCP
2) shell cp command
3) Using File System API(FileUtils to be precise) inside of a Java program
4) A MR with an Identity Mapper and no Reducer (may be this is what 
DistCP does)



I did not run any comparisons as my dev cluster is just a two node 
cluster and not sure how this would perform on a production cluster.


Kay


On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas jayunit...@gmail.com 
mailto:jayunit...@gmail.com wrote:


Yes makes sense...  cp is serialized and simpler, and does not
rely on jobtracker- Whereas distcp actually only submits a job and
waits for completion.
So it can fail if tasks start to fail or timeout.
 I Have seen distcp fail and hang before albeit not often.

Sent from my iPhone

On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov
apivova...@gmail.com mailto:apivova...@gmail.com wrote:


if cluster is busy with other jobs distcp will wait for free map
slots. Regular cp is more reliable and predictable. Especialy if
you need to copy just several GB

On Apr 10, 2013 6:31 PM, Azuryy Yu azury...@gmail.com
mailto:azury...@gmail.com wrote:

CP command is not parallel, It's just call FileSystem, even
if DFSClient has multi threads.

DistCp can work well on the same cluster.


On Thu, Apr 11, 2013 at 8:17 AM, KayVajj
vajjalak...@gmail.com mailto:vajjalak...@gmail.com wrote:

The File System Copy utility copies files byte by byte if
I'm not wrong. Could it be possible that the cp command
works with blocks and moves them which could be
significantly efficient?


Also how does the cp command work if the file is
distributed on different data nodes??

Thanks
Kay


On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas
jayunit...@gmail.com mailto:jayunit...@gmail.com wrote:

DistCP is a full blown mapreduce job (mapper only,
where the mappers do a fully parallel copy to the
detsination).

CP appears (correct me if im wrong) to simply invoke
the FileSystem and issues a copy command for every
source file.

I have an additional question: how is CP which is
internal to a cluster optimized (if at all) ?



On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣
shurong@qunar.com
mailto:shurong@qunar.com wrote:

Hi,
I think it' better using Copy in the same cluster
while using distCP between clusters, and cp
command is a hadoop internal parallel process and
will not copy files locally.


麦树荣
*From:* KayVajj mailto:vajjalak...@gmail.com
*Date:* 2013-04-11 06 tel:2013-04-11%C2%A006:20
*To:* user@hadoop.apache.org
mailto:user@hadoop.apache.org
*Subject:* Copy Vs DistCP
I have few questions regarding the usage of
DistCP for copying files in the same cluster.


1) Which one is better within a  same cluster and
what factors (like file size etc) wouldinfluence
the usage of one over te other?

2) when we run a cp command like below from a 
client node of the cluster (not a data node), How

does the cp command work
 i) like an MR job
ii) copy files locally and then it copy it
back at the new location.

Example of the copy command

hdfs dfs -cp /some_location/file /new_location/

Thanks, your responses are appreciated.

-- Kay




-- 
Jay Vyas

http://jayunit100.blogspot.com









Re: Copy Vs DistCP

2013-04-11 Thread Hemanth Yamijala
AFAIK, the cp command works fully from the DFS client. It reads bytes from
the InputStream created when the file is opened and writes the same to the
OutputStream of the file. It does not work at the level of data blocks. A
configuration io.file.buffer.size is used as the size of the buffer used in
copy - set to 4096 by default.

Thanks
Hemanth


On Thu, Apr 11, 2013 at 9:42 AM, KayVajj vajjalak...@gmail.com wrote:

 If CP command is not parallel how does it work for a file partitioned on
 various data nodes?


 On Wed, Apr 10, 2013 at 6:30 PM, Azuryy Yu azury...@gmail.com wrote:

 CP command is not parallel, It's just call FileSystem, even if DFSClient
 has multi threads.

 DistCp can work well on the same cluster.


 On Thu, Apr 11, 2013 at 8:17 AM, KayVajj vajjalak...@gmail.com wrote:

 The File System Copy utility copies files byte by byte if I'm not wrong.
 Could it be possible that the cp command works with blocks and moves them
 which could be significantly efficient?


 Also how does the cp command work if the file is distributed on
 different data nodes??

 Thanks
 Kay


 On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas jayunit...@gmail.com wrote:

 DistCP is a full blown mapreduce job (mapper only, where the mappers do
 a fully parallel copy to the detsination).

 CP appears (correct me if im wrong) to simply invoke the FileSystem and
 issues a copy command for every source file.

 I have an additional question: how is CP which is internal to a cluster
 optimized (if at all) ?



 On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 shurong@qunar.com wrote:

 **
 Hi,

 I think it' better using Copy in the same cluster while using distCP
 between clusters, and cp command is a hadoop internal parallel process and
 will not copy files locally.

 --
  麦树荣

  *From:* KayVajj vajjalak...@gmail.com
 *Date:* 2013-04-11 06:20
 *To:* user@hadoop.apache.org
 *Subject:* Copy Vs DistCP
   I have few questions regarding the usage of DistCP for copying
 files in the same cluster.


 1) Which one is better within a  same cluster and what factors (like
 file size etc) wouldinfluence the usage of one over te other?

  2) when we run a cp command like below from a  client node of the
 cluster (not a data node), How does the cp command work
   i) like an MR job
  ii) copy files locally and then it copy it back at the new
 location.

  Example of the copy command

  hdfs dfs -cp /some_location/file /new_location/

  Thanks, your responses are appreciated.

  -- Kay




 --
 Jay Vyas
 http://jayunit100.blogspot.com







Re: Copy Vs DistCP

2013-04-11 Thread Azuryy Yu
yes, you are right.


On Thu, Apr 11, 2013 at 3:40 PM, Hemanth Yamijala yhema...@thoughtworks.com
 wrote:

 AFAIK, the cp command works fully from the DFS client. It reads bytes from
 the InputStream created when the file is opened and writes the same to the
 OutputStream of the file. It does not work at the level of data blocks. A
 configuration io.file.buffer.size is used as the size of the buffer used in
 copy - set to 4096 by default.

 Thanks
 Hemanth


 On Thu, Apr 11, 2013 at 9:42 AM, KayVajj vajjalak...@gmail.com wrote:

 If CP command is not parallel how does it work for a file partitioned on
 various data nodes?


 On Wed, Apr 10, 2013 at 6:30 PM, Azuryy Yu azury...@gmail.com wrote:

 CP command is not parallel, It's just call FileSystem, even if DFSClient
 has multi threads.

 DistCp can work well on the same cluster.


 On Thu, Apr 11, 2013 at 8:17 AM, KayVajj vajjalak...@gmail.com wrote:

 The File System Copy utility copies files byte by byte if I'm not
 wrong. Could it be possible that the cp command works with blocks and moves
 them which could be significantly efficient?


 Also how does the cp command work if the file is distributed on
 different data nodes??

 Thanks
 Kay


 On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas jayunit...@gmail.com wrote:

 DistCP is a full blown mapreduce job (mapper only, where the mappers
 do a fully parallel copy to the detsination).

 CP appears (correct me if im wrong) to simply invoke the FileSystem
 and issues a copy command for every source file.

 I have an additional question: how is CP which is internal to a
 cluster optimized (if at all) ?



 On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 shurong@qunar.com wrote:

 **
 Hi,

 I think it' better using Copy in the same cluster while using distCP
 between clusters, and cp command is a hadoop internal parallel process 
 and
 will not copy files locally.

 --
  麦树荣

  *From:* KayVajj vajjalak...@gmail.com
 *Date:* 2013-04-11 06:20
 *To:* user@hadoop.apache.org
 *Subject:* Copy Vs DistCP
   I have few questions regarding the usage of DistCP for copying
 files in the same cluster.


 1) Which one is better within a  same cluster and what factors (like
 file size etc) wouldinfluence the usage of one over te other?

  2) when we run a cp command like below from a  client node of the
 cluster (not a data node), How does the cp command work
   i) like an MR job
  ii) copy files locally and then it copy it back at the new
 location.

  Example of the copy command

  hdfs dfs -cp /some_location/file /new_location/

  Thanks, your responses are appreciated.

  -- Kay




 --
 Jay Vyas
 http://jayunit100.blogspot.com








Re: Copy Vs DistCP

2013-04-11 Thread KayVajj
Summing up what would be the recommendations for copy

1) DistCP
2) shell cp command
3) Using File System API(FileUtils to be precise) inside of a Java program
4) A MR with an Identity Mapper and no Reducer (may be this is what DistCP
does)


I did not run any comparisons as my dev cluster is just a two node cluster
and not sure how this would perform on a production cluster.

Kay


On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas jayunit...@gmail.com wrote:

 Yes makes sense...  cp is serialized and simpler, and does not rely on
 jobtracker- Whereas distcp actually only submits a job and waits for
 completion.
 So it can fail if tasks start to fail or timeout.
  I Have seen distcp fail and hang before albeit not often.

 Sent from my iPhone

 On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov apivova...@gmail.com
 wrote:

 if cluster is busy with other jobs distcp will wait for free map slots.
 Regular cp is more reliable and predictable. Especialy if you need to copy
 just several GB
 On Apr 10, 2013 6:31 PM, Azuryy Yu azury...@gmail.com wrote:

 CP command is not parallel, It's just call FileSystem, even if DFSClient
 has multi threads.

 DistCp can work well on the same cluster.


 On Thu, Apr 11, 2013 at 8:17 AM, KayVajj vajjalak...@gmail.com wrote:

 The File System Copy utility copies files byte by byte if I'm not wrong.
 Could it be possible that the cp command works with blocks and moves them
 which could be significantly efficient?


 Also how does the cp command work if the file is distributed on
 different data nodes??

 Thanks
 Kay


 On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas jayunit...@gmail.com wrote:

 DistCP is a full blown mapreduce job (mapper only, where the mappers do
 a fully parallel copy to the detsination).

 CP appears (correct me if im wrong) to simply invoke the FileSystem and
 issues a copy command for every source file.

 I have an additional question: how is CP which is internal to a cluster
 optimized (if at all) ?



 On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 shurong@qunar.com wrote:

 **
 Hi,

 I think it' better using Copy in the same cluster while using distCP
 between clusters, and cp command is a hadoop internal parallel process and
 will not copy files locally.

 --
  麦树荣

  *From:* KayVajj vajjalak...@gmail.com
 *Date:* 2013-04-11 06:20
 *To:* user@hadoop.apache.org
 *Subject:* Copy Vs DistCP
   I have few questions regarding the usage of DistCP for copying
 files in the same cluster.


 1) Which one is better within a  same cluster and what factors (like
 file size etc) wouldinfluence the usage of one over te other?

  2) when we run a cp command like below from a  client node of the
 cluster (not a data node), How does the cp command work
   i) like an MR job
  ii) copy files locally and then it copy it back at the new
 location.

  Example of the copy command

  hdfs dfs -cp /some_location/file /new_location/

  Thanks, your responses are appreciated.

  -- Kay




 --
 Jay Vyas
 http://jayunit100.blogspot.com






Re: Copy Vs DistCP

2013-04-11 Thread Azuryy Yu
DistCP is prefer for your requirements.


On Fri, Apr 12, 2013 at 12:52 AM, KayVajj vajjalak...@gmail.com wrote:

 Summing up what would be the recommendations for copy

 1) DistCP
 2) shell cp command
 3) Using File System API(FileUtils to be precise) inside of a Java program
 4) A MR with an Identity Mapper and no Reducer (may be this is what DistCP
 does)


 I did not run any comparisons as my dev cluster is just a two node cluster
 and not sure how this would perform on a production cluster.

 Kay


 On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas jayunit...@gmail.com wrote:

 Yes makes sense...  cp is serialized and simpler, and does not rely on
 jobtracker- Whereas distcp actually only submits a job and waits for
 completion.
 So it can fail if tasks start to fail or timeout.
  I Have seen distcp fail and hang before albeit not often.

 Sent from my iPhone

 On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov apivova...@gmail.com
 wrote:

 if cluster is busy with other jobs distcp will wait for free map slots.
 Regular cp is more reliable and predictable. Especialy if you need to copy
 just several GB
 On Apr 10, 2013 6:31 PM, Azuryy Yu azury...@gmail.com wrote:

 CP command is not parallel, It's just call FileSystem, even if DFSClient
 has multi threads.

 DistCp can work well on the same cluster.


 On Thu, Apr 11, 2013 at 8:17 AM, KayVajj vajjalak...@gmail.com wrote:

 The File System Copy utility copies files byte by byte if I'm not
 wrong. Could it be possible that the cp command works with blocks and moves
 them which could be significantly efficient?


 Also how does the cp command work if the file is distributed on
 different data nodes??

 Thanks
 Kay


 On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas jayunit...@gmail.com wrote:

 DistCP is a full blown mapreduce job (mapper only, where the mappers
 do a fully parallel copy to the detsination).

 CP appears (correct me if im wrong) to simply invoke the FileSystem
 and issues a copy command for every source file.

 I have an additional question: how is CP which is internal to a
 cluster optimized (if at all) ?



 On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 shurong@qunar.com wrote:

 **
 Hi,

 I think it' better using Copy in the same cluster while using distCP
 between clusters, and cp command is a hadoop internal parallel process 
 and
 will not copy files locally.

 --
  麦树荣

  *From:* KayVajj vajjalak...@gmail.com
 *Date:* 2013-04-11 06:20
 *To:* user@hadoop.apache.org
 *Subject:* Copy Vs DistCP
   I have few questions regarding the usage of DistCP for copying
 files in the same cluster.


 1) Which one is better within a  same cluster and what factors (like
 file size etc) wouldinfluence the usage of one over te other?

  2) when we run a cp command like below from a  client node of the
 cluster (not a data node), How does the cp command work
   i) like an MR job
  ii) copy files locally and then it copy it back at the new
 location.

  Example of the copy command

  hdfs dfs -cp /some_location/file /new_location/

  Thanks, your responses are appreciated.

  -- Kay




 --
 Jay Vyas
 http://jayunit100.blogspot.com







Re: Copy Vs DistCP

2013-04-10 Thread 麦树荣
Hi,

I think it' better using Copy in the same cluster while using distCP between 
clusters, and cp command is a hadoop internal parallel process and will not 
copy files locally.


麦树荣

From: KayVajjmailto:vajjalak...@gmail.com
Date: 2013-04-11 06:20
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Copy Vs DistCP
I have few questions regarding the usage of DistCP for copying files in the 
same cluster.


1) Which one is better within a  same cluster and what factors (like file size 
etc) wouldinfluence the usage of one over te other?

2) when we run a cp command like below from a  client node of the cluster (not 
a data node), How does the cp command work
 i) like an MR job
ii) copy files locally and then it copy it back at the new location.

Example of the copy command

hdfs dfs -cp /some_location/file /new_location/

Thanks, your responses are appreciated.

-- Kay


Re: Copy Vs DistCP

2013-04-10 Thread Jay Vyas
DistCP is a full blown mapreduce job (mapper only, where the mappers do a
fully parallel copy to the detsination).

CP appears (correct me if im wrong) to simply invoke the FileSystem and
issues a copy command for every source file.

I have an additional question: how is CP which is internal to a cluster
optimized (if at all) ?


On Wed, Apr 10, 2013 at 6:20 PM, KayVajj vajjalak...@gmail.com wrote:

 I have few questions regarding the usage of DistCP for copying files in
 the same cluster.


 1) Which one is better within a  same cluster and what factors (like file
 size etc) wouldinfluence the usage of one over te other?

 2) when we run a cp command like below from a  client node of the cluster
 (not a data node), How does the cp command work
  i) like an MR job
 ii) copy files locally and then it copy it back at the new location.

 Example of the copy command

 hdfs dfs -cp /some_location/file /new_location/

 Thanks, your responses are appreciated.

 -- Kay




-- 
Jay Vyas
http://jayunit100.blogspot.com


Re: Copy Vs DistCP

2013-04-10 Thread KayVajj
The File System Copy utility copies files byte by byte if I'm not wrong.
Could it be possible that the cp command works with blocks and moves them
which could be significantly efficient?


Also how does the cp command work if the file is distributed on different
data nodes??

Thanks
Kay


On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas jayunit...@gmail.com wrote:

 DistCP is a full blown mapreduce job (mapper only, where the mappers do a
 fully parallel copy to the detsination).

 CP appears (correct me if im wrong) to simply invoke the FileSystem and
 issues a copy command for every source file.

 I have an additional question: how is CP which is internal to a cluster
 optimized (if at all) ?



 On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 shurong@qunar.com wrote:

 **
 Hi,

 I think it' better using Copy in the same cluster while using distCP
 between clusters, and cp command is a hadoop internal parallel process and
 will not copy files locally.

 --
  麦树荣

  *From:* KayVajj vajjalak...@gmail.com
 *Date:* 2013-04-11 06:20
 *To:* user@hadoop.apache.org
 *Subject:* Copy Vs DistCP
   I have few questions regarding the usage of DistCP for copying
 files in the same cluster.


 1) Which one is better within a  same cluster and what factors (like file
 size etc) wouldinfluence the usage of one over te other?

  2) when we run a cp command like below from a  client node of the
 cluster (not a data node), How does the cp command work
   i) like an MR job
  ii) copy files locally and then it copy it back at the new location.

  Example of the copy command

  hdfs dfs -cp /some_location/file /new_location/

  Thanks, your responses are appreciated.

  -- Kay




 --
 Jay Vyas
 http://jayunit100.blogspot.com



Re: Copy Vs DistCP

2013-04-10 Thread KayVajj
If CP command is not parallel how does it work for a file partitioned on
various data nodes?


On Wed, Apr 10, 2013 at 6:30 PM, Azuryy Yu azury...@gmail.com wrote:

 CP command is not parallel, It's just call FileSystem, even if DFSClient
 has multi threads.

 DistCp can work well on the same cluster.


 On Thu, Apr 11, 2013 at 8:17 AM, KayVajj vajjalak...@gmail.com wrote:

 The File System Copy utility copies files byte by byte if I'm not wrong.
 Could it be possible that the cp command works with blocks and moves them
 which could be significantly efficient?


 Also how does the cp command work if the file is distributed on different
 data nodes??

 Thanks
 Kay


 On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas jayunit...@gmail.com wrote:

 DistCP is a full blown mapreduce job (mapper only, where the mappers do
 a fully parallel copy to the detsination).

 CP appears (correct me if im wrong) to simply invoke the FileSystem and
 issues a copy command for every source file.

 I have an additional question: how is CP which is internal to a cluster
 optimized (if at all) ?



 On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 shurong@qunar.com wrote:

 **
 Hi,

 I think it' better using Copy in the same cluster while using distCP
 between clusters, and cp command is a hadoop internal parallel process and
 will not copy files locally.

 --
  麦树荣

  *From:* KayVajj vajjalak...@gmail.com
 *Date:* 2013-04-11 06:20
 *To:* user@hadoop.apache.org
 *Subject:* Copy Vs DistCP
   I have few questions regarding the usage of DistCP for copying
 files in the same cluster.


 1) Which one is better within a  same cluster and what factors (like
 file size etc) wouldinfluence the usage of one over te other?

  2) when we run a cp command like below from a  client node of the
 cluster (not a data node), How does the cp command work
   i) like an MR job
  ii) copy files locally and then it copy it back at the new
 location.

  Example of the copy command

  hdfs dfs -cp /some_location/file /new_location/

  Thanks, your responses are appreciated.

  -- Kay




 --
 Jay Vyas
 http://jayunit100.blogspot.com