Re: Copy Vs DistCP

Ted Dunning Sat, 13 Apr 2013 21:15:34 -0700

Lance,

Never say never.


Linux programs can read from the right kind of Hadoop cluster without using
FUSE.




On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog <goks...@gmail.com> wrote:

>  Shell 'cp' only works if you use 'fuse', which makes the HDFS file system
> visible as a Unix mounted file system. Otherwise, Unix programs cannot read
> or write HDFS files.
>
> On 04/11/2013 09:52 AM, KayVajj wrote:
>
>    Summing up what would be the recommendations for copy
>
>  1) DistCP
>  2) shell cp command
>  3) Using File System API(FileUtils to be precise) inside of a Java program
>  4) A MR with an Identity Mapper and no Reducer (may be this is what
> DistCP does)
>
>
>  I did not run any comparisons as my dev cluster is just a two node
> cluster and not sure how this would perform on a production cluster.
>
>  Kay
>
>
> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <jayunit...@gmail.com> wrote:
>
>>  Yes makes sense...  cp is serialized and simpler, and does not rely on
>> jobtracker- Whereas distcp actually only submits a job and waits for
>> completion.
>> So it can fail if tasks start to fail or timeout.
>>  I Have seen distcp fail and hang before albeit not often.
>>
>> Sent from my iPhone
>>
>> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <apivova...@gmail.com>
>> wrote:
>>
>>   if cluster is busy with other jobs distcp will wait for free map
>> slots. Regular cp is more reliable and predictable. Especialy if you need
>> to copy just several GB
>> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <azury...@gmail.com> wrote:
>>
>>>  CP command is not parallel, It's just call FileSystem, even if
>>> DFSClient has multi threads.
>>>
>>>  DistCp can work well on the same cluster.
>>>
>>>
>>>  On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <vajjalak...@gmail.com> wrote:
>>>
>>>>  The File System Copy utility copies files byte by byte if I'm not
>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>> them which could be significantly efficient?
>>>>
>>>>
>>>>  Also how does the cp command work if the file is distributed on
>>>> different data nodes??
>>>>
>>>>  Thanks
>>>>  Kay
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <jayunit...@gmail.com> wrote:
>>>>
>>>>>  DistCP is a full blown mapreduce job (mapper only, where the mappers
>>>>> do a "fully" parallel copy to the detsination).
>>>>>
>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>> and issues a copy command for every source file.
>>>>>
>>>>>  I have an additional question: how is CP which is internal to a
>>>>> cluster optimized (if at all) ?
>>>>>
>>>>>
>>>>>
>>>>>  On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <shurong....@qunar.com> wrote:
>>>>>
>>>>>>  Hi，
>>>>>>
>>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>>> between clusters, and cp command is a hadoop internal parallel process 
>>>>>> and
>>>>>> will not copy files locally.
>>>>>>
>>>>>> ------------------------------
>>>>>>  麦树荣
>>>>>>
>>>>>>  *From:* KayVajj <vajjalak...@gmail.com>
>>>>>> *Date:* 2013-04-11 06:20
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Copy Vs DistCP
>>>>>>        I have few questions regarding the usage of DistCP for
>>>>>> copying files in the same cluster.
>>>>>>
>>>>>>
>>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>>
>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>> cluster (not a data node), How does the cp command work
>>>>>>       i) like an MR job
>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>> location.
>>>>>>
>>>>>>  Example of the copy command
>>>>>>
>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>
>>>>>>  Thanks, your responses are appreciated.
>>>>>>
>>>>>>  -- Kay
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  --
>>>>> Jay Vyas
>>>>> http://jayunit100.blogspot.com
>>>>>
>>>>
>>>>
>>>
>
>

Re: Copy Vs DistCP

Reply via email to