Lance, Never say never.
Linux programs can read from the right kind of Hadoop cluster without using FUSE. On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog <goks...@gmail.com> wrote: > Shell 'cp' only works if you use 'fuse', which makes the HDFS file system > visible as a Unix mounted file system. Otherwise, Unix programs cannot read > or write HDFS files. > > On 04/11/2013 09:52 AM, KayVajj wrote: > > Summing up what would be the recommendations for copy > > 1) DistCP > 2) shell cp command > 3) Using File System API(FileUtils to be precise) inside of a Java program > 4) A MR with an Identity Mapper and no Reducer (may be this is what > DistCP does) > > > I did not run any comparisons as my dev cluster is just a two node > cluster and not sure how this would perform on a production cluster. > > Kay > > > On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <jayunit...@gmail.com> wrote: > >> Yes makes sense... cp is serialized and simpler, and does not rely on >> jobtracker- Whereas distcp actually only submits a job and waits for >> completion. >> So it can fail if tasks start to fail or timeout. >> I Have seen distcp fail and hang before albeit not often. >> >> Sent from my iPhone >> >> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <apivova...@gmail.com> >> wrote: >> >> if cluster is busy with other jobs distcp will wait for free map >> slots. Regular cp is more reliable and predictable. Especialy if you need >> to copy just several GB >> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <azury...@gmail.com> wrote: >> >>> CP command is not parallel, It's just call FileSystem, even if >>> DFSClient has multi threads. >>> >>> DistCp can work well on the same cluster. >>> >>> >>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <vajjalak...@gmail.com> wrote: >>> >>>> The File System Copy utility copies files byte by byte if I'm not >>>> wrong. Could it be possible that the cp command works with blocks and moves >>>> them which could be significantly efficient? >>>> >>>> >>>> Also how does the cp command work if the file is distributed on >>>> different data nodes?? >>>> >>>> Thanks >>>> Kay >>>> >>>> >>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <jayunit...@gmail.com> wrote: >>>> >>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers >>>>> do a "fully" parallel copy to the detsination). >>>>> >>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem >>>>> and issues a copy command for every source file. >>>>> >>>>> I have an additional question: how is CP which is internal to a >>>>> cluster optimized (if at all) ? >>>>> >>>>> >>>>> >>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <shurong....@qunar.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I think it' better using Copy in the same cluster while using distCP >>>>>> between clusters, and cp command is a hadoop internal parallel process >>>>>> and >>>>>> will not copy files locally. >>>>>> >>>>>> ------------------------------ >>>>>> 麦树荣 >>>>>> >>>>>> *From:* KayVajj <vajjalak...@gmail.com> >>>>>> *Date:* 2013-04-11 06:20 >>>>>> *To:* user@hadoop.apache.org >>>>>> *Subject:* Copy Vs DistCP >>>>>> I have few questions regarding the usage of DistCP for >>>>>> copying files in the same cluster. >>>>>> >>>>>> >>>>>> 1) Which one is better within a same cluster and what factors (like >>>>>> file size etc) wouldinfluence the usage of one over te other? >>>>>> >>>>>> 2) when we run a cp command like below from a client node of the >>>>>> cluster (not a data node), How does the cp command work >>>>>> i) like an MR job >>>>>> ii) copy files locally and then it copy it back at the new >>>>>> location. >>>>>> >>>>>> Example of the copy command >>>>>> >>>>>> hdfs dfs -cp /<some_location>/file /<new_location>/ >>>>>> >>>>>> Thanks, your responses are appreciated. >>>>>> >>>>>> -- Kay >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Jay Vyas >>>>> http://jayunit100.blogspot.com >>>>> >>>> >>>> >>> > >