Kevin,

Slurper can help here:
https://github.com/alexholmes/hdfs-file-slurper 
<https://github.com/alexholmes/hdfs-file-slurper>

BR,
 Alexander 


> On 13 Feb 2015, at 14:28, Kevin <kevin.macksa...@gmail.com> wrote:
> 
> Hi,
> 
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or 
> so files into HDFS, which totals roughly 1 TB. The cluster will be isolated 
> on its own private LAN with a single client machine that is connected to the 
> Hadoop cluster as well as the public network. The data that needs to be 
> copied into HDFS is mounted as an NFS on the client machine.
> 
> I can run `hadoop fs -put` concurrently on the client machine to try and 
> increase the throughput.
> 
> If these files were able to be accessed by each node in the Hadoop cluster, 
> then I could write a MapReduce job to copy a number of files from the network 
> into HDFS. I could not find anything in the documentation saying that 
> `distcp` works with locally hosted files (its code in the tools package 
> doesn't tell any sign of it either) - but I wouldn't expect it to.
> 
> In general, are there any other ways of copying a very large number of 
> client-local files to HDFS? I search the mail archives to find a similar 
> question and I didn't come across one. I'm sorry if this is a duplicate 
> question.
> 
> Thanks for your time,
> Kevin

Reply via email to