Kevin, Slurper can help here: https://github.com/alexholmes/hdfs-file-slurper <https://github.com/alexholmes/hdfs-file-slurper>
BR, Alexander > On 13 Feb 2015, at 14:28, Kevin <kevin.macksa...@gmail.com> wrote: > > Hi, > > I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or > so files into HDFS, which totals roughly 1 TB. The cluster will be isolated > on its own private LAN with a single client machine that is connected to the > Hadoop cluster as well as the public network. The data that needs to be > copied into HDFS is mounted as an NFS on the client machine. > > I can run `hadoop fs -put` concurrently on the client machine to try and > increase the throughput. > > If these files were able to be accessed by each node in the Hadoop cluster, > then I could write a MapReduce job to copy a number of files from the network > into HDFS. I could not find anything in the documentation saying that > `distcp` works with locally hosted files (its code in the tools package > doesn't tell any sign of it either) - but I wouldn't expect it to. > > In general, are there any other ways of copying a very large number of > client-local files to HDFS? I search the mail archives to find a similar > question and I didn't come across one. I'm sorry if this is a duplicate > question. > > Thanks for your time, > Kevin