Hi Kevin, What is network throughput btw 1. NFS server and client machine? 2. client machine and dananodes?
Alex On Feb 13, 2015 5:29 AM, "Kevin" <kevin.macksa...@gmail.com> wrote: > Hi, > > I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand > or so files into HDFS, which totals roughly 1 TB. The cluster will be > isolated on its own private LAN with a single client machine that is > connected to the Hadoop cluster as well as the public network. The data > that needs to be copied into HDFS is mounted as an NFS on the client > machine. > > I can run `hadoop fs -put` concurrently on the client machine to try and > increase the throughput. > > If these files were able to be accessed by each node in the Hadoop > cluster, then I could write a MapReduce job to copy a number of files from > the network into HDFS. I could not find anything in the documentation > saying that `distcp` works with locally hosted files (its code in the tools > package doesn't tell any sign of it either) - but I wouldn't expect it to. > > In general, are there any other ways of copying a very large number of > client-local files to HDFS? I search the mail archives to find a similar > question and I didn't come across one. I'm sorry if this is a duplicate > question. > > Thanks for your time, > Kevin >