Hi Kevin,

Have a look at Apache Flume. It collects large amounts of data.

http://flume.apache.org/FlumeUserGuide.html

On 02/13/2015 03:28 PM, Kevin wrote:
Hi,

I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or so files into HDFS, which totals roughly 1 TB. The cluster will be isolated on its own private LAN with a single client machine that is connected to the Hadoop cluster as well as the public network. The data that needs to be copied into HDFS is mounted as an NFS on the client machine.

I can run `hadoop fs -put` concurrently on the client machine to try and increase the throughput.

If these files were able to be accessed by each node in the Hadoop cluster, then I could write a MapReduce job to copy a number of files from the network into HDFS. I could not find anything in the documentation saying that `distcp` works with locally hosted files (its code in the tools package doesn't tell any sign of it either) - but I wouldn't expect it to.

In general, are there any other ways of copying a very large number of client-local files to HDFS? I search the mail archives to find a similar question and I didn't come across one. I'm sorry if this is a duplicate question.

Thanks for your time,
Kevin

--
Regards,
Ahmed Ossama

Reply via email to