Hi Kevin,
Have a look at Apache Flume. It collects large amounts of data.
http://flume.apache.org/FlumeUserGuide.html
On 02/13/2015 03:28 PM, Kevin wrote:
Hi,
I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a
thousand or so files into HDFS, which totals roughly 1 TB. The cluster
will be isolated on its own private LAN with a single client machine
that is connected to the Hadoop cluster as well as the public network.
The data that needs to be copied into HDFS is mounted as an NFS on the
client machine.
I can run `hadoop fs -put` concurrently on the client machine to try
and increase the throughput.
If these files were able to be accessed by each node in the Hadoop
cluster, then I could write a MapReduce job to copy a number of files
from the network into HDFS. I could not find anything in the
documentation saying that `distcp` works with locally hosted files
(its code in the tools package doesn't tell any sign of it either) -
but I wouldn't expect it to.
In general, are there any other ways of copying a very large number of
client-local files to HDFS? I search the mail archives to find a
similar question and I didn't come across one. I'm sorry if this is a
duplicate question.
Thanks for your time,
Kevin
--
Regards,
Ahmed Ossama