This is sth that I asked recently :)

Here's a list of what I can think of

1. on remote box of data, cat filetobesent | ssh hadoopmaster 'hadoop fs -put - 
dstinhdfs'

2. on remote box of data, configure core-site.xml to set fs.default.name to 
hdfs://namenode:port and then fire a "hadoop fs -copyFromLocal" or "hadoop fs 
-put" as it is if your namenode is accessible from your data box or through a 
VPN to reach the namenode.

3. hdfs-aware gridftp, you can read more detail about it here sth that was 
mentioned inĀ  Brian Bockelman's reply:

http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201003.mbox/%3c2506096c-c00d-40ec-8751-4abd8f040...@cse.unl.edu%3e

4. you can write a data transfer tool that is HDFS-aware and will run on data 
box: reads data on data box and send it over network to its partner on namenode 
and writes directly into hadoop cluster.

5. other idea?

Thanks,

Michael

--- On Fri, 3/5/10, zenMonkey <numan.sal...@gmail.com> wrote:

From: zenMonkey <numan.sal...@gmail.com>
Subject: Copying files between two remote hadoop clusters
To: hadoop-u...@lucene.apache.org
Date: Friday, March 5, 2010, 4:25 PM


I want to write a script that pulls data (flat files) from a remote machine
and pushes that into its hadoop cluster.

At the moment, it is done in two steps:

1 - Secure copy the remote files
2 - Put the files into HDFS

I was wondering if it was possible to optimize this by avoiding copying to
local fs before pushing to hdfs; and instead write directly to hdfs. I am
not sure if this is something that hadoop tools already provide. 

Thanks for any help.

-- 
View this message in context: 
http://old.nabble.com/Copying-files-between-two-remote-hadoop-clusters-tp27799963p27799963.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.




      

Reply via email to