Re: bulk data transfer to HDFS remotely (e.g. via wan)

Brian Bockelman Tue, 02 Mar 2010 14:01:08 -0800

On Mar 2, 2010, at 3:51 PM, jiang licht wrote:

> Thanks, Brian.
> 
> There is no certificate/grid infrastructure as of now yet for us. But I guess 
> I can still use gridftp by noticing the following from its FAQ page: GridFTP 
> can be run in a mode using standard
> SSH security credentials. It can also be run in anonymous mode and 
> with username/password authentication.


Ah, I guess I've never gone down that road - I always use SSL certificates.

> 
> I am wondering how gridftp can used in a generic scenario: transfer bulk data 
> from a box (not in hadoop cluster) to a remote hadoop cluster at a regular 
> interval (maybe hourly or couple of minutes). So, I guess I can install 
> gridftp server on hadoop name node and install gridftp client on the remote 
> data box. But to bypass the intermediate step of keeping a local copy on 
> hadoop name node, I need something like the plugin you mentioned. Is that 
> correct?
> 

This is correct.

> Since I dont have the plugin you have, I found a helpful article here that 
> might address the problem:
> 
> http://osg-test2.unl.edu/documentation/hadoop/gridftp-hdfs
> 

You can get the plugin here if you're using the same Hadoop version as us:

https://twiki.grid.iu.edu/bin/view/Storage/Hadoop

The source code is 

svn://t2.unl.edu/brian/gridftp_hdfs

(browse it online here http://t2.unl.edu:8094/browser/gridftp_hdfs)

Heck, I've even updated the RPM .spec file to be compatible with the Cloudera 
packaging of 0.20.x (we plan on moving to Cloudera's distribution for that 
version).  I haven't tested it lately against Cloudera's packaging.

> It seems to me that it can directly write data to hadoop (although I don't 
> know exactly how). But I am not sure how to direct gridftp client to write 
> data to hadoop, sth like "globus-url-copy localurl 
> hdfs://hadoopnamenode/pathinhdfs"? Otherwise, there might be some mapping on 
> the gridftp server side to relay data to hadoop.
> 

No, it would be something like this:

globus-url-copy localurl gsiftp://gridftp-server.example.com/path/in/hdfs

I would never recommend running any server on the hadoop namenode besides the 
hadoop namenode.  :)

> I think this is interesting if it works. Basically, this is a "push" mode.
> 
> Even better: "pull mode", I still want sth built into hadoop (so, its running 
> in map/reduce) that acts like "hadoop distcp s3://123:4...@nutch/ 
> hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998" or "hadoop 
> distcp -f filelistA 
> hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998" and 
> filelistA looks like
> s3://123:4...@nutch/file1
> s3://123:4...@nutch/fileN
> 
> So, just like accessing local files, we might have sth like "hadoop distcp 
> file://remotehost/path hdfs://namenode/path" or "hadoop distcp -f filelistB 
> hdfs://hostname/path" and filelistB looks like
> 
> file://remotehost/path1/file1
> file://remorehost/path2/fileN
> 
> (file:// works for local file system, but in this case it points to remote 
> file system, or replace it with sth like remote://), so, some middleware will 
> sit on remote host and the namenode to exchange data, in this case, the 
> gridftp?, if they agree on protocols (ports, etc.)
> 
> If security is an issue, data can be gpg encrypted before doing a "distcp".
> 

This might serve you better.  It's probably a fairly minor modification to 
distcp?  Doesn't sound too hard to code.

Brian

> Thanks,
> --
> 
> Michael
> 
> --- On Tue, 3/2/10, Brian Bockelman <bbock...@cse.unl.edu> wrote:
> 
> From: Brian Bockelman <bbock...@cse.unl.edu>
> Subject: Re: bulk data transfer to HDFS remotely (e.g. via wan)
> To: common-user@hadoop.apache.org
> Date: Tuesday, March 2, 2010, 3:00 PM
> 
> Hey Michael,
> 
> We've developed a GridFTP server plugin that writes directly into Hadoop, so 
> there's no intermediate data staging required.  You can just use your 
> favorite GridFTP client on the source machine and transfer it directly into 
> Hadoop.  Globus GridFTP can do checksums as it goes, but I haven't tried it - 
> it might not work with our plugin.  The GridFTP server does not need to 
> co-exist with any Hadoop processes - it just needs a network connection to 
> the WAN and a network connection to the LAN.
> 
> The GridFTP server is automatically installed with our yum packaging, along 
> with our organization's CA certs.  If this is a one-off transfer - or you 
> don't already have the CA certificate/grid infrastructure already available 
> in your organization - you might be better served approaching another 
> solution.
> 
> The setup works well for us because (a) the other 40 sites use GridFTP as a 
> common protocol, (b) we have a long history with using GridFTP, and (c) we 
> need to transfer many TB on a daily basis.
> 
> Brian
> 
> On Mar 2, 2010, at 12:10 PM, jiang licht wrote:
> 
>> Hi Brian,
>> 
>> Thanks a lot for sharing your experience. Here I have some questions to 
>> bother you for more help :)
>> 
>> So, basically means that data transfer in your case is 2-step job: 1st, use 
>> gridftp to make a local copy of data on target, 2nd load data into the 
>> target cluster by sth like "hadoop fs -put". If this is correct, I am 
>> wondering if this will consume too much disk space of your target box (since 
>> it is stored in a local file system, prior to be distributed to hadoop 
>> cluster). Also, do you do a integrity check for each file transferred (one 
>> straightforward method might be to do a 'cksum' or alike comparison, but is 
>> that doable in terms of efficiency)?
>> 
>> I am not familiar with gridftp except that I know it is a better choice 
>> compared to scp, sftp, etc. in that it can tune tcp settings and create 
>> parallel transfer. So, I want to know if it keeps a log of what files have 
>> been successfully transferred and what have not, does gridftp do a file 
>> integrity check? Right now, I only have one box for data storage (not in 
>> hadoop cluster) and want to transfer that data to hadoop. Can I just install 
>> gridftp on this box and name node box to enable gridftp transfer from the 
>> 1st to the 2nd?
>> 
>> Thanks,
>> --
>> 
>> Michael
>> 
>> --- On Tue, 3/2/10, Brian Bockelman <bbock...@cse.unl.edu> wrote:
>> 
>> From: Brian Bockelman <bbock...@cse.unl.edu>
>> Subject: Re: bulk data transfer to HDFS remotely (e.g. via wan)
>> To: common-user@hadoop.apache.org
>> Date: Tuesday, March 2, 2010, 8:38 AM
>> 
>> Hey Michael,
>> 
>> distcp does a MapReduce job to transfer data between two clusters - but it 
>> might not be acceptable security-wise for your setup.
>> 
>> Locally, we use gridftp between two clusters (not necessarily Hadoop!) and a 
>> protocol called SRM to load-balance between gridftp servers.  GridFTP was 
>> selected because it is common in our field, and we already have the 
>> certificate infrastructure well setup.
>> 
>> GridFTP is fast too - many Gbps is not too hard.
>> 
>> YMMV
>> 
>> Brian
>> 
>> On Mar 2, 2010, at 1:30 AM, jiang licht wrote:
>> 
>>> I am considering a basic task of loading data to hadoop cluster in this 
>>> scenario: hadoop cluster and bulk data reside on different boxes, e.g. 
>>> connected via LAN or wan.
>>>    
>>> An example to do this is to move data from amazon s3 to ec2, which is 
>>> supported in latest hadoop by specifying s3(n)://authority/path in distcp.
>>>    
>>> But generally speaking, what is the best way to load data to hadoop cluster 
>>> from a remote box? Clearly, in this scenario, it is unreasonable to copy 
>>> data to local name node and then issue some command like "hadoop fs 
>>> -copyFromLocal" to put data in the cluster (besides this, a desired data 
>>> transfer tool is also a factor, scp or sftp, gridftp, ..., compression and 
>>> encryption, ...).
>>>    
>>> I am not awaring of a generic support for fetching data from a remote box 
>>> (like that from s3 or s3n), I am thinking about the following solution (run 
>>> on remote boxes to push data to hadoop):
>>>    
>>> cat datafile | ssh hadoopbox 'hadoop fs -put - dst'
>>>    
>>> There are pros (simple and will do the job without storing a local copy of 
>>> each data file and then do a command like 'hadoop fs -copyFromLocal') and 
>>> cons (obviously will need many such pipelines running in parallel to speed 
>>> up the job, but at the cost of creating processes on remote machines to 
>>> read data and maintain ssh connections, so if data file is small, better 
>>> archive small files into a tar file before calling 'cat'). Alternative to 
>>> using a 'cat', a program can be written to keep reading data files and dump 
>>> to stdin in parallel.
>>>    
>>> Any comments about this or thoughts about a better solution?
>>>    
>>> Thanks,
>>> --
>>> Michael
>>> 
>>> 
>> 
>> 
>> 
>> 
> 
> 
> 
>

smime.p7s
Description: S/MIME cryptographic signature

Re: bulk data transfer to HDFS remotely (e.g. via wan)

Reply via email to