On Mar 2, 2010, at 3:51 PM, jiang licht wrote: > Thanks, Brian. > > There is no certificate/grid infrastructure as of now yet for us. But I guess > I can still use gridftp by noticing the following from its FAQ page: GridFTP > can be run in a mode using standard > SSH security credentials. It can also be run in anonymous mode and > with username/password authentication.
Ah, I guess I've never gone down that road - I always use SSL certificates. > > I am wondering how gridftp can used in a generic scenario: transfer bulk data > from a box (not in hadoop cluster) to a remote hadoop cluster at a regular > interval (maybe hourly or couple of minutes). So, I guess I can install > gridftp server on hadoop name node and install gridftp client on the remote > data box. But to bypass the intermediate step of keeping a local copy on > hadoop name node, I need something like the plugin you mentioned. Is that > correct? > This is correct. > Since I dont have the plugin you have, I found a helpful article here that > might address the problem: > > http://osg-test2.unl.edu/documentation/hadoop/gridftp-hdfs > You can get the plugin here if you're using the same Hadoop version as us: https://twiki.grid.iu.edu/bin/view/Storage/Hadoop The source code is svn://t2.unl.edu/brian/gridftp_hdfs (browse it online here http://t2.unl.edu:8094/browser/gridftp_hdfs) Heck, I've even updated the RPM .spec file to be compatible with the Cloudera packaging of 0.20.x (we plan on moving to Cloudera's distribution for that version). I haven't tested it lately against Cloudera's packaging. > It seems to me that it can directly write data to hadoop (although I don't > know exactly how). But I am not sure how to direct gridftp client to write > data to hadoop, sth like "globus-url-copy localurl > hdfs://hadoopnamenode/pathinhdfs"? Otherwise, there might be some mapping on > the gridftp server side to relay data to hadoop. > No, it would be something like this: globus-url-copy localurl gsiftp://gridftp-server.example.com/path/in/hdfs I would never recommend running any server on the hadoop namenode besides the hadoop namenode. :) > I think this is interesting if it works. Basically, this is a "push" mode. > > Even better: "pull mode", I still want sth built into hadoop (so, its running > in map/reduce) that acts like "hadoop distcp s3://123:4...@nutch/ > hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998" or "hadoop > distcp -f filelistA > hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998" and > filelistA looks like > s3://123:4...@nutch/file1 > s3://123:4...@nutch/fileN > > So, just like accessing local files, we might have sth like "hadoop distcp > file://remotehost/path hdfs://namenode/path" or "hadoop distcp -f filelistB > hdfs://hostname/path" and filelistB looks like > > file://remotehost/path1/file1 > file://remorehost/path2/fileN > > (file:// works for local file system, but in this case it points to remote > file system, or replace it with sth like remote://), so, some middleware will > sit on remote host and the namenode to exchange data, in this case, the > gridftp?, if they agree on protocols (ports, etc.) > > If security is an issue, data can be gpg encrypted before doing a "distcp". > This might serve you better. It's probably a fairly minor modification to distcp? Doesn't sound too hard to code. Brian > Thanks, > -- > > Michael > > --- On Tue, 3/2/10, Brian Bockelman <bbock...@cse.unl.edu> wrote: > > From: Brian Bockelman <bbock...@cse.unl.edu> > Subject: Re: bulk data transfer to HDFS remotely (e.g. via wan) > To: common-user@hadoop.apache.org > Date: Tuesday, March 2, 2010, 3:00 PM > > Hey Michael, > > We've developed a GridFTP server plugin that writes directly into Hadoop, so > there's no intermediate data staging required. You can just use your > favorite GridFTP client on the source machine and transfer it directly into > Hadoop. Globus GridFTP can do checksums as it goes, but I haven't tried it - > it might not work with our plugin. The GridFTP server does not need to > co-exist with any Hadoop processes - it just needs a network connection to > the WAN and a network connection to the LAN. > > The GridFTP server is automatically installed with our yum packaging, along > with our organization's CA certs. If this is a one-off transfer - or you > don't already have the CA certificate/grid infrastructure already available > in your organization - you might be better served approaching another > solution. > > The setup works well for us because (a) the other 40 sites use GridFTP as a > common protocol, (b) we have a long history with using GridFTP, and (c) we > need to transfer many TB on a daily basis. > > Brian > > On Mar 2, 2010, at 12:10 PM, jiang licht wrote: > >> Hi Brian, >> >> Thanks a lot for sharing your experience. Here I have some questions to >> bother you for more help :) >> >> So, basically means that data transfer in your case is 2-step job: 1st, use >> gridftp to make a local copy of data on target, 2nd load data into the >> target cluster by sth like "hadoop fs -put". If this is correct, I am >> wondering if this will consume too much disk space of your target box (since >> it is stored in a local file system, prior to be distributed to hadoop >> cluster). Also, do you do a integrity check for each file transferred (one >> straightforward method might be to do a 'cksum' or alike comparison, but is >> that doable in terms of efficiency)? >> >> I am not familiar with gridftp except that I know it is a better choice >> compared to scp, sftp, etc. in that it can tune tcp settings and create >> parallel transfer. So, I want to know if it keeps a log of what files have >> been successfully transferred and what have not, does gridftp do a file >> integrity check? Right now, I only have one box for data storage (not in >> hadoop cluster) and want to transfer that data to hadoop. Can I just install >> gridftp on this box and name node box to enable gridftp transfer from the >> 1st to the 2nd? >> >> Thanks, >> -- >> >> Michael >> >> --- On Tue, 3/2/10, Brian Bockelman <bbock...@cse.unl.edu> wrote: >> >> From: Brian Bockelman <bbock...@cse.unl.edu> >> Subject: Re: bulk data transfer to HDFS remotely (e.g. via wan) >> To: common-user@hadoop.apache.org >> Date: Tuesday, March 2, 2010, 8:38 AM >> >> Hey Michael, >> >> distcp does a MapReduce job to transfer data between two clusters - but it >> might not be acceptable security-wise for your setup. >> >> Locally, we use gridftp between two clusters (not necessarily Hadoop!) and a >> protocol called SRM to load-balance between gridftp servers. GridFTP was >> selected because it is common in our field, and we already have the >> certificate infrastructure well setup. >> >> GridFTP is fast too - many Gbps is not too hard. >> >> YMMV >> >> Brian >> >> On Mar 2, 2010, at 1:30 AM, jiang licht wrote: >> >>> I am considering a basic task of loading data to hadoop cluster in this >>> scenario: hadoop cluster and bulk data reside on different boxes, e.g. >>> connected via LAN or wan. >>> >>> An example to do this is to move data from amazon s3 to ec2, which is >>> supported in latest hadoop by specifying s3(n)://authority/path in distcp. >>> >>> But generally speaking, what is the best way to load data to hadoop cluster >>> from a remote box? Clearly, in this scenario, it is unreasonable to copy >>> data to local name node and then issue some command like "hadoop fs >>> -copyFromLocal" to put data in the cluster (besides this, a desired data >>> transfer tool is also a factor, scp or sftp, gridftp, ..., compression and >>> encryption, ...). >>> >>> I am not awaring of a generic support for fetching data from a remote box >>> (like that from s3 or s3n), I am thinking about the following solution (run >>> on remote boxes to push data to hadoop): >>> >>> cat datafile | ssh hadoopbox 'hadoop fs -put - dst' >>> >>> There are pros (simple and will do the job without storing a local copy of >>> each data file and then do a command like 'hadoop fs -copyFromLocal') and >>> cons (obviously will need many such pipelines running in parallel to speed >>> up the job, but at the cost of creating processes on remote machines to >>> read data and maintain ssh connections, so if data file is small, better >>> archive small files into a tar file before calling 'cat'). Alternative to >>> using a 'cat', a program can be written to keep reading data files and dump >>> to stdin in parallel. >>> >>> Any comments about this or thoughts about a better solution? >>> >>> Thanks, >>> -- >>> Michael >>> >>> >> >> >> >> > > > >
smime.p7s
Description: S/MIME cryptographic signature