For SGE the best choice is Globus (GRAM / GridFTP) or openPBS. GridFTP provides high bandwith distributed data transfer. Additionally for maintaining integrity you can setup GridFTP for sharing data between SGE and Hadoop-HDFS.
--nitesh On Sat, Nov 14, 2009 at 4:20 AM, Jeff Hammerbacher <ham...@cloudera.com>wrote: > Hey Dmitry, > > As Mike states, I think HDFS is a great fit for your use case. I have never > deployed any of the below systems into production, but I have seen some > complaints about the stability of GlusterFS (e.g. > http://gluster.org/pipermail/gluster-users/2009-October/003193.html), and > Lustre can be complex to set up and maintain. If you already have HDFS > expertise in house, you'll probably be fine with FUSE and HDFS. > > Regards, > Jeff > > On Fri, Nov 13, 2009 at 2:12 PM, Edward Capriolo <edlinuxg...@gmail.com > >wrote: > > > If you are looking for large distributed file system with posix locking > > look at: > > > > glusterfs > > lusterfs > > ocfs2 > > redhat GFS > > > > Edward > > On Fri, Nov 13, 2009 at 5:07 PM, Michael Thomas <tho...@hep.caltech.edu> > > wrote: > > > Hi Dmitry, > > > > > > I still stand by my original statement. We do use fuse_dfs for reading > > data > > > on all of the worker nodes. We don't use it much for writing data, but > > only > > > because our project's data model was never designed to use a posix > > > filesystem for writing data, only reading. > > > > > > --Mike > > > > > > On 11/13/2009 02:04 PM, Dmitry Pushkarev wrote: > > >> > > >> Mike, > > >> > > >> I guess what I said referred to use of fuse_hdfs as general solution. > If > > >> we > > >> were to use native APIs that'd be perfect. But we basically need to > > mount > > >> is > > >> as a place where programs can simultaneously dump large amounts of > data. > > >> > > >> -----Original Message----- > > >> From: Michael Thomas [mailto:tho...@hep.caltech.edu] > > >> Sent: Friday, November 13, 2009 2:00 PM > > >> To: common-user@hadoop.apache.org > > >> Subject: Re: Alternative distributed filesystem. > > >> > > >> On 11/13/2009 01:56 PM, Dmitry Pushkarev wrote: > > >>> > > >>> Dear Hadoop users, > > >>> > > >>> > > >>> > > >>> One of our hadoop clusters is being converted to SGE to run very > > specific > > >>> application and we're thinking about how to utilize these huge > > >>> hard-drives > > >>> that are there. Since there will be no hadoop installed on these > nodes > > >> > > >> we're > > >>> > > >>> looking for alternative distributed filesystem that will have decent > > >>> concurrent read/write performance (compared to HDFS) for large > amounts > > of > > >>> data. Using single filestorage - like NAS RAID arrays proved to be > very > > >>> ineffective when someone is pushing gigabytes of data on them. > > >>> > > >>> > > >>> > > >>> What other systems can we look at? We would like that FS to be > mounted > > on > > >>> every node, open source, hopefully we'd like to have POSIX compliance > > and > > >>> decent random access performance (yet it isn't critical). > > >>> > > >>> HDFS doesn't fit the bill because mounting it via fuse_dfs and using > > >> > > >> without > > >>> > > >>> any mapred jobs (i.e. data will typically be pushed from 1-2 nodes at > > >>> most > > >>> at different times) seems slightly "ass-backward" to say the least. > > >> > > >> I would hardly call is ass-backwards. I know of at least 3 HPC > clusters > > >> that use only the HDFS component of Hadoop to serve 500TB+ of data to > > >> 100+ worker nodes. > > >> > > >> As a cluster filesystem, HDFS works pretty darn well. > > >> > > >> --Mike > > >> > > >> > > >> > > > > > > > > > > > > -- Nitesh Bhatia "Life is never perfect. It just depends where you draw the line." http://www.linkedin.com/in/niteshbhatia http://www.twitter.com/niteshbhatia