For SGE the best choice is Globus (GRAM / GridFTP) or openPBS. GridFTP
provides high bandwith distributed data transfer. Additionally for
maintaining integrity you can setup GridFTP for sharing data between SGE and
Hadoop-HDFS.

--nitesh

On Sat, Nov 14, 2009 at 4:20 AM, Jeff Hammerbacher <ham...@cloudera.com>wrote:

> Hey Dmitry,
>
> As Mike states, I think HDFS is a great fit for your use case. I have never
> deployed any of the below systems into production, but I have seen some
> complaints about the stability of GlusterFS (e.g.
> http://gluster.org/pipermail/gluster-users/2009-October/003193.html), and
> Lustre can be complex to set up and maintain. If you already have HDFS
> expertise in house, you'll probably be fine with FUSE and HDFS.
>
> Regards,
> Jeff
>
> On Fri, Nov 13, 2009 at 2:12 PM, Edward Capriolo <edlinuxg...@gmail.com
> >wrote:
>
> > If you are looking for large distributed file system with posix locking
> > look at:
> >
> > glusterfs
> > lusterfs
> > ocfs2
> > redhat GFS
> >
> > Edward
> > On Fri, Nov 13, 2009 at 5:07 PM, Michael Thomas <tho...@hep.caltech.edu>
> > wrote:
> > > Hi Dmitry,
> > >
> > > I still stand by my original statement.  We do use fuse_dfs for reading
> > data
> > > on all of the worker nodes.  We don't use it much for writing data, but
> > only
> > > because our project's data model was never designed to use a posix
> > > filesystem for writing data, only reading.
> > >
> > > --Mike
> > >
> > > On 11/13/2009 02:04 PM, Dmitry Pushkarev wrote:
> > >>
> > >> Mike,
> > >>
> > >> I guess what I said referred to use of fuse_hdfs as general solution.
> If
> > >> we
> > >> were to use native APIs that'd be perfect. But we basically need to
> > mount
> > >> is
> > >> as a place where programs can simultaneously dump large amounts of
> data.
> > >>
> > >> -----Original Message-----
> > >> From: Michael Thomas [mailto:tho...@hep.caltech.edu]
> > >> Sent: Friday, November 13, 2009 2:00 PM
> > >> To: common-user@hadoop.apache.org
> > >> Subject: Re: Alternative distributed filesystem.
> > >>
> > >> On 11/13/2009 01:56 PM, Dmitry Pushkarev wrote:
> > >>>
> > >>> Dear Hadoop users,
> > >>>
> > >>>
> > >>>
> > >>> One of our hadoop clusters is being converted to SGE to run very
> > specific
> > >>> application and we're thinking about how to utilize these huge
> > >>> hard-drives
> > >>> that are there. Since there will be no hadoop installed on these
> nodes
> > >>
> > >> we're
> > >>>
> > >>> looking for alternative distributed filesystem that will have decent
> > >>> concurrent read/write performance (compared to HDFS) for large
> amounts
> > of
> > >>> data. Using single filestorage - like NAS RAID arrays proved to be
> very
> > >>> ineffective when someone is pushing gigabytes of data on them.
> > >>>
> > >>>
> > >>>
> > >>> What other systems can we look at? We would like that FS to be
> mounted
> > on
> > >>> every node, open source, hopefully we'd like to have POSIX compliance
> > and
> > >>> decent random access performance (yet it isn't critical).
> > >>>
> > >>> HDFS doesn't fit the bill because mounting it via fuse_dfs and using
> > >>
> > >> without
> > >>>
> > >>> any mapred jobs (i.e. data will typically be pushed from 1-2 nodes at
> > >>> most
> > >>> at different times) seems slightly "ass-backward" to say the least.
> > >>
> > >> I would hardly call is ass-backwards.  I know of at least 3 HPC
> clusters
> > >> that use only the HDFS component of Hadoop to serve 500TB+ of data to
> > >> 100+ worker nodes.
> > >>
> > >> As a cluster filesystem, HDFS works pretty darn well.
> > >>
> > >> --Mike
> > >>
> > >>
> > >>
> > >
> > >
> > >
> >
>



-- 
Nitesh Bhatia

"Life is never perfect. It just depends where you draw the line."

http://www.linkedin.com/in/niteshbhatia
http://www.twitter.com/niteshbhatia

Reply via email to