Hi
I wanted to propose IBM's Global Parallel File System™ (GPFS™ )  as the
distributed filesystem.  GPFS™  is well known for its unmatched scalability
and
leadership in file system performance, and it is now IBM’s premier storage
virtualization solution. More information at
http://www-03.ibm.com/systems/clusters/software/gpfs/
GPFS™  is POSIX compliant and designed for general workloads. The
filesystem performance is designed for random access behavior on I/O
operations

We at IBM have been running Hadoop on GPFS™  and have seen comparable
performance number to HDFS.  We are in the process of putting the GPFS
plugin into Hadoop code. In which case you will also be able to run
Map-Reduce jobs natively on GPFS™  along with other applications on the
same cluster."

The following papers talk about the performance comparison between HDFS and
GPFS™  and changes made to GPFS™  to support map reduce data loads.

Rajagopal Ananthanarayanan, Karan Gupta, Prashant Pandey, Himabindu Pucha,
Prasenjit Sarkar, Mansi Shah, Renu Tewari: Cloud analytics: Do we really
need to reinvent the storage stack? in Workshop on Hot Topics in Cloud
Computing (HotCloud '09) June 15, San Diego, CA

Guanying Wang, Ali R. Butt (Virgiana Tech), Prashant Pandey, Karan Gupta
(IBM Almaden): Using Realistic Simulation for Performance Analysis of
MapReduce Setups in Workshop on Large-Scale System and Application
Performance (LSAP) June 10, 2009 Munich

Guanying Wang, Ali R. Butt, Prashant Pandey and Karan Gupta. A Simulation
Approach to Evaluating Design Decisions in MapReduce Setup in IEEE/ACM
International Symposium on Modelling, Analysis and Simulation of Computer
and Telecommunication Systems 2009

--Reshu

----- Forwarded by David Fallside/Santa Teresa/IBM on 11/13/2009 04:12 PM
-----
                                                                           
             "Dmitry                                                       
             Pushkarev"                                                    
             <u...@stanford.ed                                          To 
             u>                        <common-user@hadoop.apache.org>     
                                                                        cc 
             11/13/2009 01:56                                              
             PM                                                    Subject 
                                       Alternative distributed filesystem. 
                                                                           
             Please respond to                                             
             common-u...@hadoo                                             
               p.apache.org                                                
                                                                           
                                                                           




Dear Hadoop users,



One of our hadoop clusters is being converted to SGE to run very specific
application and we're thinking about how to utilize these huge hard-drives
that are there. Since there will be no hadoop installed on these nodes
we're
looking for alternative distributed filesystem that will have decent
concurrent read/write performance (compared to HDFS) for large amounts of
data. Using single filestorage - like NAS RAID arrays proved to be very
ineffective when someone is pushing gigabytes of data on them.



What other systems can we look at? We would like that FS to be mounted on
every node, open source, hopefully we'd like to have POSIX compliance and
decent random access performance (yet it isn't critical).

HDFS doesn't fit the bill because mounting it via fuse_dfs and using
without
any mapred jobs (i.e. data will typically be pushed from 1-2 nodes at most
at different times) seems slightly "ass-backward" to say the least.



Thanks.

Reply via email to