Hi Stack, Re Lustre use: I'm not a hardware infrastructure type of guy, but I can tell you that we have a very fast interconnect for access into the global filesystem:
"The Olympus Infiniband topology is a combination of 2:1 oversubscribed 36 port leaf switches and direct links into a 648 port core Qlogic QDR Infiniband switch." I am not really worried about loss of data locality and slower speed of access to the Hbase tables. That is, this is not (yet) a production environment for multiple users with real time access. Though I think it would work - it's been quite stable, for one thing, and I have not noticed any speed problem in retrieving records. But I have not done any serious timings, and currently we are not stressing Hbase, in that the warehouse is being used by a just a few bioinformaticians, not the general community, so to speak. I'm happy to simply have the data gathered in one place that provides scalability and for which I can easily write custom analytics programs that I can build upon and that won't have to be moved to another database framework down the line. As the warehouse grows, I do plan on doing some testing, comparing HBase access using local disk storage vs Lustre. But that's when I have more time, and the warehouse is large enough for some real testing. We also have the option of putting *everything* into Lustre, both Hbase tables and all temp HDFS file storage used by our Map Reduce programs. So - no local disk use at all. I'm curious as to how well that would work. Possibly quite well, but no testing yet. Want to try that. It should be a pretty simple switch - our olympus support people have already constructed alternate starting points that load all the libs into Lustre instead of each local disk), but got other more immediate work to do first. BTW - the Dept of Energy's new five-year systems biology knowledgebase project - the largest single bioinformatics project at DOE, I believe - is using Hadoop for several things in its multiple backends. See http://kbase.science.energy.gov/. I believe that Michael Schatz at Cold Spring Harbor Lab is heading up the Hadoop work, with clusters at Lawrence Berkeley, Argonne Nat Lab, and Oak Ridge. Not sure how HBase fits in -they are getting into some NoSQL work, but not sure what they'll be using. HBase, I hope, but don't know. Ron Ronald Taylor, Ph.D. Computational Biology & Bioinformatics Group Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle) Richland, WA 99352 phone: (509) 372-6568 email: ronald.tay...@pnnl.gov -----Original Message----- From: saint....@gmail.com [mailto:saint....@gmail.com] On Behalf Of Stack Sent: Monday, July 02, 2012 1:37 PM To: user@hbase.apache.org Subject: Re: Powered By Page On Mon, Jul 2, 2012 at 8:19 PM, Taylor, Ronald C <ronald.tay...@pnnl.gov> wrote: > Pacific Northwest National Laboratory (www.pnl.gov) - Hadoop and HBase > (Cloudera distribution) are being used within PNNL's Computational Biology & > Bioinformatics Group for a systems biology data warehouse project that > integrates high throughput proteomics and transcriptomics data sets coming > from instruments in the Environmental Molecular Sciences Laboratory, a US > Department of Energy national user facility located at PNNL. The data sets > are being merged and annotated with other public genomics information in the > data warehouse environment, with Hadoop analysis programs operating on the > annotated data in the HBase tables. This work is hosted by olympus, a large > PNNL institutional computing cluster > (http://www.pnl.gov/news/release.aspx?id=908) , with the HBase tables being > stored in olympus's Lustre file system. > Thats a cool one. I put it up (I put it in place of the powerset entry -- smile). How's that Lustre hookup work Ronald? You did your own FS implementation for it? Good stuff, Thanks. St.Ack