Hi Stack,

Re Lustre use: I'm not a hardware infrastructure type of guy, but I can tell 
you that we have a very fast interconnect for access into the global filesystem:

"The Olympus Infiniband topology is a combination of 2:1 oversubscribed 36 port 
leaf switches and direct links into a 648 port core Qlogic QDR Infiniband 
switch."

I am not really worried about loss of data locality and slower speed of access 
to the Hbase tables. That is, this is not (yet) a production environment for 
multiple users with real time access. Though I think it would work - it's been 
quite stable, for one thing, and I have not noticed any speed problem in 
retrieving records.  But I have not done any serious timings, and currently we 
are not stressing Hbase, in that the warehouse is being used by a just a few 
bioinformaticians, not the general community, so to speak. I'm happy to simply 
have the data gathered in one place that provides scalability and for which I 
can easily write custom analytics programs that I can build upon and that won't 
have to be moved to another database framework down the line.

As the warehouse grows, I do plan on doing some testing, comparing HBase access 
using local disk storage vs Lustre. But that's when I have more time, and the 
warehouse is large enough for some real testing. We also have the option of 
putting *everything* into Lustre, both Hbase tables and all temp HDFS file 
storage used by our Map Reduce programs. So - no local disk use at all. I'm 
curious as to how well that would work. Possibly quite well, but no  testing 
yet. Want to try that. It should be a pretty simple switch -  our olympus 
support people have already constructed alternate starting points that load all 
the libs into Lustre instead of each local disk), but got other more immediate 
work to do first.

BTW - the Dept of Energy's new five-year systems biology knowledgebase project 
- the largest single bioinformatics project at DOE, I believe - is using Hadoop 
for several things in its multiple backends. See 
http://kbase.science.energy.gov/. I believe that Michael Schatz at Cold Spring 
Harbor Lab is heading up the Hadoop work, with clusters at Lawrence Berkeley, 
Argonne Nat Lab, and Oak Ridge.  Not sure how HBase fits in -they are getting 
into some NoSQL work, but not sure what they'll be using. HBase, I hope, but 
don't know.

 Ron

Ronald Taylor, Ph.D.
Computational Biology & Bioinformatics Group
Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
Richland, WA 99352
phone: (509) 372-6568
email: ronald.tay...@pnnl.gov


-----Original Message-----
From: saint....@gmail.com [mailto:saint....@gmail.com] On Behalf Of Stack
Sent: Monday, July 02, 2012 1:37 PM
To: user@hbase.apache.org
Subject: Re: Powered By Page

On Mon, Jul 2, 2012 at 8:19 PM, Taylor, Ronald C <ronald.tay...@pnnl.gov> wrote:
> Pacific Northwest National Laboratory (www.pnl.gov) - Hadoop and HBase 
> (Cloudera distribution) are being used within PNNL's Computational Biology & 
> Bioinformatics Group for a systems biology data warehouse project that 
> integrates high throughput proteomics and transcriptomics data sets coming 
> from instruments in the Environmental  Molecular Sciences Laboratory, a US 
> Department of Energy national user facility located at PNNL. The data sets 
> are being merged and annotated with other public genomics information in the 
> data warehouse environment, with Hadoop analysis programs operating on the 
> annotated data in the HBase tables. This work is hosted by olympus, a large 
> PNNL institutional computing cluster 
> (http://www.pnl.gov/news/release.aspx?id=908) , with the HBase tables being 
> stored in olympus's Lustre file system.
>

Thats a cool one.  I put it up (I put it in place of the powerset entry -- 
smile).

How's that Lustre hookup work Ronald?  You did your own FS implementation for 
it?

Good stuff,
Thanks.
St.Ack

Reply via email to