Hey! I strongly disagree with Tatsaya's assessment of HBase, specifically below:
On Wed, Oct 14, 2009 at 12:31 AM, Tatsuya Kawano <tatsuy...@snowcocoa.info> wrote: > HI Keith, > > On Wed, Oct 14, 2009 at 11:58 AM, Keith Thomas <keith.tho...@gmail.com> wrote: >> Am I correct in understanding that a farm of EC2 instances with Hadoop and >> HBase installed and configured individually by myself are the quickest and >> most effective way to progress with this effort? > > Well, you're not wrong. To run HBase on Amazon Web Services, you > should use EC2 instances and configure them by yourself. Make sure you > pick Extra Large instances from EC2 (see: > http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A8), and you may > also want EBS volumes as the storage devices rather than S3. (S3 is > good for archiving data) > > > But... > > Are you really sure you want to use HBase for your Grail based web > application on the cloud? I would definitely recommend MySQL which > should be more suitable for both web applications and Amazon Web > Services environment. HBase is not a cloud database and is currently > more suitable for batch processing with billions of records. This is not a correct assessment - first off, what does it mean to be a "cloud database". And secondly, HBase is suitable for storing real time queries, and it is a major use case that we have here at stumbleupon. > > If you use HBase for this purpose, you will > > -- loose the Object Relational Mapping support from Grails. > -- have to take care of database transactions and secondary indices by > yourself. You do "lose" the transactions (if you even used them) and you may have to maintain secondary indexes, but you gain a flexible schema-less column-oriented datastore that scales far beyond anything mysql can do. > -- likely suffered from a latency of data retrieval, unless you use memcached. This is not correct - HBase has good caching built in, and takes full advantage of linux's disk buffer cache. Much more effective than MySQL because it is easier to get more ram across 10-20 machines (or more) than ram in 1-2 machines. > -- need more server resources than MySQL. MySQL can run on 1 EC2 > instance, while HBase requires about 12 EC2 instances (2 for masters > and DFS namenodes, 5 for region servers and DFS datanodes, 5 for > ZooKeeper) Again, this is not entirely correct, you are overspecing quite a bit. 3 ZK nodes is fine, and they should be able to run on the "master" nodes. And you also reveal your misunderstanding, suggesting to the OP that you can run namenode on 2 hosts and that is that. The situation for HDFS is (unfortunately) more complicated than that. It is totally possible for a HBase cluster to be run on 4 EC2 instances, 1 master, 3 datanodes. Maybe even less, but you are sacrificing data reliability. i appreciate your enthusiasm for HBase, but please don't mislead our users so badly! Thanks, -ryan > > > Is there any special reason to use HBase for you web application? > > Thanks, > > -- > Tatsuya Kawano (Mr.) > Tokyo, Japan >