Thank you Zahoor, Two more comments,
1. After reading the materials you sent to me, I am confused how Bloom Filter could save I/O during random read. Supposing I am not using Bloom Filter, in order to find whether a row (or row-key) exists, we need to scan the index block which is at the end part of an HFile, the scan is in memory (I think index block is always in memory, please feel free to correct me if I am wrong) using binary search -- it should be pretty fast. With Bloom Filter, we could be a bit faster by looking up Bloom Filter bit vector in memory. Since both index block binary search and Bloom Filter bit vector search are doing in memory (no I/O is involved), what kinds of I/O is saved? :-) 2. > One Hadoop job doing random reads is perfectly fine. but , since you said "Handling directly user traffic"... i assumed you wanted to > expose HBase independently to every client request, thereby having as many connections as the number of simultaneous req.. Sorry I need to confirm again on this point. I think you mean establishing a new connection for each request is not good, using connection pool or asynchronous I/O is preferred? regards, Lin On Tue, Aug 21, 2012 at 10:45 PM, jmozah <[email protected]> wrote: > > > > > > > > 1. I know very basics of Bloom filters, which is used for detect whether > an item is in a set. How to use Bloom filters in HBase to improve random > read performance? Could you show me an example? Thanks. > > This will help omit loading the blocks (thereby saving IO and cache churn) > which does not have the given row. > For more on bloom, see > 1 - > https://issues.apache.org/jira/secure/attachment/12444007/Bloom_Filters_in_HBase.pdf > 2 - http://www.quora.com/How-are-bloom-filters-used-in-HBase > > > > 2. "Also more client connections is one more issue that might infest > you" -- supposing I am doing random read from a Hadoop job to access HBase, > do you mean using multiple client connections from the Hadoop job is good > or not good? Sorry I am a bit lost. :-) > > One Hadoop job doing random reads is perfectly fine. but , since you said > "Handling directly user traffic"... i assumed you wanted to expose HBase > independently to every client request, thereby having as many connections > as the number of simultaneous req.. > > > > 3. "asynchbase will help you" -- does HBase support asynchronous API? > Sorry I cannot find it out. Appreciate if you could point me the APIs you > are referring to. > > > Not the default HTable API. asynchbase is another client for Hbase. read > more about asynchbase here (https://github.com/stumbleupon/asynchbase) > >
