If you need to search row and column qualifiers you can pick row+ col bloom to help you skip blocks.
./Zahoor@iPad On 22-Aug-2012, at 10:58 PM, "Pamecha, Abhishek" <apame...@x.com> wrote: > Great explanation. May be diverging from the thread's original question, but > could you also care to explain the difference if any, in searching for a > rowkey [ that you mentioned below ] Vs searching for a specific column > qualifier. Are there any optimizations for column qualifier search too or > that one just needs to load all blocks that match the rowkey crieteria and > then scan each one of them from start to end? > > Thanks, > Abhishek > > > -----Original Message----- > From: Anoop Sam John [mailto:anoo...@huawei.com] > Sent: Wednesday, August 22, 2012 5:35 AM > To: user@hbase.apache.org; J Mohamed Zahoor > Subject: RE: Using HBase serving to replace memcached > >> I could be wrong. I think HFile index block (which is located at the >> end >>> of HFile) is a binary search tree containing all row-key values (of >>> the >>> HFile) in the binary search tree. Searching a specific row-key in the >>> binary search tree could easily find whether a row-key exists (some >>> node in the tree has the same row-key value) or not. Why we need load >>> every block to find if the row exists? > > I think there is some confusion with you people regarding the blooms and the > block index.I will try to clarify this point. > Block index will be there with every HFile. Within an HFile the data will be > written as multiple blocks. While reading data block by block only HBase read > data from the HDFS layer. The block index contains the information regarding > the blocks within that HFile. The information include the start and end > rowkeys which resides in that particular block and the block information like > offset of that block and its length etc. Now when a request comes for getting > a rowkey 'x' all the HFiles within that region need to be checked.[KV can be > present in any of the HFile] Now in order to know this row will be present in > which block within an HFile, this block index will be used. Well this block > index will be there in memory always. This lookup will tell only the possible > block in which the row is present. HBase will load that block and will read > through it to get the row which we are interested in now. > Bloom is like it will have information about each and every row added into > that HFile[Block index wont have info about each and every row]. This bloom > information will be there in memory always. So when a read request to get row > 'x' in an Hfile comes, 1st the bloom is checked whether this row is there in > this file or not. If this is not there, as per the bloom, no block at all > will be fetched. But if bloom is not enabled, we might find one block which > is having a row range such that 'x' comes in between and Hbase will load that > block. So usage of blooms can avoid this IO. Hope this is clear for you now. > > -Anoop- > ________________________________________ > From: Lin Ma [lin...@gmail.com] > Sent: Wednesday, August 22, 2012 5:41 PM > To: J Mohamed Zahoor; user@hbase.apache.org > Subject: Re: Using HBase serving to replace memcached > > Thanks Zahoor, > > I read through the document you referred to, I am confused about what means > leaf-level index, intermediate-level index and root-level index. It is > appreciate if you could give more details what they are, or point me to the > related documents. > > BTW: the document you pointed me is very good, however I miss some basic > background of 3 terms I mentioned above. :-) > > regards, > Lin > > On Wed, Aug 22, 2012 at 12:51 PM, J Mohamed Zahoor <jmo...@gmail.com> wrote: > >> I could be wrong. I think HFile index block (which is located at the >> end >>> of HFile) is a binary search tree containing all row-key values (of >>> the >>> HFile) in the binary search tree. Searching a specific row-key in the >>> binary search tree could easily find whether a row-key exists (some >>> node in the tree has the same row-key value) or not. Why we need load >>> every block to find if the row exists? >>> >>> >> Hmm... >> It is a multilevel index. Only the root Index's (Data, Meta etc) are >> loaded when a region is opened. The rest of the tree (intermediate and >> leaf >> index's) are present in each block level. >> I am assuming a HFile v2 here for the discussion. >> Read this for more clarity http://hbase.apache.org/book/apes03.html >> >> Nice discussion. You made me read lot of things. :-) Now i will dig in >> to the code and check this out. >> >> ./Zahoor >>