If you need to search row and column qualifiers you can pick  row+ col bloom to 
help you skip blocks.

./Zahoor@iPad

On 22-Aug-2012, at 10:58 PM, "Pamecha, Abhishek" <apame...@x.com> wrote:

> Great explanation. May be diverging from the thread's original question, but 
> could you also care to explain the difference  if any, in searching for a 
> rowkey [ that you mentioned below ] Vs searching for a specific column 
> qualifier. Are there any optimizations for column qualifier search too or 
> that one just needs to load all blocks that match the rowkey crieteria and 
> then scan each one of them from start to end?
> 
> Thanks,
> Abhishek
> 
> 
> -----Original Message-----
> From: Anoop Sam John [mailto:anoo...@huawei.com] 
> Sent: Wednesday, August 22, 2012 5:35 AM
> To: user@hbase.apache.org; J Mohamed Zahoor
> Subject: RE: Using HBase serving to replace memcached
> 
>> I could be wrong. I think HFile index block (which is located at the 
>> end
>>> of HFile) is a binary search tree containing all row-key values (of 
>>> the
>>> HFile) in the binary search tree. Searching a specific row-key in the 
>>> binary search tree could easily find whether a row-key exists (some 
>>> node in the tree has the same row-key value) or not. Why we need load 
>>> every block to find if the row exists?
> 
> I think there is some confusion with you people regarding the blooms and the 
> block index.I will try to clarify this point.
> Block index will be there with every HFile. Within an HFile the data will be 
> written as multiple blocks. While reading data block by block only HBase read 
> data from the HDFS layer. The block index contains the information regarding 
> the blocks within that HFile. The information include the start and end 
> rowkeys which resides in that particular block and the block information like 
> offset of that block and its length etc. Now when a request comes for getting 
> a rowkey 'x' all the HFiles within that region need to be checked.[KV can be 
> present in any of the HFile] Now in order to know this row will be present in 
> which block within an HFile, this block index will be used. Well this block 
> index will be there in memory always. This lookup will tell only the possible 
> block in which the row is present. HBase will load that block and will read 
> through it to get the row which we are interested in now.
> Bloom is like it will have information about each and every row added into 
> that HFile[Block index wont have info about each and every row]. This bloom 
> information will be there in memory always. So when a read request to get row 
> 'x' in an Hfile comes, 1st the bloom is checked whether this row is there in 
> this file or not. If this is not there, as per the bloom, no block at all 
> will be fetched. But if bloom is not enabled, we might find one block which 
> is having a row range such that 'x' comes in between and Hbase will load that 
> block. So usage of blooms can avoid this IO. Hope this is clear for you now.
> 
> -Anoop-
> ________________________________________
> From: Lin Ma [lin...@gmail.com]
> Sent: Wednesday, August 22, 2012 5:41 PM
> To: J Mohamed Zahoor; user@hbase.apache.org
> Subject: Re: Using HBase serving to replace memcached
> 
> Thanks Zahoor,
> 
> I read through the document you referred to, I am confused about what means 
> leaf-level index, intermediate-level index and root-level index. It is 
> appreciate if you could give more details what they are, or point me to the 
> related documents.
> 
> BTW: the document you pointed me is very good, however I miss some basic 
> background of 3 terms I mentioned above. :-)
> 
> regards,
> Lin
> 
> On Wed, Aug 22, 2012 at 12:51 PM, J Mohamed Zahoor <jmo...@gmail.com> wrote:
> 
>> I could be wrong. I think HFile index block (which is located at the 
>> end
>>> of HFile) is a binary search tree containing all row-key values (of 
>>> the
>>> HFile) in the binary search tree. Searching a specific row-key in the 
>>> binary search tree could easily find whether a row-key exists (some 
>>> node in the tree has the same row-key value) or not. Why we need load 
>>> every block to find if the row exists?
>>> 
>>> 
>> Hmm...
>> It is a multilevel index. Only the root Index's (Data, Meta etc) are 
>> loaded when a region is opened. The rest of the tree (intermediate and 
>> leaf
>> index's) are present in each block level.
>> I am assuming a HFile v2 here for the discussion.
>> Read this for more clarity http://hbase.apache.org/book/apes03.html
>> 
>> Nice discussion. You made me read lot of things. :-) Now i will dig in 
>> to the code and check this out.
>> 
>> ./Zahoor
>> 

Reply via email to