On Wed, Aug 22, 2012 at 10:20 AM, Pamecha, Abhishek <apame...@x.com> wrote:

> So then a GET query means one needs to look in every HFile where key falls
> within the min/max range of the file.
>
> From another parallel thread, I gather, HFile comprise of blocks which, I
> think, is an atomic unit of persisted data in HDFS.(please correct if not).
>
> And that each block for a HFile has a range of keys. My key can satisfy
> the range for the block and yet may not be present. So, all the blocks that
> match the range for my key, will need to be scanned. There is one block
> index per HFile which sorts blocks by key ranges. This index help in
> reducing the number of blocks to scan by extracting only those blocks whose
> ranges satisfy the key.
>
> In this case, if puts are random wrt order, each block may have similar
> range and it may turn out that Hbase needs to scan every block for the
> File. This may not be good for performance.
>
> I just want to validate my understanding.
>
>
If you have such a use case I think best practice is to use bloom filters.
I think in generaly it's a good idea to atleast enable bloom filter at row
level.

> Thanks,
> Abhishek
>
>
> -----Original Message-----
> From: lars hofhansl [mailto:lhofha...@yahoo.com]
>  Sent: Tuesday, August 21, 2012 5:55 PM
> To: user@hbase.apache.org
> Subject: Re: HBase Put
>
> That is correct.
>
>
>
> ________________________________
>  From: "Pamecha, Abhishek" <apame...@x.com>
> To: "user@hbase.apache.org" <user@hbase.apache.org>; lars hofhansl <
> lhofha...@yahoo.com>
> Sent: Tuesday, August 21, 2012 4:45 PM
> Subject: RE: HBase Put
>
> Hi Lars,
>
> Thanks for the explanation. I still have a little doubt:
>
> Based on your description, given gets do a merge sort, the data on disk is
> not kept sorted across files, but just sorted within a file.
>
> So, basically if on two separate days, say these keys get inserted:
>
> Day1: File1:   A B J M
> Day2: File2:  C D K P
>
> Then each file is sorted within itself, but scanning both files will
> require Hbase to use merge sort to produce a sorted result. Right?
>
> Also, File 1 and File2 are immutable, and during compactions, File 1 and
> File2 are compacted and sorted using merge sort to a bigger File3. Is that
> correct too?
>
> Thanks,
> Abhishek
>
>
> -----Original Message-----
> From: lars hofhansl [mailto:lhofha...@yahoo.com]
> Sent: Tuesday, August 21, 2012 4:07 PM
> To: user@hbase.apache.org
> Subject: Re: HBase Put
>
> In a nutshell:
> - Puts are collected in memory (in a sorted data structure)
> - When the collected data reaches a certain size it is flushed to a new
> file (which is sorted)
> - Gets do a merge sort between the various files that have been created
> - to contain the number of files they are periodically compacted into
> fewer, larger files
>
>
> So the data files (HFiles) are immutable once written, changes are batched
> in memory first.
>
> -- Lars
>
>
>
> ________________________________
> From: "Pamecha, Abhishek" <apame...@x.com>
> To: "user@hbase.apache.org" <user@hbase.apache.org>
> Sent: Tuesday, August 21, 2012 4:00 PM
> Subject: HBase Put
>
> Hi
>
> I had a  question on Hbase Put call. In the scenario, where data is
> inserted without any order to column qualifiers, how does Hbase maintain
> sortedness wrt column qualifiers in its store files/blocks?
>
> I checked the code base and I can see checks<
> https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L319>
> being  made for lexicographic insertions for Key value pairs.  But I cant
> seem to find out how the key-offset is calculated in the first place?
>
> Also, given HDFS is by nature, append only, how do randomly ordered keys
> make their way to sorted order. Is it only during minor/major compactions,
> that this sortedness gets applied and that there is a small window during
> which data is not sorted?
>
>
> Thanks,
> Abhishek
>

Reply via email to