I read something interesting about it in HBase TDG.

Page 344:
The StoreScanner class combines the store files and memstore that the
Store instance
contains. It is also where the exclusion happens, based on the Bloom
filter, or the timestamp. If you are asking for versions that are not more
than 30 minutes old, for example, you can skip all storage files that are
older than one hour: they will not contain anything of interest. See "Key
Design" on page 357 for details on the exclusion, and how to make use of
it.

So, I guess that it doesn't have to read all the HFiles?? But, I don't know
if HBase really uses the timestamp of each row or the date of the file. I
guess when I execute the scan, it reads everything, but, I don't know why.
I think there's something else that I don't see so that everything works to
me.


2014-04-11 13:05 GMT+02:00 gortiz <gor...@pragsis.com>:

> Sorry, I didn't get it why it should read all the timestamps and not just
> the newest it they're sorted and you didn't specific any timestamp in your
> filter.
>
>
>
> On 11/04/14 12:13, Anoop John wrote:
>
>> In the storage layer (HFiles in HDFS) all versions of a particular cell
>> will be staying together.  (Yes it has to be lexicographically ordered
>> KVs). So during a scan we will have to read all the version data.  At this
>> storage layer it doesn't know the versions stuff etc.
>>
>> -Anoop-
>>
>> On Fri, Apr 11, 2014 at 3:33 PM, gortiz <gor...@pragsis.com> wrote:
>>
>>  Yes, I have tried with two different values for that value of versions,
>>> 1000 and maximum value for integers.
>>>
>>> But, I want to keep those versions. I don't want to keep just 3 versions.
>>> Imagine that I want to record a new version each minute and store a day,
>>> those are 1440 versions.
>>>
>>> Why is HBase going to read all the versions?? , I thought, if you don't
>>> indicate any versions it's just read the newest and skip the rest. It
>>> doesn't make too much sense to read all of them if data is sorted, plus
>>> the
>>> newest version is stored in the top.
>>>
>>>
>>>
>>> On 11/04/14 11:54, Anoop John wrote:
>>>
>>>    What is the max version setting u have done for ur table cf?  When u
>>>> set
>>>> some a value, HBase has to keep all those versions.  During a scan it
>>>> will
>>>> read all those versions. In 94 version the default value for the max
>>>> versions is 3.  I guess you have set some bigger value.   If u have not,
>>>> mind testing after a major compaction?
>>>>
>>>> -Anoop-
>>>>
>>>> On Fri, Apr 11, 2014 at 1:01 PM, gortiz <gor...@pragsis.com> wrote:
>>>>
>>>>   Last test I have done it's to reduce the number of versions to 100.
>>>>
>>>>> So, right now, I have 100 rows with 100 versions each one.
>>>>> Times are: (I got the same times for blocksize of 64Ks and 1Mb)
>>>>> 100row-1000versions + blockcache-> 80s.
>>>>> 100row-1000versions + No blockcache-> 70s.
>>>>>
>>>>> 100row-*100*versions + blockcache-> 7.3s.
>>>>> 100row-*100*versions + No blockcache-> 6.1s.
>>>>>
>>>>> What's the reasons of this? I guess HBase is enough smart for not
>>>>> consider
>>>>> old versions, so, it just checks the newest. But, I reduce 10 times the
>>>>> size (in versions) and I got a 10x of performance.
>>>>>
>>>>> The filter is scan 'filters', {FILTER => "ValueFilter(=,
>>>>> 'binary:5')",STARTROW => '1010000000000000000000000000000000000101',
>>>>> STOPROW => '6010000000000000000000000000000000000201'}
>>>>>
>>>>>
>>>>>
>>>>> On 11/04/14 09:04, gortiz wrote:
>>>>>
>>>>>   Well, I guessed that, what it doesn't make too much sense because
>>>>> it's
>>>>>
>>>>>> so
>>>>>> slow. I only have right now 100 rows with 1000 versions each row.
>>>>>> I have checked the size of the dataset and each row is about 700Kbytes
>>>>>> (around 7Gb, 100rowsx1000versions). So, it should only check 100 rows
>>>>>> x
>>>>>> 700Kbytes = 70Mb, since it just check the newest version. How can it
>>>>>> spend
>>>>>> too many time checking this quantity of data?
>>>>>>
>>>>>> I'm generating again the dataset with a bigger blocksize (previously
>>>>>> was
>>>>>> 64Kb, now, it's going to be 1Mb). I could try tunning the scanning and
>>>>>> baching parameters, but I don't think they're going to affect too
>>>>>> much.
>>>>>>
>>>>>> Another test I want to do, it's generate the same dataset with just
>>>>>> 100versions, It should spend around the same time, right? Or am I
>>>>>> wrong?
>>>>>>
>>>>>> On 10/04/14 18:08, Ted Yu wrote:
>>>>>>
>>>>>>   It should be newest version of each value.
>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 10, 2014 at 9:55 AM, gortiz <gor...@pragsis.com> wrote:
>>>>>>>
>>>>>>> Another little question is, when the filter I'm using, Do I check all
>>>>>>> the
>>>>>>>
>>>>>>>    versions? or just the newest? Because, I'm wondering if when I do
>>>>>>>> a
>>>>>>>> scan
>>>>>>>> over all the table, I look for the value "5" in all the dataset or
>>>>>>>> I'm
>>>>>>>> just
>>>>>>>> looking for in one newest version of each value.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/04/14 16:52, gortiz wrote:
>>>>>>>>
>>>>>>>> I was trying to check the behaviour of HBase. The cluster is a group
>>>>>>>> of
>>>>>>>>
>>>>>>>>  old computers, one master, five slaves, each one with 2Gb, so, 12gb
>>>>>>>>> in
>>>>>>>>> total.
>>>>>>>>> The table has a column family with 1000 columns and each column
>>>>>>>>> with
>>>>>>>>> 100
>>>>>>>>> versions.
>>>>>>>>> There's another column faimily with four columns an one image of
>>>>>>>>> 100kb.
>>>>>>>>>     (I've tried without this column family as well.)
>>>>>>>>> The table is partitioned manually in all the slaves, so data are
>>>>>>>>> balanced
>>>>>>>>> in the cluster.
>>>>>>>>>
>>>>>>>>> I'm executing this sentence *scan 'table1', {FILTER =>
>>>>>>>>> "ValueFilter(=,
>>>>>>>>> 'binary:5')"* in HBase 0.94.6
>>>>>>>>> My time for lease and rpc is three minutes.
>>>>>>>>> Since, it's a full scan of the table, I have been playing with the
>>>>>>>>> BLOCKCACHE as well (just disable and enable, not about the size of
>>>>>>>>> it). I
>>>>>>>>> thought that it was going to have too much calls to the GC. I'm not
>>>>>>>>> sure
>>>>>>>>> about this point.
>>>>>>>>>
>>>>>>>>> I know that it's not the best way to use HBase, it's just a test. I
>>>>>>>>> think
>>>>>>>>> that it's not working because the hardware isn't enough, although,
>>>>>>>>> I
>>>>>>>>> would
>>>>>>>>> like to try some kind of tunning to improve it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 10/04/14 14:21, Ted Yu wrote:
>>>>>>>>>
>>>>>>>>> Can you give us a bit more information:
>>>>>>>>>
>>>>>>>>>  HBase release you're running
>>>>>>>>>> What filters are used for the scan
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> On Apr 10, 2014, at 2:36 AM, gortiz <gor...@pragsis.com> wrote:
>>>>>>>>>>
>>>>>>>>>>     I got this error when I execute a full scan with filters
>>>>>>>>>> about a
>>>>>>>>>> table.
>>>>>>>>>>
>>>>>>>>>> Caused by: java.lang.RuntimeException: org.apache.hadoop.hbase.
>>>>>>>>>>
>>>>>>>>>>> regionserver.LeaseException:
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.LeaseException: lease
>>>>>>>>>>> '-4165751462641113359' does not exist
>>>>>>>>>>>        at org.apache.hadoop.hbase.regionserver.Leases.
>>>>>>>>>>> removeLease(Leases.java:231)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.
>>>>>>>>>>> next(HRegionServer.java:2482)
>>>>>>>>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>>>>>> Method)
>>>>>>>>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke(
>>>>>>>>>>> NativeMethodAccessorImpl.java:39)
>>>>>>>>>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(
>>>>>>>>>>> DelegatingMethodAccessorImpl.java:25)
>>>>>>>>>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>>>>>        at org.apache.hadoop.hbase.ipc.
>>>>>>>>>>> WritableRpcEngine$Server.call(
>>>>>>>>>>> WritableRpcEngine.java:320)
>>>>>>>>>>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(
>>>>>>>>>>> HBaseServer.java:1428)
>>>>>>>>>>>
>>>>>>>>>>> I have read about increase the lease time and rpc time, but it's
>>>>>>>>>>> not
>>>>>>>>>>> working.. what else could I try?? The table isn't too big. I have
>>>>>>>>>>> been
>>>>>>>>>>> checking the logs from GC, HMaster and some RegionServers and I
>>>>>>>>>>> didn't see
>>>>>>>>>>> anything weird. I tried as well to try with a couple of caching
>>>>>>>>>>> values.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>> *Guillermo Ortiz*
>>>>>>>> /Big Data Developer/
>>>>>>>>
>>>>>>>> Telf.: +34 917 680 490<https://mail.google.com/
>>>>>>>> mail/u/0/html/compose/static_files/blank_quirks.html#>
>>>>>>>> Fax: +34 913 833 301<https://mail.google.com/
>>>>>>>> mail/u/0/html/compose/static_files/blank_quirks.html#>
>>>>>>>>
>>>>>>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
>>>>>>>>
>>>>>>>> _http://www.bidoop.es_
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>> *Guillermo Ortiz*
>>>>> /Big Data Developer/
>>>>>
>>>>> Telf.: +34 917 680 490<https://mail.google.com/
>>>>> mail/u/0/html/compose/static_files/blank_quirks.html#>
>>>>> Fax: +34 913 833 301<https://mail.google.com/
>>>>> mail/u/0/html/compose/static_files/blank_quirks.html#>
>>>>>
>>>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
>>>>>
>>>>> _http://www.bidoop.es_
>>>>>
>>>>>
>>>>>
>>>>>  --
>>> *Guillermo Ortiz*
>>> /Big Data Developer/
>>>
>>> Telf.: +34 917 680 490<https://mail.google.com/mail/
>>> u/0/html/compose/static_files/blank_quirks.html#>
>>> Fax: +34 913 833 301<https://mail.google.com/mail/
>>> u/0/html/compose/static_files/blank_quirks.html#>
>>>   C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
>>>
>>> _http://www.bidoop.es_
>>>
>>>
>>>
>
> --
> *Guillermo Ortiz*
> /Big Data Developer/
>
> Telf.: +34 917 680 490
> Fax: +34 913 833 301
> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
>
> _http://www.bidoop.es_
>
>

Reply via email to