In the storage layer (HFiles in HDFS) all versions of a particular cell
will be staying together.  (Yes it has to be lexicographically ordered
KVs). So during a scan we will have to read all the version data.  At this
storage layer it doesn't know the versions stuff etc.

-Anoop-

On Fri, Apr 11, 2014 at 3:33 PM, gortiz <gor...@pragsis.com> wrote:

> Yes, I have tried with two different values for that value of versions,
> 1000 and maximum value for integers.
>
> But, I want to keep those versions. I don't want to keep just 3 versions.
> Imagine that I want to record a new version each minute and store a day,
> those are 1440 versions.
>
> Why is HBase going to read all the versions?? , I thought, if you don't
> indicate any versions it's just read the newest and skip the rest. It
> doesn't make too much sense to read all of them if data is sorted, plus the
> newest version is stored in the top.
>
>
>
> On 11/04/14 11:54, Anoop John wrote:
>
>>  What is the max version setting u have done for ur table cf?  When u set
>> some a value, HBase has to keep all those versions.  During a scan it will
>> read all those versions. In 94 version the default value for the max
>> versions is 3.  I guess you have set some bigger value.   If u have not,
>> mind testing after a major compaction?
>>
>> -Anoop-
>>
>> On Fri, Apr 11, 2014 at 1:01 PM, gortiz <gor...@pragsis.com> wrote:
>>
>>  Last test I have done it's to reduce the number of versions to 100.
>>> So, right now, I have 100 rows with 100 versions each one.
>>> Times are: (I got the same times for blocksize of 64Ks and 1Mb)
>>> 100row-1000versions + blockcache-> 80s.
>>> 100row-1000versions + No blockcache-> 70s.
>>>
>>> 100row-*100*versions + blockcache-> 7.3s.
>>> 100row-*100*versions + No blockcache-> 6.1s.
>>>
>>> What's the reasons of this? I guess HBase is enough smart for not
>>> consider
>>> old versions, so, it just checks the newest. But, I reduce 10 times the
>>> size (in versions) and I got a 10x of performance.
>>>
>>> The filter is scan 'filters', {FILTER => "ValueFilter(=,
>>> 'binary:5')",STARTROW => '1010000000000000000000000000000000000101',
>>> STOPROW => '6010000000000000000000000000000000000201'}
>>>
>>>
>>>
>>> On 11/04/14 09:04, gortiz wrote:
>>>
>>>  Well, I guessed that, what it doesn't make too much sense because it's
>>>> so
>>>> slow. I only have right now 100 rows with 1000 versions each row.
>>>> I have checked the size of the dataset and each row is about 700Kbytes
>>>> (around 7Gb, 100rowsx1000versions). So, it should only check 100 rows x
>>>> 700Kbytes = 70Mb, since it just check the newest version. How can it
>>>> spend
>>>> too many time checking this quantity of data?
>>>>
>>>> I'm generating again the dataset with a bigger blocksize (previously was
>>>> 64Kb, now, it's going to be 1Mb). I could try tunning the scanning and
>>>> baching parameters, but I don't think they're going to affect too much.
>>>>
>>>> Another test I want to do, it's generate the same dataset with just
>>>> 100versions, It should spend around the same time, right? Or am I wrong?
>>>>
>>>> On 10/04/14 18:08, Ted Yu wrote:
>>>>
>>>>  It should be newest version of each value.
>>>>>
>>>>> Cheers
>>>>>
>>>>>
>>>>> On Thu, Apr 10, 2014 at 9:55 AM, gortiz <gor...@pragsis.com> wrote:
>>>>>
>>>>> Another little question is, when the filter I'm using, Do I check all
>>>>> the
>>>>>
>>>>>>  versions? or just the newest? Because, I'm wondering if when I do a
>>>>>> scan
>>>>>> over all the table, I look for the value "5" in all the dataset or I'm
>>>>>> just
>>>>>> looking for in one newest version of each value.
>>>>>>
>>>>>>
>>>>>> On 10/04/14 16:52, gortiz wrote:
>>>>>>
>>>>>> I was trying to check the behaviour of HBase. The cluster is a group
>>>>>> of
>>>>>>
>>>>>>> old computers, one master, five slaves, each one with 2Gb, so, 12gb
>>>>>>> in
>>>>>>> total.
>>>>>>> The table has a column family with 1000 columns and each column with
>>>>>>> 100
>>>>>>> versions.
>>>>>>> There's another column faimily with four columns an one image of
>>>>>>> 100kb.
>>>>>>>    (I've tried without this column family as well.)
>>>>>>> The table is partitioned manually in all the slaves, so data are
>>>>>>> balanced
>>>>>>> in the cluster.
>>>>>>>
>>>>>>> I'm executing this sentence *scan 'table1', {FILTER =>
>>>>>>> "ValueFilter(=,
>>>>>>> 'binary:5')"* in HBase 0.94.6
>>>>>>> My time for lease and rpc is three minutes.
>>>>>>> Since, it's a full scan of the table, I have been playing with the
>>>>>>> BLOCKCACHE as well (just disable and enable, not about the size of
>>>>>>> it). I
>>>>>>> thought that it was going to have too much calls to the GC. I'm not
>>>>>>> sure
>>>>>>> about this point.
>>>>>>>
>>>>>>> I know that it's not the best way to use HBase, it's just a test. I
>>>>>>> think
>>>>>>> that it's not working because the hardware isn't enough, although, I
>>>>>>> would
>>>>>>> like to try some kind of tunning to improve it.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 10/04/14 14:21, Ted Yu wrote:
>>>>>>>
>>>>>>> Can you give us a bit more information:
>>>>>>>
>>>>>>>> HBase release you're running
>>>>>>>> What filters are used for the scan
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> On Apr 10, 2014, at 2:36 AM, gortiz <gor...@pragsis.com> wrote:
>>>>>>>>
>>>>>>>>    I got this error when I execute a full scan with filters about a
>>>>>>>> table.
>>>>>>>>
>>>>>>>> Caused by: java.lang.RuntimeException: org.apache.hadoop.hbase.
>>>>>>>>> regionserver.LeaseException:
>>>>>>>>> org.apache.hadoop.hbase.regionserver.LeaseException: lease
>>>>>>>>> '-4165751462641113359' does not exist
>>>>>>>>>       at org.apache.hadoop.hbase.regionserver.Leases.
>>>>>>>>> removeLease(Leases.java:231)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>       at org.apache.hadoop.hbase.regionserver.HRegionServer.
>>>>>>>>> next(HRegionServer.java:2482)
>>>>>>>>>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>>>> Method)
>>>>>>>>>       at sun.reflect.NativeMethodAccessorImpl.invoke(
>>>>>>>>> NativeMethodAccessorImpl.java:39)
>>>>>>>>>       at sun.reflect.DelegatingMethodAccessorImpl.invoke(
>>>>>>>>> DelegatingMethodAccessorImpl.java:25)
>>>>>>>>>       at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>>>       at org.apache.hadoop.hbase.ipc.
>>>>>>>>> WritableRpcEngine$Server.call(
>>>>>>>>> WritableRpcEngine.java:320)
>>>>>>>>>       at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(
>>>>>>>>> HBaseServer.java:1428)
>>>>>>>>>
>>>>>>>>> I have read about increase the lease time and rpc time, but it's
>>>>>>>>> not
>>>>>>>>> working.. what else could I try?? The table isn't too big. I have
>>>>>>>>> been
>>>>>>>>> checking the logs from GC, HMaster and some RegionServers and I
>>>>>>>>> didn't see
>>>>>>>>> anything weird. I tried as well to try with a couple of caching
>>>>>>>>> values.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>
>>>>>> *Guillermo Ortiz*
>>>>>> /Big Data Developer/
>>>>>>
>>>>>> Telf.: +34 917 680 490<https://mail.google.com/
>>>>>> mail/u/0/html/compose/static_files/blank_quirks.html#>
>>>>>> Fax: +34 913 833 301<https://mail.google.com/
>>>>>> mail/u/0/html/compose/static_files/blank_quirks.html#>
>>>>>>
>>>>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
>>>>>>
>>>>>> _http://www.bidoop.es_
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>> *Guillermo Ortiz*
>>> /Big Data Developer/
>>>
>>> Telf.: +34 917 680 490<https://mail.google.com/
>>> mail/u/0/html/compose/static_files/blank_quirks.html#>
>>> Fax: +34 913 833 301<https://mail.google.com/
>>> mail/u/0/html/compose/static_files/blank_quirks.html#>
>>>
>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
>>>
>>> _http://www.bidoop.es_
>>>
>>>
>>>
>
> --
> *Guillermo Ortiz*
> /Big Data Developer/
>
> Telf.: +34 917 680 
> 490<https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#>
> Fax: +34 913 833 
> 301<https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#>
>  C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
>
> _http://www.bidoop.es_
>
>

Reply via email to