In the storage layer (HFiles in HDFS) all versions of a particular cell will be staying together. (Yes it has to be lexicographically ordered KVs). So during a scan we will have to read all the version data. At this storage layer it doesn't know the versions stuff etc.
-Anoop- On Fri, Apr 11, 2014 at 3:33 PM, gortiz <gor...@pragsis.com> wrote: > Yes, I have tried with two different values for that value of versions, > 1000 and maximum value for integers. > > But, I want to keep those versions. I don't want to keep just 3 versions. > Imagine that I want to record a new version each minute and store a day, > those are 1440 versions. > > Why is HBase going to read all the versions?? , I thought, if you don't > indicate any versions it's just read the newest and skip the rest. It > doesn't make too much sense to read all of them if data is sorted, plus the > newest version is stored in the top. > > > > On 11/04/14 11:54, Anoop John wrote: > >> What is the max version setting u have done for ur table cf? When u set >> some a value, HBase has to keep all those versions. During a scan it will >> read all those versions. In 94 version the default value for the max >> versions is 3. I guess you have set some bigger value. If u have not, >> mind testing after a major compaction? >> >> -Anoop- >> >> On Fri, Apr 11, 2014 at 1:01 PM, gortiz <gor...@pragsis.com> wrote: >> >> Last test I have done it's to reduce the number of versions to 100. >>> So, right now, I have 100 rows with 100 versions each one. >>> Times are: (I got the same times for blocksize of 64Ks and 1Mb) >>> 100row-1000versions + blockcache-> 80s. >>> 100row-1000versions + No blockcache-> 70s. >>> >>> 100row-*100*versions + blockcache-> 7.3s. >>> 100row-*100*versions + No blockcache-> 6.1s. >>> >>> What's the reasons of this? I guess HBase is enough smart for not >>> consider >>> old versions, so, it just checks the newest. But, I reduce 10 times the >>> size (in versions) and I got a 10x of performance. >>> >>> The filter is scan 'filters', {FILTER => "ValueFilter(=, >>> 'binary:5')",STARTROW => '1010000000000000000000000000000000000101', >>> STOPROW => '6010000000000000000000000000000000000201'} >>> >>> >>> >>> On 11/04/14 09:04, gortiz wrote: >>> >>> Well, I guessed that, what it doesn't make too much sense because it's >>>> so >>>> slow. I only have right now 100 rows with 1000 versions each row. >>>> I have checked the size of the dataset and each row is about 700Kbytes >>>> (around 7Gb, 100rowsx1000versions). So, it should only check 100 rows x >>>> 700Kbytes = 70Mb, since it just check the newest version. How can it >>>> spend >>>> too many time checking this quantity of data? >>>> >>>> I'm generating again the dataset with a bigger blocksize (previously was >>>> 64Kb, now, it's going to be 1Mb). I could try tunning the scanning and >>>> baching parameters, but I don't think they're going to affect too much. >>>> >>>> Another test I want to do, it's generate the same dataset with just >>>> 100versions, It should spend around the same time, right? Or am I wrong? >>>> >>>> On 10/04/14 18:08, Ted Yu wrote: >>>> >>>> It should be newest version of each value. >>>>> >>>>> Cheers >>>>> >>>>> >>>>> On Thu, Apr 10, 2014 at 9:55 AM, gortiz <gor...@pragsis.com> wrote: >>>>> >>>>> Another little question is, when the filter I'm using, Do I check all >>>>> the >>>>> >>>>>> versions? or just the newest? Because, I'm wondering if when I do a >>>>>> scan >>>>>> over all the table, I look for the value "5" in all the dataset or I'm >>>>>> just >>>>>> looking for in one newest version of each value. >>>>>> >>>>>> >>>>>> On 10/04/14 16:52, gortiz wrote: >>>>>> >>>>>> I was trying to check the behaviour of HBase. The cluster is a group >>>>>> of >>>>>> >>>>>>> old computers, one master, five slaves, each one with 2Gb, so, 12gb >>>>>>> in >>>>>>> total. >>>>>>> The table has a column family with 1000 columns and each column with >>>>>>> 100 >>>>>>> versions. >>>>>>> There's another column faimily with four columns an one image of >>>>>>> 100kb. >>>>>>> (I've tried without this column family as well.) >>>>>>> The table is partitioned manually in all the slaves, so data are >>>>>>> balanced >>>>>>> in the cluster. >>>>>>> >>>>>>> I'm executing this sentence *scan 'table1', {FILTER => >>>>>>> "ValueFilter(=, >>>>>>> 'binary:5')"* in HBase 0.94.6 >>>>>>> My time for lease and rpc is three minutes. >>>>>>> Since, it's a full scan of the table, I have been playing with the >>>>>>> BLOCKCACHE as well (just disable and enable, not about the size of >>>>>>> it). I >>>>>>> thought that it was going to have too much calls to the GC. I'm not >>>>>>> sure >>>>>>> about this point. >>>>>>> >>>>>>> I know that it's not the best way to use HBase, it's just a test. I >>>>>>> think >>>>>>> that it's not working because the hardware isn't enough, although, I >>>>>>> would >>>>>>> like to try some kind of tunning to improve it. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 10/04/14 14:21, Ted Yu wrote: >>>>>>> >>>>>>> Can you give us a bit more information: >>>>>>> >>>>>>>> HBase release you're running >>>>>>>> What filters are used for the scan >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> On Apr 10, 2014, at 2:36 AM, gortiz <gor...@pragsis.com> wrote: >>>>>>>> >>>>>>>> I got this error when I execute a full scan with filters about a >>>>>>>> table. >>>>>>>> >>>>>>>> Caused by: java.lang.RuntimeException: org.apache.hadoop.hbase. >>>>>>>>> regionserver.LeaseException: >>>>>>>>> org.apache.hadoop.hbase.regionserver.LeaseException: lease >>>>>>>>> '-4165751462641113359' does not exist >>>>>>>>> at org.apache.hadoop.hbase.regionserver.Leases. >>>>>>>>> removeLease(Leases.java:231) >>>>>>>>> >>>>>>>>> >>>>>>>>> at org.apache.hadoop.hbase.regionserver.HRegionServer. >>>>>>>>> next(HRegionServer.java:2482) >>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>>>>>>>> Method) >>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke( >>>>>>>>> NativeMethodAccessorImpl.java:39) >>>>>>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke( >>>>>>>>> DelegatingMethodAccessorImpl.java:25) >>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>>>>> at org.apache.hadoop.hbase.ipc. >>>>>>>>> WritableRpcEngine$Server.call( >>>>>>>>> WritableRpcEngine.java:320) >>>>>>>>> at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run( >>>>>>>>> HBaseServer.java:1428) >>>>>>>>> >>>>>>>>> I have read about increase the lease time and rpc time, but it's >>>>>>>>> not >>>>>>>>> working.. what else could I try?? The table isn't too big. I have >>>>>>>>> been >>>>>>>>> checking the logs from GC, HMaster and some RegionServers and I >>>>>>>>> didn't see >>>>>>>>> anything weird. I tried as well to try with a couple of caching >>>>>>>>> values. >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>> >>>>>> *Guillermo Ortiz* >>>>>> /Big Data Developer/ >>>>>> >>>>>> Telf.: +34 917 680 490<https://mail.google.com/ >>>>>> mail/u/0/html/compose/static_files/blank_quirks.html#> >>>>>> Fax: +34 913 833 301<https://mail.google.com/ >>>>>> mail/u/0/html/compose/static_files/blank_quirks.html#> >>>>>> >>>>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain >>>>>> >>>>>> _http://www.bidoop.es_ >>>>>> >>>>>> >>>>>> >>>>>> -- >>> *Guillermo Ortiz* >>> /Big Data Developer/ >>> >>> Telf.: +34 917 680 490<https://mail.google.com/ >>> mail/u/0/html/compose/static_files/blank_quirks.html#> >>> Fax: +34 913 833 301<https://mail.google.com/ >>> mail/u/0/html/compose/static_files/blank_quirks.html#> >>> >>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain >>> >>> _http://www.bidoop.es_ >>> >>> >>> > > -- > *Guillermo Ortiz* > /Big Data Developer/ > > Telf.: +34 917 680 > 490<https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#> > Fax: +34 913 833 > 301<https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#> > C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain > > _http://www.bidoop.es_ > >