Okay, thank you, I'll check it this Monday. I didn't know that Scan checks all the versions. So, I was checking each column and each version although it just showed me the newest version because I didn't indicate anything about the VERSIONS attribute. It makes sense that it takes so long.
2014-04-11 16:57 GMT+02:00 Ted Yu <yuzhih...@gmail.com>: > In your previous example: > scan 'table1', {FILTER => "ValueFilter(=, 'binary:5')"} > > there was no expression w.r.t. timestamp. See the following javadoc from > Scan.java: > > * To only retrieve columns within a specific range of version timestamps, > > * execute {@link #setTimeRange(long, long) setTimeRange}. > > * <p> > > * To only retrieve columns with a specific timestamp, execute > > * {@link #setTimeStamp(long) setTimestamp}. > > You can use one of the above methods to make your scan more selective. > > > ValueFilter#filterKeyValue(Cell) doesn't utilize advanced feature of > ReturnCode. You can refer to: > > > https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/Filter.ReturnCode.html > > You can take a look at SingleColumnValueFilter#filterKeyValue() for example > of how various ReturnCode's are used to speed up scan. > > Cheers > > > On Fri, Apr 11, 2014 at 8:40 AM, Guillermo Ortiz <konstt2...@gmail.com > >wrote: > > > I read something interesting about it in HBase TDG. > > > > Page 344: > > The StoreScanner class combines the store files and memstore that the > > Store instance > > contains. It is also where the exclusion happens, based on the Bloom > > filter, or the timestamp. If you are asking for versions that are not > more > > than 30 minutes old, for example, you can skip all storage files that are > > older than one hour: they will not contain anything of interest. See "Key > > Design" on page 357 for details on the exclusion, and how to make use of > > it. > > > > So, I guess that it doesn't have to read all the HFiles?? But, I don't > know > > if HBase really uses the timestamp of each row or the date of the file. I > > guess when I execute the scan, it reads everything, but, I don't know > why. > > I think there's something else that I don't see so that everything works > to > > me. > > > > > > 2014-04-11 13:05 GMT+02:00 gortiz <gor...@pragsis.com>: > > > > > Sorry, I didn't get it why it should read all the timestamps and not > just > > > the newest it they're sorted and you didn't specific any timestamp in > > your > > > filter. > > > > > > > > > > > > On 11/04/14 12:13, Anoop John wrote: > > > > > >> In the storage layer (HFiles in HDFS) all versions of a particular > cell > > >> will be staying together. (Yes it has to be lexicographically ordered > > >> KVs). So during a scan we will have to read all the version data. At > > this > > >> storage layer it doesn't know the versions stuff etc. > > >> > > >> -Anoop- > > >> > > >> On Fri, Apr 11, 2014 at 3:33 PM, gortiz <gor...@pragsis.com> wrote: > > >> > > >> Yes, I have tried with two different values for that value of > versions, > > >>> 1000 and maximum value for integers. > > >>> > > >>> But, I want to keep those versions. I don't want to keep just 3 > > versions. > > >>> Imagine that I want to record a new version each minute and store a > > day, > > >>> those are 1440 versions. > > >>> > > >>> Why is HBase going to read all the versions?? , I thought, if you > don't > > >>> indicate any versions it's just read the newest and skip the rest. It > > >>> doesn't make too much sense to read all of them if data is sorted, > plus > > >>> the > > >>> newest version is stored in the top. > > >>> > > >>> > > >>> > > >>> On 11/04/14 11:54, Anoop John wrote: > > >>> > > >>> What is the max version setting u have done for ur table cf? > When u > > >>>> set > > >>>> some a value, HBase has to keep all those versions. During a scan > it > > >>>> will > > >>>> read all those versions. In 94 version the default value for the max > > >>>> versions is 3. I guess you have set some bigger value. If u have > > not, > > >>>> mind testing after a major compaction? > > >>>> > > >>>> -Anoop- > > >>>> > > >>>> On Fri, Apr 11, 2014 at 1:01 PM, gortiz <gor...@pragsis.com> wrote: > > >>>> > > >>>> Last test I have done it's to reduce the number of versions to > 100. > > >>>> > > >>>>> So, right now, I have 100 rows with 100 versions each one. > > >>>>> Times are: (I got the same times for blocksize of 64Ks and 1Mb) > > >>>>> 100row-1000versions + blockcache-> 80s. > > >>>>> 100row-1000versions + No blockcache-> 70s. > > >>>>> > > >>>>> 100row-*100*versions + blockcache-> 7.3s. > > >>>>> 100row-*100*versions + No blockcache-> 6.1s. > > >>>>> > > >>>>> What's the reasons of this? I guess HBase is enough smart for not > > >>>>> consider > > >>>>> old versions, so, it just checks the newest. But, I reduce 10 times > > the > > >>>>> size (in versions) and I got a 10x of performance. > > >>>>> > > >>>>> The filter is scan 'filters', {FILTER => "ValueFilter(=, > > >>>>> 'binary:5')",STARTROW => > '1010000000000000000000000000000000000101', > > >>>>> STOPROW => '6010000000000000000000000000000000000201'} > > >>>>> > > >>>>> > > >>>>> > > >>>>> On 11/04/14 09:04, gortiz wrote: > > >>>>> > > >>>>> Well, I guessed that, what it doesn't make too much sense because > > >>>>> it's > > >>>>> > > >>>>>> so > > >>>>>> slow. I only have right now 100 rows with 1000 versions each row. > > >>>>>> I have checked the size of the dataset and each row is about > > 700Kbytes > > >>>>>> (around 7Gb, 100rowsx1000versions). So, it should only check 100 > > rows > > >>>>>> x > > >>>>>> 700Kbytes = 70Mb, since it just check the newest version. How can > it > > >>>>>> spend > > >>>>>> too many time checking this quantity of data? > > >>>>>> > > >>>>>> I'm generating again the dataset with a bigger blocksize > (previously > > >>>>>> was > > >>>>>> 64Kb, now, it's going to be 1Mb). I could try tunning the scanning > > and > > >>>>>> baching parameters, but I don't think they're going to affect too > > >>>>>> much. > > >>>>>> > > >>>>>> Another test I want to do, it's generate the same dataset with > just > > >>>>>> 100versions, It should spend around the same time, right? Or am I > > >>>>>> wrong? > > >>>>>> > > >>>>>> On 10/04/14 18:08, Ted Yu wrote: > > >>>>>> > > >>>>>> It should be newest version of each value. > > >>>>>> > > >>>>>>> Cheers > > >>>>>>> > > >>>>>>> > > >>>>>>> On Thu, Apr 10, 2014 at 9:55 AM, gortiz <gor...@pragsis.com> > > wrote: > > >>>>>>> > > >>>>>>> Another little question is, when the filter I'm using, Do I check > > all > > >>>>>>> the > > >>>>>>> > > >>>>>>> versions? or just the newest? Because, I'm wondering if when I > > do > > >>>>>>>> a > > >>>>>>>> scan > > >>>>>>>> over all the table, I look for the value "5" in all the dataset > or > > >>>>>>>> I'm > > >>>>>>>> just > > >>>>>>>> looking for in one newest version of each value. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> On 10/04/14 16:52, gortiz wrote: > > >>>>>>>> > > >>>>>>>> I was trying to check the behaviour of HBase. The cluster is a > > group > > >>>>>>>> of > > >>>>>>>> > > >>>>>>>> old computers, one master, five slaves, each one with 2Gb, so, > > 12gb > > >>>>>>>>> in > > >>>>>>>>> total. > > >>>>>>>>> The table has a column family with 1000 columns and each column > > >>>>>>>>> with > > >>>>>>>>> 100 > > >>>>>>>>> versions. > > >>>>>>>>> There's another column faimily with four columns an one image > of > > >>>>>>>>> 100kb. > > >>>>>>>>> (I've tried without this column family as well.) > > >>>>>>>>> The table is partitioned manually in all the slaves, so data > are > > >>>>>>>>> balanced > > >>>>>>>>> in the cluster. > > >>>>>>>>> > > >>>>>>>>> I'm executing this sentence *scan 'table1', {FILTER => > > >>>>>>>>> "ValueFilter(=, > > >>>>>>>>> 'binary:5')"* in HBase 0.94.6 > > >>>>>>>>> My time for lease and rpc is three minutes. > > >>>>>>>>> Since, it's a full scan of the table, I have been playing with > > the > > >>>>>>>>> BLOCKCACHE as well (just disable and enable, not about the size > > of > > >>>>>>>>> it). I > > >>>>>>>>> thought that it was going to have too much calls to the GC. I'm > > not > > >>>>>>>>> sure > > >>>>>>>>> about this point. > > >>>>>>>>> > > >>>>>>>>> I know that it's not the best way to use HBase, it's just a > > test. I > > >>>>>>>>> think > > >>>>>>>>> that it's not working because the hardware isn't enough, > > although, > > >>>>>>>>> I > > >>>>>>>>> would > > >>>>>>>>> like to try some kind of tunning to improve it. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On 10/04/14 14:21, Ted Yu wrote: > > >>>>>>>>> > > >>>>>>>>> Can you give us a bit more information: > > >>>>>>>>> > > >>>>>>>>> HBase release you're running > > >>>>>>>>>> What filters are used for the scan > > >>>>>>>>>> > > >>>>>>>>>> Thanks > > >>>>>>>>>> > > >>>>>>>>>> On Apr 10, 2014, at 2:36 AM, gortiz <gor...@pragsis.com> > wrote: > > >>>>>>>>>> > > >>>>>>>>>> I got this error when I execute a full scan with filters > > >>>>>>>>>> about a > > >>>>>>>>>> table. > > >>>>>>>>>> > > >>>>>>>>>> Caused by: java.lang.RuntimeException: > org.apache.hadoop.hbase. > > >>>>>>>>>> > > >>>>>>>>>>> regionserver.LeaseException: > > >>>>>>>>>>> org.apache.hadoop.hbase.regionserver.LeaseException: lease > > >>>>>>>>>>> '-4165751462641113359' does not exist > > >>>>>>>>>>> at org.apache.hadoop.hbase.regionserver.Leases. > > >>>>>>>>>>> removeLease(Leases.java:231) > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> at org.apache.hadoop.hbase.regionserver.HRegionServer. > > >>>>>>>>>>> next(HRegionServer.java:2482) > > >>>>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > > >>>>>>>>>>> Method) > > >>>>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke( > > >>>>>>>>>>> NativeMethodAccessorImpl.java:39) > > >>>>>>>>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke( > > >>>>>>>>>>> DelegatingMethodAccessorImpl.java:25) > > >>>>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:597) > > >>>>>>>>>>> at org.apache.hadoop.hbase.ipc. > > >>>>>>>>>>> WritableRpcEngine$Server.call( > > >>>>>>>>>>> WritableRpcEngine.java:320) > > >>>>>>>>>>> at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run( > > >>>>>>>>>>> HBaseServer.java:1428) > > >>>>>>>>>>> > > >>>>>>>>>>> I have read about increase the lease time and rpc time, but > > it's > > >>>>>>>>>>> not > > >>>>>>>>>>> working.. what else could I try?? The table isn't too big. I > > have > > >>>>>>>>>>> been > > >>>>>>>>>>> checking the logs from GC, HMaster and some RegionServers > and I > > >>>>>>>>>>> didn't see > > >>>>>>>>>>> anything weird. I tried as well to try with a couple of > caching > > >>>>>>>>>>> values. > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> -- > > >>>>>>>>>>> > > >>>>>>>>>> *Guillermo Ortiz* > > >>>>>>>> /Big Data Developer/ > > >>>>>>>> > > >>>>>>>> Telf.: +34 917 680 490<https://mail.google.com/ > > >>>>>>>> mail/u/0/html/compose/static_files/blank_quirks.html#> > > >>>>>>>> Fax: +34 913 833 301<https://mail.google.com/ > > >>>>>>>> mail/u/0/html/compose/static_files/blank_quirks.html#> > > >>>>>>>> > > >>>>>>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain > > >>>>>>>> > > >>>>>>>> _http://www.bidoop.es_ > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> -- > > >>>>>>>> > > >>>>>>> *Guillermo Ortiz* > > >>>>> /Big Data Developer/ > > >>>>> > > >>>>> Telf.: +34 917 680 490<https://mail.google.com/ > > >>>>> mail/u/0/html/compose/static_files/blank_quirks.html#> > > >>>>> Fax: +34 913 833 301<https://mail.google.com/ > > >>>>> mail/u/0/html/compose/static_files/blank_quirks.html#> > > >>>>> > > >>>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain > > >>>>> > > >>>>> _http://www.bidoop.es_ > > >>>>> > > >>>>> > > >>>>> > > >>>>> -- > > >>> *Guillermo Ortiz* > > >>> /Big Data Developer/ > > >>> > > >>> Telf.: +34 917 680 490<https://mail.google.com/mail/ > > >>> u/0/html/compose/static_files/blank_quirks.html#> > > >>> Fax: +34 913 833 301<https://mail.google.com/mail/ > > >>> u/0/html/compose/static_files/blank_quirks.html#> > > >>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain > > >>> > > >>> _http://www.bidoop.es_ > > >>> > > >>> > > >>> > > > > > > -- > > > *Guillermo Ortiz* > > > /Big Data Developer/ > > > > > > Telf.: +34 917 680 490 > > > Fax: +34 913 833 301 > > > C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain > > > > > > _http://www.bidoop.es_ > > > > > > > > >