Okay, thank you, I'll check it this Monday. I didn't know that Scan checks
all the versions.
So, I was checking each column and each version although it just showed me
the newest version because I didn't indicate anything about the VERSIONS
attribute. It makes sense that it takes so long.


2014-04-11 16:57 GMT+02:00 Ted Yu <yuzhih...@gmail.com>:

> In your previous example:
> scan 'table1', {FILTER => "ValueFilter(=, 'binary:5')"}
>
> there was no expression w.r.t. timestamp. See the following javadoc from
> Scan.java:
>
>  * To only retrieve columns within a specific range of version timestamps,
>
>  * execute {@link #setTimeRange(long, long) setTimeRange}.
>
>  * <p>
>
>  * To only retrieve columns with a specific timestamp, execute
>
>  * {@link #setTimeStamp(long) setTimestamp}.
>
> You can use one of the above methods to make your scan more selective.
>
>
> ValueFilter#filterKeyValue(Cell) doesn't utilize advanced feature of
> ReturnCode. You can refer to:
>
>
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/Filter.ReturnCode.html
>
> You can take a look at SingleColumnValueFilter#filterKeyValue() for example
> of how various ReturnCode's are used to speed up scan.
>
> Cheers
>
>
> On Fri, Apr 11, 2014 at 8:40 AM, Guillermo Ortiz <konstt2...@gmail.com
> >wrote:
>
> > I read something interesting about it in HBase TDG.
> >
> > Page 344:
> > The StoreScanner class combines the store files and memstore that the
> > Store instance
> > contains. It is also where the exclusion happens, based on the Bloom
> > filter, or the timestamp. If you are asking for versions that are not
> more
> > than 30 minutes old, for example, you can skip all storage files that are
> > older than one hour: they will not contain anything of interest. See "Key
> > Design" on page 357 for details on the exclusion, and how to make use of
> > it.
> >
> > So, I guess that it doesn't have to read all the HFiles?? But, I don't
> know
> > if HBase really uses the timestamp of each row or the date of the file. I
> > guess when I execute the scan, it reads everything, but, I don't know
> why.
> > I think there's something else that I don't see so that everything works
> to
> > me.
> >
> >
> > 2014-04-11 13:05 GMT+02:00 gortiz <gor...@pragsis.com>:
> >
> > > Sorry, I didn't get it why it should read all the timestamps and not
> just
> > > the newest it they're sorted and you didn't specific any timestamp in
> > your
> > > filter.
> > >
> > >
> > >
> > > On 11/04/14 12:13, Anoop John wrote:
> > >
> > >> In the storage layer (HFiles in HDFS) all versions of a particular
> cell
> > >> will be staying together.  (Yes it has to be lexicographically ordered
> > >> KVs). So during a scan we will have to read all the version data.  At
> > this
> > >> storage layer it doesn't know the versions stuff etc.
> > >>
> > >> -Anoop-
> > >>
> > >> On Fri, Apr 11, 2014 at 3:33 PM, gortiz <gor...@pragsis.com> wrote:
> > >>
> > >>  Yes, I have tried with two different values for that value of
> versions,
> > >>> 1000 and maximum value for integers.
> > >>>
> > >>> But, I want to keep those versions. I don't want to keep just 3
> > versions.
> > >>> Imagine that I want to record a new version each minute and store a
> > day,
> > >>> those are 1440 versions.
> > >>>
> > >>> Why is HBase going to read all the versions?? , I thought, if you
> don't
> > >>> indicate any versions it's just read the newest and skip the rest. It
> > >>> doesn't make too much sense to read all of them if data is sorted,
> plus
> > >>> the
> > >>> newest version is stored in the top.
> > >>>
> > >>>
> > >>>
> > >>> On 11/04/14 11:54, Anoop John wrote:
> > >>>
> > >>>    What is the max version setting u have done for ur table cf?
>  When u
> > >>>> set
> > >>>> some a value, HBase has to keep all those versions.  During a scan
> it
> > >>>> will
> > >>>> read all those versions. In 94 version the default value for the max
> > >>>> versions is 3.  I guess you have set some bigger value.   If u have
> > not,
> > >>>> mind testing after a major compaction?
> > >>>>
> > >>>> -Anoop-
> > >>>>
> > >>>> On Fri, Apr 11, 2014 at 1:01 PM, gortiz <gor...@pragsis.com> wrote:
> > >>>>
> > >>>>   Last test I have done it's to reduce the number of versions to
> 100.
> > >>>>
> > >>>>> So, right now, I have 100 rows with 100 versions each one.
> > >>>>> Times are: (I got the same times for blocksize of 64Ks and 1Mb)
> > >>>>> 100row-1000versions + blockcache-> 80s.
> > >>>>> 100row-1000versions + No blockcache-> 70s.
> > >>>>>
> > >>>>> 100row-*100*versions + blockcache-> 7.3s.
> > >>>>> 100row-*100*versions + No blockcache-> 6.1s.
> > >>>>>
> > >>>>> What's the reasons of this? I guess HBase is enough smart for not
> > >>>>> consider
> > >>>>> old versions, so, it just checks the newest. But, I reduce 10 times
> > the
> > >>>>> size (in versions) and I got a 10x of performance.
> > >>>>>
> > >>>>> The filter is scan 'filters', {FILTER => "ValueFilter(=,
> > >>>>> 'binary:5')",STARTROW =>
> '1010000000000000000000000000000000000101',
> > >>>>> STOPROW => '6010000000000000000000000000000000000201'}
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On 11/04/14 09:04, gortiz wrote:
> > >>>>>
> > >>>>>   Well, I guessed that, what it doesn't make too much sense because
> > >>>>> it's
> > >>>>>
> > >>>>>> so
> > >>>>>> slow. I only have right now 100 rows with 1000 versions each row.
> > >>>>>> I have checked the size of the dataset and each row is about
> > 700Kbytes
> > >>>>>> (around 7Gb, 100rowsx1000versions). So, it should only check 100
> > rows
> > >>>>>> x
> > >>>>>> 700Kbytes = 70Mb, since it just check the newest version. How can
> it
> > >>>>>> spend
> > >>>>>> too many time checking this quantity of data?
> > >>>>>>
> > >>>>>> I'm generating again the dataset with a bigger blocksize
> (previously
> > >>>>>> was
> > >>>>>> 64Kb, now, it's going to be 1Mb). I could try tunning the scanning
> > and
> > >>>>>> baching parameters, but I don't think they're going to affect too
> > >>>>>> much.
> > >>>>>>
> > >>>>>> Another test I want to do, it's generate the same dataset with
> just
> > >>>>>> 100versions, It should spend around the same time, right? Or am I
> > >>>>>> wrong?
> > >>>>>>
> > >>>>>> On 10/04/14 18:08, Ted Yu wrote:
> > >>>>>>
> > >>>>>>   It should be newest version of each value.
> > >>>>>>
> > >>>>>>> Cheers
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Thu, Apr 10, 2014 at 9:55 AM, gortiz <gor...@pragsis.com>
> > wrote:
> > >>>>>>>
> > >>>>>>> Another little question is, when the filter I'm using, Do I check
> > all
> > >>>>>>> the
> > >>>>>>>
> > >>>>>>>    versions? or just the newest? Because, I'm wondering if when I
> > do
> > >>>>>>>> a
> > >>>>>>>> scan
> > >>>>>>>> over all the table, I look for the value "5" in all the dataset
> or
> > >>>>>>>> I'm
> > >>>>>>>> just
> > >>>>>>>> looking for in one newest version of each value.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On 10/04/14 16:52, gortiz wrote:
> > >>>>>>>>
> > >>>>>>>> I was trying to check the behaviour of HBase. The cluster is a
> > group
> > >>>>>>>> of
> > >>>>>>>>
> > >>>>>>>>  old computers, one master, five slaves, each one with 2Gb, so,
> > 12gb
> > >>>>>>>>> in
> > >>>>>>>>> total.
> > >>>>>>>>> The table has a column family with 1000 columns and each column
> > >>>>>>>>> with
> > >>>>>>>>> 100
> > >>>>>>>>> versions.
> > >>>>>>>>> There's another column faimily with four columns an one image
> of
> > >>>>>>>>> 100kb.
> > >>>>>>>>>     (I've tried without this column family as well.)
> > >>>>>>>>> The table is partitioned manually in all the slaves, so data
> are
> > >>>>>>>>> balanced
> > >>>>>>>>> in the cluster.
> > >>>>>>>>>
> > >>>>>>>>> I'm executing this sentence *scan 'table1', {FILTER =>
> > >>>>>>>>> "ValueFilter(=,
> > >>>>>>>>> 'binary:5')"* in HBase 0.94.6
> > >>>>>>>>> My time for lease and rpc is three minutes.
> > >>>>>>>>> Since, it's a full scan of the table, I have been playing with
> > the
> > >>>>>>>>> BLOCKCACHE as well (just disable and enable, not about the size
> > of
> > >>>>>>>>> it). I
> > >>>>>>>>> thought that it was going to have too much calls to the GC. I'm
> > not
> > >>>>>>>>> sure
> > >>>>>>>>> about this point.
> > >>>>>>>>>
> > >>>>>>>>> I know that it's not the best way to use HBase, it's just a
> > test. I
> > >>>>>>>>> think
> > >>>>>>>>> that it's not working because the hardware isn't enough,
> > although,
> > >>>>>>>>> I
> > >>>>>>>>> would
> > >>>>>>>>> like to try some kind of tunning to improve it.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On 10/04/14 14:21, Ted Yu wrote:
> > >>>>>>>>>
> > >>>>>>>>> Can you give us a bit more information:
> > >>>>>>>>>
> > >>>>>>>>>  HBase release you're running
> > >>>>>>>>>> What filters are used for the scan
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks
> > >>>>>>>>>>
> > >>>>>>>>>> On Apr 10, 2014, at 2:36 AM, gortiz <gor...@pragsis.com>
> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>     I got this error when I execute a full scan with filters
> > >>>>>>>>>> about a
> > >>>>>>>>>> table.
> > >>>>>>>>>>
> > >>>>>>>>>> Caused by: java.lang.RuntimeException:
> org.apache.hadoop.hbase.
> > >>>>>>>>>>
> > >>>>>>>>>>> regionserver.LeaseException:
> > >>>>>>>>>>> org.apache.hadoop.hbase.regionserver.LeaseException: lease
> > >>>>>>>>>>> '-4165751462641113359' does not exist
> > >>>>>>>>>>>        at org.apache.hadoop.hbase.regionserver.Leases.
> > >>>>>>>>>>> removeLease(Leases.java:231)
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.
> > >>>>>>>>>>> next(HRegionServer.java:2482)
> > >>>>>>>>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> > >>>>>>>>>>> Method)
> > >>>>>>>>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke(
> > >>>>>>>>>>> NativeMethodAccessorImpl.java:39)
> > >>>>>>>>>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > >>>>>>>>>>> DelegatingMethodAccessorImpl.java:25)
> > >>>>>>>>>>>        at java.lang.reflect.Method.invoke(Method.java:597)
> > >>>>>>>>>>>        at org.apache.hadoop.hbase.ipc.
> > >>>>>>>>>>> WritableRpcEngine$Server.call(
> > >>>>>>>>>>> WritableRpcEngine.java:320)
> > >>>>>>>>>>>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(
> > >>>>>>>>>>> HBaseServer.java:1428)
> > >>>>>>>>>>>
> > >>>>>>>>>>> I have read about increase the lease time and rpc time, but
> > it's
> > >>>>>>>>>>> not
> > >>>>>>>>>>> working.. what else could I try?? The table isn't too big. I
> > have
> > >>>>>>>>>>> been
> > >>>>>>>>>>> checking the logs from GC, HMaster and some RegionServers
> and I
> > >>>>>>>>>>> didn't see
> > >>>>>>>>>>> anything weird. I tried as well to try with a couple of
> caching
> > >>>>>>>>>>> values.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> --
> > >>>>>>>>>>>
> > >>>>>>>>>> *Guillermo Ortiz*
> > >>>>>>>> /Big Data Developer/
> > >>>>>>>>
> > >>>>>>>> Telf.: +34 917 680 490<https://mail.google.com/
> > >>>>>>>> mail/u/0/html/compose/static_files/blank_quirks.html#>
> > >>>>>>>> Fax: +34 913 833 301<https://mail.google.com/
> > >>>>>>>> mail/u/0/html/compose/static_files/blank_quirks.html#>
> > >>>>>>>>
> > >>>>>>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
> > >>>>>>>>
> > >>>>>>>> _http://www.bidoop.es_
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>>
> > >>>>>>> *Guillermo Ortiz*
> > >>>>> /Big Data Developer/
> > >>>>>
> > >>>>> Telf.: +34 917 680 490<https://mail.google.com/
> > >>>>> mail/u/0/html/compose/static_files/blank_quirks.html#>
> > >>>>> Fax: +34 913 833 301<https://mail.google.com/
> > >>>>> mail/u/0/html/compose/static_files/blank_quirks.html#>
> > >>>>>
> > >>>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
> > >>>>>
> > >>>>> _http://www.bidoop.es_
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>  --
> > >>> *Guillermo Ortiz*
> > >>> /Big Data Developer/
> > >>>
> > >>> Telf.: +34 917 680 490<https://mail.google.com/mail/
> > >>> u/0/html/compose/static_files/blank_quirks.html#>
> > >>> Fax: +34 913 833 301<https://mail.google.com/mail/
> > >>> u/0/html/compose/static_files/blank_quirks.html#>
> > >>>   C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
> > >>>
> > >>> _http://www.bidoop.es_
> > >>>
> > >>>
> > >>>
> > >
> > > --
> > > *Guillermo Ortiz*
> > > /Big Data Developer/
> > >
> > > Telf.: +34 917 680 490
> > > Fax: +34 913 833 301
> > > C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
> > >
> > > _http://www.bidoop.es_
> > >
> > >
> >
>

Reply via email to