Re: setTimeRange and setMaxVersions seem to be inefficient

Jerry Lam Wed, 29 Aug 2012 06:59:58 -0700

Hi Lars:

Thanks for spending time discussing this with me. I appreciate it.


I tried to implement the setMaxVersions(1) inside the filter as follows:

@Override
public ReturnCode filterKeyValue(KeyValue kv) {

// check if the same qualifier as the one that has been included
previously. If yes, jump to next column
if (previousIncludedQualifier != null &&
Bytes.compareTo(previousIncludedQualifier,kv.getQualifier()) == 0) {
previousIncludedQualifier = null;
return ReturnCode.NEXT_COL;
}
        // another condition that makes the jump further using HINT
if (Bytes.compareTo(this.qualifier, kv.getQualifier()) == 0) {
LOG.info("Matched Found.");
return ReturnCode.SEEK_NEXT_USING_HINT;

}
        // include this to the result and keep track of the included
qualifier so the next version of the same qualifier will be excluded
previousIncludedQualifier = kv.getQualifier();
return ReturnCode.INCLUDE;
}

Does this look reasonable or there is a better way to achieve this? It
would be nice to have ReturnCode.INCLUDE_AND_NEXT_COL for this case though.

Best Regards,

Jerry


On Wed, Aug 29, 2012 at 2:09 AM, lars hofhansl <lhofha...@yahoo.com> wrote:

> Hi Jerry,
>
> my answer will be the same again:
> Some folks will want the max versions set by the client to be before
> filters and some folks will want it to restrict the end result.
> It's not possible to have it both ways. Your filter needs to do the right
> thing.
>
>
> There's a lot of discussion around this in HBASE-5104.
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Jerry Lam <chiling...@gmail.com>
> To: user@hbase.apache.org; lars hofhansl <lhofha...@yahoo.com>
> Sent: Tuesday, August 28, 2012 1:52 PM
> Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
>
> Hi Lars:
>
> I see. Please refer to the inline comment below.
>
> Best Regards,
>
> Jerry
>
> On Tue, Aug 28, 2012 at 2:21 PM, lars hofhansl <lhofha...@yahoo.com>
> wrote:
>
> > What I was saying was: It depends. :)
> >
> > First off, how do you get to 1000 versions? In 0.94++ older version are
> > pruned upon flush, so you need 333 flushes (assuming 3 versions on the
> CF)
> > to get 1000 versions.
> >
>
> I forgot that the default number of version to keep is 3. If this is what
> people use most of the time, yes you are right for this type of scenarios
> where the number of version per column to keep is small.
>
> By that time some compactions will have happened and you're back to close
> > to 3 versions (maybe 9, 12, or 15 or so, depending on how store files you
> > have).
> >
> > Now, if you have that many version because because you set VERSIONS=>1000
> > in your CF... Then imagine you have 100 columns with 1000 versions each.
> >
>
> Yes, imagine I set VERSIONS => Long.MAX_VALUE (i.e. I will manage the
> versioning myself)
>
> In your scenario below you'd do 100000 comparisons if the filter would be
> > evaluated after the version counting. But only 1100 with the current
> code.
> > (or at least in that ball park)
> >
>
> This is where I don't quite understand what you mean.
>
> if the framework counts the number of ReturnCode.INCLUDE and then stops
> feeding the KeyValue into the filterKeyValue method after it reaches the
> count specified in setMaxVersions (i.e. 1 for the case we discussed),
> should then be just 100 comparisons only (at most) instead of 1100
> comparisons? Maybe I don't understand how the current way is doing...
>
>
>
> >
> > The gist is: One can construct scenarios where one approach is better
> than
> > the other. Only one order is possible.
> > If you write a custom filter and you care about these things you should
> > use the seek hints.
> >
> > -- Lars
> >
> >
> > ----- Original Message -----
> > From: Jerry Lam <chiling...@gmail.com>
> > To: user@hbase.apache.org; lars hofhansl <lhofha...@yahoo.com>
> > Cc:
> > Sent: Tuesday, August 28, 2012 7:17 AM
> > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
> >
> > Hi Lars:
> >
> > Thanks for the reply.
> > I need to understand if I misunderstood the perceived inefficiency
> because
> > it seems you don't think quite the same.
> >
> > Let say, as an example, we have 1 row with 2 columns (col-1 and col-2)
> in a
> > table and each column has 1000 versions. Using the following code (the
> code
> > might have errors and don't compile):
> > /**
> > * This is very simple use case of a ColumnPrefixFilter.
> > * In fact all other filters that make use of filterKeyValue will see
> > similar
> > * performance problems that I have concerned with when the number of
> > * versions per column could be huge.
> >
> > Filter filter = new ColumnPrefixFilter(Bytes.toBytes("col-2"));
> > Scan scan = new Scan();
> > scan.setFilter(filter);
> > ResultScanner scanner = table.getScanner(scan);
> > for (Result result : scanner) {
> >     for (KeyValue kv : result.raw()) {
> >         System.out.println("KV: " + kv + ", Value: " +
> >         Bytes.toString(kv.getValue()));
> >     }
> > }
> > scanner.close();
> > */
> >
> > Implicitly, the number of version per column that is going to return is 1
> > (the latest version). User might expect that only 2 comparisons for
> column
> > prefix are needed (1 for col-1 and 1 for col-2) but in fact, it processes
> > the filterKeyValue method in ColumnPrefixFilter 1000 times (1 for col-1
> and
> > 1000 for col-2) for col-2 (1 per version) because all versions of the
> > column have the same prefix for obvious reason. For col-1, it will skip
> > using SEEK_NEXT_USING_HINT which should skip the 99 versions of col-1.
> >
> > In summary, the 1000 comparisons (5000 byte comparisons) for the column
> > prefix "col-2" is wasted because only 1 version is returned to user.
> Also,
> > I believe this inefficiency is hidden from the user code but it affects
> all
> > filters that use filterKeyValue as the main execution for filtering KVs.
> Do
> > we have a case to improve HBase to handle this inefficiency? :) It seems
> > valid unless you prove otherwise.
> >
> > Best Regards,
> >
> > Jerry
> >
> >
> >
> > On Tue, Aug 28, 2012 at 12:54 AM, lars hofhansl <lhofha...@yahoo.com>
> > wrote:
> >
> > > First off regarding "inefficiency"... If version counting would happen
> > > first and then filter were executed we'd have folks "complaining" about
> > > inefficiencies as well:
> > > ("Why does the code have to go through the versioning stuff when my
> > filter
> > > filters the row/column/version anyway?")  ;-)
> > >
> > >
> > > For your problem, you want to make use of "seek hints"...
> > >
> > > In addition to INCLUDE you can return NEXT_COL, NEXT_ROW, or even
> > > SEEK_NEXT_USING_HINT from Filter.filterKeyValue(...).
> > >
> > > That way the scanning framework will know to skip ahead to the next
> > > column, row, or a KV of your choosing. (see Filter.filterKeyValue and
> > > Filter.getNextKeyHint).
> > >
> > > (as an aside, it would probably be nice if Filters also had
> > > INCLUDE_AND_NEXT_COL, INCLUDE_AND_NEXT_ROW, internally used by
> > StoreScanner)
> > >
> > > Have a look at ColumnPrefixFilter as an example.
> > > I also wrote a short post here:
> > >
> >
> http://hadoop-hbase.blogspot.com/2012/01/filters-in-hbase-or-intra-row-scanning.html
> > >
> > > Does that help?
> > >
> > > -- Lars
> > >
> > >
> > > ----- Original Message -----
> > > From: Jerry Lam <chiling...@gmail.com>
> > > To: "user@hbase.apache.org" <user@hbase.apache.org>
> > > Cc: "user@hbase.apache.org" <user@hbase.apache.org>
> > > Sent: Monday, August 27, 2012 5:59 PM
> > > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
> > >
> > > Hi Lars:
> > >
> > > Thanks for confirming the inefficiency of the implementation for this
> > > case. For my case, a column can have more than 10K versions, I need a
> > quick
> > > way to stop the scan from digging the column once there is a match
> > > (ReturnCode.INCLUDE). It would be nice to have a ReturnCode that can
> > notify
> > > the framework to stop and go to next column once the number of versions
> > > specify in setMaxVersions is met.
> > >
> > > For now, I guess I have to hack it in the custom filter (I.e. I keep
> the
> > > count myself)? If you have a better way to achieve this, please share
> :)
> > >
> > > Best Regards,
> > >
> > > Jerry
> > >
> > > Sent from my iPad (sorry for spelling mistakes)
> > >
> > > On 2012-08-27, at 20:11, lars hofhansl <lhofha...@yahoo.com> wrote:
> > >
> > > > Currently filters are evaluated before we do version counting.
> > > >
> > > > Here's a comment from ScanQueryMatcher.java:
> > > >     /**
> > > >      * Filters should be checked before checking column trackers. If
> we
> > > do
> > > >      * otherwise, as was previously being done, ColumnTracker may
> > > increment its
> > > >      * counter for even that KV which may be discarded later on by
> > > Filter. This
> > > >      * would lead to incorrect results in certain cases.
> > > >      */
> > > >
> > > >
> > > > So this is by design. (Doesn't mean it's correct or desirable,
> though.)
> > > >
> > > > -- Lars
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: Jerry Lam <chiling...@gmail.com>
> > > > To: user <user@hbase.apache.org>
> > > > Cc:
> > > > Sent: Monday, August 27, 2012 2:40 PM
> > > > Subject: setTimeRange and setMaxVersions seem to be inefficient
> > > >
> > > > Hi HBase community:
> > > >
> > > > I tried to use setTimeRange and setMaxVersions to limit the number of
> > KVs
> > > > return per column. The behaviour is as I would expect that is
> > > > setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE version
> > of
> > > KV
> > > > with timestamp that is less than or equal to T.
> > > > However, I noticed that all versions of the KeyValue for a particular
> > > > column are processed through a custom filter I implemented even
> though
> > I
> > > > specify setMaxVersions(1) and setTimeRange(0, T+1). I expected that
> if
> > > ONE
> > > > KV of a particular column has ReturnCode.INCLUDE, the framework will
> > jump
> > > > to the next COL instead of iterating through all versions of the
> > column.
> > > >
> > > > Can someone confirm me if this is the expected behaviour (iterating
> > > through
> > > > all versions of a column before setMaxVersions take effect)? If this
> is
> > > an
> > > > expected behaviour, what is your recommendation to speed this up?
> > > >
> > > > Best Regards,
> > > >
> > > > Jerry
> > > >
> > >
> > >
> >
> >
>

Re: setTimeRange and setMaxVersions seem to be inefficient

Reply via email to