Hi Lars: Thanks for spending time discussing this with me. I appreciate it.
I tried to implement the setMaxVersions(1) inside the filter as follows: @Override public ReturnCode filterKeyValue(KeyValue kv) { // check if the same qualifier as the one that has been included previously. If yes, jump to next column if (previousIncludedQualifier != null && Bytes.compareTo(previousIncludedQualifier,kv.getQualifier()) == 0) { previousIncludedQualifier = null; return ReturnCode.NEXT_COL; } // another condition that makes the jump further using HINT if (Bytes.compareTo(this.qualifier, kv.getQualifier()) == 0) { LOG.info("Matched Found."); return ReturnCode.SEEK_NEXT_USING_HINT; } // include this to the result and keep track of the included qualifier so the next version of the same qualifier will be excluded previousIncludedQualifier = kv.getQualifier(); return ReturnCode.INCLUDE; } Does this look reasonable or there is a better way to achieve this? It would be nice to have ReturnCode.INCLUDE_AND_NEXT_COL for this case though. Best Regards, Jerry On Wed, Aug 29, 2012 at 2:09 AM, lars hofhansl <lhofha...@yahoo.com> wrote: > Hi Jerry, > > my answer will be the same again: > Some folks will want the max versions set by the client to be before > filters and some folks will want it to restrict the end result. > It's not possible to have it both ways. Your filter needs to do the right > thing. > > > There's a lot of discussion around this in HBASE-5104. > > > -- Lars > > > > ________________________________ > From: Jerry Lam <chiling...@gmail.com> > To: user@hbase.apache.org; lars hofhansl <lhofha...@yahoo.com> > Sent: Tuesday, August 28, 2012 1:52 PM > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient > > Hi Lars: > > I see. Please refer to the inline comment below. > > Best Regards, > > Jerry > > On Tue, Aug 28, 2012 at 2:21 PM, lars hofhansl <lhofha...@yahoo.com> > wrote: > > > What I was saying was: It depends. :) > > > > First off, how do you get to 1000 versions? In 0.94++ older version are > > pruned upon flush, so you need 333 flushes (assuming 3 versions on the > CF) > > to get 1000 versions. > > > > I forgot that the default number of version to keep is 3. If this is what > people use most of the time, yes you are right for this type of scenarios > where the number of version per column to keep is small. > > By that time some compactions will have happened and you're back to close > > to 3 versions (maybe 9, 12, or 15 or so, depending on how store files you > > have). > > > > Now, if you have that many version because because you set VERSIONS=>1000 > > in your CF... Then imagine you have 100 columns with 1000 versions each. > > > > Yes, imagine I set VERSIONS => Long.MAX_VALUE (i.e. I will manage the > versioning myself) > > In your scenario below you'd do 100000 comparisons if the filter would be > > evaluated after the version counting. But only 1100 with the current > code. > > (or at least in that ball park) > > > > This is where I don't quite understand what you mean. > > if the framework counts the number of ReturnCode.INCLUDE and then stops > feeding the KeyValue into the filterKeyValue method after it reaches the > count specified in setMaxVersions (i.e. 1 for the case we discussed), > should then be just 100 comparisons only (at most) instead of 1100 > comparisons? Maybe I don't understand how the current way is doing... > > > > > > > The gist is: One can construct scenarios where one approach is better > than > > the other. Only one order is possible. > > If you write a custom filter and you care about these things you should > > use the seek hints. > > > > -- Lars > > > > > > ----- Original Message ----- > > From: Jerry Lam <chiling...@gmail.com> > > To: user@hbase.apache.org; lars hofhansl <lhofha...@yahoo.com> > > Cc: > > Sent: Tuesday, August 28, 2012 7:17 AM > > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient > > > > Hi Lars: > > > > Thanks for the reply. > > I need to understand if I misunderstood the perceived inefficiency > because > > it seems you don't think quite the same. > > > > Let say, as an example, we have 1 row with 2 columns (col-1 and col-2) > in a > > table and each column has 1000 versions. Using the following code (the > code > > might have errors and don't compile): > > /** > > * This is very simple use case of a ColumnPrefixFilter. > > * In fact all other filters that make use of filterKeyValue will see > > similar > > * performance problems that I have concerned with when the number of > > * versions per column could be huge. > > > > Filter filter = new ColumnPrefixFilter(Bytes.toBytes("col-2")); > > Scan scan = new Scan(); > > scan.setFilter(filter); > > ResultScanner scanner = table.getScanner(scan); > > for (Result result : scanner) { > > for (KeyValue kv : result.raw()) { > > System.out.println("KV: " + kv + ", Value: " + > > Bytes.toString(kv.getValue())); > > } > > } > > scanner.close(); > > */ > > > > Implicitly, the number of version per column that is going to return is 1 > > (the latest version). User might expect that only 2 comparisons for > column > > prefix are needed (1 for col-1 and 1 for col-2) but in fact, it processes > > the filterKeyValue method in ColumnPrefixFilter 1000 times (1 for col-1 > and > > 1000 for col-2) for col-2 (1 per version) because all versions of the > > column have the same prefix for obvious reason. For col-1, it will skip > > using SEEK_NEXT_USING_HINT which should skip the 99 versions of col-1. > > > > In summary, the 1000 comparisons (5000 byte comparisons) for the column > > prefix "col-2" is wasted because only 1 version is returned to user. > Also, > > I believe this inefficiency is hidden from the user code but it affects > all > > filters that use filterKeyValue as the main execution for filtering KVs. > Do > > we have a case to improve HBase to handle this inefficiency? :) It seems > > valid unless you prove otherwise. > > > > Best Regards, > > > > Jerry > > > > > > > > On Tue, Aug 28, 2012 at 12:54 AM, lars hofhansl <lhofha...@yahoo.com> > > wrote: > > > > > First off regarding "inefficiency"... If version counting would happen > > > first and then filter were executed we'd have folks "complaining" about > > > inefficiencies as well: > > > ("Why does the code have to go through the versioning stuff when my > > filter > > > filters the row/column/version anyway?") ;-) > > > > > > > > > For your problem, you want to make use of "seek hints"... > > > > > > In addition to INCLUDE you can return NEXT_COL, NEXT_ROW, or even > > > SEEK_NEXT_USING_HINT from Filter.filterKeyValue(...). > > > > > > That way the scanning framework will know to skip ahead to the next > > > column, row, or a KV of your choosing. (see Filter.filterKeyValue and > > > Filter.getNextKeyHint). > > > > > > (as an aside, it would probably be nice if Filters also had > > > INCLUDE_AND_NEXT_COL, INCLUDE_AND_NEXT_ROW, internally used by > > StoreScanner) > > > > > > Have a look at ColumnPrefixFilter as an example. > > > I also wrote a short post here: > > > > > > http://hadoop-hbase.blogspot.com/2012/01/filters-in-hbase-or-intra-row-scanning.html > > > > > > Does that help? > > > > > > -- Lars > > > > > > > > > ----- Original Message ----- > > > From: Jerry Lam <chiling...@gmail.com> > > > To: "user@hbase.apache.org" <user@hbase.apache.org> > > > Cc: "user@hbase.apache.org" <user@hbase.apache.org> > > > Sent: Monday, August 27, 2012 5:59 PM > > > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient > > > > > > Hi Lars: > > > > > > Thanks for confirming the inefficiency of the implementation for this > > > case. For my case, a column can have more than 10K versions, I need a > > quick > > > way to stop the scan from digging the column once there is a match > > > (ReturnCode.INCLUDE). It would be nice to have a ReturnCode that can > > notify > > > the framework to stop and go to next column once the number of versions > > > specify in setMaxVersions is met. > > > > > > For now, I guess I have to hack it in the custom filter (I.e. I keep > the > > > count myself)? If you have a better way to achieve this, please share > :) > > > > > > Best Regards, > > > > > > Jerry > > > > > > Sent from my iPad (sorry for spelling mistakes) > > > > > > On 2012-08-27, at 20:11, lars hofhansl <lhofha...@yahoo.com> wrote: > > > > > > > Currently filters are evaluated before we do version counting. > > > > > > > > Here's a comment from ScanQueryMatcher.java: > > > > /** > > > > * Filters should be checked before checking column trackers. If > we > > > do > > > > * otherwise, as was previously being done, ColumnTracker may > > > increment its > > > > * counter for even that KV which may be discarded later on by > > > Filter. This > > > > * would lead to incorrect results in certain cases. > > > > */ > > > > > > > > > > > > So this is by design. (Doesn't mean it's correct or desirable, > though.) > > > > > > > > -- Lars > > > > > > > > > > > > ----- Original Message ----- > > > > From: Jerry Lam <chiling...@gmail.com> > > > > To: user <user@hbase.apache.org> > > > > Cc: > > > > Sent: Monday, August 27, 2012 2:40 PM > > > > Subject: setTimeRange and setMaxVersions seem to be inefficient > > > > > > > > Hi HBase community: > > > > > > > > I tried to use setTimeRange and setMaxVersions to limit the number of > > KVs > > > > return per column. The behaviour is as I would expect that is > > > > setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE version > > of > > > KV > > > > with timestamp that is less than or equal to T. > > > > However, I noticed that all versions of the KeyValue for a particular > > > > column are processed through a custom filter I implemented even > though > > I > > > > specify setMaxVersions(1) and setTimeRange(0, T+1). I expected that > if > > > ONE > > > > KV of a particular column has ReturnCode.INCLUDE, the framework will > > jump > > > > to the next COL instead of iterating through all versions of the > > column. > > > > > > > > Can someone confirm me if this is the expected behaviour (iterating > > > through > > > > all versions of a column before setMaxVersions take effect)? If this > is > > > an > > > > expected behaviour, what is your recommendation to speed this up? > > > > > > > > Best Regards, > > > > > > > > Jerry > > > > > > > > > > > > > > >