-- BlackBerry® from Mobistar --- -----Original Message----- From: Clint Morgan <[email protected]>
Date: Thu, 28 May 2009 13:45:57 To: <[email protected]> Subject: Re: Filter use cases Looks good to me. +1 On Thu, May 28, 2009 at 12:26 AM, Ryan Rawson <[email protected]> wrote: > Thanks all, > > The old RowFilterInterface will _sort of_ work. The new code will call > filterRowKey(byte[],int,int) and filterAllRemaining(). I unit tested the > RowInclusiveStop, and Prefix filters along with the WhileMatchRowFilter to > wrap them. Tests pass. > > More complex filters such as ColumnMatchFilter won't work, and need to be > ported to the new API before 0.20 (maybe tomorrow, eh?). Stop filter is > not necessary as a stop-row is built into the Scan specification now. > > We might need to wrap some of the client-API to take existing use cases and > translate them into the new code. Eg: detect a stop-row-filter and use the > Scan(start,end) new-API instead, etc, etc. > > > > On Wed, May 27, 2009 at 4:35 PM, Andrew Purtell <[email protected]> > wrote: > > > +1 on this API. Looks good. > > > > > > > > > > ________________________________ > > From: Ryan Rawson <[email protected]> > > To: [email protected]; [email protected] > > Sent: Wednesday, May 27, 2009 12:06:31 AM > > Subject: Re: Filter use cases > > > > Here is a suggested API. I included the call flow in the interface docs > as > > well. > > > > I dropped rowProcessed() since only PageRowFilter used it, and it can get > > the data elseway. I also dropped processAlways() as well. This seems > like > > internal workings to RowFilterSet, and should ideally be maintained > there. > > > > This row filter interface supports 1 feature we can't right now: > > - filter upto N columns, skip the rest. > > > > Right now we can do that, but not efficiently. > > > > Remember, as we write filters, columns are seen in sorted order. To be > > efficient, at all steps we need to take advantage of the sorted order of > > things. > > > > /** > > * Interface for row and column filters directly applied within the > > regionserver. > > * A filter can expect the following call sequence: > > * > > * - reset(); > > * - filterAllRemaining() -> true indicates scan is over, false, keep > going > > on. > > * - filterRowKey(byte[],int,int); -> true to drop this row > > * if false, we will also call: > > * - filterValue(KeyValue); -> true to drop this key/value > > * - filterRow(); -> last chance to drop entire row based on the sequence > of > > * filterValue() calls. Eg: filter a row if it doesn't contain a specified > > column. > > * > > * Filter instances are created one per region/scan. > > */ > > public interface NewRowFilterInterface extends Writable { > > /** > > * Reset the state of the filter between rows. > > */ > > public void reset(); > > > > /** > > * Filters a row based on the row key. If this returns true, the entire > > * row will be excluded. If false, each KeyValue in the row will be > > * passed to filterValue() below. > > * > > * @param buffer buffer containing row key > > * @param offset offset into buffer where row key starts > > * @param length length of the row key > > * @return true, remove entire row, false, include the row (maybe). > > */ > > public boolean filterRowKey(byte [] buffer, int offset, int length); > > > > /** > > * If this returns true, the scan will terminate. > > * > > * @return true to end scan, false to continue. > > */ > > public boolean filterAllRemaining(); > > > > /** > > * A way to filter based on the column family, column qualifier and/or > the > > * column value. Return code is described below. This allows filters to > > * filter only certain number of columns, then terminate without > matching > > ever > > * column. > > * > > * @param v the KeyValue in question > > * @return code as described below > > */ > > public ReturnCode filterValue(KeyValue v); > > > > /** > > * Return codes for filterValue(). > > */ > > public enum ReturnCode { > > /** > > * Include the KeyValue > > */ > > INCLUDE, > > /** > > * Skip this KeyValue > > */ > > SKIP, > > /** > > * Done with columns, skip to next row. Note that filterRow() will > > * still be called. > > */ > > NEXT_ROW, > > }; > > > > /** > > * Last chance to veto row based on previous filterValue() calls. The > > filter > > * needs to retain state then return a particular value for this call if > > they > > * wish to exclude a row if a certain column is missing (for example). > > * > > * @return true to exclude row, false to include row. > > */ > > public boolean filterRow(); > > > > } > > > > > > On Tue, May 26, 2009 at 11:39 PM, Jonathan Gray <[email protected]> > wrote: > > > > > This sounds like a good initial approach for a new filter interface. > > > > > > +1 on moving forward with what you propose, allowing for modifications > as > > > we reimplement and integrate. > > > > > > Good stuff, Ryan! > > > > > > JG > > > > > > On Tue, May 26, 2009 11:28 pm, Ryan Rawson wrote: > > > > Hi all, > > > > > > > > > > > > With HBASE-1304, it's time to normalize and review our filter API. > > > > > > > > > > > > Here are a few givens: > > > > - all calls must be byte[] offset,int offset, int length > > > > - maybe we can have calls for KeyValue (which encodes all parts of > the > > > key > > > > & > > > > value as per the name) - we'd like to get rid of the calls: > > > > -- boolean filterRow(final SortedMap<byte [], Cell> columns); > > > > -- boolean filterRow(final List<KeyValue> results); > > > > These calls are expensive, and there is no reason to have them. > > > > > > > > > > > > Here is a proposal, imagine a filter will see this sequence of calls: > > > > - reset() > > > > - filterRowKey(byte[],int,int) - true to include row, false to skip > to > > > > next row - filterKeyValue(KeyValue) - true to include key/value, > false > > to > > > > skip -- can choose to filter on family, qualifier, value, anything > > > really. > > > > - filterRow() - true to include entire row, false to post-hoc veto > row > > > > > > > > > > > > In this case one could implement the "filterIfColumnMissing" feature > of > > > > ColumnValueFilter by carrying state and returning false from > > filterRow() > > > > to veto the row based on the columns/values we didn't see. > > > > > > > > In any of these cases, all these functions will be called quite > > > > frequently, so efficiency of the code is paramount. It's probable > that > > > > filterRowKey() will be 'cached' by the calling code, but > > filterKeyValue() > > > > is called for nearly every single value we would normally return (ie: > > > it's > > > > applied _AFTER_ column matching and version and timestamp and delete > > > > tracking). > > > > > > > > The goal is to: > > > > (a) make the implementation easy and performant > > > > (b) make the API normative and easy to code for > > > > (c) make everything work > > > > > > > > > > > > Thoughts? > > > > -ryan > > > > > > > > > > > > > > > > > > > > > > > > >
