On Thursday, March 8, 2012 7:01:49 AM, "Adam Fuchs" <[email protected]> wrote: > Yes, yes, yes, this is going to be a very useful feature set! (I told > Andie > all about it and she agreed whole-heartedly) > > I think that step one needs to be figuring out how to expose this in > the > API, and the iterator interface is the place to start. Once we have > defined > an abstraction layer, we can experiment with lots of different > implementations at the RFile layer. If we are going to broadly extend > these > locality group-type filtering optimizations, it might make sense to > drop > the specialization for column family filtering that is part of the > SortedKeyValueIterator seek method. Then we could support column > family > filtering, timestamp filtering, cell-level security filtering, etc. as > separate iterators. The specialization for column family filtering is > our > current mechanism for optimizing that operation in the RFile, but we > could > be a little smarter about how we do this. > > What I'm suggesting is that when we construct an iterator tree we look > for > iterators on top of the RFile reader that we can collapse and > implement as > part of the RFile reader. So, if a column family filtering iterator is > on > top of the RFile then we can grab its set of column families and > replace it > with the filtered RFile reader. If we add a little knowledge about > commutativity of iterators then we can even collapse filters that are > not > directly on top of the RFile reader (like there might be a merging > iterator > between the RFile reader and the column family filtering iterator). > One way > we could implement this is by changing the factory method that > generates > iterators. When this method calls the init method on a newly > constructed > iterator it can instead push that iterator down through the tree and > return > the source iterator instead. We might be able to specialize the > iterator > environment to signal the optimization and avoid any changes to the > API > here. > > Once we get to the point of optimizing the RFile, I think what we > might > find is that the RFile entries are naturally grouped by time into > blocks in > many cases. A simple timestamp-based block filter might be optimal in > these > cases. This is what I was talking about with introducing extra > features > (timestamp ranges, etc) into the RFile index. I think it also makes > sense > to include some aggregate cell-level security markings here. > > One other thing to think about: I like the simpler iterator interface, > but > there are some implications to modifying the column family filter set > during a query that might be tricky. Does anybody change the column > family > set mid-query now, anyway? Is that something we would want to support > for > timestamps or other filters?
There are iterators that change the column family filter set, so I'm wary of automatically deciding which iterators can be pulled down into the file. Billie
