On Tue, Feb 10, 2009 at 9:58 PM, Paul Davis <paul.joseph.da...@gmail.com> wrote: > On Tue, Feb 10, 2009 at 10:19 PM, Jeff Hinrichs - DM&T > <dunde...@gmail.com> wrote: >> Just a few comment to get things started. >> >> On Tue, Feb 10, 2009 at 5:59 PM, Paul Davis <paul.joseph.da...@gmail.com> >> wrote: >>> I've been contemplating implementing a new feature that I've been >>> wanting for awhile. There's been some talk of implementing view >>> intersections for a bit now so I figured I'd try and give a summary of >>> what the feature would entail in terms of functionality and then the >>> necessary bits required for an implementation. >>> >>> So the original idea for view intersections was exactly what the name >>> entails: Show me the intersection between two views for a given set of >>> view query parameters. After thinking about different methods of >>> implementation I think we can extend this to be more powerful and >>> generally applicable. >>> >>> Major Hurdle 1 >>> ============ >>> >>> The first necessary bit of ground work would be to implement an >>> optional value index on views. The more I thought about intersecting >>> views the more I realized it was starting to look pointless. Ignoring >>> something along the lines of group_level=N in that we can join on >>> array prefixes, all views being joined would require exactly the same >>> key. Which begs the question, why not just create 1 view that emits >>> the output of the two you want intersected. >> >> I would argue that returning a simple list of docids that meet the >> requirement should suffice -- in fact, the views a and b need not be >> homogenous so returning anything beyond docids could end up being a >> bigger problem than the intersection itself. >> >> For instance, say we want the intersection of the documents who have >> both "blue" and "fuzzy" tags so we use >> a = /_view/tags/byval?key="blue" >> b = /_view/tags/byval?key="fuzzy" >> >> intersection(a,b) >> >> Now we want to limit that to things named Harold. >> >> c=/_view/name/first?key="Harold" >> >> intersection(intersection(a,b),c) >> >> Which gives us a list of docid's that contain Blue, Fuzzy things named >> Harold. >> >> However, the values returned by view a and view b are the same, >> however the values returned by view c might be completely different. >> So returning a view with varying values might not be very helpful >> (This is where I am not seeing why more than returning a list of >> docid's would be appropriate. Of course I am most likely missing the >> point.) Only returning intersections of similar views would not be as >> interesting a returning intersections of dissimilar views. >> >> >>> >>> I couldn't get past this for a long time until I heard Chris Anderson >>> pondering adding a btree index to the values in a view. The obvious >>> downfalls of the extra space and computation usage are there, but >>> making it optional should solve any qualms in that respect. >>> >>> Given an index on a value we're now able to chain together arbitrary >>> views using either the key or value as well as limit the intersection >>> by any combination of key and value. >>> >>> As a side benefit, we would also get the select views by value >>> restriction as well. I'm thinking it'd be as transparent as adding a >>> [start|end]value and [start|end]value_docid set of URL parameters. I >>> haven't followed this train of thought too far into the code yet, but >>> something approximating that should be fairly doable. A thought occurs >>> that limiting view results by both key and value could be interesting >>> in terms of implementation. Not sure if I'd force it through the >>> intersection API or not. >>> >>> Caveats that come to mind are that this would break binary >>> compatibility for all generated views. It wouldn't require a >>> dump/reload, but it might come as a surprise to people upgrading that >>> all their views are regenerating. >>> >>> Major Hurdle 2 >>> ============ >>> >>> Implementing the view intersection API. First off, it probably needs a >>> new name. Once we have intersections working, unions, subtractions, >>> and the NxM one who's name escapes me (cross product floats up but >>> sounds not right) should be trivially implementable. >>> >>> The underlying implementation for this is basically a large merge sort >>> running over the view btree's. If you read about the merge step in >>> map/reduce/merge that's basically what I've got in my head. >>> >>> The biggest issue that I've found in getting this implemented >>> (excluding a value index) is that I'd need to write a new btree >>> traversal method that used iterators instead of a fold mechanism. This >>> shouldn't be overly difficult to implement. >>> >>> Beyond that then it's basically up to the HTTP interface in parameter >>> parsing and error checking. For passing parameters I'm thinking along >>> the line of a JSON body posted (Premptive: any RESTafarians should >>> reference the long discussion on multi-get before writing about how >>> this isn't RESTful). >> >> posting json documents seems to be required and beyond argument given >> the technical size limits of a GET request >> >>> >>> Also, not sure if it's obvious but I'd plan on allowing arbitrarily >>> nested conditions, ie, "intersection(union(a, b), c)" type of >>> operations. There's a subtle detail in the sort order and thus >>> corresponding btree traversal that might come into play there. I can >>> punt and make the entire request use one sort order, as in the >>> previous example can't specify different sort directions for the two >>> nested operations because you'd get a (presumably) zero overlap in the >>> end. I'm pretty sure if we force all btrees to be traversed in the >>> same direction for each request we don't lose any functionality >>> though. >> 3) Since it is a set operation, as long as operator precedence is >> observed, the order of processing or traversing an index within the >> set should make no difference to the result. The intersection of the >> sets {1, 2, 3} and {2, 3, 4} == {3, 1, 2} and {4, 2, 3} == {2, 3} == >> {3, 2} >> >>> Comments >>> ========= >>> >>> That's the general outline I've got in my head right now. I'm pretty >>> sure I can see 95% of the implementation, but it's possible I'm >>> missing a finer detail somewhere. If you've got questions or comments >>> let's hear them. If there's no general objection then I can probably >>> get to starting an implementation at the end of this week. >>> >>> Thanks, >>> Paul Davis >>> >> >> Hat in hand, >> >> Jeff >> > > Jeff, > > You bring up two good points I forgot to mention: > > 1. What is actually returned? > > I see two possibilities for this. Either we return some json structure > that includes the key/value pairs from all views involved (per row) or > just a list of docids. I can see either being useful, and the ideas I > have for implementation doesn't rule out either approach. Perhaps this > is a query time parameter? After reading the previous response from Dean and some more reflection. I am starting to see the point of returning values too. It would be up to the original views to decide on what to return. If I just wanted docid's I could create my views to emit less information. ;)
> 2. Do we support intersection of docid's somehow? > > More specifically, do we concentrate on just getting a list of docid's > that meet the criterion, or allow for the possibility that we can have > multiple rows from a single docid? The ambiguity being that a single > row may include more than one docid as well as a single docid may be > repeated. For now I can see arguments for both approaches, but I have > suspicions that going for rows that may possibly contain repeated > docids will be more efficient because we don't have to track which > docids have been seen etc. This might affect #1 as well. Not that the implementation needs to completely true to the idea of "sets" (A set is a collection of distinct objects, considered as an object in its own right. ) So maybe an additional operator of unique(a) might be added later, leaving the option to the user. [unique(a) returns unique docids/values from view a or something that looks like a view, i.e. intersection(a, b), perhaps by sorting on docid and then filtering on n != n-1, it would do away with extensive bookkeeping at the cost of a sort on docid and a scan] > Implementation details related to the operation ordering should > prevent inequivalent sets with different orderings. Ie, all operations > are on sorted sets, so all results should be sorted. I haven't done > the math, but on the surface I think that's right. +1 > > Also, the fuzzy ideas I have for the posted JSON body will probably > lead to an exactly specified operator ordering. Ie, for every operator > you'll have to specify exactly 2 operands. Using the result of an > operand as an operand is permitted though. Ie. You can never say A + B > + C - D, only (((A+B)+C)-D). Though that's probably me just being > unimaginitive with the JSON constraints. I'm big on explicit so no argument from me ;" > > Thanks, > Paul Davis > Regards, Jeff Hinrichs "Explicit is better than implicit" - Zen of Python