Re: View Intersections

Jeff Hinrichs - DM&T Tue, 10 Feb 2009 20:45:10 -0800

On Tue, Feb 10, 2009 at 9:58 PM, Paul Davis <paul.joseph.da...@gmail.com> wrote:
> On Tue, Feb 10, 2009 at 10:19 PM, Jeff Hinrichs - DM&T
> <dunde...@gmail.com> wrote:
>> Just a few  comment to get things started.
>>
>> On Tue, Feb 10, 2009 at 5:59 PM, Paul Davis <paul.joseph.da...@gmail.com> 
>> wrote:
>>> I've been contemplating implementing a new feature that I've been
>>> wanting for awhile. There's been  some talk of implementing view
>>> intersections for a bit now so I figured I'd try and give a summary of
>>> what the feature would entail in terms of functionality and then the
>>> necessary bits required for an implementation.
>>>
>>> So the original idea for view intersections was exactly what the name
>>> entails: Show me the intersection between two views for a given set of
>>> view query parameters. After thinking about different methods of
>>> implementation I think we can extend this to be more powerful and
>>> generally applicable.
>>>
>>> Major Hurdle 1
>>> ============
>>>
>>> The first necessary bit of ground work would be to implement an
>>> optional value index on views. The more I thought about intersecting
>>> views the more I realized it was starting to look pointless. Ignoring
>>> something along the lines of group_level=N in that we can join on
>>> array prefixes, all views being joined would require exactly the same
>>> key. Which begs the question, why not just create 1 view that emits
>>> the output of the two you want intersected.
>>
>>  I would argue that returning a simple list of docids that meet the
>> requirement should suffice -- in fact, the views a and b need not be
>> homogenous so returning anything beyond docids could end up being a
>> bigger problem than the intersection itself.
>>
>> For instance, say we want the intersection of the documents who have
>> both "blue" and "fuzzy" tags so we use
>> a = /_view/tags/byval?key="blue"
>> b = /_view/tags/byval?key="fuzzy"
>>
>> intersection(a,b)
>>
>> Now we want to limit that to things named Harold.
>>
>> c=/_view/name/first?key="Harold"
>>
>> intersection(intersection(a,b),c)
>>
>> Which gives us a list of docid's that contain Blue, Fuzzy things named 
>> Harold.
>>
>> However, the values returned by view a and view b are the same,
>> however the values returned by view c might be completely different.
>> So returning a view with varying values might not be very helpful
>> (This is where I am not seeing why more than returning a list of
>> docid's would be appropriate. Of course I am most likely missing the
>> point.)  Only returning intersections of similar views would not be as
>> interesting a returning intersections of dissimilar views.
>>
>>
>>>
>>> I couldn't get past this for a long time until I heard Chris Anderson
>>> pondering adding a btree index to the values in a view. The obvious
>>> downfalls of the extra space and computation usage are there, but
>>> making it optional should solve any qualms in that respect.
>>>
>>> Given an index on a value we're now able to chain together arbitrary
>>> views using either the key or value as well as limit the intersection
>>> by any combination of key and value.
>>>
>>> As a side benefit, we would also get the select views by value
>>> restriction as well. I'm thinking it'd be as transparent as adding a
>>> [start|end]value and [start|end]value_docid set of URL parameters. I
>>> haven't followed this train of thought too far into the code yet, but
>>> something approximating that should be fairly doable. A thought occurs
>>> that limiting view results by both key and value could be interesting
>>> in terms of implementation. Not sure if I'd force it through the
>>> intersection API or not.
>>>
>>> Caveats that come to mind are that this would break binary
>>> compatibility for all generated views. It wouldn't require a
>>> dump/reload, but it might come as a surprise to people upgrading that
>>> all their views are regenerating.
>>>
>>> Major Hurdle 2
>>> ============
>>>
>>> Implementing the view intersection API. First off, it probably needs a
>>> new name. Once we have intersections working, unions, subtractions,
>>> and the NxM one who's name escapes me (cross product floats up but
>>> sounds not right) should be trivially implementable.
>>>
>>> The underlying implementation for this is basically a large merge sort
>>> running over the view btree's. If you read about the merge step in
>>> map/reduce/merge that's basically what I've got in my head.
>>>
>>> The biggest issue that I've found in getting this implemented
>>> (excluding a value index) is that I'd need to write a new btree
>>> traversal method that used iterators instead of a fold mechanism. This
>>> shouldn't be overly difficult to implement.
>>>
>>> Beyond that then it's basically up to the HTTP interface in parameter
>>> parsing and error checking. For passing parameters I'm thinking along
>>> the line of a JSON body posted (Premptive: any RESTafarians should
>>> reference the long discussion on multi-get before writing about how
>>> this isn't RESTful).
>>
>>  posting json documents seems to be required and beyond argument given
>> the technical size limits of a GET request
>>
>>>
>>> Also, not sure if it's obvious but I'd plan on allowing arbitrarily
>>> nested conditions, ie, "intersection(union(a, b), c)" type of
>>> operations. There's a subtle detail in the sort order and thus
>>> corresponding btree traversal that might come into play there. I can
>>> punt and make the entire request use one sort order, as in the
>>> previous example can't specify different sort directions for the two
>>> nested operations because you'd get a (presumably) zero overlap in the
>>> end. I'm pretty sure if we force all btrees to be traversed in the
>>> same direction for each request we don't lose any functionality
>>> though.
>> 3) Since it is a set operation, as long as operator precedence is
>> observed, the order of processing or traversing an index within the
>> set should make no difference to the result.  The intersection of the
>> sets {1, 2, 3} and {2, 3, 4} == {3, 1, 2} and {4, 2, 3}  == {2, 3} ==
>> {3, 2}
>>
>>> Comments
>>> =========
>>>
>>> That's the general outline I've got in my head right now. I'm pretty
>>> sure I can see 95% of the implementation, but it's possible I'm
>>> missing a finer detail somewhere. If you've got questions or comments
>>> let's hear them. If there's no general objection then I can probably
>>> get to starting an implementation at the end of this week.
>>>
>>> Thanks,
>>> Paul Davis
>>>
>>
>> Hat in hand,
>>
>> Jeff
>>
>
> Jeff,
>
> You bring up two good points I forgot to mention:
>
> 1. What is actually returned?
>
> I see two possibilities for this. Either we return some json structure
> that includes the key/value pairs from all views involved (per row) or
> just a list of docids. I can see either being useful, and the ideas I
> have for implementation doesn't rule out either approach. Perhaps this
> is a query time parameter?
After reading the previous response from Dean and some more
reflection.  I am starting to see the point of returning values too.
It would be up to the original views to decide on what to return.  If
I just wanted docid's I could create my views to emit less
information. ;)


> 2. Do we support intersection of docid's somehow?
>
> More specifically, do we concentrate on just getting a list of docid's
> that meet the criterion, or allow for the possibility that we can have
> multiple rows from a single docid? The ambiguity being that a single
> row may include more than one docid as well as a single docid may be
> repeated. For now I can see arguments for both approaches, but I have
> suspicions that going for rows that may possibly contain repeated
> docids will be more efficient because we don't have to track which
> docids have been seen etc. This might affect #1 as well.
Not that the implementation needs to completely true to the idea of
"sets" (A set is a collection of distinct objects, considered as an
object in its own right. ) So maybe an additional operator of
unique(a) might be added later, leaving the option to the user.
[unique(a) returns unique docids/values from view a or something that
looks like a view, i.e. intersection(a, b), perhaps by sorting on
docid and then filtering on n != n-1, it would do away with extensive
bookkeeping at the cost of a sort on docid and a scan]

> Implementation details related to the operation ordering should
> prevent inequivalent sets with different orderings. Ie, all operations
> are on sorted sets, so all results should be sorted. I haven't done
> the math, but on the surface I think that's right.
+1
>
> Also, the fuzzy ideas I have for the posted JSON body will probably
> lead to an exactly specified operator ordering. Ie, for every operator
> you'll have to specify exactly 2 operands. Using the result of an
> operand as an operand is permitted though. Ie. You can never say A + B
> + C - D, only (((A+B)+C)-D). Though that's probably me just being
> unimaginitive with the JSON constraints.

I'm big on explicit so no argument from me ;"
>
> Thanks,
> Paul Davis
>
Regards,

Jeff Hinrichs
"Explicit is better than implicit" - Zen of Python

Re: View Intersections

Reply via email to