Fwd: Secondary Indexes - Feedback?

Gordon Tillman Thu, 17 Nov 2011 06:45:26 -0800

I forgot to CC the mailing list with this response.

--g

From: Gordon Tillman <[email protected]>
Subject: Re: Secondary Indexes - Feedback?
Date: November 16, 2011 14:55:00 CST
To: Rusty Klophaus <[email protected]>

On Nov 16, 2011, at 13:53 , Rusty Klophaus wrote:

> Hi Gordon,
> 
> Thanks for your feedback! Some follow up questions below:
> 
> For example, with search I can specify something like this to generate my 
> input to map-reduce:
> 
> p:foo AND t:bar  (give me all the objects whose parent "p" is foo and that 
> hav tag "t" of bar).
> 
> So that would get fed to map-reduce where additional processing (think 
> filtering, sorting, pagination) is done.
> 
> I can do the same thing with secondary indexes but would have to move some of 
> that into map.  
> 
> So in this case I would use secondary indexes to grab all of the items whose 
> parent "p" is "foo".  This would generate the input phase and at that point I 
> would have to use map to filter out all of the items that did not contain the 
> tag "t"  of "bar".
> 
> It is doable, but not as performant as I think it could be.
> 
> So to be clear, this is mainly about performance, not convenience? In other 
> words, you don't mind writing your own map function, so long as it is fast?

That is correct, don't mind doing that at all.  We already have a bunch of M/R 
code and it's all in Erlang so it is pretty fast.  Here is where my comment 
about speed came from.  

Assume theoretical objects that have these fields: parent, tag, date, data.  
Stored in JSON.  Our goal is to retrieve all objects where parent="foo", 
tag="bar",  and date<20111116 in a M/R job.  

(1) We could do this:  use "input: bucket" (full key listing), and do all of 
the filtering in a map phase.

(2) Conversely, if using search to index our data we could use search as the 
input phase:

"parent:foo AND tag:bar AND date:[00000000 TO 20111115]

and you are pretty much done.

everything else is somewhere in between.  So when using secondary indexes we 
can pick one of those three fields to generate the input phase (say 
parent_bin=foo) and do the rest of the filtering in a map phase.

I am operating on the assumption that option (1) is the slowest and option (2) 
is the fastest, so that the solution using secondary indexes would fall 
somewhere in between.  I am probably over-simplifying but that is what 
motivated my remark with regards to speed.

> Also, lets say that part 1 of the query is getting a list of keys where "p" 
> == "foo", part 2 is turning those keys into objects, and part 3 is filtering 
> those objects. Are all parts too slow for your application, or is only a 
> specific part of the query too slow?
> 
> Hope that makes sense, this is a nuanced point.

In the example above I would combine part 2 and 3 into one map phase.  I would 
extract a JSON representation of each object, something like this (assumption 
in this case of course is that allow mult = false for the bucket in question):

get_json({error,notfound}) ->
    null;
get_json(RiakObject) ->
    ObjMD = riak_object:get_metadata(RiakObject),
    case dict:find(<<"content-type">>, ObjMD) of
        {ok, CtVal} ->
            case CtVal of
                "application/json" ->
                    mochijson2:decode(riak_object:get_value(RiakObject));
                _ ->
                    null
            end;
        error ->
            null
    end.

I would then check to see if the object meet all of the filter criteria.  If 
not, return [] else return whatever sub-set of the JSON data that was required. 
 Since there is no other map-phase following this one I don't have to return 
[[bucket, key]], I can just return the data.

So really, the only thing that would possibly result in slower performance 
would be that the initial set of objects generated during the input phase would 
be larger when using secondary indexes as opposed to using search.

> 
> -- 
> Rusty Klophaus (@rustyio)
> Basho Technologies, Inc.
> www.basho.com
> 

Honestly Rusty I don't think that is the biggest performance issue that I'm 
worried about.  I'm really interested in being able to implement distributed 
reduce phases (specifically to do a partial sort)  and then have that output 
handle by a final reduce phase that could perform an efficient merge sort  and 
stream results back to the client.  That would be really cool!

--g

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Fwd: Secondary Indexes - Feedback?

Reply via email to