newbie questions: sorted keys, ADT primitives, and link manipulation

Gary William Flake Sat, 15 Jan 2011 16:23:23 -0800

I am building a backend for a web service and Riak looks to be a strong fit
for my needs at this point.  However, there is one really simple requirement
that I can't figure out how to implement on Riak with any sort of
efficiency.  To simplify the question, suppose that I want a twitter-like
service that has billions of smallish documents that enter into the backend
over a period of time.  Typical read pattern will be:


1. List the last N documents
2. Given a document, see the N that came before this one.
3. From a random point, see the N documents before this one.

Hence, what I really want is to be able to access the documents in sorted
order by timestamp.  I know that I can do all of this with a map/reduce job,
but this solution literally requires the backend to scan each and every
document which seems less than ideal.

Is there a generally accepted "right" way to do this on Riak?  Or is there
no known way to do this gracefully, and I need to be prepared to roll my own
secondary indices?  Or, can this be handled with a Riak Search range query?

###

The second topic is concerned with finding an elegant way to address the
first question.  One way to solve the problem above is to use a bucket as a
poor man's list.  At the application layer, when I insert a new document, I
could create a new k:v pair in my "list" bucket with the key being an
integer corresponding to this document's position in the corpus, and the
value being the actual key in the document bucket.  With this scheme, I
could satisfy the scenario above with the following observations:

1. Once I have the list key for a document, I could find the N before it in
O(N) time which is an improvement over the map/reduce solution which is
O(size of corpus) / O(cluster size).

2. This solution assumes atomicity that doesn't exist in practice (e.g., the
operations "new id <- number of docs", and "add a new doc with key new id"
have to complete as an atomic operation if multiple clients are attempting
an insert).

3. This solution also wants a multiget styled function to map index values
to true document keys, and then to get the resulting document.

I know what you are thinking: just use a serialized JSON array as the value
of the list that you want, then just get it, deserialize it, mutate it, then
update its value.  But this is horrific because each insertion takes
O(corpus size) time.

Bottom line: life sucks no matter which route we take.

What I think is needed are redis styled objects that support O(1) atomic
operations.  For example, if I simply had a k:v pair where the value was a
list which supported insert/delete front/rear, then we are golden.

###

This brings me to my third question / observation.  In scanning over the
archives of this list, I see that others have requested this feature before,
but I haven't seen any mention of the feature being considered.  (Please,
please!  I hope I am wrong).

So, I want to offer what I think is the minimal feature request that
supports the scenarios that I am trying to achieve.  Basically, I want to
build off of links, allowing them to generalize into container types.  My
assumptions:

1. Links have a natural order in that they alway come out in the same order
that they are created (at least that's what the python client API states,
but I can't confirm it in the Riak docs).

2. Links seem to support mutation operations, meaning that I can add a link
and remove a link in O(1) time (and not in time proportional to the number
of links on the record).

If both of these are true, then all that is needed is to add simple O(1)
mutation operations allowing one to treat a records links as a special
container that references to other records.  Insert/delete front/rear will
give you stacks and queues.  A small addition to the API would support sets.

###

That's all for now.  I hope someone can chime in with something to add on
top.  My real hope is that I've misunderstood the capabilities of Riak and
that what I want to do can be done today.  But if that's not the case, I am
curious if others have the same needs and if they know how to work around
them.

Thanks,
-- GWF

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

newbie questions: sorted keys, ADT primitives, and link manipulation

Reply via email to