Re: Map Reduce Requirements

Brian Rowe Tue, 23 Aug 2011 08:12:46 -0700

I'm a little late to the party, but the way I've been handling
marshaling is using an explicit map/reduce phase to perform the
marshaling and/or data massaging. You can chain map phases together by
using the special bucket/key pair {none,none} and passing the
intermediate data via the KeyData. This also makes the phases more
portable if you wish to re-use them in other situations. I wrote a
blog post about chaining phases a while back, which might be useful:
http://cartesianfaith.wordpress.com/2011/07/27/mapreduce-tips-and-tricks-in-riak/


HTH,
Brian


On Tue, Aug 23, 2011 at 10:01 AM, Jeremiah Peschka
<[email protected]> wrote:
> On Aug 22, 2011, at 8:50 PM, bill robertson wrote:
>
>> I wonder if it would be feasible to deploy an erlang web-service in the riak 
>> node's webmachine instance that could translate meta-data into Erlang funs 
>> and drive the map reduce operation that way. I'm not sure if I could get 
>> around having specific knowledge of the protobuf structures baked into that 
>> code, but I don't think it matters in this case.
>>
>> I also wonder how much 1.0 will change this picture.
>>
>> > Additionally, are secondary indexes meta-data?  i.e. If I built some 
>> > secondary indices, these are stored in some form internal to Riak, and 
>> > therefore available for query regardless of the type of data its 
>> > associated with. Is this correct?
>>
>> Secondary indexes are a separate physical structure, or so I gather. (Rusty 
>> could be full of lies.) They're stored separately from the initial data and 
>> not as metadata in the object headers. So, yes, you can store whatever you 
>> want in secondary indexes and query it however you want, provided there's an 
>> API that supports what you're doing.
>>
>> Would secondary indexes eliminate the need for key-filtering? Logically, it 
>> would seem that you could do with indexes, but do they have similar 
>> performance characteristics?  (i.e. does one suck more than the other?)
>
> Key filters will always perform a list-keys operation. Meaning that they 
> result in an in memory scan of all keys in the key space.
>
> Not knowing entirely how indexes are implemented internally (reading the 
> source is on my TO DO list), I can only guess from my experience with other 
> databases how this would work. Indexes generally work best when you have a 
> low search cardinality - when you're seeking only a few records from the 
> index. As long as you can structure secondary indexes to answer the questions 
> you're asking, then indexes make it easy to perform fast queries.
>
> The difference comes in based on your storage mechanism. With bitcask, all 
> keys are in memory so that list-keys scan only happens between RAM and CPU 
> and isn't THAT expensive of an operation. If indexes are not a memory 
> resident structure, then a scan of an index (when you're doing a search 
> that's some kind of substring or ends with operation) will be painfully slow 
> - much like when you have to perform a table scan in an RDBMS.
>
> The upside of key filtering, and composite key names in general, is that you 
> can create meaningful keys that you can assemble on the fly. e.g. To get 
> yesterday's trades of Ford stock in the NYSE, (assuming you have a trades 
> bucket) you could get at yesterday's trading history through something like 
> http://my_riak_server:8091/riak/trades/NYSE:F:20110822 Being able to perform 
> ad hoc seeks like that is really powerful.
>
> TL;DR - key filters and secondary indexes serve different purposes.
>
>>
>> Thanks again,
>> Bill Robertson
>
>
> ---
> Jeremiah Peschka - Founder, Brent Ozar PLF, LLC
> Microsoft SQL Server MVP
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Map Reduce Requirements

Reply via email to