Re: Using Riak to perform aggregate queries

Christian Dahlqvist Sun, 24 Nov 2013 00:02:36 -0800

Hi,

Instead of updating a summary object for every insert, you could also 
precompute by periodically aggregating data for a specific time period. If you 
are using LevelDB and e.g. have a secondary index that contains a timestamp, 
you could create an hourly summary using a batch job every hour. You could also 
choose to then daily roll these hourly summary objects up into even larger 
daily objects. This allows you to spread out the computation effort and 
provides different levels of granularity for analysis.

While you in a relational database usually store each event as a separate row 
in a table, storing each event as a separate object in Riak is not always the 
optimal approach. When modelling data in Riak it is important to consider how 
you need to access and query your data when structuring it. As Riak at its core 
is a key-value store and accessing data directly by key is the most efficient 
and scalable query method, you ideally want to ensure that the vast majority of 
querying is done directly by key. In order to do this it is often recommended 
to use semantic keys that describe the data instead of automatically generated 
UUIDs.

As retrieving a large number of individual keys is considerably less efficient 
that retrieving a single larger object containing the same data, it often makes 
sense to de-normalise in Riak. In your case you might e.g. consider creating 
objects that contain events of a specific type for a vendor for a set time 
period. These parameters, which identifies the data held in the object should 
be used to create the key. This could take the form <vendor id>_<event 
type>_<date/time in YYYYMMDDHH24 form>. Instead of just inserting a new event 
as a single record in the database, you would update the appropriate object. 
Although this results in a bit more work up front, it allows you to retrieve 
data much more efficiently. 

Examples of how this can be done in real scenarios can be found in these 
presentations:

Temetra:  http://basho.com/riak-at-temetra/
Boundary: 
http://boundary.com/blog/2012/08/21/boundary-techtalk-large-scale-olap-with-kobayashi/

I hope this gives you a better idea of how data can be modelled in Riak and how 
it differs from working with a RDBMS. If you can share more details about your 
use case, data and how you need to access and query it, we would love to help 
you bounce ideas and find a solution that works for you.

Best regards,

Christian

On 22 Nov 2013, at 22:34, Hector Castro <[email protected]> wrote:

> On Thu, Nov 21, 2013 at 4:47 PM, NC <[email protected]> wrote:
>> Our use-case is very similar to what Chris has described till now. I am new
>> to the riak store and have a background with RDBMS.
>> 
>> Going over this thread, there was a suggestion to pre-compute things. I am
>> trying to understand what pre-compute exactly means. Does it mean using pre
>> or post commit hooks to perform aggregation as different events enter our
>> system? Or does it mean running map reduce jobs in the background to
>> precompute the aggregations?
>> 
>> A brief background on our use-case. We have vendors in our system that get
>> millions of events every week. Every two weeks, we sum the amount on all the
>> events for the vendor to generate an invoice. Querying millions of events
>> for the vendor using secondary indices or key filters doesn't seem feasible
>> in riak. I am wondering if we can use post-commit hooks so that as events
>> enter our system, we maintain a real-time account for the vendor, adding and
>> subtracting things on the go. When the time comes to create an invoice, we
>> just look at the account to find the amount to pay to the vendor.
> 
> You're on the right track. Avoiding the commit hooks might work better though.
> 
> Precomputing in the content of your use case could look something like:
> 
> At write time, store the event data in a key/value pair, but also
> create another key that has an invoice sum for a specific vendor along
> with a date:
> 
> INVOICE_SUM:VENDOR_NAME:DATE
> 
> At the end of two weeks, you can get all of the keys that make up two
> weeks for a specific vendor and sum them up in your client-side code:
> 
> INVOICE_SUM:VENDORX:20131122 = 5
> INVOICE_SUM:VENDORX:20131121 = 10
> INVOICE_SUM:VENDORX:20131120 = 54
> 
> Then do the same for the next vendor. If it makes sense to roll up the
> sums at something larger than a day, you can do that too.
> 
>> My questions are: can we even use post-commit hook in that manner where we
>> insert / update multiple records? Is there a different way to design such a
>> schema that I am missing?
>> 
>> Thanks.
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://riak-users.197444.n3.nabble.com/Using-Riak-to-perform-aggregate-queries-tp4027668p4029900.html
>> Sent from the Riak Users mailing list archive at Nabble.com.
>> 
>> _______________________________________________
>> riak-users mailing list
>> [email protected]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Using Riak to perform aggregate queries

Reply via email to