Christian and Jeremiah, thanks for your responses. See my comments inline below.

> Date: Fri, 21 Sep 2012 10:46:36 +0100
> From: Christian Dahlqvist <christ...@whitenode.com>
> ....
> account<https://github.com/whitenode/riak_mapreduce_utils>and I also
> submitted them to Basho Contrib yesterday for review.

Thanks, these functions look interesting and could be exactly what I need.

> ...
> you seem to have a quite deep hierarchy, maybe a mixture of links (for
> records and relationships not changing often) and 2i "links" may work?

You're right. The top levels aren't changing that much and the resulting 
multiple writes from an insert are no problem. It's only the "leafs of the 
tree" and maybe one level higher that would benefit from using 2i instead of 
links.

> Date: Fri, 21 Sep 2012 08:20:05 -0700
> From: Jeremiah Peschka <jeremiah.pesc...@gmail.com>
> ....
> Another consideration of links is that links will be returned in the HTTP 
> headers. If you have too many links, then you can blow away the max size 
> limits on an HTTP header and bad things will happen.

Thanks for the warning. Note that I have tested creating more then 1,000 links 
using both HTTP and Protobuf and things still appeared to work.

> ...
> What's the big picture problem that you're trying to solve? Are you try to 
> determine the fastest way to traverse a rigid hierarchy in your data?

There are several things. I'm primarily using Riak to store sensor data, adding 
a few GB every day. One thing I want to make sure is that I can traverse all 
data and since key listing was clearly a no-go (especially before 2i existed) I 
keep a hierarchy so I can get to all keys/all data. This allows me to run batch 
jobs so I can gather statistics or otherwise work with the data, e.g. shift 
stale/"cold" data to another storage, basically implementing a multi-tier 
storage system. If I do move the data I still keep a URI and encryption key in 
Riak but move the data itself to S3 (by the way: Amazon Glacier looks 
interesting...). And run jobs that service the devices, these run continuously 
in the background.

Another thing is I want operators to easily drill down to resolve problems. So 
looking at the "USA page" they would see each state and for each state it would 
list the number of "Operational" devices and "Offline" devices ("GROUP BY 
device_status"), as well as the longest time since last contact 
("MIN(device_last_contact")"). E.g. something like this:

State1 98 Operational, 2 Offline, 3 minutes since last contact.
State2 100 Operational, 4 hours since last contact.

Clicking on State1 would drill down/zoom in and a similar page would display 
for this level, et-cetera. Obviously these queries should be fast since the 
user is waiting for a response. When a user drills down to the device level 
other information would appear like how much data is stored, drill down to 
individual dates and records, run various reports.

> ....
> If that's the case, why not store the intermediate data as secondary indexes 
> on the device itself? Then you can simply run a query to determine which 
> devices are in the US rather than walk across multiple buckets. With 
> sufficient secondary indexes at your intermediate levels, you should be able 
> to easily recompute your various roll ups for reporting as the underlying 
> data changes and still get quick reporting without having to traverse the 
> existing buckets. 

I'll do some testing keeping lots of indexes and see how that works. And maybe 
this is the best way forward, especially if we were to start with a clean sheet 
today. I was hoping for an easy transition to using 2i instead of links (and 
maybe with the MR functions from Christian Dahlqvist that is possible).

Thanks,
Timo


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to