Christian and Jeremiah, thanks for your responses. See my comments inline below.
> Date: Fri, 21 Sep 2012 10:46:36 +0100 > From: Christian Dahlqvist <christ...@whitenode.com> > .... > account<https://github.com/whitenode/riak_mapreduce_utils>and I also > submitted them to Basho Contrib yesterday for review. Thanks, these functions look interesting and could be exactly what I need. > ... > you seem to have a quite deep hierarchy, maybe a mixture of links (for > records and relationships not changing often) and 2i "links" may work? You're right. The top levels aren't changing that much and the resulting multiple writes from an insert are no problem. It's only the "leafs of the tree" and maybe one level higher that would benefit from using 2i instead of links. > Date: Fri, 21 Sep 2012 08:20:05 -0700 > From: Jeremiah Peschka <jeremiah.pesc...@gmail.com> > .... > Another consideration of links is that links will be returned in the HTTP > headers. If you have too many links, then you can blow away the max size > limits on an HTTP header and bad things will happen. Thanks for the warning. Note that I have tested creating more then 1,000 links using both HTTP and Protobuf and things still appeared to work. > ... > What's the big picture problem that you're trying to solve? Are you try to > determine the fastest way to traverse a rigid hierarchy in your data? There are several things. I'm primarily using Riak to store sensor data, adding a few GB every day. One thing I want to make sure is that I can traverse all data and since key listing was clearly a no-go (especially before 2i existed) I keep a hierarchy so I can get to all keys/all data. This allows me to run batch jobs so I can gather statistics or otherwise work with the data, e.g. shift stale/"cold" data to another storage, basically implementing a multi-tier storage system. If I do move the data I still keep a URI and encryption key in Riak but move the data itself to S3 (by the way: Amazon Glacier looks interesting...). And run jobs that service the devices, these run continuously in the background. Another thing is I want operators to easily drill down to resolve problems. So looking at the "USA page" they would see each state and for each state it would list the number of "Operational" devices and "Offline" devices ("GROUP BY device_status"), as well as the longest time since last contact ("MIN(device_last_contact")"). E.g. something like this: State1 98 Operational, 2 Offline, 3 minutes since last contact. State2 100 Operational, 4 hours since last contact. Clicking on State1 would drill down/zoom in and a similar page would display for this level, et-cetera. Obviously these queries should be fast since the user is waiting for a response. When a user drills down to the device level other information would appear like how much data is stored, drill down to individual dates and records, run various reports. > .... > If that's the case, why not store the intermediate data as secondary indexes > on the device itself? Then you can simply run a query to determine which > devices are in the US rather than walk across multiple buckets. With > sufficient secondary indexes at your intermediate levels, you should be able > to easily recompute your various roll ups for reporting as the underlying > data changes and still get quick reporting without having to traverse the > existing buckets. I'll do some testing keeping lots of indexes and see how that works. And maybe this is the best way forward, especially if we were to start with a clean sheet today. I was hoping for an easy transition to using 2i instead of links (and maybe with the MR functions from Christian Dahlqvist that is possible). Thanks, Timo _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com