Keith I have an idea that might work for you. This is a bit vague but I would be glad to put together a more concrete example if you like.
Use secondary indexes to tag each entry with the device id. You can then find all of the entries for a given device by using the the secondary index to feed into a simple map phase operation that returns only the entries that you want; i.e., those that are in a given time range. In addition, to easily find all of the registered device ids easily you can create one entry for each device. The key can be most anything (even the device id if you encode it properly -- hash it), and you could tag each of those entries with a secondary index whose field is something like "type" or whatever and whose value is "deviceid". The value for each entry could be just a simple text/plain value whose contents is just the device id of the registered device. --gordon On Nov 12, 2011, at 16:19 , Keith Irwin wrote: > Folks-- > > (Apologies up front for the length of this.) > > I'm wondering if you can let me know if Riak is a good fit for a simple > not-quite-key-value scenario described below. MongoDB or (say) Postgresql > seem a more natural fit conceptually, but I really, really like Riak's > distribution strategy. > > ## context > > The basic overview is this: > > 50K devices push data once a second to web services which need to store that > data in short-term storage (Riak). Once an hour, a sweeper needs to take an > hour's worth of data per device (if there is any) and ship it off to long > term storage, then delete it from short-term storage. Ideally, there'd only > ever be slightly more than 1 hour's worth of data still in short-term storage > for any given device. The goal is to write down the data as simply and safely > as possible, with little or no processing on that data. > > Each second's worth of data is: > > * A device identifier > * A timestamp (epoch seconds, integer) for the slice of time the data > represents > * An opaque blob of binary data (2 to 4k) > > Once an hour, I'd like to do something like: > > * For each device: > * Find (and concat) all the data between time1 and time2 (an hour). > * Move that data to long-term storage (not Riak) as a single blob. > * Delete that data from Riak. > > For an SQL db, this is a really simple problem, conceptually. You can have a > table with three columns: device-id, timestamp, blob. You can index the first > two columns and roll up the data easily enough and then delete it via single > SQL statements (or buffer as needed). The harder part is partitioning, > replication, etc, etc. > > For MongoDB, it's also fairly simple. Just use a document with the same > device-id, timestamp and binary-array data (as JSON), make sure indexes are > declared, and query/delete just as in SQL. MongoDB provides sharding, > replica-sets, recovery, etc. Set up, while less complicated than an RDBMS, > still seems way more complicated than necessary. > > These solutions also provide sorting (which, while nice, isn't a requirement > for my case). > > ## question > > I've been reading the Riak docs, and I'm just not sure if this simple > "queryable" case can really fit all that well. I'm not so concerned about > having to send 50K "deletes" to delete data. I'm more concerned about being > able to find it. Given what I've written above, I may be blocked conceptually > by the above index/query mentality such that I'm just not seeing the Riak way > of doing things. > > Anyway, I can "tag" (via the secondary index feature) each blob of data with > the device-id and the timestamp. I could then do a range query similar to: > > GET /buckets/devices/index/timestamp/start/end > > However, this doesn't allow me to group based on device-id. I could create a > separate bucket for every device, such that I could do: > > GET /buckets/device-id/index/timestamp/start/end > > but if I do this, how can I get a list of the device-ids I need so that I can > create that specific URL? The docs say listing buckets and keys is > problematic. > > Might be that Riak just isn't a good case for this sort of thing, especially > given I want to use it for short-term transient data, and that's fine. But I > wanted to ask you all just to make sure that I'm not missing something > somewhere. > > For instance, might link walking help? How about a map/reduce to find a > unique list of device-ids within a given time-horizon, and a streaming map > job to gather the data for export? Does that seem pretty reasonable? > > Thanks! > > Keith > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
