Re: Data modelling questions

AM Sun, 22 Feb 2015 18:05:09 -0800

Hi Jason, Christopher.

This is supposed to be an append-only time-limited data. I only intendto save about 2 weeks worth of data (which is yet another thing I needto figure out, ie how to vacate older data).

Re: querying, for the most part the system will be building out hourlyreports based on geo, build and location information so I need to have amodel that allows me to aggregate by timestamp +[each-of-geo-build-location] or just do it on the fly during ingestion.

Ingestion is yet another thing where I have some flexibility as it is abatch job, ie log files get dropped on S3 and we get notified (usuallyon an hourly basis, some logs on a 10-min basis) so I can massage itfurther but I am concerned that every place where I buffer is anotheropportunity for losing data and I would like to avoid reprocessing asmuch as possible.

Messages will already have the timestamp and msg-id and I will mostly beinterested in aggregates. In some very rare cases I expect to be able tosimply run map-reduce jobs for custom queries.


Given that, does my current model look reasonable?

Thanks.
AM


On 2/21/15 6:40 PM, Jason Campbell wrote:

I have the same questions as Christopher.

Does this data need to change, or is it write-once?
What information do you have when querying?
  - Will you already have timestamp and msg-id?
  - If not, you may want to consider aggregating everything into a single key.  
This is easier of the data isn’t changing.
What data will you typically be querying?
  - Will you typically be looking for a single element of data, or aggregates 
(graphing or mapping for example)?
  - If aggregates, what fields are you aggregating on (timestamp, geo, 
location, etc) and which will be fixed?

The aggregate question may need a little more explanation, so I will use an 
example.

I have been working on time-series data with my key being: 
<node-id>:<metric-id>:<timestamp>
Node-id and metric-id are fixed, they will never be merged in an aggregate way, 
and I have them before querying.
Timestamp is my aggregate value, I may need a single timestamp, or hundreds of 
thousands of timestamps (to draw a graph).  For this reason, I grouped my 
metrics by 5 minute block instead of one key per timestamp.  I also created 
aggregates with relevant averages and such for 1 hour, 1 day and 1 month to 
reduce the amount of key lookups for large graphs.

So it depends what visualisations you want.  If you are going to be mapping the 
most recent data based on the geo or location, I would include aggregates for 
that.  If you are more interested in timestamp, group by that.  Because Riak 
doesn’t have multi-key consistency though, also choose an canonical source of 
data.  If you store the same data in multiple keys, they will diverge at some 
point.  Decide now which is the real source, and which are derived, it will 
make your life easier when fixing data later.

Also keep in mind typical periods and data size.  There was no point for me to 
create a 1 minute increment since the 5 minute data was an acceptable size.  
Sure it’s a waste to transmit 4 minutes of data I don’t need, but it’s measured 
in milliseconds (mainly unserialising JSON in my app), so it doesn’t matter to 
me and makes larger aggregates much more performant.

On 22 Feb 2015, at 03:44, Christopher Meiklejohn <cmeiklej...@basho.com> wrote:

On Feb 20, 2015, at 5:35 PM, AM <ams....@gmail.com> wrote:

Hi All.

I am currently looking at using Riak as a data store for time series data. 
Currently we get about 1.5T of data in JSON format that I intend to persist in 
Riak. I am having some difficulty figuring out how to model it such that I can 
fulfill the use cases I have been handed.

The data is provided in several types of log formats with some common fields:

- timestamp
- geo
- s/w build #
- location #

- .... whole bunch of other key value pairs.

For the most part I will need to provide aggregated views based on geo. There 
are some views based on s/w build # and location #. The aggregation will be on 
an hourly basis.

The model that I came up with:

<log-format-type>[<hour>][<timestamp>-<msg-id>]: <json-body>

Hi AM,

Additionally, it would be great if you could provide additional information on 
how you plan on querying both the original and aggregated values.  Querying is 
usually the most difficult part to get right in Riak, and your query pattern 
will be very important in establishing the best way to lay out this data on 
disk.

- Chris

Christopher Meiklejohn
Senior Software Engineer
Basho Technologies, Inc.
cmeiklej...@basho.com


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Data modelling questions

Reply via email to