Re: Data modelling questions

Jason Campbell Mon, 23 Feb 2015 13:36:24 -0800

Thanks for the info.

The model looks reasonable, but something I would worry about is the 
availability of the key data.  For example, the timestamps and msg-ids should 
be known without key-listing Riak (which is always a very slow operation).  
There is several options for this, you can either maintain your own index (Riak 
CRDT sets work very well for this), use 2i, or Riak search.


The other thing I’m worried about is something I’ve run into with my data.  If 
you create a key per message as you have indicated, your key size can be very 
small, and you end up aggregating thousands of keys for any reasonable query.  
For pulling large amounts of data out of Riak, try to keep key sizes between 
about 100KB and 1MB.  Riak is still very responsive at those sizes, and there 
isn’t much parsing overhead even if you are only interested in one of the 
messages.  For me, that means grouping data into fixed 5 minute blocks.  It 
will obviously vary depending on message size and number of messages, but I 
wouldn’t go with a key per message unless the messages are >10KB.  Grouping by 
timestamp also gives the advantage that any client can know the keys to query 
in advance since they are fixed.  You said a 10 minute range is ideal, so if 
you can manage to group your data into 10 minute keys, that would likely give 
the best performance when querying.

For grouping data, I would recommend using Riak sets and serialised JSON 
strings.  As long as you don’t have exact duplicate messages, it works very 
well, and allows Riak to resolve conflicts automatically.

As far as those aggregate metrics (for graphing and alerting), I would 
definitely store those in a separate bucket, and group them by 10 minute 
intervals.  The full data keys should only be used for unplanned queries (Riak 
MR jobs), and anything you know you will need should ideally be generated when 
loading the data initially.

Hope this helps, let me know if you have any other questions.

Jason

> On 24 Feb 2015, at 05:24, AM <ams....@gmail.com> wrote:
> 
> On 2/22/15 6:16 PM, Jason Campbell wrote:
>> Coming at this from another angle, if you already have a permanent data 
>> store, and you are only reporting on each hour at a time, can you run the 
>> reports based on the log itself?
>> A lot of Riak’s advantage comes from the stability and availability of data 
>> storage, but S3 is already doing that for you.  Riak can store the data, but 
>> I’m not sure what benefit it serves from my understanding of your problem.
>> 
>> Aggregates are usually quite small (even with more advanced things like 
>> histograms), so it’s relatively easy to parse a log line-by-line and produce 
>> aggregates in-memory for a report.
>> 
>> Can you give a bit more detail on why are you using Riak?
> 
> For the most part yes, we are using EMR at the moment, but some of the 
> reasons I want to go down that road are:
> 
> - We are not quite 'bit data' (using that definition that I can process 60 
> mins of my data on an 8 core 16G machine in under 40 mins) and EMR is 
> actually 'slower' for us, than just running it locally on a large machine. 
> That brings its own stability and maintenance issues for us. It would be much 
> nicer if the data was stored relliably and in a format that was query-able 
> quickly instead of having to reprocess things.
> 
> - The data is compressed and we actually waste quite a bit of time 
> decompressing it for EMR which is yet another issue if we have to re-process 
> due to single machine durability issues.
> 
> - We want to  be able to drive graphs and alerts off of the data whose 
> granularity is most likely going to be of the order of 10 mins . These are 
> just counters on a single time dimension so I am assuming that if I get the 
> model right I will this will be easy. Yes we can do this via EMR but it also 
> requires additional moving parts that we would have to manage.
> 
> - We have certain BI use cases (as yet not clearly defined) that riak MR 
> would be quite useful and faster for us.
> 
> All in all Riak appears to offer the sweet spot of reliability, data 
> management and querying tools such that all we would have to be concerned 
> about is the the actual cluster itself.
> 
> Thanks.
> AM
>> Hope this helps,
>> Jason
>> 
>>> On 23 Feb 2015, at 13:03, AM <ams....@gmail.com> wrote:
>>> 
>>> Hi Jason, Christopher.
>>> 
>>> This is supposed to be an append-only time-limited data. I only intend to 
>>> save about 2 weeks worth of data (which is yet another thing I need to 
>>> figure out, ie how to vacate older data).
>>> 
>>> Re: querying, for the most part the system will be building out hourly 
>>> reports based on geo, build and location information so I need to have a 
>>> model that allows me to aggregate by timestamp + 
>>> [each-of-geo-build-location] or just do it on the fly during ingestion.
>>> 
>>> Ingestion is yet another thing where I have some flexibility as it is a 
>>> batch job, ie log files get dropped on S3 and we get notified (usually on 
>>> an hourly basis, some logs on a 10-min basis) so I can massage it further 
>>> but I am concerned that every place where I buffer is another opportunity 
>>> for losing data and I would like to avoid reprocessing as much as possible.
>>> 
>>> Messages will already have the timestamp and msg-id and I will mostly be 
>>> interested in aggregates. In some very rare cases I expect to be able to 
>>> simply run map-reduce jobs for custom queries.
>>> 
>>> Given that, does my current model look reasonable?
>>> 
>>> Thanks.
>>> AM
>>> 
>>> 
>>> On 2/21/15 6:40 PM, Jason Campbell wrote:
>>>> I have the same questions as Christopher.
>>>> 
>>>> Does this data need to change, or is it write-once?
>>>> What information do you have when querying?
>>>>  - Will you already have timestamp and msg-id?
>>>>  - If not, you may want to consider aggregating everything into a single 
>>>> key.  This is easier of the data isn’t changing.
>>>> What data will you typically be querying?
>>>>  - Will you typically be looking for a single element of data, or 
>>>> aggregates (graphing or mapping for example)?
>>>>  - If aggregates, what fields are you aggregating on (timestamp, geo, 
>>>> location, etc) and which will be fixed?
>>>> 
>>>> The aggregate question may need a little more explanation, so I will use 
>>>> an example.
>>>> 
>>>> I have been working on time-series data with my key being: 
>>>> <node-id>:<metric-id>:<timestamp>
>>>> Node-id and metric-id are fixed, they will never be merged in an aggregate 
>>>> way, and I have them before querying.
>>>> Timestamp is my aggregate value, I may need a single timestamp, or 
>>>> hundreds of thousands of timestamps (to draw a graph).  For this reason, I 
>>>> grouped my metrics by 5 minute block instead of one key per timestamp.  I 
>>>> also created aggregates with relevant averages and such for 1 hour, 1 day 
>>>> and 1 month to reduce the amount of key lookups for large graphs.
>>>> 
>>>> So it depends what visualisations you want.  If you are going to be 
>>>> mapping the most recent data based on the geo or location, I would include 
>>>> aggregates for that.  If you are more interested in timestamp, group by 
>>>> that.  Because Riak doesn’t have multi-key consistency though, also choose 
>>>> an canonical source of data.  If you store the same data in multiple keys, 
>>>> they will diverge at some point.  Decide now which is the real source, and 
>>>> which are derived, it will make your life easier when fixing data later.
>>>> 
>>>> Also keep in mind typical periods and data size.  There was no point for 
>>>> me to create a 1 minute increment since the 5 minute data was an 
>>>> acceptable size.  Sure it’s a waste to transmit 4 minutes of data I don’t 
>>>> need, but it’s measured in milliseconds (mainly unserialising JSON in my 
>>>> app), so it doesn’t matter to me and makes larger aggregates much more 
>>>> performant.
>>>> 
>>>>> On 22 Feb 2015, at 03:44, Christopher Meiklejohn <cmeiklej...@basho.com> 
>>>>> wrote:
>>>>> 
>>>>> 
>>>>>> On Feb 20, 2015, at 5:35 PM, AM <ams....@gmail.com> wrote:
>>>>>> 
>>>>>> Hi All.
>>>>>> 
>>>>>> I am currently looking at using Riak as a data store for time series 
>>>>>> data. Currently we get about 1.5T of data in JSON format that I intend 
>>>>>> to persist in Riak. I am having some difficulty figuring out how to 
>>>>>> model it such that I can fulfill the use cases I have been handed.
>>>>>> 
>>>>>> The data is provided in several types of log formats with some common 
>>>>>> fields:
>>>>>> 
>>>>>> - timestamp
>>>>>> - geo
>>>>>> - s/w build #
>>>>>> - location #
>>>>>> 
>>>>>> - .... whole bunch of other key value pairs.
>>>>>> 
>>>>>> For the most part I will need to provide aggregated views based on geo. 
>>>>>> There are some views based on s/w build # and location #. The 
>>>>>> aggregation will be on an hourly basis.
>>>>>> 
>>>>>> The model that I came up with:
>>>>>> 
>>>>>> <log-format-type>[<hour>][<timestamp>-<msg-id>]: <json-body>
>>>>> Hi AM,
>>>>> 
>>>>> Additionally, it would be great if you could provide additional 
>>>>> information on how you plan on querying both the original and aggregated 
>>>>> values.  Querying is usually the most difficult part to get right in 
>>>>> Riak, and your query pattern will be very important in establishing the 
>>>>> best way to lay out this data on disk.
>>>>> 
>>>>> - Chris
>>>>> 
>>>>> Christopher Meiklejohn
>>>>> Senior Software Engineer
>>>>> Basho Technologies, Inc.
>>>>> cmeiklej...@basho.com
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> riak-users mailing list
>>>>> riak-users@lists.basho.com
>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Data modelling questions

Reply via email to