Re: High volume data series storage and queries

Jeremiah Peschka Thu, 11 Aug 2011 09:03:05 -0700

On Aug 11, 2011, at 7:49 AM, Paul O wrote:

> However, I'm still a bit in the dark regarding some aspects of storing
> everything in one big database. Are these records stored in an "append
> only" fashion? If they are, then how can the range query be done
> without considering potentially a huge number of records? If they are
> not "append only" then the DB might have to shuffle data around which
> would degrade performance or store an index.


Having one monolithic table could result in a lot of random reads since data 
will likely be written in insertion order. With appropriate indexing, you can 
reduce the randomness considerably. In a worst case scenario your database will 
effectively say "Up yours, I'm not doing this many random reads, it's going to 
be cheaper to scan the whole table."

> 
> Anyway, I'll read some more on this, it's probably just my ignorance
> regarding how those databases work, but if the DBs can be append-only
> then I need one DB per data source and even then I have to contend
> with data insertions in the past (a requirement which I hope I made
> explicit in previous emails.) And if I have to use multiple DBs are
> these solutions able to juggle multiple DBs, open/close them quickly,
> etc.?

This depends a lot on your database back end. Both Oracle and PostgreSQL don't 
have any physical insertion order other than the order the rows come in - they 
use append only structures on disk. PostgreSQL compounds this by storing 
different versions of the row in the main table (e.g. if there is a 
modification). 

SQL Server and MySQL use index ordered tables, Oracle can be told to do this on 
a table by table basis. An index ordered table is often referred to as a table 
with a clustered index. Ciprian mentioned that a few times in his benchmark.

A good fit solution to your problem would be to create a single table per data 
source and instruct the database to leave a bit of extra space on each data 
page (by setting a fill factor around 80% or so). That gives you some room for 
inserts of historical records. You'd want to adjust your fill factor as real 
world usage evolves. By putting each data source in a separate physical table, 
you can do a lot of interesting things with storage techniques and whatnot, but 
you also get the advantage of only having to perform reads on a single table at 
a time to get data from any given source. With a monolithic table, you can run 
into all kinds of problems with data statistics and accurately predicting the 
kind of data access you need, especially as your database grows in size.

If you need to perform analytical queries across all of your data, you can 
always build views to help you out there, too.

All of these features vary slightly from vendor to vendor, and some of them 
even have specific features to make this kind of design easier to implement and 
administer. This is why Ciprian and I were talking about an RDBMS solution 
requiring domain specific knowledge of a single product/vendor. The advantage 
of something like Riak is that you stand up some nodes and you're going to get 
relatively predictable performance. 

I hate to scare you off from any solution, but Ciprian has experience building 
this on top of Riak core and I have experience building this solution on top of 
an RDBMS.

> 
> I hope I'm not too off-list-topic with this and hope at least some
> other people find the discussion useful.

I find data storage systems to be fascinating :)

> 
> Regards,
> 
> Paul
> 
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

---
Jeremiah Peschka - Founder, Brent Ozar PLF, LLC
 Microsoft SQL Server MVP


_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: High volume data series storage and queries

Reply via email to