On 15 Jan 2011, at 1:41, bubba postgres wrote:

> I've been googling, but haven't found a good answer to what I should do if I 
> want to store time series in Postgres.
> My current solution is store serialized (compressed) blobs of data.
> (So for example store 1 day worth of 1 minute samples (~1440 samples) stored 
> as one row in a bytea. (Plus meta data)
> It would be nice if I could use 1 sample per column,(because updating 
> individual columns/samples is clear to me) but postgres doesn't compress the 
> row (which is bad because of high amount of repetitive data.. Easily 10X 
> bigger.

Not an easy problem to solve without knowing more about your data, like what 
the nature of this repetition is, how repetitive it is, etc. It also much 
depends on how you intend to use this data later on. Are there gaps in your 
data ranges? How accurate does your data need to be stored? Etc.

For a decision you're probably the only person who'll be close enough to the 
data and how it will be used.

That said, assuming your data has significant _sequential_ repetition (That's 
just how I would phrase it; meaning that a range of values in sequence are the 
same), it may be an idea to just store differences larger than 0 and the length 
of a sample of equal values. If such ranges are small, this would just add 
overhead though.

You could extend that approach by fitting curves to ranges of values with a 
similar tendency and store those curves instead. Retrieving single values from 
those is of course a little harder, plus you probably will get some inaccuracy 
when you can't fit curves exactly.

Hard to say how effective this would be though.

> I've been considering a Double[] array, which would get compressed, but 
> before I start down that path (I suppose I need to make some storedprocs to 
> update individual samples), has anyone built anything like this? Any open 
> source projects I should look at?


This is the safest approach I think. You don't have to rely on any assumptions 
on how your data behaves over time. It's also quite predictable in terms of 
database and storage requirements, you won't run into surprises here. Just the 
usual time-related stuff (DST-changes and such).

One change I'd probably make is to store them like this:
CREATE TABLE sample (
        start   timestamp,
        length  int,
        values  double[]
);

That way you don't need to reserve space for longer gaps. For example, if a day 
ends "early" because data stopped coming in, you can just store a shorter day 
(length < 1440 minutes), or if you missed data at the start of the day you can 
make it start later (once the first sample arrives).

Querying them isn't too tough either, although determining the array index you 
need based on a timestamp is somewhat tricky I just found out.
Finding the record is pretty easy though, that's just:

        WHERE now() BETWEEN start AND start + length * interval '1 minute';

I'd advise putting an index on the latter formula ;)


There is one possible problem I'm seeing with this approach though: Updating 
values requires the whole row to be rewritten (MVCC).
If your arrays are wide enough, then updating those rows will take longer and 
longer because the amount of data that needs to be written each time is 
increasing.
Seeing that your data-samples apparently arrive at 1-minute intervals you'll 
probably be safe, but if you're going to add significant processing around 
updating a row, then you could run into a race where the previous version of 
the record is still being processed.
There are ways around that, but they only complicate matters more and probably 
hurt performance, so that's probably best left for if it's needed at all.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,4d318b0811872055413410!



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to