Re: [GENERAL] Help needed structuring Postgresql correlation query

Alban Hertroys Tue, 21 Jun 2016 05:30:33 -0700

> On 19 Jun 2016, at 10:58, Tim Smith <randomdev4+postg...@gmail.com> wrote:
> 
> Hi,
> 
> My postgresql-fu is not good enough to write a query to achieve this
> (some may well say r is a better suited tool to achieve this !).
> 
> I need to calculate what I would call a correlation window on a time
> series of data, my table looks like this :
> 
> create table data(data_date date,data_measurement numeric);
> insert into data values('2016-01-01',16.23);
> <etc>
> insert into data values('2016-06-19',30.54);
> 
> My "target sample" would be the N most recent samples in the table
> (e.g. 20, the most recent 20 days)
> 
> My "potential sample" would be a moving window of size N (the same
> size N as above), starting at T0 (i.e. 2016-01-01 in this example) and
> incrementing by one (i.e. 2016-01-01 + 20, then 2016-01-02+20 etc),
> but the "target sample" would obviously be excluded.
> 
> The output needs to display window date range (or at least the start
> date of the "potential sample" window) and the result
> corr(target,potential).
> 
> Hope that makes sense


Something like this could do the trick (untested):

        with recursive sample (nr, start_date) as (
                select 1 as nr, data_date as start_date, SUM(data_measurement) 
as total
                from generate_series(0, 19) range(step)
                left join data on (data_date = start_date + range.step)

                union all

                select nr + 1, sample.start_date +1, SUM(data_measurement) as 
total
                from sample
                join generate_series(0, 19) range(step)
                left join data on (data_date = start_date +1 + range.step)
                where start_date +1 +19 <= (select MAX(data_date) from data)
                group by 1, 2
        )
        select * from sample where start_date >= '2016-01-01';

Not sure how best to go about parameterising sample size N, a stored function 
seems like a good option.


Another approach would be to move a (cumulative) window-function with 20 items 
over your data set and for each row subtract the first value of the previous 
window from the total of the current window (that is, assuming you're 
calculating a SUM of data_measurement for each window of 20 records).

Visually that looks something like this for sample size 4:
sample 1: (A + B + C + D)
sample 2: (A + B + C + D) + E - A = (B + C + D + E)
sample 3: (B + C + D + E) + F - B = (C + D + E + F)
etc.

To accomplish this, you calculate two cumulative totals (often misnamed as 
running totals, but AFAIK that's something different), one from the start, and 
one lagging N rows behind (you can use the lag() window function for that) and 
subtract the two.

Good luck!

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Help needed structuring Postgresql correlation query

Reply via email to