Hi,

[I'm not sure whether this discussion belongs to numpy-discussion or scipy-dev]

In day to day time series analysis I regularly need to look at the data autocorrelation ("acorr" or "acf" depending on the software package). The straighforward available function I have is matplotlib.pyplot.acorr. However, for a moderately long time series (say of length 10**5) it taking a huge time just to just dislays the autocorrelation values within a small range of time lags. The main reason being it is relying on np.correlate(x,x, mode=2) while only a few lags are needed.
(I guess mode=2 is an (old fashioned?) way to set mode='full')

I know that np.correlate performance issue has been discussed already, and there is a *somehow* related ticket (http://projects.scipy.org/numpy/ticket/1260). I noticed in the ticket's change number 2 the following comment by Josef : "Maybe a truncated convolution/correlation would be good". I'll come back to this soon.

I made an example script "acf_timing.py" to start my point with some timing data :

In Ipython:
>>> run acf_timing.py # it imports statsmodel's acf + define 2 other acf implementations + an example data 10**5 samples long

%time l,c = mpl_acf(a, 10)
CPU times: user 8.69 s, sys: 0.00 s, total: 8.69 s
Wall time: 11.18 s # pretty long...

 %time c = sm_acf(a, 10)
CPU times: user 8.76 s, sys: 0.01 s, total: 8.78 s
Wall time: 10.79 s # long as well. statsmodel has a similar underlying implementation # http://statsmodels.sourceforge.net/generated/scikits.statsmodels.tsa.stattools.acf.html#scikits.statsmodels.tsa.stattools.acf

#Now, better option : use the fft convolution
%time c=sm_acf(a, 10,fft=True)
CPU times: user 0.03 s, sys: 0.01 s, total: 0.04 s
Wall time: 0.07 s
# Fast, but I'm not sure about the memory implication of using fft though.

#The naive option : just compute the acf lags that are needed
%time l,c = naive_acf(a, 10)
CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
Wall time: 0.01 s
# Iterative computation. Pretty silly but very fast
# (Now of course, this naive implementation won't scale nicely for a lot of lags)

Now comes (at last) the question : what should be done about this performance issue ? - should there be a truncated np.convolve/np.correlate function, as Josef suggested ? - or should people in need of autocorrelation find some workarounds because this usecase is not big enough to call for a change in np.convolve ?

I really feel this question is about *where* a change should be implemented (numpy, scipy.signal, maplotlib ?) so that it makes sense while not breaking 10^10 lines of numpy related code...

Best,
Pierre





#!/usr/bin/python
# -*- coding: UTF-8 -*-
""" Autocorrelation timing study
Pierre Haessig — February 2012
"""

from __future__ import division, print_function
import numpy as np
import matplotlib.pyplot as plt
from scikits.statsmodels.tsa.stattools import acf as sm_acf


def mpl_acf(x, maxlags=10):
    '''Matplotlib autocorrelation implementation.
    freely copy-pasted and adapted from https://raw.github.com/matplotlib/matplotlib/master/lib/matplotlib/axes.py
    
    Return value is a tuple (*lags*, *c*) where:
    - *lags* are a length 2*maxlags+1 lag vector
    - *c* is the 2*maxlags+1 auto correlation vector
    '''
    Nx = len(x)
    x = x - x.mean() # instead of detrend call

    c = np.correlate(x, x, mode=2) # I guess mode=2 means mode='full' ?
    
    # Normalization:
    c/= np.sqrt(np.dot(x,x) * np.dot(x,x))

    lags = np.arange(-maxlags,maxlags+1)
    c = c[Nx-1-maxlags:Nx+maxlags]
    return lags, c


def naive_acf(x, maxlags=10):
    '''naive autocorrelation implementation
    It only computes the correlation at the required lags *one by one* 
    (works well for a small number of lags)
    
    Return value is a tuple (*lags*, *c*) where:
    - *lags* are a length 2*maxlags+1 lag vector
    - *c* is the 2*maxlags+1 auto correlation vector
    '''
    N = x.size
    c = np.zeros(2*maxlags+1)
    x -= x.mean()
    var_x = x.var()
    lags = np.arange(-maxlags, maxlags+1)
    for lag in lags:
        if lag<0:
            c[lag+maxlags] = (x[:N+lag] * x[-lag:]).sum()/N/var_x
            # (some complex conjugate is missing here)
        elif lag==0:
            c[lag+maxlags] = 1.
        else: # lag > 0
            c[lag+maxlags] = (x[lag:] * x[:N-lag]).sum()/N/var_x
    
    return lags, c

N = 10**5
a = np.random.normal(size=N)

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to