Hi,
[I'm not sure whether this discussion belongs to numpy-discussion or
scipy-dev]
In day to day time series analysis I regularly need to look at the data
autocorrelation ("acorr" or "acf" depending on the software package).
The straighforward available function I have is matplotlib.pyplot.acorr.
However, for a moderately long time series (say of length 10**5) it
taking a huge time just to just dislays the autocorrelation values
within a small range of time lags.
The main reason being it is relying on np.correlate(x,x, mode=2) while
only a few lags are needed.
(I guess mode=2 is an (old fashioned?) way to set mode='full')
I know that np.correlate performance issue has been discussed already,
and there is a *somehow* related ticket
(http://projects.scipy.org/numpy/ticket/1260). I noticed in the ticket's
change number 2 the following comment by Josef : "Maybe a truncated
convolution/correlation would be good". I'll come back to this soon.
I made an example script "acf_timing.py" to start my point with some
timing data :
In Ipython:
>>> run acf_timing.py # it imports statsmodel's acf + define 2 other
acf implementations + an example data 10**5 samples long
%time l,c = mpl_acf(a, 10)
CPU times: user 8.69 s, sys: 0.00 s, total: 8.69 s
Wall time: 11.18 s # pretty long...
%time c = sm_acf(a, 10)
CPU times: user 8.76 s, sys: 0.01 s, total: 8.78 s
Wall time: 10.79 s # long as well. statsmodel has a similar underlying
implementation
#
http://statsmodels.sourceforge.net/generated/scikits.statsmodels.tsa.stattools.acf.html#scikits.statsmodels.tsa.stattools.acf
#Now, better option : use the fft convolution
%time c=sm_acf(a, 10,fft=True)
CPU times: user 0.03 s, sys: 0.01 s, total: 0.04 s
Wall time: 0.07 s
# Fast, but I'm not sure about the memory implication of using fft though.
#The naive option : just compute the acf lags that are needed
%time l,c = naive_acf(a, 10)
CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
Wall time: 0.01 s
# Iterative computation. Pretty silly but very fast
# (Now of course, this naive implementation won't scale nicely for a lot
of lags)
Now comes (at last) the question : what should be done about this
performance issue ?
- should there be a truncated np.convolve/np.correlate function, as
Josef suggested ?
- or should people in need of autocorrelation find some workarounds
because this usecase is not big enough to call for a change in np.convolve ?
I really feel this question is about *where* a change should be
implemented (numpy, scipy.signal, maplotlib ?) so that it makes sense
while not breaking 10^10 lines of numpy related code...
Best,
Pierre
#!/usr/bin/python
# -*- coding: UTF-8 -*-
""" Autocorrelation timing study
Pierre Haessig â February 2012
"""
from __future__ import division, print_function
import numpy as np
import matplotlib.pyplot as plt
from scikits.statsmodels.tsa.stattools import acf as sm_acf
def mpl_acf(x, maxlags=10):
'''Matplotlib autocorrelation implementation.
freely copy-pasted and adapted from https://raw.github.com/matplotlib/matplotlib/master/lib/matplotlib/axes.py
Return value is a tuple (*lags*, *c*) where:
- *lags* are a length 2*maxlags+1 lag vector
- *c* is the 2*maxlags+1 auto correlation vector
'''
Nx = len(x)
x = x - x.mean() # instead of detrend call
c = np.correlate(x, x, mode=2) # I guess mode=2 means mode='full' ?
# Normalization:
c/= np.sqrt(np.dot(x,x) * np.dot(x,x))
lags = np.arange(-maxlags,maxlags+1)
c = c[Nx-1-maxlags:Nx+maxlags]
return lags, c
def naive_acf(x, maxlags=10):
'''naive autocorrelation implementation
It only computes the correlation at the required lags *one by one*
(works well for a small number of lags)
Return value is a tuple (*lags*, *c*) where:
- *lags* are a length 2*maxlags+1 lag vector
- *c* is the 2*maxlags+1 auto correlation vector
'''
N = x.size
c = np.zeros(2*maxlags+1)
x -= x.mean()
var_x = x.var()
lags = np.arange(-maxlags, maxlags+1)
for lag in lags:
if lag<0:
c[lag+maxlags] = (x[:N+lag] * x[-lag:]).sum()/N/var_x
# (some complex conjugate is missing here)
elif lag==0:
c[lag+maxlags] = 1.
else: # lag > 0
c[lag+maxlags] = (x[lag:] * x[:N-lag]).sum()/N/var_x
return lags, c
N = 10**5
a = np.random.normal(size=N)
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion