Re: numpy/scipy: error of correlation coefficient (clumpy data)

robert Thu, 16 Nov 2006 12:52:20 -0800

sturlamolden wrote:
> robert wrote:
> 
>> Think of such example: A drunken (x,y) 2D walker is supposed to walk along a 
>> diagonal, but he makes frequent and unpredictable pauses/slow motion. You 
>> get x,y coordinates in 1 per second. His speed and time pattern at all do 
>> not matter - you just want to know how well he keeps his track.
> 
> 
> In which case you have time series data, i.e. regular samples from p(t)
> = [ x(t), y(t) ]. Time series have some sort of autocorrelation in the
> samples as well, which must be taken into account. Even tough you could
> weight each point by the drunkard's speed, a correlation or linear
> regression would still not make any sense here, as such analyses are
> based on the assumption of no autocorrelation in the samples or the
> residuals. Correlation has no meaning if y[t] is correlated with
> y[t+1], and regression has no meaning if the residual e[t] is
> correlated with the residual e[t+1].
> 
> A state space model could e.g. be applicable. You could estimate the
> path of the drunkard using a Kalman filter to compute a Taylor series
> expansion p(t) = p0 + v*t + 0.5*a*t**2 + ... for the path at each step
> p(t). When you have estimates for the state parameters s, v, and a, you
> can compute some sort of measure for the drunkard's deviation from his
> ideal path.
> 
> However, if you don't have time series data, you should not treat your
> data as such.
> 
> If you don't know how your data is generated, there is no way to deal
> with them correctly. If the samples are time series they must be
> threated as such, if they are not they should not. If the samples are
> i.i.d. each point count equally much, if they are not they do not. If
> you have a clumped data due to time series or lack of i.i.d., you must
> deal with that. However, data can be i.i.d. and clumped, if the
> underlying distribution is clumped. In order to determine the cause,
> you must consider how your data are generated and how your data are
> sampled. You need meta-information about your data to determine this.
> Matlab or Octave will help you with this, and it is certainly not a
> weakness of NumPy as you implied in your original post. There is no way
> to put magic into any numerical computation. Statistics always require
> formulation of specific assumptions about the data. If you cannot think
> clearly about your data, then that is the problem you must solve.


yes, in the example of the drunkard time-series its possible to go to better 
model - yet even there it is very expensive (in relation to the 
stats-improvement) to worry too much about the best model for such guy :-).

In the field of datamining with many datatracks and typically digging first for 
multiple but smaller correlations - without a practical bottom-up modell, I 
think one falls regularly back to a certain basic case - maybe the most basic 
model for data at all: that basic modell is possibly that of a "hunter" which 
waits mostly, but only acts, if rare goodies are in front of him. 
Again in the most basic case, when having 2D x,y data in front of you without a 
relyable time-path or so, you see this: a density distribution of points. There 
is possibly a linear correlation on the highest scale - which you are 
interested in - but the points show also inhomgenity/clumping, and this rises 
the question of influence on r_err. What now? One sees clearly that its 
nonsense to make just plain average stats.
I think this case is a most basic default for data - even compared to the 
common textbook-i.i.d. case.  In fact, one can recognize such kind of stats, 
which  repects mere (inhomogous) data-density itself, as (kind of 
simple/indepent) auto-bayesian stats vs. dumb averaging. 

I think one can almost always do this "bayesian density weighter/filter" as 
better option compared to mere average stats in that case of x,y correlation 
when there is obvioulsy interesting correlation but where you are too 
lazy..to..principally unable to itch out a modell on physics level. The latter 
requirement is in fact what any averaging stats cries for at any price - but 
how often can you do it in real world applications ... 

( In reality there is anyway no way to eliminate auto-correlation in the 
composition of data. Everything and everybody lies :-) )

Thats where a top-down (model-free) bayesian stats approach will pay off: In 
the previous extreme example of criminal data duplication - I'm sure - it will 
totally neutralize the attack without question. In the drunkard time-series 
example it will tell me very reliably how well this guy will keep track - 
without need for a complex model. In the case of good i.i.d. data distribution 
it will tell me the same as simple stats. Just good news ...

Thus I can possibly say it so now: I have the problem for guessing linear 
correlation (coefficient with error) on x,y data with respect to the (most 
general assupmtion) "bayesian" background of inhomogenous data distribution. 
Therefore I'm seeking a (fast/efficient/approx.) formula for r/r_err. I guess 
the formula for r does not change (much) compared to that for simple averaging 
stats, but the formula for r_err will.

Maybe its easy with some existing means of numpy/scipy already. Maybe not. I'm 
far from finding the (efficient) math myself, but I know what I want - and can 
see if a formula really does it.


Robert
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: numpy/scipy: error of correlation coefficient (clumpy data)

Reply via email to