sturlamolden wrote: > robert wrote: > >> Think of such example: A drunken (x,y) 2D walker is supposed to walk along a >> diagonal, but he makes frequent and unpredictable pauses/slow motion. You >> get x,y coordinates in 1 per second. His speed and time pattern at all do >> not matter - you just want to know how well he keeps his track. > > > In which case you have time series data, i.e. regular samples from p(t) > = [ x(t), y(t) ]. Time series have some sort of autocorrelation in the > samples as well, which must be taken into account. Even tough you could > weight each point by the drunkard's speed, a correlation or linear > regression would still not make any sense here, as such analyses are > based on the assumption of no autocorrelation in the samples or the > residuals. Correlation has no meaning if y[t] is correlated with > y[t+1], and regression has no meaning if the residual e[t] is > correlated with the residual e[t+1]. > > A state space model could e.g. be applicable. You could estimate the > path of the drunkard using a Kalman filter to compute a Taylor series > expansion p(t) = p0 + v*t + 0.5*a*t**2 + ... for the path at each step > p(t). When you have estimates for the state parameters s, v, and a, you > can compute some sort of measure for the drunkard's deviation from his > ideal path. > > However, if you don't have time series data, you should not treat your > data as such. > > If you don't know how your data is generated, there is no way to deal > with them correctly. If the samples are time series they must be > threated as such, if they are not they should not. If the samples are > i.i.d. each point count equally much, if they are not they do not. If > you have a clumped data due to time series or lack of i.i.d., you must > deal with that. However, data can be i.i.d. and clumped, if the > underlying distribution is clumped. In order to determine the cause, > you must consider how your data are generated and how your data are > sampled. You need meta-information about your data to determine this. > Matlab or Octave will help you with this, and it is certainly not a > weakness of NumPy as you implied in your original post. There is no way > to put magic into any numerical computation. Statistics always require > formulation of specific assumptions about the data. If you cannot think > clearly about your data, then that is the problem you must solve.
yes, in the example of the drunkard time-series its possible to go to better model - yet even there it is very expensive (in relation to the stats-improvement) to worry too much about the best model for such guy :-). In the field of datamining with many datatracks and typically digging first for multiple but smaller correlations - without a practical bottom-up modell, I think one falls regularly back to a certain basic case - maybe the most basic model for data at all: that basic modell is possibly that of a "hunter" which waits mostly, but only acts, if rare goodies are in front of him. Again in the most basic case, when having 2D x,y data in front of you without a relyable time-path or so, you see this: a density distribution of points. There is possibly a linear correlation on the highest scale - which you are interested in - but the points show also inhomgenity/clumping, and this rises the question of influence on r_err. What now? One sees clearly that its nonsense to make just plain average stats. I think this case is a most basic default for data - even compared to the common textbook-i.i.d. case. In fact, one can recognize such kind of stats, which repects mere (inhomogous) data-density itself, as (kind of simple/indepent) auto-bayesian stats vs. dumb averaging. I think one can almost always do this "bayesian density weighter/filter" as better option compared to mere average stats in that case of x,y correlation when there is obvioulsy interesting correlation but where you are too lazy..to..principally unable to itch out a modell on physics level. The latter requirement is in fact what any averaging stats cries for at any price - but how often can you do it in real world applications ... ( In reality there is anyway no way to eliminate auto-correlation in the composition of data. Everything and everybody lies :-) ) Thats where a top-down (model-free) bayesian stats approach will pay off: In the previous extreme example of criminal data duplication - I'm sure - it will totally neutralize the attack without question. In the drunkard time-series example it will tell me very reliably how well this guy will keep track - without need for a complex model. In the case of good i.i.d. data distribution it will tell me the same as simple stats. Just good news ... Thus I can possibly say it so now: I have the problem for guessing linear correlation (coefficient with error) on x,y data with respect to the (most general assupmtion) "bayesian" background of inhomogenous data distribution. Therefore I'm seeking a (fast/efficient/approx.) formula for r/r_err. I guess the formula for r does not change (much) compared to that for simple averaging stats, but the formula for r_err will. Maybe its easy with some existing means of numpy/scipy already. Maybe not. I'm far from finding the (efficient) math myself, but I know what I want - and can see if a formula really does it. Robert -- http://mail.python.org/mailman/listinfo/python-list