Re: [Numpy-discussion] 2d binning and linear regression
> > > > > What exactly are trying to fit because it is rather bad practice to fit > a model to some summarized data as you lose the uncertainty in the > original data? > If you define your boxes, you can loop through directly on each box and > even fit the equation: > > model=mu +beta1*obs > > The extension is to fit the larger equation: > model=mu + boxes + beta1*obs + beta2*obs*boxes > > where your boxes is a indicator or dummy variable for each box. > Since you are only interested in the box by model term, you probably can > use this type of model > model=mu + boxes + beta2*obs*boxes > > However, these models assume that the residual variance is identical for > all boxes. (That is solved by a mixed model approach.) > > Bruce Bruce, I am trying to determine spatially based linear corrections for surface winds in order to force a wave model. The basic idea is, use satellite observations from sattelites to determine the errors and the wind, and reduce them by applying a linear correction prior to forcing the wave model. I am not sure I understand what you are saying, I am possibly trying to do what you are describing. i.e. for each box, gather observations, determine a linear correction, and apply it to the model model = a*x + b Does that make sense? Cheers Tom > > > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2d binning and linear regression
> > > the basic idea is in "polyfit on multiple data points" on > numpy-disscusion mailing list April 2009 > > In this case, calculations have to be done by groups > > subtract mean (this needs to be replaced by group demeaning) > modeldm = model - model.mean() > obsdm = obs - obs.mean() > > xx = np.histogram2d( > xx, xedges, yedges = np.histogram2d(lat, lon, weights=modeldm*modeldm, > bins=(latedges,lonedges)) > xy, xedges, yedges = np.histogram2d(lat, lon, weights=obsdm*obsdm, > bins=(latedges,lonedges)) > > > slopes = xy/xx # slopes by group > > expand slopes to length of original array > predicted = model - obs * slopes_expanded > ... > > the main point is to get the group functions, for demeaning, ... for > the 2d labels (and get the labels out of histogramdd) > > I'm out of time (off to the airport soon), but I can look into it next > weekend. > > Josef > > Thanks Josef, I will chase up the April list... If I understand what you have done above, this returns the slope of best fit lines forced through the origin, is that right? Have a great trip! Tom ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2d binning and linear regression
> > are you doing something like np.polyfit(model, obs, 1) ? > > If you are using polyfit with deg=1, i.e. fitting a straight line, > then this could be also calculated using the weights in histogram2d. > > histogram2d (histogramdd) uses np.digitize and np.bincount, so I'm > surprised if the histogram2d version is much faster. If a quick > reading of histogramdd is correct, the main improvement would be to > get the labels "xy" out of it, so it can be used repeatedly with > np.bincount. > > Josef > Thanks Josef, >From my limited understanding, you are right the histogram is much faster due >to the fact that it doesn't have to keep reading in the array over and over I am using np.polyfit(model, obs, 1). I couldn't work out a way to do these regression using histogram2d and weights, but you think it can be done? This would be great! Cheers Tom ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] 2d binning and linear regression
Hi All, I have a problem involving lat/lon data. Basically, I am evaluating numerical weather model data against satellite data, and trying to produce gridded plots of various statistics. There are various steps involved with this, but basically, I get to the point where I have four arrays of the same length, containing lat, lon, model and observed values respectively along the path of the sattelite. eg: lat = [ 50.32 50.7851.2451.82 52.5553.1553.75 54.28 54.79 55.16 ... ] lon = [ 192.83 193.38 193.94 194.67 195.65 196.49 197.35 198.15 198.94 199.53 ... ] obs = [1.42 1.35 1.55 1.50 1.59 1.76 2.15 1.90 1.550.73 ... ] model = [ 1.67 1.68 1.70 1.79 2.04 2.36 2.53 2.38 2.149 1.57 ... ] I then want to calculate statistics based on bins of say 2 X 2 degree boxes. These arrays are very large, on the order of 10^6. For ease of explanation, I will focus initially on bias. My first approach was to use loops through lat/lon boxes and use np.where statements to extract all the values of the model and observations within the given box, and calculate the bias as the mean of the difference. This was obviously very slow. histogram2d provided a FAR better way to do this. i.e. import numpy as np latedges=np.arange(-90,90,2) lonedges=np.arange(0,360,2) diff = model-obs grid_N, xedges, yedges = np.histogram2d(lat, lon], bins=(latedges,lonedges)) grid_bias_sum, xedges, yedges = np.histogram2d(lat, lon, weights=diff, bins=(latedges,lonedges)) grid_bias = grid_bias_sum/grid_N I now want to determine the the linear regression coefficients for each each box after fitting a least squares linear regression to the data in each bin. I have been looking at using np.digitize to extract the bin indeces, but haven't had much success trying to do this in two dimensions. I am back to looping through the lat and lon box values, using np.where to extract the observations and model data within that box, and using np.polyfit to obtain the regression coefficients. This is, of course, very slow. Can anyone advise a smarter, vectorized way to do this? Any thoughts would be greatly appreciated. Thanks in advance Tom ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion