Re: [Numpy-discussion] 2d binning and linear regression
On 06/22/2010 02:58 PM, josef.p...@gmail.com wrote: On Tue, Jun 22, 2010 at 10:09 AM, Tom Durrantthdurr...@gmail.com wrote: the basic idea is in polyfit on multiple data points on numpy-disscusion mailing list April 2009 In this case, calculations have to be done by groups subtract mean (this needs to be replaced by group demeaning) modeldm = model - model.mean() obsdm = obs - obs.mean() xx = np.histogram2d( xx, xedges, yedges = np.histogram2d(lat, lon, weights=modeldm*modeldm, bins=(latedges,lonedges)) xy, xedges, yedges = np.histogram2d(lat, lon, weights=obsdm*obsdm, bins=(latedges,lonedges)) slopes = xy/xx # slopes by group expand slopes to length of original array predicted = model - obs * slopes_expanded ... the main point is to get the group functions, for demeaning, ... for the 2d labels (and get the labels out of histogramdd) I'm out of time (off to the airport soon), but I can look into it next weekend. Josef Thanks Josef, I will chase up the April list... If I understand what you have done above, this returns the slope of best fit lines forced through the origin, is that right? Not if both variables, model and obs, are demeaned first, demeaning removes any effect of a constant and only the slope is left over, which can be done with the ration xx/xy. But to get independent intercept per group, the demeaning has to be by group. What's the size of your problem, how many groups or how many separate regressions ? demeaning by group has setup cost in this case, so the main speed benefit would come if you calculate the digitize and label generation, that histogram2d does, only ones and reuse it in later calculations. Using dummy variables as Bruce proposes works very well if there are not a very large number of groups, otherwise I think the memory requirements and size of array would be very costly in terms of performance. Josef There is always a tradeoff of memory vs speed. It is too easy to be too clever just to find that a more brute force approach is considerably faster. You want to avoid Python code and array indexing as much as possible since those can be major speed bumps. Also some approaches have hidden memory costs as well (histogram2d calls as histogramdd([x,y],...). If memory is an issue, then obviously you need to decide how to handle it because there are many ways around it. For example, the biggest memory usage in my code should be related to the design matrix and creating the normal equations. At worst (which requires passing by value rather than reference) it would be two 2-d arrays with the number of rows being the number of observations and the number of columns being two times the number of groups. If you have scipy, then sparse matrices can reduce the memory footprint of the design matrix and could be faster. There are also ways to construct the design matrix and the normal equations either observation by observation (essential for very large data sets) or pieces (uses within group information of numbers of observations, sum of 'model' and sum of 'model*model'). If the your boxes have homogeneous variance (which is already being assumed), you can first divide the data into groups of boxes and then loop over each group reusing the existing arrays as needed. Bruce ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2d binning and linear regression
the basic idea is in polyfit on multiple data points on numpy-disscusion mailing list April 2009 In this case, calculations have to be done by groups subtract mean (this needs to be replaced by group demeaning) modeldm = model - model.mean() obsdm = obs - obs.mean() xx = np.histogram2d( xx, xedges, yedges = np.histogram2d(lat, lon, weights=modeldm*modeldm, bins=(latedges,lonedges)) xy, xedges, yedges = np.histogram2d(lat, lon, weights=obsdm*obsdm, bins=(latedges,lonedges)) slopes = xy/xx # slopes by group expand slopes to length of original array predicted = model - obs * slopes_expanded ... the main point is to get the group functions, for demeaning, ... for the 2d labels (and get the labels out of histogramdd) I'm out of time (off to the airport soon), but I can look into it next weekend. Josef Thanks Josef, I will chase up the April list... If I understand what you have done above, this returns the slope of best fit lines forced through the origin, is that right? Have a great trip! Tom ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2d binning and linear regression
What exactly are trying to fit because it is rather bad practice to fit a model to some summarized data as you lose the uncertainty in the original data? If you define your boxes, you can loop through directly on each box and even fit the equation: model=mu +beta1*obs The extension is to fit the larger equation: model=mu + boxes + beta1*obs + beta2*obs*boxes where your boxes is a indicator or dummy variable for each box. Since you are only interested in the box by model term, you probably can use this type of model model=mu + boxes + beta2*obs*boxes However, these models assume that the residual variance is identical for all boxes. (That is solved by a mixed model approach.) Bruce Bruce, I am trying to determine spatially based linear corrections for surface winds in order to force a wave model. The basic idea is, use satellite observations from sattelites to determine the errors and the wind, and reduce them by applying a linear correction prior to forcing the wave model. I am not sure I understand what you are saying, I am possibly trying to do what you are describing. i.e. for each box, gather observations, determine a linear correction, and apply it to the model model = a*x + b Does that make sense? Cheers Tom ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2d binning and linear regression
On 06/22/2010 09:13 AM, Tom Durrant wrote: What exactly are trying to fit because it is rather bad practice to fit a model to some summarized data as you lose the uncertainty in the original data? If you define your boxes, you can loop through directly on each box and even fit the equation: model=mu +beta1*obs The extension is to fit the larger equation: model=mu + boxes + beta1*obs + beta2*obs*boxes where your boxes is a indicator or dummy variable for each box. Since you are only interested in the box by model term, you probably can use this type of model model=mu + boxes + beta2*obs*boxes However, these models assume that the residual variance is identical for all boxes. (That is solved by a mixed model approach.) Bruce Bruce, I am trying to determine spatially based linear corrections for surface winds in order to force a wave model. The basic idea is, use satellite observations from sattelites to determine the errors and the wind, and reduce them by applying a linear correction prior to forcing the wave model. I am not sure I understand what you are saying, I am possibly trying to do what you are describing. i.e. for each box, gather observations, determine a linear correction, and apply it to the model model = a*x + b Does that make sense? Cheers Tom I used the data you gave and may have swapped 'model' and 'x' here - the code should work if you switch these. First I assume that you can create a variable 'box' based on the lat/lon data - I just created one from the first values. As I understand it, your problem is essentially is an analysis of covariance with one factor (box) and one regressor (x). So I used some technical knowledge about the parameterization of the normal equations that may not be true in general for all models as it depends on finding 'equivalent models'. Just print out the different arrays for a small example as it is rather hard to describe using words alone. Basically you can create a design matrix that just includes dummy variables for each box and the interactions between that dummy variable and your 'x' array. This amounts to the equation: y = box + box*x where y is your 'model' array box is a 'dummy variable of box - number of columns is the number of boxes. x is your 'x' array. Note that there is no general or overall intercept here because you are not interested in that. Rather you are interested in the 'intercept' for each box which comes from the corresponding solution. The code provides a function to create the dummy variables for each box and a second function to compute the interactions between that dummy variable and your 'x' array - these functions originated in pystatsmodels. After that I form and solve the normal equations to get the standard errors of the solutions and other useful statistics. Note that the residual variance (MSE) is probably the pooled residual variance across all boxes (I am too lazy to check). Also, if you are not interested in the standard errors etc. then you probably should use a more efficient solver available in numpy. I (hopefully) reshaped the solutions ('beta') and standard errors ('pSE') so the rows are for each box, the first column is the 'intercept' and the second column is the 'slope'. The output is: $ python reg_box.py Residual Sum of Squares 0.0306718851814 Residual Mean Sum of Squares 0.00511198086357 RSquared 0.975657233983 Estimated Intercept and regression for each box [[ 0.69044944 0.44569288] [-1.53272813 1.43358273]] Estimated standard error of the Intercept and regression for each box [[ 0.41081864 0.23061509] [ 0.21123387 0.095004 ]] For box=1: a=0.45 (se=0.23) and b=0.69 (se= 0.41) For box=2: a=1.43 (se=0.095) and b=-1.53 (se=0.21) Bruce (If you have questions, you can contact me off list if you want.) import numpy as np y =np.array( [1.42, 1.35, 1.55, 1.50, 1.59, 1.76, 2.15, 1.90, 1.55, 0.73 ]) x =np.array( [1.67, 1.68, 1.70, 1.79, 2.04, 2.36, 2.53, 2.38, 2.149, 1.57 ]) box =np.array( [1, 1, 1, 1, 1, 2, 2, 2, 2, 2 ]) def data2dummy(x): if not isinstance(x, np.ndarray): #if not an ndarray object attempt to convert it x=np.array(x, dtype=dtype) if len(x.shape) 1: raise ValueError, 'Too many columns' groups = np.unique(x) return (x[:, None] == groups).astype(int), groups def design_int(A, B): if len(B.shape) 1: rowB, colB=B.shape ab=A*B[:,0].reshape(rowB,1) else: rowB=B.shape[0] colB=1 ab=A*B[:].reshape(rowB,1) for col in range(1,colB): ncol=B[:,col].reshape(rowB,1) ab=np.hstack((ab,A*ncol)) return ab dbox, lbox= data2dummy(box) #create dummy variable for box Xdesign=np.hstack((dbox,design_int(dbox, x))) #create Design matrix #Create components of the normal equations YY=np.dot(y.T,y) XX=np.dot(Xdesign.T,Xdesign)
Re: [Numpy-discussion] 2d binning and linear regression
On Tue, Jun 22, 2010 at 10:09 AM, Tom Durrant thdurr...@gmail.com wrote: the basic idea is in polyfit on multiple data points on numpy-disscusion mailing list April 2009 In this case, calculations have to be done by groups subtract mean (this needs to be replaced by group demeaning) modeldm = model - model.mean() obsdm = obs - obs.mean() xx = np.histogram2d( xx, xedges, yedges = np.histogram2d(lat, lon, weights=modeldm*modeldm, bins=(latedges,lonedges)) xy, xedges, yedges = np.histogram2d(lat, lon, weights=obsdm*obsdm, bins=(latedges,lonedges)) slopes = xy/xx # slopes by group expand slopes to length of original array predicted = model - obs * slopes_expanded ... the main point is to get the group functions, for demeaning, ... for the 2d labels (and get the labels out of histogramdd) I'm out of time (off to the airport soon), but I can look into it next weekend. Josef Thanks Josef, I will chase up the April list... If I understand what you have done above, this returns the slope of best fit lines forced through the origin, is that right? Not if both variables, model and obs, are demeaned first, demeaning removes any effect of a constant and only the slope is left over, which can be done with the ration xx/xy. But to get independent intercept per group, the demeaning has to be by group. What's the size of your problem, how many groups or how many separate regressions ? demeaning by group has setup cost in this case, so the main speed benefit would come if you calculate the digitize and label generation, that histogram2d does, only ones and reuse it in later calculations. Using dummy variables as Bruce proposes works very well if there are not a very large number of groups, otherwise I think the memory requirements and size of array would be very costly in terms of performance. Josef Have a great trip! Tom ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2d binning and linear regression
On Sun, Jun 20, 2010 at 10:57 PM, Tom Durrant thdurr...@gmail.com wrote: are you doing something like np.polyfit(model, obs, 1) ? If you are using polyfit with deg=1, i.e. fitting a straight line, then this could be also calculated using the weights in histogram2d. histogram2d (histogramdd) uses np.digitize and np.bincount, so I'm surprised if the histogram2d version is much faster. If a quick reading of histogramdd is correct, the main improvement would be to get the labels xy out of it, so it can be used repeatedly with np.bincount. Josef Thanks Josef, From my limited understanding, you are right the histogram is much faster due to the fact that it doesn't have to keep reading in the array over and over I am using np.polyfit(model, obs, 1). I couldn't work out a way to do these regression using histogram2d and weights, but you think it can be done? This would be great! the basic idea is in polyfit on multiple data points on numpy-disscusion mailing list April 2009 In this case, calculations have to be done by groups subtract mean (this needs to be replaced by group demeaning) modeldm = model - model.mean() obsdm = obs - obs.mean() xx = np.histogram2d( xx, xedges, yedges = np.histogram2d(lat, lon, weights=modeldm*modeldm, bins=(latedges,lonedges)) xy, xedges, yedges = np.histogram2d(lat, lon, weights=obsdm*obsdm, bins=(latedges,lonedges)) slopes = xy/xx # slopes by group expand slopes to length of original array predicted = model - obs * slopes_expanded ... the main point is to get the group functions, for demeaning, ... for the 2d labels (and get the labels out of histogramdd) I'm out of time (off to the airport soon), but I can look into it next weekend. Josef Cheers Tom ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2d binning and linear regression
On 06/20/2010 03:24 AM, Tom Durrant wrote: Hi All, I have a problem involving lat/lon data. Basically, I am evaluating numerical weather model data against satellite data, and trying to produce gridded plots of various statistics. There are various steps involved with this, but basically, I get to the point where I have four arrays of the same length, containing lat, lon, model and observed values respectively along the path of the sattelite. eg: lat = [ 50.32 50.7851.2451.82 52.5553.15 53.7554.28 54.79 55.16 ... ] lon = [ 192.83 193.38 193.94 194.67 195.65 196.49 197.35 198.15 198.94 199.53 ... ] obs = [1.42 1.35 1.55 1.50 1.59 1.76 2.15 1.90 1.550.73 ... ] model = [ 1.67 1.68 1.70 1.79 2.04 2.36 2.53 2.38 2.149 1.57 ... ] I then want to calculate statistics based on bins of say 2 X 2 degree boxes. These arrays are very large, on the order of 10^6. For ease of explanation, I will focus initially on bias. My first approach was to use loops through lat/lon boxes and use np.where statements to extract all the values of the model and observations within the given box, and calculate the bias as the mean of the difference. This was obviously very slow. histogram2d provided a FAR better way to do this. i.e. import numpy as np latedges=np.arange(-90,90,2) lonedges=np.arange(0,360,2) diff = model-obs grid_N, xedges, yedges = np.histogram2d(lat, lon], bins=(latedges,lonedges)) grid_bias_sum, xedges, yedges = np.histogram2d(lat, lon, weights=diff, bins=(latedges,lonedges)) grid_bias = grid_bias_sum/grid_N I now want to determine the the linear regression coefficients for each each box after fitting a least squares linear regression to the data in each bin. I have been looking at using np.digitize to extract the bin indeces, but haven't had much success trying to do this in two dimensions. I am back to looping through the lat and lon box values, using np.where to extract the observations and model data within that box, and using np.polyfit to obtain the regression coefficients. This is, of course, very slow. Can anyone advise a smarter, vectorized way to do this? Any thoughts would be greatly appreciated. Thanks in advance Tom What exactly are trying to fit because it is rather bad practice to fit a model to some summarized data as you lose the uncertainty in the original data? If you define your boxes, you can loop through directly on each box and even fit the equation: model=mu +beta1*obs The extension is to fit the larger equation: model=mu + boxes + beta1*obs + beta2*obs*boxes where your boxes is a indicator or dummy variable for each box. Since you are only interested in the box by model term, you probably can use this type of model model=mu + boxes + beta2*obs*boxes However, these models assume that the residual variance is identical for all boxes. (That is solved by a mixed model approach.) Bruce ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] 2d binning and linear regression
Hi All, I have a problem involving lat/lon data. Basically, I am evaluating numerical weather model data against satellite data, and trying to produce gridded plots of various statistics. There are various steps involved with this, but basically, I get to the point where I have four arrays of the same length, containing lat, lon, model and observed values respectively along the path of the sattelite. eg: lat = [ 50.32 50.7851.2451.82 52.5553.1553.75 54.28 54.79 55.16 ... ] lon = [ 192.83 193.38 193.94 194.67 195.65 196.49 197.35 198.15 198.94 199.53 ... ] obs = [1.42 1.35 1.55 1.50 1.59 1.76 2.15 1.90 1.550.73 ... ] model = [ 1.67 1.68 1.70 1.79 2.04 2.36 2.53 2.38 2.149 1.57 ... ] I then want to calculate statistics based on bins of say 2 X 2 degree boxes. These arrays are very large, on the order of 10^6. For ease of explanation, I will focus initially on bias. My first approach was to use loops through lat/lon boxes and use np.where statements to extract all the values of the model and observations within the given box, and calculate the bias as the mean of the difference. This was obviously very slow. histogram2d provided a FAR better way to do this. i.e. import numpy as np latedges=np.arange(-90,90,2) lonedges=np.arange(0,360,2) diff = model-obs grid_N, xedges, yedges = np.histogram2d(lat, lon], bins=(latedges,lonedges)) grid_bias_sum, xedges, yedges = np.histogram2d(lat, lon, weights=diff, bins=(latedges,lonedges)) grid_bias = grid_bias_sum/grid_N I now want to determine the the linear regression coefficients for each each box after fitting a least squares linear regression to the data in each bin. I have been looking at using np.digitize to extract the bin indeces, but haven't had much success trying to do this in two dimensions. I am back to looping through the lat and lon box values, using np.where to extract the observations and model data within that box, and using np.polyfit to obtain the regression coefficients. This is, of course, very slow. Can anyone advise a smarter, vectorized way to do this? Any thoughts would be greatly appreciated. Thanks in advance Tom ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2d binning and linear regression
On Sun, Jun 20, 2010 at 4:24 AM, Tom Durrant thdurr...@gmail.com wrote: Hi All, I have a problem involving lat/lon data. Basically, I am evaluating numerical weather model data against satellite data, and trying to produce gridded plots of various statistics. There are various steps involved with this, but basically, I get to the point where I have four arrays of the same length, containing lat, lon, model and observed values respectively along the path of the sattelite. eg: lat = [ 50.32 50.78 51.24 51.82 52.55 53.15 53.75 54.28 54.79 55.16 ... ] lon = [ 192.83 193.38 193.94 194.67 195.65 196.49 197.35 198.15 198.94 199.53 ... ] obs = [ 1.42 1.35 1.55 1.50 1.59 1.76 2.15 1.90 1.55 0.73 ... ] model = [ 1.67 1.68 1.70 1.79 2.04 2.36 2.53 2.38 2.149 1.57 ... ] I then want to calculate statistics based on bins of say 2 X 2 degree boxes. These arrays are very large, on the order of 10^6. For ease of explanation, I will focus initially on bias. My first approach was to use loops through lat/lon boxes and use np.where statements to extract all the values of the model and observations within the given box, and calculate the bias as the mean of the difference. This was obviously very slow. histogram2d provided a FAR better way to do this. i.e. import numpy as np latedges=np.arange(-90,90,2) lonedges=np.arange(0,360,2) diff = model-obs grid_N, xedges, yedges = np.histogram2d(lat, lon], bins=(latedges,lonedges)) grid_bias_sum, xedges, yedges = np.histogram2d(lat, lon, weights=diff, bins=(latedges,lonedges)) grid_bias = grid_bias_sum/grid_N I now want to determine the the linear regression coefficients for each each box after fitting a least squares linear regression to the data in each bin. I have been looking at using np.digitize to extract the bin indeces, but haven't had much success trying to do this in two dimensions. I am back to looping through the lat and lon box values, using np.where to extract the observations and model data within that box, and using np.polyfit to obtain the regression coefficients. This is, of course, very slow. Can anyone advise a smarter, vectorized way to do this? Any thoughts would be greatly appreciated. For a general linear regression problem, there wouldn't be much that I can see that can be done. If there are many small regression problem, then sometimes stacking them into one big sparse least squares problem can be faster, it's faster to solve but not always faster to create in the first place. are you doing something like np.polyfit(model, obs, 1) ? If you are using polyfit with deg=1, i.e. fitting a straight line, then this could be also calculated using the weights in histogram2d. histogram2d (histogramdd) uses np.digitize and np.bincount, so I'm surprised if the histogram2d version is much faster. If a quick reading of histogramdd is correct, the main improvement would be to get the labels xy out of it, so it can be used repeatedly with np.bincount. Josef Thanks in advance Tom ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2d binning and linear regression
are you doing something like np.polyfit(model, obs, 1) ? If you are using polyfit with deg=1, i.e. fitting a straight line, then this could be also calculated using the weights in histogram2d. histogram2d (histogramdd) uses np.digitize and np.bincount, so I'm surprised if the histogram2d version is much faster. If a quick reading of histogramdd is correct, the main improvement would be to get the labels xy out of it, so it can be used repeatedly with np.bincount. Josef Thanks Josef, From my limited understanding, you are right the histogram is much faster due to the fact that it doesn't have to keep reading in the array over and over I am using np.polyfit(model, obs, 1). I couldn't work out a way to do these regression using histogram2d and weights, but you think it can be done? This would be great! Cheers Tom ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2d binning on regular grid
Hello Andreas, please see this as a side remark. A colleague of mine made me aware of a very beautiful thing about covering spheres by evenly spaced points: http://healpix.jpl.nasa.gov/ Since you want to calculate mean and stddev, to my understanding a grid in longitude/latitude is without proper weighting factors problematic. Do you use weighting factors? If yes, of what kind? For Healpix, there exists exists a Python binding, but I never worked with it. Friedrich ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2d binning on regular grid
On Wed, Jun 2, 2010 at 11:40 AM, Andreas Hilboll li...@hilboll.de wrote: Hi there, I'm interested in the solution to a special case of the parallel thread '2D binning', which is going on at the moment. My data is on a fine global grid, say .125x.125 degrees. I'm looking for a way to do calculations on coarser grids, e.g. * calculate means() * calculate std() * ... on a, say, 2.5x3.75 degree grid. One very crude approach would be to iterate through latitudes and longitudes, like this: latstep_orig = .125 lonstep_orig = .125 data_orig = np.arange((180./latstep_orig)*(360./lonstep_orig)).reshape((180./latstep_orig,360./lonstep_orig)) latstep_new = 2.5 lonstep_new = 3.75 latstep = int(latstep_new / latstep_orig) lonstep = int(lonstep_new / lonstep_orig) print 'one new lat equals',latstep,'new lats' print 'one new lon equals',lonstep,'new lons' result = ma.zeros((180./latstep_new,360./lonstep_new)) latidx = 0 while latidx*latstep_new 180.: lonidx = 0 while lonidx*lonstep_new 360.: m = np.mean( \ data_orig[latidx*latstep:(latidx+1)*latstep, \ lonidx*lonstep:(lonidx+1)*lonstep]) result[latidx,lonidx] = m lonidx += 1 latidx += 1 However, this is very crude, and I was wondering if there's any more elegant way to do it ... Thanks for your insight! I thought maybe there is something in ndimage for this, but since nobody mentions it, maybe not. If there are no memory problems and my interpretation of the question is correct, then something like this might work: x = np.arange(16).reshape(4,-1) d = np.kron(np.eye(2),np.ones(2)) s = np.dot(np.dot(d,x),d.T) m = np.dot(np.dot(d,x),d.T)/4 v = np.dot(np.dot(d,x**2),d.T)/4 - m**2 x array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15]]) s array([[ 10., 18.], [ 42., 50.]]) m array([[ 2.5, 4.5], [ 10.5, 12.5]]) v array([[ 4.25, 4.25], [ 4.25, 4.25]]) x[:2,:2].sum() 10 x[:2,:2].mean() 2.5 x[:2,:2].var() 4.25 Josef A. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote: On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu wrote: I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) You could (in c or cython) try a brain-dead hashtable with no collision detection: for lat, long, data in dataset: bin = (lat ^ long) % num_bins hashtable[bin] = update_incremental_mean(hashtable[bin], data) you'll of course want to do some experiments to see if your data are sufficiently sparse and/or you can afford a large enough hashtable array that you won't get spurious hash collisions. Adding error- checking to ensure that there are no collisions would be pretty trivial (just keep a table of the lat/long for each hash value, which you'll need anyway, and check that different lat/long pairs don't get assigned the same bin). Zach -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincuszachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? If the lat lon can be converted to a 1d label as Wes suggested, then in a similar timing exercise ndimage was the fastest. http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html And as you said your lats and lons are integers, you could simply do ll = lat*1000 + lon to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or lon will never exceed 360 (degrees). After that, either use the ndimage approach, or you could use histogramming with weighting by data values and divide by histogram withouth weighting, or just loop. Vincent (this was for python 2.4, also later I found np.bincount which requires that the labels are consecutive integers, but is as fast as ndimage) I don't know how it would compare to the new suggestions. Josef Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
Why not simply use a set? uniquePoints = set(zip(lats, lons)) Ben Root On Wed, Jun 2, 2010 at 2:41 AM, Vincent Schut sc...@sarvision.nl wrote: On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote: On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu wrote: I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) You could (in c or cython) try a brain-dead hashtable with no collision detection: for lat, long, data in dataset: bin = (lat ^ long) % num_bins hashtable[bin] = update_incremental_mean(hashtable[bin], data) you'll of course want to do some experiments to see if your data are sufficiently sparse and/or you can afford a large enough hashtable array that you won't get spurious hash collisions. Adding error- checking to ensure that there are no collisions would be pretty trivial (just keep a table of the lat/long for each hash value, which you'll need anyway, and check that different lat/long pairs don't get assigned the same bin). Zach -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincuszachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? If the lat lon can be converted to a 1d label as Wes suggested, then in a similar timing exercise ndimage was the fastest. http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html And as you said your lats and lons are integers, you could simply do ll = lat*1000 + lon to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or lon will never exceed 360 (degrees). After that, either use the ndimage approach, or you could use histogramming with weighting by data values and divide by histogram withouth weighting, or just loop. Vincent (this was for python 2.4, also later I found np.bincount which requires that the labels are consecutive integers, but is as fast as ndimage) I don't know how it would compare to the new suggestions. Josef Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] 2d binning on regular grid
Hi there, I'm interested in the solution to a special case of the parallel thread '2D binning', which is going on at the moment. My data is on a fine global grid, say .125x.125 degrees. I'm looking for a way to do calculations on coarser grids, e.g. * calculate means() * calculate std() * ... on a, say, 2.5x3.75 degree grid. One very crude approach would be to iterate through latitudes and longitudes, like this: latstep_orig = .125 lonstep_orig = .125 data_orig = np.arange((180./latstep_orig)*(360./lonstep_orig)).reshape((180./latstep_orig,360./lonstep_orig)) latstep_new = 2.5 lonstep_new = 3.75 latstep = int(latstep_new / latstep_orig) lonstep = int(lonstep_new / lonstep_orig) print 'one new lat equals',latstep,'new lats' print 'one new lon equals',lonstep,'new lons' result = ma.zeros((180./latstep_new,360./lonstep_new)) latidx = 0 while latidx*latstep_new 180.: lonidx = 0 while lonidx*lonstep_new 360.: m = np.mean( \ data_orig[latidx*latstep:(latidx+1)*latstep, \ lonidx*lonstep:(lonidx+1)*lonstep]) result[latidx,lonidx] = m lonidx += 1 latidx += 1 However, this is very crude, and I was wondering if there's any more elegant way to do it ... Thanks for your insight! A. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl wrote: On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote: On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu wrote: I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) You could (in c or cython) try a brain-dead hashtable with no collision detection: for lat, long, data in dataset: bin = (lat ^ long) % num_bins hashtable[bin] = update_incremental_mean(hashtable[bin], data) you'll of course want to do some experiments to see if your data are sufficiently sparse and/or you can afford a large enough hashtable array that you won't get spurious hash collisions. Adding error- checking to ensure that there are no collisions would be pretty trivial (just keep a table of the lat/long for each hash value, which you'll need anyway, and check that different lat/long pairs don't get assigned the same bin). Zach -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincuszachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? If the lat lon can be converted to a 1d label as Wes suggested, then in a similar timing exercise ndimage was the fastest. http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html And as you said your lats and lons are integers, you could simply do ll = lat*1000 + lon to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or lon will never exceed 360 (degrees). After that, either use the ndimage approach, or you could use histogramming with weighting by data values and divide by histogram withouth weighting, or just loop. Vincent (this was for python 2.4, also later I found np.bincount which requires that the labels are consecutive integers, but is as fast as ndimage) I don't know how it would compare to the new suggestions. Josef Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I was curious about how fast ndimage was for this operation so here's the complete function. import scipy.ndimage as ndi N = 1 lat = np.random.randint(0, 360, N) lon = np.random.randint(0, 360, N) data = np.random.randn(N) def group_mean(lat, lon, data): indexer = np.lexsort((lon, lat)) lat = lat.take(indexer) lon = lon.take(indexer) sorted_data = data.take(indexer) keys = 1000 * lat + lon unique_keys = np.unique(keys) result = ndi.mean(sorted_data, labels=keys, index=unique_keys) decoder = keys.searchsorted(unique_keys) return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result)) Appears to be about 13x faster (and could be made faster still) than the naive version on my machine: def group_mean_naive(lat, lon, data): grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, np.mean(da)) for ltln, da in grouped.items()) return averaged I had to get the latest scipy trunk to not get an error from ndimage.mean ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
thanks. I am also getting an error in ndi.mean Were you getting the error RuntimeError: data type not supported? -Mathew On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney wesmck...@gmail.com wrote: On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl wrote: On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote: On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu wrote: I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) You could (in c or cython) try a brain-dead hashtable with no collision detection: for lat, long, data in dataset: bin = (lat ^ long) % num_bins hashtable[bin] = update_incremental_mean(hashtable[bin], data) you'll of course want to do some experiments to see if your data are sufficiently sparse and/or you can afford a large enough hashtable array that you won't get spurious hash collisions. Adding error- checking to ensure that there are no collisions would be pretty trivial (just keep a table of the lat/long for each hash value, which you'll need anyway, and check that different lat/long pairs don't get assigned the same bin). Zach -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincus zachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? If the lat lon can be converted to a 1d label as Wes suggested, then in a similar timing exercise ndimage was the fastest. http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html And as you said your lats and lons are integers, you could simply do ll = lat*1000 + lon to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or lon will never exceed 360 (degrees). After that, either use the ndimage approach, or you could use histogramming with weighting by data values and divide by histogram withouth weighting, or just loop. Vincent (this was for python 2.4, also later I found np.bincount which requires that the labels are consecutive integers, but is as fast as ndimage) I don't know how it would compare to the new suggestions. Josef Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I was curious about how fast ndimage was for this operation so here's the complete function. import scipy.ndimage as ndi N = 1 lat = np.random.randint(0, 360, N) lon = np.random.randint(0, 360, N) data = np.random.randn(N) def group_mean(lat, lon, data): indexer = np.lexsort((lon, lat)) lat = lat.take(indexer) lon = lon.take(indexer) sorted_data = data.take(indexer) keys = 1000 * lat + lon unique_keys = np.unique(keys) result = ndi.mean(sorted_data, labels=keys, index=unique_keys) decoder = keys.searchsorted(unique_keys) return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result)) Appears to be about 13x faster (and could be made faster still) than the naive version on my machine: def group_mean_naive(lat, lon, data): grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, np.mean(da)) for ltln, da in grouped.items())
Re: [Numpy-discussion] 2D binning
On Wed, Jun 2, 2010 at 1:23 PM, Mathew Yeates mat.yea...@gmail.com wrote: thanks. I am also getting an error in ndi.mean Were you getting the error RuntimeError: data type not supported? -Mathew On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney wesmck...@gmail.com wrote: On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl wrote: On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote: On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu wrote: I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) You could (in c or cython) try a brain-dead hashtable with no collision detection: for lat, long, data in dataset: bin = (lat ^ long) % num_bins hashtable[bin] = update_incremental_mean(hashtable[bin], data) you'll of course want to do some experiments to see if your data are sufficiently sparse and/or you can afford a large enough hashtable array that you won't get spurious hash collisions. Adding error- checking to ensure that there are no collisions would be pretty trivial (just keep a table of the lat/long for each hash value, which you'll need anyway, and check that different lat/long pairs don't get assigned the same bin). Zach -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincuszachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? If the lat lon can be converted to a 1d label as Wes suggested, then in a similar timing exercise ndimage was the fastest. http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html And as you said your lats and lons are integers, you could simply do ll = lat*1000 + lon to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or lon will never exceed 360 (degrees). After that, either use the ndimage approach, or you could use histogramming with weighting by data values and divide by histogram withouth weighting, or just loop. Vincent (this was for python 2.4, also later I found np.bincount which requires that the labels are consecutive integers, but is as fast as ndimage) I don't know how it would compare to the new suggestions. Josef Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I was curious about how fast ndimage was for this operation so here's the complete function. import scipy.ndimage as ndi N = 1 lat = np.random.randint(0, 360, N) lon = np.random.randint(0, 360, N) data = np.random.randn(N) def group_mean(lat, lon, data): indexer = np.lexsort((lon, lat)) lat = lat.take(indexer) lon = lon.take(indexer) sorted_data = data.take(indexer) keys = 1000 * lat + lon unique_keys = np.unique(keys) result = ndi.mean(sorted_data, labels=keys, index=unique_keys) decoder = keys.searchsorted(unique_keys) return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result)) Appears to be about 13x faster (and could be made faster still) than the naive version on my machine: def group_mean_naive(lat, lon, data): grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da)
Re: [Numpy-discussion] 2D binning
I'm on Windows, using a precompiled binary. I never built numpy/scipy on Windows. On Wed, Jun 2, 2010 at 10:45 AM, Wes McKinney wesmck...@gmail.com wrote: On Wed, Jun 2, 2010 at 1:23 PM, Mathew Yeates mat.yea...@gmail.com wrote: thanks. I am also getting an error in ndi.mean Were you getting the error RuntimeError: data type not supported? -Mathew On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney wesmck...@gmail.com wrote: On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl wrote: On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote: On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincus zachary.pin...@yale.edu wrote: I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) You could (in c or cython) try a brain-dead hashtable with no collision detection: for lat, long, data in dataset: bin = (lat ^ long) % num_bins hashtable[bin] = update_incremental_mean(hashtable[bin], data) you'll of course want to do some experiments to see if your data are sufficiently sparse and/or you can afford a large enough hashtable array that you won't get spurious hash collisions. Adding error- checking to ensure that there are no collisions would be pretty trivial (just keep a table of the lat/long for each hash value, which you'll need anyway, and check that different lat/long pairs don't get assigned the same bin). Zach -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincuszachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? If the lat lon can be converted to a 1d label as Wes suggested, then in a similar timing exercise ndimage was the fastest. http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html And as you said your lats and lons are integers, you could simply do ll = lat*1000 + lon to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or lon will never exceed 360 (degrees). After that, either use the ndimage approach, or you could use histogramming with weighting by data values and divide by histogram withouth weighting, or just loop. Vincent (this was for python 2.4, also later I found np.bincount which requires that the labels are consecutive integers, but is as fast as ndimage) I don't know how it would compare to the new suggestions. Josef Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I was curious about how fast ndimage was for this operation so here's the complete function. import scipy.ndimage as ndi N = 1 lat = np.random.randint(0, 360, N) lon = np.random.randint(0, 360, N) data = np.random.randn(N) def group_mean(lat, lon, data): indexer = np.lexsort((lon, lat)) lat = lat.take(indexer) lon = lon.take(indexer) sorted_data = data.take(indexer) keys = 1000 * lat + lon unique_keys = np.unique(keys) result = ndi.mean(sorted_data, labels=keys, index=unique_keys) decoder = keys.searchsorted(unique_keys) return
Re: [Numpy-discussion] 2D binning
On Wed, Jun 2, 2010 at 2:09 PM, Mathew Yeates mat.yea...@gmail.com wrote: I'm on Windows, using a precompiled binary. I never built numpy/scipy on Windows. ndimage measurements has been recently rewritten. ndimage is very fast but (the old version) has insufficient type checking and may crash on wrong inputs. I managed to work with it in the past on Windows. Maybe you could try to check the dtypes of the arguments for ndi.mean. (Preferably in an interpreter session where you don't mind if it crashes) I don't remember the type restrictions, but there are/were several tickets for it. Josef On Wed, Jun 2, 2010 at 10:45 AM, Wes McKinney wesmck...@gmail.com wrote: On Wed, Jun 2, 2010 at 1:23 PM, Mathew Yeates mat.yea...@gmail.com wrote: thanks. I am also getting an error in ndi.mean Were you getting the error RuntimeError: data type not supported? -Mathew On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney wesmck...@gmail.com wrote: On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl wrote: On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote: On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu wrote: I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) You could (in c or cython) try a brain-dead hashtable with no collision detection: for lat, long, data in dataset: bin = (lat ^ long) % num_bins hashtable[bin] = update_incremental_mean(hashtable[bin], data) you'll of course want to do some experiments to see if your data are sufficiently sparse and/or you can afford a large enough hashtable array that you won't get spurious hash collisions. Adding error- checking to ensure that there are no collisions would be pretty trivial (just keep a table of the lat/long for each hash value, which you'll need anyway, and check that different lat/long pairs don't get assigned the same bin). Zach -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincuszachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? If the lat lon can be converted to a 1d label as Wes suggested, then in a similar timing exercise ndimage was the fastest. http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html And as you said your lats and lons are integers, you could simply do ll = lat*1000 + lon to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or lon will never exceed 360 (degrees). After that, either use the ndimage approach, or you could use histogramming with weighting by data values and divide by histogram withouth weighting, or just loop. Vincent (this was for python 2.4, also later I found np.bincount which requires that the labels are consecutive integers, but is as fast as ndimage) I don't know how it would compare to the new suggestions. Josef Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I was curious about how fast ndimage was for this operation so here's
Re: [Numpy-discussion] 2D binning
On Wed, Jun 2, 2010 at 2:26 PM, josef.p...@gmail.com wrote: On Wed, Jun 2, 2010 at 2:09 PM, Mathew Yeates mat.yea...@gmail.com wrote: I'm on Windows, using a precompiled binary. I never built numpy/scipy on Windows. ndimage measurements has been recently rewritten. ndimage is very fast but (the old version) has insufficient type checking and may crash on wrong inputs. I managed to work with it in the past on Windows. Maybe you could try to check the dtypes of the arguments for ndi.mean. (Preferably in an interpreter session where you don't mind if it crashes) I don't remember the type restrictions, but there are/were several tickets for it. Josef On Wed, Jun 2, 2010 at 10:45 AM, Wes McKinney wesmck...@gmail.com wrote: On Wed, Jun 2, 2010 at 1:23 PM, Mathew Yeates mat.yea...@gmail.com wrote: thanks. I am also getting an error in ndi.mean Were you getting the error RuntimeError: data type not supported? -Mathew On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney wesmck...@gmail.com wrote: On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl wrote: On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote: On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu wrote: I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) You could (in c or cython) try a brain-dead hashtable with no collision detection: for lat, long, data in dataset: bin = (lat ^ long) % num_bins hashtable[bin] = update_incremental_mean(hashtable[bin], data) you'll of course want to do some experiments to see if your data are sufficiently sparse and/or you can afford a large enough hashtable array that you won't get spurious hash collisions. Adding error- checking to ensure that there are no collisions would be pretty trivial (just keep a table of the lat/long for each hash value, which you'll need anyway, and check that different lat/long pairs don't get assigned the same bin). Zach -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincuszachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? If the lat lon can be converted to a 1d label as Wes suggested, then in a similar timing exercise ndimage was the fastest. http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html And as you said your lats and lons are integers, you could simply do ll = lat*1000 + lon to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or lon will never exceed 360 (degrees). After that, either use the ndimage approach, or you could use histogramming with weighting by data values and divide by histogram withouth weighting, or just loop. Vincent (this was for python 2.4, also later I found np.bincount which requires that the labels are consecutive integers, but is as fast as ndimage) I don't know how it would compare to the new suggestions. Josef Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
Nope. This version didn't work either. If you're on Python 2.6 the binary on here might work for you: http://www.lfd.uci.edu/~gohlke/pythonlibs/ It looks recent enough to have the rewritten ndimage ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
On 6/2/2010 2:32 PM, Mathew Yeates wrote: Nope. This version didn't work either. If you're on Python 2.6 the binary on here might work for you: http://www.lfd.uci.edu/~gohlke/pythonlibs/ It looks recent enough to have the rewritten ndimage On 6/2/2010 2:32 PM, Mathew Yeates wrote: Nope. This version didn't work either. Please note that in order to use the ndimage package from http://www.lfd.uci.edu/~gohlke/pythonlibs/#ndimage you have to use import ndimage as ndi instead of import scipy.ndimage as ndi. The code posted by Wes works for me on Windows after that change. -- Christoph ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
On 1/06/2010 10:51 PM, Wes McKinney wrote: snip This is a pretty good example of the group-by problem that will hopefully work its way into a future edition of NumPy. Wes (or anyone else), please can you elaborate on any plans for groupby? I've made my own modification to numpy.bincount for doing groupby-type operations but not contributed anything back to the numpy community. Is there any interest from European-based people to work on groupby etc at the Euro SciPy sprints in July? If others are interested, maybe we could work out requirements beforehand and then do some coding in Paris. Cheers Stephen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
On Wed, Jun 2, 2010 at 6:18 PM, Stephen Simmons m...@stevesimmons.com wrote: On 1/06/2010 10:51 PM, Wes McKinney wrote: snip This is a pretty good example of the group-by problem that will hopefully work its way into a future edition of NumPy. Wes (or anyone else), please can you elaborate on any plans for groupby? I've made my own modification to numpy.bincount for doing groupby-type operations but not contributed anything back to the numpy community. Is there any interest from European-based people to work on groupby etc at the Euro SciPy sprints in July? If others are interested, maybe we could work out requirements beforehand and then do some coding in Paris. Cheers Stephen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I know we're planning to discuss at SciPy in Austin and will hopefully have a gameplan. We should coordinate the discussions / implementation somewhere. I've implemented some groupby functionality in pandas but I guess at a higher level than what's been proposed to add to NumPy. On the pystatsmodels mailing list we've been starting to have some discussions about statistically-oriented data structures in general-- I'll be sending an e-mail to the NumPy mailing list soon to start a broader discussion which will hopefully lead to some good things at the conferences. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] 2D binning
Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. Mathew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
On Tue, Jun 1, 2010 at 1:07 PM, Mathew Yeates mat.yea...@gmail.com wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. Looping N and searching in N would be O(N^2). But would it be too long just to loop, so O(N)? If you need to find all averages you could try something like this: from collections import defaultdict def aggdict(x): Convert [(1, 'a'), (2, 'b'), (1, 'A')] to {1: ['a', 'A'], 2: ['b']}. d = defaultdict(list) for k, v in x: d[k].append(v) return dict(d) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
On Tue, Jun 1, 2010 at 4:49 PM, Zachary Pincus zachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion This is a pretty good example of the group-by problem that will hopefully work its way into a future edition of NumPy. Given that, a good approach would be to produce a unique key from the lat and lon vectors, and pass that off to the groupby routine (when it exists). ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincus zachary.pin...@yale.eduwrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) You could (in c or cython) try a brain-dead hashtable with no collision detection: for lat, long, data in dataset: bin = (lat ^ long) % num_bins hashtable[bin] = update_incremental_mean(hashtable[bin], data) you'll of course want to do some experiments to see if your data are sufficiently sparse and/or you can afford a large enough hashtable array that you won't get spurious hash collisions. Adding error- checking to ensure that there are no collisions would be pretty trivial (just keep a table of the lat/long for each hash value, which you'll need anyway, and check that different lat/long pairs don't get assigned the same bin). Zach -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincus zachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincus zachary.pin...@yale.edu wrote: I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) You could (in c or cython) try a brain-dead hashtable with no collision detection: for lat, long, data in dataset: bin = (lat ^ long) % num_bins hashtable[bin] = update_incremental_mean(hashtable[bin], data) you'll of course want to do some experiments to see if your data are sufficiently sparse and/or you can afford a large enough hashtable array that you won't get spurious hash collisions. Adding error- checking to ensure that there are no collisions would be pretty trivial (just keep a table of the lat/long for each hash value, which you'll need anyway, and check that different lat/long pairs don't get assigned the same bin). Zach -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincus zachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? If the lat lon can be converted to a 1d label as Wes suggested, then in a similar timing exercise ndimage was the fastest. http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html (this was for python 2.4, also later I found np.bincount which requires that the labels are consecutive integers, but is as fast as ndimage) I don't know how it would compare to the new suggestions. Josef Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
On Tue, Jun 1, 2010 at 1:51 PM, Wes McKinney wesmck...@gmail.com wrote: On Tue, Jun 1, 2010 at 4:49 PM, Zachary Pincus zachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion This is a pretty good example of the group-by problem that will hopefully work its way into a future edition of NumPy. Given that, a good approach would be to produce a unique key from the lat and lon vectors, and pass that off to the groupby routine (when it exists). ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion meanwhile groupby from itertools will work but might be a bit slower since it'll have to convert every row to tuple and group in a list. import numpy as np import itertools # fake data N = 1 lats = np.repeat(180 * (np.random.ranf(N/ 250) - 0.5), 250) lons = np.repeat(360 * (np.random.ranf(N/ 250) - 0.5), 250) np.random.shuffle(lats) np.random.shuffle(lons) vals = np.arange(N) # inds = np.lexsort((lons, lats)) sorted_lats = lats[inds] sorted_lons = lons[inds] sorted_vals = vals[inds] llv = np.array((sorted_lats, sorted_lons, sorted_vals)).T for (lat, lon), group in itertools.groupby(llv, lambda row: tuple(row[:2])): group_vals = [g[-1] for g in group] print lat, lon, np.mean(group_vals) # make sure the mean for the last lat/lon from the loop matches the mean # for that lat/lon from original data. tests_idx, = np.where((lats == lat) (lons == lon)) assert np.mean(vals[tests_idx]) == np.mean(group_vals) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion