[Numpy-discussion] 2d binning and linear regression
Hi All, I have a problem involving lat/lon data. Basically, I am evaluating numerical weather model data against satellite data, and trying to produce gridded plots of various statistics. There are various steps involved with this, but basically, I get to the point where I have four arrays of the same length, containing lat, lon, model and observed values respectively along the path of the sattelite. eg: lat = [ 50.32 50.7851.2451.82 52.5553.1553.75 54.28 54.79 55.16 ... ] lon = [ 192.83 193.38 193.94 194.67 195.65 196.49 197.35 198.15 198.94 199.53 ... ] obs = [1.42 1.35 1.55 1.50 1.59 1.76 2.15 1.90 1.550.73 ... ] model = [ 1.67 1.68 1.70 1.79 2.04 2.36 2.53 2.38 2.149 1.57 ... ] I then want to calculate statistics based on bins of say 2 X 2 degree boxes. These arrays are very large, on the order of 10^6. For ease of explanation, I will focus initially on bias. My first approach was to use loops through lat/lon boxes and use np.where statements to extract all the values of the model and observations within the given box, and calculate the bias as the mean of the difference. This was obviously very slow. histogram2d provided a FAR better way to do this. i.e. import numpy as np latedges=np.arange(-90,90,2) lonedges=np.arange(0,360,2) diff = model-obs grid_N, xedges, yedges = np.histogram2d(lat, lon], bins=(latedges,lonedges)) grid_bias_sum, xedges, yedges = np.histogram2d(lat, lon, weights=diff, bins=(latedges,lonedges)) grid_bias = grid_bias_sum/grid_N I now want to determine the the linear regression coefficients for each each box after fitting a least squares linear regression to the data in each bin. I have been looking at using np.digitize to extract the bin indeces, but haven't had much success trying to do this in two dimensions. I am back to looping through the lat and lon box values, using np.where to extract the observations and model data within that box, and using np.polyfit to obtain the regression coefficients. This is, of course, very slow. Can anyone advise a smarter, vectorized way to do this? Any thoughts would be greatly appreciated. Thanks in advance Tom ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2d binning and linear regression
On Sun, Jun 20, 2010 at 4:24 AM, Tom Durrant thdurr...@gmail.com wrote: Hi All, I have a problem involving lat/lon data. Basically, I am evaluating numerical weather model data against satellite data, and trying to produce gridded plots of various statistics. There are various steps involved with this, but basically, I get to the point where I have four arrays of the same length, containing lat, lon, model and observed values respectively along the path of the sattelite. eg: lat = [ 50.32 50.78 51.24 51.82 52.55 53.15 53.75 54.28 54.79 55.16 ... ] lon = [ 192.83 193.38 193.94 194.67 195.65 196.49 197.35 198.15 198.94 199.53 ... ] obs = [ 1.42 1.35 1.55 1.50 1.59 1.76 2.15 1.90 1.55 0.73 ... ] model = [ 1.67 1.68 1.70 1.79 2.04 2.36 2.53 2.38 2.149 1.57 ... ] I then want to calculate statistics based on bins of say 2 X 2 degree boxes. These arrays are very large, on the order of 10^6. For ease of explanation, I will focus initially on bias. My first approach was to use loops through lat/lon boxes and use np.where statements to extract all the values of the model and observations within the given box, and calculate the bias as the mean of the difference. This was obviously very slow. histogram2d provided a FAR better way to do this. i.e. import numpy as np latedges=np.arange(-90,90,2) lonedges=np.arange(0,360,2) diff = model-obs grid_N, xedges, yedges = np.histogram2d(lat, lon], bins=(latedges,lonedges)) grid_bias_sum, xedges, yedges = np.histogram2d(lat, lon, weights=diff, bins=(latedges,lonedges)) grid_bias = grid_bias_sum/grid_N I now want to determine the the linear regression coefficients for each each box after fitting a least squares linear regression to the data in each bin. I have been looking at using np.digitize to extract the bin indeces, but haven't had much success trying to do this in two dimensions. I am back to looping through the lat and lon box values, using np.where to extract the observations and model data within that box, and using np.polyfit to obtain the regression coefficients. This is, of course, very slow. Can anyone advise a smarter, vectorized way to do this? Any thoughts would be greatly appreciated. For a general linear regression problem, there wouldn't be much that I can see that can be done. If there are many small regression problem, then sometimes stacking them into one big sparse least squares problem can be faster, it's faster to solve but not always faster to create in the first place. are you doing something like np.polyfit(model, obs, 1) ? If you are using polyfit with deg=1, i.e. fitting a straight line, then this could be also calculated using the weights in histogram2d. histogram2d (histogramdd) uses np.digitize and np.bincount, so I'm surprised if the histogram2d version is much faster. If a quick reading of histogramdd is correct, the main improvement would be to get the labels xy out of it, so it can be used repeatedly with np.bincount. Josef Thanks in advance Tom ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Multiplying numpy floats and python lists
I came across some strange behavior when multiplying numpy floats and python lists: the list is returned unchanged: In [18]: np.float64(1.2) * [1, 2] Out[18]: [1, 2] On the other hand, multiplying an array scalar and a python list gives the expected answer In [19]: np.array(1.2) * [1, 2] Out[19]: array([ 1.2, 2.4]) Finally, multiplying a python float and a python list gives an error: In [20]: 1.2 * [1, 2] --- TypeError Traceback (most recent call last) /Users/Tony/ipython console in module() TypeError: can't multiply sequence by non-int of type 'float' These last two results (with the array scalar and the python float) both seem appropriate, but the numpy-float behavior is surprising. Is this a bug? -Tony ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Multiplying numpy floats and python lists
Tony S Yu wrote: I came across some strange behavior when multiplying numpy floats and python lists: the list is returned unchanged: In [18]: np.float64(1.2) * [1, 2] Out[18]: [1, 2] Apparently the np.float64 object is cast to an int, and python's * is used: In [7]: np.float64(2.5) * [1,2] Out[7]: [1, 2, 1, 2] In [8]: np.float64(3.8) * [1,2] Out[8]: [1, 2, 1, 2, 1, 2] Warren On the other hand, multiplying an array scalar and a python list gives the expected answer In [19]: np.array(1.2) * [1, 2] Out[19]: array([ 1.2, 2.4]) Finally, multiplying a python float and a python list gives an error: In [20]: 1.2 * [1, 2] --- TypeError Traceback (most recent call last) /Users/Tony/ipython console in module() TypeError: can't multiply sequence by non-int of type 'float' These last two results (with the array scalar and the python float) both seem appropriate, but the numpy-float behavior is surprising. Is this a bug? -Tony ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Multiplying numpy floats and python lists
su, 2010-06-20 kello 13:56 -0400, Tony S Yu kirjoitti: I came across some strange behavior when multiplying numpy floats and python lists: the list is returned unchanged: In [18]: np.float64(1.2) * [1, 2] Out[18]: [1, 2] Probably a bug, it seems to round the result first to an integer, and then do the usual Python thing (try for example np.float64(2.2)). I'd suggest filing a bug ticket. Looking at CPython source code, the issue seems to be that np.float64 for some reason passes PyIndex_Check. But (without whipping out gdb) I can't understand how this can be: the float64 scalar type is not supposed to have a non-NULL in the nb_index slot. Indeed, a np.float64.__index__ method does not exist, and the following indeed does not work: [1, 2, 3][np.float64(2.2)] Strange. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] _wrapit bug?
Hi, I was just digging through the fromnumeric.py code when I noticed that the function _wrapit used asarray() rather than asanyarray(). I don't know if this function really gets used anymore (as it gets called in special situations), but would seem to be an issue as it would cause arrays such as masked_array to lose its type. For example, if _wrapit was called for a max() function on a masked array, the max returned would be for the ndarray's max, not the masked_array's max. Let me know if this is a non-issue, I just thought I ought to point it out. Thanks, Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Multiplying numpy floats and python lists
On Jun 20, 2010, at 2:28 PM, Pauli Virtanen wrote: su, 2010-06-20 kello 13:56 -0400, Tony S Yu kirjoitti: I came across some strange behavior when multiplying numpy floats and python lists: the list is returned unchanged: In [18]: np.float64(1.2) * [1, 2] Out[18]: [1, 2] Probably a bug, it seems to round the result first to an integer, and then do the usual Python thing (try for example np.float64(2.2)). I'd suggest filing a bug ticket. Done: http://projects.scipy.org/numpy/ticket/1516 Looking at CPython source code, the issue seems to be that np.float64 for some reason passes PyIndex_Check. But (without whipping out gdb) I can't understand how this can be: the float64 scalar type is not supposed to have a non-NULL in the nb_index slot. Indeed, a np.float64.__index__ method does not exist, and the following indeed does not work: [1, 2, 3][np.float64(2.2)] Strange. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Building Numpy: could not read symbols Bad value error
On Sun, Jun 20, 2010 at 1:19 AM, Charles R Harris charlesr.har...@gmail.com wrote: Have you checked the actual configuration that was used to compile lapack, *not* the command line flags. I never had with success with the automatic builds using the compressed files. Also check the actual installed libraries snip Yeah, the make.inc does have the -fPIC option: FORTRAN = gfortran OPTS = -O2 -fPIC DRVOPTS = $(OPTS) NOOPT= -O0 -fPIC LOADER = gfortran LOADOPTS = And of course I see that -fPIC is actually being used: ... gfortran -O2 -fPIC -c zsysvx.f -o zsysvx.o gfortran -O2 -fPIC -c zsytf2.f -o zsytf2.o gfortran -O2 -fPIC -c zsytrf.f -o zsytrf.o ... -- Warm regards, Michael Green ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Compiling NumPy on 64 bit Solaris 10 sparc)
Hi Ralf, Thanks for the information. I'll have a look at it, and see if it solves the issue. Cheers, Chris On 18 June 2010 01:53, Ralf Gommers ralf.gomm...@googlemail.com wrote: On Tue, Jun 15, 2010 at 11:06 AM, Chris LeBlanc crlebl...@gmail.com wrote: Hi, Firstly thanks to everyone that has helped bring NumPy to the point it is today. Its an excellent bit of software. I've recently managed to get NumPy to compile on a 64 bit Solaris 10 (sparc) machine. We've embedded 64 bit Python in some of our C code using the Python C-API, and therefore require a 64 bit version of NumPy. I had to use a bit of a hack to get NumPy to compile in 64 bit mode, and I'm just wondering if there's a more elegant way of doing it? If I set CFLAGS to -m64 -xcode=pic32, and CC to /usr/bin/cc, it will raise the following error: ... snipped most of backtrace ... File /home/leblanc/source/numpy/numpy/distutils/command/build_src.py, line 385, in generate_sources source = func(extension, build_dir) File numpy/core/setup.py, line 675, in get_mathlib_info raise RuntimeError(Broken toolchain: cannot link a simple C program) However, if I create a wrapper script and set CC to the path of the wrapper, then it will build normally and create a NumPy module I can use from the 64 bit version of Python installed on the system (/usr/bin/sparcv9/python). Here's the wrapper: #!/bin/sh /usr/bin/cc -m64 -xcode=pic32 $* I did a bit of googling and couldn't find much information on this issue (which is why I wanted to post this message, so others could find it). I'm satisfied with the results, but wondering if there's a better way to do it. It's because CFLAGS does not append to but overrides flags. Your wrapper script manages to append flags. Does the patch in http://projects.scipy.org/numpy/ticket/1382 work for you? If so it can be applied I think. Cheers, Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] timeseries - dates prior to 1970
Pierre GM pgmdevlist at gmail.com writes: On Jun 10, 2010, at 7:16 PM, Bevan Jenkins wrote: Hello, I have posted previously about dates prior to 1900 but this seems to be a seperate issue. The error message is definitley different. I can not seem to convert a timseseries from one frequency ('D') to another ('H') when i use dates prior to 1970 as shown below. This works fine when I use a date after 1970. Is this something that can be easily fixed or work around that I can use? Thanks I've started a git branch (http://github.com/pierregm/scikits.timeseries- sandbox) to test some changes on scikits.timeseries, in particular improved support of high- frequency dates before the unix epoch. Please give it a try, let me know how it goes. If all's fine, I'll port the changes to the SVN site (which until further notice remains the official source for the scikits). I downloaded the git branch, I had to use the Download pierregm/scikits.timeseries-sandbox at master via a zip file as I have not managed to get Git to work at Uni yet. After deleting the old scikit I built and installed the new version but I am still getting the same error. I notoiced you had some new tests to dealing with high- freq dates before the unix epoch. When i run ts.test() i get Ran 0 tests in 0.000s OK So I imagine I am doing something incorrectly. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] timeseries - dates prior to 1970
On Jun 20, 2010, at 7:28 PM, Bevan Jenkins wrote: I downloaded the git branch, I had to use the Download pierregm/scikits.timeseries-sandbox at master via a zip file as I have not managed to get Git to work at Uni yet. After deleting the old scikit I built and installed the new version but I am still getting the same error. Note that it's a work in progress. I'll let you know when the problem you experienced is fixed. I notoiced you had some new tests to dealing with high- freq dates before the unix epoch. When i run ts.test() i get Ran 0 tests in 0.000s OK So I imagine I am doing something incorrectly. Not necessarily. It works on my side (but I install using python setup.py develop...), I'll try and see where it goes wrong. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Building Numpy: could not read symbols Bad value error
On 06/21/2010 06:30 AM, Michael Green wrote: On Sun, Jun 20, 2010 at 1:19 AM, Charles R Harris charlesr.har...@gmail.com wrote: Have you checked the actual configuration that was used to compile lapack, *not* the command line flags. I never had with success with the automatic builds using the compressed files. Also check the actual installed libraries snip Yeah, the make.inc does have the -fPIC option: FORTRAN = gfortran OPTS = -O2 -fPIC DRVOPTS = $(OPTS) NOOPT= -O0 -fPIC LOADER = gfortran LOADOPTS = And of course I see that -fPIC is actually being used: ... gfortran -O2 -fPIC -c zsysvx.f -o zsysvx.o gfortran -O2 -fPIC -c zsytf2.f -o zsytf2.o gfortran -O2 -fPIC -c zsytrf.f -o zsytrf.o Can you check it is actually used for the file which causes the link failure (in atlas) ? David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2d binning and linear regression
are you doing something like np.polyfit(model, obs, 1) ? If you are using polyfit with deg=1, i.e. fitting a straight line, then this could be also calculated using the weights in histogram2d. histogram2d (histogramdd) uses np.digitize and np.bincount, so I'm surprised if the histogram2d version is much faster. If a quick reading of histogramdd is correct, the main improvement would be to get the labels xy out of it, so it can be used repeatedly with np.bincount. Josef Thanks Josef, From my limited understanding, you are right the histogram is much faster due to the fact that it doesn't have to keep reading in the array over and over I am using np.polyfit(model, obs, 1). I couldn't work out a way to do these regression using histogram2d and weights, but you think it can be done? This would be great! Cheers Tom ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion