Re: [Numpy-discussion] 2D binning
On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote: On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu wrote: I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) You could (in c or cython) try a brain-dead hashtable with no collision detection: for lat, long, data in dataset: bin = (lat ^ long) % num_bins hashtable[bin] = update_incremental_mean(hashtable[bin], data) you'll of course want to do some experiments to see if your data are sufficiently sparse and/or you can afford a large enough hashtable array that you won't get spurious hash collisions. Adding error- checking to ensure that there are no collisions would be pretty trivial (just keep a table of the lat/long for each hash value, which you'll need anyway, and check that different lat/long pairs don't get assigned the same bin). Zach -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincuszachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? If the lat lon can be converted to a 1d label as Wes suggested, then in a similar timing exercise ndimage was the fastest. http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html And as you said your lats and lons are integers, you could simply do ll = lat*1000 + lon to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or lon will never exceed 360 (degrees). After that, either use the ndimage approach, or you could use histogramming with weighting by data values and divide by histogram withouth weighting, or just loop. Vincent (this was for python 2.4, also later I found np.bincount which requires that the labels are consecutive integers, but is as fast as ndimage) I don't know how it would compare to the new suggestions. Josef Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Technicalities of the SVN - GIT transition
On Wed, Jun 2, 2010 at 3:09 AM, Matthew Brett matthew.br...@gmail.com wrote: Maybe you could put up a list of those people whose emails you need to scrape (obviously without the email addresses) and ask for opt-out? Or opt-in if you think that's better? So here is the list of authors: abaecker alan.mcintyre ariver cdavid chanley charris chris.barker chris.burns christoph.weidemann cookedm darren.dale dhuard dmorrill edschofield ej eric fperez ilan jarrod.millman jmiller joe jswhit kern madhusudancs martin matthew.brett mdroe nobody oliphant patmiller paul.ivanov pearu peridot pierregm prabhu ptvirtan rc rgommers rkern sasha skip stefan test tim_hochberg timl travis travo wfspotz (for wfspotz and matthew.brett, I removed the email suffix). I guess rkern and kern are the same person, as well as travo/oliphant. cheers, David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
Why not simply use a set? uniquePoints = set(zip(lats, lons)) Ben Root On Wed, Jun 2, 2010 at 2:41 AM, Vincent Schut sc...@sarvision.nl wrote: On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote: On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu wrote: I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) You could (in c or cython) try a brain-dead hashtable with no collision detection: for lat, long, data in dataset: bin = (lat ^ long) % num_bins hashtable[bin] = update_incremental_mean(hashtable[bin], data) you'll of course want to do some experiments to see if your data are sufficiently sparse and/or you can afford a large enough hashtable array that you won't get spurious hash collisions. Adding error- checking to ensure that there are no collisions would be pretty trivial (just keep a table of the lat/long for each hash value, which you'll need anyway, and check that different lat/long pairs don't get assigned the same bin). Zach -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincuszachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? If the lat lon can be converted to a 1d label as Wes suggested, then in a similar timing exercise ndimage was the fastest. http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html And as you said your lats and lons are integers, you could simply do ll = lat*1000 + lon to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or lon will never exceed 360 (degrees). After that, either use the ndimage approach, or you could use histogramming with weighting by data values and divide by histogram withouth weighting, or just loop. Vincent (this was for python 2.4, also later I found np.bincount which requires that the labels are consecutive integers, but is as fast as ndimage) I don't know how it would compare to the new suggestions. Josef Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] 2d binning on regular grid
Hi there, I'm interested in the solution to a special case of the parallel thread '2D binning', which is going on at the moment. My data is on a fine global grid, say .125x.125 degrees. I'm looking for a way to do calculations on coarser grids, e.g. * calculate means() * calculate std() * ... on a, say, 2.5x3.75 degree grid. One very crude approach would be to iterate through latitudes and longitudes, like this: latstep_orig = .125 lonstep_orig = .125 data_orig = np.arange((180./latstep_orig)*(360./lonstep_orig)).reshape((180./latstep_orig,360./lonstep_orig)) latstep_new = 2.5 lonstep_new = 3.75 latstep = int(latstep_new / latstep_orig) lonstep = int(lonstep_new / lonstep_orig) print 'one new lat equals',latstep,'new lats' print 'one new lon equals',lonstep,'new lons' result = ma.zeros((180./latstep_new,360./lonstep_new)) latidx = 0 while latidx*latstep_new 180.: lonidx = 0 while lonidx*lonstep_new 360.: m = np.mean( \ data_orig[latidx*latstep:(latidx+1)*latstep, \ lonidx*lonstep:(lonidx+1)*lonstep]) result[latidx,lonidx] = m lonidx += 1 latidx += 1 However, this is very crude, and I was wondering if there's any more elegant way to do it ... Thanks for your insight! A. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl wrote: On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote: On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu wrote: I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) You could (in c or cython) try a brain-dead hashtable with no collision detection: for lat, long, data in dataset: bin = (lat ^ long) % num_bins hashtable[bin] = update_incremental_mean(hashtable[bin], data) you'll of course want to do some experiments to see if your data are sufficiently sparse and/or you can afford a large enough hashtable array that you won't get spurious hash collisions. Adding error- checking to ensure that there are no collisions would be pretty trivial (just keep a table of the lat/long for each hash value, which you'll need anyway, and check that different lat/long pairs don't get assigned the same bin). Zach -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincuszachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? If the lat lon can be converted to a 1d label as Wes suggested, then in a similar timing exercise ndimage was the fastest. http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html And as you said your lats and lons are integers, you could simply do ll = lat*1000 + lon to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or lon will never exceed 360 (degrees). After that, either use the ndimage approach, or you could use histogramming with weighting by data values and divide by histogram withouth weighting, or just loop. Vincent (this was for python 2.4, also later I found np.bincount which requires that the labels are consecutive integers, but is as fast as ndimage) I don't know how it would compare to the new suggestions. Josef Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I was curious about how fast ndimage was for this operation so here's the complete function. import scipy.ndimage as ndi N = 1 lat = np.random.randint(0, 360, N) lon = np.random.randint(0, 360, N) data = np.random.randn(N) def group_mean(lat, lon, data): indexer = np.lexsort((lon, lat)) lat = lat.take(indexer) lon = lon.take(indexer) sorted_data = data.take(indexer) keys = 1000 * lat + lon unique_keys = np.unique(keys) result = ndi.mean(sorted_data, labels=keys, index=unique_keys) decoder = keys.searchsorted(unique_keys) return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result)) Appears to be about 13x faster (and could be made faster still) than the naive version on my machine: def group_mean_naive(lat, lon, data): grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, np.mean(da)) for ltln, da in grouped.items()) return averaged I had to get the latest scipy trunk to not get an error from ndimage.mean ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Technicalities of the SVN - GIT transition
Hi. Maybe you could put up a list of those people whose emails you need to scrape (obviously without the email addresses) and ask for opt-out? Or opt-in if you think that's better? So here is the list of authors: Do y'all think opt-in? Or opt-out?If it's opt-in I guess you'll catch most of the current committers, and most of the others you'll lose, but maybe that's good enough. See you, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
thanks. I am also getting an error in ndi.mean Were you getting the error RuntimeError: data type not supported? -Mathew On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney wesmck...@gmail.com wrote: On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl wrote: On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote: On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu wrote: I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) You could (in c or cython) try a brain-dead hashtable with no collision detection: for lat, long, data in dataset: bin = (lat ^ long) % num_bins hashtable[bin] = update_incremental_mean(hashtable[bin], data) you'll of course want to do some experiments to see if your data are sufficiently sparse and/or you can afford a large enough hashtable array that you won't get spurious hash collisions. Adding error- checking to ensure that there are no collisions would be pretty trivial (just keep a table of the lat/long for each hash value, which you'll need anyway, and check that different lat/long pairs don't get assigned the same bin). Zach -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincus zachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? If the lat lon can be converted to a 1d label as Wes suggested, then in a similar timing exercise ndimage was the fastest. http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html And as you said your lats and lons are integers, you could simply do ll = lat*1000 + lon to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or lon will never exceed 360 (degrees). After that, either use the ndimage approach, or you could use histogramming with weighting by data values and divide by histogram withouth weighting, or just loop. Vincent (this was for python 2.4, also later I found np.bincount which requires that the labels are consecutive integers, but is as fast as ndimage) I don't know how it would compare to the new suggestions. Josef Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I was curious about how fast ndimage was for this operation so here's the complete function. import scipy.ndimage as ndi N = 1 lat = np.random.randint(0, 360, N) lon = np.random.randint(0, 360, N) data = np.random.randn(N) def group_mean(lat, lon, data): indexer = np.lexsort((lon, lat)) lat = lat.take(indexer) lon = lon.take(indexer) sorted_data = data.take(indexer) keys = 1000 * lat + lon unique_keys = np.unique(keys) result = ndi.mean(sorted_data, labels=keys, index=unique_keys) decoder = keys.searchsorted(unique_keys) return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result)) Appears to be about 13x faster (and could be made faster still) than the naive version on my machine: def group_mean_naive(lat, lon, data): grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, np.mean(da)) for ltln, da in grouped.items())
Re: [Numpy-discussion] 2D binning
On Wed, Jun 2, 2010 at 1:23 PM, Mathew Yeates mat.yea...@gmail.com wrote: thanks. I am also getting an error in ndi.mean Were you getting the error RuntimeError: data type not supported? -Mathew On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney wesmck...@gmail.com wrote: On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl wrote: On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote: On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu wrote: I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) You could (in c or cython) try a brain-dead hashtable with no collision detection: for lat, long, data in dataset: bin = (lat ^ long) % num_bins hashtable[bin] = update_incremental_mean(hashtable[bin], data) you'll of course want to do some experiments to see if your data are sufficiently sparse and/or you can afford a large enough hashtable array that you won't get spurious hash collisions. Adding error- checking to ensure that there are no collisions would be pretty trivial (just keep a table of the lat/long for each hash value, which you'll need anyway, and check that different lat/long pairs don't get assigned the same bin). Zach -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincuszachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? If the lat lon can be converted to a 1d label as Wes suggested, then in a similar timing exercise ndimage was the fastest. http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html And as you said your lats and lons are integers, you could simply do ll = lat*1000 + lon to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or lon will never exceed 360 (degrees). After that, either use the ndimage approach, or you could use histogramming with weighting by data values and divide by histogram withouth weighting, or just loop. Vincent (this was for python 2.4, also later I found np.bincount which requires that the labels are consecutive integers, but is as fast as ndimage) I don't know how it would compare to the new suggestions. Josef Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I was curious about how fast ndimage was for this operation so here's the complete function. import scipy.ndimage as ndi N = 1 lat = np.random.randint(0, 360, N) lon = np.random.randint(0, 360, N) data = np.random.randn(N) def group_mean(lat, lon, data): indexer = np.lexsort((lon, lat)) lat = lat.take(indexer) lon = lon.take(indexer) sorted_data = data.take(indexer) keys = 1000 * lat + lon unique_keys = np.unique(keys) result = ndi.mean(sorted_data, labels=keys, index=unique_keys) decoder = keys.searchsorted(unique_keys) return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result)) Appears to be about 13x faster (and could be made faster still) than the naive version on my machine: def group_mean_naive(lat, lon, data): grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da)
Re: [Numpy-discussion] 2D binning
I'm on Windows, using a precompiled binary. I never built numpy/scipy on Windows. On Wed, Jun 2, 2010 at 10:45 AM, Wes McKinney wesmck...@gmail.com wrote: On Wed, Jun 2, 2010 at 1:23 PM, Mathew Yeates mat.yea...@gmail.com wrote: thanks. I am also getting an error in ndi.mean Were you getting the error RuntimeError: data type not supported? -Mathew On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney wesmck...@gmail.com wrote: On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl wrote: On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote: On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincus zachary.pin...@yale.edu wrote: I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) You could (in c or cython) try a brain-dead hashtable with no collision detection: for lat, long, data in dataset: bin = (lat ^ long) % num_bins hashtable[bin] = update_incremental_mean(hashtable[bin], data) you'll of course want to do some experiments to see if your data are sufficiently sparse and/or you can afford a large enough hashtable array that you won't get spurious hash collisions. Adding error- checking to ensure that there are no collisions would be pretty trivial (just keep a table of the lat/long for each hash value, which you'll need anyway, and check that different lat/long pairs don't get assigned the same bin). Zach -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincuszachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? If the lat lon can be converted to a 1d label as Wes suggested, then in a similar timing exercise ndimage was the fastest. http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html And as you said your lats and lons are integers, you could simply do ll = lat*1000 + lon to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or lon will never exceed 360 (degrees). After that, either use the ndimage approach, or you could use histogramming with weighting by data values and divide by histogram withouth weighting, or just loop. Vincent (this was for python 2.4, also later I found np.bincount which requires that the labels are consecutive integers, but is as fast as ndimage) I don't know how it would compare to the new suggestions. Josef Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I was curious about how fast ndimage was for this operation so here's the complete function. import scipy.ndimage as ndi N = 1 lat = np.random.randint(0, 360, N) lon = np.random.randint(0, 360, N) data = np.random.randn(N) def group_mean(lat, lon, data): indexer = np.lexsort((lon, lat)) lat = lat.take(indexer) lon = lon.take(indexer) sorted_data = data.take(indexer) keys = 1000 * lat + lon unique_keys = np.unique(keys) result = ndi.mean(sorted_data, labels=keys, index=unique_keys) decoder = keys.searchsorted(unique_keys) return
Re: [Numpy-discussion] 2D binning
On Wed, Jun 2, 2010 at 2:09 PM, Mathew Yeates mat.yea...@gmail.com wrote: I'm on Windows, using a precompiled binary. I never built numpy/scipy on Windows. ndimage measurements has been recently rewritten. ndimage is very fast but (the old version) has insufficient type checking and may crash on wrong inputs. I managed to work with it in the past on Windows. Maybe you could try to check the dtypes of the arguments for ndi.mean. (Preferably in an interpreter session where you don't mind if it crashes) I don't remember the type restrictions, but there are/were several tickets for it. Josef On Wed, Jun 2, 2010 at 10:45 AM, Wes McKinney wesmck...@gmail.com wrote: On Wed, Jun 2, 2010 at 1:23 PM, Mathew Yeates mat.yea...@gmail.com wrote: thanks. I am also getting an error in ndi.mean Were you getting the error RuntimeError: data type not supported? -Mathew On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney wesmck...@gmail.com wrote: On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl wrote: On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote: On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu wrote: I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) You could (in c or cython) try a brain-dead hashtable with no collision detection: for lat, long, data in dataset: bin = (lat ^ long) % num_bins hashtable[bin] = update_incremental_mean(hashtable[bin], data) you'll of course want to do some experiments to see if your data are sufficiently sparse and/or you can afford a large enough hashtable array that you won't get spurious hash collisions. Adding error- checking to ensure that there are no collisions would be pretty trivial (just keep a table of the lat/long for each hash value, which you'll need anyway, and check that different lat/long pairs don't get assigned the same bin). Zach -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincuszachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? If the lat lon can be converted to a 1d label as Wes suggested, then in a similar timing exercise ndimage was the fastest. http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html And as you said your lats and lons are integers, you could simply do ll = lat*1000 + lon to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or lon will never exceed 360 (degrees). After that, either use the ndimage approach, or you could use histogramming with weighting by data values and divide by histogram withouth weighting, or just loop. Vincent (this was for python 2.4, also later I found np.bincount which requires that the labels are consecutive integers, but is as fast as ndimage) I don't know how it would compare to the new suggestions. Josef Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I was curious about how fast ndimage was for this operation so here's
Re: [Numpy-discussion] 2D binning
On Wed, Jun 2, 2010 at 2:26 PM, josef.p...@gmail.com wrote: On Wed, Jun 2, 2010 at 2:09 PM, Mathew Yeates mat.yea...@gmail.com wrote: I'm on Windows, using a precompiled binary. I never built numpy/scipy on Windows. ndimage measurements has been recently rewritten. ndimage is very fast but (the old version) has insufficient type checking and may crash on wrong inputs. I managed to work with it in the past on Windows. Maybe you could try to check the dtypes of the arguments for ndi.mean. (Preferably in an interpreter session where you don't mind if it crashes) I don't remember the type restrictions, but there are/were several tickets for it. Josef On Wed, Jun 2, 2010 at 10:45 AM, Wes McKinney wesmck...@gmail.com wrote: On Wed, Jun 2, 2010 at 1:23 PM, Mathew Yeates mat.yea...@gmail.com wrote: thanks. I am also getting an error in ndi.mean Were you getting the error RuntimeError: data type not supported? -Mathew On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney wesmck...@gmail.com wrote: On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl wrote: On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote: On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu wrote: I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) You could (in c or cython) try a brain-dead hashtable with no collision detection: for lat, long, data in dataset: bin = (lat ^ long) % num_bins hashtable[bin] = update_incremental_mean(hashtable[bin], data) you'll of course want to do some experiments to see if your data are sufficiently sparse and/or you can afford a large enough hashtable array that you won't get spurious hash collisions. Adding error- checking to ensure that there are no collisions would be pretty trivial (just keep a table of the lat/long for each hash value, which you'll need anyway, and check that different lat/long pairs don't get assigned the same bin). Zach -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincuszachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? If the lat lon can be converted to a 1d label as Wes suggested, then in a similar timing exercise ndimage was the fastest. http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html And as you said your lats and lons are integers, you could simply do ll = lat*1000 + lon to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or lon will never exceed 360 (degrees). After that, either use the ndimage approach, or you could use histogramming with weighting by data values and divide by histogram withouth weighting, or just loop. Vincent (this was for python 2.4, also later I found np.bincount which requires that the labels are consecutive integers, but is as fast as ndimage) I don't know how it would compare to the new suggestions. Josef Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
Nope. This version didn't work either. If you're on Python 2.6 the binary on here might work for you: http://www.lfd.uci.edu/~gohlke/pythonlibs/ It looks recent enough to have the rewritten ndimage ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] masked constant
I'm about to commit some changes in the numpy trunk regarding the masked constant. Namely, 'masked' is now its own object. It inherits from MaskedArray, but doesn't have a fill_value and is represented slightly differently: repr(ma.masked) 'masked' The reason behind that is to reduce the headaches of new users (who tend to wonder why a masked item has a different fill_value than the original array (that's because it's a *constant*)). Any objections ? (I could have silently committed that a while ago, but I don't want to mess w/ anybody's supplies of chill-pills...) P. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
On 6/2/2010 2:32 PM, Mathew Yeates wrote: Nope. This version didn't work either. If you're on Python 2.6 the binary on here might work for you: http://www.lfd.uci.edu/~gohlke/pythonlibs/ It looks recent enough to have the rewritten ndimage On 6/2/2010 2:32 PM, Mathew Yeates wrote: Nope. This version didn't work either. Please note that in order to use the ndimage package from http://www.lfd.uci.edu/~gohlke/pythonlibs/#ndimage you have to use import ndimage as ndi instead of import scipy.ndimage as ndi. The code posted by Wes works for me on Windows after that change. -- Christoph ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
On 1/06/2010 10:51 PM, Wes McKinney wrote: snip This is a pretty good example of the group-by problem that will hopefully work its way into a future edition of NumPy. Wes (or anyone else), please can you elaborate on any plans for groupby? I've made my own modification to numpy.bincount for doing groupby-type operations but not contributed anything back to the numpy community. Is there any interest from European-based people to work on groupby etc at the Euro SciPy sprints in July? If others are interested, maybe we could work out requirements beforehand and then do some coding in Paris. Cheers Stephen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] 2 bugs related to isinf and isfinite generate crazy warnings
http://www.mail-archive.com/numpy-discussion@scipy.org/msg23912.html On some systems--but evidently not for most numpy users, or there would have been a steady stream of screams--the appearance of np.inf in any call to np.isfinite or np.isinf yields this: In [1]:import numpy as np In [2]:np.isinf(np.inf) Warning: invalid value encountered in isinf Out[2]:True This generates streams of warnings if np.isinf or np.isfinite is applied to an array with many inf values. The problem is a combination of two bugs: 1) When building with setup.py, but perhaps not with scons (which I haven't tried yet), NPY_HAVE_DECL_ISFINITE and friends are never defined, even though they should be--this is all on ubuntu 10.4, in my case, and isfinite and isinf most definitely are in math.h. It looks to me like the only mechanism for defining these is in SConstruct. 2) So, with no access to the built-in isinf etc., npy_math.h falls back on the compact but slow macros that replace the much nicer ones that were in numpy a year or so ago. Here is the relevant part of npy_math.h: #ifndef NPY_HAVE_DECL_ISFINITE #define npy_isfinite(x) !npy_isnan((x) + (-x)) #else #define npy_isfinite(x) isfinite((x)) #endif #ifndef NPY_HAVE_DECL_ISINF #define npy_isinf(x) (!npy_isfinite(x) !npy_isnan(x)) #else #define npy_isinf(x) isinf((x)) #endif isinf calls isfinite, and isfinite calls isnan(x-x). If x is inf, this generates a nan, so the return value is correct--but generating that nan set the INVALID flag, hence the warning. Sorry I don't have a patch to offer. Eric ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
On Wed, Jun 2, 2010 at 6:18 PM, Stephen Simmons m...@stevesimmons.com wrote: On 1/06/2010 10:51 PM, Wes McKinney wrote: snip This is a pretty good example of the group-by problem that will hopefully work its way into a future edition of NumPy. Wes (or anyone else), please can you elaborate on any plans for groupby? I've made my own modification to numpy.bincount for doing groupby-type operations but not contributed anything back to the numpy community. Is there any interest from European-based people to work on groupby etc at the Euro SciPy sprints in July? If others are interested, maybe we could work out requirements beforehand and then do some coding in Paris. Cheers Stephen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I know we're planning to discuss at SciPy in Austin and will hopefully have a gameplan. We should coordinate the discussions / implementation somewhere. I've implemented some groupby functionality in pandas but I guess at a higher level than what's been proposed to add to NumPy. On the pystatsmodels mailing list we've been starting to have some discussions about statistically-oriented data structures in general-- I'll be sending an e-mail to the NumPy mailing list soon to start a broader discussion which will hopefully lead to some good things at the conferences. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2 bugs related to isinf and isfinite generate crazy warnings
On Wed, Jun 2, 2010 at 7:11 PM, Eric Firing efir...@hawaii.edu wrote: http://www.mail-archive.com/numpy-discussion@scipy.org/msg23912.html On some systems--but evidently not for most numpy users, or there would have been a steady stream of screams--the appearance of np.inf in any call to np.isfinite or np.isinf yields this: In [1]:import numpy as np In [2]:np.isinf(np.inf) Warning: invalid value encountered in isinf Out[2]:True This generates streams of warnings if np.isinf or np.isfinite is applied to an array with many inf values. The problem is a combination of two bugs: 1) When building with setup.py, but perhaps not with scons (which I haven't tried yet), NPY_HAVE_DECL_ISFINITE and friends are never defined, even though they should be--this is all on ubuntu 10.4, in my case, and isfinite and isinf most definitely are in math.h. It looks to me like the only mechanism for defining these is in SConstruct. 2) So, with no access to the built-in isinf etc., npy_math.h falls back on the compact but slow macros that replace the much nicer ones that were in numpy a year or so ago. Here is the relevant part of npy_math.h: #ifndef NPY_HAVE_DECL_ISFINITE #define npy_isfinite(x) !npy_isnan((x) + (-x)) #else #define npy_isfinite(x) isfinite((x)) #endif #ifndef NPY_HAVE_DECL_ISINF #define npy_isinf(x) (!npy_isfinite(x) !npy_isnan(x)) #else #define npy_isinf(x) isinf((x)) #endif isinf calls isfinite, and isfinite calls isnan(x-x). If x is inf, this generates a nan, so the return value is correct--but generating that nan set the INVALID flag, hence the warning. Sorry I don't have a patch to offer. Open a ticket. Since this looks fixable we should get it fixed for 1.5. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Technicalities of the SVN - GIT transition
On Wed, Jun 2, 2010 at 10:08 AM, Matthew Brett matthew.br...@gmail.com wrote: Do y'all think opt-in? Or opt-out? If it's opt-in I guess you'll catch most of the current committers, and most of the others you'll lose, but maybe that's good enough. I think it should be opt-in. How would opt-out work? Would someone create new accounts for all the contributors and then give them access? If that is the case, I would really rather avoid this. If someone wants an account on github, they should register an account on their own. Or am I missing something? Thanks, Jarro ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Technicalities of the SVN - GIT transition
On 06/03/2010 10:24 AM, Jarrod Millman wrote: On Wed, Jun 2, 2010 at 10:08 AM, Matthew Brettmatthew.br...@gmail.com wrote: Do y'all think opt-in? Or opt-out?If it's opt-in I guess you'll catch most of the current committers, and most of the others you'll lose, but maybe that's good enough. I think it should be opt-in. How would opt-out work? Would someone create new accounts for all the contributors and then give them access? Just to be clear, this has nothing to do with accounts on github, or any registered thing. This is *only* about username/email as recognized by git itself (as recorded in the commit objects). Several people already answered me privately to give me their email, I think that's the way to go, cheers, David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2 bugs related to isinf and isfinite generate crazy warnings
On Wed, Jun 2, 2010 at 7:11 PM, Eric Firing efir...@hawaii.edu wrote: http://www.mail-archive.com/numpy-discussion@scipy.org/msg23912.html On some systems--but evidently not for most numpy users, or there would have been a steady stream of screams--the appearance of np.inf in any call to np.isfinite or np.isinf yields this: In [1]:import numpy as np In [2]:np.isinf(np.inf) Warning: invalid value encountered in isinf Out[2]:True This generates streams of warnings if np.isinf or np.isfinite is applied to an array with many inf values. The problem is a combination of two bugs: 1) When building with setup.py, but perhaps not with scons (which I haven't tried yet), NPY_HAVE_DECL_ISFINITE and friends are never defined, even though they should be--this is all on ubuntu 10.4, in my case, and isfinite and isinf most definitely are in math.h. It looks to me like the only mechanism for defining these is in SConstruct. 2) So, with no access to the built-in isinf etc., npy_math.h falls back on the compact but slow macros that replace the much nicer ones that were in numpy a year or so ago. Here is the relevant part of npy_math.h: #ifndef NPY_HAVE_DECL_ISFINITE #define npy_isfinite(x) !npy_isnan((x) + (-x)) #else #define npy_isfinite(x) isfinite((x)) #endif #ifndef NPY_HAVE_DECL_ISINF #define npy_isinf(x) (!npy_isfinite(x) !npy_isnan(x)) #else #define npy_isinf(x) isinf((x)) #endif isinf calls isfinite, and isfinite calls isnan(x-x). If x is inf, this generates a nan, so the return value is correct--but generating that nan set the INVALID flag, hence the warning. Sorry I don't have a patch to offer. I'm guessing that the problem is that in gcc isinf is a macro, the relevant functions are actually __isinff, __isinf, and __isinfl. The library itself yields: $[char...@ubuntu lib]$ nm -D libc.so.6 | grep isinf 00032d10 T __isinf 00033110 T __isinff 00033420 T __isinfl 00032d10 W isinf 00033110 W isinff 00033420 W isinfl Where the symbols without the underscores are weak. I have no idea what the implications of that are ;) Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2 bugs related to isinf and isfinite generate crazy warnings
On 06/03/2010 10:11 AM, Eric Firing wrote: http://www.mail-archive.com/numpy-discussion@scipy.org/msg23912.html On some systems--but evidently not for most numpy users, or there would have been a steady stream of screams--the appearance of np.inf in any call to np.isfinite or np.isinf yields this: In [1]:import numpy as np In [2]:np.isinf(np.inf) Warning: invalid value encountered in isinf Out[2]:True This generates streams of warnings if np.isinf or np.isfinite is applied to an array with many inf values. The problem is a combination of two bugs: 1) When building with setup.py, but perhaps not with scons (which I haven't tried yet), NPY_HAVE_DECL_ISFINITE and friends are never defined, even though they should be--this is all on ubuntu 10.4, in my case, and isfinite and isinf most definitely are in math.h. It looks to me like the only mechanism for defining these is in SConstruct. Actually, there is a bug in setup.py to detect those for python = 2.6, I have fixed this. 2) So, with no access to the built-in isinf etc., npy_math.h falls back on the compact but slow macros that replace the much nicer ones that were in numpy a year or so ago. Which one are you thinking about: the ones using fpclassify ? Could you show the code where the current version is slower ? cheers, David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2 bugs related to isinf and isfinite generate crazy warnings
On Wed, Jun 2, 2010 at 7:11 PM, Eric Firing efir...@hawaii.edu wrote: http://www.mail-archive.com/numpy-discussion@scipy.org/msg23912.html On some systems--but evidently not for most numpy users, or there would have been a steady stream of screams--the appearance of np.inf in any call to np.isfinite or np.isinf yields this: In [1]:import numpy as np In [2]:np.isinf(np.inf) Warning: invalid value encountered in isinf Out[2]:True This generates streams of warnings if np.isinf or np.isfinite is applied to an array with many inf values. The problem is a combination of two bugs: 1) When building with setup.py, but perhaps not with scons (which I haven't tried yet), NPY_HAVE_DECL_ISFINITE and friends are never defined, even though they should be--this is all on ubuntu 10.4, in my case, and isfinite and isinf most definitely are in math.h. It looks to me like the only mechanism for defining these is in SConstruct. 2) So, with no access to the built-in isinf etc., npy_math.h falls back on the compact but slow macros that replace the much nicer ones that were in numpy a year or so ago. Here is the relevant part of npy_math.h: #ifndef NPY_HAVE_DECL_ISFINITE #define npy_isfinite(x) !npy_isnan((x) + (-x)) #else #define npy_isfinite(x) isfinite((x)) #endif #ifndef NPY_HAVE_DECL_ISINF #define npy_isinf(x) (!npy_isfinite(x) !npy_isnan(x)) #else #define npy_isinf(x) isinf((x)) #endif isinf calls isfinite, and isfinite calls isnan(x-x). If x is inf, this generates a nan, so the return value is correct--but generating that nan set the INVALID flag, hence the warning. More precisely, it looks like HAVE_DECL_ISINF and friends are defined in Python.h, hence NPY_HAVE_DECL_ISINF doesn't get defined. Hmm... So maybe Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2 bugs related to isinf and isfinite generate crazy warnings
On Wed, Jun 2, 2010 at 9:00 PM, David da...@silveregg.co.jp wrote: On 06/03/2010 10:11 AM, Eric Firing wrote: http://www.mail-archive.com/numpy-discussion@scipy.org/msg23912.html On some systems--but evidently not for most numpy users, or there would have been a steady stream of screams--the appearance of np.inf in any call to np.isfinite or np.isinf yields this: In [1]:import numpy as np In [2]:np.isinf(np.inf) Warning: invalid value encountered in isinf Out[2]:True This generates streams of warnings if np.isinf or np.isfinite is applied to an array with many inf values. The problem is a combination of two bugs: 1) When building with setup.py, but perhaps not with scons (which I haven't tried yet), NPY_HAVE_DECL_ISFINITE and friends are never defined, even though they should be--this is all on ubuntu 10.4, in my case, and isfinite and isinf most definitely are in math.h. It looks to me like the only mechanism for defining these is in SConstruct. Actually, there is a bug in setup.py to detect those for python = 2.6, I have fixed this. Beat me to it ;) Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2 bugs related to isinf and isfinite generate crazy warnings
On 06/02/2010 05:00 PM, David wrote: On 06/03/2010 10:11 AM, Eric Firing wrote: http://www.mail-archive.com/numpy-discussion@scipy.org/msg23912.html On some systems--but evidently not for most numpy users, or there would have been a steady stream of screams--the appearance of np.inf in any call to np.isfinite or np.isinf yields this: In [1]:import numpy as np In [2]:np.isinf(np.inf) Warning: invalid value encountered in isinf Out[2]:True This generates streams of warnings if np.isinf or np.isfinite is applied to an array with many inf values. The problem is a combination of two bugs: 1) When building with setup.py, but perhaps not with scons (which I haven't tried yet), NPY_HAVE_DECL_ISFINITE and friends are never defined, even though they should be--this is all on ubuntu 10.4, in my case, and isfinite and isinf most definitely are in math.h. It looks to me like the only mechanism for defining these is in SConstruct. Actually, there is a bug in setup.py to detect those for python= 2.6, I have fixed this. David, Thank you. 2) So, with no access to the built-in isinf etc., npy_math.h falls back on the compact but slow macros that replace the much nicer ones that were in numpy a year or so ago. Which one are you thinking about: the ones using fpclassify ? Could you show the code where the current version is slower ? OK, my memory was off as usual. What I was remembering was actually this: http://currents.soest.hawaii.edu/hgstage/hgwebdir.cgi/mpl_hg/file/4ab6f2159bea/matplotlib/src/MPL_isnan.h So the macro versions were in mpl, and derived from numarray, but apparently were never in numpy. In any case, in MPL_isnan.h, isfinite is fast, because it involves only a single bitwise-and plus comparison; isnan and isinf are slower, because they involve about twice as much work. Given that the present numpy isfinite macro is broken--it raises an fp error--there is not much point in talking about its performance. What I don't know is whether there is some reason not to use the MPL_isnan.h macros as the backups in numpy, for platforms that do not have their own isnan, isfinite, and isinf. I do know (or think...) that two or three years ago I timed numpy.isnan and numpy.isfinite, and found that the latter was notably faster than the former, consistent with what I would have expected if the implementation was as in MPL_isnan.h. Presumably, at that time my numpy build was using the system versions, and maybe they are implemented in a similar way. Looking in /usr/include/bits/*.h, I gave up trying to figure out what the system macros really are doing. Eric cheers, David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2 bugs related to isinf and isfinite generate crazy warnings
On 06/03/2010 01:02 PM, Eric Firing wrote: In any case, in MPL_isnan.h, isfinite is fast, because it involves only a single bitwise-and plus comparison; isnan and isinf are slower, because they involve about twice as much work. Concerning speed, the problem is that it depends a lot on the compiler/architecture. When I timed some implementations of isnan, just changing from Pentium 4 to Pentium M with the exact same compiler/options gave mixed results. It also depends on the input - some methods are faster for finite numbers but slower for nan/inf, etc... Given that the present numpy isfinite macro is broken--it raises an fp error--there is not much point in talking about its performance. Indeed :) What I don't know is whether there is some reason not to use the MPL_isnan.h macros as the backups in numpy, for platforms that do not have their own isnan, isfinite, and isinf. The problem is that it is actually quite some work. Float and double are not so difficult (two cases to deal with for endianness), but long double is a huge PITA - we have already 5 different implementations to deal with, and we are missing some (linux PPC). Presumably, at that time my numpy build was using the system versions, and maybe they are implemented in a similar way. Looking in /usr/include/bits/*.h, I gave up trying to figure out what the system macros really are doing. You have to look into sysdeps/ directory inside glibc sources (on Linux). They use clever tricks to avoid branching, but when I benchmarked those, things like x != x where much faster on the computer I tested this on. Most likely very CPU dependent. cheers, David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion