On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut <sc...@sarvision.nl> wrote: > On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote: >> On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincus<zachary.pin...@yale.edu> >> wrote: >>>> I guess it's as fast as I'm going to get. I don't really see any >>>> other way. BTW, the lat/lons are integers) >>> >>> You could (in c or cython) try a brain-dead "hashtable" with no >>> collision detection: >>> >>> for lat, long, data in dataset: >>> bin = (lat ^ long) % num_bins >>> hashtable[bin] = update_incremental_mean(hashtable[bin], data) >>> >>> you'll of course want to do some experiments to see if your data are >>> sufficiently sparse and/or you can afford a large enough hashtable >>> array that you won't get spurious hash collisions. Adding error- >>> checking to ensure that there are no collisions would be pretty >>> trivial (just keep a table of the lat/long for each hash value, which >>> you'll need anyway, and check that different lat/long pairs don't get >>> assigned the same bin). >>> >>> Zach >>> >>> >>> >>>> -Mathew >>>> >>>> On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincus<zachary.pin...@yale.edu >>>>> wrote: >>>>> Hi >>>>> Can anyone think of a clever (non-lopping) solution to the >>>> following? >>>>> >>>>> A have a list of latitudes, a list of longitudes, and list of data >>>>> values. All lists are the same length. >>>>> >>>>> I want to compute an average of data values for each lat/lon pair. >>>>> e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then >>>>> data[1001] = (data[1001] + data[2001])/2 >>>>> >>>>> Looping is going to take wayyyy to long. >>>> >>>> As a start, are the "equal" lat/lon pairs exactly equal (i.e. either >>>> not floating-point, or floats that will always compare equal, that is, >>>> the floating-point bit-patterns will be guaranteed to be identical) or >>>> approximately equal to float tolerance? >>>> >>>> If you're in the approx-equal case, then look at the KD-tree in scipy >>>> for doing near-neighbors queries. >>>> >>>> If you're in the exact-equal case, you could consider hashing the lat/ >>>> lon pairs or something. At least then the looping is O(N) and not >>>> O(N^2): >>>> >>>> import collections >>>> grouped = collections.defaultdict(list) >>>> for lt, ln, da in zip(lat, lon, data): >>>> grouped[(lt, ln)].append(da) >>>> >>>> averaged = dict((ltln, numpy.mean(da)) for ltln, da in >>>> grouped.items()) >>>> >>>> Is that fast enough? >> >> If the lat lon can be converted to a 1d label as Wes suggested, then >> in a similar timing exercise ndimage was the fastest. >> http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html > > And as you said your lats and lons are integers, you could simply do > > ll = lat*1000 + lon > > to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or > lon will never exceed 360 (degrees). > > After that, either use the ndimage approach, or you could use > histogramming with weighting by data values and divide by histogram > withouth weighting, or just loop. > > Vincent > >> >> (this was for python 2.4, also later I found np.bincount which >> requires that the labels are consecutive integers, but is as fast as >> ndimage) >> >> I don't know how it would compare to the new suggestions. >> >> Josef >> >> >> >>>> >>>> Zach >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> NumPy-Discussion@scipy.org >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>> >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> NumPy-Discussion@scipy.org >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
I was curious about how fast ndimage was for this operation so here's the complete function. import scipy.ndimage as ndi N = 10000 lat = np.random.randint(0, 360, N) lon = np.random.randint(0, 360, N) data = np.random.randn(N) def group_mean(lat, lon, data): indexer = np.lexsort((lon, lat)) lat = lat.take(indexer) lon = lon.take(indexer) sorted_data = data.take(indexer) keys = 1000 * lat + lon unique_keys = np.unique(keys) result = ndi.mean(sorted_data, labels=keys, index=unique_keys) decoder = keys.searchsorted(unique_keys) return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result)) Appears to be about 13x faster (and could be made faster still) than the naive version on my machine: def group_mean_naive(lat, lon, data): grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, np.mean(da)) for ltln, da in grouped.items()) return averaged I had to get the latest scipy trunk to not get an error from ndimage.mean _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion