Hi, I have been investigating Ticket #605 'Incorrect behavior of numpy.histogram' (http://scipy.org/scipy/numpy/ticket/605 ).
The fix for this ticket really depends on what the expectations are for the bin limits and different applications have different behavior. Consequently, I think that feedback from the community is important. I have attached a modified histogram function where I use a very simple and obvious example: r= numpy.array([1,2,2,3,3,3,4,4,4,4,5,5,5,5,5]) dbin=[2,3,4] The current (Default) behavior provides the counts as array([2, 3, 9]). Here the values less than 2 are ignored and the last bin contains all values greater than or equal to 4. 1) Should the first bin contain all values less than or equal to the value of the first limit and the last bin contain all values greater than the value of the last limit? This produced the counts as: array([3, 3, 9]) (I termed this 'Accumulate' in the output). 2) Should any values outside than the range of the bins be excluded? That is remove any value that is smaller than the lowest value of the bin and higher than the highest value of the bin. This produced the counts as: array([2, 3, 4]) (I termed this 'Exclude' in the output) 3) Should there be extra bins for these values? While I did not implement this option, it would provide the counts as: array([1,2,3,4,5]) 4) Is there some other expectation? Thanks for any input, Bruce
import numpy def histo(a, bins=10, range=None, normed=False, variant=0): """Compute the histogram from a set of data. Parameters: a : array The data to histogram. n-D arrays will be flattened. bins : int or sequence of floats If an int, then the number of equal-width bins in the given range. Otherwise, a sequence of the lower bound of each bin. range : (float, float) The lower and upper range of the bins. If not provided, then (a.min(), a.max()) is used. Values outside of this range are allocated to the closest bin. normed : bool If False, the result array will contain the number of samples in each bin. If True, the result array is the value of the probability *density* function at the bin normalized such that the *integral* over the range is 1. Note that the sum of all of the histogram values will not usually be 1; it is not a probability *mass* function. Returns: hist : array The values of the histogram. See `normed` for a description of the possible semantics. lower_edges : float array The lower edges of each bin. SeeAlso: histogramdd """ a = numpy.asarray(a).ravel() if (range is not None): mn, mx = range if (mn > mx): raise AttributeError, 'max must be larger than min in range parameter.' if not numpy.iterable(bins): if range is None: range = (a.min(), a.max()) mn, mx = [mi+0.0 for mi in range] if mn == mx: mn -= 0.5 mx += 0.5 bins = numpy.linspace(mn, mx, bins, endpoint=False) else: bins = numpy.asarray(bins) if (bins[1:]-bins[:-1] < 0).any(): raise AttributeError, 'bins must increase monotonically.' # best block size probably depends on processor cache size block = 65536 n = numpy.sort(a[:block]).searchsorted(bins) for i in xrange(block, a.size, block): n += sort(a[i:i+block]).searchsorted(bins) nlow= len(numpy.where(a<bins[0])[0]) nhigh=len(numpy.where(a>bins[-1])[0]) if variant==1: n = numpy.concatenate([n, [len(a)]]) n = n[1:]-n[:-1] n[0]=n[0]+(len(a)-n.sum()) elif variant==2: n = numpy.concatenate([n, [len(a)-nhigh]]) n = n[1:]-n[:-1] else: n = numpy.concatenate([n, [len(a)]]) n = n[1:]-n[:-1] if normed: db = bins[1] - bins[0] return 1.0/(a.size*db) * n, bins else: return n, bins r= numpy.array([1,2,2,3,3,3,4,4,4,4,5,5,5,5,5]) dbin=[2,3,4] print 'Default:', histo(r, bins=dbin, normed=False, variant=0) print 'Accumulate:', histo(r, bins=dbin, normed=False, variant=1) print 'Exclude: ', histo(r, bins=dbin, normed=False, variant=2)
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion