[Numpy-discussion] Ticket #605 Incorrect behavior of numpy.histogram

Bruce Southey Sat, 05 Apr 2008 11:02:15 -0700

Hi,
I have been investigating Ticket #605 'Incorrect behavior of
numpy.histogram' (http://scipy.org/scipy/numpy/ticket/605 ).


The fix for this ticket really depends on what the expectations are
for the bin limits and different applications have different behavior.
Consequently, I think that feedback from the community is important.

I have attached a modified histogram function where I use a very
simple and obvious example:
r= numpy.array([1,2,2,3,3,3,4,4,4,4,5,5,5,5,5])
dbin=[2,3,4]

The current (Default) behavior provides the counts as array([2, 3,
9]). Here the values less than 2 are ignored and the last bin contains
all values greater than or equal to 4.

1) Should the first bin contain all values less than or equal to the
value of the first limit and the last bin contain all values greater
than the value of the last limit?
This produced the counts as: array([3, 3, 9]) (I termed this
'Accumulate' in the output).

2) Should any values outside than the range of the bins be excluded?
That is remove any value that is smaller than the lowest value of the
bin and higher than the highest value of the bin.
This produced the counts as: array([2, 3, 4]) (I termed this 'Exclude'
in the output)

3) Should there be extra bins for these values?
While I did not implement this option, it would provide the counts as:
array([1,2,3,4,5])

4) Is there some other expectation?

Thanks for any input,
Bruce

import numpy
def histo(a, bins=10, range=None, normed=False, variant=0):
    """Compute the histogram from a set of data.

    Parameters:

        a : array
            The data to histogram. n-D arrays will be flattened.

        bins : int or sequence of floats
            If an int, then the number of equal-width bins in the given range.
            Otherwise, a sequence of the lower bound of each bin.

        range : (float, float)
            The lower and upper range of the bins. If not provided, then
            (a.min(), a.max()) is used. Values outside of this range are
            allocated to the closest bin.

        normed : bool
            If False, the result array will contain the number of samples in
            each bin.  If True, the result array is the value of the
            probability *density* function at the bin normalized such that the
            *integral* over the range is 1. Note that the sum of all of the
            histogram values will not usually be 1; it is not a probability
            *mass* function.

    Returns:

        hist : array
            The values of the histogram. See `normed` for a description of the
            possible semantics.

        lower_edges : float array
            The lower edges of each bin.

    SeeAlso:

        histogramdd

    """
    a = numpy.asarray(a).ravel()
    
    if (range is not None):
        mn, mx = range
        if (mn > mx):
            raise AttributeError, 'max must be larger than min in range parameter.'

    if not numpy.iterable(bins):
        if range is None:
            range = (a.min(), a.max())
        mn, mx = [mi+0.0 for mi in range]
        if mn == mx:
            mn -= 0.5
            mx += 0.5
        bins = numpy.linspace(mn, mx, bins, endpoint=False)
    else:
        bins = numpy.asarray(bins)
        if (bins[1:]-bins[:-1] < 0).any():
            raise AttributeError, 'bins must increase monotonically.'

    # best block size probably depends on processor cache size
    block = 65536
    n = numpy.sort(a[:block]).searchsorted(bins)
    for i in xrange(block, a.size, block):
        n += sort(a[i:i+block]).searchsorted(bins)
    nlow= len(numpy.where(a<bins[0])[0])
    nhigh=len(numpy.where(a>bins[-1])[0])
    if variant==1:
        n = numpy.concatenate([n, [len(a)]])
        n = n[1:]-n[:-1]
        n[0]=n[0]+(len(a)-n.sum())
    elif variant==2:
        n = numpy.concatenate([n, [len(a)-nhigh]])
        n = n[1:]-n[:-1]
    else:
        n = numpy.concatenate([n, [len(a)]])
        n = n[1:]-n[:-1]
        
	
    if normed:
        db = bins[1] - bins[0]
        return 1.0/(a.size*db) * n, bins
    else:
        return n, bins


r= numpy.array([1,2,2,3,3,3,4,4,4,4,5,5,5,5,5])
dbin=[2,3,4]

print 'Default:', histo(r, bins=dbin, normed=False, variant=0)
print 'Accumulate:', histo(r, bins=dbin, normed=False, variant=1)
print 'Exclude: ', histo(r, bins=dbin, normed=False, variant=2)

_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Ticket #605 Incorrect behavior of numpy.histogram

Reply via email to