Re: [Numpy-discussion] Ticket #605 Incorrect behavior of numpy.histogram

David Huard Mon, 07 Apr 2008 19:14:00 -0700

> On Apr 7, 2008, at 4:14 PM, LB wrote:
> > +1 for axis and +1 for a keyword to define what to do with values
> > outside the range.
> >
> > For the keyword, ather than 'outliers', I would propose 'discard' or
> > 'exclude', because it could be used to describe the four
> > possibilities :
> >  - discard='low'      => values lower than the range are discarded,
> > values higher are added to the last bin
> >   - discard='up'       => values higher than the range are discarded,
> > values lower are added to the first bin
> >   - discard='out'      => values out of the range are discarded
> >   - discard=None    => values outside of this range are allocated to
> > the closest bin
> >




Suppose you set bins=5, range=[0,10], discard=None, should the returned bins
be [0,2,4,6,810] or [-inf, 2, 4, 6, 8, inf] ?
Now suppose normed=True, what should be the density for the first and last
bin ? It seems to me it should be zero since we are assuming that the bins
extend to -infinity and infinity, but then, taking the outliers into account
seems pretty useless.

Overall, I think "discard" is a confusing option with little added value.
Getting the outliers is simply a matter of defining the bin edges explictly,
ie [-inf, x0, x1, ..., xn, inf].

In any case, attached is a version of histogram implementing the axis and
discard keywords. I'd really prefer though if we dumped the discard option.

David

>
>
>
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>

def histogram(a, bins=10, range=None, normed=False, discard='out', axis=None):
    """Compute the histogram from a set of data.

    Parameters:

        a : array
            The data to histogram.

        bins : int or sequence of floats
            If an int, then the number of equal-width bins in the given range.
            Otherwise, a sequence of the lower bound of each bin.

        range : (float, float)
            The lower and upper range of the bins. If not provided, then
            (a.min(), a.max()) is used. Values outside of this range are
            allocated according to the discard keyword.

        normed : bool
            If False, the result array will contain the number of samples in
            each bin.  If True, the result array is the value of the
            probability *density* function at the bin normalized such that the
            *integral* over the range is 1. Note that the sum of all of the
            histogram values will not usually be 1; it is not a probability
            *mass* function.

        discard : out, low, high, None
            With out, values outside range are not tallied, using low (high),
            values lower (greater) than range are discarded, and values higher
            (lower) than range are tallied in the closest bin. Using None,
            values outside of range are stored in the closest bin.
           
        axis : None or int
            Axis along which histogram is performed. If None, applies on the
            entire array.
       
    Returns:

        hist : array
            The values of the histogram. See `normed` for a description of the
            possible semantics.

        edges : float array
            The bins edges.

    SeeAlso:

        histogramdd

    """
    a = asarray(a).ravel()

    if (range is not None):
        mn, mx = range
        if (mn > mx):
            raise AttributeError, 'max must be larger than min in range parameter.'

    if not iterable(bins):
        if range is None:
            range = (a.min(), a.max())
        mn, mx = [mi+0.0 for mi in range]
        if mn == mx:
            mn -= 0.5
            mx += 0.5
        bins = linspace(mn, mx, bins+1, endpoint=True)
    else:
        bins = asarray(bins)
        if (np.diff(bins) < 0).any():
            raise AttributeError, 'bins must increase monotonically.'
      
    if discard is None:
        bins = np.r_[-np.inf, bins[1:-1], np.inf]
    elif discard == 'low':
        bins = np.r_[bins[:-1], np.inf]
    elif discard == 'high':
        bins = np.r_[-np.inf, bins[1:]]
    elif discard == 'out':
        pass
    else:
        raise ValueError, 'discard keyword not in None, out, high, low : %s'%discard
      
    if axis is None:
        return histogram1d(a.ravel(), bins, normed), bins
    else:
        return np.apply_along_axis(histogram1d, axis, a, bins, normed), bins
       

def histogram1d(a, bins, normed):
    """Internal usage function to compute an histogram on a 1D array.
   
    Parameters:
      a : array
          The data to histogram.
      bins : sequence
          The edges of the bins.
      normed : bool
          If false, return the number of samples falling into each bin. If true, return
          the density of the sample in each bin.
    """
    # best block size probably depends on processor cache size
    block = 65536
    n = np.zeros(bins.shape, int)
    for i in xrange(0, a.size, block):
        sa = sort(a[:block])
        n += np.r_[sa.searchsorted(bins[:-1], 'left'), sa.searchsorted(bins[-1], 'right')]
   
    n = np.diff(n)
   
    if normed:
        db = np.diff(bins)
        return n/db
    else:
        return n

class TestHistogram(NumpyTestCase):
    def check_simple(self):
        n=100
        v=rand(n)
        (a,b)=histogram(v)
        #check if the sum of the bins equals the number of samples
        assert_equal(sum(a,axis=0), n)
        #check that the bin counts are evenly spaced when the data is from a
        # linear function
        (a,b)=histogram(linspace(0,10,100))
        assert_array_equal(a, 10)
       
    def check_variable_bin_width(self):
        a = arange(10)+.5
        h,b = histogram(a, bins=[0,7, 10])
        assert_array_equal(h, array([7,3]))
       
        h,b = histogram(a, bins=[0,7, 10], normed=True)
        assert_array_equal(h, array([1, 1]))
       
    def check_outliers(self):
        a = arange(10)+.5
        h,b = histogram(a, range=[1,9])
        assert_equal(h.sum(),8)
       
    def check_discard(self):
        a = arange(10)+.5
        h,b = histogram(a, range=[1,9], discard='low')
        assert_equal(h[-1], 2)
        h,b = histogram(a, range=[1,9], discard='high')
        assert_equal(h[0], 2)
        h,b = histogram(a, range=[1,9], discard=None)
        assert_equal(h[0], 2)
        assert_equal(h[-1], 2)
        h,b = histogram(a, range=[1,9], normed=True, discard=None)
        assert_equal(h[0], 0)

_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Ticket #605 Incorrect behavior of numpy.histogram

Reply via email to