Re: [Numpy-discussion] histogram complete makeover

2006-10-20 Thread David Huard
Thanks for the comments, Here is the code for the new histogram, tests included. I'll wait for comments or suggestions before submitting a patch (numpy / scipy) ?CheersDavid
2006/10/18, Tim Hochberg <[EMAIL PROTECTED]>:
My $0.02:If histogram is going to get a makeover, particularly one that makes itmore complex than at present, it should probably be moved to SciPy.Failing that, it should be moved to a submodule of numpy with similar
statistical tools. Preferably with consistent interfaces for all of thefunctions.-Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easierDownload IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642___Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/numpy-discussion
# License: Scipy compatible
# Author: David Huard, 2006
from numpy import *
def histogram(a, bins=10, range=None, normed=False, weights=None, axis=None):
"""histogram(a, bins=10, range=None, normed=False, weights=None, axis=None) 
   -> H, dict

Return the distribution of sample.

Parameters
--
a:   Array sample.
bins:Number of bins, or 
 an array of bin edges, in which case the range is not used.
range:   Lower and upper bin edges, default: [min, max].
normed:  Boolean, if False, return the number of samples in each bin,
 if True, return a frequency distribution.  
weights: Sample weights.
axis:Specifies the dimension along which the histogram is computed. 
 Defaults to None, which aggregates the entire sample array. 

Output
--
H:The number of samples in each bin. 
  If normed is True, H is a frequency distribution.
dict{
'edges':  The bin edges, including the rightmost edge.
'upper':  Upper outliers.
'lower':  Lower outliers.
'bincenters': Center of bins.
}

Examples

x = random.rand(100,10)
H, Dict = histogram(x, bins=10, range=[0,1], normed=True)
H2, Dict = histogram(x, bins=10, range=[0,1], normed=True, axis=0)

See also: histogramnd
"""

a = asarray(a)
if axis is None:
a = atleast_1d(a.ravel())
axis = 0 

# Bin edges.   
if not iterable(bins):
if range is None:
range = (a.min(), a.max())
mn, mx = [mi+0.0 for mi in range]
if mn == mx:
mn -= 0.5
mx += 0.5
edges = linspace(mn, mx, bins+1, endpoint=True)
else:
edges = asarray(bins, float)

dedges = diff(edges)
decimal = int(-log10(dedges.min())+6)
bincenters = edges[:-1] + dedges/2.

# apply_along_axis accepts only one array input, but we need to pass the 
# weights along with the sample. The strategy here is to concatenate the 
# weights array along axis, so the passed array contains [sample, weights]. 
# The array is then split back in  __hist1d.
if weights is not None:
aw = concatenate((a, weights), axis)
weighted = True
else:
aw = a
weighted = False

count = apply_along_axis(__hist1d, axis, aw, edges, decimal, weighted)

# Outlier count
upper = count.take(array([-1]), axis)
lower = count.take(array([0]), axis)

# Non-outlier count
core = a.ndim*[slice(None)]
core[axis] = slice(1, -1)
hist = count[core]

if normed:
normalize = lambda x: atleast_1d(x/(x*dedges).sum())
hist = apply_along_axis(normalize, axis, hist)

return hist, {'edges':edges, 'lower':lower, 'upper':upper, \
'bincenters':bincenters}

 
def __hist1d(aw, edges, decimal, weighted):
"""Internal routine to compute the 1d histogram.
aw: sample, [weights]
edges: bin edges
decimal: approximation to put values lying on the rightmost edge in the last
 bin.
weighted: Means that the weights are appended to array a. 
Return the bin count or frequency if normed.
"""
nbin = edges.shape[0]+1
if weighted:
count = zeros(nbin, dtype=float)
a,w = hsplit(aw,2)
w = w/w.mean()
else:
a = aw
count = zeros(nbin, dtype=int)
w = None

binindex = digitize(a, edges)

# Values that fall on an edge are put in the right bin.
# For the rightmost bin, we want values equal to the right 
# edge to be counted in the last bin, and not as an outlier. 
on_edge = where(around(a,decimal) == around(edges[-1], decimal))[0]
binindex[on_edge] -= 1

# Count the number 

Re: [Numpy-discussion] histogram complete makeover

2006-10-18 Thread Tim Hochberg

My $0.02:

If histogram is going to get a makeover, particularly one that makes it 
more complex than at present, it should probably be moved to SciPy. 
Failing that, it should be moved to a submodule of numpy with similar 
statistical tools. Preferably with consistent interfaces for all of the 
functions.


-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion


Re: [Numpy-discussion] histogram complete makeover

2006-10-18 Thread Erin Sheldon
On 10/17/06, David Huard <[EMAIL PROTECTED]> wrote:
> Hi all,
>
> I'd like to poll the list to see what people want from numpy.histogram(),
> since I'm currently writing a contender.
>
> My main complaints with the current version are:
> 1. upper outliers are stored in the last bin, while lower outliers are not
> counted at all,
> 2. cannot use weights.
>
> The new histogram function is well under way (it address these issues and
> adds an axis keyword),
> but I want to know what is the preferred behavior regarding the function
> output, and your
> willingness to introduce a new behavior that will break some code.
>
> Given a number of bins N and range (min, max), histogram constructs linearly
> spaced bin edges
> b0 (out-of-range)  | b1 | b2 | b3 |  | bN | bN+1 out-of-range
> and may return:
>
> A.  H = array([N_b0, N_b1, ..., N_bN,  N_bN+1])
> The out-of-range values are the first and last values of the array. The
> returned array is hence N+2
>
> B.  H = array([N_b0 + N_b1, N_b2, ..., N_bN + N_bN+1])
> The lower and upper out-of-range values are added to the first and last bin
> respectively.
>
> C.  H = array([N_b1, ..., N_bN + N_bN+1])
> Current behavior: the upper out-of-range values are added to the last bin.
>
> D.  H = array([N_b1, N_b2, ..., N_bN]),
> Lower and upper out-of-range values are given after the histogram array.
>
> Ideally, the new function would not break the common usage: H =
> histogram(x)[0], so this exclude A.  B and C are not acceptable in my
> opinion, so only D remains, with the downsize that the outliers are not
> returned. A solution might be to add a keyword full_output=False, which when
> set to True, returns the out-of-range values in a dictionnary.
>
> Also, the current function returns -> H, ledges
>  where ledges is the array of left bin edges (N).
>  I propose returning the complete array of edges (N+1), including the
> rightmost edge. This is a little bit impractical for plotting, as the edges
> array does not have the same length as the histogram array, but allows the
> use of user-defined non-uniform bins.
>
> Opinions, suggestions ?

I dislike the current behavior.  I don't want the histogram
to count anything outside the range I specify.

It would also be nice to allow specification of a binsize
which would be used if number of bins wasn't sent.

Personally, since I don't have any code yet that uses
histogram, I feel like edges could be returned in a
keyword.  Perhaps in a dictionary with other useful items, such
as bin middles, mean of the data in bins and other statistics, or
whatever, which would only be calculated if the keyword
dict was sent.

Hopefully Google and sourceforge are playing nice and
you will see this within a day of sending.
Erin

-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion


Re: [Numpy-discussion] histogram complete makeover

2006-10-17 Thread Neal Becker
David Huard wrote:

> Hi all,
> 
> I'd like to poll the list to see what people want from numpy.histogram(),
> since I'm currently writing a contender.
> 
> My main complaints with the current version are:
> 1. upper outliers are stored in the last bin, while lower outliers are not
> counted at all,
> 2. cannot use weights.
> 
> The new histogram function is well under way (it address these issues and
> adds an axis keyword),
> but I want to know what is the preferred behavior regarding the function
> output, and your
> willingness to introduce a new behavior that will break some code.
> 
> Given a number of bins N and range (min, max), histogram constructs
> linearly spaced bin edges
> b0 (out-of-range)  | b1 | b2 | b3 |  | bN | bN+1 out-of-range
> and may return:
> 
> A.  H = array([N_b0, N_b1, ..., N_bN,  N_bN+1])
> The out-of-range values are the first and last values of the array. The
> returned array is hence N+2
> 
> B.  H = array([N_b0 + N_b1, N_b2, ..., N_bN + N_bN+1])
> The lower and upper out-of-range values are added to the first and last
> bin respectively.
> 
> C.  H = array([N_b1, ..., N_bN + N_bN+1])
> Current behavior: the upper out-of-range values are added to the last bin.
> 
> D.  H = array([N_b1, N_b2, ..., N_bN]),
> Lower and upper out-of-range values are given after the histogram array.
> 
> Ideally, the new function would not break the common usage: H =
> histogram(x)[0], so this exclude A.  B and C are not acceptable in my
> opinion, so only D remains, with the downsize that the outliers are not
> returned. A solution might be to add a keyword full_output=False, which
> when set to True, returns the out-of-range values in a dictionnary.
> 
> Also, the current function returns -> H, ledges
> where ledges is the array of left bin edges (N).
> I propose returning the complete array of edges (N+1), including the
> rightmost edge. This is a little bit impractical for plotting, as the
> edges array does not have the same length as the histogram array, but
> allows the use of user-defined non-uniform bins.
> 
> Opinions, suggestions ?
> 
> David

I have my own histogram that might interest you.  The core is modern c++,
with boost::python wrapper.

Out-of-bounds behavior is programmable.  I'll send it to you if you are
interested.


-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion


[Numpy-discussion] histogram complete makeover

2006-10-17 Thread David Huard
Hi all, I'd like to poll the list to see what people want from numpy.histogram(), since I'm currently writing a contender.My main complaints with the current version are:1. upper outliers are stored in the last bin, while lower outliers are not counted at all,
2. cannot use weights.The new histogram function is well under way (it address these issues and adds an axis keyword), but I want to know what is the preferred behavior regarding the function output, and your 
willingness to introduce a new behavior that will break some code. Given a number of bins N and range (min, max), histogram constructs linearly spaced bin edges b0 (out-of-range)  | b1 | b2 | b3 |  | bN | bN+1 out-of-range
and may return:A.  H = array([N_b0, N_b1, ..., N_bN,  N_bN+1])The out-of-range values are the first and last values of the array. The returned array is hence N+2B.  H = array([N_b0 + N_b1, N_b2, ..., N_bN + N_bN+1])
The lower and upper out-of-range values are added to the first and last bin respectively.C.  H = array([N_b1, ..., N_bN + N_bN+1])    Current behavior: the upper out-of-range values are added to the last bin.
D.  H = array([N_b1, N_b2, ..., N_bN]), Lower and upper out-of-range values are given after the histogram array. Ideally, the new function would not break the common usage: H = histogram(x)[0], so this exclude A.  B and C are not acceptable in my opinion, so only D remains, with the downsize that the outliers are not returned. A solution might be to add a keyword full_output=False, which when set to True, returns the out-of-range values in a dictionnary. 
Also, the current function returns -> H, ledges 
where ledges is the array of left bin edges (N). 
I propose returning the complete array of edges (N+1), including the
rightmost edge. This is a little bit impractical for plotting, as the
edges array does not have the same length as the histogram array, but
allows the use of user-defined non-uniform bins. 
Opinions, suggestions ?David
-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion