On Fri, 2010-10-22 at 13:39 -0500, Ryan May wrote:
> Thanks for that. This actually led me here:
> http://en.wikipedia.org/wiki/Histogram which gives a bunch of
> different ways to estimate the number of bins/binsize. It might be
> worth looking at one of these in general. However, ironically enough,
> these wouldn't actually give the original poster the desired
> results--the binsizes would lead to lots of bins, many of which would
> be empty due to the integer data. In fact, it seems that all of these
> methods are going to break down due to integer data. I guess you could
> take the ceiling of the calculated binsize...anyone have an opinion on
> whether calculating binsize/nbins would be a step forward over leaving
> the default (of 10) and letting the user calculate if they like?

Integer histograms are a different beast altogether. It is not very hard
to define a natural bin width for integer histograms: 1. The only
sensible alternatives are integer multiples of that.

import numpy as np
import matplotlib.pyplot as plt
data = np.int32(np.rint(200*np.random.randn(10000)))
axis = np.arange(data.min(), data.max()+1)
hist = np.zeros((data.max()-data.min()+1,), dtype=np.int32)
# unfortunately the shortcut hist[data-data.min()] += 1 does not work, 
# the list of indices in data is simplified before looping implicitly.
# Explicit loop:
for item in data:
    hist[item-data.min()] += 1

plt.plot(axis,hist)
plt.show()

This histogram can easily be adapted to any sensible bin size, as this
is the finest possible increment. With floats you have to do things the
hard way because there is no such thing as a natural bin size. 

And yes, the np.histogram() function is much faster.

hist2 = np.histogram(data, bins=data.max()-data.min())
plt.plot(hist2[1][0:-1]+0.5, hist2[0])
plt.show()

I don't like putting the data on the bin-boundaries, as it is very clear
what the bins can be in this case.

Yes, this is not so much a hard suggestion, as it is a line of thought.
Treating integer data for histograms differently from pseudo continuous
data is the natural way in in my view. Scaling (grouping bins) could be
done to ensure that the most populated bin contains 4*ndata/nbins points
(yes, this fails for uniformly distributed data).

Maarten
-- 
KNMI, De Bilt
T: 030 2206 747
E: maarten.sn...@knmi.nl
Room B 2.42


------------------------------------------------------------------------------
Nokia and AT&T present the 2010 Calling All Innovators-North America contest
Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store 
http://p.sf.net/sfu/nokia-dev2dev
_______________________________________________
Matplotlib-users mailing list
Matplotlib-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/matplotlib-users

Reply via email to