Though widely used, Sturges' rule is a wrong choice -- see (e.g., since it's freely available and easy to understand even for non-mathematicians like most of <us here>)
http://robjhyndman.com/papers/sturges.pdf Essential excerpt: ---------- Alternative rules for constructing histograms include Scott's (1979) rule for the class width: h = 3.5*s*n**?1/3 and Freedman and Diaconis's (1981) rule for the class width: h = 2(IQ)n**?1/3 where s is the sample standard deviation and IQ is the sample interquartile range. [** obviously stands for power] Either of these are just as simple to use as Sturges' rule, but are well-founded in statistical theory. Sturges' rule has probably survived as long as it has because, for moderate n (less than 200), it gives similar results to the alternative rules above (see Scott, 1992, p.56), and so produces reasonable histograms. However, it does not work for large n. The problem with Sturges' rule is that its derivation is wrong. It is a rule which no longer deserves a place in statistics textbooks or as a default in statistical computer packages. ---------- So, here's a good chance for PSPP to implement something better than many commercial stats packages (BTW, several of which I admire and have been daily using for a looong time, while PSPP has still a looong way to go before ...) with minimum effort. Cordial regards and all the best with further advance of the PSPP project, Gaj Vidmar --- Assist. Prof. Gaj Vidmar, PhD University Rehabilitation Institute, Republic of Slovenia & Univ. of Ljubljana, Fac. of Medicine, Inst. for Biostatistics and Medical Informatics "Ben Pfaff" <[email protected]> wrote in message news:[email protected]... > [cleaning out my old email] > > Erik Frebold <[email protected]> writes: > >> 2. Re: Histogram-- I was puzzled as to why this function would >> assign only six bins to an n=1000 dataset. Upon investigation, >> looks like the number of bins is assigned by gsl, right? (which >> I assume would use something suitable like Sturges' formula, >> though I wasn't able to find the actual stretch of code) So is >> it likely just that the pspp part of the process might not be >> quite ready for primetime yet? No problem if this is the case-- >> I can use matplotlib or similar for now. > > I took a look at this. PSPP always uses exactly 11 bins for > every histogram! This seems less than optimal. > > I pushed out a commit that uses Sturges' formula, with a minimum > of 5 buckets. > -- > Ben Pfaff > http://benpfaff.org _______________________________________________ Pspp-users mailing list [email protected] http://lists.gnu.org/mailman/listinfo/pspp-users
