Re: [R] Displaying a distribution -- was: Combining two histograms

Spencer Graves Wed, 02 Feb 2005 10:48:56 -0800

There are PP plots and QQ plots for any distribution, and I've experimented a little (though not much) with PP plots and with QQ plots for uniform, Student's t, chi-square and F distributions. I've found qqnorm plots very useful, but I don't recall learning much from any other probability plots.

In data mining situations, I've computed hundreds of p-values, possibly associated with Student's t or F or log(likelihood ratio) approximate chi-squares. I've found it useful to convert them to normal scores via qnorm(p), then make a normal plot of the p-values. Points off the line on the lower tail are statistically significant. Points on the line are there by chance alone. If the slope of the line is different from one or not centered at zero, there may be hidden components of variance or serial dependence of various kinds that I'm not modeling properly in computing the p-values. This "p-value plot" seems to provide a subtle check and first order correction for a variety of different violations of assumptions like this.

Comments? Best Wishes, spencer p.s. Regarding normal plots with millions of points: I find them still useful. However, we need some kind of heuristic to decimate excess points so we still get the same visual image without a plot object that consumes gigabytes on the hard drive, hours to plot, and can't be exported to PowerPoint, for example.

Berton Gunter wrote:

May I take this off topic a little to seek collective wisdom (and so feel
free to reply privately).

The catalyst is Deepayan's remark:

Histograms were appropriate for drawing density estimates by hand in the good old days, but I can imagine very few situations where I would not prefer to use smoother density estimates when I have the computational power to do so.

Deepayan


Generally, I agree; but the appearance and thus one's perception and
interpretation of both histograms and density plots depend upon the
parameters chosen for the display (bin boundaries for histograms; bandwidth
and kernel for density plots). Important data peculiarities like arbitrary
rounding, favoring of certain values, resolution limitations, and so forth
are therefore often lost. I would instead advocate that simple quantile
plots -- plot(ppoints(x),sort(x)) -- or perhaps normal qqplots always be the
first plot used to explore univariate data distributions. I believe this
conforms to Bill Cleveland's recommendations, who says in the first sentence
on p. 17 of VISUALIZING DATA on visualizing univariate data: "Quantiles are
essential to visualizing distributions."

While it is true that many people may be unfamiliar with quantile plots, I
think we need to improve modern statistical practice not only by abandoning
histograms in favor of density plots, but also by always first using
quantile plots and explaining why this is necessary.

Difficult issue: What should one do when when there are, say, a million
values?

Alternative views?


-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA

"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box

______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Displaying a distribution -- was: Combining two histograms

Reply via email to