Hi, people. Within ?hist (using R 2.3.0), one reads: density: values f^(x[i]), as estimated density values. If 'all(diff(breaks) == 1)', they are the relative frequencies 'counts/n' and in general satisfy sum[i; f^(x[i]) (b[i+1]-b[i])] = 1, where b[i] = 'breaks[i]'.
I trip on this explanation each time I read it. Some R guardians will be tempted to say that since R itself does not trip, I am necessarily the problem :-). But yet, non-obstant and nevertheless, maybe these few lines of documentation could be improved. The "f^(x[i])" bit is somehow cryptic and not explained. It suggests that there are as many densities as possible "i" values, and since "i" indexes "x", it indirectly suggests that length(density) == length(x), which cannot be right. The "sum[i; ...]" has to be taken up to the number of cells, not the number of "x" values. Because "x[i]" is a bit meaningless in the above context, it should better be avoided. The "^" may mean that "x[i]" is an index of "f", some kind of TeX device for shifting the notation. It may also means "hat" to suggest the density is an approximation. But the approximation of what? Of course, I understand an untold model by which "density" estimates the density of some continuous distribution out of which the "x" values were sampled, before the "hist()" function was called. But "x" is not necessarily a sample of a continuum, it may well be the population, and the densities in the histogram may well be exact, and not an approximation. So it might be simpler to drop the "^" as well. The concept of relative frequency is explained in case of equal width cells only, and not otherwise. This concept is not reused elsewhere in "?hist". So, it is not so useful, we could use "d" instead of "f". Finally, writing "breaks[i+1]-breaks[i]" is simpler and clearer than introducing an intermediate "b[i]" device. Let's drop it. Let me suggest a simpler rewriting of these few lines, using humbler notation while being more precise. Let's start with something like: density: For each cell i, density[i] is the proportion of all x[] which get sorted into that cell, divided by the cell width. So, the value of 'sum(density * diff(breaks))' is 1. and improve on it. -- François Pinard http://pinard.progiciels-bpi.ca ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel