Re: [R] histogram first bar wrong position
> On 22 Dec 2016, at 18:08 , William Dunlap via R-help> wrote: > > As a practical matter, 'continuous' data must be discretized, so if you > have long vectors of it you will run into this problem. Yep, and it is a bit unfortunate that hist() tries to use "pretty" breakpoints, so that you will have data points on the boundaries, causing all the left/right/endpoint business to come into play. The truehist() function in MASS does somewhat better. For the case at hand, things are much improved by setting the breaks explicitly: hist(y,freq=TRUE, col='red', breaks=0.5:6.5) but as pointed out by others, it is a much better idea to do plot(factor(y, levels=1:6)) or similar. Incidentally, what is the most handy way to get a plot with percentages instead of counts? This works, but seems a bit ham-fisted: barplot(prop.table(table(factor(y, levels=1:6 -pd > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > On Thu, Dec 22, 2016 at 8:19 AM, Martin Maechler > wrote: > >>> itpro >>>on Thu, 22 Dec 2016 16:17:28 +0300 writes: >> >>> Hi, everyone. >>> I stumbled upon weird histogram behaviour. >> >>> Consider this "dice emulator": >>> Step 1: Generate uniform random array x of size N. >>> Step 2: Multiply each item by six and round to next bigger integer >> to get numbers 1 to 6. >>> Step 3: Plot histogram. >> x<-runif(N) y<-ceiling(x*6) hist(y,freq=TRUE, col='orange') >> >> >>> Now what I get with N=10 >> x<-runif(10) y<-ceiling(x*6) hist(y,freq=TRUE, col='green') >> >>> At first glance looks OK. >> >>> Now try N=100 >> x<-runif(100) y<-ceiling(x*6) hist(y,freq=TRUE, col='red') >> >>> Now first bar is not where it should be. >>> Hmm. Look again to 10 histogram... First bar is not where I want >> it, it's only less striking due to narrow bars. >> >>> So, first bar is always in wrong position. How do I fix it to make >> perfectly spaced bars? >> >> Don't use histograms *at all* for such discrete integer data. >> >> N <- rpois(100, 5) >> plot(table(N), lwd = 4) >> >> Histograms should be only be used for continuous data (or discrete data >> with "many" possible values). >> >> It's a pain to see them so often "misused" for data like the 'N' above. >> >> Martin Maechler, >> ETH Zurich >> >> __ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/ >> posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] histogram first bar wrong position
> William Dunlap> on Thu, 22 Dec 2016 09:08:35 -0800 writes: > As a practical matter, 'continuous' data must be discretized, so if you > have long vectors of it you will run into this problem. > Bill Dunlap > TIBCO Software > wdunlap tibco.com Yes, it is true that on the computer and in statistics we never have continuous data in the strict sense. My point was and still is that a histogram is a wrong graphical tool to be used for visualizing a distribution on a small finite set, as e.g., the dice rolls 'itpro' has used. And yes, if (s)he used something like dice <- ceiling(6 * runif(100)) and really prefers to use hist() over (something like) plot(table(dice), lwd = 6) then an appropriate graphic would rather be hist(dice, freq=TRUE, col="orange", breaks = (31:(6*32))/32) (and the default breaks from sample size N = 100'000 is indeed relatively close to that because as we both know the number of default breaks grows (slowly) with N). For me, histograms are a (poor but easy to understand and explain) version of density estimates (where the underlying density is wrt to the lebesgue measure or simlar). Now back to large / long vectors of data: If you need to bin large vectors, you will hopefully be binning to rather 100's or 1000's of bins (because 1000 is still much smaller than "large") and then you actually have computed the data for a histogram yourself already; so I personally would again prefer not to use hist(), but to write my own "3 line" function that returns an "histogram" object which I'd call plot(.) on. So, maybe providing such a short function maybe useful, notably on the ?hist help page ? Martin Maechler, ETH Zurich > On Thu, Dec 22, 2016 at 8:19 AM, Martin Maechler > wrote: >> > itpro >> > on Thu, 22 Dec 2016 16:17:28 +0300 writes: >> >> > Hi, everyone. >> > I stumbled upon weird histogram behaviour. >> >> > Consider this "dice emulator": >> > Step 1: Generate uniform random array x of size N. >> > Step 2: Multiply each item by six and round to next bigger integer >> to get numbers 1 to 6. >> > Step 3: Plot histogram. >> >> >> x<-runif(N) >> >> y<-ceiling(x*6) >> >> hist(y,freq=TRUE, col='orange') >> >> >> > Now what I get with N=10 >> >> >> x<-runif(10) >> >> y<-ceiling(x*6) >> >> hist(y,freq=TRUE, col='green') >> >> > At first glance looks OK. >> >> > Now try N=100 >> >> >> x<-runif(100) >> >> y<-ceiling(x*6) >> >> hist(y,freq=TRUE, col='red') >> >> > Now first bar is not where it should be. >> > Hmm. Look again to 10 histogram... First bar is not where I want >> it, it's only less striking due to narrow bars. >> >> > So, first bar is always in wrong position. How do I fix it to make >> perfectly spaced bars? >> >> Don't use histograms *at all* for such discrete integer data. >> >> N <- rpois(100, 5) >> plot(table(N), lwd = 4) >> >> Histograms should be only be used for continuous data (or discrete data >> with "many" possible values). >> >> It's a pain to see them so often "misused" for data like the 'N' above. >> >> Martin Maechler, >> ETH Zurich >> >> __ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/ >> posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] histogram first bar wrong position
Willam has listed the lid on the essence of the problem, which is that in R the way that breaks (and therefore counts) in a histogram are evaluated is an area of long grass with lurking snakes! To get a glimpse of this, have a look at ?hist and in the seaction "Arguments", look at "breaks", "freq", "right". Also see under "Details". and, as suggested under "See also", look at ?nclass.Sturges As William suggests, if you know what claa intervals you want, create them yourself! For your example (with N=100), look at: hist(y,freq=TRUE, col='red', breaks=0.5+(0:6)) or hist(y,freq=TRUE, col='red', breaks=0.25+(0:12)/2) Hoping this helps! Best wishes, Ted. On 22-Dec-2016 16:36:34 William Dunlap via R-help wrote: > Looking at the return value of hist will show you what is happening: > >> x <- rep(1:6,10*(6:1)) >> z <- hist(x, freq=TRUE) >> z > $breaks > [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 > > $counts > [1] 60 50 0 40 0 30 0 20 0 10 > ... > > The the first bin is [1-1.5], including both endpoints, while the other > bins include only the upper endpoint. I recommend defining your > own breakpoints, ones don't include possible data points, as in > >> print(hist(x, breaks=seq(min(x)-0.5, max(x)+0.5, by=1), freq=TRUE)) > $breaks > [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 > > $counts > [1] 60 50 40 30 20 10 > ... > > S+ had a 'factor' method for hist() that did this sort of thing, but R does > not. > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > On Thu, Dec 22, 2016 at 5:17 AM, itprowrote: > >> Hi, everyone. >> >> >> I stumbled upon weird histogram behaviour. >> >> Consider this "dice emulator": >> Step 1: Generate uniform random array x of size N. >> Step 2: Multiply each item by six and round to next bigger integer to get >> numbers 1 to 6. >> Step 3: Plot histogram. >> >> > x<-runif(N) >> > y<-ceiling(x*6) >> > hist(y,freq=TRUE, col='orange') >> >> >> Now what I get with N=10 >> >> > x<-runif(10) >> > y<-ceiling(x*6) >> > hist(y,freq=TRUE, col='green') >> >> At first glance looks OK. >> >> Now try N=100 >> >> > x<-runif(100) >> > y<-ceiling(x*6) >> > hist(y,freq=TRUE, col='red') >> >> Now first bar is not where it should be. >> Hmm. Look again to 10 histogram... First bar is not where I want it, >> it's only less striking due to narrow bars. >> >> So, first bar is always in wrong position. How do I fix it to make >> perfectly spaced bars? >> >> >> >> >> >> >> __ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/ >> posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. - E-Mail: (Ted Harding) Date: 22-Dec-2016 Time: 17:23:26 This message was sent by XFMail __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] histogram first bar wrong position
As a practical matter, 'continuous' data must be discretized, so if you have long vectors of it you will run into this problem. Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Dec 22, 2016 at 8:19 AM, Martin Maechlerwrote: > > itpro > > on Thu, 22 Dec 2016 16:17:28 +0300 writes: > > > Hi, everyone. > > I stumbled upon weird histogram behaviour. > > > Consider this "dice emulator": > > Step 1: Generate uniform random array x of size N. > > Step 2: Multiply each item by six and round to next bigger integer > to get numbers 1 to 6. > > Step 3: Plot histogram. > > >> x<-runif(N) > >> y<-ceiling(x*6) > >> hist(y,freq=TRUE, col='orange') > > > > Now what I get with N=10 > > >> x<-runif(10) > >> y<-ceiling(x*6) > >> hist(y,freq=TRUE, col='green') > > > At first glance looks OK. > > > Now try N=100 > > >> x<-runif(100) > >> y<-ceiling(x*6) > >> hist(y,freq=TRUE, col='red') > > > Now first bar is not where it should be. > > Hmm. Look again to 10 histogram... First bar is not where I want > it, it's only less striking due to narrow bars. > > > So, first bar is always in wrong position. How do I fix it to make > perfectly spaced bars? > > Don't use histograms *at all* for such discrete integer data. > > N <- rpois(100, 5) > plot(table(N), lwd = 4) > > Histograms should be only be used for continuous data (or discrete data > with "many" possible values). > > It's a pain to see them so often "misused" for data like the 'N' above. > > Martin Maechler, > ETH Zurich > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] histogram first bar wrong position
Looking at the return value of hist will show you what is happening: > x <- rep(1:6,10*(6:1)) > z <- hist(x, freq=TRUE) > z $breaks [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 $counts [1] 60 50 0 40 0 30 0 20 0 10 ... The the first bin is [1-1.5], including both endpoints, while the other bins include only the upper endpoint. I recommend defining your own breakpoints, ones don't include possible data points, as in > print(hist(x, breaks=seq(min(x)-0.5, max(x)+0.5, by=1), freq=TRUE)) $breaks [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 $counts [1] 60 50 40 30 20 10 ... S+ had a 'factor' method for hist() that did this sort of thing, but R does not. Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Dec 22, 2016 at 5:17 AM, itprowrote: > Hi, everyone. > > > I stumbled upon weird histogram behaviour. > > Consider this "dice emulator": > Step 1: Generate uniform random array x of size N. > Step 2: Multiply each item by six and round to next bigger integer to get > numbers 1 to 6. > Step 3: Plot histogram. > > > x<-runif(N) > > y<-ceiling(x*6) > > hist(y,freq=TRUE, col='orange') > > > Now what I get with N=10 > > > x<-runif(10) > > y<-ceiling(x*6) > > hist(y,freq=TRUE, col='green') > > At first glance looks OK. > > Now try N=100 > > > x<-runif(100) > > y<-ceiling(x*6) > > hist(y,freq=TRUE, col='red') > > Now first bar is not where it should be. > Hmm. Look again to 10 histogram... First bar is not where I want it, > it's only less striking due to narrow bars. > > So, first bar is always in wrong position. How do I fix it to make > perfectly spaced bars? > > > > > > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] histogram first bar wrong position
> itpro> on Thu, 22 Dec 2016 16:17:28 +0300 writes: > Hi, everyone. > I stumbled upon weird histogram behaviour. > Consider this "dice emulator": > Step 1: Generate uniform random array x of size N. > Step 2: Multiply each item by six and round to next bigger integer to get numbers 1 to 6. > Step 3: Plot histogram. >> x<-runif(N) >> y<-ceiling(x*6) >> hist(y,freq=TRUE, col='orange') > Now what I get with N=10 >> x<-runif(10) >> y<-ceiling(x*6) >> hist(y,freq=TRUE, col='green') > At first glance looks OK. > Now try N=100 >> x<-runif(100) >> y<-ceiling(x*6) >> hist(y,freq=TRUE, col='red') > Now first bar is not where it should be. > Hmm. Look again to 10 histogram... First bar is not where I want it, it's only less striking due to narrow bars. > So, first bar is always in wrong position. How do I fix it to make perfectly spaced bars? Don't use histograms *at all* for such discrete integer data. N <- rpois(100, 5) plot(table(N), lwd = 4) Histograms should be only be used for continuous data (or discrete data with "many" possible values). It's a pain to see them so often "misused" for data like the 'N' above. Martin Maechler, ETH Zurich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] histogram first bar wrong position
Hi, everyone. I stumbled upon weird histogram behaviour. Consider this "dice emulator": Step 1: Generate uniform random array x of size N. Step 2: Multiply each item by six and round to next bigger integer to get numbers 1 to 6. Step 3: Plot histogram. > x<-runif(N) > y<-ceiling(x*6) > hist(y,freq=TRUE, col='orange') Now what I get with N=10 > x<-runif(10) > y<-ceiling(x*6) > hist(y,freq=TRUE, col='green') At first glance looks OK. Now try N=100 > x<-runif(100) > y<-ceiling(x*6) > hist(y,freq=TRUE, col='red') Now first bar is not where it should be. Hmm. Look again to 10 histogram... First bar is not where I want it, it's only less striking due to narrow bars. So, first bar is always in wrong position. How do I fix it to make perfectly spaced bars? __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.