Re: [R] Histogram omitting/collapsing groups
On Jan 1, 2012, at 07:40 , Joshua Wiley wrote: If you just want a plot of the frequencies at each hour why not just call barplot on the output of table? Histograms create bins and count in those, which doesn't sound like what you're after. Exactly. If what you want is a barplot, make a barplot; histograms are for continuous data. Just remember that you may need to set the levels explicitly in case of empty groups: barplot(table(factor(x,levels=0:23))). (This is irrelevant with 100K data samples, but not with 100 of them). That being said, the fact that hist() tends to create breakpoints which coincide with data points due to discretization is arguably a bit of a design error, but it is age-old and hard to change now. One way out is to use truehist() from MASS, another is to explicitly set the breaks to intermediate values, as in hist(x, breaks=seq(-.5, 23.5, 1)) Cheers, Josh On Dec 31, 2011, at 21:37, jim holtman jholt...@gmail.com wrote: Fast fingers; notice that there is still a problem in the counts; I was only looking at the last. Happy New Year -- up too late. On Sun, Jan 1, 2012 at 12:33 AM, jim holtman jholt...@gmail.com wrote: Here is a test I ran and looks fine, but then I created the data, so it might have something to do with your data: x - sample(0:23, 10, TRUE) a - hist(x, breaks = 24) a[1:5] $breaks [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 $counts [1] 8262 4114 4186 4106 4153 4234 4206 4155 4157 4203 4186 4158 4132 4139 4231 4216 4158 4054 4185 4153 [21] 4281 4110 4221 $intensities [1] 0.08262 0.04114 0.04186 0.04106 0.04153 0.04234 0.04206 0.04155 0.04157 0.04203 0.04186 0.04158 [13] 0.04132 0.04139 0.04231 0.04216 0.04158 0.04054 0.04185 0.04153 0.04281 0.04110 0.04221 $density [1] 0.08262 0.04114 0.04186 0.04106 0.04153 0.04234 0.04206 0.04155 0.04157 0.04203 0.04186 0.04158 [13] 0.04132 0.04139 0.04231 0.04216 0.04158 0.04054 0.04185 0.04153 0.04281 0.04110 0.04221 $mids [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5 18.5 19.5 [21] 20.5 21.5 22.5 table(x) x 0123456789 10 11 12 13 14 15 16 17 18 19 20 4168 4094 4114 4186 4106 4153 4234 4206 4155 4157 4203 4186 4158 4132 4139 4231 4216 4158 4054 4185 4153 21 22 23 4281 4110 4221 On Sat, Dec 31, 2011 at 11:20 AM, Sarah Goslee sarah.gos...@gmail.com wrote: Hi, I think you're not understanding quite what's going on with hist. Reread the help, and take a look at this small example. The solution I'd use is the last item. x - rep(1:10, times=1:10) table(x) x 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, right=TRUE)$counts [1] 3 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, right=TRUE)$breaks [1] 1 2 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, right=TRUE)$mids [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 hist(x, plot=FALSE, right=FALSE)$counts [1] 1 2 3 4 5 6 7 8 19 hist(x, plot=FALSE, right=FALSE)$breaks [1] 1 2 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, right=FALSE)$mids [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 hist(x, plot=FALSE, breaks=seq(.5, 10.5, by=1))$counts [1] 1 2 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, breaks=seq(.5, 10.5, by=1))$breaks [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 hist(x, plot=FALSE, breaks=seq(.5, 10.5, by=1))$mids [1] 1 2 3 4 5 6 7 8 9 10 Sarah On Sat, Dec 31, 2011 at 10:25 AM, Aren Cambre a...@arencambre.com wrote: I have two large datasets (156K and 2.06M records). Each row has the hour that an event happened, represented by an integer from 0 to 23. R's histogram is combining some data. Here's the command I ran to get the histogram: histinfo - hist(crashes$hour, right=FALSE) Here's histinfo: histinfo $breaks [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 $counts [1] 4755 4618 5959 3292 2378 2715 4592 6144 6860 5598 5601 6596 7152 7490 8166 [16] 9758 11301 11745 9943 7494 6272 6220 11669 $intensities [1] 0.03041876 0.02954234 0.03812101 0.02105963 0.01521258 0.01736844 0.02937602 0.03930449 [9] 0.04388490 0.03581161 0.03583081 0.04219604 0.04575289 0.04791515 0.05223967 0.06242403 [17] 0.07229494 0.07513530 0.06360752 0.04794074 0.04012334 0.03979068 0.07464911 $density [1] 0.03041876 0.02954234 0.03812101 0.02105963 0.01521258 0.01736844 0.02937602 0.03930449 [9] 0.04388490 0.03581161 0.03583081 0.04219604 0.04575289 0.04791515 0.05223967 0.06242403 [17] 0.07229494 0.07513530 0.06360752 0.04794074 0.04012334 0.03979068 0.07464911 $mids [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5 [19] 18.5 19.5 20.5 21.5 22.5 $xname [1] crashes$hour $equidist [1] TRUE attr(,class) [1] histogram Note how the last value in counts is 11669. It's relevant to the output of
Re: [R] Histogram omitting/collapsing groups
On Sun, Jan 1, 2012 at 5:29 AM, peter dalgaard pda...@gmail.com wrote: Exactly. If what you want is a barplot, make a barplot; histograms are for continuous data. Just remember that you may need to set the levels explicitly in case of empty groups: barplot(table(factor(x,levels=0:23))). (This is irrelevant with 100K data samples, but not with 100 of them). That being said, the fact that hist() tends to create breakpoints which coincide with data points due to discretization is arguably a bit of a design error, but it is age-old and hard to change now. One way out is to use truehist() from MASS, another is to explicitly set the breaks to intermediate values, as in hist(x, breaks=seq(-.5, 23.5, 1)) Thanks, everybody. I'll definitely switch to barplot. As for continuous, it's all relative. Even the most continuous dataset at a scale that looks pretty to humans may have gaps between the values when you zoom in a lot. Aren __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Histogram omitting/collapsing groups
Hi Aren, I was busy thinking about how to make what you wanted, and I missed that you were working with hours from a day. That being the case, you may think about a circular graph. The attached plots show two different ways of working with the same data. Cheers, Josh set.seed(10) x - sample(0:23, 1, TRUE, prob = sin(0:23)+1) require(ggplot2) # graphing package ## regular barplot p - ggplot(as.data.frame(table(x)), aes(x = x, y = Freq)) + geom_bar() ## using circular coordinates p2 - p + coord_polar() ## print them print(p) print(p2) ## just if you're interested, the code to ## put the two plots side by side require(grid) dev.new(height = 6, width = 12) grid.newpage() pushViewport(vpList( viewport(x = 0, width = .5, just = left, name = barplot), viewport(x = .5, width = .5, just = left, name=windrose))) seekViewport(barplot) grid.draw(ggplotGrob(p)) seekViewport(windrose) grid.draw(ggplotGrob(p2)) On Sun, Jan 1, 2012 at 7:59 AM, Aren Cambre a...@arencambre.com wrote: On Sun, Jan 1, 2012 at 5:29 AM, peter dalgaard pda...@gmail.com wrote: Exactly. If what you want is a barplot, make a barplot; histograms are for continuous data. Just remember that you may need to set the levels explicitly in case of empty groups: barplot(table(factor(x,levels=0:23))). (This is irrelevant with 100K data samples, but not with 100 of them). That being said, the fact that hist() tends to create breakpoints which coincide with data points due to discretization is arguably a bit of a design error, but it is age-old and hard to change now. One way out is to use truehist() from MASS, another is to explicitly set the breaks to intermediate values, as in hist(x, breaks=seq(-.5, 23.5, 1)) Thanks, everybody. I'll definitely switch to barplot. As for continuous, it's all relative. Even the most continuous dataset at a scale that looks pretty to humans may have gaps between the values when you zoom in a lot. Aren -- Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/ plots.pdf Description: Adobe PDF document __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Histogram omitting/collapsing groups
This is helpful, although I can't seem to adapt it to my own data. If I run your sample as is, I do get the nice graphs. However, this doesn't work: (Assume you already have a data frame dallas with 2057980 rows. It has column offense_hour, and each row has a value between 0 and 23, inclusive.) p - ggplot(as.data.frame(table(dallas$offense_hour)), aes(x = dallas$offense_hour, y = Freq)) + geom_bar() print(p) Error in data.frame(x = c(9, 8, 10, 9, 10, 15, 11, 13, 0, 16, 13, 20, : arguments imply differing number of rows: 2057980, 24 Seems like dallas$offense_hour corresponds to x in your example. I'm confused why yours works even though your x has 10,000 values, yet mine fails complaining that the row count is way off. Either way, the length of x or dallas$offense_hour grossly exceeds 24. Aren On Sun, Jan 1, 2012 at 10:34 AM, Joshua Wiley jwiley.ps...@gmail.com wrote: Hi Aren, I was busy thinking about how to make what you wanted, and I missed that you were working with hours from a day. That being the case, you may think about a circular graph. The attached plots show two different ways of working with the same data. Cheers, Josh set.seed(10) x - sample(0:23, 1, TRUE, prob = sin(0:23)+1) require(ggplot2) # graphing package ## regular barplot p - ggplot(as.data.frame(table(x)), aes(x = x, y = Freq)) + geom_bar() ## using circular coordinates p2 - p + coord_polar() ## print them print(p) print(p2) ## just if you're interested, the code to ## put the two plots side by side require(grid) dev.new(height = 6, width = 12) grid.newpage() pushViewport(vpList( viewport(x = 0, width = .5, just = left, name = barplot), viewport(x = .5, width = .5, just = left, name=windrose))) seekViewport(barplot) grid.draw(ggplotGrob(p)) seekViewport(windrose) grid.draw(ggplotGrob(p2)) On Sun, Jan 1, 2012 at 7:59 AM, Aren Cambre a...@arencambre.com wrote: On Sun, Jan 1, 2012 at 5:29 AM, peter dalgaard pda...@gmail.com wrote: Exactly. If what you want is a barplot, make a barplot; histograms are for continuous data. Just remember that you may need to set the levels explicitly in case of empty groups: barplot(table(factor(x,levels=0:23))). (This is irrelevant with 100K data samples, but not with 100 of them). That being said, the fact that hist() tends to create breakpoints which coincide with data points due to discretization is arguably a bit of a design error, but it is age-old and hard to change now. One way out is to use truehist() from MASS, another is to explicitly set the breaks to intermediate values, as in hist(x, breaks=seq(-.5, 23.5, 1)) Thanks, everybody. I'll definitely switch to barplot. As for continuous, it's all relative. Even the most continuous dataset at a scale that looks pretty to humans may have gaps between the values when you zoom in a lot. Aren -- Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Histogram omitting/collapsing groups
Sorry, that was probably a really confusing example...too many xs floating around. set.seed(10) rawdata - sample(0:23, 1, TRUE, prob = sin(0:23)+1) ## do theis step first for your data tableddata - as.data.frame(table(rawdata)) ## use these names in ggplot colnames(tableddata) require(ggplot2) p - ggplot(tableddata, aes(x = rawdata, y = Freq)) + geom_bar() Cheers, Josh On Sun, Jan 1, 2012 at 2:36 PM, Aren Cambre a...@arencambre.com wrote: This is helpful, although I can't seem to adapt it to my own data. If I run your sample as is, I do get the nice graphs. However, this doesn't work: (Assume you already have a data frame dallas with 2057980 rows. It has column offense_hour, and each row has a value between 0 and 23, inclusive.) p - ggplot(as.data.frame(table(dallas$offense_hour)), aes(x = dallas$offense_hour, y = Freq)) + geom_bar() print(p) Error in data.frame(x = c(9, 8, 10, 9, 10, 15, 11, 13, 0, 16, 13, 20, : arguments imply differing number of rows: 2057980, 24 Seems like dallas$offense_hour corresponds to x in your example. I'm confused why yours works even though your x has 10,000 values, yet mine fails complaining that the row count is way off. Either way, the length of x or dallas$offense_hour grossly exceeds 24. Aren On Sun, Jan 1, 2012 at 10:34 AM, Joshua Wiley jwiley.ps...@gmail.com wrote: Hi Aren, I was busy thinking about how to make what you wanted, and I missed that you were working with hours from a day. That being the case, you may think about a circular graph. The attached plots show two different ways of working with the same data. Cheers, Josh set.seed(10) x - sample(0:23, 1, TRUE, prob = sin(0:23)+1) require(ggplot2) # graphing package ## regular barplot p - ggplot(as.data.frame(table(x)), aes(x = x, y = Freq)) + geom_bar() ## using circular coordinates p2 - p + coord_polar() ## print them print(p) print(p2) ## just if you're interested, the code to ## put the two plots side by side require(grid) dev.new(height = 6, width = 12) grid.newpage() pushViewport(vpList( viewport(x = 0, width = .5, just = left, name = barplot), viewport(x = .5, width = .5, just = left, name=windrose))) seekViewport(barplot) grid.draw(ggplotGrob(p)) seekViewport(windrose) grid.draw(ggplotGrob(p2)) On Sun, Jan 1, 2012 at 7:59 AM, Aren Cambre a...@arencambre.com wrote: On Sun, Jan 1, 2012 at 5:29 AM, peter dalgaard pda...@gmail.com wrote: Exactly. If what you want is a barplot, make a barplot; histograms are for continuous data. Just remember that you may need to set the levels explicitly in case of empty groups: barplot(table(factor(x,levels=0:23))). (This is irrelevant with 100K data samples, but not with 100 of them). That being said, the fact that hist() tends to create breakpoints which coincide with data points due to discretization is arguably a bit of a design error, but it is age-old and hard to change now. One way out is to use truehist() from MASS, another is to explicitly set the breaks to intermediate values, as in hist(x, breaks=seq(-.5, 23.5, 1)) Thanks, everybody. I'll definitely switch to barplot. As for continuous, it's all relative. Even the most continuous dataset at a scale that looks pretty to humans may have gaps between the values when you zoom in a lot. Aren -- Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/ -- Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Histogram omitting/collapsing groups
Thanks. That did it! And I get it now--in your original example, aes(x = x, y = Freq), x refers to the column name in as.data.frame(table(x)), not the x vector(?) you created. Aren On Sun, Jan 1, 2012 at 4:44 PM, Joshua Wiley jwiley.ps...@gmail.com wrote: Sorry, that was probably a really confusing example...too many xs floating around. set.seed(10) rawdata - sample(0:23, 1, TRUE, prob = sin(0:23)+1) ## do theis step first for your data tableddata - as.data.frame(table(rawdata)) ## use these names in ggplot colnames(tableddata) require(ggplot2) p - ggplot(tableddata, aes(x = rawdata, y = Freq)) + geom_bar() Cheers, Josh On Sun, Jan 1, 2012 at 2:36 PM, Aren Cambre a...@arencambre.com wrote: This is helpful, although I can't seem to adapt it to my own data. If I run your sample as is, I do get the nice graphs. However, this doesn't work: (Assume you already have a data frame dallas with 2057980 rows. It has column offense_hour, and each row has a value between 0 and 23, inclusive.) p - ggplot(as.data.frame(table(dallas$offense_hour)), aes(x = dallas$offense_hour, y = Freq)) + geom_bar() print(p) Error in data.frame(x = c(9, 8, 10, 9, 10, 15, 11, 13, 0, 16, 13, 20, : arguments imply differing number of rows: 2057980, 24 Seems like dallas$offense_hour corresponds to x in your example. I'm confused why yours works even though your x has 10,000 values, yet mine fails complaining that the row count is way off. Either way, the length of x or dallas$offense_hour grossly exceeds 24. Aren On Sun, Jan 1, 2012 at 10:34 AM, Joshua Wiley jwiley.ps...@gmail.com wrote: Hi Aren, I was busy thinking about how to make what you wanted, and I missed that you were working with hours from a day. That being the case, you may think about a circular graph. The attached plots show two different ways of working with the same data. Cheers, Josh set.seed(10) x - sample(0:23, 1, TRUE, prob = sin(0:23)+1) require(ggplot2) # graphing package ## regular barplot p - ggplot(as.data.frame(table(x)), aes(x = x, y = Freq)) + geom_bar() ## using circular coordinates p2 - p + coord_polar() ## print them print(p) print(p2) ## just if you're interested, the code to ## put the two plots side by side require(grid) dev.new(height = 6, width = 12) grid.newpage() pushViewport(vpList( viewport(x = 0, width = .5, just = left, name = barplot), viewport(x = .5, width = .5, just = left, name=windrose))) seekViewport(barplot) grid.draw(ggplotGrob(p)) seekViewport(windrose) grid.draw(ggplotGrob(p2)) On Sun, Jan 1, 2012 at 7:59 AM, Aren Cambre a...@arencambre.com wrote: On Sun, Jan 1, 2012 at 5:29 AM, peter dalgaard pda...@gmail.com wrote: Exactly. If what you want is a barplot, make a barplot; histograms are for continuous data. Just remember that you may need to set the levels explicitly in case of empty groups: barplot(table(factor(x,levels=0:23))). (This is irrelevant with 100K data samples, but not with 100 of them). That being said, the fact that hist() tends to create breakpoints which coincide with data points due to discretization is arguably a bit of a design error, but it is age-old and hard to change now. One way out is to use truehist() from MASS, another is to explicitly set the breaks to intermediate values, as in hist(x, breaks=seq(-.5, 23.5, 1)) Thanks, everybody. I'll definitely switch to barplot. As for continuous, it's all relative. Even the most continuous dataset at a scale that looks pretty to humans may have gaps between the values when you zoom in a lot. Aren -- Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/ -- Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Histogram omitting/collapsing groups
Hi, I think you're not understanding quite what's going on with hist. Reread the help, and take a look at this small example. The solution I'd use is the last item. x - rep(1:10, times=1:10) table(x) x 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, right=TRUE)$counts [1] 3 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, right=TRUE)$breaks [1] 1 2 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, right=TRUE)$mids [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 hist(x, plot=FALSE, right=FALSE)$counts [1] 1 2 3 4 5 6 7 8 19 hist(x, plot=FALSE, right=FALSE)$breaks [1] 1 2 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, right=FALSE)$mids [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 hist(x, plot=FALSE, breaks=seq(.5, 10.5, by=1))$counts [1] 1 2 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, breaks=seq(.5, 10.5, by=1))$breaks [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 hist(x, plot=FALSE, breaks=seq(.5, 10.5, by=1))$mids [1] 1 2 3 4 5 6 7 8 9 10 Sarah On Sat, Dec 31, 2011 at 10:25 AM, Aren Cambre a...@arencambre.com wrote: I have two large datasets (156K and 2.06M records). Each row has the hour that an event happened, represented by an integer from 0 to 23. R's histogram is combining some data. Here's the command I ran to get the histogram: histinfo - hist(crashes$hour, right=FALSE) Here's histinfo: histinfo $breaks [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 $counts [1] 4755 4618 5959 3292 2378 2715 4592 6144 6860 5598 5601 6596 7152 7490 8166 [16] 9758 11301 11745 9943 7494 6272 6220 11669 $intensities [1] 0.03041876 0.02954234 0.03812101 0.02105963 0.01521258 0.01736844 0.02937602 0.03930449 [9] 0.04388490 0.03581161 0.03583081 0.04219604 0.04575289 0.04791515 0.05223967 0.06242403 [17] 0.07229494 0.07513530 0.06360752 0.04794074 0.04012334 0.03979068 0.07464911 $density [1] 0.03041876 0.02954234 0.03812101 0.02105963 0.01521258 0.01736844 0.02937602 0.03930449 [9] 0.04388490 0.03581161 0.03583081 0.04219604 0.04575289 0.04791515 0.05223967 0.06242403 [17] 0.07229494 0.07513530 0.06360752 0.04794074 0.04012334 0.03979068 0.07464911 $mids [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5 [19] 18.5 19.5 20.5 21.5 22.5 $xname [1] crashes$hour $equidist [1] TRUE attr(,class) [1] histogram Note how the last value in counts is 11669. It's relevant to the output of table(crashes$hour): 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 4755 4618 5959 3292 2378 2715 4592 6144 6860 5598 5601 6596 7152 7490 8166 15 16 17 18 19 20 21 22 23 9758 11301 11745 9943 7494 6272 6220 6000 5669 Notice how the sum of 22 and 23 from table(crashes$hour) is 11669? Is that correct for the histogram to combine hours 22 and 23? Since I specified right = FALSE, I figured there's no way 23 would be combined with 22? Adding breaks=24 to the hist makes no difference; it's still stuck at 23 breaks. I also tried breaks=25 and 23 and several other values, in case I am misinterpreting breaks's meaning, but none of them make a difference. I imagine this is a n00b question, so my apologies if this is obvious. Aren -- Sarah Goslee http://www.functionaldiversity.org __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Histogram omitting/collapsing groups
Here is a test I ran and looks fine, but then I created the data, so it might have something to do with your data: x - sample(0:23, 10, TRUE) a - hist(x, breaks = 24) a[1:5] $breaks [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 $counts [1] 8262 4114 4186 4106 4153 4234 4206 4155 4157 4203 4186 4158 4132 4139 4231 4216 4158 4054 4185 4153 [21] 4281 4110 4221 $intensities [1] 0.08262 0.04114 0.04186 0.04106 0.04153 0.04234 0.04206 0.04155 0.04157 0.04203 0.04186 0.04158 [13] 0.04132 0.04139 0.04231 0.04216 0.04158 0.04054 0.04185 0.04153 0.04281 0.04110 0.04221 $density [1] 0.08262 0.04114 0.04186 0.04106 0.04153 0.04234 0.04206 0.04155 0.04157 0.04203 0.04186 0.04158 [13] 0.04132 0.04139 0.04231 0.04216 0.04158 0.04054 0.04185 0.04153 0.04281 0.04110 0.04221 $mids [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5 18.5 19.5 [21] 20.5 21.5 22.5 table(x) x 0123456789 10 11 12 13 14 15 16 17 18 19 20 4168 4094 4114 4186 4106 4153 4234 4206 4155 4157 4203 4186 4158 4132 4139 4231 4216 4158 4054 4185 4153 21 22 23 4281 4110 4221 On Sat, Dec 31, 2011 at 11:20 AM, Sarah Goslee sarah.gos...@gmail.com wrote: Hi, I think you're not understanding quite what's going on with hist. Reread the help, and take a look at this small example. The solution I'd use is the last item. x - rep(1:10, times=1:10) table(x) x 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, right=TRUE)$counts [1] 3 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, right=TRUE)$breaks [1] 1 2 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, right=TRUE)$mids [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 hist(x, plot=FALSE, right=FALSE)$counts [1] 1 2 3 4 5 6 7 8 19 hist(x, plot=FALSE, right=FALSE)$breaks [1] 1 2 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, right=FALSE)$mids [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 hist(x, plot=FALSE, breaks=seq(.5, 10.5, by=1))$counts [1] 1 2 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, breaks=seq(.5, 10.5, by=1))$breaks [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 hist(x, plot=FALSE, breaks=seq(.5, 10.5, by=1))$mids [1] 1 2 3 4 5 6 7 8 9 10 Sarah On Sat, Dec 31, 2011 at 10:25 AM, Aren Cambre a...@arencambre.com wrote: I have two large datasets (156K and 2.06M records). Each row has the hour that an event happened, represented by an integer from 0 to 23. R's histogram is combining some data. Here's the command I ran to get the histogram: histinfo - hist(crashes$hour, right=FALSE) Here's histinfo: histinfo $breaks [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 $counts [1] 4755 4618 5959 3292 2378 2715 4592 6144 6860 5598 5601 6596 7152 7490 8166 [16] 9758 11301 11745 9943 7494 6272 6220 11669 $intensities [1] 0.03041876 0.02954234 0.03812101 0.02105963 0.01521258 0.01736844 0.02937602 0.03930449 [9] 0.04388490 0.03581161 0.03583081 0.04219604 0.04575289 0.04791515 0.05223967 0.06242403 [17] 0.07229494 0.07513530 0.06360752 0.04794074 0.04012334 0.03979068 0.07464911 $density [1] 0.03041876 0.02954234 0.03812101 0.02105963 0.01521258 0.01736844 0.02937602 0.03930449 [9] 0.04388490 0.03581161 0.03583081 0.04219604 0.04575289 0.04791515 0.05223967 0.06242403 [17] 0.07229494 0.07513530 0.06360752 0.04794074 0.04012334 0.03979068 0.07464911 $mids [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5 [19] 18.5 19.5 20.5 21.5 22.5 $xname [1] crashes$hour $equidist [1] TRUE attr(,class) [1] histogram Note how the last value in counts is 11669. It's relevant to the output of table(crashes$hour): 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 4755 4618 5959 3292 2378 2715 4592 6144 6860 5598 5601 6596 7152 7490 8166 15 16 17 18 19 20 21 22 23 9758 11301 11745 9943 7494 6272 6220 6000 5669 Notice how the sum of 22 and 23 from table(crashes$hour) is 11669? Is that correct for the histogram to combine hours 22 and 23? Since I specified right = FALSE, I figured there's no way 23 would be combined with 22? Adding breaks=24 to the hist makes no difference; it's still stuck at 23 breaks. I also tried breaks=25 and 23 and several other values, in case I am misinterpreting breaks's meaning, but none of them make a difference. I imagine this is a n00b question, so my apologies if this is obvious. Aren -- Sarah Goslee http://www.functionaldiversity.org __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are
Re: [R] Histogram omitting/collapsing groups
Fast fingers; notice that there is still a problem in the counts; I was only looking at the last. Happy New Year -- up too late. On Sun, Jan 1, 2012 at 12:33 AM, jim holtman jholt...@gmail.com wrote: Here is a test I ran and looks fine, but then I created the data, so it might have something to do with your data: x - sample(0:23, 10, TRUE) a - hist(x, breaks = 24) a[1:5] $breaks [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 $counts [1] 8262 4114 4186 4106 4153 4234 4206 4155 4157 4203 4186 4158 4132 4139 4231 4216 4158 4054 4185 4153 [21] 4281 4110 4221 $intensities [1] 0.08262 0.04114 0.04186 0.04106 0.04153 0.04234 0.04206 0.04155 0.04157 0.04203 0.04186 0.04158 [13] 0.04132 0.04139 0.04231 0.04216 0.04158 0.04054 0.04185 0.04153 0.04281 0.04110 0.04221 $density [1] 0.08262 0.04114 0.04186 0.04106 0.04153 0.04234 0.04206 0.04155 0.04157 0.04203 0.04186 0.04158 [13] 0.04132 0.04139 0.04231 0.04216 0.04158 0.04054 0.04185 0.04153 0.04281 0.04110 0.04221 $mids [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5 18.5 19.5 [21] 20.5 21.5 22.5 table(x) x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 4168 4094 4114 4186 4106 4153 4234 4206 4155 4157 4203 4186 4158 4132 4139 4231 4216 4158 4054 4185 4153 21 22 23 4281 4110 4221 On Sat, Dec 31, 2011 at 11:20 AM, Sarah Goslee sarah.gos...@gmail.com wrote: Hi, I think you're not understanding quite what's going on with hist. Reread the help, and take a look at this small example. The solution I'd use is the last item. x - rep(1:10, times=1:10) table(x) x 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, right=TRUE)$counts [1] 3 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, right=TRUE)$breaks [1] 1 2 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, right=TRUE)$mids [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 hist(x, plot=FALSE, right=FALSE)$counts [1] 1 2 3 4 5 6 7 8 19 hist(x, plot=FALSE, right=FALSE)$breaks [1] 1 2 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, right=FALSE)$mids [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 hist(x, plot=FALSE, breaks=seq(.5, 10.5, by=1))$counts [1] 1 2 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, breaks=seq(.5, 10.5, by=1))$breaks [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 hist(x, plot=FALSE, breaks=seq(.5, 10.5, by=1))$mids [1] 1 2 3 4 5 6 7 8 9 10 Sarah On Sat, Dec 31, 2011 at 10:25 AM, Aren Cambre a...@arencambre.com wrote: I have two large datasets (156K and 2.06M records). Each row has the hour that an event happened, represented by an integer from 0 to 23. R's histogram is combining some data. Here's the command I ran to get the histogram: histinfo - hist(crashes$hour, right=FALSE) Here's histinfo: histinfo $breaks [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 $counts [1] 4755 4618 5959 3292 2378 2715 4592 6144 6860 5598 5601 6596 7152 7490 8166 [16] 9758 11301 11745 9943 7494 6272 6220 11669 $intensities [1] 0.03041876 0.02954234 0.03812101 0.02105963 0.01521258 0.01736844 0.02937602 0.03930449 [9] 0.04388490 0.03581161 0.03583081 0.04219604 0.04575289 0.04791515 0.05223967 0.06242403 [17] 0.07229494 0.07513530 0.06360752 0.04794074 0.04012334 0.03979068 0.07464911 $density [1] 0.03041876 0.02954234 0.03812101 0.02105963 0.01521258 0.01736844 0.02937602 0.03930449 [9] 0.04388490 0.03581161 0.03583081 0.04219604 0.04575289 0.04791515 0.05223967 0.06242403 [17] 0.07229494 0.07513530 0.06360752 0.04794074 0.04012334 0.03979068 0.07464911 $mids [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5 [19] 18.5 19.5 20.5 21.5 22.5 $xname [1] crashes$hour $equidist [1] TRUE attr(,class) [1] histogram Note how the last value in counts is 11669. It's relevant to the output of table(crashes$hour): 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 4755 4618 5959 3292 2378 2715 4592 6144 6860 5598 5601 6596 7152 7490 8166 15 16 17 18 19 20 21 22 23 9758 11301 11745 9943 7494 6272 6220 6000 5669 Notice how the sum of 22 and 23 from table(crashes$hour) is 11669? Is that correct for the histogram to combine hours 22 and 23? Since I specified right = FALSE, I figured there's no way 23 would be combined with 22? Adding breaks=24 to the hist makes no difference; it's still stuck at 23 breaks. I also tried breaks=25 and 23 and several other values, in case I am misinterpreting breaks's meaning, but none of them make a difference. I imagine this is a n00b question, so my apologies if this is obvious. Aren -- Sarah Goslee http://www.functionaldiversity.org __ R-help@r-project.org mailing list
Re: [R] Histogram omitting/collapsing groups
If you just want a plot of the frequencies at each hour why not just call barplot on the output of table? Histograms create bins and count in those, which doesn't sound like what you're after. Cheers, Josh On Dec 31, 2011, at 21:37, jim holtman jholt...@gmail.com wrote: Fast fingers; notice that there is still a problem in the counts; I was only looking at the last. Happy New Year -- up too late. On Sun, Jan 1, 2012 at 12:33 AM, jim holtman jholt...@gmail.com wrote: Here is a test I ran and looks fine, but then I created the data, so it might have something to do with your data: x - sample(0:23, 10, TRUE) a - hist(x, breaks = 24) a[1:5] $breaks [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 $counts [1] 8262 4114 4186 4106 4153 4234 4206 4155 4157 4203 4186 4158 4132 4139 4231 4216 4158 4054 4185 4153 [21] 4281 4110 4221 $intensities [1] 0.08262 0.04114 0.04186 0.04106 0.04153 0.04234 0.04206 0.04155 0.04157 0.04203 0.04186 0.04158 [13] 0.04132 0.04139 0.04231 0.04216 0.04158 0.04054 0.04185 0.04153 0.04281 0.04110 0.04221 $density [1] 0.08262 0.04114 0.04186 0.04106 0.04153 0.04234 0.04206 0.04155 0.04157 0.04203 0.04186 0.04158 [13] 0.04132 0.04139 0.04231 0.04216 0.04158 0.04054 0.04185 0.04153 0.04281 0.04110 0.04221 $mids [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5 18.5 19.5 [21] 20.5 21.5 22.5 table(x) x 0123456789 10 11 12 13 14 15 16 17 18 19 20 4168 4094 4114 4186 4106 4153 4234 4206 4155 4157 4203 4186 4158 4132 4139 4231 4216 4158 4054 4185 4153 21 22 23 4281 4110 4221 On Sat, Dec 31, 2011 at 11:20 AM, Sarah Goslee sarah.gos...@gmail.com wrote: Hi, I think you're not understanding quite what's going on with hist. Reread the help, and take a look at this small example. The solution I'd use is the last item. x - rep(1:10, times=1:10) table(x) x 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, right=TRUE)$counts [1] 3 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, right=TRUE)$breaks [1] 1 2 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, right=TRUE)$mids [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 hist(x, plot=FALSE, right=FALSE)$counts [1] 1 2 3 4 5 6 7 8 19 hist(x, plot=FALSE, right=FALSE)$breaks [1] 1 2 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, right=FALSE)$mids [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 hist(x, plot=FALSE, breaks=seq(.5, 10.5, by=1))$counts [1] 1 2 3 4 5 6 7 8 9 10 hist(x, plot=FALSE, breaks=seq(.5, 10.5, by=1))$breaks [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 hist(x, plot=FALSE, breaks=seq(.5, 10.5, by=1))$mids [1] 1 2 3 4 5 6 7 8 9 10 Sarah On Sat, Dec 31, 2011 at 10:25 AM, Aren Cambre a...@arencambre.com wrote: I have two large datasets (156K and 2.06M records). Each row has the hour that an event happened, represented by an integer from 0 to 23. R's histogram is combining some data. Here's the command I ran to get the histogram: histinfo - hist(crashes$hour, right=FALSE) Here's histinfo: histinfo $breaks [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 $counts [1] 4755 4618 5959 3292 2378 2715 4592 6144 6860 5598 5601 6596 7152 7490 8166 [16] 9758 11301 11745 9943 7494 6272 6220 11669 $intensities [1] 0.03041876 0.02954234 0.03812101 0.02105963 0.01521258 0.01736844 0.02937602 0.03930449 [9] 0.04388490 0.03581161 0.03583081 0.04219604 0.04575289 0.04791515 0.05223967 0.06242403 [17] 0.07229494 0.07513530 0.06360752 0.04794074 0.04012334 0.03979068 0.07464911 $density [1] 0.03041876 0.02954234 0.03812101 0.02105963 0.01521258 0.01736844 0.02937602 0.03930449 [9] 0.04388490 0.03581161 0.03583081 0.04219604 0.04575289 0.04791515 0.05223967 0.06242403 [17] 0.07229494 0.07513530 0.06360752 0.04794074 0.04012334 0.03979068 0.07464911 $mids [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5 [19] 18.5 19.5 20.5 21.5 22.5 $xname [1] crashes$hour $equidist [1] TRUE attr(,class) [1] histogram Note how the last value in counts is 11669. It's relevant to the output of table(crashes$hour): 0 1 2 3 4 5 6 7 8 910 11121314 4755 4618 5959 3292 2378 2715 4592 6144 6860 5598 5601 6596 7152 7490 8166 151617181920212223 9758 11301 11745 9943 7494 6272 6220 6000 5669 Notice how the sum of 22 and 23 from table(crashes$hour) is 11669? Is that correct for the histogram to combine hours 22 and 23? Since I specified right = FALSE, I figured there's no way 23 would be combined with 22? Adding breaks=24 to the hist makes no difference; it's still stuck at 23 breaks. I also tried breaks=25 and 23 and several other values, in case I am