Re: [R] histogram first bar wrong position

2016-12-23 Thread peter dalgaard

> On 22 Dec 2016, at 18:08 , William Dunlap via R-help  
> wrote:
> 
> As a practical matter, 'continuous' data must be discretized, so if you
> have long vectors of it you will run into this problem.

Yep, and it is a bit unfortunate that hist() tries to use "pretty" breakpoints, 
so that you will have data points on the boundaries, causing all the 
left/right/endpoint business to come into play. The truehist() function in MASS 
does somewhat better. 

For the case at hand, things are much improved by setting the breaks explicitly:

hist(y,freq=TRUE, col='red', breaks=0.5:6.5)

but as pointed out by others, it is a much better idea to do

plot(factor(y, levels=1:6))

or similar. 

Incidentally, what is the most handy way to get a plot with percentages instead 
of counts? This works, but seems a bit ham-fisted:

barplot(prop.table(table(factor(y, levels=1:6

-pd

> 
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
> 
> On Thu, Dec 22, 2016 at 8:19 AM, Martin Maechler > wrote:
> 
>>> itpro  
>>>on Thu, 22 Dec 2016 16:17:28 +0300 writes:
>> 
>>> Hi, everyone.
>>> I stumbled upon weird histogram behaviour.
>> 
>>> Consider this "dice emulator":
>>> Step 1: Generate uniform random array x of size N.
>>> Step 2: Multiply each item by six and round to next bigger integer
>> to get numbers 1 to 6.
>>> Step 3: Plot histogram.
>> 
 x<-runif(N)
 y<-ceiling(x*6)
 hist(y,freq=TRUE, col='orange')
>> 
>> 
>>> Now what I get with N=10
>> 
 x<-runif(10)
 y<-ceiling(x*6)
 hist(y,freq=TRUE, col='green')
>> 
>>> At first glance looks OK.
>> 
>>> Now try N=100
>> 
 x<-runif(100)
 y<-ceiling(x*6)
 hist(y,freq=TRUE, col='red')
>> 
>>> Now first bar is not where it should be.
>>> Hmm. Look again to 10 histogram... First bar is not where I want
>> it, it's only less striking due to narrow bars.
>> 
>>> So, first bar is always in wrong position. How do I fix it to make
>> perfectly spaced bars?
>> 
>> Don't use histograms *at all* for such discrete integer data.
>> 
>> N <- rpois(100, 5)
>> plot(table(N), lwd = 4)
>> 
>> Histograms should be only be used for continuous data (or discrete data
>> with "many" possible values).
>> 
>> It's a pain to see them so often "misused" for data like the 'N' above.
>> 
>> Martin Maechler,
>> ETH Zurich
>> 
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/
>> posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] histogram first bar wrong position

2016-12-23 Thread Martin Maechler
> William Dunlap 
> on Thu, 22 Dec 2016 09:08:35 -0800 writes:

> As a practical matter, 'continuous' data must be discretized, so if you
> have long vectors of it you will run into this problem.

> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com

Yes, it is true that on the computer and in statistics we never have
continuous data in the strict sense.

My point was  and still is that a histogram is a wrong graphical
tool to be used for visualizing a distribution
on a small finite set, as e.g., the dice rolls 'itpro' has used.

And yes, if (s)he used something like

   dice <- ceiling(6 * runif(100))

and really prefers to use  hist() over (something like)

   plot(table(dice), lwd = 6)

then an appropriate graphic would rather be

  hist(dice, freq=TRUE, col="orange", breaks = (31:(6*32))/32)

(and the default breaks from sample size N = 100'000 is indeed
 relatively close to that because as we both know the number of
 default breaks grows (slowly) with N).

For me, histograms are a (poor but easy to understand and
explain) version of density estimates  (where the underlying
density is wrt to the lebesgue measure or simlar).

Now back to large / long vectors of data:
If you need to bin large vectors, you will hopefully be binning
to rather 100's or 1000's of bins (because 1000 is still much
smaller than "large") and then you actually have computed the
data for a histogram yourself already; so I personally would
again prefer not to use hist(), but to write my own "3 line"
function that returns an "histogram" object which I'd call  plot(.) on.

So, maybe providing such a short function maybe useful, notably
on the ?hist  help page ?

Martin Maechler,
ETH Zurich


> On Thu, Dec 22, 2016 at 8:19 AM, Martin Maechler 
> wrote:

>> > itpro  
>> > on Thu, 22 Dec 2016 16:17:28 +0300 writes:
>> 
>> > Hi, everyone.
>> > I stumbled upon weird histogram behaviour.
>> 
>> > Consider this "dice emulator":
>> > Step 1: Generate uniform random array x of size N.
>> > Step 2: Multiply each item by six and round to next bigger integer
>> to get numbers 1 to 6.
>> > Step 3: Plot histogram.
>> 
>> >> x<-runif(N)
>> >> y<-ceiling(x*6)
>> >> hist(y,freq=TRUE, col='orange')
>> 
>> 
>> > Now what I get with N=10
>> 
>> >> x<-runif(10)
>> >> y<-ceiling(x*6)
>> >> hist(y,freq=TRUE, col='green')
>> 
>> > At first glance looks OK.
>> 
>> > Now try N=100
>> 
>> >> x<-runif(100)
>> >> y<-ceiling(x*6)
>> >> hist(y,freq=TRUE, col='red')
>> 
>> > Now first bar is not where it should be.
>> > Hmm. Look again to 10 histogram... First bar is not where I want
>> it, it's only less striking due to narrow bars.
>> 
>> > So, first bar is always in wrong position. How do I fix it to make
>> perfectly spaced bars?
>> 
>> Don't use histograms *at all* for such discrete integer data.
>> 
>> N <- rpois(100, 5)
>> plot(table(N), lwd = 4)
>> 
>> Histograms should be only be used for continuous data (or discrete data
>> with "many" possible values).
>> 
>> It's a pain to see them so often "misused" for data like the 'N' above.
>> 
>> Martin Maechler,
>> ETH Zurich
>> 
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/
>> posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 

> [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] histogram first bar wrong position

2016-12-22 Thread Ted Harding
Willam has listed the lid on the essence of the problem, which is
that in R the way that breaks (and therefore counts) in a histogram
are evaluated is an area of long grass with lurking snakes!

To get a glimpse of this, have a look at
  ?hist
and in the seaction "Arguments", look at "breaks", "freq", "right".
Also see under "Details".

and, as suggested under "See also", look at
  ?nclass.Sturges

As William suggests, if you know what claa intervals you want,
create them yourself! For your example (with N=100), look at:

   hist(y,freq=TRUE, col='red', breaks=0.5+(0:6))

or

   hist(y,freq=TRUE, col='red', breaks=0.25+(0:12)/2)

Hoping this helps!
Best wishes,
Ted.


On 22-Dec-2016 16:36:34 William Dunlap via R-help wrote:
> Looking at the return value of hist will show you what is happening:
> 
>> x <- rep(1:6,10*(6:1))
>> z <- hist(x, freq=TRUE)
>> z
> $breaks
>  [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
> 
> $counts
>  [1] 60 50  0 40  0 30  0 20  0 10
> ...
> 
> The the first bin is [1-1.5], including both endpoints, while the other
> bins include only the upper endpoint.  I recommend defining your
> own breakpoints, ones don't include possible data points, as in
> 
>> print(hist(x, breaks=seq(min(x)-0.5, max(x)+0.5, by=1), freq=TRUE))
> $breaks
> [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5
> 
> $counts
> [1] 60 50 40 30 20 10
> ...
> 
> S+ had a 'factor' method for hist() that did this sort of thing, but R does
> not.
> 
> 
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
> 
> On Thu, Dec 22, 2016 at 5:17 AM, itpro  wrote:
> 
>> Hi, everyone.
>>
>>
>> I stumbled upon weird histogram behaviour.
>>
>> Consider this "dice emulator":
>> Step 1: Generate uniform random array x of size N.
>> Step 2: Multiply each item by six and round to next bigger integer to get
>> numbers 1 to 6.
>> Step 3: Plot histogram.
>>
>> > x<-runif(N)
>> > y<-ceiling(x*6)
>> > hist(y,freq=TRUE, col='orange')
>>
>>
>> Now what I get with N=10
>>
>> > x<-runif(10)
>> > y<-ceiling(x*6)
>> > hist(y,freq=TRUE, col='green')
>>
>> At first glance looks OK.
>>
>> Now try N=100
>>
>> > x<-runif(100)
>> > y<-ceiling(x*6)
>> > hist(y,freq=TRUE, col='red')
>>
>> Now first bar is not where it should be.
>> Hmm. Look again to 10 histogram... First bar is not where I want it,
>> it's only less striking due to narrow bars.
>>
>> So, first bar is always in wrong position. How do I fix it to make
>> perfectly spaced bars?
>>
>>
>>
>>
>>
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/
>> posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-
E-Mail: (Ted Harding) 
Date: 22-Dec-2016  Time: 17:23:26
This message was sent by XFMail

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] histogram first bar wrong position

2016-12-22 Thread William Dunlap via R-help
As a practical matter, 'continuous' data must be discretized, so if you
have long vectors of it you will run into this problem.

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Thu, Dec 22, 2016 at 8:19 AM, Martin Maechler  wrote:

> > itpro  
> > on Thu, 22 Dec 2016 16:17:28 +0300 writes:
>
> > Hi, everyone.
> > I stumbled upon weird histogram behaviour.
>
> > Consider this "dice emulator":
> > Step 1: Generate uniform random array x of size N.
> > Step 2: Multiply each item by six and round to next bigger integer
> to get numbers 1 to 6.
> > Step 3: Plot histogram.
>
> >> x<-runif(N)
> >> y<-ceiling(x*6)
> >> hist(y,freq=TRUE, col='orange')
>
>
> > Now what I get with N=10
>
> >> x<-runif(10)
> >> y<-ceiling(x*6)
> >> hist(y,freq=TRUE, col='green')
>
> > At first glance looks OK.
>
> > Now try N=100
>
> >> x<-runif(100)
> >> y<-ceiling(x*6)
> >> hist(y,freq=TRUE, col='red')
>
> > Now first bar is not where it should be.
> > Hmm. Look again to 10 histogram... First bar is not where I want
> it, it's only less striking due to narrow bars.
>
> > So, first bar is always in wrong position. How do I fix it to make
> perfectly spaced bars?
>
> Don't use histograms *at all* for such discrete integer data.
>
>  N <- rpois(100, 5)
>  plot(table(N), lwd = 4)
>
> Histograms should be only be used for continuous data (or discrete data
> with "many" possible values).
>
> It's a pain to see them so often "misused" for data like the 'N' above.
>
> Martin Maechler,
> ETH Zurich
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] histogram first bar wrong position

2016-12-22 Thread William Dunlap via R-help
Looking at the return value of hist will show you what is happening:

> x <- rep(1:6,10*(6:1))
> z <- hist(x, freq=TRUE)
> z
$breaks
 [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

$counts
 [1] 60 50  0 40  0 30  0 20  0 10
...

The the first bin is [1-1.5], including both endpoints, while the other
bins include only the upper endpoint.  I recommend defining your
own breakpoints, ones don't include possible data points, as in

> print(hist(x, breaks=seq(min(x)-0.5, max(x)+0.5, by=1), freq=TRUE))
$breaks
[1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5

$counts
[1] 60 50 40 30 20 10
...

S+ had a 'factor' method for hist() that did this sort of thing, but R does
not.


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Thu, Dec 22, 2016 at 5:17 AM, itpro  wrote:

> Hi, everyone.
>
>
> I stumbled upon weird histogram behaviour.
>
> Consider this "dice emulator":
> Step 1: Generate uniform random array x of size N.
> Step 2: Multiply each item by six and round to next bigger integer to get
> numbers 1 to 6.
> Step 3: Plot histogram.
>
> > x<-runif(N)
> > y<-ceiling(x*6)
> > hist(y,freq=TRUE, col='orange')
>
>
> Now what I get with N=10
>
> > x<-runif(10)
> > y<-ceiling(x*6)
> > hist(y,freq=TRUE, col='green')
>
> At first glance looks OK.
>
> Now try N=100
>
> > x<-runif(100)
> > y<-ceiling(x*6)
> > hist(y,freq=TRUE, col='red')
>
> Now first bar is not where it should be.
> Hmm. Look again to 10 histogram... First bar is not where I want it,
> it's only less striking due to narrow bars.
>
> So, first bar is always in wrong position. How do I fix it to make
> perfectly spaced bars?
>
>
>
>
>
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] histogram first bar wrong position

2016-12-22 Thread Martin Maechler
> itpro  
> on Thu, 22 Dec 2016 16:17:28 +0300 writes:

> Hi, everyone.
> I stumbled upon weird histogram behaviour.

> Consider this "dice emulator":
> Step 1: Generate uniform random array x of size N.
> Step 2: Multiply each item by six and round to next bigger integer to get 
numbers 1 to 6.
> Step 3: Plot histogram.

>> x<-runif(N)
>> y<-ceiling(x*6)
>> hist(y,freq=TRUE, col='orange')


> Now what I get with N=10

>> x<-runif(10)
>> y<-ceiling(x*6)
>> hist(y,freq=TRUE, col='green')

> At first glance looks OK.

> Now try N=100

>> x<-runif(100)
>> y<-ceiling(x*6)
>> hist(y,freq=TRUE, col='red')

> Now first bar is not where it should be.
> Hmm. Look again to 10 histogram... First bar is not where I want it, 
it's only less striking due to narrow bars.

> So, first bar is always in wrong position. How do I fix it to make 
perfectly spaced bars?

Don't use histograms *at all* for such discrete integer data.

 N <- rpois(100, 5)
 plot(table(N), lwd = 4)

Histograms should be only be used for continuous data (or discrete data
with "many" possible values).

It's a pain to see them so often "misused" for data like the 'N' above.

Martin Maechler,
ETH Zurich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] histogram first bar wrong position

2016-12-22 Thread itpro
Hi, everyone.


I stumbled upon weird histogram behaviour.

Consider this "dice emulator":
Step 1: Generate uniform random array x of size N.
Step 2: Multiply each item by six and round to next bigger integer to get 
numbers 1 to 6.
Step 3: Plot histogram.

> x<-runif(N)
> y<-ceiling(x*6)
> hist(y,freq=TRUE, col='orange')


Now what I get with N=10

> x<-runif(10)
> y<-ceiling(x*6)
> hist(y,freq=TRUE, col='green')

At first glance looks OK.

Now try N=100

> x<-runif(100)
> y<-ceiling(x*6)
> hist(y,freq=TRUE, col='red')

Now first bar is not where it should be.
Hmm. Look again to 10 histogram... First bar is not where I want it, it's 
only less striking due to narrow bars.

So, first bar is always in wrong position. How do I fix it to make perfectly 
spaced bars?





__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.