Re: [R] Off topic --- underdispersed (pseudo) binomial data.

2021-03-27 Thread Abby Spurdle
Further to yesterday's posts:
I think the "n" value would be the maximum possible number of jumps,
not the number of students.
In theory, the minimum possible number is zero, so the distributions
are more binomial-like than they look.

Also, there was a mistake in my comments.
The jump is from non-A-grade to A-grade, not non-pass to pass.


On Sat, Mar 27, 2021 at 10:00 PM Abby Spurdle  wrote:
>
> Sorry.
> I just realized, after posting, that the "n" value in the dispersion
> calculation isn't correct.
> I'll have to revisit the simulation, tomorrow.
>
> On Sat, Mar 27, 2021 at 9:11 PM Abby Spurdle  wrote:
> >
> > Hi Rolf,
> >
> > Let's say we have a course called Corgiology 101, with a single moderated 
> > exam.
> > And let's say the moderators transform initial exam scores, such that
> > there are fixed percentages of pass rates and A grades.
> >
> > Rather than count the number of passes, we can count the number of "jumps".
> > That is, the number of people that pass the corgiology exam after
> > moderation, that would not have passed without moderation.
> >
> > I've created a function to test for underdispersion, based on your 
> > expression.
> > (I hope I got it right).
> >
> > Then I've gone on to create simulations, using both constant and
> > nonconstant class sizes.
> > The nonconstant simulations apply an (approx) discrete scaling
> > transformation, referred to previously.
> >
> > We can see from the examples that there are a lot of these jumps.
> > And more importantly, they appear to be underdispersed.
> >
> > code
> > PASS.SCORE <- 0.5
> > A.SCORE <- 0.8
> >
> > #target parameters
> > PASS.RATE <- 0.8
> > A.RATE <- 0.2
> > #unmoderated parameters
> > UNMOD.MEAN.SCORE <- 0.65
> > UNMOD.SD.SCORE <- 0.075
> >
> > NCLASSES <- 2000
> > NSTUD.CONST <- 200
> > NSTUD.NONCONST.LIMS <- c (50, 800)
> >
> > sim.njump <- function (nstud, mean0=UNMOD.MEAN.SCORE, sd0=UNMOD.SD.SCORE,
> > pass.score=PASS.SCORE, a.score=A.SCORE,
> > pass.rate=PASS.RATE, a.rate=A.RATE)
> > {   x <- rnorm (nstud, mean0, sd0)
> > q <- quantile (x, 1 - c (pass.rate, a.rate), names=FALSE)
> > dq <- diff (q)
> > q <- (a.score - pass.score) / dq * q
> > y <- pass.score - q [1] + (a.score - pass.score) / dq * x
> > sum (x < a.score & y >= a.score)
> > }
> >
> > sim.nclasses <- function (nclasses, nstud, nstud.std)
> > {   nstud <- rep_len (nstud, nclasses)
> > njump <- integer (nclasses)
> > for (i in 1:nclasses)
> > njump [i] <- sim.njump (nstud [i])
> > if (missing (nstud.std) )
> > njump
> > else
> > round (nstud.std / nstud * njump)
> > }
> >
> > is.under <- function (x, n)
> > var (x) < mean (x) * (1 - mean (x) / n)
> >
> > njump.hom <- sim.nclasses (NCLASSES, NSTUD.CONST)
> > nstud <- round (runif (NCLASSES, NSTUD.NONCONST.LIMS [1],
> > NSTUD.NONCONST.LIMS [2]) )
> > njump.het <- sim.nclasses (NCLASSES, nstud, NSTUD.CONST)
> >
> > under.hom <- is.under (njump.hom, NSTUD.CONST)
> > under.het <- is.under (njump.het, NSTUD.CONST)
> > main.hom <- paste0 ("const class size (under=", under.hom, ")")
> > main.het <- paste0 ("diff class sizes (under=", under.het, ")")
> >
> > p0 <- par (mfrow = c (2, 1) )
> > hist (njump.hom, main=main.hom)
> > hist (njump.het, main=main.het)
> > par (p0)
> > code
> >
> > best,
> > B.
> >
> >
> > On Thu, Mar 25, 2021 at 2:33 PM Rolf Turner  wrote:
> > >
> > >
> > > I would like a real-life example of a data set which one might think to
> > > model by a binomial distribution, but which is substantially
> > > underdispersed. I.e. a sample X = {X_1, X_2, ..., X_N} where each X_i
> > > is an integer between 0 and n (n known a priori) such that var(X) <<
> > > mean(X)*(1 - mean(X)/n).
> > >
> > > Does anyone know of any such examples?  Do any exist?  I've done
> > > a perfunctory web search, and had a look at "A Handbook of Small
> > > Data Sets" by Hand, Daly, Lunn, et al., and drawn a blank.
> > >
> > > I've seen on the web some references to underdispersed "pseudo-Poisson"
> > > data, but not to underdispersed "pseudo-binomial" data.  And of course
> > > there's lots of *over* dispersed stuff.  But that's not what I want.
> > >
> > > I can *simulate* data sets of the sor that I am looking for (so far the
> > > only ideas I've had for doing this are pretty simplistic and
> > > artificial) but I'd like to get my hands on a *real* example, if
> > > possible.
> > >
> > > Grateful for any pointers/suggestions.
> > >
> > > cheers,
> > >
> > > Rolf Turner
> > >
> > > --
> > > Honorary Research Fellow
> > > Department of Statistics
> > > University of Auckland
> > > Phone: +64-9-373-7599 ext. 88276
> > >
> > > __
> > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide 
> > > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reprod

Re: [R] Off topic --- underdispersed (pseudo) binomial data.

2021-03-27 Thread Abby Spurdle
Sorry.
I just realized, after posting, that the "n" value in the dispersion
calculation isn't correct.
I'll have to revisit the simulation, tomorrow.

On Sat, Mar 27, 2021 at 9:11 PM Abby Spurdle  wrote:
>
> Hi Rolf,
>
> Let's say we have a course called Corgiology 101, with a single moderated 
> exam.
> And let's say the moderators transform initial exam scores, such that
> there are fixed percentages of pass rates and A grades.
>
> Rather than count the number of passes, we can count the number of "jumps".
> That is, the number of people that pass the corgiology exam after
> moderation, that would not have passed without moderation.
>
> I've created a function to test for underdispersion, based on your expression.
> (I hope I got it right).
>
> Then I've gone on to create simulations, using both constant and
> nonconstant class sizes.
> The nonconstant simulations apply an (approx) discrete scaling
> transformation, referred to previously.
>
> We can see from the examples that there are a lot of these jumps.
> And more importantly, they appear to be underdispersed.
>
> code
> PASS.SCORE <- 0.5
> A.SCORE <- 0.8
>
> #target parameters
> PASS.RATE <- 0.8
> A.RATE <- 0.2
> #unmoderated parameters
> UNMOD.MEAN.SCORE <- 0.65
> UNMOD.SD.SCORE <- 0.075
>
> NCLASSES <- 2000
> NSTUD.CONST <- 200
> NSTUD.NONCONST.LIMS <- c (50, 800)
>
> sim.njump <- function (nstud, mean0=UNMOD.MEAN.SCORE, sd0=UNMOD.SD.SCORE,
> pass.score=PASS.SCORE, a.score=A.SCORE,
> pass.rate=PASS.RATE, a.rate=A.RATE)
> {   x <- rnorm (nstud, mean0, sd0)
> q <- quantile (x, 1 - c (pass.rate, a.rate), names=FALSE)
> dq <- diff (q)
> q <- (a.score - pass.score) / dq * q
> y <- pass.score - q [1] + (a.score - pass.score) / dq * x
> sum (x < a.score & y >= a.score)
> }
>
> sim.nclasses <- function (nclasses, nstud, nstud.std)
> {   nstud <- rep_len (nstud, nclasses)
> njump <- integer (nclasses)
> for (i in 1:nclasses)
> njump [i] <- sim.njump (nstud [i])
> if (missing (nstud.std) )
> njump
> else
> round (nstud.std / nstud * njump)
> }
>
> is.under <- function (x, n)
> var (x) < mean (x) * (1 - mean (x) / n)
>
> njump.hom <- sim.nclasses (NCLASSES, NSTUD.CONST)
> nstud <- round (runif (NCLASSES, NSTUD.NONCONST.LIMS [1],
> NSTUD.NONCONST.LIMS [2]) )
> njump.het <- sim.nclasses (NCLASSES, nstud, NSTUD.CONST)
>
> under.hom <- is.under (njump.hom, NSTUD.CONST)
> under.het <- is.under (njump.het, NSTUD.CONST)
> main.hom <- paste0 ("const class size (under=", under.hom, ")")
> main.het <- paste0 ("diff class sizes (under=", under.het, ")")
>
> p0 <- par (mfrow = c (2, 1) )
> hist (njump.hom, main=main.hom)
> hist (njump.het, main=main.het)
> par (p0)
> code
>
> best,
> B.
>
>
> On Thu, Mar 25, 2021 at 2:33 PM Rolf Turner  wrote:
> >
> >
> > I would like a real-life example of a data set which one might think to
> > model by a binomial distribution, but which is substantially
> > underdispersed. I.e. a sample X = {X_1, X_2, ..., X_N} where each X_i
> > is an integer between 0 and n (n known a priori) such that var(X) <<
> > mean(X)*(1 - mean(X)/n).
> >
> > Does anyone know of any such examples?  Do any exist?  I've done
> > a perfunctory web search, and had a look at "A Handbook of Small
> > Data Sets" by Hand, Daly, Lunn, et al., and drawn a blank.
> >
> > I've seen on the web some references to underdispersed "pseudo-Poisson"
> > data, but not to underdispersed "pseudo-binomial" data.  And of course
> > there's lots of *over* dispersed stuff.  But that's not what I want.
> >
> > I can *simulate* data sets of the sor that I am looking for (so far the
> > only ideas I've had for doing this are pretty simplistic and
> > artificial) but I'd like to get my hands on a *real* example, if
> > possible.
> >
> > Grateful for any pointers/suggestions.
> >
> > cheers,
> >
> > Rolf Turner
> >
> > --
> > Honorary Research Fellow
> > Department of Statistics
> > University of Auckland
> > Phone: +64-9-373-7599 ext. 88276
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Off topic --- underdispersed (pseudo) binomial data.

2021-03-27 Thread Abby Spurdle
Hi Rolf,

Let's say we have a course called Corgiology 101, with a single moderated exam.
And let's say the moderators transform initial exam scores, such that
there are fixed percentages of pass rates and A grades.

Rather than count the number of passes, we can count the number of "jumps".
That is, the number of people that pass the corgiology exam after
moderation, that would not have passed without moderation.

I've created a function to test for underdispersion, based on your expression.
(I hope I got it right).

Then I've gone on to create simulations, using both constant and
nonconstant class sizes.
The nonconstant simulations apply an (approx) discrete scaling
transformation, referred to previously.

We can see from the examples that there are a lot of these jumps.
And more importantly, they appear to be underdispersed.

code
PASS.SCORE <- 0.5
A.SCORE <- 0.8

#target parameters
PASS.RATE <- 0.8
A.RATE <- 0.2
#unmoderated parameters
UNMOD.MEAN.SCORE <- 0.65
UNMOD.SD.SCORE <- 0.075

NCLASSES <- 2000
NSTUD.CONST <- 200
NSTUD.NONCONST.LIMS <- c (50, 800)

sim.njump <- function (nstud, mean0=UNMOD.MEAN.SCORE, sd0=UNMOD.SD.SCORE,
pass.score=PASS.SCORE, a.score=A.SCORE,
pass.rate=PASS.RATE, a.rate=A.RATE)
{   x <- rnorm (nstud, mean0, sd0)
q <- quantile (x, 1 - c (pass.rate, a.rate), names=FALSE)
dq <- diff (q)
q <- (a.score - pass.score) / dq * q
y <- pass.score - q [1] + (a.score - pass.score) / dq * x
sum (x < a.score & y >= a.score)
}

sim.nclasses <- function (nclasses, nstud, nstud.std)
{   nstud <- rep_len (nstud, nclasses)
njump <- integer (nclasses)
for (i in 1:nclasses)
njump [i] <- sim.njump (nstud [i])
if (missing (nstud.std) )
njump
else
round (nstud.std / nstud * njump)
}

is.under <- function (x, n)
var (x) < mean (x) * (1 - mean (x) / n)

njump.hom <- sim.nclasses (NCLASSES, NSTUD.CONST)
nstud <- round (runif (NCLASSES, NSTUD.NONCONST.LIMS [1],
NSTUD.NONCONST.LIMS [2]) )
njump.het <- sim.nclasses (NCLASSES, nstud, NSTUD.CONST)

under.hom <- is.under (njump.hom, NSTUD.CONST)
under.het <- is.under (njump.het, NSTUD.CONST)
main.hom <- paste0 ("const class size (under=", under.hom, ")")
main.het <- paste0 ("diff class sizes (under=", under.het, ")")

p0 <- par (mfrow = c (2, 1) )
hist (njump.hom, main=main.hom)
hist (njump.het, main=main.het)
par (p0)
code

best,
B.


On Thu, Mar 25, 2021 at 2:33 PM Rolf Turner  wrote:
>
>
> I would like a real-life example of a data set which one might think to
> model by a binomial distribution, but which is substantially
> underdispersed. I.e. a sample X = {X_1, X_2, ..., X_N} where each X_i
> is an integer between 0 and n (n known a priori) such that var(X) <<
> mean(X)*(1 - mean(X)/n).
>
> Does anyone know of any such examples?  Do any exist?  I've done
> a perfunctory web search, and had a look at "A Handbook of Small
> Data Sets" by Hand, Daly, Lunn, et al., and drawn a blank.
>
> I've seen on the web some references to underdispersed "pseudo-Poisson"
> data, but not to underdispersed "pseudo-binomial" data.  And of course
> there's lots of *over* dispersed stuff.  But that's not what I want.
>
> I can *simulate* data sets of the sor that I am looking for (so far the
> only ideas I've had for doing this are pretty simplistic and
> artificial) but I'd like to get my hands on a *real* example, if
> possible.
>
> Grateful for any pointers/suggestions.
>
> cheers,
>
> Rolf Turner
>
> --
> Honorary Research Fellow
> Department of Statistics
> University of Auckland
> Phone: +64-9-373-7599 ext. 88276
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Off topic --- underdispersed (pseudo) binomial data.

2021-03-26 Thread Duncan Murdoch

On 25/03/2021 10:25 p.m., Rolf Turner wrote:


On Fri, 26 Mar 2021 13:41:00 +1300
Abby Spurdle  wrote:


I haven't checked this, but I guess that the number of students that
*pass* a particular exam/subject, per semester would be like that.

e.g.
Let's say you have a course in maximum likelihood, that's taught once
per year to 3rd year students, and a few postgrads.
You could count the number of passes, each year.

If you assume a near-constant probability of passing in each
exam/semester: Then I would assume it would follow the distribution
that you're requesting.




Thanks Abby.  I've experimented (simulated) a wee bit and found
that if I keep the numbers of students (undergrad and grad) exactly
constant, then the results are underdispersed.  However if the
numbers are allowed to vary then the results are overdispersed.

It seems that the universe is very reluctant to produce underdispersed
pseudo-binomial data!


I'd expect underdispersion to happen in competitive situations:  if 
subject A succeeds, that makes it less likely that other subjects will 
also succeed.


An extreme case is a contest winner.  With some contests there will 
always be one winner (a little too-underdispersed for you, probably), 
but others allow a small amount of variation.


For example, sports events that allow ties.  This page 
https://en.wikipedia.org/wiki/List_of_ties_for_medals_at_the_Olympics 
seems to indicate that speed skating had a lot of ties up until 1980.


Duncan Murdoch

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Off topic --- underdispersed (pseudo) binomial data.

2021-03-25 Thread Rolf Turner


On Fri, 26 Mar 2021 13:41:00 +1300
Abby Spurdle  wrote:

> I haven't checked this, but I guess that the number of students that
> *pass* a particular exam/subject, per semester would be like that.
> 
> e.g.
> Let's say you have a course in maximum likelihood, that's taught once
> per year to 3rd year students, and a few postgrads.
> You could count the number of passes, each year.
> 
> If you assume a near-constant probability of passing in each
> exam/semester: Then I would assume it would follow the distribution
> that you're requesting.



Thanks Abby.  I've experimented (simulated) a wee bit and found
that if I keep the numbers of students (undergrad and grad) exactly
constant, then the results are underdispersed.  However if the
numbers are allowed to vary then the results are overdispersed.

It seems that the universe is very reluctant to produce underdispersed
pseudo-binomial data!

cheers,

Rolf

-- 
Honorary Research Fellow
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Off topic --- underdispersed (pseudo) binomial data.

2021-03-25 Thread Abby Spurdle
I haven't checked this, but I guess that the number of students that
*pass* a particular exam/subject, per semester would be like that.

e.g.
Let's say you have a course in maximum likelihood, that's taught once
per year to 3rd year students, and a few postgrads.
You could count the number of passes, each year.

If you assume a near-constant probability of passing in each exam/semester:
Then I would assume it would follow the distribution that you're requesting.

If there is a significant change in the number of students:
Lets say, that less and less students study maximum likelihood because
they would rather study "advanced" R programming for "data science"
with "large data", then you might be able to apply some sort of
discrete-scaling transformation to the number of passes each semester.
This would allow you to pretend that the number of people studying
maximum likelihood is the same, and no one is studying other
apparently more important subjects.


On Thu, Mar 25, 2021 at 2:33 PM Rolf Turner  wrote:
>
>
> I would like a real-life example of a data set which one might think to
> model by a binomial distribution, but which is substantially
> underdispersed. I.e. a sample X = {X_1, X_2, ..., X_N} where each X_i
> is an integer between 0 and n (n known a priori) such that var(X) <<
> mean(X)*(1 - mean(X)/n).
>
> Does anyone know of any such examples?  Do any exist?  I've done
> a perfunctory web search, and had a look at "A Handbook of Small
> Data Sets" by Hand, Daly, Lunn, et al., and drawn a blank.
>
> I've seen on the web some references to underdispersed "pseudo-Poisson"
> data, but not to underdispersed "pseudo-binomial" data.  And of course
> there's lots of *over* dispersed stuff.  But that's not what I want.
>
> I can *simulate* data sets of the sor that I am looking for (so far the
> only ideas I've had for doing this are pretty simplistic and
> artificial) but I'd like to get my hands on a *real* example, if
> possible.
>
> Grateful for any pointers/suggestions.
>
> cheers,
>
> Rolf Turner
>
> --
> Honorary Research Fellow
> Department of Statistics
> University of Auckland
> Phone: +64-9-373-7599 ext. 88276
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Off topic --- underdispersed (pseudo) binomial data.

2021-03-24 Thread Rolf Turner


On Wed, 24 Mar 2021 18:45:01 -0700
 wrote:

> "X = {X_1, X_2, ..., X_N} where each X_i
> is an integer between 0 and n (n known a priori)"
> 
> That is a multinomial, not a binomial distribution. A binomial
> distribution can have only two values, success or failure.
> 
> What have I misunderstood?

And then, following up:

> Oh, I think I get what you mean -- you are drawing repeated samples
> from a binomial with n trials and you are counting the number of
> successes for each.

Yes.  Exactly.  Sorry if my post was unclear.

cheers,

Rolf Turner

-- 
Honorary Research Fellow
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.