Re: [Rd] Bug in model.matrix.default for higher-order interaction encoding when specific model terms are missing

2017-11-04 Thread Tyler
Hi Arie,

I understand what you're saying. The following excerpt out of the book
shows that F_j does not refer exclusively to categorical factors: "...the
rule does not do anything special for them, and it remains valid, in a
trivial sense, whenever any of the F_j is numeric rather than categorical."
Since F_j refers to both categorical and numeric variables, the behavior of
model.matrix is not consistent with the heuristic.

Best regards,
Tyler

On Sat, Nov 4, 2017 at 6:50 AM, Arie ten Cate  wrote:

> Hello Tyler,
>
> I rephrase my previous mail, as follows:
>
> In your example, T_i = X1:X2:X3. Let F_j = X3. (The numerical
> variables X1 and X2 are not encoded at all.) Then T_{i(j)} = X1:X2,
> which in the example is dropped from the model. Hence the X3 in T_i
> must be encoded by dummy variables, as indeed it is.
>
>   Arie
>
>
> On Thu, Nov 2, 2017 at 4:11 PM, Tyler  wrote:
> > Hi Arie,
> >
> > The book out of which this behavior is based does not use factor (in this
> > section) to refer to categorical factor. I will again point to this
> > sentence, from page 40, in the same section and referring to the behavior
> > under question, that shows F_j is not limited to categorical factors:
> > "Numeric variables appear in the computations as themselves, uncoded.
> > Therefore, the rule does not do anything special for them, and it remains
> > valid, in a trivial sense, whenever any of the F_j is numeric rather than
> > categorical."
> >
> > Note the "... whenever any of the F_j is numeric rather than
> categorical."
> > Factor here is used in the more general sense of the word, not referring
> to
> > the R type "factor." The behavior of R does not match the heuristic that
> > it's citing.
> >
> > Best regards,
> > Tyler
> >
> > On Thu, Nov 2, 2017 at 2:51 AM, Arie ten Cate 
> wrote:
> >>
> >> Hello Tyler,
> >>
> >> Thank you for searching for, and finding, the basic description of the
> >> behavior of R in this matter.
> >>
> >> I think your example is in agreement with the book.
> >>
> >> But let me first note the following. You write: "F_j refers to a
> >> factor (variable) in a model and not a categorical factor". However:
> >> "a factor is a vector object used to specify a discrete
> >> classification" (start of chapter 4 of "An Introduction to R".) You
> >> might also see the description of the R function factor().
> >>
> >> You note that the book says about a factor F_j:
> >>   "... F_j is coded by contrasts if T_{i(j)} has appeared in the
> >> formula and by dummy variables if it has not"
> >>
> >> You find:
> >>"However, the example I gave demonstrated that this dummy variable
> >> encoding only occurs for the model where the missing term is the
> >> numeric-numeric interaction, ~(X1+X2+X3)^3-X1:X2."
> >>
> >> We have here T_i = X1:X2:X3. Also: F_j = X3 (the only factor). Then
> >> T_{i(j)} = X1:X2, which is dropped from the model. Hence the X3 in T_i
> >> must be encoded by dummy variables, as indeed it is.
> >>
> >>   Arie
> >>
> >> On Tue, Oct 31, 2017 at 4:01 PM, Tyler  wrote:
> >> > Hi Arie,
> >> >
> >> > Thank you for your further research into the issue.
> >> >
> >> > Regarding Stata: On the other hand, JMP gives model matrices that use
> >> > the
> >> > main effects contrasts in computing the higher order interactions,
> >> > without
> >> > the dummy variable encoding. I verified this both by analyzing the
> >> > linear
> >> > model given in my first example and noting that JMP has one more
> degree
> >> > of
> >> > freedom than R for the same model, as well as looking at the generated
> >> > model
> >> > matrices. It's easy to find a design where JMP will allow us fit our
> >> > model
> >> > with goodness-of-fit estimates and R will not due to the extra
> degree(s)
> >> > of
> >> > freedom required. Let's keep the conversation limited to R.
> >> >
> >> > I want to refocus back onto my original bug report, which was not for
> a
> >> > missing main effects term, but rather for a missing lower-order
> >> > interaction
> >> > term. The behavior of model.matrix.default() for a missing main
> effects
> >> > term
> >> > is a nice example to demonstrate how model.matrix encodes with dummy
> >> > variables instead of contrasts, but doesn't demonstrate the
> inconsistent
> >> > behavior my bug report highlighted.
> >> >
> >> > I went looking for documentation on this behavior, and the issue stems
> >> > not
> >> > from model.matrix.default(), but rather the terms() function in
> >> > interpreting
> >> > the formula. This "clever" replacement of contrasts by dummy variables
> >> > to
> >> > maintain marginality (presuming that's the reason) is not described
> >> > anywhere
> >> > in the documentation for either the model.matrix() or the terms()
> >> > function.
> >> > In order to find a description for the behavior, I had to look in the
> >> > underlying C code, buried above the "TermCode" function of the
> 

Re: [Rd] Extreme bunching of random values from runif with Mersenne-Twister seed

2017-11-04 Thread Radford Neal
In the code below, you seem to be essentially using the random number
generator to implement a hash function.  This isn't a good idea.

My impression is that pseudo-random number generation methods are
generally evaluated by whether the sequence produced from any seed
"appears" to be random.  Informally, there may be some effort to make
long sequences started with seeds 1, 2, 3, etc. appear unrelated,
since that is a common use pattern when running a simulation several
times to check on variability.  But you are relying on the FIRST
number from each sequence being apparently unrelated to the seed.  
I think few or none of the people designing pseudo-random number
generators evaluate their methods by that criterion.

There is, however, a large literature on hash functions, which is
what you should look at.

But if you want a quick fix, perhaps looking not at the first number
in the sequence, but rather (say) the 10th, might be preferable.

   Radford Neal


> > seeds = c(86548915L, 86551615L, 86566163L, 86577411L, 86584144L,
> 86584272L,
> +   86620568L, 86724613L, 86756002L, 86768593L, 86772411L, 86781516L,
> +   86794389L, 86805854L, 86814600L, 86835092L, 86874179L, 86876466L,
> +   86901193L, 86987847L, 86988080L)
> >
> > random_values = sapply(seeds, function(x) {
> +   set.seed(x)
> +   y = runif(1, 17, 26)
> +   return(y)
> + })
> >
> > summary(random_values)
>Min. 1st Qu.  MedianMean 3rd Qu.Max.
>   25.13   25.36   25.66   25.58   25.83   25.94

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Possible bug with a regex pattern

2017-11-04 Thread Etienne Sanchez
Dear list,

The following behaviour looks like a bug:

grepl("ab{,2}c", "abbbc")
# [1] TRUE

I would expect either FALSE or an "invalid regex" error.

More details can be found in the question I originally asked here:
https://stackoverflow.com/q/4664/6656269

I've also opened an issue on the TRE github:
https://github.com/laurikari/tre/issues/62


Best,

Etienne

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] ans[nas] <- NA in 'ifelse' (was: ifelse() woes ... can we agree on a ifelse2() ?)

2017-11-04 Thread Suharto Anggono Suharto Anggono via R-devel
Removal of
ans[nas] <- NA
from the code of function 'ifelse' in R is not committed (yet). Why?


On Mon, 28/11/16, Martin Maechler  wrote:

 Subject: Re: [Rd] ifelse() woes ... can we agree on a ifelse2() ?

 Cc: R-devel@r-project.org, maech...@stat.math.ethz.ch
 Date: Monday, 28 November, 2016, 10:00 PM
 
> Suharto Anggono Suharto Anggono via R-devel 
> on Sat, 26 Nov 2016 17:14:01 + writes:

...


> On current 'ifelse' code in R:

> * The part
> ans[nas] <- NA
> could be omitted because NA's are already in place.
> If the part is removed, variable 'nas' is no longer used.

I agree that this seems logical.  If I apply the change, R's own
full checks do not seem affected, and I may try to commit that
change and "wait and see".


...

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bug in model.matrix.default for higher-order interaction encoding when specific model terms are missing

2017-11-04 Thread Arie ten Cate
Hello Tyler,

I rephrase my previous mail, as follows:

In your example, T_i = X1:X2:X3. Let F_j = X3. (The numerical
variables X1 and X2 are not encoded at all.) Then T_{i(j)} = X1:X2,
which in the example is dropped from the model. Hence the X3 in T_i
must be encoded by dummy variables, as indeed it is.

  Arie


On Thu, Nov 2, 2017 at 4:11 PM, Tyler  wrote:
> Hi Arie,
>
> The book out of which this behavior is based does not use factor (in this
> section) to refer to categorical factor. I will again point to this
> sentence, from page 40, in the same section and referring to the behavior
> under question, that shows F_j is not limited to categorical factors:
> "Numeric variables appear in the computations as themselves, uncoded.
> Therefore, the rule does not do anything special for them, and it remains
> valid, in a trivial sense, whenever any of the F_j is numeric rather than
> categorical."
>
> Note the "... whenever any of the F_j is numeric rather than categorical."
> Factor here is used in the more general sense of the word, not referring to
> the R type "factor." The behavior of R does not match the heuristic that
> it's citing.
>
> Best regards,
> Tyler
>
> On Thu, Nov 2, 2017 at 2:51 AM, Arie ten Cate  wrote:
>>
>> Hello Tyler,
>>
>> Thank you for searching for, and finding, the basic description of the
>> behavior of R in this matter.
>>
>> I think your example is in agreement with the book.
>>
>> But let me first note the following. You write: "F_j refers to a
>> factor (variable) in a model and not a categorical factor". However:
>> "a factor is a vector object used to specify a discrete
>> classification" (start of chapter 4 of "An Introduction to R".) You
>> might also see the description of the R function factor().
>>
>> You note that the book says about a factor F_j:
>>   "... F_j is coded by contrasts if T_{i(j)} has appeared in the
>> formula and by dummy variables if it has not"
>>
>> You find:
>>"However, the example I gave demonstrated that this dummy variable
>> encoding only occurs for the model where the missing term is the
>> numeric-numeric interaction, ~(X1+X2+X3)^3-X1:X2."
>>
>> We have here T_i = X1:X2:X3. Also: F_j = X3 (the only factor). Then
>> T_{i(j)} = X1:X2, which is dropped from the model. Hence the X3 in T_i
>> must be encoded by dummy variables, as indeed it is.
>>
>>   Arie
>>
>> On Tue, Oct 31, 2017 at 4:01 PM, Tyler  wrote:
>> > Hi Arie,
>> >
>> > Thank you for your further research into the issue.
>> >
>> > Regarding Stata: On the other hand, JMP gives model matrices that use
>> > the
>> > main effects contrasts in computing the higher order interactions,
>> > without
>> > the dummy variable encoding. I verified this both by analyzing the
>> > linear
>> > model given in my first example and noting that JMP has one more degree
>> > of
>> > freedom than R for the same model, as well as looking at the generated
>> > model
>> > matrices. It's easy to find a design where JMP will allow us fit our
>> > model
>> > with goodness-of-fit estimates and R will not due to the extra degree(s)
>> > of
>> > freedom required. Let's keep the conversation limited to R.
>> >
>> > I want to refocus back onto my original bug report, which was not for a
>> > missing main effects term, but rather for a missing lower-order
>> > interaction
>> > term. The behavior of model.matrix.default() for a missing main effects
>> > term
>> > is a nice example to demonstrate how model.matrix encodes with dummy
>> > variables instead of contrasts, but doesn't demonstrate the inconsistent
>> > behavior my bug report highlighted.
>> >
>> > I went looking for documentation on this behavior, and the issue stems
>> > not
>> > from model.matrix.default(), but rather the terms() function in
>> > interpreting
>> > the formula. This "clever" replacement of contrasts by dummy variables
>> > to
>> > maintain marginality (presuming that's the reason) is not described
>> > anywhere
>> > in the documentation for either the model.matrix() or the terms()
>> > function.
>> > In order to find a description for the behavior, I had to look in the
>> > underlying C code, buried above the "TermCode" function of the "model.c"
>> > file, which says:
>> >
>> > "TermCode decides on the encoding of a model term. Returns 1 if variable
>> > ``whichBit'' in ``thisTerm'' is to be encoded by contrasts and 2 if it
>> > is to
>> > be encoded by dummy variables.  This is decided using the heuristic
>> > described in Statistical Models in S, page 38."
>> >
>> > I do not have a copy of this book, and I suspect most R users do not as
>> > well. Thankfully, however, some of the pages describing this behavior
>> > were
>> > available as part of Amazon's "Look Inside" feature--but if not for
>> > that, I
>> > would have no idea what heuristic R was using. Since those pages could
>> > made
>> > unavailable by Amazon at any time, at the very least we have an problem
>> > with