Hi Francesco,

(cc-ing the list so that this is part of the archives that others can use,
and so that others can chime in/correct, etc. I seem to have accidentally
not pressed reply all in my first reply to you. I've removed your email in
case you for some reason hadn't intended it to be shared with the list.)

thank you for the additional information. I am getting a clearer picture.
It seems like your outcome is overall *un*common. Assuming that you
treatment-coded the group variable, two of the groups have correct answers
on about 26% of the trials and one group (Group3) has correct answers on
about 10% of the trials. Does that sound about right? So that's a pretty
decent effect for 40 subjects and 75 items. It is not surprising that the
contrast of Group3 reaches significance. If your theory *pre*dicted this,
rather than you having conducted many tests and then having arrived at this
significance (cf. guidelines in e.g. Simmons et al 2011, which are much
more important than all of this effect size back and forth), it strikes me
as the type of effect that should pass muster.

No, just to be sure that you're doing the right thing: if each group sees
all items, you should have by-item slopes for group. Sometimes the answer
to this requires thinking about what you mean by "items". E.g., if the same
token has a different item ID because it was shown to another group that
would usually mean that you're coding item wrong. You should then change
the item coding that all stimuli that were meant to form an item by design
have the same item ID. If group is then within-item, you should add the
random slope for group by item.

So much about that. Now about the power analyses. Here are some thoughts on
this issue:

   1. Yes, what you are doing is an observed power analysis (you already
   have the data), and yes ideally you should conduct power analyses *prior
   *to running the experiment. I can't emphasize enough just how insightful
   this can be. At the same time, I sympathize with the situation you find
   yourself in. Often we learn about how we should do things after the fact,
   and that doesn't necessarily make our work uninformative or less relevant
   for scientific goals.
   2. In your email you seem to contrast the bad reputation of observed
   power sims with the desirability of effect size measures (perhaps that's
   not what you meant). So just to be clear: effect size measures are just as
   much based on observed data. I.e., they have all the same weaknesses that
   the observed power analysis has.
   3. For what it's worth, one can ameliorate the downsides of observed
   power analyses, by adequately incorporating uncertainty about the effect
   size into the power simulation. Since I don't really use simr I don't know
   whether that package does that. But I would assume so (at least to the
   extent that non-Bayesian analyses capture the relevant uncertainties). This
   would be worth checking. You can also always provide additional analyses,
   e.g., by taking the lower (as in closer to zero) bound of the 95% CI of
   your effect as the effect size measure. I think of that as a quick but
   effective way to get an idea of the range of power one might expect. For an
   example, see Melis et al 2017 "Satellite- vs. Verb-Framing Underpredicts
   Nonverbal Motion Categorization: Insights from a Large Language Sample and
   Simulations", Figures 4-5.
   4. Of course, 20 simulations would be way too few. We usually conduct at
   least 1000. In your case though, you will have close to ceiling power
   provided you're item specification doesn't have to change.
   5. Apologies, but I don't have time to go through the tutorial right
   now. Since I'm not using simr, I can't say much about it, but perhaps
   others can. But perhaps you coded Correct as logical? or as character? When
   I just tried to run a binomial glmer powersim there was no issue. So that
   is not a limitation.

Florian


On Sat, Jun 27, 2020 at 11:20 AM Francesco Romano <fbroman...@gmail.com>
wrote:
[...]

>
> On Sat, Jun 27, 2020 at 12:36 AM Florian Jaeger <timegu...@gmail.com>
> wrote:
>
>> Dear Francesco,
>>
>> I have not (yet) read the Brysbaert and Stevens paper, but as you said,
>> effect sizes can mean all kinds of things. In your case, it seems that the
>> reviewers' question is about replicability. In that case, all you need to
>> know is how likely this effect is to replicate. I think what people usually
>> mean by this is how likely the effect is to replicate in pretty much the
>> same type of experiment (e.g., even Cohen's d, etc. do depend on the
>> paradigm, dependent variable, and task, among many other things). For this,
>> I personally find power estimates (or Bayesian extensions of it) more
>> intuitive and more informative than a single effect size measure. There's
>> quite a few examples out there by now, as to how you can explore in more or
>> less depth how much power you have under various assumptions about your
>> data (any effect size measure will make assumptions, too). You can do so
>> while considering uncertainty about the true size of the effect of interest
>> as well as any other parameters in the model. The simr package is one
>> example, or you can create something more directly made for your own goals.
>> Thanks to the tidyverse, dplyr, purrr, this is now possible without too
>> much coding work. Please let me know whether I'm misunderstanding. I can
>> send you references to papers that have power simulations in them (or
>> example code), if that is of interest.
>>
>> Finally, let me say that there definitely is *not* any easy formula like
>> 4 participants per parameter. The recommended number of participants would
>> depend on the size of the effect you are seeking to test. It is true that
>> many, say, self-paced reading experiments from the 90s seem to have 4-6
>> items per design condition and at least as many subjects. But a) many of
>> these studies took on large effects on simple variables (or variables that
>> everyone agrees to pretend are simple, like RTs) and b) standards have
>> advanced. Also, since you're running a mixed-effect logistic regression,
>> your power will depend on the mean of the data (see e.g., Dixon, 2008; Xie
>> et al 2020; Bicknell et al., 2020 for examples): for mean proportions
>> closer to 1 or 0, you have less power. If participants in your data have
>> attentional lapses, this further exacerbates the problem. If you could send
>> the output of the regression, and explain what you meant by n and k, I
>> might be able to say more.
>>
>> hth somewhat?
>>
>> Florian
>>
>>
>>
>>
>>
>>
>> On Fri, Jun 26, 2020 at 6:06 PM Francesco Romano <fbroman...@gmail.com>
>> wrote:
>>
>>> Dear Florian and all,
>>>
>>> I am returning to the issue of effect sizes as I now have some data to
>>> hand. I must admit that I am still confused as to what the best way is to
>>> act upon a reviewer's request to include effect sizes. Their request is
>>> motivated by skepticism about sample size as they maintain the effects are
>>> likely to be too small to be replicated. This, however, cannot be
>>> determined unless effect sizes are obtained, according to them.
>>>
>>> There are different views out there regarding effect sizes and what they
>>> mean. Some look for cohen D, others for an R or Rsq, but how these should
>>> be extracted with glmer objects is not explained anywhere, especially with
>>> more complex models. The best source to date is
>>>
>>> Brysbaert, M., & Stevens, M. (2018). Power Analysis and Effect Size in
>>> Mixed Effects Models: A Tutorial. *Journal of Cognition*, *1*(1), 9.
>>> DOI: 
>>> https://urldefense.com/v3/__http://doi.org/10.5334/joc.10__;!!Mih3wA!UY8SIX5CBu_kZj4JeVSnmnQ8E-oQa-74nDDing0-JlkkH3839JppuV_UPLP_HsFmf-scdi2jC72R$
>>>  
>>>
>>> Their data and R strings are stored at 
>>> https://urldefense.com/v3/__https://osf.io/fhrc6/__;!!Mih3wA!UY8SIX5CBu_kZj4JeVSnmnQ8E-oQa-74nDDing0-JlkkH3839JppuV_UPLP_HsFmf-scdqzueYtT$
>>>  
>>>
>>> Following their method:
>>>
>>> #running a glmer with a binary DV and one factor with 3 levels:
>>>
>>> > mod1<-glmer(Correct~Group+(1|Participant)+(1|Item), data  = priming,
>>> family = binomial)
>>>
>>> #NOTE: Random slope for group by participants was not included as it did
>>> not improve fit
>>>
>>> #determining main effect via car package
>>> > car::Anova(mod1, type = 3)
>>> Analysis of Deviance Table (Type III Wald chisquare tests)
>>>
>>> Response: Correct
>>>              Chisq Df Pr(>Chisq)
>>> (Intercept) 21.599  1  3.360e-06 ***
>>> Group       30.413  2  2.489e-07 ***
>>> ---
>>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>>
>>> #determine the effect size of the main effect for group via the r2glmm
>>> package
>>> > r2beta(model = mod1, method = 'sgv', data = priming)
>>>   Effect   Rsq upper.CL lower.CL
>>> 1  Model 0.045    0.060    0.032
>>> 3 Group3 0.033    0.046    0.022
>>> 2 Group2 0.000    0.002    0.000
>>>
>>> #results after applying sum-to-0 coding since the factor has 3 levels
>>> > r2beta(model = mod1, method = 'sgv', data = priming)
>>>   Effect   Rsq upper.CL lower.CL
>>> 1  Model 0.045    0.060    0.032
>>> 2 Group1 0.010    0.017    0.004
>>> 3 Group2 0.009    0.016    0.004
>>>
>>> If Rsq is to be interpreted like a Cohen's D, where values between .4
>>> and .8 are meaningful, the Rsq obtained are rather small which would lead
>>> me to conclude that I either have too few participants, too few items or
>>> both. But *n* = 40 and *k* = 3332 and as far as I can recall from one
>>> of Florian's tutorials, a reliable way of estimating whether your
>>> participant sample is large enough is to count 4 participants per
>>> parameter. As there are currently 3 parameters set to the model, the
>>> minimum *n *would be 16, provided a sufficient number of *k *is in
>>> place.
>>>
>>> Any takers?
>>>
>>> Francesco Romano PhD
>>>
>>>
>>>
>>>
>>>
>>>

Reply via email to