Hi Francesco, (cc-ing the list so that this is part of the archives that others can use, and so that others can chime in/correct, etc. I seem to have accidentally not pressed reply all in my first reply to you. I've removed your email in case you for some reason hadn't intended it to be shared with the list.)
thank you for the additional information. I am getting a clearer picture. It seems like your outcome is overall *un*common. Assuming that you treatment-coded the group variable, two of the groups have correct answers on about 26% of the trials and one group (Group3) has correct answers on about 10% of the trials. Does that sound about right? So that's a pretty decent effect for 40 subjects and 75 items. It is not surprising that the contrast of Group3 reaches significance. If your theory *pre*dicted this, rather than you having conducted many tests and then having arrived at this significance (cf. guidelines in e.g. Simmons et al 2011, which are much more important than all of this effect size back and forth), it strikes me as the type of effect that should pass muster. No, just to be sure that you're doing the right thing: if each group sees all items, you should have by-item slopes for group. Sometimes the answer to this requires thinking about what you mean by "items". E.g., if the same token has a different item ID because it was shown to another group that would usually mean that you're coding item wrong. You should then change the item coding that all stimuli that were meant to form an item by design have the same item ID. If group is then within-item, you should add the random slope for group by item. So much about that. Now about the power analyses. Here are some thoughts on this issue: 1. Yes, what you are doing is an observed power analysis (you already have the data), and yes ideally you should conduct power analyses *prior *to running the experiment. I can't emphasize enough just how insightful this can be. At the same time, I sympathize with the situation you find yourself in. Often we learn about how we should do things after the fact, and that doesn't necessarily make our work uninformative or less relevant for scientific goals. 2. In your email you seem to contrast the bad reputation of observed power sims with the desirability of effect size measures (perhaps that's not what you meant). So just to be clear: effect size measures are just as much based on observed data. I.e., they have all the same weaknesses that the observed power analysis has. 3. For what it's worth, one can ameliorate the downsides of observed power analyses, by adequately incorporating uncertainty about the effect size into the power simulation. Since I don't really use simr I don't know whether that package does that. But I would assume so (at least to the extent that non-Bayesian analyses capture the relevant uncertainties). This would be worth checking. You can also always provide additional analyses, e.g., by taking the lower (as in closer to zero) bound of the 95% CI of your effect as the effect size measure. I think of that as a quick but effective way to get an idea of the range of power one might expect. For an example, see Melis et al 2017 "Satellite- vs. Verb-Framing Underpredicts Nonverbal Motion Categorization: Insights from a Large Language Sample and Simulations", Figures 4-5. 4. Of course, 20 simulations would be way too few. We usually conduct at least 1000. In your case though, you will have close to ceiling power provided you're item specification doesn't have to change. 5. Apologies, but I don't have time to go through the tutorial right now. Since I'm not using simr, I can't say much about it, but perhaps others can. But perhaps you coded Correct as logical? or as character? When I just tried to run a binomial glmer powersim there was no issue. So that is not a limitation. Florian On Sat, Jun 27, 2020 at 11:20 AM Francesco Romano <fbroman...@gmail.com> wrote: [...] > > On Sat, Jun 27, 2020 at 12:36 AM Florian Jaeger <timegu...@gmail.com> > wrote: > >> Dear Francesco, >> >> I have not (yet) read the Brysbaert and Stevens paper, but as you said, >> effect sizes can mean all kinds of things. In your case, it seems that the >> reviewers' question is about replicability. In that case, all you need to >> know is how likely this effect is to replicate. I think what people usually >> mean by this is how likely the effect is to replicate in pretty much the >> same type of experiment (e.g., even Cohen's d, etc. do depend on the >> paradigm, dependent variable, and task, among many other things). For this, >> I personally find power estimates (or Bayesian extensions of it) more >> intuitive and more informative than a single effect size measure. There's >> quite a few examples out there by now, as to how you can explore in more or >> less depth how much power you have under various assumptions about your >> data (any effect size measure will make assumptions, too). You can do so >> while considering uncertainty about the true size of the effect of interest >> as well as any other parameters in the model. The simr package is one >> example, or you can create something more directly made for your own goals. >> Thanks to the tidyverse, dplyr, purrr, this is now possible without too >> much coding work. Please let me know whether I'm misunderstanding. I can >> send you references to papers that have power simulations in them (or >> example code), if that is of interest. >> >> Finally, let me say that there definitely is *not* any easy formula like >> 4 participants per parameter. The recommended number of participants would >> depend on the size of the effect you are seeking to test. It is true that >> many, say, self-paced reading experiments from the 90s seem to have 4-6 >> items per design condition and at least as many subjects. But a) many of >> these studies took on large effects on simple variables (or variables that >> everyone agrees to pretend are simple, like RTs) and b) standards have >> advanced. Also, since you're running a mixed-effect logistic regression, >> your power will depend on the mean of the data (see e.g., Dixon, 2008; Xie >> et al 2020; Bicknell et al., 2020 for examples): for mean proportions >> closer to 1 or 0, you have less power. If participants in your data have >> attentional lapses, this further exacerbates the problem. If you could send >> the output of the regression, and explain what you meant by n and k, I >> might be able to say more. >> >> hth somewhat? >> >> Florian >> >> >> >> >> >> >> On Fri, Jun 26, 2020 at 6:06 PM Francesco Romano <fbroman...@gmail.com> >> wrote: >> >>> Dear Florian and all, >>> >>> I am returning to the issue of effect sizes as I now have some data to >>> hand. I must admit that I am still confused as to what the best way is to >>> act upon a reviewer's request to include effect sizes. Their request is >>> motivated by skepticism about sample size as they maintain the effects are >>> likely to be too small to be replicated. This, however, cannot be >>> determined unless effect sizes are obtained, according to them. >>> >>> There are different views out there regarding effect sizes and what they >>> mean. Some look for cohen D, others for an R or Rsq, but how these should >>> be extracted with glmer objects is not explained anywhere, especially with >>> more complex models. The best source to date is >>> >>> Brysbaert, M., & Stevens, M. (2018). Power Analysis and Effect Size in >>> Mixed Effects Models: A Tutorial. *Journal of Cognition*, *1*(1), 9. >>> DOI: >>> https://urldefense.com/v3/__http://doi.org/10.5334/joc.10__;!!Mih3wA!UY8SIX5CBu_kZj4JeVSnmnQ8E-oQa-74nDDing0-JlkkH3839JppuV_UPLP_HsFmf-scdi2jC72R$ >>> >>> >>> Their data and R strings are stored at >>> https://urldefense.com/v3/__https://osf.io/fhrc6/__;!!Mih3wA!UY8SIX5CBu_kZj4JeVSnmnQ8E-oQa-74nDDing0-JlkkH3839JppuV_UPLP_HsFmf-scdqzueYtT$ >>> >>> >>> Following their method: >>> >>> #running a glmer with a binary DV and one factor with 3 levels: >>> >>> > mod1<-glmer(Correct~Group+(1|Participant)+(1|Item), data = priming, >>> family = binomial) >>> >>> #NOTE: Random slope for group by participants was not included as it did >>> not improve fit >>> >>> #determining main effect via car package >>> > car::Anova(mod1, type = 3) >>> Analysis of Deviance Table (Type III Wald chisquare tests) >>> >>> Response: Correct >>> Chisq Df Pr(>Chisq) >>> (Intercept) 21.599 1 3.360e-06 *** >>> Group 30.413 2 2.489e-07 *** >>> --- >>> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 >>> >>> #determine the effect size of the main effect for group via the r2glmm >>> package >>> > r2beta(model = mod1, method = 'sgv', data = priming) >>> Effect Rsq upper.CL lower.CL >>> 1 Model 0.045 0.060 0.032 >>> 3 Group3 0.033 0.046 0.022 >>> 2 Group2 0.000 0.002 0.000 >>> >>> #results after applying sum-to-0 coding since the factor has 3 levels >>> > r2beta(model = mod1, method = 'sgv', data = priming) >>> Effect Rsq upper.CL lower.CL >>> 1 Model 0.045 0.060 0.032 >>> 2 Group1 0.010 0.017 0.004 >>> 3 Group2 0.009 0.016 0.004 >>> >>> If Rsq is to be interpreted like a Cohen's D, where values between .4 >>> and .8 are meaningful, the Rsq obtained are rather small which would lead >>> me to conclude that I either have too few participants, too few items or >>> both. But *n* = 40 and *k* = 3332 and as far as I can recall from one >>> of Florian's tutorials, a reliable way of estimating whether your >>> participant sample is large enough is to count 4 participants per >>> parameter. As there are currently 3 parameters set to the model, the >>> minimum *n *would be 16, provided a sufficient number of *k *is in >>> place. >>> >>> Any takers? >>> >>> Francesco Romano PhD >>> >>> >>> >>> >>> >>>