Hi Tibor,

I don't have a 'solution', but here are some thoughts that *might * be
helpful. My first thought is that Bayesian tests (in your case, it seems a
Bayesian GLMM with a bernoulli link, which you could run through, e.g.,
Paul Buerkner's brms package) might be better suited for tests of the null.
For introduction and discussion, see e.g., Wagenmakers 2007. It includes
references to different ways to test the null in a Bayesian framework. If
you're interested in that direction, let me know and I can recommend some
papers.

For your power question, I don't think there's a 'correct' answer, but I
find it useful to turn the problem around. I'd keep in mind that it's a
(understandable) shortcoming of the theories that predict the difference
that they do not specify the predicted effect size. ("understandable"
because it's difficult to develop a theory as well as *all of its link
functions *to different types of observable behaviors; *very* few
psycholinguistic or linguistic theories even specify *any* quantitative
link function). So, you could instead formulate the question in terms of
"if the predicted relative badness is comparable to the relative badness of
[enter other, ideally better understood and well-studied,
structure/phenomenon here], then my study had X% power to detect the
effect. If the badness is half as bad, I had power XXX%". That gives
other researchers something that they can intuitively compare to. That
would allow you to look at the literature on other structures (ideally
using the same type of 2 AFC task and similar fillers [e.g., if you use
'good' fillers, the ideal other study to compare to would have done so,
too]). Even better would be if you can find a few studies on the same topic
and extract the average effect size, which will reduce the dependency on
the individual results. If you go this route, I would also discuss in the
paper that results reported in the literature tend to have inflated
estimates due to the 'significance filter' (see, e.g., Shravan's recent
review of this issue). Of course, nothing would keep you from doing this
for a few different structures, providing your readers with a range of
power estimates. Imagine, for example, a graph with effect size (magnitude
of estimate) on the x-axis, power on the y-axis (for the number of subjects
and items you have) and various points (along x or along the power curve,
labeled by other studies/phenomena/structures (this wouldn't be the power
of those studies, but the power you would have had assuming the estimate
from those studies).

Another route you can take is to estimate the 'minimally detectable'
difference in the paradigm, and ask how much power you had to detect that
effect. I think though that this approach is more fraught with problems
than appears at first sight. I also find that you don't stand to gain that
much insight from it, compared to assessing power relative to effects
observed in previous work.

Finally, I think that 'just' doubling the estimate from your study to do a
power analysis for that effect size is rather uninformative.

Sorry to not have anything more concrete / directly helpful to say!

Florian

On Wed, May 6, 2020 at 7:30 AM Tibor Kiss <tibor.k...@ruhr-uni-bochum.de>
wrote:

> Hi everybody,
>
> here is another try on effect sizes, this time from the perspective of
> power calculations.
>
> I hope that the following characterization is sufficient: In Germanic
> languages, part of the VP can be topicalized, leaving other parts behind,
> such as illustrated in (1):
>
> (1)     a.      Ohne einen Lappen gespült hat sie in einer Kaffeepause.
>                 without a rag rinsed has she during a coffee break
>                 ’She rinsed without a rag during a coffee break.’
>         b.      In einer Kaffeepause gespült hat sie ohne einen Lappen.
>                 (same as (1a) only order of temporal and instrumental PP
> reversed)
>
> There are some claims in the literature (as e.g. Frey and Pittner 1998,
> but not directed towards the specific examples here) that speakers should
> judge (1a) significantly better than (1b). There is also, a more general,
> almost tacit claim that examples like (1a) are derived from verbfinal
> structures that are structurally identical, and moreover that a reversal of
> the PPs in the verbfinal clause is ok. So, both (2a) and (2b) are supposed
> to be fine.
>
> (2)     a.      … dass sie in einer Kaffeepause   ohne einen Lappen
> gespült hat.
>                      that  she during a coffee break without a     rag
>     rinsed has
>         b.      … dass sie [ohne einen Lappen]_i in einer Kaffeepause t_i
> gespült hat.
>                 (same as (2a) with scrambling of the PP)
>
> But it would be impossible to have a derivation where you first have a
> reversal of the PPs and then a partial VP topicalization, because the
> topicalized phrase contains a trace (or, in more contemporary parlance: a
> copy), which cannot be linked to the (scrambled) antecedent. So, only (2a)
> could be an input to yield (1a), while (2b) would be an input to yield
> (1b), but then the trace is lost … [It is not relevant for the present
> purposes that there might be alternative analyses which do not make use of
> scrambling/traces/copies.]
>
> This is all classical generative linguistics, but here comes my question.
> We can translate all this into two hypotheses, H0 stating that there is no
> difference between structures of type (1) and structures of type (2) in
> terms of acceptability judgments, and HA stating that there is. The
> question is: what should be the minimal effect size that we accept in this
> case? I dare say that a conservative guess (the least we should get) would
> be something like 2. In terms of a two-alternative-forced-choice study
> (Sprouse et al. 2013), where subjects have to pick one of two sentences
> presented, which they consider more natural, we would thus expect that the
> order (1a)/(2a) would make it two times more likely that the example be
> picked.
>
> It turns out that the difference between (1) and (2) is not significant
> (H0 cannot be rejected), according to a GLMM (estimate: -0.33, p > 0.1).
> The question remains whether the experiment had enough power to find a
> sufficiently large effect. Following the logic sketched above, and using
> simr::powerSim and simr::powerCurve, I have changed the estimate from -0.33
> to -0.65, which amounts to an (inverted) odds ratio of 1.91, so even below
> the threshold of 2 proposed above. powerCurve shows a power of 85 % (CI:
> 79.28-89.65) for the pertinent factor. Given that this is based on a very
> conservative effect size (< 2), the experiment surely will have enough
> power to detect larger effect sizes.
>
> My (rather general) question now is: is the logic of proposing a rather
> low effect size sound? Are there general assumptions around about the
> expected effects in case of judgement studies? The interesting thing here
> is (or seems to me) that the linguistic argumentation leads one to hope for
> a non-rejection of H0, so non-significance is a result, and needs
> corroboration.
>
> With kind regards
>
>
> Tibor
>
>
> ———————————————————
> Prof. Dr. Tibor Kiss
> Sprachwissenschaftliches Institut
> Ruhr-Universität Bochum
>

Reply via email to