Re: [discovery] How to measure disagreement between human judges in discernatron?

Justin Ormont Mon, 31 Oct 2016 13:13:43 -0700

Did you add any honey-pot answers? Answers where you know the results quite
well (via many judges agreeing), or are very obvious (q=Obama,
results=[en:Presidency of Barack Obama, en:A4 Paper]).


I've set these up as a pre-test before starting the judgment session to
check that the judge understands the instructions, and randomly included to
weed out judges that randomly select answers.

Investigating the labels (individual query-result pair) with the most
disagreement may be useful, along with the judges with the most
disagreement.

--justin

On Mon, Oct 31, 2016 at 7:43 AM, Trey Jones <[email protected]> wrote:

> Interesting stats, Erik. Thanks for sharing these.
>
> More clarity in the documentation is always good.
>
> For some of the negative alpha agreement values, a couple of possible
> sources come to mind. There could be bad faith actors, who either didn't
> really try very hard, or purposely put in incorrect or random values. There
> could also be genuine disagreement between the scorers about the relevance
> of the results—David and I discussed one that we both scored, and we
> disagreed like that. I can see where he was coming from, but it wasn't how
> I thought of it. In both of these cases, additional scores would help.
>
> One thing I noticed that has been inconsistent in my own scoring is that
> early on when I got a mediocre query (i.e., I wouldn't expect any *really*
> good results), I tended to grade on a curve. I'd give "Relevant" to the
> best result even if it wasn't actually a great result. After grading a
> couple of queries for which there were clearly *no* good results (i.e.,
> *everything* was irrelevant), I think I stopped grading on a curve.
>
> My point there is that's one place we could improve the documentation:
> explicitly state that not every query has good results. It's okay to not
> have any result rated as "relevant"—or this could already be in the docs,
> and the problem is that no one reads them. :(
>
> Another thing that Erik has suggested was trying to filter out wildly
> non-encyclopedic queries (like "SANTA CLAUS PRINT OUT 3D PAPERTOYS"), and
> maybe really vague queries (like "antonio parent"), but that's potentially
> more work than filtering PII, and much more subjective.
>
> It might also be informative to review some of the scores for the negative
> alphas and see if something obvious is going on, in which case we'd know
> the alpha calculation is doing its job.
>
>
> On Thu, Oct 27, 2016 at 7:21 PM, Erik Bernhardson <
> [email protected]> wrote:
>
>> To follow up a little here, i implemented Krippendorff's Alpha and ran it
>> against all the data we currently have in discernatron, the distribution
>> looks something like:
>>
>> constraint count
>> alpha >= 0.80 11
>> 0.667 <= alpha < 0.80 18
>> 0.500 <= alpha < 0.667 20
>> 0.333 <= alpha < 0.500 26
>> 0 <= alpha < 0.333 43
>> alpha < 0 31
>>
>> This is a much lower level of agreement than i was expecting. The
>> literature suggests 0.80 as a reliable cutoff, and 0.667 as a cutoff from
>> which you can draw tentative conclusions. Below 0 indicates there is less
>> agreement than random chance, and we need to re-evaluate the instructions
>> to be more clear (probably true).
>>
>>
> _______________________________________________
> discovery mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/discovery
>
>

_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Re: [discovery] How to measure disagreement between human judges in discernatron?

Reply via email to