Re: [discovery] How to measure disagreement between human judges in discernatron?

Erik Bernhardson Tue, 01 Nov 2016 15:24:02 -0700

On Mon, Oct 31, 2016 at 1:13 PM, Justin Ormont <[email protected]>
wrote:


> Did you add any honey-pot answers? Answers where you know the results
> quite well (via many judges agreeing), or are very obvious (q=Obama,
> results=[en:Presidency of Barack Obama, en:A4 Paper]).
>
> I've set these up as a pre-test before starting the judgment session to
> check that the judge understands the instructions, and randomly included to
> weed out judges that randomly select answers.
>
> We don't have any honey-pot answers yet, we had thought about it but had
hoped that since there was no real benefit to users of doing a bad job (no
payments, no leaderboard to get on) it wouldn't be necessary. We may have
to re-evaluate that though, it seems a common way to deal with
crowd-sourced data.


> Investigating the labels (individual query-result pair) with the most
> disagreement may be useful, along with the judges with the most
> disagreement.
>

Good idea, will be looking into it soon.

>
> --justin
>
> On Mon, Oct 31, 2016 at 7:43 AM, Trey Jones <[email protected]> wrote:
>
>> Interesting stats, Erik. Thanks for sharing these.
>>
>> More clarity in the documentation is always good.
>>
>> For some of the negative alpha agreement values, a couple of possible
>> sources come to mind. There could be bad faith actors, who either didn't
>> really try very hard, or purposely put in incorrect or random values. There
>> could also be genuine disagreement between the scorers about the relevance
>> of the results—David and I discussed one that we both scored, and we
>> disagreed like that. I can see where he was coming from, but it wasn't how
>> I thought of it. In both of these cases, additional scores would help.
>>
>> One thing I noticed that has been inconsistent in my own scoring is that
>> early on when I got a mediocre query (i.e., I wouldn't expect any
>> *really* good results), I tended to grade on a curve. I'd give
>> "Relevant" to the best result even if it wasn't actually a great result.
>> After grading a couple of queries for which there were clearly *no* good
>> results (i.e., *everything* was irrelevant), I think I stopped grading
>> on a curve.
>>
>> My point there is that's one place we could improve the documentation:
>> explicitly state that not every query has good results. It's okay to not
>> have any result rated as "relevant"—or this could already be in the docs,
>> and the problem is that no one reads them. :(
>>
>> Another thing that Erik has suggested was trying to filter out wildly
>> non-encyclopedic queries (like "SANTA CLAUS PRINT OUT 3D PAPERTOYS"), and
>> maybe really vague queries (like "antonio parent"), but that's potentially
>> more work than filtering PII, and much more subjective.
>>
>> It might also be informative to review some of the scores for the
>> negative alphas and see if something obvious is going on, in which case
>> we'd know the alpha calculation is doing its job.
>>
>>
>> On Thu, Oct 27, 2016 at 7:21 PM, Erik Bernhardson <
>> [email protected]> wrote:
>>
>>> To follow up a little here, i implemented Krippendorff's Alpha and ran
>>> it against all the data we currently have in discernatron, the distribution
>>> looks something like:
>>>
>>> constraint count
>>> alpha >= 0.80 11
>>> 0.667 <= alpha < 0.80 18
>>> 0.500 <= alpha < 0.667 20
>>> 0.333 <= alpha < 0.500 26
>>> 0 <= alpha < 0.333 43
>>> alpha < 0 31
>>>
>>> This is a much lower level of agreement than i was expecting. The
>>> literature suggests 0.80 as a reliable cutoff, and 0.667 as a cutoff from
>>> which you can draw tentative conclusions. Below 0 indicates there is less
>>> agreement than random chance, and we need to re-evaluate the instructions
>>> to be more clear (probably true).
>>>
>>>
>> _______________________________________________
>> discovery mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>
>>
>
> _______________________________________________
> discovery mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/discovery
>
>

_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Re: [discovery] How to measure disagreement between human judges in discernatron?

Reply via email to