Re: [discovery] How to measure disagreement between human judges in discernatron?

Erik Bernhardson Fri, 28 Oct 2016 00:21:47 -0700

Thanks for the links! This is exactly what I was looking for. After
reviewing some of the options I'm going to do a first try with
Krippendorff's Alpha. It's ability to handle missing data from some graders
as well as being applicable down to n=2 seems promising.


On Oct 26, 2016 11:37 AM, "Justin Ormont" <[email protected]> wrote:

> You're in the area of: https://en.wikipedia.org/wiki/
> Inter-rater_reliability
>
> --justin
>
> On Wed, Oct 26, 2016 at 11:31 AM, Jonathan Morgan <[email protected]>
> wrote:
>
>> Disclaimer: I'm not a math nerd, and I don't know the history of
>> Discernatron very well.
>>
>> ...but re: your second specialized concern, have you considered running
>> some more sophisticated inter-rater reliability statistics to get a better
>> sense of the degree of disagreement (controlling for random chance?). See
>> for example: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032/
>>
>> - Jonathan
>>
>> On Wed, Oct 26, 2016 at 11:21 AM, Erik Bernhardson <
>> [email protected]> wrote:
>>
>>> For a little backstory, in discernatron multiple judges provide scores
>>> in from 0 to 3 for results. Typically we only request a single query to be
>>> reviewed by two judges. We would like to measure the level of disagreement
>>> between these two judges, and if it crosses some threshold get two more
>>> scores, so we can then measure disagreement in the group of 4. Somehow
>>> though, we need to define how to measure that level of disagreement and
>>> what the threshold for needing more scores is.
>>>
>>> Some specialized concerns:
>>> * It is probably important to include not just that the users gave
>>> different values, but also how far apart they are. The difference between a
>>> 3 and a 2 is much smaller than between a 2 and a 0.
>>> * If the users agree that 80% of the results are all 0, but disagree on
>>> the last 20%, even though the average disagreement is low it's probably
>>> still important? Might be worthwhile to take all the agreements about
>>> irrelevant results and remove them before calculating disagreement? Not
>>> sure...
>>>
>>> I know we have a few math nerds here on the list, so hoping someone has
>>> a few ideas.
>>>
>>> _______________________________________________
>>> discovery mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>>
>>>
>>
>>
>> --
>> Jonathan T. Morgan
>> Senior Design Researcher
>> Wikimedia Foundation
>> User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
>>
>>
>> _______________________________________________
>> discovery mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>
>>
>
> _______________________________________________
> discovery mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/discovery
>
>

_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Re: [discovery] How to measure disagreement between human judges in discernatron?

Reply via email to