Did you add any honey-pot answers? Answers where you know the results quite well (via many judges agreeing), or are very obvious (q=Obama, results=[en:Presidency of Barack Obama, en:A4 Paper]).
I've set these up as a pre-test before starting the judgment session to check that the judge understands the instructions, and randomly included to weed out judges that randomly select answers. Investigating the labels (individual query-result pair) with the most disagreement may be useful, along with the judges with the most disagreement. --justin On Mon, Oct 31, 2016 at 7:43 AM, Trey Jones <[email protected]> wrote: > Interesting stats, Erik. Thanks for sharing these. > > More clarity in the documentation is always good. > > For some of the negative alpha agreement values, a couple of possible > sources come to mind. There could be bad faith actors, who either didn't > really try very hard, or purposely put in incorrect or random values. There > could also be genuine disagreement between the scorers about the relevance > of the results—David and I discussed one that we both scored, and we > disagreed like that. I can see where he was coming from, but it wasn't how > I thought of it. In both of these cases, additional scores would help. > > One thing I noticed that has been inconsistent in my own scoring is that > early on when I got a mediocre query (i.e., I wouldn't expect any *really* > good results), I tended to grade on a curve. I'd give "Relevant" to the > best result even if it wasn't actually a great result. After grading a > couple of queries for which there were clearly *no* good results (i.e., > *everything* was irrelevant), I think I stopped grading on a curve. > > My point there is that's one place we could improve the documentation: > explicitly state that not every query has good results. It's okay to not > have any result rated as "relevant"—or this could already be in the docs, > and the problem is that no one reads them. :( > > Another thing that Erik has suggested was trying to filter out wildly > non-encyclopedic queries (like "SANTA CLAUS PRINT OUT 3D PAPERTOYS"), and > maybe really vague queries (like "antonio parent"), but that's potentially > more work than filtering PII, and much more subjective. > > It might also be informative to review some of the scores for the negative > alphas and see if something obvious is going on, in which case we'd know > the alpha calculation is doing its job. > > > On Thu, Oct 27, 2016 at 7:21 PM, Erik Bernhardson < > [email protected]> wrote: > >> To follow up a little here, i implemented Krippendorff's Alpha and ran it >> against all the data we currently have in discernatron, the distribution >> looks something like: >> >> constraint count >> alpha >= 0.80 11 >> 0.667 <= alpha < 0.80 18 >> 0.500 <= alpha < 0.667 20 >> 0.333 <= alpha < 0.500 26 >> 0 <= alpha < 0.333 43 >> alpha < 0 31 >> >> This is a much lower level of agreement than i was expecting. The >> literature suggests 0.80 as a reliable cutoff, and 0.667 as a cutoff from >> which you can draw tentative conclusions. Below 0 indicates there is less >> agreement than random chance, and we need to re-evaluate the instructions >> to be more clear (probably true). >> >> > _______________________________________________ > discovery mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/discovery > >
_______________________________________________ discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
