More randomly ordered thoughts on this. I'm sure they altogether amount to a 
lot of work, and I don't expect your team to do it all. I just offer it as a 
list of things that might be interesting for you guys to look at.

Earlier, you said you didn't have the resources to conduct a study with human 
subjects. I just wanted to point out again that you may be overestimating the 
time it takes to put one of those together. Putting together a web site where 
people can go and volunteer to evaluate the results of your algorithm on a page 
they know well would probably require a fraction of the time it took your team 
to develop the algorithm itself (I'm sure dealing with that much data, and 
figuring out who wrote what contiguous parts of text took a lot of tinkering). 
I'm sure it would not be hard to convince editors and reviewers on wikipedia to 
volunteer to review 30-50 pages through such a special site. You could set up 
the experiment so that the reviewer reviews a page WITHOUT any of your 
colourings, and then you compute the overlap between their changes, and the 
segments that your system thought were untrustworthy. By doing it that way, you 
would be avoiding the issue of favourable or disfavourable evaluator bias 
towards the system (because the evaluator does not know which segments the 
system deems unreliable). Also, you would be catching both false positive and 
false negatives (whereas the way I evaluated the system, I could only catch 
false positives).

Another thought is that maybe you should not evaluate the system's ability to 
rate trustworthiness of **segments**, but rather rate the trustworthiness of 
whole pages. In other words, it could be that if you focus the user's attention 
on pages that have a large proportion of red in them, you would have very few 
false positives on that task (of course you might have lots of false negatives 
too, but it's still better than what we have now which is NOTHING). For a task 
like that, you would of course have to compare your system to a naïve 
implementation which, for example, uses a pages's "age" (i.e. elapsed time 
since initial creation), or the number of edits by different people, or the 
number of visits by different people, as an indication of trustworthiness. Have 
you looked at how your measure correlates with the review board's evaluation of 
page quality?

Earlier, you also said you didn't think the algorithm could do a good job at 
predicting what parts of the text are questionable because so many good 
contributions are made by occasional one-off contributors and anonymous 
authors. Maybe all this means is that you need to put your threshold for 
colouring at a higher value. In other words, only colour those parts which have 
been written by people who are KNOWN to be poor contributors. Also, for 
anonymous contributors, do you treat all of them as one big "user", or do you 
try to distinguish by IP address? Have you tried eliminating anonymous 
contributions from your processing altogether? Have you tried eliminating 
contributors who only made contributions to < N pages? How do these things 
affect the values of "internal" metrics you mentioned in your previous email.

Finally, it may be that this tool is more useful for reviewers and editors than 
for readers of wikipedia. So, what would be good metrics for reviewers?
* Precision/Recall of pages that are low quality.
* Precsion/Recall of segments in those low quality pages that are low quality
* Productivity boost when reviewing pages using this system vs not. For 
example, does a reviewer using this system end up doing more edits per hour 
than a reviewer who does not? 

That's it for now. Like I said, I'm sure they altogether amount to a lot of 
work, and I don't expect your team to do it all. I just offer it as a list of 
things that might be interesting for you guys to look at.

Cheers,

Alain

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to