> You are evaluating the coloring against a performance criterion that is not 
> the one we designed it for.
>
> Our coloring gives orange color to new information that has been added by 
> low-reputation authors.  New information by 
> high-reputation authors is light orange.  As the information is revised, it 
> gains trust. 
>
> Thus, our coloring answers the question, intuitively: has this information 
> been revised already?  Have reputable 
> authors looked at it? 
> 
> You are asking the question: how much information colored orange is 
> questionable? 
> This is a different question, and we will never be able to do well, for the 
> simple reason that it is well known that 
> a lot of the correct factual information on Wikipedia comes from occasional 
> contributors, including anonymous 
> authors, and those occasional contributors and anonymous will have low 
> reputation in most conceivable reputation 
> systems. 

Before I go further, let me reiterate that I think your work is excellent and 
has the potential for adding huge value to the wiki world. If I didn't think 
so, I wouldn't bother writing this message.

I think it's important to evaluate a system like this in terms of a metric that 
captures some sort of value added to some category of wiki end user.

The system you are trying to build could provide HUGE value for the end user, 
if it could allow him to tell with a certain amount of certainty (say, > 60%) 
which parts of the system are questionable and which parts are not. This is the 
metric I used in my admittedly very small test (Note: I'm sure it's not the 
only metric that could be used to measure end-user value).

Based on that very preliminary test, it seems your system does not do a great 
job at that, and you seem to say that you don't think it could. 

That's OK. I'm sure there is SOMETHING that this system can do for the end 
user, because he "internal" performance metrics you list in your message seem 
to indicate that there is some substance to the predictions of the algorithm.

> We do not plan to do any large-scale human study.  For one, we don't have the 
> resources.  

A study with human judges does not have to be large scale. I would guess 30 
subjects would do the trick. 

> For another, in the very 
> limited tests we did, the notion of "questionable" was so subjective that our 
> data contained a HUGE amount of 
> noise.  We asked to rank edits as -1 (bad), 0 (neutral), +1 (good).  The 
> probability that two of us agreed was 
> somewhere below 60%.  We decided this was not a good way to go. 

That's interesting. I would have expected a large amount of agreement based on 
my assumption that the majority of edits are either clearly Good or Neutral. In 
other words, I would have expected judges to disagree only on the "iffy" 
portion of the edits, but since I assume that this is a small portion of all 
edits, you would still have large agreement. I guess my assumptions are wrong.

Is the story the same if you look at only two categories: Reject (= your {-1} 
set) and Keep (your {0, +1} set)?

> The results of our data-driven evaluation on a random sample of 1000 articles 
> with at least 200 revisions each showed 
> that (quoting from our paper): 
> * Recall of deletions. We consider the recall of low-trust as a predictor for 
> deletions. We show that text in the 
> lowest 50% of trust values constitutes only 3.4% of the text of articles, yet 
> corresponds to 66% of the  text that is 
> deleted from one revision to the next.
> * Precision of deletions.  We consider the precision of low-trust as a 
> predictor for deletions. We show that text 
> that is in the bottom half of trust values has a probability of 33% of being 
> deleted in the very next revision, 
> in  contrast with the 1.9% probability for general text.  The deletion 
> probability raises to 62% for text in the 
> bottom 20%  of trust values.
> * Trust of average vs. deleted text. We consider the trust distribution of 
> all text, compared to the trust 
> distribution to the text that is deleted.  We show that 90% of the text 
> overall had trust at least 76%, while the 
> average trust for deleted text was 33%.
> * Trust as a predictor of lifespan.  We select words uniformly at random, and 
> we consider the statistical correlation 
> between the trust of the word at the moment of sampling, and the future 
> lifespan of the word.  We show that words 
> with the highest trust have an expected future lifespan that is 4.5 times 
> longer than words with no trust.  We remark 
> that this is a proper test, since the trust at the time of sampling depends 
> only on the history of the word prior to 
> sampling.

Those measures tell me that there is definitely something to the algorithm, and 
I am trying to help you define what value it could provide to the which kind of 
end user.

One concern I have though. Have you compared your system to a naïve 
implementation which simply uses the edit's "age" as a measure of its 
trustworthiness? In other words, don't worry about who created the edit or 
modified it. Just worry about how long it's been there. 

Alain

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to