Brian wrote: > Delerium, you do make it sound as if merely having the tagged dataset > solves the entire problem. But there are really multiple problems. One > is learning to classify what you have been told is in the dataset > (e.g., that all instances of this rule in the edit history *really > are* vandalism). The other is learning about new reasons that this > edit is vandalism based on all the other occurences of vandalism and > non-vandalism and a sophisticated pre-parse of all the content that > breaks it down into natural language features. Finally, you then wish > to use this system to bootstrap a vandalism detection system that can > generalize to entirely new instances of vandalism. > > Generally speaking, it is not true that you can only draw conclusions > about what is immediately available in your dataset. It is true that, > with the exception of people, machine learning systems struggle with > generalization.
My point is mainly that using the *results* of an automated rule system as *input* to a machine-learning algorithm won't constitute training on "vandalism", but on "what the current rule set considers vandalism". I don't see a particularly good reason to find new reasons an edit is vandalism for edits that we already correctly predict. What we want is new discriminators for edits we *don't* correctly predict. And for those, you can't use the labels-given-by-the-current rules as the training data, since if the current rule set produces false positives, those are now positives in your training set; and if the rule set has false negatives, those are now negatives in your training set. I suppose it could be used for proposing hypotheses to human discriminators. For example, you can propose new feature X, if you find that 95% of the time the existing rule set flags edits with feature X as vandalism, and by human inspection determine that the remaining 5% were false negatives, so actually feature X should be a new "this is vandalism" feature. But you need that human inspection--- you can't automatically discriminate between rules that improve the filter set's performance and rules that decrease it if your labeled data set is the one with the mistakes in it. -Mark _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l