I just wanted to be really clear about what I mean as a specific counter-example to this just being an example of "reconstructing that rule set." Suppose you use the AbuseFilter rules on the entire history of the wiki in order to generate a dataset of positive and negative examples of vandalism edits. You should then *throw the rules away* and attempt to discover features that separate the vandalism into classes correctly, more or less in the blind.
The key then is feature discovery and a machine system has the potential to do this is a more effective way than a human in virtue of its ability to read the entire encyclopedia. On Thu, Mar 19, 2009 at 2:30 PM, Brian <brian.min...@colorado.edu> wrote: > I presented a talk at Wikimania 2007 that espoused the virtues of > combining human measures of content with automatically determined > measures in order to generalize to unseen instances. Unfortunately all > those Wikimania talks seem to have been lost. It was related to this > article on predicting the quality ratings provided by the Wikipedia > Editorial Team: > > Rassbach, L., Pincock, T., Mingus., B (2007). "Exploring the > Feasibility of Automatically Rating Online Article Quality" > http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMingus07.pdf > > Delerium, you do make it sound as if merely having the tagged dataset > solves the entire problem. But there are really multiple problems. One > is learning to classify what you have been told is in the dataset > (e.g., that all instances of this rule in the edit history *really > are* vandalism). The other is learning about new reasons that this > edit is vandalism based on all the other occurences of vandalism and > non-vandalism and a sophisticated pre-parse of all the content that > breaks it down into natural language features. Finally, you then wish > to use this system to bootstrap a vandalism detection system that can > generalize to entirely new instances of vandalism. > > The primary way of doing this is to use positive and *negative* > examples of vandalism in conjunction with their features. A good set > of example features is an article or an edit's conformance with the > Wikipedia Manual of Style. I never implemented the entire MoS, but I > did do quite a bit of it and it is quite indicative of quality. > > Generally speaking, it is not true that you can only draw conclusions > about what is immediately available in your dataset. It is true that, > with the exception of people, machine learning systems struggle with > generalization. > > On Thu, Mar 19, 2009 at 6:03 AM, Delirium <delir...@hackish.org> wrote: >> Brian wrote: >>> This extension is very important for training machine learning >>> vandalism detection bots. Recently published systems use only hundreds >>> of examples of vandalism in training - not nearly enough to >>> distinguish between the variety found in Wikipedia or generalize to >>> new, unseen forms of vandalism. A large set of human created rules >>> could be run against all previous edits in order to create a massive >>> vandalism dataset. >> As a machine-learning person, this seems like a somewhat problematic >> idea--- generating training examples *from a rule set* and then learning >> on them is just a very roundabout way of reconstructing that rule set. >> What you really want is a large dataset of human-labeled examples of >> vandalism / non-vandalism that *can't* currently be distinguished >> reliably by rules, so you can throw a machine-learning algorithm at the >> problem of trying to come up with some. >> >> -Mark >> >> >> _______________________________________________ >> Wikitech-l mailing list >> Wikitech-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l >> > _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l