I just wanted to be really clear about what I mean as a specific
counter-example to this just being an example of "reconstructing that
rule set." Suppose you use the AbuseFilter rules on the entire history
of the wiki in order to generate a dataset of positive and negative
examples of vandalism edits. You should then *throw the rules away*
and attempt to discover  features that separate the vandalism into
classes correctly, more or less in the blind.

The key then is feature discovery and a machine system has the
potential to do this is a more effective way than a human in virtue of
its ability to read the entire encyclopedia.

On Thu, Mar 19, 2009 at 2:30 PM, Brian <brian.min...@colorado.edu> wrote:
> I presented a talk at Wikimania 2007 that espoused the virtues of
> combining human measures of content with automatically determined
> measures in order to generalize to unseen instances. Unfortunately all
> those Wikimania talks seem to have been lost. It was related to this
> article on predicting the quality ratings provided by the Wikipedia
> Editorial Team:
>
> Rassbach, L., Pincock, T., Mingus., B (2007). "Exploring the
> Feasibility of Automatically Rating Online Article Quality"
> http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMingus07.pdf
>
> Delerium, you do make it sound as if merely having the tagged dataset
> solves the entire problem. But there are really multiple problems. One
> is learning to classify what you have been told is in the dataset
> (e.g., that all instances of this rule in the edit history *really
> are* vandalism). The other is learning about new reasons that this
> edit is vandalism based on all the other occurences of vandalism and
> non-vandalism and a sophisticated pre-parse of all the content that
> breaks it down into natural language features.  Finally, you then wish
> to use this system to bootstrap a vandalism detection system that can
> generalize to entirely new instances of vandalism.
>
> The primary way of doing this is to use positive and *negative*
> examples of vandalism in conjunction with their features. A good set
> of example features is an article or an edit's conformance with the
> Wikipedia Manual of Style. I never implemented the entire MoS, but I
> did do quite a bit of it and it is quite indicative of quality.
>
> Generally speaking, it is not true that you can only draw conclusions
> about what is immediately available in your dataset. It is true that,
> with the exception of people, machine learning systems struggle with
> generalization.
>
> On Thu, Mar 19, 2009 at 6:03 AM, Delirium <delir...@hackish.org> wrote:
>> Brian wrote:
>>> This extension is very important for training  machine learning
>>> vandalism detection bots. Recently published systems use only hundreds
>>> of examples of vandalism in training - not nearly enough to
>>> distinguish between the variety found in Wikipedia or generalize to
>>> new, unseen forms of vandalism. A large set of human created rules
>>> could be run against all previous edits in order to create a massive
>>> vandalism dataset.
>> As a machine-learning person, this seems like a somewhat problematic
>> idea--- generating training examples *from a rule set* and then learning
>> on them is just a very roundabout way of reconstructing that rule set.
>> What you really want is a large dataset of human-labeled examples of
>> vandalism / non-vandalism that *can't* currently be distinguished
>> reliably by rules, so you can throw a machine-learning algorithm at the
>> problem of trying to come up with some.
>>
>> -Mark
>>
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to