Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Platonides
Andrew Garrett wrote:
> On Thu, Mar 19, 2009 at 11:54 AM, Platonides  wrote:
>> PS: Why there isn't a link to Special:AbuseFilter/history/$id on the
>> filter view?
> 
> There is.

Oops. I was looking for it on the top bar, not at the bottom. I stay
corrected.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread David Gerard
2009/3/19 Aryeh Gregor :
> On Thu, Mar 19, 2009 at 5:26 PM, Brian  wrote:

>> A general point - there is a *lot* of information contained in edits
>> that AbuseFilter cannot practically characterize due to the complexity
>> of language and the subtelty of certain types of abuse. A system with
>> access to natural language features  (and wikitext features) could
>> theoretically detect them.

> And how poorly would *that* perform, if the current AbuseFilter
> already has performance problems?  :)


Research box, toolserver cluster! :-D


- d.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Aryeh Gregor
On Thu, Mar 19, 2009 at 5:26 PM, Brian  wrote:
> A general point - there is a *lot* of information contained in edits
> that AbuseFilter cannot practically characterize due to the complexity
> of language and the subtelty of certain types of abuse. A system with
> access to natural language features  (and wikitext features) could
> theoretically detect them.

And how poorly would *that* perform, if the current AbuseFilter
already has performance problems?  :)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Brian
Ultimately we need a system that integrates information from multiple
sources, such as WikiTrust, AbuseFilter and the Wikipedia Editorial
Team.

A general point - there is a *lot* of information contained in edits
that AbuseFilter cannot practically characterize due to the complexity
of language and the subtelty of certain types of abuse. A system with
access to natural language features  (and wikitext features) could
theoretically detect them. My quality research group considered
including features relating to the [[Thematic relation]]s found in an
article (we have access to a thematic role parser) which could
potentially be used to detect bad writing - indicative of the edit
containing vandalism.

On Thu, Mar 19, 2009 at 3:17 PM, Delirium  wrote:
> But if your training data is
> the output of the previous rule set, you aren't going to be able to
> *improve* on its performance without some additional information (or
> built-in inductive bias).
>
> -Mark
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Delirium
Brian wrote:
> I just wanted to be really clear about what I mean as a specific
> counter-example to this just being an example of "reconstructing that
> rule set." Suppose you use the AbuseFilter rules on the entire history
> of the wiki in order to generate a dataset of positive and negative
> examples of vandalism edits. You should then *throw the rules away*
> and attempt to discover  features that separate the vandalism into
> classes correctly, more or less in the blind.

That's precisely the case where you're attempting to reconstruct the 
original rule set (or some work-alike). If you had positive and negative 
examples that were actually "known good" examples of edits that really 
are vandalism, and really aren't vandalism, then yes you could turn 
loose an algorithm to generalize over them to discover a discriminator 
between the "is vandalism" and "isn't vandalism" classes. But if your 
labels are from the output of the existing AbuseFilter, then your 
training classes are really "is flagged by the AbuseFilter" and "is not 
flagged by the AbuseFilter", and any machine-learning algorithm will try 
to generalize the examples in a way that discriminates *those* classes. 
To the extent the AbuseFilter actually does flag vandalism accurately, 
you'll learn a concept approximating that of vandalism. But to the 
extent it doesn't (e.g. if it systematically mis-labels certain kinds of 
edits), you'll learn the same flaws.

That might not be useless--- you might recover a more concise rule set 
that replicates the original performance. But if your training data is 
the output of the previous rule set, you aren't going to be able to 
*improve* on its performance without some additional information (or 
built-in inductive bias).

-Mark

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Delirium
Brian wrote:
> Delerium, you do make it sound as if merely having the tagged dataset
> solves the entire problem. But there are really multiple problems. One
> is learning to classify what you have been told is in the dataset
> (e.g., that all instances of this rule in the edit history *really
> are* vandalism). The other is learning about new reasons that this
> edit is vandalism based on all the other occurences of vandalism and
> non-vandalism and a sophisticated pre-parse of all the content that
> breaks it down into natural language features.  Finally, you then wish
> to use this system to bootstrap a vandalism detection system that can
> generalize to entirely new instances of vandalism.
> 
> Generally speaking, it is not true that you can only draw conclusions
> about what is immediately available in your dataset. It is true that,
> with the exception of people, machine learning systems struggle with
> generalization.

My point is mainly that using the *results* of an automated rule system 
as *input* to a machine-learning algorithm won't constitute training on 
"vandalism", but on "what the current rule set considers vandalism". I 
don't see a particularly good reason to find new reasons an edit is 
vandalism for edits that we already correctly predict. What we want is 
new discriminators for edits we *don't* correctly predict. And for 
those, you can't use the labels-given-by-the-current rules as the 
training data, since if the current rule set produces false positives, 
those are now positives in your training set; and if the rule set has 
false negatives, those are now negatives in your training set.

I suppose it could be used for proposing hypotheses to human 
discriminators. For example, you can propose new feature X, if you find 
that 95% of the time the existing rule set flags edits with feature X as 
vandalism, and by human inspection determine that the remaining 5% were 
false negatives, so actually feature X should be a new "this is 
vandalism" feature. But you need that human inspection--- you can't 
automatically discriminate between rules that improve the filter set's 
performance and rules that decrease it if your labeled data set is the 
one with the mistakes in it.

-Mark

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Brian
I just wanted to be really clear about what I mean as a specific
counter-example to this just being an example of "reconstructing that
rule set." Suppose you use the AbuseFilter rules on the entire history
of the wiki in order to generate a dataset of positive and negative
examples of vandalism edits. You should then *throw the rules away*
and attempt to discover  features that separate the vandalism into
classes correctly, more or less in the blind.

The key then is feature discovery and a machine system has the
potential to do this is a more effective way than a human in virtue of
its ability to read the entire encyclopedia.

On Thu, Mar 19, 2009 at 2:30 PM, Brian  wrote:
> I presented a talk at Wikimania 2007 that espoused the virtues of
> combining human measures of content with automatically determined
> measures in order to generalize to unseen instances. Unfortunately all
> those Wikimania talks seem to have been lost. It was related to this
> article on predicting the quality ratings provided by the Wikipedia
> Editorial Team:
>
> Rassbach, L., Pincock, T., Mingus., B (2007). "Exploring the
> Feasibility of Automatically Rating Online Article Quality"
> http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMingus07.pdf
>
> Delerium, you do make it sound as if merely having the tagged dataset
> solves the entire problem. But there are really multiple problems. One
> is learning to classify what you have been told is in the dataset
> (e.g., that all instances of this rule in the edit history *really
> are* vandalism). The other is learning about new reasons that this
> edit is vandalism based on all the other occurences of vandalism and
> non-vandalism and a sophisticated pre-parse of all the content that
> breaks it down into natural language features.  Finally, you then wish
> to use this system to bootstrap a vandalism detection system that can
> generalize to entirely new instances of vandalism.
>
> The primary way of doing this is to use positive and *negative*
> examples of vandalism in conjunction with their features. A good set
> of example features is an article or an edit's conformance with the
> Wikipedia Manual of Style. I never implemented the entire MoS, but I
> did do quite a bit of it and it is quite indicative of quality.
>
> Generally speaking, it is not true that you can only draw conclusions
> about what is immediately available in your dataset. It is true that,
> with the exception of people, machine learning systems struggle with
> generalization.
>
> On Thu, Mar 19, 2009 at 6:03 AM, Delirium  wrote:
>> Brian wrote:
>>> This extension is very important for training  machine learning
>>> vandalism detection bots. Recently published systems use only hundreds
>>> of examples of vandalism in training - not nearly enough to
>>> distinguish between the variety found in Wikipedia or generalize to
>>> new, unseen forms of vandalism. A large set of human created rules
>>> could be run against all previous edits in order to create a massive
>>> vandalism dataset.
>> As a machine-learning person, this seems like a somewhat problematic
>> idea--- generating training examples *from a rule set* and then learning
>> on them is just a very roundabout way of reconstructing that rule set.
>> What you really want is a large dataset of human-labeled examples of
>> vandalism / non-vandalism that *can't* currently be distinguished
>> reliably by rules, so you can throw a machine-learning algorithm at the
>> problem of trying to come up with some.
>>
>> -Mark
>>
>>
>> ___
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Brian
I presented a talk at Wikimania 2007 that espoused the virtues of
combining human measures of content with automatically determined
measures in order to generalize to unseen instances. Unfortunately all
those Wikimania talks seem to have been lost. It was related to this
article on predicting the quality ratings provided by the Wikipedia
Editorial Team:

Rassbach, L., Pincock, T., Mingus., B (2007). "Exploring the
Feasibility of Automatically Rating Online Article Quality"
http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMingus07.pdf

Delerium, you do make it sound as if merely having the tagged dataset
solves the entire problem. But there are really multiple problems. One
is learning to classify what you have been told is in the dataset
(e.g., that all instances of this rule in the edit history *really
are* vandalism). The other is learning about new reasons that this
edit is vandalism based on all the other occurences of vandalism and
non-vandalism and a sophisticated pre-parse of all the content that
breaks it down into natural language features.  Finally, you then wish
to use this system to bootstrap a vandalism detection system that can
generalize to entirely new instances of vandalism.

The primary way of doing this is to use positive and *negative*
examples of vandalism in conjunction with their features. A good set
of example features is an article or an edit's conformance with the
Wikipedia Manual of Style. I never implemented the entire MoS, but I
did do quite a bit of it and it is quite indicative of quality.

Generally speaking, it is not true that you can only draw conclusions
about what is immediately available in your dataset. It is true that,
with the exception of people, machine learning systems struggle with
generalization.

On Thu, Mar 19, 2009 at 6:03 AM, Delirium  wrote:
> Brian wrote:
>> This extension is very important for training  machine learning
>> vandalism detection bots. Recently published systems use only hundreds
>> of examples of vandalism in training - not nearly enough to
>> distinguish between the variety found in Wikipedia or generalize to
>> new, unseen forms of vandalism. A large set of human created rules
>> could be run against all previous edits in order to create a massive
>> vandalism dataset.
> As a machine-learning person, this seems like a somewhat problematic
> idea--- generating training examples *from a rule set* and then learning
> on them is just a very roundabout way of reconstructing that rule set.
> What you really want is a large dataset of human-labeled examples of
> vandalism / non-vandalism that *can't* currently be distinguished
> reliably by rules, so you can throw a machine-learning algorithm at the
> problem of trying to come up with some.
>
> -Mark
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Brion Vibber
On 3/19/09 12:21 PM, Alex wrote:
> Yes, in one filter (filter 32) I've been watching, it was taking
> 90-120ms for what seemed like simple checks (action, editcount,
> difference in bytes), so I moved the editcount check last, in case it
> had to pull that from the DB. The time dropped to ~3ms, but a couple
> hours later with no changes to the order and its up to 20ms.

Well, a couple notes here:

The runtime of a filter will depend on what it's filtering -- large 
pages or pages with lots of links are more likely to take longer.

It probably makes sense to give some min/max/mean/average times or 
something... and a plot over time might be very helpful as well to help 
filter out (or show up!) outliers.

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Alex
Robert Rohde wrote:
> On Wed, Mar 18, 2009 at 8:00 PM, Andrew Garrett  wrote:
> 
>> To help a bit more with performance, I've also added a profiler within
>> the interface itself. Hopefully this will encourage self-policing with
>> regard to filter performance.
> 
> Based on personal observations, the self-profiling is quite noisy.
> Sometimes a filter will report one value (say 5 ms) only to come back
> 5 minutes later and see the same filter report a value 20 times
> larger, and a few minutes after that it jumps back down.
> 
> Assuming that this behavior is a result of variations in the filter
> workload (and not some sort of profiling bug), it would be useful if
> you could increase the profiling window to better average over those
> fluctuations.  Right now it is hard to tell which rules are slow or
> not because the numbers aren't very stable.
> 

Yes, in one filter (filter 32) I've been watching, it was taking
90-120ms for what seemed like simple checks (action, editcount,
difference in bytes), so I moved the editcount check last, in case it
had to pull that from the DB. The time dropped to ~3ms, but a couple
hours later with no changes to the order and its up to 20ms.

Related to this: It would be nice if there was a chart or something
comparing how expensive certain variables and functions are.

-- 
Alex (wikipedia:en:User:Mr.Z-man)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Soxred93
Cobi (owner of ClueBot) and his roomate Crispy have already been  
working hard to make this specific dataset, but they've been hurt by  
not enough contributors. The page is here: http://en.wikipedia.org/ 
wiki/User:Crispy1989#New_Dataset_Contribution_Interface

X!

On Mar 19, 2009, at 8:15 AM [Mar 19, 2009 ], Tei wrote:

> On Thu, Mar 19, 2009 at 1:03 PM, Delirium   
> wrote:
>
>> Brian wrote:
>>> This extension is very important for training  machine learning
>>> vandalism detection bots. Recently published systems use only  
>>> hundreds
>>> of examples of vandalism in training - not nearly enough to
>>> distinguish between the variety found in Wikipedia or generalize to
>>> new, unseen forms of vandalism. A large set of human created rules
>>> could be run against all previous edits in order to create a massive
>>> vandalism dataset.
>> As a machine-learning person, this seems like a somewhat problematic
>> idea--- generating training examples *from a rule set* and then  
>> learning
>> on them is just a very roundabout way of reconstructing that rule  
>> set.
>> What you really want is a large dataset of human-labeled examples of
>> vandalism / non-vandalism that *can't* currently be distinguished
>> reliably by rules, so you can throw a machine-learning algorithm  
>> at the
>> problem of trying to come up with some.
>>
>
> since theres already a database, this sounds like could be done  
> flagging
> edits as "vandalism", and then reading the existing database  
> information to
> extract these details, like ip,  a diff of the change, etc..   that  
> way,
> humans define what is a "vandalism", and the machine can learn the  
> meaning.
>
> this may need a button or something, so users report this, and the  
> database
> flag the edit
>
>
> -- 
> --
> ℱin del ℳensaje.
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Robert Rohde
On Wed, Mar 18, 2009 at 8:00 PM, Andrew Garrett  wrote:

>
> To help a bit more with performance, I've also added a profiler within
> the interface itself. Hopefully this will encourage self-policing with
> regard to filter performance.

Based on personal observations, the self-profiling is quite noisy.
Sometimes a filter will report one value (say 5 ms) only to come back
5 minutes later and see the same filter report a value 20 times
larger, and a few minutes after that it jumps back down.

Assuming that this behavior is a result of variations in the filter
workload (and not some sort of profiling bug), it would be useful if
you could increase the profiling window to better average over those
fluctuations.  Right now it is hard to tell which rules are slow or
not because the numbers aren't very stable.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Brion Vibber
On 3/19/09 5:15 AM, Tei wrote:
> since theres already a database, this sounds like could be done flagging
> edits as "vandalism", and then reading the existing database information to
> extract these details, like ip,  a diff of the change, etc..   that way,
> humans define what is a "vandalism", and the machine can learn the meaning.
>
> this may need a button or something, so users report this, and the database
> flag the edit

*nod*

Part of the infrastructure for AbuseFilter was adding a tag marker 
system for edits and log entries, so filters can tag an event as 
potentially needing more review.

(This is different from say Flagged Revisions, which attempts to mark up 
a version of a page as having a certain overall state -- it's a *page* 
thing; here individual actions can be tagged based only on their own 
internal changes, so similar *events* happening anywhere can be called 
up in a search for human review.)


It would definitely be useful to allow readers to provide similar 
feedback, much as many photo and video sharing sites allow visitors to 
flag something as 'inappropriate' which puts it into a queue for admins 
to look at more closely.

So far we don't have a manual tagging interface (and the tag-filtering 
views are disabled pending some query fixes), but the infrastructure is 
laid in. :)

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Brion Vibber
On Mar 18, 2009, at 20:00, Andrew Garrett  wrote:
>
>>
> To help a bit more with performance, I've also added a profiler within
> the interface itself. Hopefully this will encourage self-policing with
> regard to filter performance.

Awesome!

Maybe we could use that for templates too ... ;)

-- Brion

>
>
> -- 
> Andrew Garrett
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Tei
On Thu, Mar 19, 2009 at 1:03 PM, Delirium  wrote:

> Brian wrote:
> > This extension is very important for training  machine learning
> > vandalism detection bots. Recently published systems use only hundreds
> > of examples of vandalism in training - not nearly enough to
> > distinguish between the variety found in Wikipedia or generalize to
> > new, unseen forms of vandalism. A large set of human created rules
> > could be run against all previous edits in order to create a massive
> > vandalism dataset.
> As a machine-learning person, this seems like a somewhat problematic
> idea--- generating training examples *from a rule set* and then learning
> on them is just a very roundabout way of reconstructing that rule set.
> What you really want is a large dataset of human-labeled examples of
> vandalism / non-vandalism that *can't* currently be distinguished
> reliably by rules, so you can throw a machine-learning algorithm at the
> problem of trying to come up with some.
>

since theres already a database, this sounds like could be done flagging
edits as "vandalism", and then reading the existing database information to
extract these details, like ip,  a diff of the change, etc..   that way,
humans define what is a "vandalism", and the machine can learn the meaning.

this may need a button or something, so users report this, and the database
flag the edit


-- 
--
ℱin del ℳensaje.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Delirium
Brian wrote:
> This extension is very important for training  machine learning
> vandalism detection bots. Recently published systems use only hundreds
> of examples of vandalism in training - not nearly enough to
> distinguish between the variety found in Wikipedia or generalize to
> new, unseen forms of vandalism. A large set of human created rules
> could be run against all previous edits in order to create a massive
> vandalism dataset.
As a machine-learning person, this seems like a somewhat problematic 
idea--- generating training examples *from a rule set* and then learning 
on them is just a very roundabout way of reconstructing that rule set. 
What you really want is a large dataset of human-labeled examples of 
vandalism / non-vandalism that *can't* currently be distinguished 
reliably by rules, so you can throw a machine-learning algorithm at the 
problem of trying to come up with some.

-Mark


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-18 Thread Robert Rohde
On Wed, Mar 18, 2009 at 8:00 PM, Andrew Garrett  wrote:

> I've disabled a filter or two which were taking well in excess of
> 150ms to run, and seemed to be targetted at specific vandals, without
> any hits. The culprit seemed to be running about 20 regexes to
> determine if an IP is in a particular range, where one call to
> ip_in_range would suffice. Of course, this is also a documentation
> issue which I'm working on.


ip_in_range
rmwhitespace
rmspecials
? :
if then else end
contains

and probably some others appear in SVN but not in the dropdown list
that I assume most people are using to locate options.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-18 Thread Andrew Garrett
Tim Starling wrote:
> Robert Rohde wrote:
>> For Andrew or anyone else that knows, can we assume that the filter is
>> smart enough that if the first part of an AND clause fails then the
>> other parts don't run (or similarly if the first part of an OR
>> succeeds)?  If so, we can probably optimize rules by doing easy checks
>> first before complex ones.
>
> No, everything will be evaluated.

I've written and deployed branch optimisation code, which reduced
run-time by about one third.

>> Note that the problem with rule 48 was that added_links triggers a
>> complete parse of the pre-edit page text. It could be replaced by a
>> check against the externallinks table. No amount of clever shortcut
>> evaluation would have made it fast.

I've fixed this to use the DB instead for that particular context.

On Thu, Mar 19, 2009 at 11:54 AM, Platonides  wrote:
> PS: Why there isn't a link to Special:AbuseFilter/history/$id on the
> filter view?

There is.

I've disabled a filter or two which were taking well in excess of
150ms to run, and seemed to be targetted at specific vandals, without
any hits. The culprit seemed to be running about 20 regexes to
determine if an IP is in a particular range, where one call to
ip_in_range would suffice. Of course, this is also a documentation
issue which I'm working on.

To help a bit more with performance, I've also added a profiler within
the interface itself. Hopefully this will encourage self-policing with
regard to filter performance.

-- 
Andrew Garrett

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-18 Thread Platonides
Tim Starling wrote:
> Robert Rohde wrote:
>> For Andrew or anyone else that knows, can we assume that the filter is
>> smart enough that if the first part of an AND clause fails then the
>> other parts don't run (or similarly if the first part of an OR
>> succeeds)?  If so, we can probably optimize rules by doing easy checks
>> first before complex ones.
> 
> No, everything will be evaluated.
> 
> Note that the problem with rule 48 was that added_links triggers a
> complete parse of the pre-edit page text. It could be replaced by a
> check against the externallinks table. No amount of clever shortcut
> evaluation would have made it fast.
> 
> -- Tim Starling

With branch optimization, placing the check !("autoconfirmed" in
USER_GROUPS) and namespace at the beginning would avoid checking the
added_links at all (and thus the parse).

Another option could be to automatically optimize based on the cost of
each rule.

PS: Why there isn't a link to Special:AbuseFilter/history/$id on the
filter view?


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-18 Thread Soxred93
However, that simply disallows them all. On enwiki, the blanking  
filter warns the user, and lets them go through with it after  
confirmation.

X!

On Mar 18, 2009, at 4:51 PM [Mar 18, 2009 ], jida...@jidanni.org wrote:

> AG> frown on page-blanking
>
> For now I just stop them on my wikis with
> $wgSpamRegex=array('/^\B$/');
> I haven't tried fancier solutions yet.
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-18 Thread Brian
This extension is very important for training  machine learning
vandalism detection bots. Recently published systems use only hundreds
of examples of vandalism in training - not nearly enough to
distinguish between the variety found in Wikipedia or generalize to
new, unseen forms of vandalism. A large set of human created rules
could be run against all previous edits in order to create a massive
vandalism dataset. If one includes positive and negative types of
vandalism in training practically the entire text of the history of
wikipedia can be used in the training set, possibly creating a
remarkable bot.

On Wed, Mar 18, 2009 at 6:34 AM, Andrew Garrett  wrote:
> I am pleased to announce that the Abuse Filter [1] has been activated
> on English Wikipedia!
>
> The Abuse Filter is an extension to the MediaWiki [2] software that
> powers Wikipedia allowing automatic "filters" or "rules" to be run
> against every edit, and to take actions if any of those rules are
> triggered. It is designed to combat vandalism which is simple and
> pattern-based, from blanking pages to complicated evasive page-move
> vandalism.
>
> We've already seen some pretty cool uses for the Abuse Filter. While
> there are filters for the obvious personal attacks [3], many of our
> filters are there just to identify common newbie mistakes such
> page-blanking [4], give the users a friendly warning [5] and ask them
> if they really want to submit their edits.
>
> The best part is that these friendly "soft" warning messages seem to
> work in passively changing user behaviour. Just the suggestion that we
> frown on page-blanking was enough to stop 56 of the 78 matches [6] of
> that filter when I checked. If you look closely, you'll even find that
> many of the users took our advice and redirected the page or did
> something else more constructive instead.
>
> I'm very pleased at my work being used so well on English Wikipedia,
> and I'm looking forward to seeing some quality filters in the near
> future! While at the moment, some of the harsher actions such as
> blocking are disabled on Wikimedia, we're hoping that the filters
> developed will be good enough that we can think about activating them
> in the future.
>
> If anybody has any questions or concerns about the Abuse Filter, feel
> free to file a bug [7], contact me on IRC (werdna on
> irc.freenode.net), post on my user talk page, or send me an email at
> agarrett at wikimedia.org
>
> [1] http://www.mediawiki.org/wiki/Extension:AbuseFilter
> [2] http://www.mediawiki.org
> [3] http://en.wikipedia.org/wiki/Special:AbuseFilter/9
> [4] http://en.wikipedia.org/wiki/Special:AbuseFilter/3
> [5] http://en.wikipedia.org/wiki/MediaWiki:Abusefilter-warning-blanking
> [6] 
> http://en.wikipedia.org/w/index.php?title=Special:AbuseLog&wpSearchFilter=3
> [7] http://bugzilla.wikimedia.org
>
> --
> Andrew Garrett
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-18 Thread jidanni
AG> frown on page-blanking

For now I just stop them on my wikis with
$wgSpamRegex=array('/^\B$/');
I haven't tried fancier solutions yet.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-18 Thread Tim Starling
Robert Rohde wrote:
> For Andrew or anyone else that knows, can we assume that the filter is
> smart enough that if the first part of an AND clause fails then the
> other parts don't run (or similarly if the first part of an OR
> succeeds)?  If so, we can probably optimize rules by doing easy checks
> first before complex ones.

No, everything will be evaluated.

Note that the problem with rule 48 was that added_links triggers a
complete parse of the pre-edit page text. It could be replaced by a
check against the externallinks table. No amount of clever shortcut
evaluation would have made it fast.

-- Tim Starling


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-18 Thread Brion Vibber
On 3/18/09 12:59 PM, Tim Starling wrote:
> Brion Vibber wrote:
>> On 3/18/09 5:34 AM, Andrew Garrett wrote:
>>> I am pleased to announce that the Abuse Filter [1] has been activated
>>> on English Wikipedia!
>> I've temporarily disabled it as we're seeing some performance problems
>> saving edits at peak time today. Need to make sure there's functional
>> per-filter profiling before re-enabling so we can confirm if one of the
>> 55 active filters (!) is particularly bad or if we need to do overall
>> optimization.
>
> Done, took less than five minutes. Re-enabled.
>
> We're still profiling at ~700ms CPU time per page save, with no
> particular rule dominant. Disabling 20 of them would help.

Not bad for a first production pass on the madness that is enwiki! :D

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-18 Thread Robert Rohde
On Wed, Mar 18, 2009 at 12:59 PM, Tim Starling  wrote:
> Brion Vibber wrote:
>> On 3/18/09 5:34 AM, Andrew Garrett wrote:
>>> I am pleased to announce that the Abuse Filter [1] has been activated
>>> on English Wikipedia!
>>
>> I've temporarily disabled it as we're seeing some performance problems
>> saving edits at peak time today. Need to make sure there's functional
>> per-filter profiling before re-enabling so we can confirm if one of the
>> 55 active filters (!) is particularly bad or if we need to do overall
>> optimization.
>
> Done, took less than five minutes. Re-enabled.
>
> We're still profiling at ~700ms CPU time per page save, with no
> particular rule dominant. Disabling 20 of them would help.

For Andrew or anyone else that knows, can we assume that the filter is
smart enough that if the first part of an AND clause fails then the
other parts don't run (or similarly if the first part of an OR
succeeds)?  If so, we can probably optimize rules by doing easy checks
first before complex ones.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-18 Thread Tim Starling
Brion Vibber wrote:
> On 3/18/09 5:34 AM, Andrew Garrett wrote:
>> I am pleased to announce that the Abuse Filter [1] has been activated
>> on English Wikipedia!
> 
> I've temporarily disabled it as we're seeing some performance problems 
> saving edits at peak time today. Need to make sure there's functional 
> per-filter profiling before re-enabling so we can confirm if one of the 
> 55 active filters (!) is particularly bad or if we need to do overall 
> optimization.

Done, took less than five minutes. Re-enabled.

We're still profiling at ~700ms CPU time per page save, with no
particular rule dominant. Disabling 20 of them would help.

-- Tim Starling


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-18 Thread Robert Rohde
On Wed, Mar 18, 2009 at 12:43 PM, Brion Vibber  wrote:
> On 3/18/09 5:34 AM, Andrew Garrett wrote:
>> I am pleased to announce that the Abuse Filter [1] has been activated
>> on English Wikipedia!
>
> I've temporarily disabled it as we're seeing some performance problems
> saving edits at peak time today. Need to make sure there's functional
> per-filter profiling before re-enabling so we can confirm if one of the
> 55 active filters (!) is particularly bad or if we need to do overall
> optimization.

For a 45 minute window one specific filter was timing out the server
every time someone try to save a large page like WP:AN/I.

We found and disabled that one, but more detailed load stats would
definitely be useful.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-18 Thread Brion Vibber
On 3/18/09 5:34 AM, Andrew Garrett wrote:
> I am pleased to announce that the Abuse Filter [1] has been activated
> on English Wikipedia!

I've temporarily disabled it as we're seeing some performance problems 
saving edits at peak time today. Need to make sure there's functional 
per-filter profiling before re-enabling so we can confirm if one of the 
55 active filters (!) is particularly bad or if we need to do overall 
optimization.

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-18 Thread Andrew Garrett
I am pleased to announce that the Abuse Filter [1] has been activated
on English Wikipedia!

The Abuse Filter is an extension to the MediaWiki [2] software that
powers Wikipedia allowing automatic "filters" or "rules" to be run
against every edit, and to take actions if any of those rules are
triggered. It is designed to combat vandalism which is simple and
pattern-based, from blanking pages to complicated evasive page-move
vandalism.

We've already seen some pretty cool uses for the Abuse Filter. While
there are filters for the obvious personal attacks [3], many of our
filters are there just to identify common newbie mistakes such
page-blanking [4], give the users a friendly warning [5] and ask them
if they really want to submit their edits.

The best part is that these friendly "soft" warning messages seem to
work in passively changing user behaviour. Just the suggestion that we
frown on page-blanking was enough to stop 56 of the 78 matches [6] of
that filter when I checked. If you look closely, you'll even find that
many of the users took our advice and redirected the page or did
something else more constructive instead.

I'm very pleased at my work being used so well on English Wikipedia,
and I'm looking forward to seeing some quality filters in the near
future! While at the moment, some of the harsher actions such as
blocking are disabled on Wikimedia, we're hoping that the filters
developed will be good enough that we can think about activating them
in the future.

If anybody has any questions or concerns about the Abuse Filter, feel
free to file a bug [7], contact me on IRC (werdna on
irc.freenode.net), post on my user talk page, or send me an email at
agarrett at wikimedia.org

[1] http://www.mediawiki.org/wiki/Extension:AbuseFilter
[2] http://www.mediawiki.org
[3] http://en.wikipedia.org/wiki/Special:AbuseFilter/9
[4] http://en.wikipedia.org/wiki/Special:AbuseFilter/3
[5] http://en.wikipedia.org/wiki/MediaWiki:Abusefilter-warning-blanking
[6] http://en.wikipedia.org/w/index.php?title=Special:AbuseLog&wpSearchFilter=3
[7] http://bugzilla.wikimedia.org

-- 
Andrew Garrett

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l