Re: [Wikitech-l] Getting the list of Page Titles and Redirects of Wikipedia

2009-03-19 Thread Roan Kattouw
Platonides schreef:
 (it's helpfully provided in the API result . . . actually, what
 does it mean that Portal and Portal talk are canonical? shouldn't
 there be no canonical attribute if the namespace is custom?).
 
 Agree. Portal and Portal talk could still be acceptable, since the
 namespace ids 100-101 are more or less reserved for portals across the
 wikis.
 What is scaryier is seeing ns id=102 canonical=Cookbook on
 enwikibooks whereas the same ns 102 mean Wikiproject on some pedias.
 
 Since the API provides namespacealiases linked to the id, not to the
 informal canonical name I see no reason to keep the canonical
 parameter on the extra ns.
 
This was brought up before in bug 16672 comment #5. My reply was:

  b) custom namespaces shouldn't have a canonical name
Maybe, maybe not; I see arguments for and against. But since 
$wgCanonicalNames contains canonical names for custom namespaces too and 
since removing the canonical attribute for some namespaces but not 
others would violate expectations and be a breaking change, I'll just 
keep stuff the way it is. Regardless of whether custom namespaces should 
or shouldn't have a canonical name, removing it from the API output 
isn't worth the trouble.

Roan Kattouw (Catrope)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Serving as xhtml+xml

2009-03-19 Thread Alexandre Emsenhuber

Le 19.3.2009 4:46, « lee worden » won...@riseup.net a écrit :

 Attached is a patch for the skins directory that allows changing the
 Content-type dynamically.  After applying this patch, if any code sets the
 global $wgServeAsXHTML to true, the page will be output with the xhtml+xml
 content type.  This seems to work fine with the existing MW XHTML pages.

Can't you set $wgMimeType = 'application/xhtml+xml'; in LocalSettings.php to
serve pages with application/xhtml+xml content type?

Alexandre Emsenhuber (ialex)



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Serving as xhtml+xml

2009-03-19 Thread Aryeh Gregor
2009/3/18 lee worden won...@riseup.net:
 Attached is a patch for the skins directory that allows changing the
 Content-type dynamically.  After applying this patch, if any code sets the
 global $wgServeAsXHTML to true, the page will be output with the xhtml+xml
 content type.  This seems to work fine with the existing MW XHTML pages.

This mailing list doesn't accept attachments.  Patches should be
submitted at bugzilla.wikimedia.org.

 This has been done before, for instance in the ASCIIMath4Wiki extension [2].
  I don't want to change the Content-type unconditionally, though, only some
 of the time, so that we can serve texvc-style images to browsers or users
 that don't like the modified content type.

Note that this will interfere with any kind of HTML caching, such as
Squid or file cache, and with the parser cache as well.  It won't work
correctly in most well-configured MediaWiki installs unless you make
sure to fragment the parser cache appropriately, at a minimum.

 The patch is made on the 1:1.13.3-1ubuntu1 mediawiki package (from Ubuntu
 9.04), and only modifies Monobook.php and Modern.php.  There are other skins
 in my installation here, but they don't seem to work very well and I didn't
 see where to make the change.

Well, that's okay for a first stab at the implementation, if we want
to include this at all (hard to say without seeing the patch).  Not
your fault our skin system is a complete mess.  There are likely to be
some merge conflicts, though -- 1.13 is very old.  It's best to submit
patches based on trunk.

 Is there a better way to make MathML work in MW?  Might this option be
 included in a future MW release?  Any feedback or alternative suggestions is
 welcome.

Under Math in preferences on Wikipedia, there's already a MathML if
possible (experimental) option.  I'm not sure if it actually does
anything, though . . . if not, I guess it should be removed.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Brion Vibber
On Mar 18, 2009, at 20:00, Andrew Garrett and...@epstone.net wrote:


 To help a bit more with performance, I've also added a profiler within
 the interface itself. Hopefully this will encourage self-policing with
 regard to filter performance.

Awesome!

Maybe we could use that for templates too ... ;)

-- Brion



 -- 
 Andrew Garrett

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Serving as xhtml+xml

2009-03-19 Thread Brion Vibber
What's to patch? This is already a configuration variable, just set  
it...

-- brion vibber (brion @ wikimedia.org)

On Mar 18, 2009, at 21:24, Andrew Garrett agarr...@wikimedia.org  
wrote:

 2009/3/19 lee worden won...@riseup.net:
 I'm at work on a MW extension that, among other things, uses  
 LaTeXML [1] to
 make XHTML from full LaTeX documents.  One feature is the option to  
 render
 the equations in MathML, which requires the skins to be patched so  
 that they
 output the page as Content-type: application/xhtml+xml instead of  
 text/html.

 Attached is a patch for the skins directory that allows changing the
 Content-type dynamically.  After applying this patch, if any code  
 sets the
 global $wgServeAsXHTML to true, the page will be output with the  
 xhtml+xml
 content type.  This seems to work fine with the existing MW XHTML  
 pages.

 It should be done with the Parser Output instead.

 This mailing list does not accept attachments

 -- 
 Andrew Garrett

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Brion Vibber
On 3/19/09 5:15 AM, Tei wrote:
 since theres already a database, this sounds like could be done flagging
 edits as vandalism, and then reading the existing database information to
 extract these details, like ip,  a diff of the change, etc..   that way,
 humans define what is a vandalism, and the machine can learn the meaning.

 this may need a button or something, so users report this, and the database
 flag the edit

*nod*

Part of the infrastructure for AbuseFilter was adding a tag marker 
system for edits and log entries, so filters can tag an event as 
potentially needing more review.

(This is different from say Flagged Revisions, which attempts to mark up 
a version of a page as having a certain overall state -- it's a *page* 
thing; here individual actions can be tagged based only on their own 
internal changes, so similar *events* happening anywhere can be called 
up in a search for human review.)


It would definitely be useful to allow readers to provide similar 
feedback, much as many photo and video sharing sites allow visitors to 
flag something as 'inappropriate' which puts it into a queue for admins 
to look at more closely.

So far we don't have a manual tagging interface (and the tag-filtering 
views are disabled pending some query fixes), but the infrastructure is 
laid in. :)

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Robert Rohde
On Wed, Mar 18, 2009 at 8:00 PM, Andrew Garrett and...@epstone.net wrote:
snip

 To help a bit more with performance, I've also added a profiler within
 the interface itself. Hopefully this will encourage self-policing with
 regard to filter performance.

Based on personal observations, the self-profiling is quite noisy.
Sometimes a filter will report one value (say 5 ms) only to come back
5 minutes later and see the same filter report a value 20 times
larger, and a few minutes after that it jumps back down.

Assuming that this behavior is a result of variations in the filter
workload (and not some sort of profiling bug), it would be useful if
you could increase the profiling window to better average over those
fluctuations.  Right now it is hard to tell which rules are slow or
not because the numbers aren't very stable.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Soxred93
Cobi (owner of ClueBot) and his roomate Crispy have already been  
working hard to make this specific dataset, but they've been hurt by  
not enough contributors. The page is here: http://en.wikipedia.org/ 
wiki/User:Crispy1989#New_Dataset_Contribution_Interface

X!

On Mar 19, 2009, at 8:15 AM [Mar 19, 2009 ], Tei wrote:

 On Thu, Mar 19, 2009 at 1:03 PM, Delirium delir...@hackish.org  
 wrote:

 Brian wrote:
 This extension is very important for training  machine learning
 vandalism detection bots. Recently published systems use only  
 hundreds
 of examples of vandalism in training - not nearly enough to
 distinguish between the variety found in Wikipedia or generalize to
 new, unseen forms of vandalism. A large set of human created rules
 could be run against all previous edits in order to create a massive
 vandalism dataset.
 As a machine-learning person, this seems like a somewhat problematic
 idea--- generating training examples *from a rule set* and then  
 learning
 on them is just a very roundabout way of reconstructing that rule  
 set.
 What you really want is a large dataset of human-labeled examples of
 vandalism / non-vandalism that *can't* currently be distinguished
 reliably by rules, so you can throw a machine-learning algorithm  
 at the
 problem of trying to come up with some.


 since theres already a database, this sounds like could be done  
 flagging
 edits as vandalism, and then reading the existing database  
 information to
 extract these details, like ip,  a diff of the change, etc..   that  
 way,
 humans define what is a vandalism, and the machine can learn the  
 meaning.

 this may need a button or something, so users report this, and the  
 database
 flag the edit


 -- 
 --
 ℱin del ℳensaje.
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Serving as xhtml+xml

2009-03-19 Thread lee worden

This has been done before, for instance in the ASCIIMath4Wiki extension [2].
 I don't want to change the Content-type unconditionally, though, only some
of the time, so that we can serve texvc-style images to browsers or users
that don't like the modified content type.


Note that this will interfere with any kind of HTML caching, such as
Squid or file cache, and with the parser cache as well.  It won't work
correctly in most well-configured MediaWiki installs unless you make
sure to fragment the parser cache appropriately, at a minimum.


The patch seems to work with the simple caching on the system I'm using to 
test.  It changes more than just the content-type - the DOCTYPE is 
different and there's a ?xml string before the document, and a 
'Vary: Accept' header.


I got all that from another author.  I'll look further into how to do it 
without patches.  It's true I can change the Content-type by setting 
$wgMimeTypes dynamically, but it only works on the first hit, and cached 
pages arrive as text/html.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Brian
I presented a talk at Wikimania 2007 that espoused the virtues of
combining human measures of content with automatically determined
measures in order to generalize to unseen instances. Unfortunately all
those Wikimania talks seem to have been lost. It was related to this
article on predicting the quality ratings provided by the Wikipedia
Editorial Team:

Rassbach, L., Pincock, T., Mingus., B (2007). Exploring the
Feasibility of Automatically Rating Online Article Quality
http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMingus07.pdf

Delerium, you do make it sound as if merely having the tagged dataset
solves the entire problem. But there are really multiple problems. One
is learning to classify what you have been told is in the dataset
(e.g., that all instances of this rule in the edit history *really
are* vandalism). The other is learning about new reasons that this
edit is vandalism based on all the other occurences of vandalism and
non-vandalism and a sophisticated pre-parse of all the content that
breaks it down into natural language features.  Finally, you then wish
to use this system to bootstrap a vandalism detection system that can
generalize to entirely new instances of vandalism.

The primary way of doing this is to use positive and *negative*
examples of vandalism in conjunction with their features. A good set
of example features is an article or an edit's conformance with the
Wikipedia Manual of Style. I never implemented the entire MoS, but I
did do quite a bit of it and it is quite indicative of quality.

Generally speaking, it is not true that you can only draw conclusions
about what is immediately available in your dataset. It is true that,
with the exception of people, machine learning systems struggle with
generalization.

On Thu, Mar 19, 2009 at 6:03 AM, Delirium delir...@hackish.org wrote:
 Brian wrote:
 This extension is very important for training  machine learning
 vandalism detection bots. Recently published systems use only hundreds
 of examples of vandalism in training - not nearly enough to
 distinguish between the variety found in Wikipedia or generalize to
 new, unseen forms of vandalism. A large set of human created rules
 could be run against all previous edits in order to create a massive
 vandalism dataset.
 As a machine-learning person, this seems like a somewhat problematic
 idea--- generating training examples *from a rule set* and then learning
 on them is just a very roundabout way of reconstructing that rule set.
 What you really want is a large dataset of human-labeled examples of
 vandalism / non-vandalism that *can't* currently be distinguished
 reliably by rules, so you can throw a machine-learning algorithm at the
 problem of trying to come up with some.

 -Mark


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Delirium
Brian wrote:
 Delerium, you do make it sound as if merely having the tagged dataset
 solves the entire problem. But there are really multiple problems. One
 is learning to classify what you have been told is in the dataset
 (e.g., that all instances of this rule in the edit history *really
 are* vandalism). The other is learning about new reasons that this
 edit is vandalism based on all the other occurences of vandalism and
 non-vandalism and a sophisticated pre-parse of all the content that
 breaks it down into natural language features.  Finally, you then wish
 to use this system to bootstrap a vandalism detection system that can
 generalize to entirely new instances of vandalism.
 
 Generally speaking, it is not true that you can only draw conclusions
 about what is immediately available in your dataset. It is true that,
 with the exception of people, machine learning systems struggle with
 generalization.

My point is mainly that using the *results* of an automated rule system 
as *input* to a machine-learning algorithm won't constitute training on 
vandalism, but on what the current rule set considers vandalism. I 
don't see a particularly good reason to find new reasons an edit is 
vandalism for edits that we already correctly predict. What we want is 
new discriminators for edits we *don't* correctly predict. And for 
those, you can't use the labels-given-by-the-current rules as the 
training data, since if the current rule set produces false positives, 
those are now positives in your training set; and if the rule set has 
false negatives, those are now negatives in your training set.

I suppose it could be used for proposing hypotheses to human 
discriminators. For example, you can propose new feature X, if you find 
that 95% of the time the existing rule set flags edits with feature X as 
vandalism, and by human inspection determine that the remaining 5% were 
false negatives, so actually feature X should be a new this is 
vandalism feature. But you need that human inspection--- you can't 
automatically discriminate between rules that improve the filter set's 
performance and rules that decrease it if your labeled data set is the 
one with the mistakes in it.

-Mark

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Delirium
Brian wrote:
 I just wanted to be really clear about what I mean as a specific
 counter-example to this just being an example of reconstructing that
 rule set. Suppose you use the AbuseFilter rules on the entire history
 of the wiki in order to generate a dataset of positive and negative
 examples of vandalism edits. You should then *throw the rules away*
 and attempt to discover  features that separate the vandalism into
 classes correctly, more or less in the blind.

That's precisely the case where you're attempting to reconstruct the 
original rule set (or some work-alike). If you had positive and negative 
examples that were actually known good examples of edits that really 
are vandalism, and really aren't vandalism, then yes you could turn 
loose an algorithm to generalize over them to discover a discriminator 
between the is vandalism and isn't vandalism classes. But if your 
labels are from the output of the existing AbuseFilter, then your 
training classes are really is flagged by the AbuseFilter and is not 
flagged by the AbuseFilter, and any machine-learning algorithm will try 
to generalize the examples in a way that discriminates *those* classes. 
To the extent the AbuseFilter actually does flag vandalism accurately, 
you'll learn a concept approximating that of vandalism. But to the 
extent it doesn't (e.g. if it systematically mis-labels certain kinds of 
edits), you'll learn the same flaws.

That might not be useless--- you might recover a more concise rule set 
that replicates the original performance. But if your training data is 
the output of the previous rule set, you aren't going to be able to 
*improve* on its performance without some additional information (or 
built-in inductive bias).

-Mark

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Brian
Ultimately we need a system that integrates information from multiple
sources, such as WikiTrust, AbuseFilter and the Wikipedia Editorial
Team.

A general point - there is a *lot* of information contained in edits
that AbuseFilter cannot practically characterize due to the complexity
of language and the subtelty of certain types of abuse. A system with
access to natural language features  (and wikitext features) could
theoretically detect them. My quality research group considered
including features relating to the [[Thematic relation]]s found in an
article (we have access to a thematic role parser) which could
potentially be used to detect bad writing - indicative of the edit
containing vandalism.

On Thu, Mar 19, 2009 at 3:17 PM, Delirium delir...@hackish.org wrote:
 But if your training data is
 the output of the previous rule set, you aren't going to be able to
 *improve* on its performance without some additional information (or
 built-in inductive bias).

 -Mark

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Aryeh Gregor
On Thu, Mar 19, 2009 at 5:26 PM, Brian brian.min...@colorado.edu wrote:
 A general point - there is a *lot* of information contained in edits
 that AbuseFilter cannot practically characterize due to the complexity
 of language and the subtelty of certain types of abuse. A system with
 access to natural language features  (and wikitext features) could
 theoretically detect them.

And how poorly would *that* perform, if the current AbuseFilter
already has performance problems?  :)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread David Gerard
2009/3/19 Aryeh Gregor simetrical+wikil...@gmail.com:
 On Thu, Mar 19, 2009 at 5:26 PM, Brian brian.min...@colorado.edu wrote:

 A general point - there is a *lot* of information contained in edits
 that AbuseFilter cannot practically characterize due to the complexity
 of language and the subtelty of certain types of abuse. A system with
 access to natural language features  (and wikitext features) could
 theoretically detect them.

 And how poorly would *that* perform, if the current AbuseFilter
 already has performance problems?  :)


Research box, toolserver cluster! :-D


- d.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

2009-03-19 Thread Platonides
Andrew Garrett wrote:
 On Thu, Mar 19, 2009 at 11:54 AM, Platonides platoni...@gmail.com wrote:
 PS: Why there isn't a link to Special:AbuseFilter/history/$id on the
 filter view?
 
 There is.

Oops. I was looking for it on the top bar, not at the bottom. I stay
corrected.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l