[Wikitech-l] URLs that aren't cool...

2009-07-28 Thread Paul Houle
I've been looking at the id structure of dbpedia and wikipedia and 
finally found an example where case sensitivity issues really bite.

Cases like this with a "redirect" are a little obnoxious,

http://en.wikipedia.org/wiki/New_York_City
http://en.wikipedia.org/wiki/New_york_city

largely because there isn't a redirect...  The same page gets displayed 
at each URL. (Ok,  the "redirect" has a little extra stuff at the top 
saying that's a redirect)

dbpedia has separate resource pages for the above cases,  so at least 
it's explaining the situation clearly -- reasoning systems that work 
with dbpedia need to be able to read this.

Here's a case that's just plain bad...

http://en.wikipedia.org/wiki/Direct_instruction
http://en.wikipedia.org/wiki/Direct_Instruction

Last time I looked there were about 10,000 wikipedia urls that varied 
only by case.  In this particular one,  it's two articles about the same 
topic,  but there could be some cases where the two articles are about 
something different.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] URLs that aren't cool...

2009-07-28 Thread Aryeh Gregor
On Tue, Jul 28, 2009 at 11:53 AM, Paul Houle wrote:
> I've been looking at the id structure of dbpedia and wikipedia and
> finally found an example where case sensitivity issues really bite.

We should keep in mind that case isn't so clear-cut if you move away
from English, though -- is "groß" the same as "GROSS" and thus the
same as "gross"?  How about languages that don't even have bijections
between uppercase and lowercase if you stick to the same dialect?
(I'm pretty sure there are some; don't some language strip diacritics
from uppercase letters?)  There's probably some Unicode standard on
normalization with respect to case, but it's not actually so simple in
an international context.

That said, I think case-insensitivity would be a good thing to support
in the long run, optionally, and that it would probably be suitable
for all Wikipedias.  Or at least almost all, if there are languages
out there where case insensitivity is a real headache -- hopefully
not, since most languages don't have letter case at all.  At any rate
it would be good on enwiki.

But it would require a lot of tedious and error-prone conversion of
old code.  Everything tends to assume that a)
$title->getPrefixedText() is what should be displayed to the user, but
b) two titles are equal if and only if their
$title->getPrefixedText()s are equal.  Likewise for
$title->getPrefixedDbKey().  Those would need to be systematically and
thoroughly fixed.  We'd also have to add a field to the page table or
such to store the normalized form of the title, and fiddle with the
indexes appropriately, and update all other tables to use the
normalized form.  A lot of work.

(But at least we could get rid of the silly Text/DbKey distinction
while we're doing this.  I've heard recent MySQL versions actually
support storage of ASCII space characters in text fields!)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] URLs that aren't cool...

2009-07-28 Thread Mark Williamson
Case insensitivity shouldn't be a problem for any language, as long as
you do it properly.

Turkish and other languages using dotless i, for example, will need a
special rule - Turkish lowercase dotted i capitalizes to a capital
dotted İ while lowercase undotted ı capitalizes to regular undotted I.

skype: node.ue

On Tue, Jul 28, 2009 at 9:26 AM, Aryeh
Gregor wrote:
> On Tue, Jul 28, 2009 at 11:53 AM, Paul Houle wrote:
>> I've been looking at the id structure of dbpedia and wikipedia and
>> finally found an example where case sensitivity issues really bite.
>
> We should keep in mind that case isn't so clear-cut if you move away
> from English, though -- is "groß" the same as "GROSS" and thus the
> same as "gross"?  How about languages that don't even have bijections
> between uppercase and lowercase if you stick to the same dialect?
> (I'm pretty sure there are some; don't some language strip diacritics
> from uppercase letters?)  There's probably some Unicode standard on
> normalization with respect to case, but it's not actually so simple in
> an international context.
>
> That said, I think case-insensitivity would be a good thing to support
> in the long run, optionally, and that it would probably be suitable
> for all Wikipedias.  Or at least almost all, if there are languages
> out there where case insensitivity is a real headache -- hopefully
> not, since most languages don't have letter case at all.  At any rate
> it would be good on enwiki.
>
> But it would require a lot of tedious and error-prone conversion of
> old code.  Everything tends to assume that a)
> $title->getPrefixedText() is what should be displayed to the user, but
> b) two titles are equal if and only if their
> $title->getPrefixedText()s are equal.  Likewise for
> $title->getPrefixedDbKey().  Those would need to be systematically and
> thoroughly fixed.  We'd also have to add a field to the page table or
> such to store the normalized form of the title, and fiddle with the
> indexes appropriately, and update all other tables to use the
> normalized form.  A lot of work.
>
> (But at least we could get rid of the silly Text/DbKey distinction
> while we're doing this.  I've heard recent MySQL versions actually
> support storage of ASCII space characters in text fields!)
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] URLs that aren't cool...

2009-07-28 Thread Aryeh Gregor
On Tue, Jul 28, 2009 at 12:52 PM, Mark Williamson wrote:
> Case insensitivity shouldn't be a problem for any language, as long as
> you do it properly.
>
> Turkish and other languages using dotless i, for example, will need a
> special rule - Turkish lowercase dotted i capitalizes to a capital
> dotted İ while lowercase undotted ı capitalizes to regular undotted I.

And so what if a wiki is multilingual and you don't know what language
the page name is in?  What if a Turkish wiki contains some English
page names as loan words, for instance?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] URLs that aren't cool...

2009-07-28 Thread Brion Vibber
On 7/28/09 10:04 AM, Aryeh Gregor wrote:
> On Tue, Jul 28, 2009 at 12:52 PM, Mark Williamson  wrote:
>> Case insensitivity shouldn't be a problem for any language, as long as
>> you do it properly.
>>
>> Turkish and other languages using dotless i, for example, will need a
>> special rule - Turkish lowercase dotted i capitalizes to a capital
>> dotted İ while lowercase undotted ı capitalizes to regular undotted I.
>
> And so what if a wiki is multilingual and you don't know what language
> the page name is in?  What if a Turkish wiki contains some English
> page names as loan words, for instance?

Indeed, good handling of case-insensitive matchings would be a big win 
for human usability, but it's not easy to get right in all cases.

The main problems are:

1) Conflicts when we really do consider something separate, but the case 
folding rules match them together

2) Language-specific case folding rules in a multilingual environment

Turkish I with/without dot and German ß not always matching to SS are 
the primary examples off the top of my head. Also, some languages tend 
to drop accent markers in capital form (eg, Spanish). What can or should 
we do here?


A nearer-term help would be to go ahead and implement what we talked 
about a billion years ago but never got around to -- a decent "did you 
mean X?" message to display when you go to an empty page but there's 
something similar nearby.

If it's at least trivial to click through from [[New york city]] to 
[[New York City]], that's better than having to search for it anew.

Of course we have some case-insensitive matching for near-matches on 
"go" searches... we could pull from that easily. [Note this is done via 
TitleKey for full case-insensitivity at present... and it probably 
doesn't handle Turkish correctly yet.]

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] URLs that aren't cool...

2009-07-28 Thread Mark Williamson
Since when does Spanish drop accent markers in capital form? If you
have seen anybody do this, it is just a misspelling. For example:
http://es.wikipedia.org/wiki/Ópera or
http://es.wikipedia.org/wiki/África or
http://es.wikipedia.org/wiki/Océano_Índico

I have been told that Greek drops accents in capital form but this may
not be true. Other than that, though, I am not acquainted with any
language that does such a thing (but of course that doesn't mean none
exist).

Mark

skype: node.ue



On Tue, Jul 28, 2009 at 10:16 AM, Brion Vibber wrote:
> On 7/28/09 10:04 AM, Aryeh Gregor wrote:
>> On Tue, Jul 28, 2009 at 12:52 PM, Mark Williamson  wrote:
>>> Case insensitivity shouldn't be a problem for any language, as long as
>>> you do it properly.
>>>
>>> Turkish and other languages using dotless i, for example, will need a
>>> special rule - Turkish lowercase dotted i capitalizes to a capital
>>> dotted İ while lowercase undotted ı capitalizes to regular undotted I.
>>
>> And so what if a wiki is multilingual and you don't know what language
>> the page name is in?  What if a Turkish wiki contains some English
>> page names as loan words, for instance?
>
> Indeed, good handling of case-insensitive matchings would be a big win
> for human usability, but it's not easy to get right in all cases.
>
> The main problems are:
>
> 1) Conflicts when we really do consider something separate, but the case
> folding rules match them together
>
> 2) Language-specific case folding rules in a multilingual environment
>
> Turkish I with/without dot and German ß not always matching to SS are
> the primary examples off the top of my head. Also, some languages tend
> to drop accent markers in capital form (eg, Spanish). What can or should
> we do here?
>
>
> A nearer-term help would be to go ahead and implement what we talked
> about a billion years ago but never got around to -- a decent "did you
> mean X?" message to display when you go to an empty page but there's
> something similar nearby.
>
> If it's at least trivial to click through from [[New york city]] to
> [[New York City]], that's better than having to search for it anew.
>
> Of course we have some case-insensitive matching for near-matches on
> "go" searches... we could pull from that easily. [Note this is done via
> TitleKey for full case-insensitivity at present... and it probably
> doesn't handle Turkish correctly yet.]
>
> -- brion
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] URLs that aren't cool...

2009-07-28 Thread Tei
The related wikipedia article write that it was a urband leyend:

http://es.wikipedia.org/wiki/Acentuaci%C3%B3n_de_las_may%C3%BAsculas

So is wrong to drop these accents.


On Tue, Jul 28, 2009 at 7:21 PM, Mark Williamson wrote:
> Since when does Spanish drop accent markers in capital form? If you
> have seen anybody do this, it is just a misspelling. For example:
> http://es.wikipedia.org/wiki/Ópera or
> http://es.wikipedia.org/wiki/África or
> http://es.wikipedia.org/wiki/Océano_Índico
>
> I have been told that Greek drops accents in capital form but this may
> not be true. Other than that, though, I am not acquainted with any
> language that does such a thing (but of course that doesn't mean none
> exist).
>
> Mark
>
> skype: node.ue
>
>
>
> On Tue, Jul 28, 2009 at 10:16 AM, Brion Vibber wrote:
>> On 7/28/09 10:04 AM, Aryeh Gregor wrote:
>>> On Tue, Jul 28, 2009 at 12:52 PM, Mark Williamson  wrote:
 Case insensitivity shouldn't be a problem for any language, as long as
 you do it properly.

 Turkish and other languages using dotless i, for example, will need a
 special rule - Turkish lowercase dotted i capitalizes to a capital
 dotted İ while lowercase undotted ı capitalizes to regular undotted I.
>>>
>>> And so what if a wiki is multilingual and you don't know what language
>>> the page name is in?  What if a Turkish wiki contains some English
>>> page names as loan words, for instance?
>>
>> Indeed, good handling of case-insensitive matchings would be a big win
>> for human usability, but it's not easy to get right in all cases.
>>
>> The main problems are:
>>
>> 1) Conflicts when we really do consider something separate, but the case
>> folding rules match them together
>>
>> 2) Language-specific case folding rules in a multilingual environment
>>
>> Turkish I with/without dot and German ß not always matching to SS are
>> the primary examples off the top of my head. Also, some languages tend
>> to drop accent markers in capital form (eg, Spanish). What can or should
>> we do here?



-- 
--
ℱin del ℳensaje.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] URLs that aren't cool...

2009-07-28 Thread Brion Vibber
On 7/28/09 10:30 AM, Tei wrote:
> The related wikipedia article write that it was a urband leyend:
>
> http://es.wikipedia.org/wiki/Acentuaci%C3%B3n_de_las_may%C3%BAsculas

Dang! I've been taken in again by exposure to real-world practice 
instead of what's correct. ;)

(In any case, handling that case nicely is wise too.)

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] URLs that aren't cool...

2009-07-28 Thread Roan Kattouw
2009/7/28 Mark Williamson :
> Since when does Spanish drop accent markers in capital form? If you
> have seen anybody do this, it is just a misspelling. For example:
> http://es.wikipedia.org/wiki/Ópera or
> http://es.wikipedia.org/wiki/África or
> http://es.wikipedia.org/wiki/Océano_Índico
>
> I have been told that Greek drops accents in capital form but this may
> not be true. Other than that, though, I am not acquainted with any
> language that does such a thing (but of course that doesn't mean none
> exist).
>
Frisian (fy) does drop accents in capitals, FWIW.

Roan Kattouw (Catrope)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] URLs that aren't cool...

2009-07-28 Thread Helder Geovane Gomes de Lima
2009/7/28 Brion Vibber :
> A nearer-term help would be to go ahead and implement what we talked
> about a billion years ago but never got around to -- a decent "did you
> mean X?" message to display when you go to an empty page but there's
> something similar nearby.
>
> If it's at least trivial to click through from [[New york city]] to
> [[New York City]], that's better than having to search for it anew.

I think this would be really good to implement this, since it also
help us when creating and following interwiki links (see also the
point 3 I was talking here:
http://lists.wikimedia.org/pipermail/wikitech-l/2009-July/044007.html)

Helder

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] URLs that aren't cool...

2009-07-28 Thread Platonides
Brion Vibber wrote:
> On 7/28/09 10:30 AM, Tei wrote:
>> The related wikipedia article write that it was a urband leyend:
>>
>> http://es.wikipedia.org/wiki/Acentuaci%C3%B3n_de_las_may%C3%BAsculas
> 
> Dang! I've been taken in again by exposure to real-world practice 
> instead of what's correct. ;)

Once upon a time, mechanical typewriters weren't able to properly
acceuntate them.

> (In any case, handling that case nicely is wise too.)
> 
> -- brion

At Spanish wikipedia there're some bots creating redirects from titles
lowercased with accents dropped, to make the article show up when
searching without the exact spelling.
I don't really like it, but where the software doesn't work, users get
inventive.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] URLs that aren't cool...

2009-07-28 Thread Nikola Smolenski
Дана Tuesday 28 July 2009 19:16:22 Brion Vibber написа:
> On 7/28/09 10:04 AM, Aryeh Gregor wrote:
> > On Tue, Jul 28, 2009 at 12:52 PM, Mark Williamson  
wrote:
> >> Case insensitivity shouldn't be a problem for any language, as long as
> >> you do it properly.
> >>
> >> Turkish and other languages using dotless i, for example, will need a
> >> special rule - Turkish lowercase dotted i capitalizes to a capital
> >> dotted İ while lowercase undotted ı capitalizes to regular undotted I.
> >
> > And so what if a wiki is multilingual and you don't know what language
> > the page name is in?  What if a Turkish wiki contains some English
> > page names as loan words, for instance?
>
> Indeed, good handling of case-insensitive matchings would be a big win
> for human usability, but it's not easy to get right in all cases.
>
> The main problems are:
>
> 1) Conflicts when we really do consider something separate, but the case
> folding rules match them together
>
> 2) Language-specific case folding rules in a multilingual environment
>
> Turkish I with/without dot and German ß not always matching to SS are
> the primary examples off the top of my head. Also, some languages tend
> to drop accent markers in capital form (eg, Spanish). What can or should
> we do here?

Similar to automatic redirect, we could build an authomatic disambiguation 
page. For example, someone on srwiki going to [[Dj]] would get:

Did you mean:

* [[Đ]]
* [[DJ]]
* [[D.J.]]

> A nearer-term help would be to go ahead and implement what we talked
> about a billion years ago but never got around to -- a decent "did you
> mean X?" message to display when you go to an empty page but there's
> something similar nearby.

Was thinking a lot about this. The best solution I thought of would be to add 
a column to page table "page_title_canonical". When an article is 
created/moved, this canonical title is built from the real title. When an 
article is looked up, if there is no match in page_title, build the canonical 
title from the URL and see if there is a match in page_title_canonical and if 
yes, display "did you mean X" or even go there automatically as if from a 
redirect (if there is only one match) or "did you mean *X, *X1" if there are 
multiple matches.

This canonical title would be made like this:
* Remove disambiguator from the title if it exists
* Remove punctuation and the like
* Transliterate the title to Latin alphabet
* Transliterate to pure ASCII
* Lowercase
* Order the words alphabetically

What could possibly go wrong?

Note that this would also be very helpful for non-Latin wikis - people often 
want Latin-only URLs since non-Latin URLs are t long. I also recall a 
recent discussion about a wiki in a language with nonstandard spelling (nds?) 
where they use bots to create dozens or even hundreds of redirects to an 
article title - this would also make that unneeded.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] URLs that aren't cool...

2009-07-28 Thread Andrew Dunbar
2009/7/29 Nikola Smolenski :
> Дана Tuesday 28 July 2009 19:16:22 Brion Vibber написа:
>> On 7/28/09 10:04 AM, Aryeh Gregor wrote:
>> > On Tue, Jul 28, 2009 at 12:52 PM, Mark Williamson
> wrote:
>> >> Case insensitivity shouldn't be a problem for any language, as long as
>> >> you do it properly.
>> >>
>> >> Turkish and other languages using dotless i, for example, will need a
>> >> special rule - Turkish lowercase dotted i capitalizes to a capital
>> >> dotted İ while lowercase undotted ı capitalizes to regular undotted I.
>> >
>> > And so what if a wiki is multilingual and you don't know what language
>> > the page name is in?  What if a Turkish wiki contains some English
>> > page names as loan words, for instance?
>>
>> Indeed, good handling of case-insensitive matchings would be a big win
>> for human usability, but it's not easy to get right in all cases.
>>
>> The main problems are:
>>
>> 1) Conflicts when we really do consider something separate, but the case
>> folding rules match them together
>>
>> 2) Language-specific case folding rules in a multilingual environment
>>
>> Turkish I with/without dot and German ß not always matching to SS are
>> the primary examples off the top of my head. Also, some languages tend
>> to drop accent markers in capital form (eg, Spanish). What can or should
>> we do here?
>
> Similar to automatic redirect, we could build an authomatic disambiguation
> page. For example, someone on srwiki going to [[Dj]] would get:
>
> Did you mean:
>
> * [[Đ]]
> * [[DJ]]
> * [[D.J.]]
>
>> A nearer-term help would be to go ahead and implement what we talked
>> about a billion years ago but never got around to -- a decent "did you
>> mean X?" message to display when you go to an empty page but there's
>> something similar nearby.
>
> Was thinking a lot about this. The best solution I thought of would be to add
> a column to page table "page_title_canonical". When an article is
> created/moved, this canonical title is built from the real title. When an
> article is looked up, if there is no match in page_title, build the canonical
> title from the URL and see if there is a match in page_title_canonical and if
> yes, display "did you mean X" or even go there automatically as if from a
> redirect (if there is only one match) or "did you mean *X, *X1" if there are
> multiple matches.
>
> This canonical title would be made like this:
> * Remove disambiguator from the title if it exists
> * Remove punctuation and the like
> * Transliterate the title to Latin alphabet
> * Transliterate to pure ASCII
> * Lowercase
> * Order the words alphabetically
>
> What could possibly go wrong?
>
> Note that this would also be very helpful for non-Latin wikis - people often
> want Latin-only URLs since non-Latin URLs are t long. I also recall a
> recent discussion about a wiki in a language with nonstandard spelling (nds?)
> where they use bots to create dozens or even hundreds of redirects to an
> article title - this would also make that unneeded.
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

I actually did make this extension a couple of years, intended for the
English Wiktionary where we manually add an {{also}} template to the
top of pages to like to other pages whose titles differ in minor ways
such as capitalization, hyphenation, apostrophes, accents, periods. I
think I had it working with Hebrew and Arabic and a few other exotic
languages besides.

It was running on Brion's test box for some time but getting little
interest. It's been offline and unmaintained since Brion moved and I
did a couple of overseas trips.

http://www.mediawiki.org/wiki/Extension:DidYouMean
http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/DidYouMean/
https://bugzilla.wikimedia.org/show_bug.cgi?id=8648

It hooked all ways to create delete or move a page to maintain a
separate table of normalized page titles which it consulted when
displaying a page.
The code for display was designed for compatibility with the
then-current Wiktionary templates and would need to be implemented in
a more general way.
A core version would probably just add a field to the existing table.

Andrew Dunbar (hippietrail)


-- 
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] URLs that aren't cool...

2009-07-29 Thread Tim Starling
Aryeh Gregor wrote:
> (But at least we could get rid of the silly Text/DbKey distinction
> while we're doing this.  I've heard recent MySQL versions actually
> support storage of ASCII space characters in text fields!)

Apparently this poor design choice was made due to some bogus concept
of backwards compatibility with UseMod, or some similarly crappy wiki
engine that stores articles in the filesystem, with filenames chosen
to avoid distressing shellscript fanboys.

-- Tim Starling


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l