Re: [Wikitech-l] search=steven+tyler gets Steven_tyler

Aryeh Gregor Sun, 15 May 2011 08:02:35 -0700

On Fri, May 13, 2011 at 7:57 PM, Daniel Friesen
<li...@nadir-seen-fire.com> wrote:
> Doesn't look that bad...
> - Some arcane maintenance scripts.
> - Some .js that can't interact with Title working with urls.
> - The expected User, Title, Parser, file related, etc... core api stuff
> that's easy to tweak.
> - Some hardcoded stuff for namespaces which could be improved, but
> actually isn't all that applicable to what we're trying to fix.
> - Some special pages cleaning up inputs where we might want to provide
> something inside Title for that.

Except that there are who knows how many other places in the code that
make such assumptions but aren't so easily found by searching.

On Fri, May 13, 2011 at 11:33 PM, Andrew Dunbar <hippytr...@gmail.com> wrote:
> I'm almost positive Azeri has the same dotless i issue and perhaps
> some of the other Turkic languages of Central Asia. One solution is to
> do accent/diacritic normalization too as part of the canonicalization.

The dotless-i issue affects "Turkic (Turkish/Azerbaijani)" text,
according to <http://userguide.icu-project.org/transforms/casemappings>.
 This is a well-studied issue with existing standards, and we're not
going to do better than the Unicode Consortium has come up with.

You cannot fix the problem by doing accent/diacritic normalization.
"i" and "I" are the same letter in English but different letters in
Turkish.  You cannot get around that.  We'd need to have a separate
case-folding algorithm for Turkish wikis, or make them use one that's
incorrect for their language.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] search=steven+tyler gets Steven_tyler

Reply via email to