Дана Tuesday 17 August 2010 20:37:44 Aryeh Gregor написа:
> The code is currently enabled in trunk and is still awaiting review.
> It's basically complete, but there are some issues left:
>
> * What sortkey algorithm to use?  Currently it just ASCII uppercases
> the words, which is okay for a proof-of-concept but doesn't actually
> solve bug 164.

For some time now, I am thinking about a stupidly simple solution:

php -r 'for($i = 0; $i < 65536; $i++) { echo pack("nx", $i); echo "\n"; }'|
iconv -f ucs-2be -t utf8 | sort | php -r 'foreach(file("php://stdin") as $v) 
{ echo var_export(substr($v, 0, -1)) . " => \"" . str_pad(base_convert($i, 
10, 36), 4, 0, STR_PAD_LEFT) . "\",\n"; $i++; }'

This, more or less, should:

- Print every Unicode (UCS-2 only) character on its own line
- Sort that according to the current locale
- Print a PHP array to replace each Unicode character (UTF-8 encoded) with 
appropriate base36 number

If an UTF-8 string is encoded with this array, the resulting strings should be 
sorted exactly the same as in the locale through mere ASCII sorting. Or am I 
missing something big? (Except contextual sensitivity, but it occurs 
relatively rarely and this should still be better than what we have now.)

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to