Re: [PHP-DEV] Unicode support

Aleksey Tulinov Tue, 14 Oct 2014 15:57:50 -0700

On 15/10/14 00:04, Rowan Collins wrote:

Rowan,

Back to combining characters, i dig the idea of introducing graphemes,
but i think French person would write word "noël" using precomposed
character. I'm using French keyboard at
https://translate.google.com/#fr/. "ë" is Shift + "^", then "e", it
produces precomposed U+00EB.


You don't even need to rely on the input method using the combined form,
Unicode includes an algorithm for normalisation to this form (where such
composites are coded), known as NFC.

The problem with NFC is that it's not only composition, butdecomposition + reordering + re-composition. I know about NFC quickcheck, but the issue is if check fails and string need transformation,this would be very challenging, if not impossible, to do while keepingstring immutable and without introducing internal representation of thatstring.

Internal representation and string modifications brings overhead whichmight eventually render implementation unusable for a range of applications.

On the other side, language specific characters which can beprecomposed, are likely to be precomposed.

If script doesn't have precomposed equivalent, then this grapheme will
always be in the same decomposed form and collation will work.
Substring search will also work, because needle will be decomposed in
the same way as haystack.


No, it won't. You won't get false negatives as long as both strings are
normalised to the same form (whether that is NFC or NFD), but you will
get false positives. For instance, searching for the substring "e" would
not match a combined ë, but it would match an uncombined sequence with e
at its base (e.g. with two diacritics).

Normalising to NFD (fully de-composed) would at least mean that "e"
consistently matched all graphemes with "e" at their base, but is not a
lossless operation, so performing it implicitly is probably not a good
idea.

Good point. That's what i meant by border-line case. Could you possiblypoint me to a specific example of such false positive? I'm interested inwell-formed UTF-8 string. I believe "noël" test is ill-formed UTF-8 anddoesn't conform to shortest-form requirement.

It's pretty meaningless to say you support Unicode, but only the easy
bits. You might as well just tag each string with one of the pages of
ISO-8859.

As far as i'm concerned Unicode specification does not require toimplement all annexes or even support entire character set to beconformant. I think there are always trade-offs involved, depending onwhat is more important for you.


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] Unicode support

Reply via email to