Re: [PHP-DEV] Unicode support

Rowan Collins Tue, 14 Oct 2014 14:09:14 -0700

On 14/10/2014 20:51, Andrea Faulds wrote:

If you went length in characters, you probably need to implement your own 
algorithm, as it really depends on your specific use case.

I disagree, Unicode has very well-defined algorithms for these things,and the average PHP developer (or even PHP framework developer) isunlikely to do better.

It will, however, always produce valid UTF8 strings for output. That’s better 
than standard string functions which can mangle UTF8.

They will be valid UTF-8 sequences, but they may not be meaningfulstrings - you might truncate halfway along a set of combiningdiacritics, or worse, halfway through a Korean syllable character (3codepoints, 1 grapheme).

I may also want to say $string->getByteStringWithMaxLength('UTF-8', 20) to fit an exact 
number of graphemes into a 20-byte binary space; something that neither 
$string->substring(0, 20)->getByteString('UTF-8') nor substr( 
$string->getByteString('UTF-8'), 0, 20 ) can do.

I’m not sure quite how you’d do that.


Nor am I, that's why I want the library to do it for me! :P

More seriously, a simple algorithm is easy enough to design - serializeyour abstract string into bytes one grapheme at a time, tracking thecurrent and previous lengths. If current length exceeds the maximum,track back to the previous length and return; otherwise, continue untilall graphemes are serialized.

Sure. But just handling code points safely is hard enough as it is. This 
handles that. It doesn’t handle characters, sure, but it’s a start. And for 
many applications, you do not need to handle characters.

We already have mbstring and intl for doing various things "a bitbetter"; the goal of more centralised support should be to do them aswell as possible, not be just another variation that doesn't quite getthere.


--
Rowan Collins
[IMSoP]


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] Unicode support

Reply via email to