On 14/10/2014 20:51, Andrea Faulds wrote:
If you went length in characters, you probably need to implement your own 
algorithm, as it really depends on your specific use case.

I disagree, Unicode has very well-defined algorithms for these things, and the average PHP developer (or even PHP framework developer) is unlikely to do better.

It will, however, always produce valid UTF8 strings for output. That’s better 
than standard string functions which can mangle UTF8.

They will be valid UTF-8 sequences, but they may not be meaningful strings - you might truncate halfway along a set of combining diacritics, or worse, halfway through a Korean syllable character (3 codepoints, 1 grapheme).


I may also want to say $string->getByteStringWithMaxLength('UTF-8', 20) to fit an exact 
number of graphemes into a 20-byte binary space; something that neither 
$string->substring(0, 20)->getByteString('UTF-8') nor substr( 
$string->getByteString('UTF-8'), 0, 20 ) can do.
I’m not sure quite how you’d do that.

Nor am I, that's why I want the library to do it for me! :P

More seriously, a simple algorithm is easy enough to design - serialize your abstract string into bytes one grapheme at a time, tracking the current and previous lengths. If current length exceeds the maximum, track back to the previous length and return; otherwise, continue until all graphemes are serialized.

Sure. But just handling code points safely is hard enough as it is. This 
handles that. It doesn’t handle characters, sure, but it’s a start. And for 
many applications, you do not need to handle characters.

We already have mbstring and intl for doing various things "a bit better"; the goal of more centralised support should be to do them as well as possible, not be just another variation that doesn't quite get there.

--
Rowan Collins
[IMSoP]


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to