Re: [PHP-DEV] [RFC] UString

Derick Rethans Sun, 01 Mar 2015 12:39:27 -0800

On Sat, 28 Feb 2015, Rowan Collins wrote:

> On 28/02/2015 06:48, Joe Watkins wrote:
> > Morning internals,
> > 
> >      This is just a quick note to announce my intention to ready this RFC
> > for voting next week.
> > 
> >      I know I'm a little late maybe, I was real sick most of last week, so
> > couldn't do anything useful.
> > 
> >      A couple of us intend to fix outstanding issues on github and those
> > raised here, tidy the RFC and open the vote for 7.
> > 
> >     I would ask anyone interested to scan through this thread and announce
> > concerns that are not mentioned asap.
> 
> I still think this class is trying to do several jobs, and not doing any of
> them very well, and I fear that people will see this class and expect it to
> solve problems which it actually ignores.
> 
> Here are some concrete use cases I would like a simple interface to solve for
> me:
> 
> - Take text from an ISO 88592-2 data source, pass it through generic text
> filters, and pass it to a UTF-16 data target.
> - Given a long string of Unicode text, give me a valid UTF-8 string which fits
> into a buffer with fixed byte size; i.e. give me the largest number of whole
> code points which fit into that number of bytes once encoded.
> - As above, but without stripping diacritics off the last character of the
> resulting string, i.e. give me the largest number of whole graphemes which
> fit.
> - Split a string into equal sized chunks of readable characters (graphemes),
> regardless of how many bytes or code points each chunk contains.
> 
> UString currently falls short of all of these:
> 
> - I can specify my input encoding (in the constructor or helper method,
> over-riding a static default, which is equivalent to ext/mbstring's global
> setting), but not my output encoding (there is no method to ask for a byte
> representation other than a string cast, which by definition has no
> parameters).


Yeah, there should be an output method to convert to a target encoding.

> - I can ask for a fixed number of code points, but don't know how many bytes
> these will take until I cast to a UTF-8 string.

As I said before, indexes into strings should not be done on code 
points, as the following would then break the characters:

$s = new Text("Ås");
echo $s->substring(1);

The output would be:    ̊  

Where as:

$s = new Text("Ås);
echo $s->substring(1);

would output "s".

Which is not what people would expect.

> - I can't manipulate anything at the grapheme level at all, even though this
> is the most meaningful level of operation in most cases.

Yes - graphemes should be the base blocks, not code points.

> Things it does do:
> 
> - a handful of methods give meaningful international text support: toUpper(),
> toLower(),  trim()
> - some methods could be done on byte strings if I ensure they're all in UTF-8:
> replace(), contains(), startsWith(), endsWith(), repeat()

That doesn't always work when you have graphemes, or text in different 
normalisation forms. Ie, it should consider Å U+00C5 and Å (U+0041 + 
U+030A) the same for contains and startsWith — ie, handle normalisation 
for comparison.

> - there may be limited situations where I want to dive into the code points
> which make up a string, although I can't think of many: $length, pad(),
> indexOf(), lastIndexOf(), charAt(), replaceSlice()

Break iterators on either code points, or graphemes, might work here?

> - remaining methods avoid me creating invalid UTF-8, but don't help me 
> much with real-life text: chunk(), split(), substring() - I can ask 
> what codepage my Unicode string is in; I don't even understand what 
> this means
> 
> I think an efficient OO wrapper around ICU is a great idea, but more 
> thought needs to go into what methods are exposed, and how people are 
> going to use them in real code.

Yes - I agree. I think this current proposal is a good start, but it 
needs to be worked out a little bit more before I think we should vote 
on it — how much I would like to see something like this in PHP.

cheers,
Derick

-- 
http://derickrethans.nl | http://xdebug.org
Like Xdebug? Consider a donation: http://xdebug.org/donate.php
twitter: @derickr and @xdebug
Posted with an email client that doesn't mangle email: alpine

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] [RFC] UString

Reply via email to