Re: [PHP-DEV] [RFC] UString

Rowan Collins Sat, 28 Feb 2015 09:59:07 -0800

On 28/02/2015 06:48, Joe Watkins wrote:

Morning internals,


     This is just a quick note to announce my intention to ready this RFC
for voting next week.

     I know I'm a little late maybe, I was real sick most of last week, so
couldn't do anything useful.

     A couple of us intend to fix outstanding issues on github and those
raised here, tidy the RFC and open the vote for 7.

    I would ask anyone interested to scan through this thread and announce
concerns that are not mentioned asap.

I still think this class is trying to do several jobs, and not doing anyof them very well, and I fear that people will see this class and expectit to solve problems which it actually ignores.

Here are some concrete use cases I would like a simple interface tosolve for me:

- Take text from an ISO 88592-2 data source, pass it through generictext filters, and pass it to a UTF-16 data target.- Given a long string of Unicode text, give me a valid UTF-8 stringwhich fits into a buffer with fixed byte size; i.e. give me the largestnumber of whole code points which fit into that number of bytes onceencoded.- As above, but without stripping diacritics off the last character ofthe resulting string, i.e. give me the largest number of whole graphemeswhich fit.- Split a string into equal sized chunks of readable characters(graphemes), regardless of how many bytes or code points each chunkcontains.


UString currently falls short of all of these:

- I can specify my input encoding (in the constructor or helper method,over-riding a static default, which is equivalent to ext/mbstring'sglobal setting), but not my output encoding (there is no method to askfor a byte representation other than a string cast, which by definitionhas no parameters).- I can ask for a fixed number of code points, but don't know how manybytes these will take until I cast to a UTF-8 string.- I can't manipulate anything at the grapheme level at all, even thoughthis is the most meaningful level of operation in most cases.


Things it does do:

- a handful of methods give meaningful international text support:toUpper(), toLower(), trim()- some methods could be done on byte strings if I ensure they're all inUTF-8: replace(), contains(), startsWith(), endsWith(), repeat()- there may be limited situations where I want to dive into the codepoints which make up a string, although I can't think of many: $length,pad(), indexOf(), lastIndexOf(), charAt(), replaceSlice()- remaining methods avoid me creating invalid UTF-8, but don't help memuch with real-life text: chunk(), split(), substring()- I can ask what codepage my Unicode string is in; I don't evenunderstand what this means

I think an efficient OO wrapper around ICU is a great idea, but morethought needs to go into what methods are exposed, and how people aregoing to use them in real code.


Regards,
--
Rowan Collins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] [RFC] UString

Reply via email to