Hi Joe,

On Sun, Mar 1, 2015 at 6:14 PM, Yasuo Ohgaki <yohg...@ohgaki.net> wrote:

> On Sat, Feb 28, 2015 at 3:48 PM, Joe Watkins <pthre...@pthreads.org>
> wrote:
>
>>     This is just a quick note to announce my intention to ready this RFC
>> for voting next week.
>>
>>     I know I'm a little late maybe, I was real sick most of last week, so
>> couldn't do anything useful.
>>
>>     A couple of us intend to fix outstanding issues on github and those
>> raised here, tidy the RFC and open the vote for 7.
>>
>>    I would ask anyone interested to scan through this thread and announce
>> concerns that are not mentioned asap.
>>
>
> I appreciate your proposal!
> Rowan pointed out some important things. I don't understand details as I
> don't read your code yet. I'll try to read and comment in a few days.
>

I guess you would like to start voting today or tomorrow, so I briefly read
your code.
I think your approach is good. I like UString be UTF-8 always by default
regardless
of other settings. i.e. default_charset, internal_encoding.

I see few missing key APIs that would be critical for multibyte char
handling, like
string length, string width, normalization, string conversions like Zenkaku
to Hankaku,
encoding(codepage) converter.  However, all of these may be added later as
they
are already implemented in ICU.

I think UString may be better to use UTF-8 always to make users life a
little simpler.
Your constructor only have codepage setting that is used as UString
codepage to support
other codepage(encodings).

Rather than to have various encoding support, I think constructor needs
encoding(codepage)
conversion feature. Codepage parameter is better to be used as "from
encoding(codepage)"
parameter and convert any encoding(codepage) to UTF-8. If conversion fails,
it should raise
exception. It's better to have forgiving API for malformed strings if user
explicitly specified to do so.

Constructor may be

   public function __construct([string $string [, string $source_codepage
[, string $substitute_char] ]);

$soure_codepage is source string encoding(codepage) and $string is
converted to UTF-8 always.
If $substitute_char is omitted, raise exception for invalid $string.
If $substitute_char is specified (it can be '' empty string), convert
$string according to $source_codepage
and just remove/replace invalid byte stream in $string.

With this constructor, string stored in UString object is always valid
UTF-8. Any character encoding
(including UTF-16/32 and 200 encoding names supported by ICU) may be used
as source string.

Since there will be no variable codepage setting for UString object,
followings may be removed.

    public static function getDefaultCodepage();
    public static function setDefaultCodepage(string $codepage);

ICU uses "codepage" as "character encoding", but it may be better to use
"character
encoding" as people are not used to ICU terminology.

This is what I thought. I didn't read your code carefully, so I might be
wrong. Please
correct me if I'm mistaken.

I suppose there are other people working on Unicode string based simpler
libraries.
I would like to hear opinion from them.

BTW, we really need byte_len(). strlen() is just confusing API... It's not
a scope of
this RFC, though.

Regards,

--
Yasuo Ohgaki
yohg...@ohgaki.net

Reply via email to