Hi Joe, On Sun, Mar 1, 2015 at 6:14 PM, Yasuo Ohgaki <yohg...@ohgaki.net> wrote:
> On Sat, Feb 28, 2015 at 3:48 PM, Joe Watkins <pthre...@pthreads.org> > wrote: > >> This is just a quick note to announce my intention to ready this RFC >> for voting next week. >> >> I know I'm a little late maybe, I was real sick most of last week, so >> couldn't do anything useful. >> >> A couple of us intend to fix outstanding issues on github and those >> raised here, tidy the RFC and open the vote for 7. >> >> I would ask anyone interested to scan through this thread and announce >> concerns that are not mentioned asap. >> > > I appreciate your proposal! > Rowan pointed out some important things. I don't understand details as I > don't read your code yet. I'll try to read and comment in a few days. > I guess you would like to start voting today or tomorrow, so I briefly read your code. I think your approach is good. I like UString be UTF-8 always by default regardless of other settings. i.e. default_charset, internal_encoding. I see few missing key APIs that would be critical for multibyte char handling, like string length, string width, normalization, string conversions like Zenkaku to Hankaku, encoding(codepage) converter. However, all of these may be added later as they are already implemented in ICU. I think UString may be better to use UTF-8 always to make users life a little simpler. Your constructor only have codepage setting that is used as UString codepage to support other codepage(encodings). Rather than to have various encoding support, I think constructor needs encoding(codepage) conversion feature. Codepage parameter is better to be used as "from encoding(codepage)" parameter and convert any encoding(codepage) to UTF-8. If conversion fails, it should raise exception. It's better to have forgiving API for malformed strings if user explicitly specified to do so. Constructor may be public function __construct([string $string [, string $source_codepage [, string $substitute_char] ]); $soure_codepage is source string encoding(codepage) and $string is converted to UTF-8 always. If $substitute_char is omitted, raise exception for invalid $string. If $substitute_char is specified (it can be '' empty string), convert $string according to $source_codepage and just remove/replace invalid byte stream in $string. With this constructor, string stored in UString object is always valid UTF-8. Any character encoding (including UTF-16/32 and 200 encoding names supported by ICU) may be used as source string. Since there will be no variable codepage setting for UString object, followings may be removed. public static function getDefaultCodepage(); public static function setDefaultCodepage(string $codepage); ICU uses "codepage" as "character encoding", but it may be better to use "character encoding" as people are not used to ICU terminology. This is what I thought. I didn't read your code carefully, so I might be wrong. Please correct me if I'm mistaken. I suppose there are other people working on Unicode string based simpler libraries. I would like to hear opinion from them. BTW, we really need byte_len(). strlen() is just confusing API... It's not a scope of this RFC, though. Regards, -- Yasuo Ohgaki yohg...@ohgaki.net