On Thu, Aug 29, 2013 at 12:28 AM, Bennie Kloosteman <[email protected]>wrote:
> ..And Unicode is not even official here , officially you should use ASCII > and then use an encoding scheme GB or GBK ( Unicode cant do newer chars so > they do this encoding on unicode anyway and suffer a double whamy because > the encoded chars are wider) . > I've run into this problem in Japanese as well. The result is that proper eastern-language I18N ends up forced into byte[] instead of UTF8 strings anyhow. I'm in favor of UTF8 strings, and also of "chunky" strings in which > sub-runs are encoded using the most efficient encoding for the run. Back at egroups/yahoo-groups, we used a UTF-8 compatible "chunk-marked" encoding we called ME8 written by Gaku Ueda. It allowed marking a chunk with a specific charset-encoding, to solve some of the issues Bennie mentioned. I thought there was some more public draft written up about it, but the best I could find is this... it explains how the sequence was craftily chosen to be distinguishable from UTF-8 sequences. http://dj1.willowmail.com/~jeske/_drop/ME8_chunked_charset_encoding.txt Regarding the lack of direct string[i] indexing, in all of the email/web I18N stuff I've worked on, strings are nearly always stream-processed, so there isn't any need for random access.
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
