On Fri, Jun 24, 2011 at 4:08 AM, Dave Cridland <d...@cridland.net> wrote:

> 1) Processing software may have decoded the UTF-8 into "something", making
> it awkward to manage.
>
> 2) Referring to UTF-8 octets means we have silly states where we could edit
> inside characters. It's even possible this may be used intentionally, in
> some languages.
>
> So I'd say that we should refer to characters in a string, and deal with
> Unicode code-points in the abstract. I'd expect that implementations would
> convert this internally into whatever made sense for them.
>

That's what I did in the v0.0.2 of the specification already, but it makes
it necessary to explain which string format, which made it necessary to say
it is based on UTF16 strings. Unfortunately, the same string returns
different Unicode encodings in the programming language's native Unicode
storage format (not the wire transmission format before XML processing):

UTF8: String.Length("Québec") == 7
UTF16: String.Length("Québec") == 6

Now, when we start using Chinese characters outside the BMP, we now also
diverge between UTF16 and UCS4 for exactly the same chinese character:

UTF16: String.Length("#") == 2
UCS4: String.Length("#") == 1

This plays unfortunate havoc with String.Insert's and String.Delete's.
This hurts interoperability.
Therefore, we have to go with a consistent method.
Therefore, we had to go to "Unicode code points"

Reply via email to