Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.

Scott Robison Tue, 08 Jul 2014 15:01:25 -0700

On Tue, Jul 8, 2014 at 3:50 PM, Stephan Beal <sgb...@googlemail.com> wrote:

> Interesting question/option, but i have no answer. Something to possibly
> consider?
>
> (sent from a mobile device - please excuse brevity, typos, and
> top-posting)
> ----- stephan beal
> http://wanderinghorse.net
> On Jul 8, 2014 11:43 PM, "Andy Bradford" <amb-fos...@bradfords.org> wrote:
>
>> Thus said Stephan Beal on Tue, 08 Jul 2014 21:37:50 +0200:
>>
>> > No characters between 128 and 255  are valid UTF-8, to avoid confusion
>> > with the many encodings which use that range.
>>
>> If no characters between 128 and 255 are valid UTF-8, and they can never
>> be valid UTF-8  characters, and are used by many  encodings, why doesn't
>> Fossil simply ignore them when they  are committed? I guess I'm confused
>> why they are being treated specially as to warrant either a setting or a
>> prompt to continue.
>>
>
Maybe this will help, and my apologies if I appear to be talking down to
anyone. I'm just trying to be clear:

Unicode is a way of expressing codepoints from 0x000000 to 0x10FFFF (17
planes of 65,536 code points). 0xE8 is a legitimate Unicode codepoint
for "LATIN
SMALL LETTER E WITH GRAVE" (essentially the same thing as in ISO-8859-1).

If you use UTF-32 to encode it, you'll wind up with a four byte integer
0x000000E8 (either little or big endian encoded depending on the platform).

If you use UTF-16 to encode it, you'll wind up with a two byte integer
0x00E8 (again varying by endianess).

If you use UTF-8 to encode it, you'll wind up with the two byte sequence
0xC3 0xA8 (or 0b110[00011] 0b10[101000] where the bracketed binary digits
are the original 0xE8 byte).

So the codepoint (character) 0xE8 can appear in UTF-8, but the byte 0xE8 is
not the same thing as the code point.

-- 
Scott Robison

_______________________________________________
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.

Reply via email to