On Tue, Jul 8, 2014 at 3:50 PM, Stephan Beal <sgb...@googlemail.com> wrote:
> Interesting question/option, but i have no answer. Something to possibly > consider? > > (sent from a mobile device - please excuse brevity, typos, and > top-posting) > ----- stephan beal > http://wanderinghorse.net > On Jul 8, 2014 11:43 PM, "Andy Bradford" <amb-fos...@bradfords.org> wrote: > >> Thus said Stephan Beal on Tue, 08 Jul 2014 21:37:50 +0200: >> >> > No characters between 128 and 255 are valid UTF-8, to avoid confusion >> > with the many encodings which use that range. >> >> If no characters between 128 and 255 are valid UTF-8, and they can never >> be valid UTF-8 characters, and are used by many encodings, why doesn't >> Fossil simply ignore them when they are committed? I guess I'm confused >> why they are being treated specially as to warrant either a setting or a >> prompt to continue. >> > Maybe this will help, and my apologies if I appear to be talking down to anyone. I'm just trying to be clear: Unicode is a way of expressing codepoints from 0x000000 to 0x10FFFF (17 planes of 65,536 code points). 0xE8 is a legitimate Unicode codepoint for "LATIN SMALL LETTER E WITH GRAVE" (essentially the same thing as in ISO-8859-1). If you use UTF-32 to encode it, you'll wind up with a four byte integer 0x000000E8 (either little or big endian encoded depending on the platform). If you use UTF-16 to encode it, you'll wind up with a two byte integer 0x00E8 (again varying by endianess). If you use UTF-8 to encode it, you'll wind up with the two byte sequence 0xC3 0xA8 (or 0b110[00011] 0b10[101000] where the bracketed binary digits are the original 0xE8 byte). So the codepoint (character) 0xE8 can appear in UTF-8, but the byte 0xE8 is not the same thing as the code point. -- Scott Robison
_______________________________________________ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users