Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.

Ron W Tue, 22 Jul 2014 10:23:37 -0700

On Tue, Jul 22, 2014 at 1:01 PM, Stephan Beal <sgb...@googlemail.com> wrote:


> One would think i'd be more conscious of how i throw around byte vs
> character :/. i'm still not clear on the whole char-vs-code point bit,
> though.
>

Code points also include non-character entities such as "zero width
non-breaking space", "soft hyphen" and what some ASCII users would call
"control characters". (And the Byte Order Mark.)

(By accident of having a friend who works for a non-profit org, I have done
some web related programming on occasion. This has resulted in me learning
far more about code points and Unicode than I ever wanted.)


>
>
>> FWIW, FYI, UTF-8 has an optional Byte Order Mark, 0xEF 0xBB 0xBF,that can
>> appear at the beginning of a file. This just the UTF-8 encoding of code
>> point U-00FEFF, which is the actual Unicode Byte Order Mark. For UTF-8,
>> this mark is really only useful as a suggestion that the following text
>> might be UFT-8 encoded Unicode. For UFT-16 and UTF-32 encodings, this mark
>> is used to inform the receiver of the text the order of bytes within the 16
>> or 32 bit encoding units (presuming that the file is actually UTF-16 or 32
>> encoded text).
>>
>
> AFAIK a BOM is not recommended for UTF-8, because it's (except for the use
> you point out) meaningless and confuses so many tools. That's (partially)
> what Wikipedia says, anyway (and i didn't write it).
>

As usual, it depends on circumstances. By far, it is preferred that the
encoding of a file or byte stream be identified by some kind of meta data.
Failing that, a mark can help with identifying the encoding.

_______________________________________________
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.

Reply via email to