[On the use of the UTF-8 signature, aka the BOM, at the start of a UTF-8 file]

On Sat, Jun 24, 2017 at 1:59 PM, Russ Allbery <r...@debian.org> wrote:
> Russ Allbery <r...@debian.org> writes:
>
>> I did a bit more research, and apparently this approach has become more
>> blessed again..
>
> Okay, I experimented with this, but unfortunately less displays the BOM at
> the start of the file as a very ugly reverse-video <U+FEFF> at the top of
> the screen.
>
> I think this is arguably a bug in less; this is a control character in a
> sense, but the whole point is for it to be invisible, particularly when
> it's the first character of the file.  Nonetheless, that's how less
> currently behaves.  My feeling is that good display in less is a more
> important use case for us than enabling this autorecognition in web
> browsers (which will normally be viewing the HTML versions).
>
> Given that, I think the right fix here is to fix the declared charset on
> www.debian.org for these files.

I hadn't looked at less output on the file.  After doing that, I agree
that this is a bug in less.  I just emailed the following to
bug-l...@gnu.org:


If using less on a text file that contains embedded UTF-8 characters,
less seems to properly interpret and display the UTF-8.  However, if
that UTF-8 file begins with the UTF-8 signature (aka the UTF-8 version
of the Byte Order Mark, U+FEFF), less displays it with inverted video
at the start of the first line as "<EF><BB><BF>".

Please have less detect this sequence and not display it, given that
other UTF-8-encoded data in a text file is not displayed like that.

The latest Unicode Standard was released on 20 June 2017.  You can
download it at

http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf

At the bottom of p. 67, there is a section called "Unicode Signature".
That section shows that it is acceptable to use the UTF-8 version of
the Byte Order Mark.  In the past, that was not the case.

Part of the impetus behind the change is likely the World Wide Web.
HTML5 browsers are required to recognize the UTF-8 signature at the
start of a plain text file and if present, then to interpret the
remainder of the file as a UTF-8 file.  You can see mention of this at

https://www.w3.org/International/questions/qa-byte-order-mark

which contains this paragraph: "In HTML5 browsers are required to
recognize the UTF-8 BOM and use it to detect the encoding of the page,
and recent versions of major browsers handle the BOM as expected when
used for UTF-8 encoded pages."

Thank you for your consideration,


Paul Hardy

Reply via email to