[On the use of the UTF-8 signature, aka the BOM, at the start of a UTF-8 file]
On Sat, Jun 24, 2017 at 1:59 PM, Russ Allbery <r...@debian.org> wrote: > Russ Allbery <r...@debian.org> writes: > >> I did a bit more research, and apparently this approach has become more >> blessed again.. > > Okay, I experimented with this, but unfortunately less displays the BOM at > the start of the file as a very ugly reverse-video <U+FEFF> at the top of > the screen. > > I think this is arguably a bug in less; this is a control character in a > sense, but the whole point is for it to be invisible, particularly when > it's the first character of the file. Nonetheless, that's how less > currently behaves. My feeling is that good display in less is a more > important use case for us than enabling this autorecognition in web > browsers (which will normally be viewing the HTML versions). > > Given that, I think the right fix here is to fix the declared charset on > www.debian.org for these files. I hadn't looked at less output on the file. After doing that, I agree that this is a bug in less. I just emailed the following to bug-l...@gnu.org: If using less on a text file that contains embedded UTF-8 characters, less seems to properly interpret and display the UTF-8. However, if that UTF-8 file begins with the UTF-8 signature (aka the UTF-8 version of the Byte Order Mark, U+FEFF), less displays it with inverted video at the start of the first line as "<EF><BB><BF>". Please have less detect this sequence and not display it, given that other UTF-8-encoded data in a text file is not displayed like that. The latest Unicode Standard was released on 20 June 2017. You can download it at http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf At the bottom of p. 67, there is a section called "Unicode Signature". That section shows that it is acceptable to use the UTF-8 version of the Byte Order Mark. In the past, that was not the case. Part of the impetus behind the change is likely the World Wide Web. HTML5 browsers are required to recognize the UTF-8 signature at the start of a plain text file and if present, then to interpret the remainder of the file as a UTF-8 file. You can see mention of this at https://www.w3.org/International/questions/qa-byte-order-mark which contains this paragraph: "In HTML5 browsers are required to recognize the UTF-8 BOM and use it to detect the encoding of the page, and recent versions of major browsers handle the BOM as expected when used for UTF-8 encoded pages." Thank you for your consideration, Paul Hardy