Re: what to do with multiple BOMs

MRAB Thu, 19 Aug 2021 11:58:34 -0700

On 2021-08-19 14:07, Robin Becker wrote:

Channeling unicode text experts and xml people:


I have xml entity with initial bytes ff fe ff fe which the file command says is
UTF-16, little-endian text.

I agree, but what should be done about the additional BOM.

A test output made many years ago seems to keep the extra BOM. The xml context 
is


xml file 014.xml
<!DOCTYPE doc [
<!ELEMENT doc (#PCDATA)>
<!ENTITY e SYSTEM "014.ent">
]>
<doc>&e;</doc

the entitity file 014.ent is bombomdata

b'\xff\xfe\xff\xfed\x00a\x00t\x00a\x00'

The old saved test output of processing is

b'<doc>\xef\xbb\xbfdata</doc>'

which implies seems as though the extra BOM in the entity has been kept and 
processed into a different BOM meaning utf8.

I think the test file is wrong and that multiple BOM chars in the entiry should 
have been removed.

Am I right?

The use of a BOM b'\xef\xbb\xbf' at the start of a UTF-8 file is aWindows thing. It's not used on non-Windows systems. Putting it in themiddle, e.g. b'<doc>\xef\xbb\xbfdata</doc>', just looks wrong.

It looks like the contents of a UTF-8 file, with a BOM because itoriginated on a Windows system, were read in without stripping the BOMfirst.

--
https://mail.python.org/mailman/listinfo/python-list

Re: what to do with multiple BOMs

Reply via email to