Re: what to do with multiple BOMs

Richard Damon Thu, 19 Aug 2021 11:58:46 -0700

By the rules of Unicode, that character, if not the very first character of the 
file, should be treated as a “zero-width non-breaking space”, it is NOT a BOM 
character there.


It’s presence in the files is almost certainly an error, and being caused by 
broken software or software processing files in a manner that it wasn’t 
designed for.

> On Aug 19, 2021, at 1:48 PM, Robin Becker <[email protected]> wrote:
> 
> Channeling unicode text experts and xml people:
> 
> I have xml entity with initial bytes ff fe ff fe which the file command says 
> is
> UTF-16, little-endian text.
> 
> I agree, but what should be done about the additional BOM.
> 
> A test output made many years ago seems to keep the extra BOM. The xml 
> context is
> 
> 
> xml file 014.xml
> <!DOCTYPE doc [
> <!ELEMENT doc (#PCDATA)>
> <!ENTITY e SYSTEM "014.ent">
> ]>
> <doc>&e;</doc
> 
> the entitity file 014.ent is bombomdata
> 
> b'\xff\xfe\xff\xfed\x00a\x00t\x00a\x00'
> 
> The old saved test output of processing is
> 
> b'<doc>\xef\xbb\xbfdata</doc>'
> 
> which implies seems as though the extra BOM in the entity has been kept and 
> processed into a different BOM meaning utf8.
> 
> I think the test file is wrong and that multiple BOM chars in the entiry 
> should have been removed.
> 
> Am I right?
> --
> Robin Becker
> 
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: what to do with multiple BOMs

Reply via email to