By the rules of Unicode, that character, if not the very first character of the file, should be treated as a “zero-width non-breaking space”, it is NOT a BOM character there.
It’s presence in the files is almost certainly an error, and being caused by broken software or software processing files in a manner that it wasn’t designed for. > On Aug 19, 2021, at 1:48 PM, Robin Becker <ro...@reportlab.com> wrote: > > Channeling unicode text experts and xml people: > > I have xml entity with initial bytes ff fe ff fe which the file command says > is > UTF-16, little-endian text. > > I agree, but what should be done about the additional BOM. > > A test output made many years ago seems to keep the extra BOM. The xml > context is > > > xml file 014.xml > <!DOCTYPE doc [ > <!ELEMENT doc (#PCDATA)> > <!ENTITY e SYSTEM "014.ent"> > ]> > <doc>&e;</doc > > the entitity file 014.ent is bombomdata > > b'\xff\xfe\xff\xfed\x00a\x00t\x00a\x00' > > The old saved test output of processing is > > b'<doc>\xef\xbb\xbfdata</doc>' > > which implies seems as though the extra BOM in the entity has been kept and > processed into a different BOM meaning utf8. > > I think the test file is wrong and that multiple BOM chars in the entiry > should have been removed. > > Am I right? > -- > Robin Becker > > -- > https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list