Re: MS/Unix BOM FAQ again (small fix)

Markus Scherer Wed, 10 Apr 2002 11:11:16 -0700

The reason for ICU's "UTF-16" converter not trying to auto-detect the BOM is that this 
seems to be something that the _application_ has to decide, not the _converter_ that 
the application instantiates.
This converter name is (currently) only a convenience alias for "use the UTF-16 byte 
serialization that is normally used on this machine".


As this discussion shows, whether initial FF FE or FE FF are interpreted as 
BOM/signature or ZWNBSP or U+FFFE depends on the protocol and on what other 
information is available.

If a BOM can be expected, then the application should inspect the first few bytes with 
something like ICU's ucnv_detectUnicodeSignature().
This function in turn will provide a string "UTF-16BE" or "UTF-16LE" or "SCSU" or 
"UTF-8" or... and tell how many bytes to skip for the signature.
Then the application can instantiate a converter - not just one of the UTF-16*E but 
possibly a different one.

This has been consensus for a while.
The implementation could be changed if the consensus in the ICU team changes.

markus

Re: MS/Unix BOM FAQ again (small fix)

Reply via email to