Folks, I’m sorry to interrupt but I’ve just woken up to 11 posts in this thread 
and I see a lot of inaccurate 'facts' posted here.  Rather than pick up on 
statements in individual posts (which would unfairly pick on some people as 
being less accurate than others) I’d like to post facts straight from 
Unicode.org and let you reassess some of the things written earlier.

Position of BOM
---------------

A Byte Order Mark is valid only at the beginning of a data stream.  You never 
need to scan a file for it.  If you find the sequence of characters for a BOM 
in the middle of a datastream, it’s not a BOM and you should handle it as if 
those were Unicode characters in the current encoding (for example ZERO WIDTH 
NON-BREAKING SPACE).  There is no unicode sequence which means "Encoding is 
changing.  The next sequence is the new BOM."

If you look at the first few bytes of a file and can’t identify one of the 
BOMs, there isn’t (a valid) one for that data stream and you can assume the 
default which is UTF-8.  This is done to allow the use of ASCII text in a 
datastream which was designed for Unicode.  If you do not implement it, your 
software will fail for inputs limited by small chipsets or old APIs which can 
handle only ASCII.

What BOMs indicate
------------------

BOMs indicate both which type of UTF is in use as well as the byte order.  In 
other words you can not only tell UTF-16LE from UT-16BE, but you can also tell 
UTF-32LE from UTF-16LE.  To identify the encoding, check the beginning of the 
datastream for these five sequences, starting from the first one listed:

00 00 FE FF     UTF-32, big-endian
FF FE 00 00     UTF-32, little-endian
FE FF           UTF-16, big-endian
FF FE           UTF-16, little-endian
EF BB BF        UTF-8

As you can see, Having a datastream start with FE FF does not definitely tell 
you that it’s a UTF-16 datastream.  Be careful.  Also be careful of 
software/protocols/APIs which assume that 00 bytes indicate the end of a 
datastream.

As you can see, although the BOMs for 16 and 32 bit formats are the same size 
as those formats, this is not true of the BOM for UTF-8.  Be careful.

How to handle BOMs in software/protocols/APIs
——————————————————————

Establish whether each field can handle all kinds of Unicode and understands 
BOMs, or whether the field understands only one kind of Unicode.  If the 
latter, state this in the documentation, including which kind of Unicode it 
understands.

There is no convention for "This software understands both UTF-16BE and 
UTF-16LE but nothing else.".  If it handles any BOMs, it should handle all 
five.  However, it can handle them by identifying, for example, UTF-32BE and 
returning an error indicating that it can’t handle any encodings which aren’t 
16 bit.

Try to be consistent across all fields in your protocol/API.

References:

<http://unicode.org/faq/utf_bom.html>
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to