Folks, I’m sorry to interrupt but I’ve just woken up to 11 posts in this thread and I see a lot of inaccurate 'facts' posted here. Rather than pick up on statements in individual posts (which would unfairly pick on some people as being less accurate than others) I’d like to post facts straight from Unicode.org and let you reassess some of the things written earlier.
Position of BOM --------------- A Byte Order Mark is valid only at the beginning of a data stream. You never need to scan a file for it. If you find the sequence of characters for a BOM in the middle of a datastream, it’s not a BOM and you should handle it as if those were Unicode characters in the current encoding (for example ZERO WIDTH NON-BREAKING SPACE). There is no unicode sequence which means "Encoding is changing. The next sequence is the new BOM." If you look at the first few bytes of a file and can’t identify one of the BOMs, there isn’t (a valid) one for that data stream and you can assume the default which is UTF-8. This is done to allow the use of ASCII text in a datastream which was designed for Unicode. If you do not implement it, your software will fail for inputs limited by small chipsets or old APIs which can handle only ASCII. What BOMs indicate ------------------ BOMs indicate both which type of UTF is in use as well as the byte order. In other words you can not only tell UTF-16LE from UT-16BE, but you can also tell UTF-32LE from UTF-16LE. To identify the encoding, check the beginning of the datastream for these five sequences, starting from the first one listed: 00 00 FE FF UTF-32, big-endian FF FE 00 00 UTF-32, little-endian FE FF UTF-16, big-endian FF FE UTF-16, little-endian EF BB BF UTF-8 As you can see, Having a datastream start with FE FF does not definitely tell you that it’s a UTF-16 datastream. Be careful. Also be careful of software/protocols/APIs which assume that 00 bytes indicate the end of a datastream. As you can see, although the BOMs for 16 and 32 bit formats are the same size as those formats, this is not true of the BOM for UTF-8. Be careful. How to handle BOMs in software/protocols/APIs —————————————————————— Establish whether each field can handle all kinds of Unicode and understands BOMs, or whether the field understands only one kind of Unicode. If the latter, state this in the documentation, including which kind of Unicode it understands. There is no convention for "This software understands both UTF-16BE and UTF-16LE but nothing else.". If it handles any BOMs, it should handle all five. However, it can handle them by identifying, for example, UTF-32BE and returning an error indicating that it can’t handle any encodings which aren’t 16 bit. Try to be consistent across all fields in your protocol/API. References: <http://unicode.org/faq/utf_bom.html> _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users