Re: [sqlite] UTF8-BOM not disregarded in CSV import

Peter da Silva Mon, 26 Jun 2017 08:10:15 -0700

I didn’t mean to imply you had to scan the whole content for a BOM, but rather 
for illegal characters in the absence of a BOM.


On 6/26/17, 10:02 AM, "sqlite-users on behalf of Simon Slavin" 
<[email protected] on behalf of 
[email protected]> wrote:

    Folks, I’m sorry to interrupt but I’ve just woken up to 11 posts in this 
thread and I see a lot of inaccurate 'facts' posted here.  Rather than pick up 
on statements in individual posts (which would unfairly pick on some people as 
being less accurate than others) I’d like to post facts straight from 
Unicode.org and let you reassess some of the things written earlier.
    
    Position of BOM
    ---------------
    
    A Byte Order Mark is valid only at the beginning of a data stream.  You 
never need to scan a file for it.  If you find the sequence of characters for a 
BOM in the middle of a datastream, it’s not a BOM and you should handle it as 
if those were Unicode characters in the current encoding (for example ZERO 
WIDTH NON-BREAKING SPACE).  There is no unicode sequence which means "Encoding 
is changing.  The next sequence is the new BOM."
    
    If you look at the first few bytes of a file and can’t identify one of the 
BOMs, there isn’t (a valid) one for that data stream and you can assume the 
default which is UTF-8.  This is done to allow the use of ASCII text in a 
datastream which was designed for Unicode.  If you do not implement it, your 
software will fail for inputs limited by small chipsets or old APIs which can 
handle only ASCII.
    
    What BOMs indicate
    ------------------
    
    BOMs indicate both which type of UTF is in use as well as the byte order.  
In other words you can not only tell UTF-16LE from UT-16BE, but you can also 
tell UTF-32LE from UTF-16LE.  To identify the encoding, check the beginning of 
the datastream for these five sequences, starting from the first one listed:
    
    00 00 FE FF UTF-32, big-endian
    FF FE 00 00 UTF-32, little-endian
    FE FF       UTF-16, big-endian
    FF FE       UTF-16, little-endian
    EF BB BF    UTF-8
    
    As you can see, Having a datastream start with FE FF does not definitely 
tell you that it’s a UTF-16 datastream.  Be careful.  Also be careful of 
software/protocols/APIs which assume that 00 bytes indicate the end of a 
datastream.
    
    As you can see, although the BOMs for 16 and 32 bit formats are the same 
size as those formats, this is not true of the BOM for UTF-8.  Be careful.
    
    How to handle BOMs in software/protocols/APIs
    ——————————————————————
    
    Establish whether each field can handle all kinds of Unicode and 
understands BOMs, or whether the field understands only one kind of Unicode.  
If the latter, state this in the documentation, including which kind of Unicode 
it understands.
    
    There is no convention for "This software understands both UTF-16BE and 
UTF-16LE but nothing else.".  If it handles any BOMs, it should handle all 
five.  However, it can handle them by identifying, for example, UTF-32BE and 
returning an error indicating that it can’t handle any encodings which aren’t 
16 bit.
    
    Try to be consistent across all fields in your protocol/API.
    
    References:
    
    <http://unicode.org/faq/utf_bom.html>
    _______________________________________________
    sqlite-users mailing list
    [email protected]
    http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
    

_______________________________________________
sqlite-users mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] UTF8-BOM not disregarded in CSV import

Reply via email to