RE: detecting encoding in plain text (related to utf8)

Deepak Chand Rathore Wed, 14 Jan 2004 00:09:31 -0800

        

Hi all,


Great to hear so many views on detecting encoding
I would also like to share something related to detecting UTF8 encoding
As most of u would be knowing, we can check any stream of bytes for utf8
encoding, if any of the following sequence of bytes appears.
If not , we simply consider it not to be in utf8

                                        unicode range
utf 8 encoded bytes
U-00000000 - U-0000007F:        0xxxxxxx        
U-00000080 - U-000007FF:        110xxxxx 10xxxxxx       
U-00000800 - U-0000FFFF:        1110xxxx 10xxxxxx 10xxxxxx      
U-00010000 - U-001FFFFF:        11110xxx 10xxxxxx 10xxxxxx 10xxxxxx     
U-00200000 - U-03FFFFFF:        111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

U-04000000 - U-7FFFFFFF:        1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
10xxxxxx        
similarly using the above principle , we can write our own function that
converts wide char to utf8 & vice versa
according to me , this will work. (     am i right ??)
This approach will surely help as we don't have to rely on the library (for
eg. some utf8 functions require that the locale to be set to xxx.UTF-8
locale, so dependency on such locale)

But, there is one concern. In some cases the utf8 byte stream starts with a
BOM,( for eg. when we try reading bytes from a text file that
is saved using notepad (using utf8 option )in WIN2k, after first few bytes(
i suppose first 3 bytes), the actual text start.
So how do we detect whether the byte stream starts with a BOM or not ??
or the first few bytes represent BOM or the actual text ??

with regards
( DC )
deepak chand rathore

RE: detecting encoding in plain text (related to utf8)

Reply via email to