Re: detecting encoding in plain text (related to utf8)

Markus Scherer Wed, 14 Jan 2004 10:59:10 -0800

Deepak Chand Rathore wrote:

                                        unicode range
utf 8 encoded bytes
U-00000000 - U-0000007F:        0xxxxxxx        
U-00000080 - U-000007FF:        110xxxxx 10xxxxxx       
U-00000800 - U-0000FFFF:        1110xxxx 10xxxxxx 10xxxxxx

> ...

This table is not correct. Please check Unicode 3.2 or Unicode 4 for the correct table.

Table 3.1B. Legal UTF-8 Byte Sequences in 
http://www.unicode.org/reports/tr28/#3_1_conformance
Conformance chapter in http://www.unicode.org/versions/Unicode4.0.0/

But, there is one concern. In some cases the utf8 byte stream starts with a
BOM,( for eg. when we try reading bytes from a text file that
is saved using notepad (using utf8 option )in WIN2k, after first few bytes(
i suppose first 3 bytes), the actual text start.
So how do we detect whether the byte stream starts with a BOM or not ??
or the first few bytes represent BOM or the actual text ??

There is a whole FAQ section on this topic at http://www.unicode.org/faq/utf_bom.html#BOM

Best regards,
markus

--
Opinions expressed here may not reflect my company's positions unless otherwise noted.

Re: detecting encoding in plain text (related to utf8)

Reply via email to