> ...unicode range utf 8 encoded bytes U-00000000 - U-0000007F: 0xxxxxxx U-00000080 - U-000007FF: 110xxxxx 10xxxxxx U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
This table is not correct. Please check Unicode 3.2 or Unicode 4 for the correct table.
Table 3.1B. Legal UTF-8 Byte Sequences in http://www.unicode.org/reports/tr28/#3_1_conformance Conformance chapter in http://www.unicode.org/versions/Unicode4.0.0/
But, there is one concern. In some cases the utf8 byte stream starts with a BOM,( for eg. when we try reading bytes from a text file that is saved using notepad (using utf8 option )in WIN2k, after first few bytes( i suppose first 3 bytes), the actual text start. So how do we detect whether the byte stream starts with a BOM or not ?? or the first few bytes represent BOM or the actual text ??
There is a whole FAQ section on this topic at http://www.unicode.org/faq/utf_bom.html#BOM
Best regards, markus
-- Opinions expressed here may not reflect my company's positions unless otherwise noted.