Re: Distinguishing between ASCII and UTF8

Jeff Massung Wed, 06 Oct 2010 13:29:43 -0700

On Wed, Oct 6, 2010 at 3:23 PM, Richard Gaskin
<ambassa...@fourthworld.com>wrote:


> I have an app that needs to auto-detect Unicode and plain text, and render
> them correctly based on that auto-detection.
>
> I have the UTF16 stuff working, but with UTF8 I have a problem:  there is
> no BOM to let me know if it's Unicode, and some plain text files will
> occasionally have high-ASCII values in them (like the dagger symbol).
>
> What patterns should I be looking for in the binary data of a file to
> distinguish UTF8 from plain text?
>
>
Sorry, Richard, but I believe you are out of luck here. The idea behind UTF8
is that it's indistinguishable from ASCII (0-127). You may be able to scan
the files, and if they are large enough, try and deduce some thing from them
to know which they are. For example:

On Windows, "\r\n" (13, 10) should terminate lines. Could very well be a
text file.

In ASCII there will never be a NULL terminator anywhere (byte 0). There's
likely many 0-byte values in any appreciably large Unicode file. This would
also be true of byte 8 (backspace) and byte 7 (the bell) and probably a few
others.

If the number of bytes that have the high bit (0x80) set is extremely low
(<<< 1%) then most likely it's ASCII.

HTH,

Jeff M.
_______________________________________________
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: Distinguishing between ASCII and UTF8

Reply via email to