On Wed, Oct 6, 2010 at 3:23 PM, Richard Gaskin <ambassa...@fourthworld.com>wrote:
> I have an app that needs to auto-detect Unicode and plain text, and render > them correctly based on that auto-detection. > > I have the UTF16 stuff working, but with UTF8 I have a problem: there is > no BOM to let me know if it's Unicode, and some plain text files will > occasionally have high-ASCII values in them (like the dagger symbol). > > What patterns should I be looking for in the binary data of a file to > distinguish UTF8 from plain text? > > Sorry, Richard, but I believe you are out of luck here. The idea behind UTF8 is that it's indistinguishable from ASCII (0-127). You may be able to scan the files, and if they are large enough, try and deduce some thing from them to know which they are. For example: On Windows, "\r\n" (13, 10) should terminate lines. Could very well be a text file. In ASCII there will never be a NULL terminator anywhere (byte 0). There's likely many 0-byte values in any appreciably large Unicode file. This would also be true of byte 8 (backspace) and byte 7 (the bell) and probably a few others. If the number of bytes that have the high bit (0x80) set is extremely low (<<< 1%) then most likely it's ASCII. HTH, Jeff M. _______________________________________________ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution