Re: Determine encoding of file

Michael Ash Fri, 30 Jul 2010 20:52:09 -0700

On Fri, Jul 30, 2010 at 6:24 PM, Nick Zitzmann <n...@chronosnet.com> wrote:
>
> On Jul 30, 2010, at 4:09 PM, Dave DeLong wrote:
>
>> Hi everyone,
>>
>> I have a seemingly simple question, but I haven't been able to figure it out.
>>
>> Given a file, how can I determine the NSStringEncoding of the file, without 
>> reading the entire file into memory?  (If the file isn't a text file, then 
>> defaulting to NSUTF8StringEncoding is just fine, since my code will only 
>> work properly if I'm working with text files anyway)
>>
>> I've found this: 
>> http://www.macosxguru.net/article.php?story=20030808081801868 but it seems 
>> ridiculously complex...
>
> Check the first two bytes. If they are 0xFEFF or 0xFFFE, then it is 
> guaranteed to be in Unicode (UTF-16) format. Otherwise, it can be in pretty 
> much any format, since pretty much every format that is not Unicode doesn't 
> use identifiers of any sort.


A nitpick: starting with those two bytes is a *strong suggestion* that
it's UTF-16, but it could just be, say, a Latin-1 file that starts
with "þÿ", or a random binary file that happens to start with that
byte sequence.

One fact that's can be extremely useful for this sort of thing but
which seems to be little-known: due to the structure of UTF-8 it's
rare for a file to be valid UTF-8 by accident. Random data, or data
that isn't intended to be structured like UTF-8, is extremely unlikely
to happen to match the structure required by UTF-8 by coincidence.
Thus, if a file parses as UTF-8, you can be pretty confident that it
was supposed to be interpreted in that encoding.

The same is, alas, not true of UTF-16.

Mike
_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: Determine encoding of file

Reply via email to