On Sun, 15 Nov 2009 04:34:10 -0800, Chris Rebert wrote:

>>> I'm looking for a way to be able to load a generic file from the
>>> system and understand if he is plain text.
>>> The mimetype module has some nice methods, but for example it's not
>>> working for file without extension.
>>>
>>> Any suggestion?
>>
>> You could use the "file" command. It's normally installed by default on
>> Unix systems, but you can get a Windows version from:
> 
> FWIW, IIRC the heuristic `file` uses to check whether a file is text
> or not is whether it contains any null bytes; if it does, it
> classifies it as binary (i.e. not text).

"file" provides more granularity than that, recognising many specific
formats, both text and binary.

First, it uses "magic number" checks, checking for known signature bytes
(e.g. "#!" or "JFIF") at the beginning of the file. If those checks fail
it checks for common text encodings. If those also fail, it reports "data".

Also, UTF-16-encoded text is recognised as text, even though it may
contain a high proportion of NUL bytes.


-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to