Re: How to know if a file is a text file

2009-11-15 Thread Nobody
On Sun, 15 Nov 2009 13:49:54 +0100, Luca wrote:

> I was quite sure that this is not a very simple task. Right now search
> only inside ASCII encode is not enough for me (my native language is
> outside this encode :-)
> Checking every single byte can be a good solution...
> 
> I can start using the mimetype module and, if the file has no
> extension, check byte one by one (commonly) as "file" command does.
> Better: I can check use the "file" command if available.

Another possible solution:

Universal Encoding Detector
Character encoding auto-detection in Python 2 and 3

http://chardet.feedparser.org/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to know if a file is a text file

2009-11-15 Thread Nobody
On Sun, 15 Nov 2009 04:34:10 -0800, Chris Rebert wrote:

>>> I'm looking for a way to be able to load a generic file from the
>>> system and understand if he is plain text.
>>> The mimetype module has some nice methods, but for example it's not
>>> working for file without extension.
>>>
>>> Any suggestion?
>>
>> You could use the "file" command. It's normally installed by default on
>> Unix systems, but you can get a Windows version from:
> 
> FWIW, IIRC the heuristic `file` uses to check whether a file is text
> or not is whether it contains any null bytes; if it does, it
> classifies it as binary (i.e. not text).

"file" provides more granularity than that, recognising many specific
formats, both text and binary.

First, it uses "magic number" checks, checking for known signature bytes
(e.g. "#!" or "JFIF") at the beginning of the file. If those checks fail
it checks for common text encodings. If those also fail, it reports "data".

Also, UTF-16-encoded text is recognised as text, even though it may
contain a high proportion of NUL bytes.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to know if a file is a text file

2009-11-15 Thread Luca
On Sat, Nov 14, 2009 at 6:51 PM, Philip Semanchuk  wrote:
> Hi Luca,
> You have to define what you mean by "text" file. It might seem obvious, but
> it's not.
>
> Do you mean just ASCII text? Or will you accept Unicode too? Unicode text
> can be more difficult to detect because you have to guess the file's
> encoding (unless it has a BOM; most don't).
>
> And do you need to verify that every single byte in the file is "text"? What
> if the file is 1GB, do you still want to examine every single byte?
>
> If you give us your own (specific!) definition of what "text" means, or
> perhaps a description of the problem you're trying to solve, then maybe we
> can help you better.
>

Thanks all.

I was quite sure that this is not a very simple task. Right now search
only inside ASCII encode is not enough for me (my native language is
outside this encode :-)
Checking every single byte can be a good solution...

I can start using the mimetype module and, if the file has no
extension, check byte one by one (commonly) as "file" command does.
Better: I can check use the "file" command if available.

Again: thanks all!

-- 
-- luca
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to know if a file is a text file

2009-11-15 Thread Chris Rebert
On Sun, Nov 15, 2009 at 4:06 AM, Nobody  wrote:
> On Sat, 14 Nov 2009 17:02:29 +0100, Luca Fabbri wrote:
>
>> I'm looking for a way to be able to load a generic file from the
>> system and understand if he is plain text.
>> The mimetype module has some nice methods, but for example it's not
>> working for file without extension.
>>
>> Any suggestion?
>
> You could use the "file" command. It's normally installed by default on
> Unix systems, but you can get a Windows version from:

FWIW, IIRC the heuristic `file` uses to check whether a file is text
or not is whether it contains any null bytes; if it does, it
classifies it as binary (i.e. not text).

Cheers,
Chris
--
http://blog.rebertia.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to know if a file is a text file

2009-11-15 Thread Nobody
On Sat, 14 Nov 2009 17:02:29 +0100, Luca Fabbri wrote:

> I'm looking for a way to be able to load a generic file from the
> system and understand if he is plain text.
> The mimetype module has some nice methods, but for example it's not
> working for file without extension.
> 
> Any suggestion?

You could use the "file" command. It's normally installed by default on
Unix systems, but you can get a Windows version from:

http://gnuwin32.sourceforge.net/packages/file.htm

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to know if a file is a text file

2009-11-14 Thread Philip Semanchuk


On Nov 14, 2009, at 11:02 AM, Luca Fabbri wrote:


Hi all.

I'm looking for a way to be able to load a generic file from the
system and understand if he is plain text.
The mimetype module has some nice methods, but for example it's not
working for file without extension.


Hi Luca,
You have to define what you mean by "text" file. It might seem  
obvious, but it's not.


Do you mean just ASCII text? Or will you accept Unicode too? Unicode  
text can be more difficult to detect because you have to guess the  
file's encoding (unless it has a BOM; most don't).


And do you need to verify that every single byte in the file is  
"text"? What if the file is 1GB, do you still want to examine every  
single byte?


If you give us your own (specific!) definition of what "text" means,  
or perhaps a description of the problem you're trying to solve, then  
maybe we can help you better.


Cheers
Philip
--
http://mail.python.org/mailman/listinfo/python-list