Re: How to know if a file is a text file
On Sun, 15 Nov 2009 13:49:54 +0100, Luca wrote: > I was quite sure that this is not a very simple task. Right now search > only inside ASCII encode is not enough for me (my native language is > outside this encode :-) > Checking every single byte can be a good solution... > > I can start using the mimetype module and, if the file has no > extension, check byte one by one (commonly) as "file" command does. > Better: I can check use the "file" command if available. Another possible solution: Universal Encoding Detector Character encoding auto-detection in Python 2 and 3 http://chardet.feedparser.org/ -- http://mail.python.org/mailman/listinfo/python-list
Re: How to know if a file is a text file
On Sun, 15 Nov 2009 04:34:10 -0800, Chris Rebert wrote: >>> I'm looking for a way to be able to load a generic file from the >>> system and understand if he is plain text. >>> The mimetype module has some nice methods, but for example it's not >>> working for file without extension. >>> >>> Any suggestion? >> >> You could use the "file" command. It's normally installed by default on >> Unix systems, but you can get a Windows version from: > > FWIW, IIRC the heuristic `file` uses to check whether a file is text > or not is whether it contains any null bytes; if it does, it > classifies it as binary (i.e. not text). "file" provides more granularity than that, recognising many specific formats, both text and binary. First, it uses "magic number" checks, checking for known signature bytes (e.g. "#!" or "JFIF") at the beginning of the file. If those checks fail it checks for common text encodings. If those also fail, it reports "data". Also, UTF-16-encoded text is recognised as text, even though it may contain a high proportion of NUL bytes. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to know if a file is a text file
On Sat, Nov 14, 2009 at 6:51 PM, Philip Semanchuk wrote: > Hi Luca, > You have to define what you mean by "text" file. It might seem obvious, but > it's not. > > Do you mean just ASCII text? Or will you accept Unicode too? Unicode text > can be more difficult to detect because you have to guess the file's > encoding (unless it has a BOM; most don't). > > And do you need to verify that every single byte in the file is "text"? What > if the file is 1GB, do you still want to examine every single byte? > > If you give us your own (specific!) definition of what "text" means, or > perhaps a description of the problem you're trying to solve, then maybe we > can help you better. > Thanks all. I was quite sure that this is not a very simple task. Right now search only inside ASCII encode is not enough for me (my native language is outside this encode :-) Checking every single byte can be a good solution... I can start using the mimetype module and, if the file has no extension, check byte one by one (commonly) as "file" command does. Better: I can check use the "file" command if available. Again: thanks all! -- -- luca -- http://mail.python.org/mailman/listinfo/python-list
Re: How to know if a file is a text file
On Sun, Nov 15, 2009 at 4:06 AM, Nobody wrote: > On Sat, 14 Nov 2009 17:02:29 +0100, Luca Fabbri wrote: > >> I'm looking for a way to be able to load a generic file from the >> system and understand if he is plain text. >> The mimetype module has some nice methods, but for example it's not >> working for file without extension. >> >> Any suggestion? > > You could use the "file" command. It's normally installed by default on > Unix systems, but you can get a Windows version from: FWIW, IIRC the heuristic `file` uses to check whether a file is text or not is whether it contains any null bytes; if it does, it classifies it as binary (i.e. not text). Cheers, Chris -- http://blog.rebertia.com -- http://mail.python.org/mailman/listinfo/python-list
Re: How to know if a file is a text file
On Sat, 14 Nov 2009 17:02:29 +0100, Luca Fabbri wrote: > I'm looking for a way to be able to load a generic file from the > system and understand if he is plain text. > The mimetype module has some nice methods, but for example it's not > working for file without extension. > > Any suggestion? You could use the "file" command. It's normally installed by default on Unix systems, but you can get a Windows version from: http://gnuwin32.sourceforge.net/packages/file.htm -- http://mail.python.org/mailman/listinfo/python-list
Re: How to know if a file is a text file
On Nov 14, 2009, at 11:02 AM, Luca Fabbri wrote: Hi all. I'm looking for a way to be able to load a generic file from the system and understand if he is plain text. The mimetype module has some nice methods, but for example it's not working for file without extension. Hi Luca, You have to define what you mean by "text" file. It might seem obvious, but it's not. Do you mean just ASCII text? Or will you accept Unicode too? Unicode text can be more difficult to detect because you have to guess the file's encoding (unless it has a BOM; most don't). And do you need to verify that every single byte in the file is "text"? What if the file is 1GB, do you still want to examine every single byte? If you give us your own (specific!) definition of what "text" means, or perhaps a description of the problem you're trying to solve, then maybe we can help you better. Cheers Philip -- http://mail.python.org/mailman/listinfo/python-list