On Wed, Nov 11, 2020 at 6:52 AM Barry Scott <ba...@barrys-emacs.org> wrote:
>
>
>
> > On 10 Nov 2020, at 19:30, Eli the Bearded <*@eli.users.panix.com> wrote:
> >
> > In comp.lang.python, Chris Angelico <ros...@gmail.com> wrote:
> >> Eli the Bearded <*@eli.users.panix.com> wrote:
> >>> Read first N lines of a file. If all parse as valid UTF-8, consider it 
> >>> text.
> >>> That's probably the rough method file(1) and Perl's -T use. (In
> >>> particular allow no nulls. Maybe allow ISO-8859-1.)
> >> ISO-8859-1 is basically "allow any byte values", so all you'd be doing
> >> is checking for a lack of NUL bytes.
>
> NUL check does not work for windows UTF-16 files.

Yeah, so if you're expecting UTF-16, you would have to do the decode
to text first, and the check for NULs second. One of the big
advantages of UTF-8 is that you can do the checks in either order.

> >> And let's be honest here, there aren't THAT many binary files that
> >> manage to contain a total of zero NULs, so you won't get many false
> >> hits :)
>
> There is the famous EICAR virus test file that is a valid 8086 program for
> DOS that is printing ASCII.

Yes. I didn't say "none", I said "aren't many" :) There's
fundamentally no way to know whether something is or isn't text based
on its contents alone; raw audio data might just happen to look like
an RFC822 email, it's just really really unlikely.

> > There's always the issue of how much to read before deciding.
>
> Simple read it all, after all you have to scan all the file to do the 
> replacement.

If the script's assuming it'll mostly work on small text files, it
might be very annoying to suddenly read in a 4GB blob of video file
just to find out that it's not text. But since we're talking
heuristics here, reading in a small chunk of the file is going to give
an extremely high chance of recognizing a binary file, with a
relatively small cost.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to