On Wed, Nov 11, 2020 at 6:52 AM Barry Scott <ba...@barrys-emacs.org> wrote: > > > > > On 10 Nov 2020, at 19:30, Eli the Bearded <*@eli.users.panix.com> wrote: > > > > In comp.lang.python, Chris Angelico <ros...@gmail.com> wrote: > >> Eli the Bearded <*@eli.users.panix.com> wrote: > >>> Read first N lines of a file. If all parse as valid UTF-8, consider it > >>> text. > >>> That's probably the rough method file(1) and Perl's -T use. (In > >>> particular allow no nulls. Maybe allow ISO-8859-1.) > >> ISO-8859-1 is basically "allow any byte values", so all you'd be doing > >> is checking for a lack of NUL bytes. > > NUL check does not work for windows UTF-16 files.
Yeah, so if you're expecting UTF-16, you would have to do the decode to text first, and the check for NULs second. One of the big advantages of UTF-8 is that you can do the checks in either order. > >> And let's be honest here, there aren't THAT many binary files that > >> manage to contain a total of zero NULs, so you won't get many false > >> hits :) > > There is the famous EICAR virus test file that is a valid 8086 program for > DOS that is printing ASCII. Yes. I didn't say "none", I said "aren't many" :) There's fundamentally no way to know whether something is or isn't text based on its contents alone; raw audio data might just happen to look like an RFC822 email, it's just really really unlikely. > > There's always the issue of how much to read before deciding. > > Simple read it all, after all you have to scan all the file to do the > replacement. If the script's assuming it'll mostly work on small text files, it might be very annoying to suddenly read in a 4GB blob of video file just to find out that it's not text. But since we're talking heuristics here, reading in a small chunk of the file is going to give an extremely high chance of recognizing a binary file, with a relatively small cost. ChrisA -- https://mail.python.org/mailman/listinfo/python-list