[fossil-users] Question about binary file detection

Bill Thiede Thu, 19 May 2011 18:45:59 -0700

I had a seemingly normal python file checked in and when I went to view the
artifact, fossil claimed it was binary.  I tracked it down to a line like this
in my source:


sys.stdout.write('^H') 

Where the ^H is 0x8, the backspace character.  It seems my otherwise text
looking file was marked binary by mimetype_from_content() in src/doc.c 

I know text/binary detection can be a tricky problem with UTF-8/16/32, so I'm
wondering if the isBinary is really the best solution.  I locally patched my
fossil to look for a NULL, and that seems to work pretty well.  Embedding a
NULL in source code probably happens less often than some of the other
non-printable characters.  And I would suspect your average binary file has a
few NULLs in it.

Should that check stay as is, become a NULL based check (patch attached), or
maybe some sort of probability based metric (i.e. <5% of the file is
non-printable, still not great for UTF-* if we don't have a unicode table for
printable characters handy).

Your thoughts?

Bill
_______________________________________________
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

[fossil-users] Question about binary file detection

Reply via email to