* Chris Lamb <la...@debian.org> [2023-10-11 14:51]: > Niels Thykier wrote: > > > Digging a bit deeper, it turns out that `file -i` correctly classifies > > the changelog as `text/plain; charset=utf-8`. That is, `file` knows it > > is text and I suspect `diffoscope` should try `file -i` as well when it > > gets an unknown result from `file`. > > By "unknown result" I assume you mean that diffoscope cannot match > the file type with any known comparator. :) Indeed, diffoscope > doesn't recognise the bogus "Message Sequence Chart" so it falls > back to using a hexdump as you intuited.
I would argue that this is a bug in file(1) as Magdir/communications uses a "string" test, which is for binary files. If this is a text file, not a binary format, it should be forcing a text file test by using "string/t" instead. That said, this is likely not the only such bug (I already encountered one before [1]), so the suggestion below makes sense to me. > I've got some WIP code that will treat unknown file types as text if > they have a MIME type of text/plain. This avoids the use of hexdump > with the examples you sent over at least. > > Do you think I should be further limiting that conditional to a > whitelist of safe encodings, too? (eg. "utf-8" and "us-ascii", etc.) I don't think we need to handle encodings differently from how we already handle files identified as text by file(1): the TextFile comparator tries to guess the encoding, but falls back to a hexdump for e.g. euc-jp encoded files which are identified as "unknown-8bit" by File.guess_encoding(), resulting in a LookupError from codecs.open(). - Fay [1] https://mailman.astron.com/pipermail/file/2023-February/001132.html