Bug#1053668: diffoscope: Consider using `file -i` as fallback for unknown file output
Version: 251 FC Stegerman wrote: > I would argue that this is a bug in file(1) as Magdir/communications > uses a "string" test, which is for binary files. If this is a text > file, not a binary format, it should be forcing a text file test by > using "string/t" instead. > > That said, this is likely not the only such bug (I already encountered > one before [1]), so the suggestion below makes sense to me. Yes to both things. This was actually implemented and fixed last week, although I didn't encode the entry in debian/changelog correctly so this bug wasn't closed… so closing it now in BCC. :) Thanks to you both. Best wishes, -- ,''`. : :' : Chris Lamb `. `'` la...@debian.org chris-lamb.co.uk `-
Bug#1053668: diffoscope: Consider using `file -i` as fallback for unknown file output
* Chris Lamb [2023-10-11 14:51]: > Niels Thykier wrote: > > > Digging a bit deeper, it turns out that `file -i` correctly classifies > > the changelog as `text/plain; charset=utf-8`. That is, `file` knows it > > is text and I suspect `diffoscope` should try `file -i` as well when it > > gets an unknown result from `file`. > > By "unknown result" I assume you mean that diffoscope cannot match > the file type with any known comparator. :) Indeed, diffoscope > doesn't recognise the bogus "Message Sequence Chart" so it falls > back to using a hexdump as you intuited. I would argue that this is a bug in file(1) as Magdir/communications uses a "string" test, which is for binary files. If this is a text file, not a binary format, it should be forcing a text file test by using "string/t" instead. That said, this is likely not the only such bug (I already encountered one before [1]), so the suggestion below makes sense to me. > I've got some WIP code that will treat unknown file types as text if > they have a MIME type of text/plain. This avoids the use of hexdump > with the examples you sent over at least. > > Do you think I should be further limiting that conditional to a > whitelist of safe encodings, too? (eg. "utf-8" and "us-ascii", etc.) I don't think we need to handle encodings differently from how we already handle files identified as text by file(1): the TextFile comparator tries to guess the encoding, but falls back to a hexdump for e.g. euc-jp encoded files which are identified as "unknown-8bit" by File.guess_encoding(), resulting in a LookupError from codecs.open(). - Fay [1] https://mailman.astron.com/pipermail/file/2023-February/001132.html
Bug#1053668: diffoscope: Consider using `file -i` as fallback for unknown file output
Chris Lamb: Niels Thykier wrote: Digging a bit deeper, it turns out that `file -i` correctly classifies the changelog as `text/plain; charset=utf-8`. That is, `file` knows it is text and I suspect `diffoscope` should try `file -i` as well when it gets an unknown result from `file`. By "unknown result" I assume you mean that diffoscope cannot match the file type with any known comparator. :) Indeed, diffoscope doesn't recognise the bogus "Message Sequence Chart" so it falls back to using a hexdump as you intuited. Correct. I've got some WIP code that will treat unknown file types as text if they have a MIME type of text/plain. This avoids the use of hexdump with the examples you sent over at least. Sounds good. :) Do you think I should be further limiting that conditional to a whitelist of safe encodings, too? (eg. "utf-8" and "us-ascii", etc.) Regards, I am not sure what to do here. Maybe you want to normalize the encoding first for more reliable diffs. If one side is utf-8 and the other is utf-16 or something more exotic, normalized the encoding before the diff might produce more readable diffs as long as diffoscope somewhere denotes the encoding difference. But honestly, I feel I should defer to your experience on this particular corner case. Best regards, Niels
Bug#1053668: diffoscope: Consider using `file -i` as fallback for unknown file output
Niels Thykier wrote: > Digging a bit deeper, it turns out that `file -i` correctly classifies > the changelog as `text/plain; charset=utf-8`. That is, `file` knows it > is text and I suspect `diffoscope` should try `file -i` as well when it > gets an unknown result from `file`. By "unknown result" I assume you mean that diffoscope cannot match the file type with any known comparator. :) Indeed, diffoscope doesn't recognise the bogus "Message Sequence Chart" so it falls back to using a hexdump as you intuited. I've got some WIP code that will treat unknown file types as text if they have a MIME type of text/plain. This avoids the use of hexdump with the examples you sent over at least. Do you think I should be further limiting that conditional to a whitelist of safe encodings, too? (eg. "utf-8" and "us-ascii", etc.) Regards, -- ,''`. : :' : Chris Lamb `. `'` la...@debian.org chris-lamb.co.uk `-
Bug#1053668: diffoscope: Consider using `file -i` as fallback for unknown file output
Package: diffoscope Version: 250 Severity: wishlist X-Debbugs-Cc: ni...@thykier.net Hi, I noticed that `diffoscope` used `hexdump -C` based diffs for the debian/changelog in the `mscgen` package. My first bet was that `file` would produce incorrect output and indeed, `file` classifies that changelog as a `Message Sequence Chart` rather than text. This is now filed as 1053666. Digging a bit deeper, it turns out that `file -i` correctly classifies the changelog as `text/plain; charset=utf-8`. That is, `file` knows it is text and I suspect `diffoscope` should try `file -i` as well when it gets an unknown result from `file`. This bug report obviously assumes that the `hexdump -C` like output is triggered because `diffoscope` uses `file` for determining how to analyze the changelog. If it uses something else, then there is some other bug in play that makes `diffoscope` treat the `mscgen` changelog as a binary file. Here are two samples files that `file` considers to be `Message Sequence Chart (chart)` and `text/plain; charset=us-ascii` with -i, in case it is useful for a test: ``` msc { a, b; } ``` ``` msc { c, d; } ``` Best regards, Niels