Bug#1053668: diffoscope: Consider using `file -i` as fallback for unknown file output

2023-10-17 Thread Chris Lamb
Version: 251

FC Stegerman wrote:

> I would argue that this is a bug in file(1) as Magdir/communications
> uses a "string" test, which is for binary files.  If this is a text
> file, not a binary format, it should be forcing a text file test by
> using "string/t" instead.
>
> That said, this is likely not the only such bug (I already encountered
> one before [1]), so the suggestion below makes sense to me.

Yes to both things. This was actually implemented and fixed last week,
although I didn't encode the entry in debian/changelog correctly so
this bug wasn't closed… so closing it now in BCC. :)

Thanks to you both.


Best wishes,

-- 
  ,''`.
 : :'  : Chris Lamb
 `. `'`  la...@debian.org  chris-lamb.co.uk
   `-



Bug#1053668: diffoscope: Consider using `file -i` as fallback for unknown file output

2023-10-15 Thread FC Stegerman
* Chris Lamb  [2023-10-11 14:51]:
> Niels Thykier wrote:
> 
> > Digging a bit deeper, it turns out that `file -i` correctly classifies 
> > the changelog as `text/plain; charset=utf-8`.  That is, `file` knows it 
> > is text and I suspect `diffoscope` should try `file -i` as well when it 
> > gets an unknown result from `file`.
> 
> By "unknown result" I assume you mean that diffoscope cannot match
> the file type with any known comparator. :)  Indeed, diffoscope
> doesn't recognise the bogus "Message Sequence Chart" so it falls
> back to using a hexdump as you intuited.

I would argue that this is a bug in file(1) as Magdir/communications
uses a "string" test, which is for binary files.  If this is a text
file, not a binary format, it should be forcing a text file test by
using "string/t" instead.

That said, this is likely not the only such bug (I already encountered
one before [1]), so the suggestion below makes sense to me.

> I've got some WIP code that will treat unknown file types as text if
> they have a MIME type of text/plain. This avoids the use of hexdump
> with the examples you sent over at least.
> 
> Do you think I should be further limiting that conditional to a
> whitelist of safe encodings, too? (eg. "utf-8" and "us-ascii", etc.)

I don't think we need to handle encodings differently from how we
already handle files identified as text by file(1): the TextFile
comparator tries to guess the encoding, but falls back to a hexdump
for e.g. euc-jp encoded files which are identified as "unknown-8bit"
by File.guess_encoding(), resulting in a LookupError from
codecs.open().

- Fay

[1] https://mailman.astron.com/pipermail/file/2023-February/001132.html



Bug#1053668: diffoscope: Consider using `file -i` as fallback for unknown file output

2023-10-11 Thread Niels Thykier

Chris Lamb:

Niels Thykier wrote:


Digging a bit deeper, it turns out that `file -i` correctly classifies
the changelog as `text/plain; charset=utf-8`.  That is, `file` knows it
is text and I suspect `diffoscope` should try `file -i` as well when it
gets an unknown result from `file`.


By "unknown result" I assume you mean that diffoscope cannot match
the file type with any known comparator. :)  Indeed, diffoscope
doesn't recognise the bogus "Message Sequence Chart" so it falls
back to using a hexdump as you intuited.



Correct.


I've got some WIP code that will treat unknown file types as text if
they have a MIME type of text/plain. This avoids the use of hexdump
with the examples you sent over at least.



Sounds good. :)


Do you think I should be further limiting that conditional to a
whitelist of safe encodings, too? (eg. "utf-8" and "us-ascii", etc.)


Regards,



I am not sure what to do here.  Maybe you want to normalize the encoding 
first for more reliable diffs.  If one side is utf-8 and the other is 
utf-16 or something more exotic, normalized the encoding before the diff 
might produce more readable diffs as long as diffoscope somewhere 
denotes the encoding difference.  But honestly, I feel I should defer to 
your experience on this particular corner case.


Best regards,
Niels



Bug#1053668: diffoscope: Consider using `file -i` as fallback for unknown file output

2023-10-11 Thread Chris Lamb
Niels Thykier wrote:

> Digging a bit deeper, it turns out that `file -i` correctly classifies 
> the changelog as `text/plain; charset=utf-8`.  That is, `file` knows it 
> is text and I suspect `diffoscope` should try `file -i` as well when it 
> gets an unknown result from `file`.

By "unknown result" I assume you mean that diffoscope cannot match
the file type with any known comparator. :)  Indeed, diffoscope
doesn't recognise the bogus "Message Sequence Chart" so it falls
back to using a hexdump as you intuited.

I've got some WIP code that will treat unknown file types as text if
they have a MIME type of text/plain. This avoids the use of hexdump
with the examples you sent over at least.

Do you think I should be further limiting that conditional to a
whitelist of safe encodings, too? (eg. "utf-8" and "us-ascii", etc.)


Regards,

-- 
  ,''`.
 : :'  : Chris Lamb
 `. `'`  la...@debian.org  chris-lamb.co.uk
   `-



Bug#1053668: diffoscope: Consider using `file -i` as fallback for unknown file output

2023-10-08 Thread Niels Thykier

Package: diffoscope
Version: 250
Severity: wishlist
X-Debbugs-Cc: ni...@thykier.net

Hi,

I noticed that `diffoscope` used `hexdump -C` based diffs for the 
debian/changelog in the `mscgen` package.


My first bet was that `file` would produce incorrect output and indeed, 
`file` classifies that changelog as a `Message Sequence Chart` rather 
than text.  This is now filed as 1053666.


Digging a bit deeper, it turns out that `file -i` correctly classifies 
the changelog as `text/plain; charset=utf-8`.  That is, `file` knows it 
is text and I suspect `diffoscope` should try `file -i` as well when it 
gets an unknown result from `file`.


This bug report obviously assumes that the `hexdump -C` like output is 
triggered because `diffoscope` uses `file` for determining how to 
analyze the changelog.  If it uses something else, then there is some 
other bug in play that makes `diffoscope` treat the `mscgen` changelog 
as a binary file.


Here are two samples files that `file` considers to be `Message Sequence 
Chart (chart)` and `text/plain; charset=us-ascii` with -i, in case it is 
useful for a test:


```
msc {
  a, b;
}
```
```
msc {
  c, d;
}
```



Best regards,
Niels