At 2024-10-03T18:35:37-0400, Dave wrote: > Follow-up Comment #2: > [comment #1 comment #1:] > > [comment #0 original submission:] > > > Unfortunately, preconv looks only at the first two lines of a file > > > for encoding information. > > > > Only if the file isn't seekable... > > preconv looks at 0 lines if the file isn't seekable, and 2 lines if it > is. Per its man page: "If the input stream is seekable, check the > first two input lines for a GNU Emacs file-local variable identifying > the character encoding." Under no circumstances will preconv find the > tag if it appears after the first two lines.
Hmm, right. Thanks for reminding me. I feel pulled in several
directions lately...
> I don't desire any change in preconv. I merely desire to change
> shipped groff files to give preconv a greater chance of getting the
> encoding right.
This is fine if it doesn't fool Emacs into ignoring the local variables
at the end of the file and making the overall file editing experience
_worse_ for people who _do_ have uchardet installed.
I reckon I'll test that.
> Putting the "coding:" tag in the first two lines, where preconv will
> find it, is a small change to two shipped files and no executables.
>
> > Hmm, can't reproduce a problem here with _groff_ 1.23.0 or Git HEAD.
>
> Ah, probably you have a uchardet library, which is preconv's next step
> after checking the first two lines for an encoding tag.
I assuredly do.
> > Can you do some experiments with `preconv -d` and see what it says?
>
> Sure. On a UTF-8 terminal, absent uchardet, preconv guesses the wrong
> encoding for groff_mmse.7.man:
>
> $ fgrep 'coding: ' contrib/mm/groff_mmse.7.man
> .\" coding: latin-1
> $ echo $LC_CTYPE
> en_US.utf8
> $ preconv -d contrib/mm/groff_mmse.7.man > /dev/null
> fallback encoding: 'UTF-8'
> processing 'contrib/mm/groff_mmse.7.man'
> no coding tag
> could not detect encoding with uchardet
> encoding used: 'UTF-8'
> incomplete UTF-8 sequence(s) in input stream: replacing each such sequence
> with 0xFFFD
> $ preconv --version
> GNU preconv (groff) version 1.23.0.1624-4d251-dirty with iconv support and
> without uchardet support
>
> And on a latin-1 terminal, it guesses the wrong encoding for
> meintro_fr.me.in:
>
> $ fgrep 'coding: ' doc/meintro_fr.me.in
> .\" coding: utf-8
> $ echo $LC_CTYPE
> en_US.iso88591
> $ preconv -d doc/meintro_fr.me.in > /dev/null
> fallback encoding: 'ISO-8859-1'
> processing 'doc/meintro_fr.me.in'
> no coding tag
> could not detect encoding with uchardet
> encoding used: 'ISO-8859-1'
>
> Putting the coding: tag at the tops of the files, following the
> examples of the two .mom files I cited, fixes both of these.
Hrm, yup. If that provokes GNU Emacs into bad ergonomics as noted
above, it may be time to migrate at least these two files to UTF-8 in
the source tree. That day is coming one way or the other...
> {savane: user = 108747; tracker = bugs; item = 66287}
signature.asc
Description: PGP signature
