Update of bug #66287 (group groff):
Status: Need Info => None
Assigned to: barx => None
_______________________________________________________
Follow-up Comment #2:
[comment #1 comment #1:]
> [comment #0 original submission:]
> > Unfortunately, preconv looks only at the first two lines of a file for
encoding information.
>
> Only if the file isn't seekable...
preconv looks at 0 lines if the file isn't seekable, and 2 lines if it is.
Per its man page: "If the input stream is seekable, check the first two input
lines for a GNU Emacs file-local variable identifying the character encoding."
Under no circumstances will preconv find the tag if it appears after the
first two lines.
> `preconv` is a preprocessor.... For it to behave as you desire,
I don't desire any change in preconv. I merely desire to change shipped groff
files to give preconv a greater chance of getting the encoding right.
> I think the status quo is the best we can do for shipped files
> without heavily refactoring preconv and potentially doing
> violence to the pipeline/filter concept.
Putting the "coding:" tag in the first two lines, where preconv will find it,
is a small change to two shipped files and no executables.
> Hmm, can't reproduce a problem here with _groff_ 1.23.0 or Git HEAD.
Ah, probably you have a uchardet library, which is preconv's next step after
checking the first two lines for an encoding tag.
> Can you do some experiments with `preconv -d` and see what it says?
Sure. On a UTF-8 terminal, absent uchardet, preconv guesses the wrong
encoding for groff_mmse.7.man:
$ fgrep 'coding: ' contrib/mm/groff_mmse.7.man
.\" coding: latin-1
$ echo $LC_CTYPE
en_US.utf8
$ preconv -d contrib/mm/groff_mmse.7.man > /dev/null
fallback encoding: 'UTF-8'
processing 'contrib/mm/groff_mmse.7.man'
no coding tag
could not detect encoding with uchardet
encoding used: 'UTF-8'
incomplete UTF-8 sequence(s) in input stream: replacing each such sequence
with 0xFFFD
$ preconv --version
GNU preconv (groff) version 1.23.0.1624-4d251-dirty with iconv support and
without uchardet support
And on a latin-1 terminal, it guesses the wrong encoding for
meintro_fr.me.in:
$ fgrep 'coding: ' doc/meintro_fr.me.in
.\" coding: utf-8
$ echo $LC_CTYPE
en_US.iso88591
$ preconv -d doc/meintro_fr.me.in > /dev/null
fallback encoding: 'ISO-8859-1'
processing 'doc/meintro_fr.me.in'
no coding tag
could not detect encoding with uchardet
encoding used: 'ISO-8859-1'
Putting the coding: tag at the tops of the files, following the examples of
the two .mom files I cited, fixes both of these.
_______________________________________________________
Reply to this item at:
<https://savannah.gnu.org/bugs/?66287>
_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
signature.asc
Description: PGP signature
