[bug #66287] help preconv guess the correct encoding of shipped files

Dave Thu, 03 Oct 2024 15:35:55 -0700

Update of bug #66287 (group groff):

                  Status:               Need Info => None                   
             Assigned to:                    barx => None


    _______________________________________________________

Follow-up Comment #2:

[comment #1 comment #1:]
> [comment #0 original submission:]
> > Unfortunately, preconv looks only at the first two lines of a file for
encoding information.
> 
> Only if the file isn't seekable...

preconv looks at 0 lines if the file isn't seekable, and 2 lines if it is. 
Per its man page: "If the input stream is seekable, check the first two input
lines for a GNU Emacs file-local variable identifying the character encoding."
 Under no circumstances will preconv find the tag if it appears after the
first two lines.

> `preconv` is a preprocessor....  For it to behave as you desire,

I don't desire any change in preconv.  I merely desire to change shipped groff
files to give preconv a greater chance of getting the encoding right.

> I think the status quo is the best we can do for shipped files
> without heavily refactoring preconv and potentially doing
> violence to the pipeline/filter concept.

Putting the "coding:" tag in the first two lines, where preconv will find it,
is a small change to two shipped files and no executables.

> Hmm, can't reproduce a problem here with _groff_ 1.23.0 or Git HEAD.

Ah, probably you have a uchardet library, which is preconv's next step after
checking the first two lines for an encoding tag.

> Can you do some experiments with `preconv -d` and see what it says?

Sure.  On a UTF-8 terminal, absent uchardet, preconv guesses the wrong
encoding for groff_mmse.7.man:

$ fgrep 'coding: ' contrib/mm/groff_mmse.7.man 
.\" coding: latin-1
$ echo $LC_CTYPE
en_US.utf8
$ preconv -d contrib/mm/groff_mmse.7.man > /dev/null
fallback encoding: 'UTF-8'
processing 'contrib/mm/groff_mmse.7.man'
  no coding tag
  could not detect encoding with uchardet
  encoding used: 'UTF-8'
  incomplete UTF-8 sequence(s) in input stream: replacing each such sequence
with 0xFFFD
$ preconv --version
GNU preconv (groff) version 1.23.0.1624-4d251-dirty with iconv support and
without uchardet support

And on a latin-1 terminal, it guesses the wrong encoding for
meintro_fr.me.in:

$ fgrep 'coding: ' doc/meintro_fr.me.in
.\" coding: utf-8
$ echo $LC_CTYPE
en_US.iso88591
$ preconv -d doc/meintro_fr.me.in > /dev/null
fallback encoding: 'ISO-8859-1'
processing 'doc/meintro_fr.me.in'
  no coding tag
  could not detect encoding with uchardet
  encoding used: 'ISO-8859-1'

Putting the coding: tag at the tops of the files, following the examples of
the two .mom files I cited, fixes both of these.


    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?66287>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

signature.asc
Description: PGP signature

[bug #66287] help preconv guess the correct encoding of shipped files

Reply via email to