http://qa.mandrakesoft.com/show_bug.cgi?id=4212





------- Additional Comments From [EMAIL PROTECTED]  2003-22-07 19:12 -------
Here's the response I've received from groff's maintainer (summary: it's by
design, it's the LESS pager that doesn't handle Unicode hyphens correctly, but a
more correct than mine workaround has already been implemented in SuSe):
----------------------------------------------------------

>> I've found a bug in Groff 1.18's UTF 8 device definitions. If that's
>> already known, please forgive me.  I couldn't locate any info in the
>> mailing list archive.


It is not a bug.  It is a well known `feature'.


>> I'm using Mandrake Linux 9.1, there's an installer option "Use
>> Unicode by default".


Your problem is related to Mandrake.  For example, SuSE has a
workaround in recent distributions (see below).


>> However, there is a problem with some UTF-8 locales with regards to
>> man pages.  The problem exhibits itself in hyphens (e.g. in the
>> option names) being displayed incorrectly and being unsearchable
>> (the "minus" character from the keyboard doesn't match them).


`Unsearchable' is the right word.


>> This is due to the fact that the groff utility that's used for
>> formatting pages (when called from the nroff shell script) formats
>> "\-" sequence in the source input as Unicode character "0x2212", and
>> "-" character as Unicode character "0x2010" instead of the
>> backward-compatible minus sign (which has code "0x002D" for
>> compatibility with ASCII).


This is intentional, and I won't change it.  From the Unicode point of
view my implementation is correct.  The very problem is that most
software doesn't support proper Unicode searching, that is, if you
enter a `-' on the keyboard, it should also find U+2212 and U+2010
(and some other characters too).


>> The hyphen sign "0x2212" isn't handled properly by either the less
>> viewer, or the output terminal and as a result it's displayed with a
>> leading garbage character and can't be input from the keyboard when
>> searching in the manual page (so that e.g. it isn't possible to
>> search for "-h" option when reading the manual for ls).


Hmm, I've called xterm with

  LANG=en_US.UTF-8 \
  xterm -fn "-misc-fixed-medium-r-normal--20-200-75-75-c-100-iso10646-1" -u8

(I'm still using xterm from XFree86 4.2.0), and inside this xterm I
did

  man groff_man

and both the minus and hyphen are displayed correctly.  I have the
following environment settings:

  LESS="-MM -S -R"
  LESSBINFMT="*n%c"
  LESSCHARDEF=8bcccbcc18b.
  LESSKEY=/etc/lesskey.bin

So it seems to be a misconfiguration on your side.


>> The problem is solved by modifying groff's font descriptions for the
>> utf8 device so that the standard, ASCII-compatible "0x002D"
>> character code is used instead of "0x2212" for the hyphen sequence
>> ("\-").


As mentioned above, this is only a temporary workaround until other
software really supports Unicode.

In SuSE, the following code has been added to the troffrc
configuration file:

  .if '\*[.T]'utf8' \{\
  .  char \- \N'45'
  .  char  - \N'45'
  .  char  ' \N'39'
  .\}

which is currently the best solution.


    Werner


-- 
Configure bugmail: http://qa.mandrakesoft.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


------- Reminder: -------
assigned_to: [EMAIL PROTECTED]
status: UNCONFIRMED
creation_date: 
description: 
In Mandrake 9.1, there's an installer option "Use Unicode by default".

This option causes UTF-8 versions of locales to be used, which finally moves the
distribution in the direction of unifying diverse character set encoding
standards into a single, well-known and well understood encoding: UTF-8.
However, there is a problem with some UTF-8 locales with regards to man pages.
The problem exhibits itself in hyphens (e.g. in the option names) being
displayed incorrectly and being unsearchable (the "minus" character from the
keyboard doesn't match them).

This is due to the fact that the groff utility that's used for formatting pages
(when called from the nroff shell script) formats "\-" sequence in the source
input as Unicode character "0x2212", and "-" character as Unicode character
"0x2010" instead of the backward-compatible minus sign (which has code "0x002D"
for compatibility with ASCII).

The hyphen sign "0x2212" isn't handled properly by either the less viewer, or
the output terminal and as a result it's displayed with a leading garbage
character and can't be input from the keyboard when searching in the manual page
(so that e.g. it isn't possible to search for "-h" option when reading the
manual for ls).

Among others, the "en_US.UTF-8" locale is influenced by this bug. OTOH, some
other locales (e.g. "pl") aren't influenced by it because the nroff wrapper has
a quick hack which switches from UTF-8 to legacy encodings (like ISO-8859-2) for
those locales, since man pages are still encoded in non-UTF8 charsets. See the
source of /usr/bin/nroff script for details.

The problem is solved by modifying groff's font descriptions for the utf8 device
so that the standard, ASCII-compatible "0x002D" character code is used instead
of "0x2212" for the hyphen sequence ("\-").

The font settings for utf8 device are in the
/usr/share/groff/1.18.1/font/devutf8/ directory, in the files R (for regular
text), B (for bold), I and BI (for italic an bold-italic respectively).

I'm attaching a patch that does the change.
Test if the patch will apply cleanly by doing:
# cd /
# patch -p1 --dry-run < path/to/patch/devutf8_hyphen.patch
Apply the patch:
# cd /
# patch -p1 < path/to/patch/devutf8_hyphen.patch

Test by executing "man ls" and "man mount" in the en_US.UTF-8 locale. All
hyphens should me ok, you should be able to search for an option e.g. "-v".

Please, test it and, if you find it to be correct, apply and (if needed) forward
to groff maintainers (http://www.gnu.org/directory/GNU/groff.html).

Reply via email to