Bug#514963: utf-8 man pages not handled properly sometimes

2009-08-18 Thread Colin Watson
tags 514963 fixed-upstream
thanks

On Thu, Feb 12, 2009 at 12:57:10PM +0100, Michal Čihař wrote:
> Dne Thu, 12 Feb 2009 11:43:42 +
> Colin Watson  napsal(a):
> > position 325 isn't representable in ISO-8859-2. Unfortunately, manconv
> > isn't currently smart enough to distinguish between "conversion failed
> > because this isn't valid UTF-8" and "conversion failed because this bit
> > of UTF-8 isn't available in the target encoding", and therefore it falls
> > back to recoding from ISO-8859-2 to ISO-8859-2 (i.e. a no-op) and then
> > you see the mess when it tries to interpret UTF-8 as if it were
> > ISO-8859-2.
> > 
> > I think it might be possible to fix this, albeit more slowly, by
> > recoding the page to UCS-4, which should always succeed as long as the
> > text matches the input encoding being tried, and then recoding from
> > there to ISO-8859-2 and just throwing away characters that don't fit.
> > Alternatively, by the time we've done that we might have a groff that
> > supports UTF-8 input!
> 
> Yes, that would be great.

Although we now have groff 1.20.1 in unstable so you should no longer
notice the effects of this bug, I've fixed it anyway for man-db 2.5.6.

Tue Aug 18 09:47:50 BST 2009  Colin Watson  

* src/manconv.c (try_iconv): Convert text to UTF-8 and then (if
  necessary) to the target encoding. This allows us to distinguish
  between "text not in input encoding" and "characters not
  representable in output encoding" (Debian bug #514963).
* src/tests/manconv-2: Add test for this and some other possible
  encoding-handling bugs in manconv.
* src/tests/Makefile.am (TESTS): Add manconv-2.
* NEWS: Document this.

Thanks for your report,

-- 
Colin Watson   [cjwat...@debian.org]



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#514963: utf-8 man pages not handled properly sometimes

2009-02-12 Thread Michal Čihař
Hi

Dne Thu, 12 Feb 2009 11:43:42 +
Colin Watson  napsal(a):

> I tried to warn about this problem in the policy manual:
> 
>  Due to limitations in current implementations, all characters in the
>  manual page source should be representable in the usual legacy
>  encoding for that language, even if the file is actually encoded in
>  UTF-8.
> 
> ... and you can see the problem like this:
> 
>   $ iconv -f UTF-8 -t ISO-8859-2 gammu-smsd-mysql.7 >/dev/null
>   iconv: illegal input sequence at position 325
> 
> In other words, what's happening here is that the middle dot (U+00B7) at

Oops, that came from wrongly using some translation tool and are
already fixed upstream. That's why I had that hard time to find what
actually causes the problem, when trying to reproduce it with current
SVN. Unfortunately diff between those versions was too big to spot
these dots...

> position 325 isn't representable in ISO-8859-2. Unfortunately, manconv
> isn't currently smart enough to distinguish between "conversion failed
> because this isn't valid UTF-8" and "conversion failed because this bit
> of UTF-8 isn't available in the target encoding", and therefore it falls
> back to recoding from ISO-8859-2 to ISO-8859-2 (i.e. a no-op) and then
> you see the mess when it tries to interpret UTF-8 as if it were
> ISO-8859-2.
> 
> I think it might be possible to fix this, albeit more slowly, by
> recoding the page to UCS-4, which should always succeed as long as the
> text matches the input encoding being tried, and then recoding from
> there to ISO-8859-2 and just throwing away characters that don't fit.
> Alternatively, by the time we've done that we might have a groff that
> supports UTF-8 input!

Yes, that would be great.

> For the meantime, you can work around this problem by ensuring that your
> manual page passes 'iconv -f UTF-8 -t ISO-8859-2 gammu-smsd-mysql.7
> >/dev/null'.

Thanks, will do that.

-- 
Michal Čihař | http://cihar.com | http://blog.cihar.com


signature.asc
Description: PGP signature


Bug#514963: utf-8 man pages not handled properly sometimes

2009-02-12 Thread Colin Watson
retitle 514963 manconv fails to distinguish between "text not in input 
encoding" and "characters not representable in output encoding"
found 514963 2.5.3-3
user man...@packages.debian.org
usertags 514963 target-2.5.5
thanks

On Thu, Feb 12, 2009 at 12:03:01PM +0100, Michal Čihař wrote:
> I noticed this issue, when some translated man pages from gammu package
> (currently in experimental) do not show properly. All they are properly
> encoded in utf-8 and man has no problem showing them locally. But once
> they get installed into /usr/share/man/cs/, some iso-8859-2 detection
> sometimes fails and manconv starts to thing that some of pages are in
> iso-8859-2 instead of utf-8.
> 
> - From debug logs, I found out that /usr/lib/man-db/manconv -f
> utf-8:iso-8859-2 -t ISO-8859-2//IGNORE is called and on some of pages,
> it things the man page is in iso-8859-2 instead of utf-8. If the man
> page is not in /usr/share/man/cs/, the iso-8859-2 is missing in from
> charsets and man page is shown correctly.
> 
> I'm attaching example of such man page.

I tried to warn about this problem in the policy manual:

 Due to limitations in current implementations, all characters in the
 manual page source should be representable in the usual legacy
 encoding for that language, even if the file is actually encoded in
 UTF-8.

... and you can see the problem like this:

  $ iconv -f UTF-8 -t ISO-8859-2 gammu-smsd-mysql.7 >/dev/null
  iconv: illegal input sequence at position 325

In other words, what's happening here is that the middle dot (U+00B7) at
position 325 isn't representable in ISO-8859-2. Unfortunately, manconv
isn't currently smart enough to distinguish between "conversion failed
because this isn't valid UTF-8" and "conversion failed because this bit
of UTF-8 isn't available in the target encoding", and therefore it falls
back to recoding from ISO-8859-2 to ISO-8859-2 (i.e. a no-op) and then
you see the mess when it tries to interpret UTF-8 as if it were
ISO-8859-2.

I think it might be possible to fix this, albeit more slowly, by
recoding the page to UCS-4, which should always succeed as long as the
text matches the input encoding being tried, and then recoding from
there to ISO-8859-2 and just throwing away characters that don't fit.
Alternatively, by the time we've done that we might have a groff that
supports UTF-8 input!

For the meantime, you can work around this problem by ensuring that your
manual page passes 'iconv -f UTF-8 -t ISO-8859-2 gammu-smsd-mysql.7
>/dev/null'.

Thanks,

-- 
Colin Watson   [cjwat...@debian.org]



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#514963: utf-8 man pages not handled properly sometimes

2009-02-12 Thread Michal Čihař
Package: man-db
Version: 2.5.2-4
Severity: normal

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi

I noticed this issue, when some translated man pages from gammu package
(currently in experimental) do not show properly. All they are properly
encoded in utf-8 and man has no problem showing them locally. But once
they get installed into /usr/share/man/cs/, some iso-8859-2 detection
sometimes fails and manconv starts to thing that some of pages are in
iso-8859-2 instead of utf-8.

- From debug logs, I found out that /usr/lib/man-db/manconv -f
utf-8:iso-8859-2 -t ISO-8859-2//IGNORE is called and on some of pages,
it things the man page is in iso-8859-2 instead of utf-8. If the man
page is not in /usr/share/man/cs/, the iso-8859-2 is missing in from
charsets and man page is shown correctly.

I'm attaching example of such man page.

- -- 
Michal Čihař | http://cihar.com | http://blog.cihar.com


- -- System Information:
Debian Release: 5.0
  APT prefers unstable
  APT policy: (500, 'unstable'), (1, 'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.25.20-0.1-default (SMP w/2 CPU cores)
Locale: LANG=cs_CZ.UTF-8, LC_CTYPE=cs_CZ.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash

Versions of packages man-db depends on:
ii  bsdmainutils   6.1.10collection of more utilities from 
ii  debconf [debconf-2.0]  1.5.24Debian configuration management sy
ii  dpkg   1.14.25   Debian package management system
ii  groff-base 1.18.1.1-21   GNU troff text-formatting system (
ii  libc6  2.7-18GNU C Library: Shared libraries
ii  libgdbm3   1.8.3-4   GNU dbm database routines (runtime
ii  zlib1g 1:1.2.3.3.dfsg-12 compression library - runtime

man-db recommends no packages.

Versions of packages man-db suggests:
ii  elinks [www-browser]   0.12~pre2.dfsg0-1 advanced text-mode WWW browser
ii  epiphany-gecko [www-br 2.22.3-9  Intuitive GNOME web browser - Geck
pn  groff  (no description available)
ii  iceweasel [www-browser 3.0.6-1   lightweight web browser based on M
ii  less   418-1 Pager program similar to more
ii  links [www-browser]2.2-1 Web browser running in text mode
ii  w3m [www-browser]  0.5.2-2+b1WWW browsable pager with excellent

- -- debconf information:
* man-db/install-setuid: false

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkmUAeUACgkQ3DVS6DbnVgRncgCdEp0TV1iNnRq8u3IptUrCDqjJ
MIgAoJtzLgbeUOBY2RtEo7bSBdcHkB53
=t10n
-END PGP SIGNATURE-
.\"***
.\"
.\" This file was generated with po4a. Translate the source file.
.\"
.\"***
.TH GAMMU\-SMSD\-MYSQL 7 "Leden 8, 2009" "Gammu 1.23.0" "Dokumentace Gammu"
.SH JMÉNO

.P
gammu\-smsd\-mysql·\-·služba pro gammu\-smsd(1) používající k ukládání zpráv
databázový server MySQL

.SH POPIS
gammu\-smsd(1) podporuje několik služeb. Aktuálně použitá je zvolená v
konfiguračním souboru gammu\-smsdrc(5).

Služba MYSQL ukládá všechna data na databázovém serveru MySQL, jehož
parametry jsou zadány v konfiguračním souboru (viz gammu\-smsdrc(5)· pro
popis těchto parametrů).

.SS "Přijímání zpráv"

Přijaté zprávy jsou ukládány v tabulce inbox.

.SS "Odesílání zpráv"

Zprávy k odeslání jsou čteny z tabulky outbox a jejich případné další části
z tabulky outbox_multipart.

.SS "Popis tabulek"

.TP 
\fBdaemon\fP

Informace o běžících démonech.

.TP 
\fBgammu\fP

Tato tabulka obsahuje jedinou hodnotu \- verzi databázového schématu.

.TP 
\fBinbox\fP

Tabulka, ve které jsou ukládány přijaté zprávy.

.TP 
\fBoutbox\fP

Zprávy určené k odeslání by měly být uloženy v této tabulce. Pokud zpráva
obsahuje více částí, další části jsou uloženy v tabulce outbox_multipart.

.TP 
\fBoutbox_multipart\fP

Data pro odchozí zprávy, které jsou z více částí.

.TP 
\fBphones\fP

Informace o připojených telefonech. Tato tabulka je pravidelně obnovována a
můžete v ní najít informace jako stav baterie nebo síla signálu.

.TP 
\fBsentitems\fP

Informace o odeslaných zprávách a jejich stavu, pokud jsou zapnuty
doručenky.

.TP 
\fBpbk\fP

SMSD tuto tabulku v současné době nepoužívá, je zde jen pro použití v
aplikaci.

.TP 
\fBpbk_groups\fP

SMSD tuto tabulku v současné době nepoužívá, je zde jen pro použití v
aplikaci.

.SH PŘÍKLAD

SQL skript potřebný pro vytvoření všech tabulek je obsažen v dokumentaci
Gammu. Ta také obsahuje pár PHP skriptů pro práci s databází.

.SH "DALŠÍ INFORMACE"
gammu\-smsd(1), gammu\-smsdrc(5), gammu(1), gammurc(5)
.SH AUTOR
gammu\-smsd a tuto manuálovou stránku napsal Michal Čihař
.
.SH COPYRIGHT
Copyright \(co 2009 Michal Čihař a další autoři.  Licence GPLv2: GNU GPL
verze 2 
.br
Tento program je volný software; můžet