Re: Bug#467249: FW by [EMAIL PROTECTED] : Bug#467249: man-db: over sensitive on the spell of locale

2008-03-01 Thread Colin Watson
On Thu, Feb 28, 2008 at 10:10:32PM +, brian m. carlson wrote:
> On Thu, Feb 28, 2008 at 09:30:55PM +, Colin Watson wrote:
> >On Thu, Feb 28, 2008 at 09:21:41PM +0100, Adam Borowski wrote:
> >man-db really does have some special-casing here. Trust me. It was
> >necessary at the time. There are a finite number of known aliases for
> >the very small number of locales in question, and until it becomes
> >unnecessary I will simply support those.
> >
> >(And I agree that it should go away, but can't easily just yet.)
> 
> Is there some way to query what character set a locale uses?

Yes, nl_langinfo (CODESET).

> If not, I think that man-db should default to UTF-8 (since that *is*
> the standard on Debian) and handle exceptions to that.  Processing an
> ASCII manpage as UTF-8 is a no-op.  And it's pretty easy to tell if
> something isn't valid UTF-8, and man-db can handle that as it normally
> would.

Please review the changes that I made in man-db 2.5.0 and 2.5.1, which I
think make this speculation unnecessary.

> AIUI, PostScript doesn't have UTF-8 support either, yet it seems to work 
> just fine.  Anyway, newer versions of groff have a conversion tool that 
> maps UTF-8 (or any arbitrary character set) input into glyph names.  But 
> Debian's groff has been very heavily patched with support for kinsoku 
> shori (prohibition character handling) and so we cannot simply update to 
> a newer version.  Believe me, if it were that easy, I'm sure Colin would 
> have done it.

Indeed so (I have tried before). I've had it with special-cased hacks to
groff - I want either something that goes upstream, or else to stick
with what we have until something *can* go upstream. I'm finished with
nasty typographically-unsound workarounds.

> >Are you working with Brian M. Carlson on this? He has been working on a
> >solution acceptable to groff upstream, which is, frankly, the only way I
> >want to go now. He has already made substantial progress with character
> >class support.
> 
> Please be aware that I have little time with school right now, so this 
> may not be implemented soon.  In fact, it may not be ready in time for 
> lenny's release.  I will sit down and work on it some more soon, but my 
> time is limited.  If people want more information on my plan of attack, 
> please do let me know, and I'll be happy to share.

Drat. Understood, though. I do follow the groff list (when my spam
filters haven't decided that it's statistically all spam ...) and do
hope to find time to build something useful on top of the work you've
posted there already.

Cheers,

-- 
Colin Watson   [EMAIL PROTECTED]


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: FW by [EMAIL PROTECTED] : Bug#467249: man-db: over sensitive on the spell of locale

2008-02-29 Thread Colin Watson
On Fri, Feb 29, 2008 at 08:52:12AM +0100, Petter Reinholdtsen wrote:
> [Michelle Konzack]
> > It seems there is a common problem while setting up the correct UNICODE
> > locale in systems.  As the posster in the attached message has written,
> > he has setup his locale to "zh_CN.utf8" which is wrong, but as he has
> > written too, the output of "locale -a" show it.
> 
> 'locale -a' do not show that the locale is working, it just show what
> is set in the environment.

locale will indicate whether a locale is working by means of the
messages it prints on standard error if it isn't:

  $ LANG=zz_ZZ.utf8 LC_ALL=zz_ZZ.utf8 locale
  locale: Cannot set LC_CTYPE to default locale: No such file or directory
  locale: Cannot set LC_MESSAGES to default locale: No such file or directory
  locale: Cannot set LC_ALL to default locale: No such file or directory
  LANG=zz_ZZ.utf8
  [...]
  $ LANG=zh_CN.utf8 LC_ALL=zh_CN.utf8 locale
  LANG=zh_CN.utf8
  [...]

(The same goes for 'locale -a', which in fact does not show what is set
in the environment at all, but does what it is documented to do: "Write
names of available locales.")

> Use 'locale charmap' to check that the locale is working and that the
> correct character set is selected.  If it return 'ANSI_X3.4-1968'
> (which is ASCII), the locale isn't working (unless it is a locale that
> uses ASCII, not very likely).  If it show 'UTF-8', the locale settings
> are working.

  $ LANG=zh_CN.utf8 locale charmap
  UTF-8

> In the case you describe, I believe the only fix is to get the user to
> stop using an invalid and non-existing locale,

It is neither an invalid nor a non-existing locale. It is not in the
form documented in /usr/share/i18n/SUPPORTED, but it is in the canonical
form into which glibc normalises it internally. See the
_nl_normalize_codeset function in glibc/intl/l10nflist.c.

> and instead use the correct locale string, which I would suspect is
> 'zh_CN.UTF-8'.  The only workaround to this would be to rewrite glibc
> and locales, and it does not seem useful to me.

This is not a glibc bug. This is not a locales bug. This is a man-db
bug. I am both the Debian maintainer and the upstream maintainer and I
have accepted the bug (see the bug log). I wish people would stop trying
to argue that it isn't a bug or that it is a bug somewhere else.

Cheers,

-- 
Colin Watson   [EMAIL PROTECTED]


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: FW by [EMAIL PROTECTED] : Bug#467249: man-db: over sensitive on the spell of locale

2008-02-28 Thread Petter Reinholdtsen
[Michelle Konzack]
> Hello Maintainers,
>
> It seems there is a common problem while setting up the correct UNICODE
> locale in systems.  As the posster in the attached message has written,
> he has setup his locale to "zh_CN.utf8" which is wrong, but as he has
> written too, the output of "locale -a" show it.

'locale -a' do not show that the locale is working, it just show what
is set in the environment.  Use 'locale charmap' to check that the
locale is working and that the correct character set is selected.  If
it return 'ANSI_X3.4-1968' (which is ASCII), the locale isn't working
(unless it is a locale that uses ASCII, not very likely).  If it show
'UTF-8', the locale settings are working.

In the case you describe, I believe the only fix is to get the user to
stop using an invalid and non-existing locale, and instead use the
correct locale string, which I would suspect is 'zh_CN.UTF-8'.  The
only workaround to this would be to rewrite glibc and locales, and it
does not seem useful to me.

Happy hacking,
-- 
Petter Reinholdtsen


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: Bug#467249: FW by [EMAIL PROTECTED] : Bug#467249: man-db: over sensitive on the spell of locale

2008-02-28 Thread brian m. carlson

On Thu, Feb 28, 2008 at 09:30:55PM +, Colin Watson wrote:

On Thu, Feb 28, 2008 at 09:21:41PM +0100, Adam Borowski wrote:
man-db really does have some special-casing here. Trust me. It was
necessary at the time. There are a finite number of known aliases for
the very small number of locales in question, and until it becomes
unnecessary I will simply support those.

(And I agree that it should go away, but can't easily just yet.)


Is there some way to query what character set a locale uses?  If not, I 
think that man-db should default to UTF-8 (since that *is* the standard 
on Debian) and handle exceptions to that.  Processing an ASCII manpage 
as UTF-8 is a no-op.  And it's pretty easy to tell if something isn't 
valid UTF-8, and man-db can handle that as it normally would.


Of course, I'm not contributing code, so my opinion is worth what you 
paid for it.



Too bad, groff doesn't have real Unicode support, and supports only several
special-cased locales (which may then be transcoded as UTF-8, but they still
get wrapped into their old-style charsets).


AIUI, PostScript doesn't have UTF-8 support either, yet it seems to work 
just fine.  Anyway, newer versions of groff have a conversion tool that 
maps UTF-8 (or any arbitrary character set) input into glyph names.  But 
Debian's groff has been very heavily patched with support for kinsoku 
shori (prohibition character handling) and so we cannot simply update to 
a newer version.  Believe me, if it were that easy, I'm sure Colin would 
have done it.



Are you working with Brian M. Carlson on this? He has been working on a
solution acceptable to groff upstream, which is, frankly, the only way I
want to go now. He has already made substantial progress with character
class support.


Please be aware that I have little time with school right now, so this 
may not be implemented soon.  In fact, it may not be ready in time for 
lenny's release.  I will sit down and work on it some more soon, but my 
time is limited.  If people want more information on my plan of attack, 
please do let me know, and I'll be happy to share.


In fact, I'm off to hack some more on groff right now.

--
brian m. carlson / brian with sandals: Houston, Texas, US
+1 713 440 7475 | http://crustytoothpaste.ath.cx/~bmc | My opinion only
troff on top of XML: http://crustytoothpaste.ath.cx/~bmc/code/thwack
OpenPGP: RSA v4 4096b 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187


signature.asc
Description: Digital signature


Re: Bug#467249: FW by [EMAIL PROTECTED] : Bug#467249: man-db: over sensitive on the spell of locale

2008-02-28 Thread Colin Watson
On Thu, Feb 28, 2008 at 09:21:41PM +0100, Adam Borowski wrote:
> On Thu, Feb 28, 2008 at 10:42:30AM +0100, Michelle Konzack wrote:
> > It seems there is a common problem while setting up the correct UNICODE
> > locale in systems.  As the posster in the attached message has written,
> > he has setup his locale to "zh_CN.utf8" which is wrong, but as he has
> > written too, the output of "locale -a" show it.
> 
> No way which way the _locale_ is spelt (including "vi_VI" without even the
> word "utf" inside),

Irrelevant to this bug, as you'll see if you look at the code.

> the _charset_ is UTF-8.  No program ever should look at the locale's
> name, as it has more quirks like this.  Checking the charset will get
> you what you want.
> 
> > I think, there should be a global solution for this, since patching
> > man-db is worthless.
> 
> Actually, it's groff what's at fault here.  Mostly.

man-db really does have some special-casing here. Trust me. It was
necessary at the time. There are a finite number of known aliases for
the very small number of locales in question, and until it becomes
unnecessary I will simply support those.

(And I agree that it should go away, but can't easily just yet.)

Please don't drag groff into this bug. I really hate it when bugs drift
wildly off their original (accurately-constrained) topic despite
attempts to haul them back. It makes them impossible to keep organised.

> > $ LANG=zh_CN.UTF-8 man --warnings -l ls.zh_CN.1 > /dev/null
> > $ LANG=zh_CN.utf8 man --warnings -l ls.zh_CN.1 > /dev/null
> > :9: warning: can't find special character `u013F'
> > :9: warning: can't find special character `u011A'
> > :9: warning: can't find special character `u021D'
> > :11: warning: can't find special character `u0321'
> > :11: warning: can't find special character `u04AA'
> > :12: warning: can't find special character `u0461'
> > // snip
> 
> Too bad, groff doesn't have real Unicode support, and supports only several
> special-cased locales (which may then be transcoded as UTF-8, but they still
> get wrapped into their old-style charsets).
> 
> Instead of changing the special-case recognition, I would instead completely
> skip special-casing and just treat all characters equally.  Including, but
> not limited to, u013F and u0461.

Are you working with Brian M. Carlson on this? He has been working on a
solution acceptable to groff upstream, which is, frankly, the only way I
want to go now. He has already made substantial progress with character
class support.

Treating all characters equally will absolutely not be acceptable to
groff upstream. groff is a typesetter and needs to know about properties
of characters.

Cheers,

-- 
Colin Watson   [EMAIL PROTECTED]


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: FW by [EMAIL PROTECTED] : Bug#467249: man-db: over sensitive on the spell of locale

2008-02-28 Thread Adam Borowski
On Thu, Feb 28, 2008 at 10:42:30AM +0100, Michelle Konzack wrote:
> Hello Maintainers,
> 
> It seems there is a common problem while setting up the correct UNICODE
> locale in systems.  As the posster in the attached message has written,
> he has setup his locale to "zh_CN.utf8" which is wrong, but as he has
> written too, the output of "locale -a" show it.

No way which way the _locale_ is spelt (including "vi_VI" without even the
word "utf" inside), the _charset_ is UTF-8.  No program ever should look at
the locale's name, as it has more quirks like this.  Checking the charset
will get you what you want.

> I think, there should be a global solution for this, since patching
> man-db is worthless.

Actually, it's groff what's at fault here.  Mostly.

> $ LANG=zh_CN.UTF-8 man --warnings -l ls.zh_CN.1 > /dev/null
> $ LANG=zh_CN.utf8 man --warnings -l ls.zh_CN.1 > /dev/null
> :9: warning: can't find special character `u013F'
> :9: warning: can't find special character `u011A'
> :9: warning: can't find special character `u021D'
> :11: warning: can't find special character `u0321'
> :11: warning: can't find special character `u04AA'
> :12: warning: can't find special character `u0461'
> // snip

Too bad, groff doesn't have real Unicode support, and supports only several
special-cased locales (which may then be transcoded as UTF-8, but they still
get wrapped into their old-style charsets).

Instead of changing the special-case recognition, I would instead completely
skip special-casing and just treat all characters equally.  Including, but
not limited to, u013F and u0461.


I've did some initial work at this, but unfortunately I'm dead busy right
now.  For "show me the code", working but not good enough to even to submit
to Colin pan-Unicode groff and man-db are at
deb-src http://angband.pl/debian sid main
(just don't look inside, they're too ugly to live).  On the upside, on tty
everything but RTL (Hebrew/Arabic) works just fine, including CJK,
Vietnamese, Devanagari and cuneiform, even all together in one manpage (try
"man utf8test").  What's lacking is support for html (should be trivial), ps
(aargh...) and other devices.
I'm afraid I can do nothing at least until late friday...  but it looks like
we may be able to help Colin squash at least this bastion of locale
dependency.

-- 
1KB // Microsoft corollary to Hanlon's razor:
//  Never attribute to stupidity what can be
//  adequately explained by malice.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



FW by [EMAIL PROTECTED] : Bug#467249: man-db: over sensitive on the spell of locale

2008-02-28 Thread Michelle Konzack
Hello Maintainers,

It seems there is a common problem while setting up the correct UNICODE
locale in systems.  As the posster in the attached message has written,
he has setup his locale to "zh_CN.utf8" which is wrong, but as he has
written too, the output of "locale -a" show it.

I have many customers with the same problem...

I think, there should be a global solution for this, since patching
man-db is worthless.

Please discuse this problem and let me stay in the MFT.

Thanks, Greetings and nice Day
Michelle Konzack
Systemadministrator
Tamay Dogan Network
Debian GNU/Linux Consultant



- Forwarded message from LI Daobing <[EMAIL PROTECTED]> -

Date: Sun, 24 Feb 2008 11:51:44 +0800
From: LI Daobing <[EMAIL PROTECTED]>
To: Debian Bug Tracking System <[EMAIL PROTECTED]>
Subject: Bug#467249: man-db: over sensitive on the spell of locale
X-PTS-Package: man-db
X-Debian-PR-Package: man-db

Package: man-db
Version: 2.5.1-2
Severity: important


when set locale to zh_CN.UTF-8, I can view a chinese manpage with 
"man -l ls.zh_CN.1", but when set locale to zh_CN.utf8, I got many
rubbish charaters on the screen.

and the information generated by "locale -a" is zh_CN.utf8, so many
users set locale to utf8 instead of UTF-8.

the ls.zh_CN.1 in attchment.

please fix it, thanks.

or you can reproduce this bug with following commands. Thanks.

$ LANG=zh_CN.UTF-8 man --warnings -l ls.zh_CN.1 > /dev/null
$ LANG=zh_CN.utf8 man --warnings -l ls.zh_CN.1 > /dev/null
:9: warning: can't find special character `u013F'
:9: warning: can't find special character `u011A'
:9: warning: can't find special character `u021D'
:11: warning: can't find special character `u0321'
:11: warning: can't find special character `u04AA'
:12: warning: can't find special character `u0461'
// snip


- End forwarded message ---





-- 
Linux-User #280138 with the Linux Counter, http://counter.li.org/
# Debian GNU/Linux Consultant #
Michelle Konzack   Apt. 917  ICQ #328449886
+49/177/935194750, rue de Soultz MSN LinuxMichi
+33/6/61925193 67100 Strasbourg/France   IRC #Debian (irc.icq.com)


signature.pgp
Description: Digital signature