Re: Bug#467249: FW by [EMAIL PROTECTED] : Bug#467249: man-db: over sensitive on the spell of locale

2008-03-01 Thread Colin Watson
On Thu, Feb 28, 2008 at 10:10:32PM +, brian m. carlson wrote:
> On Thu, Feb 28, 2008 at 09:30:55PM +, Colin Watson wrote:
> >On Thu, Feb 28, 2008 at 09:21:41PM +0100, Adam Borowski wrote:
> >man-db really does have some special-casing here. Trust me. It was
> >necessary at the time. There are a finite number of known aliases for
> >the very small number of locales in question, and until it becomes
> >unnecessary I will simply support those.
> >
> >(And I agree that it should go away, but can't easily just yet.)
> 
> Is there some way to query what character set a locale uses?

Yes, nl_langinfo (CODESET).

> If not, I think that man-db should default to UTF-8 (since that *is*
> the standard on Debian) and handle exceptions to that.  Processing an
> ASCII manpage as UTF-8 is a no-op.  And it's pretty easy to tell if
> something isn't valid UTF-8, and man-db can handle that as it normally
> would.

Please review the changes that I made in man-db 2.5.0 and 2.5.1, which I
think make this speculation unnecessary.

> AIUI, PostScript doesn't have UTF-8 support either, yet it seems to work 
> just fine.  Anyway, newer versions of groff have a conversion tool that 
> maps UTF-8 (or any arbitrary character set) input into glyph names.  But 
> Debian's groff has been very heavily patched with support for kinsoku 
> shori (prohibition character handling) and so we cannot simply update to 
> a newer version.  Believe me, if it were that easy, I'm sure Colin would 
> have done it.

Indeed so (I have tried before). I've had it with special-cased hacks to
groff - I want either something that goes upstream, or else to stick
with what we have until something *can* go upstream. I'm finished with
nasty typographically-unsound workarounds.

> >Are you working with Brian M. Carlson on this? He has been working on a
> >solution acceptable to groff upstream, which is, frankly, the only way I
> >want to go now. He has already made substantial progress with character
> >class support.
> 
> Please be aware that I have little time with school right now, so this 
> may not be implemented soon.  In fact, it may not be ready in time for 
> lenny's release.  I will sit down and work on it some more soon, but my 
> time is limited.  If people want more information on my plan of attack, 
> please do let me know, and I'll be happy to share.

Drat. Understood, though. I do follow the groff list (when my spam
filters haven't decided that it's statistically all spam ...) and do
hope to find time to build something useful on top of the work you've
posted there already.

Cheers,

-- 
Colin Watson   [EMAIL PROTECTED]


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: Bug#467249: FW by [EMAIL PROTECTED] : Bug#467249: man-db: over sensitive on the spell of locale

2008-02-28 Thread brian m. carlson

On Thu, Feb 28, 2008 at 09:30:55PM +, Colin Watson wrote:

On Thu, Feb 28, 2008 at 09:21:41PM +0100, Adam Borowski wrote:
man-db really does have some special-casing here. Trust me. It was
necessary at the time. There are a finite number of known aliases for
the very small number of locales in question, and until it becomes
unnecessary I will simply support those.

(And I agree that it should go away, but can't easily just yet.)


Is there some way to query what character set a locale uses?  If not, I 
think that man-db should default to UTF-8 (since that *is* the standard 
on Debian) and handle exceptions to that.  Processing an ASCII manpage 
as UTF-8 is a no-op.  And it's pretty easy to tell if something isn't 
valid UTF-8, and man-db can handle that as it normally would.


Of course, I'm not contributing code, so my opinion is worth what you 
paid for it.



Too bad, groff doesn't have real Unicode support, and supports only several
special-cased locales (which may then be transcoded as UTF-8, but they still
get wrapped into their old-style charsets).


AIUI, PostScript doesn't have UTF-8 support either, yet it seems to work 
just fine.  Anyway, newer versions of groff have a conversion tool that 
maps UTF-8 (or any arbitrary character set) input into glyph names.  But 
Debian's groff has been very heavily patched with support for kinsoku 
shori (prohibition character handling) and so we cannot simply update to 
a newer version.  Believe me, if it were that easy, I'm sure Colin would 
have done it.



Are you working with Brian M. Carlson on this? He has been working on a
solution acceptable to groff upstream, which is, frankly, the only way I
want to go now. He has already made substantial progress with character
class support.


Please be aware that I have little time with school right now, so this 
may not be implemented soon.  In fact, it may not be ready in time for 
lenny's release.  I will sit down and work on it some more soon, but my 
time is limited.  If people want more information on my plan of attack, 
please do let me know, and I'll be happy to share.


In fact, I'm off to hack some more on groff right now.

--
brian m. carlson / brian with sandals: Houston, Texas, US
+1 713 440 7475 | http://crustytoothpaste.ath.cx/~bmc | My opinion only
troff on top of XML: http://crustytoothpaste.ath.cx/~bmc/code/thwack
OpenPGP: RSA v4 4096b 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187


signature.asc
Description: Digital signature


Re: Bug#467249: FW by [EMAIL PROTECTED] : Bug#467249: man-db: over sensitive on the spell of locale

2008-02-28 Thread Colin Watson
On Thu, Feb 28, 2008 at 09:21:41PM +0100, Adam Borowski wrote:
> On Thu, Feb 28, 2008 at 10:42:30AM +0100, Michelle Konzack wrote:
> > It seems there is a common problem while setting up the correct UNICODE
> > locale in systems.  As the posster in the attached message has written,
> > he has setup his locale to "zh_CN.utf8" which is wrong, but as he has
> > written too, the output of "locale -a" show it.
> 
> No way which way the _locale_ is spelt (including "vi_VI" without even the
> word "utf" inside),

Irrelevant to this bug, as you'll see if you look at the code.

> the _charset_ is UTF-8.  No program ever should look at the locale's
> name, as it has more quirks like this.  Checking the charset will get
> you what you want.
> 
> > I think, there should be a global solution for this, since patching
> > man-db is worthless.
> 
> Actually, it's groff what's at fault here.  Mostly.

man-db really does have some special-casing here. Trust me. It was
necessary at the time. There are a finite number of known aliases for
the very small number of locales in question, and until it becomes
unnecessary I will simply support those.

(And I agree that it should go away, but can't easily just yet.)

Please don't drag groff into this bug. I really hate it when bugs drift
wildly off their original (accurately-constrained) topic despite
attempts to haul them back. It makes them impossible to keep organised.

> > $ LANG=zh_CN.UTF-8 man --warnings -l ls.zh_CN.1 > /dev/null
> > $ LANG=zh_CN.utf8 man --warnings -l ls.zh_CN.1 > /dev/null
> > :9: warning: can't find special character `u013F'
> > :9: warning: can't find special character `u011A'
> > :9: warning: can't find special character `u021D'
> > :11: warning: can't find special character `u0321'
> > :11: warning: can't find special character `u04AA'
> > :12: warning: can't find special character `u0461'
> > // snip
> 
> Too bad, groff doesn't have real Unicode support, and supports only several
> special-cased locales (which may then be transcoded as UTF-8, but they still
> get wrapped into their old-style charsets).
> 
> Instead of changing the special-case recognition, I would instead completely
> skip special-casing and just treat all characters equally.  Including, but
> not limited to, u013F and u0461.

Are you working with Brian M. Carlson on this? He has been working on a
solution acceptable to groff upstream, which is, frankly, the only way I
want to go now. He has already made substantial progress with character
class support.

Treating all characters equally will absolutely not be acceptable to
groff upstream. groff is a typesetter and needs to know about properties
of characters.

Cheers,

-- 
Colin Watson   [EMAIL PROTECTED]


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]