Re: Bug#467249: man-db/groff and locales

2008-03-02 Thread Adam Borowski
I see that my (inept but working) patches are not welcome right now.  So,
I'll leave groff alone; just let me answer the issues raised.

On Sat, Mar 01, 2008 at 11:56:28PM +, Colin Watson wrote:
 On Fri, Feb 29, 2008 at 12:32:29AM +0100, Adam Borowski wrote:
  On Thu, Feb 28, 2008 at 10:10:32PM +, brian m. carlson wrote:
   On Thu, Feb 28, 2008 at 09:30:55PM +, Colin Watson wrote:
   man-db really does have some special-casing here. Trust me. It was
   necessary at the time. There are a finite number of known aliases for
   the very small number of locales in question, and until it becomes
   unnecessary I will simply support those.
  
  Of, course, encodings for _source_ pages are those we can't get away with. 
  
  But for all intermediate steps, I don't see any reason to not go to a
  well-known encoding, do everything there and finally convert to whatever
  locale is set -- and you don't even need to name the charset there.
  
  Special-casing _output_ locales seems quite strange to me.
 
 /* An ugly special case is needed here. The utf8 device normally
  * takes ISO-8859-1 input. However, with the multibyte patch, when
  * recoding from CJK character sets it takes UTF-8 input instead.
  * This is evil, but there's not much that can be done about it
  * apart from waiting for groff 2.0.
  */

The idea is to make it take UTF-8 input _always_.  Either hard-coded as in
Red Hat, or settable with -Kcharset as in upstream groff.

   (And I agree that it should go away, but can't easily just yet.)
  
  Could you tell us what keeps us with all the old cruft?
 
 Sanity. I am not interested in making the groff package even more
 incredibly difficult to update to a new upstream in the future.

Having the outside API (ie, -K and expected charsets) be more in line with
current upstreams sounds like something that would make upgrading _easier_. 
If most of groff-1.8 patches cannot be ported to 1.9, I would label at least
bringing outside interfaces together a good thing.

 
 Official groff does not yet support proper CJK typography. Until that is
 in place it is not a viable replacement.

Yet it does support every other language save for Arabic and Hebrew.  And
unless I'm missing something, it's just word-wrapping that's amiss.  I'm not
sure what is the extent of kinsoku shori -- but if its description in
Wikipedia is accurate, it could be done by injecting a separator character
like U+200B ZERO WIDTH SPACE between chars than allow word wrap and then
using the normal rules for scripts with explicit spaces.

But again, if you have already done some research, I'll better leave you
alone.

 
 I think I'm fairly clearly active in man-db; could you please accept that
 I have my reasons beyond laziness,

Uhm... neither me nor Brian Carlson have accused you of laziness.  Heck, I
think that you have done a bunch of great work in man-db recently --
allowing uniformly encoded sources in particular.  I just offered some help
with following through -- full Unicode support would be a logical next step.

 and look up what has been said on this topic over and over again in the
 past?)

Indeed, I've taken a look only at past debian-devel threads and the BTS;
there's probably lots of wisdom I missed on new groff lists.  I was fooled
by an impression I taken in a previous discussion that groff-1.9 is a no-no
for us.
 

 I am honestly not willing to support a backport of -K/preconv to our
 groff package,

That's sad, but if indeed groff-1.9 will be deemed acceptable soon, you're
probably right.

 I appreciate your research into this. But please, I beg you, focus your
 energies on upstream. There is really not much left to do; Brian's done
 the heavy lifting of character class support (or most of it, anyway),
 and now somebody just needs to take the specialised typographic rules
 and make them sufficiently general for inclusion.
 
 I hope you will take my advice born of nearly seven years of maintaining
 groff in Debian.

Ok.  Since groff is a really tangled, complex beast that would take a lot of
time to understand well enough, I think I'll go pester someone else now. 
There's a lot of other places with flaky non-ASCII support in Debian.  Like,
if you use a JFS partition, d-i fails to add iocharset=utf8 in fstab
making non-ASCII filenames lose badly.  And so on, so on...


Cheers and schtuff,
-- 
1KB // Microsoft corollary to Hanlon's razor:
//  Never attribute to stupidity what can be
//  adequately explained by malice.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Re: Bug#467249: man-db/groff and locales

2008-03-01 Thread Colin Watson
On Fri, Feb 29, 2008 at 12:32:29AM +0100, Adam Borowski wrote:
 On Thu, Feb 28, 2008 at 10:10:32PM +, brian m. carlson wrote:
  On Thu, Feb 28, 2008 at 09:30:55PM +, Colin Watson wrote:
  man-db really does have some special-casing here. Trust me. It was
  necessary at the time. There are a finite number of known aliases for
  the very small number of locales in question, and until it becomes
  unnecessary I will simply support those.
 
 Of, course, encodings for _source_ pages are those we can't get away with. 
 
 But for all intermediate steps, I don't see any reason to not go to a
 well-known encoding, do everything there and finally convert to whatever
 locale is set -- and you don't even need to name the charset there.
 
 Special-casing _output_ locales seems quite strange to me.

/* An ugly special case is needed here. The utf8 device normally
 * takes ISO-8859-1 input. However, with the multibyte patch, when
 * recoding from CJK character sets it takes UTF-8 input instead.
 * This is evil, but there's not much that can be done about it
 * apart from waiting for groff 2.0.
 */

  (And I agree that it should go away, but can't easily just yet.)
 
 Could you tell us what keeps us with all the old cruft?

Sanity. I am not interested in making the groff package even more
incredibly difficult to update to a new upstream in the future.

Official groff does not yet support proper CJK typography. Until that is
in place it is not a viable replacement.

(I'm also really fed up of explaining this again and again. I think I'm
fairly clearly active in man-db; could you please accept that I have my
reasons beyond laziness, and look up what has been said on this topic
over and over again in the past?)

  Is there some way to query what character set a locale uses?  If not, I 
  think that man-db should default to UTF-8 (since that *is* the standard 
  on Debian) and handle exceptions to that.  Processing an ASCII manpage 
  as UTF-8 is a no-op.  And it's pretty easy to tell if something isn't 
  valid UTF-8, and man-db can handle that as it normally would.
 
 AOL.  I agree with Brian 100%.  As you already added code to detect if the
 source is valid UTF-8 or not, all that needs to be done is using UTF-8
 instead of ISO-8859-1 as the intermediate format.

There is a lot more to it than that or upstream would be recommending
that already; the version of groff we are using does not have the
internal capabilities that are needed (our changes are a band-aid at
best). Reading this thread may be a helpful summary:

  http://www.mail-archive.com/[EMAIL PROTECTED]/msg01378.html

In short, I am not interested in doing this on top of our current groff
package. I want to do it on top of a whole new upstream that actually
has the features we need with an upstream maintainer prepared to support
them (note that nobody has stepped forward to do any maintenance work on
the Debian multibyte patch for years). Doing that without also
forward-porting our patches for features such as kinsoku shori would
introduce regressions. Forward-porting these patches hackily is
incredibly difficult (I've tried). Forward-porting those patches in a
way that is consistent with upstream's direction (i.e. reimplementing
them) is essentially Brian's work.

 I see.  So, in very short term, groff would be able to output PostScript
 only for limited locales.  That's no regression.
 
 And on tty and html, which are 99.99% of uses of man, suddenly all bugs like
 man iso-8859-2, Kanji names in English manpages, regressions in KOI-8R
 (#424655) or no support for Indic scripts would dissappear overnight with a
 minimal patch.

I would love to have these new features, but I want them on top of a
sane, supportable upstream release. I am sick of the mess we have now
and don't want to make it worse. I also want to actually have us
contribute something useful to groff upstream beyond confused users
showing up on their mailing list and having to be told that this is a
weirdness of Debian's groff package.

I am honestly not willing to support a backport of -K/preconv to our
groff package, with all of the other Unicode support that should come
along with it in order to do a good job. I also enjoy maintaining this
stuff too much to resign. Therefore I must encourage you to help
upstream with the last few pieces needed in order to get this all merged
properly.

Finally, I suspect you'll find that e.g. the specialised kerning code
that's in Debian's groff for proper rendering of ASCII/EUC-JP boundaries
will cause problems with generalised UTF-8 rendering unless properly
forward-ported. I'm fairly sure there are more such examples; that's
just the first I could find easily having been away from that particular
code for a while. If you don't speak all the languages in question, you
might not notice this kind of thing on casual inspection of the output.
Typography involves more than just getting all the characters into 

Re: Bug#467249: man-db/groff and locales

2008-02-28 Thread Adam Borowski
On Thu, Feb 28, 2008 at 10:10:32PM +, brian m. carlson wrote:
 On Thu, Feb 28, 2008 at 09:30:55PM +, Colin Watson wrote:
 On Thu, Feb 28, 2008 at 09:21:41PM +0100, Adam Borowski wrote:
 man-db really does have some special-casing here. Trust me. It was
 necessary at the time. There are a finite number of known aliases for
 the very small number of locales in question, and until it becomes
 unnecessary I will simply support those.

Of, course, encodings for _source_ pages are those we can't get away with. 

But for all intermediate steps, I don't see any reason to not go to a
well-known encoding, do everything there and finally convert to whatever
locale is set -- and you don't even need to name the charset there.

Special-casing _output_ locales seems quite strange to me.

 (And I agree that it should go away, but can't easily just yet.)

Could you tell us what keeps us with all the old cruft?  By adding
groff-1.19 like -Kcharset to our groff, I was able replace all special-
casing except for source.  In my ugly preliminary code most functions in
src/encodings.c start with 'return UTF-8;' -- and it seems to work just
fine in all locales I tested, which include zh_CN.GB2312 and similar.

It's very likely I missed something, I hardly know anything about groff, but
at least at the first glance, ripping away most of the file seems to be a
win.

 Is there some way to query what character set a locale uses?  If not, I 
 think that man-db should default to UTF-8 (since that *is* the standard 
 on Debian) and handle exceptions to that.  Processing an ASCII manpage 
 as UTF-8 is a no-op.  And it's pretty easy to tell if something isn't 
 valid UTF-8, and man-db can handle that as it normally would.

AOL.  I agree with Brian 100%.  As you already added code to detect if the
source is valid UTF-8 or not, all that needs to be done is using UTF-8
instead of ISO-8859-1 as the intermediate format.

 Too bad, groff doesn't have real Unicode support, and supports only
 several special-cased locales (which may then be transcoded as UTF-8,
 but they still get wrapped into their old-style charsets).
 
 AIUI, PostScript doesn't have UTF-8 support either, yet it seems to work 
 just fine.  Anyway, newer versions of groff have a conversion tool that 
 maps UTF-8 (or any arbitrary character set) input into glyph names.

I see.  So, in very short term, groff would be able to output PostScript
only for limited locales.  That's no regression.

And on tty and html, which are 99.99% of uses of man, suddenly all bugs like
man iso-8859-2, Kanji names in English manpages, regressions in KOI-8R
(#424655) or no support for Indic scripts would dissappear overnight with a
minimal patch.


 Are you working with Brian M. Carlson on this?

Not yet, I preferred to have some code to show first.

 He has been working on a solution acceptable to groff upstream, which is,
 frankly, the only way I want to go now. He has already made substantial
 progress with character class support.

Sounds great.  And that's the way to go.

For example, when selecting width, groff 1.18 does:
  u2E00..u9FFF 48 0
  uAC00..uD7AF 48 0
  uFF00..uFFEF 48 0
which supports only CJK.

My temporary solution has a hard-coded table (to minimize patching code):
  u0100..u10FF 24 0
  u1100..u115F 48 0
  u1160..u2328 24 0
  u2329..u232A 48 0
  u232B..u2E7F 24 0
  [...]
  u1..u1FFFD 24 0
  u2..u2FFFD 48 0
  u3..u3FFFD 48 0
  u4..u10 24 0
This supports all other code ranges, and is forward-compatible with when
proper character class support and other goodies go in.
 


 Please be aware that I have little time with school right now, so this 
 may not be implemented soon.  In fact, it may not be ready in time for 
 lenny's release.  I will sit down and work on it some more soon, but my 
 time is limited.  If people want more information on my plan of attack, 
 please do let me know, and I'll be happy to share.

Likewise, I'm nearly unavailable for the next two days.  I'll be able to
help later, but bear in mind that groff is not my area of expertise, and I
plan only minimal changes.


-- 
1KB // Microsoft corollary to Hanlon's razor:
//  Never attribute to stupidity what can be
//  adequately explained by malice.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]