Re: Bug#467249: man-db/groff and locales
I see that my (inept but working) patches are not welcome right now. So, I'll leave groff alone; just let me answer the issues raised. On Sat, Mar 01, 2008 at 11:56:28PM +, Colin Watson wrote: On Fri, Feb 29, 2008 at 12:32:29AM +0100, Adam Borowski wrote: On Thu, Feb 28, 2008 at 10:10:32PM +, brian m. carlson wrote: On Thu, Feb 28, 2008 at 09:30:55PM +, Colin Watson wrote: man-db really does have some special-casing here. Trust me. It was necessary at the time. There are a finite number of known aliases for the very small number of locales in question, and until it becomes unnecessary I will simply support those. Of, course, encodings for _source_ pages are those we can't get away with. But for all intermediate steps, I don't see any reason to not go to a well-known encoding, do everything there and finally convert to whatever locale is set -- and you don't even need to name the charset there. Special-casing _output_ locales seems quite strange to me. /* An ugly special case is needed here. The utf8 device normally * takes ISO-8859-1 input. However, with the multibyte patch, when * recoding from CJK character sets it takes UTF-8 input instead. * This is evil, but there's not much that can be done about it * apart from waiting for groff 2.0. */ The idea is to make it take UTF-8 input _always_. Either hard-coded as in Red Hat, or settable with -Kcharset as in upstream groff. (And I agree that it should go away, but can't easily just yet.) Could you tell us what keeps us with all the old cruft? Sanity. I am not interested in making the groff package even more incredibly difficult to update to a new upstream in the future. Having the outside API (ie, -K and expected charsets) be more in line with current upstreams sounds like something that would make upgrading _easier_. If most of groff-1.8 patches cannot be ported to 1.9, I would label at least bringing outside interfaces together a good thing. Official groff does not yet support proper CJK typography. Until that is in place it is not a viable replacement. Yet it does support every other language save for Arabic and Hebrew. And unless I'm missing something, it's just word-wrapping that's amiss. I'm not sure what is the extent of kinsoku shori -- but if its description in Wikipedia is accurate, it could be done by injecting a separator character like U+200B ZERO WIDTH SPACE between chars than allow word wrap and then using the normal rules for scripts with explicit spaces. But again, if you have already done some research, I'll better leave you alone. I think I'm fairly clearly active in man-db; could you please accept that I have my reasons beyond laziness, Uhm... neither me nor Brian Carlson have accused you of laziness. Heck, I think that you have done a bunch of great work in man-db recently -- allowing uniformly encoded sources in particular. I just offered some help with following through -- full Unicode support would be a logical next step. and look up what has been said on this topic over and over again in the past?) Indeed, I've taken a look only at past debian-devel threads and the BTS; there's probably lots of wisdom I missed on new groff lists. I was fooled by an impression I taken in a previous discussion that groff-1.9 is a no-no for us. I am honestly not willing to support a backport of -K/preconv to our groff package, That's sad, but if indeed groff-1.9 will be deemed acceptable soon, you're probably right. I appreciate your research into this. But please, I beg you, focus your energies on upstream. There is really not much left to do; Brian's done the heavy lifting of character class support (or most of it, anyway), and now somebody just needs to take the specialised typographic rules and make them sufficiently general for inclusion. I hope you will take my advice born of nearly seven years of maintaining groff in Debian. Ok. Since groff is a really tangled, complex beast that would take a lot of time to understand well enough, I think I'll go pester someone else now. There's a lot of other places with flaky non-ASCII support in Debian. Like, if you use a JFS partition, d-i fails to add iocharset=utf8 in fstab making non-ASCII filenames lose badly. And so on, so on... Cheers and schtuff, -- 1KB // Microsoft corollary to Hanlon's razor: // Never attribute to stupidity what can be // adequately explained by malice. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Re: Bug#467249: man-db/groff and locales
On Fri, Feb 29, 2008 at 12:32:29AM +0100, Adam Borowski wrote: On Thu, Feb 28, 2008 at 10:10:32PM +, brian m. carlson wrote: On Thu, Feb 28, 2008 at 09:30:55PM +, Colin Watson wrote: man-db really does have some special-casing here. Trust me. It was necessary at the time. There are a finite number of known aliases for the very small number of locales in question, and until it becomes unnecessary I will simply support those. Of, course, encodings for _source_ pages are those we can't get away with. But for all intermediate steps, I don't see any reason to not go to a well-known encoding, do everything there and finally convert to whatever locale is set -- and you don't even need to name the charset there. Special-casing _output_ locales seems quite strange to me. /* An ugly special case is needed here. The utf8 device normally * takes ISO-8859-1 input. However, with the multibyte patch, when * recoding from CJK character sets it takes UTF-8 input instead. * This is evil, but there's not much that can be done about it * apart from waiting for groff 2.0. */ (And I agree that it should go away, but can't easily just yet.) Could you tell us what keeps us with all the old cruft? Sanity. I am not interested in making the groff package even more incredibly difficult to update to a new upstream in the future. Official groff does not yet support proper CJK typography. Until that is in place it is not a viable replacement. (I'm also really fed up of explaining this again and again. I think I'm fairly clearly active in man-db; could you please accept that I have my reasons beyond laziness, and look up what has been said on this topic over and over again in the past?) Is there some way to query what character set a locale uses? If not, I think that man-db should default to UTF-8 (since that *is* the standard on Debian) and handle exceptions to that. Processing an ASCII manpage as UTF-8 is a no-op. And it's pretty easy to tell if something isn't valid UTF-8, and man-db can handle that as it normally would. AOL. I agree with Brian 100%. As you already added code to detect if the source is valid UTF-8 or not, all that needs to be done is using UTF-8 instead of ISO-8859-1 as the intermediate format. There is a lot more to it than that or upstream would be recommending that already; the version of groff we are using does not have the internal capabilities that are needed (our changes are a band-aid at best). Reading this thread may be a helpful summary: http://www.mail-archive.com/[EMAIL PROTECTED]/msg01378.html In short, I am not interested in doing this on top of our current groff package. I want to do it on top of a whole new upstream that actually has the features we need with an upstream maintainer prepared to support them (note that nobody has stepped forward to do any maintenance work on the Debian multibyte patch for years). Doing that without also forward-porting our patches for features such as kinsoku shori would introduce regressions. Forward-porting these patches hackily is incredibly difficult (I've tried). Forward-porting those patches in a way that is consistent with upstream's direction (i.e. reimplementing them) is essentially Brian's work. I see. So, in very short term, groff would be able to output PostScript only for limited locales. That's no regression. And on tty and html, which are 99.99% of uses of man, suddenly all bugs like man iso-8859-2, Kanji names in English manpages, regressions in KOI-8R (#424655) or no support for Indic scripts would dissappear overnight with a minimal patch. I would love to have these new features, but I want them on top of a sane, supportable upstream release. I am sick of the mess we have now and don't want to make it worse. I also want to actually have us contribute something useful to groff upstream beyond confused users showing up on their mailing list and having to be told that this is a weirdness of Debian's groff package. I am honestly not willing to support a backport of -K/preconv to our groff package, with all of the other Unicode support that should come along with it in order to do a good job. I also enjoy maintaining this stuff too much to resign. Therefore I must encourage you to help upstream with the last few pieces needed in order to get this all merged properly. Finally, I suspect you'll find that e.g. the specialised kerning code that's in Debian's groff for proper rendering of ASCII/EUC-JP boundaries will cause problems with generalised UTF-8 rendering unless properly forward-ported. I'm fairly sure there are more such examples; that's just the first I could find easily having been away from that particular code for a while. If you don't speak all the languages in question, you might not notice this kind of thing on casual inspection of the output. Typography involves more than just getting all the characters into
Re: Bug#467249: man-db/groff and locales
On Thu, Feb 28, 2008 at 10:10:32PM +, brian m. carlson wrote: On Thu, Feb 28, 2008 at 09:30:55PM +, Colin Watson wrote: On Thu, Feb 28, 2008 at 09:21:41PM +0100, Adam Borowski wrote: man-db really does have some special-casing here. Trust me. It was necessary at the time. There are a finite number of known aliases for the very small number of locales in question, and until it becomes unnecessary I will simply support those. Of, course, encodings for _source_ pages are those we can't get away with. But for all intermediate steps, I don't see any reason to not go to a well-known encoding, do everything there and finally convert to whatever locale is set -- and you don't even need to name the charset there. Special-casing _output_ locales seems quite strange to me. (And I agree that it should go away, but can't easily just yet.) Could you tell us what keeps us with all the old cruft? By adding groff-1.19 like -Kcharset to our groff, I was able replace all special- casing except for source. In my ugly preliminary code most functions in src/encodings.c start with 'return UTF-8;' -- and it seems to work just fine in all locales I tested, which include zh_CN.GB2312 and similar. It's very likely I missed something, I hardly know anything about groff, but at least at the first glance, ripping away most of the file seems to be a win. Is there some way to query what character set a locale uses? If not, I think that man-db should default to UTF-8 (since that *is* the standard on Debian) and handle exceptions to that. Processing an ASCII manpage as UTF-8 is a no-op. And it's pretty easy to tell if something isn't valid UTF-8, and man-db can handle that as it normally would. AOL. I agree with Brian 100%. As you already added code to detect if the source is valid UTF-8 or not, all that needs to be done is using UTF-8 instead of ISO-8859-1 as the intermediate format. Too bad, groff doesn't have real Unicode support, and supports only several special-cased locales (which may then be transcoded as UTF-8, but they still get wrapped into their old-style charsets). AIUI, PostScript doesn't have UTF-8 support either, yet it seems to work just fine. Anyway, newer versions of groff have a conversion tool that maps UTF-8 (or any arbitrary character set) input into glyph names. I see. So, in very short term, groff would be able to output PostScript only for limited locales. That's no regression. And on tty and html, which are 99.99% of uses of man, suddenly all bugs like man iso-8859-2, Kanji names in English manpages, regressions in KOI-8R (#424655) or no support for Indic scripts would dissappear overnight with a minimal patch. Are you working with Brian M. Carlson on this? Not yet, I preferred to have some code to show first. He has been working on a solution acceptable to groff upstream, which is, frankly, the only way I want to go now. He has already made substantial progress with character class support. Sounds great. And that's the way to go. For example, when selecting width, groff 1.18 does: u2E00..u9FFF 48 0 uAC00..uD7AF 48 0 uFF00..uFFEF 48 0 which supports only CJK. My temporary solution has a hard-coded table (to minimize patching code): u0100..u10FF 24 0 u1100..u115F 48 0 u1160..u2328 24 0 u2329..u232A 48 0 u232B..u2E7F 24 0 [...] u1..u1FFFD 24 0 u2..u2FFFD 48 0 u3..u3FFFD 48 0 u4..u10 24 0 This supports all other code ranges, and is forward-compatible with when proper character class support and other goodies go in. Please be aware that I have little time with school right now, so this may not be implemented soon. In fact, it may not be ready in time for lenny's release. I will sit down and work on it some more soon, but my time is limited. If people want more information on my plan of attack, please do let me know, and I'll be happy to share. Likewise, I'm nearly unavailable for the next two days. I'll be able to help later, but bear in mind that groff is not my area of expertise, and I plan only minimal changes. -- 1KB // Microsoft corollary to Hanlon's razor: // Never attribute to stupidity what can be // adequately explained by malice. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]