Re: lists.debian.org de-localization
Hi, (Remember, the topic is that http://lists.debian.org pages sometimes use 8bit characters which may break all contents after the character when east Asian users browse the pages.) From: Josip Rodin [EMAIL PROTECTED] Subject: Re: lists.debian.org de-localization Date: Sun, 12 Jan 2003 04:14:45 +0100 On Sun, Jan 12, 2003 at 10:38:52AM +0900, Tomohiro KUBOTA wrote: However, I don't think this can be a solution now because it will take a very long time that the version will be stable, then the stable version will be adopted into unstable/testing version of Debian distribution, then the distribution will become stable (released), and then the stable distribution will be adopted to master.debian.org . Actually, we use a non-.deb mhonarc on lists.d.o so this isn't a problem per se. A new version of MHonArc (2.6.0) was released recently which I think can solve all encoding-related problem by converting everything into UTF-8. This, on the other hand, is a hassle to handle (backporting or installation into subdirs). master.d.o is scheduled to be upgraded to woody after samosa. That's all I know. shrug Any new information? --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
Re: lists.debian.org de-localization
Hi, From: Josip Rodin [EMAIL PROTECTED] Subject: Re: lists.debian.org de-localization Date: Sun, 12 Jan 2003 04:14:45 +0100 This, on the other hand, is a hassle to handle (backporting or installation into subdirs). master.d.o is scheduled to be upgraded to woody after samosa. That's all I know. shrug This is a good news. Then I will work later on various encoding support. Anyway, I don't expect the new master.d.o will have development version of MHonArc (with encoding-assuming feature for raw 8bit headers) even if it comes from non-Debian-package version. Thus I think we will have to have some method to handle raw 8bit headers. Here is a filter to convert 8bit characters (assumed to be KOI8-R) to #; expression, which I wrote by imitating iso8859.pl, CharEnt.pm, and UTF8.pm . This filter is used for raw 7bit/8bit strings. Since 7bit part of KOI8-R is identical to ASCII, it doesn't harm legal ASCII headers. The filter is to be installed into org/lists.debian.org/mhonarc/share/mhonarc/MHonArc/DEBIAN.pm and doesn't depend on the version of MHonArc or Debian. ## DEBIAN.pm by Tomohiro KUBOTA [EMAIL PROTECTED] ## ## CHARSETCONVERTER module that assume input string to be KOI8-R ## and convert it into #xxx; expression where xxx is decimal Unicode ## codepoint. package DEBIAN; %US_ASCII_To_Ent = ( #-- # Hex CodeEntity Ref # ISO external entity and description #-- 0x22, quot;, # ISOnum : Quotation mark 0x26, amp;,# ISOnum : Ampersand 0x3C, lt;, # ISOnum : Less-than sign 0x3E, gt;, # ISOnum : Greater-than sign ); %KOI8_R_To_Ent = ( #-- # Hex CodeEntity Ref # ISO external entity and description #-- 0x80, #9472;, # BOX DRAWINGS LIGHT HORIZONTAL 0x81, #9474;, # BOX DRAWINGS LIGHT VERTICAL 0x82, #9484;, # BOX DRAWINGS LIGHT DOWN AND RIGHT 0x83, #9488;, # BOX DRAWINGS LIGHT DOWN AND LEFT 0x84, #9492;, # BOX DRAWINGS LIGHT UP AND RIGHT 0x85, #9496;, # BOX DRAWINGS LIGHT UP AND LEFT 0x86, #9500;, # BOX DRAWINGS LIGHT VERTICAL AND RIGHT 0x87, #9508;, # BOX DRAWINGS LIGHT VERTICAL AND LEFT 0x88, #9516;, # BOX DRAWINGS LIGHT DOWN AND HORIZONTAL 0x89, #9524;, # BOX DRAWINGS LIGHT UP AND HORIZONTAL 0x8a, #9532;, # BOX DRAWINGS LIGHT VERTICAL AND HORIZONTAL 0x8b, #9600;, # UPPER HALF BLOCK 0x8c, #9604;, # LOWER HALF BLOCK 0x8d, #9608;, # FULL BLOCK 0x8e, #9612;, # LEFT HALF BLOCK 0x8f, #9616;, # RIGHT HALF BLOCK 0x90, #9617;, # LIGHT SHADE 0x91, #9618;, # MEDIUM SHADE 0x92, #9619;, # DARK SHADE 0x93, #8992;, # TOP HALF INTEGRAL 0x94, #9632;, # BLACK SQUARE 0x95, #8729;, # BULLET OPERATOR 0x96, #8730;, # SQUARE ROOT 0x97, #8776;, # ALMOST EQUAL TO 0x98, #8804;, # LESS-THAN OR EQUAL TO 0x99, #8805;, # GREATER-THAN OR EQUAL TO 0x9a, #160;, # NO-BREAK SPACE 0x9b, #8993;, # BOTTOM HALF INTEGRAL 0x9c, #176;, # DEGREE SIGN 0x9d, #178;, # SUPERSCRIPT TWO 0x9e, #183;, # MIDDLE DOT 0x9f, #247;, # DIVISION SIGN 0xa0, #9552;, # BOX DRAWINGS DOUBLE HORIZONTAL 0xa1, #9553;, # BOX DRAWINGS DOUBLE VERTICAL 0xa2, #9554;, # BOX DRAWINGS DOWN SINGLE AND RIGHT DOUBLE 0xa3, #1105;, # CYRILLIC SMALL LETTER IO 0xa4, #9555;, # BOX DRAWINGS DOWN DOUBLE AND RIGHT SINGLE 0xa5, #9556;, # BOX DRAWINGS DOUBLE DOWN AND RIGHT 0xa6, #9557;, # BOX DRAWINGS DOWN SINGLE AND LEFT DOUBLE 0xa7, #9558;, # BOX DRAWINGS DOWN DOUBLE AND LEFT SINGLE 0xa8, #9559;, # BOX DRAWINGS DOUBLE DOWN AND LEFT 0xa9, #9560;, # BOX DRAWINGS UP SINGLE AND RIGHT DOUBLE 0xaa, #9561;, # BOX DRAWINGS UP DOUBLE AND RIGHT SINGLE 0xab, #9562;, # BOX DRAWINGS DOUBLE UP AND RIGHT 0xac, #9563;, # BOX DRAWINGS UP SINGLE AND LEFT DOUBLE 0xad, #9564;, # BOX DRAWINGS UP DOUBLE AND LEFT SINGLE 0xae, #9565;, # BOX DRAWINGS DOUBLE UP AND LEFT 0xaf, #9566;, # BOX DRAWINGS VERTICAL SINGLE AND RIGHT DOUBLE 0xb0, #9567;, # BOX DRAWINGS VERTICAL DOUBLE AND RIGHT SINGLE 0xb1, #9568;, # BOX DRAWINGS DOUBLE VERTICAL AND RIGHT 0xb2, #9569;, # BOX
Re: lists.debian.org de-localization
Hi, From: Tomohiro KUBOTA [EMAIL PROTECTED] Subject: Re: lists.debian.org de-localization Date: Tue, 07 Jan 2003 21:45:05 +0900 (JST) I think more important problem is how to deal with raw 8bit mail headers without encoding specification or encodings which are not supported by the current set-up but used in Debian mailing lists (GB2312, BIG5, and KOI8-R). I heard that the current development version of MHonArc has a feature to assume raw 8bit characters as some specified encoding . However, I don't think this can be a solution now because it will take a very long time that the version will be stable, then the stable version will be adopted into unstable/testing version of Debian distribution, then the distribution will become stable (released), and then the stable distribution will be adopted to master.debian.org . Anyway, I can write a KOI8-R - SGML entity (or #; expression) filter very easily. My plan is to assume raw 8bit characters to be KOI8-R Russian and I think this can be achieved easily. Remained problem is: how to handle unsupported encodings such as GB2312 and Big5. I found that the current set-up of lists.debian.org mhonarc converts GB2312 and Big5 into raw 8bit streams (or can be said 16bit streams because these encodings are multibyte) and they also cause encoding conflicts and loss of following in /em. Thus I'd like these encodings to be converted into #; expressions. (Also, debian-esperanto people may want to use ISO-8859-3 and UTF-8.) I found master.debian.org:/org/lists.debian.org/mhonarc/share/mhonarc/MHonArc/UTF8.pm but I don't think this will work well because it depends on Unicode::MapUTF8 module which is available as libunicode-maputf8-perl package since Woody, where master.debian.org is Potato. Then, I might be able to write an original filter using libtext-unicode-perl but the package is also available since Woody. I don't know any other ways. Any suggestions? --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
Re: lists.debian.org de-localization
On Sun, Jan 12, 2003 at 10:38:52AM +0900, Tomohiro KUBOTA wrote: However, I don't think this can be a solution now because it will take a very long time that the version will be stable, then the stable version will be adopted into unstable/testing version of Debian distribution, then the distribution will become stable (released), and then the stable distribution will be adopted to master.debian.org . Actually, we use a non-.deb mhonarc on lists.d.o so this isn't a problem per se. I found master.debian.org:/org/lists.debian.org/mhonarc/share/mhonarc/MHonArc/UTF8.pm but I don't think this will work well because it depends on Unicode::MapUTF8 module which is available as libunicode-maputf8-perl package since Woody, where master.debian.org is Potato. Then, I might be able to write an original filter using libtext-unicode-perl but the package is also available since Woody. This, on the other hand, is a hassle to handle (backporting or installation into subdirs). master.d.o is scheduled to be upgraded to woody after samosa. That's all I know. shrug -- 2. That which causes joy or happiness.
Re: lists.debian.org de-localization
On Tue, Jan 07, 2003 at 09:29:33AM +0900, Tomohiro KUBOTA wrote: I have an idea about an easy modification to old list pages. Add the following line to all http://lists.debian.org/*/*/threads.html , http://lists.debian.org/*/*/maillist.html , http://lists.debian.org/*/*/subject.html , and http://lists.debian.org/*/*/author.html Hm, but doesn't the section on character sets cover the mails themselves as well? There are a bit under twenty thousand indices, which is a large amount by itself, but the mails themselves are really problematic. -- 2. That which causes joy or happiness.
Re: lists.debian.org de-localization
Hi, From: Josip Rodin [EMAIL PROTECTED] Subject: Re: lists.debian.org de-localization Date: Tue, 7 Jan 2003 11:41:36 +0100 Hm, but doesn't the section on character sets cover the mails themselves as well? There are a bit under twenty thousand indices, which is a large amount by itself, but the mails themselves are really problematic. I think the priority of the mails themselves is much lower than that of the lists, because 8bit character in only one mail affects the whole month in a list. Anyway, I don't think the regeneration or modification is needed *now*, because our solution is not yet perfect. It is obviously waste of machine time to modify both *now* and in future when better solution will be available. I think more important problem is how to deal with raw 8bit mail headers without encoding specification or encodings which are not supported by the current set-up but used in Debian mailing lists (GB2312, BIG5, and KOI8-R). --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
Re: lists.debian.org de-localization
Hi, From: Stephen J. Turnbull [EMAIL PROTECTED] Subject: Re: lists.debian.org de-localization (Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages) Date: Sun, 05 Jan 2003 16:10:02 +0900 This is a fairly small sample (about 100 subscribers, 25 regular posters). However, the Russian spam I've seen (isn't it funny how you can identify spam even though you can't read the language it's written in?) invariably fails either the addressee tests (implicit, too many), the known spam software test, or the HTML-only test. So (FWIW) I've disabled the 8-bit test and so far the Russian subscribers are happy. IMO, in such a case, allowing raw 8bit mails is better (i.e., its merit is larger than its demerit) than disabling them. Again, speaking about lists.debian.org, my original idea is to assume all 8bit raw characters to be ISO-8859-1, though I don't know this is technically possible or not. In this case, Russian people will be annoyed browsing lists.debian.org pages. If it is possible to have assumption encoding for each mailing list, that of debian-russian list will be KOI8-R, that of debian-chinese-gb will be GB2312, and so on, and all others ISO-8859-1. I also hope there are some UTF-8 filters. (There seems a writer who uses UTF-8 name (From:) in debian-esperanto.) However, I don't know at all about MHonArc --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
Re: lists.debian.org de-localization
On Mon, Jan 06, 2003 at 10:09:11AM +0900, Tomohiro KUBOTA wrote: Hi, From: [EMAIL PROTECTED] (Denis Barbier) Subject: Re: lists.debian.org de-localization (Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages) Date: Sun, 5 Jan 2003 15:33:41 +0100 Why not use iso_8859::str2sgml; instead of mhonarc::htmlize for iso-8859-1? [...] Sounds like a very good idea. Who should I ask for this modification? (permission of klecker:/org/www.debian.org/cron/people_scripts/people.pl is 755 owner=joy group=debwww, but I don't know whether klecker is the rignt place to do because I checked klecker just by chance. I also checked gluck(=www.debian.org) and master but they don't have the file. Where can I find a document on how /org/* are processed?) 'joy' is Josip Rodin. He recently put these files into CVS: http://cvs.debian.org/cron/?cvsroot=webwml but I do not know how permissions are managed. IMO you should wait for his (or any other webmaster's) approval before applying changes. There is no extra documentation, you have to read scripts and the README files to learn how it works. Basically pages are generated on klecker (aka www-master) and copied to gluck. Denis
Re: lists.debian.org de-localization
On Jan 06, Tomohiro KUBOTA [EMAIL PROTECTED] wrote: IMO, in such a case, allowing raw 8bit mails is better (i.e., its merit is larger than its demerit) than disabling them. Again, speaking about lists.debian.org, my original idea is to assume all 8bit raw characters to be ISO-8859-1, though I don't know this is technically possible or not. This is not needed, only spammers put raw latin-1 characters in mail headers. In this case, Russian people will be annoyed browsing lists.debian.org pages. This would be wrong, Russians do not use ISO-8859-1 anyway. -- ciao, Marco
Re: lists.debian.org de-localization
Hi, From: Marco d'Itri [EMAIL PROTECTED] Subject: Re: lists.debian.org de-localization Date: Mon, 6 Jan 2003 13:34:17 +0100 Again, speaking about lists.debian.org, my original idea is to assume all 8bit raw characters to be ISO-8859-1, though I don't know this is technically possible or not. This is not needed, only spammers put raw latin-1 characters in mail headers. The key point is that when we receive a mail with raw 8bit characters, we don't have an easy and relyable method to tell the characters are from ISO-8859-1 or KOI8-R or other character sets. Anyway, in debian-russian mailing list, raw 8bit characters in mail headers should be allowed and they should be assumed to be KOI8-R on building lists.debian.org pages. In any cases, using raw 8bit characters in lists.debian.org pages must be avoided (so that the pages are not broken), and thus, raw 8bit characters in mail headers must be converted into something (or must be deleted). An easy way is to assume *all* raw 8bit characters to be KOI8-R and convert into SGML entity. However, I don't know whether there are some other languages where a certain amount of non-spammer people use raw 8bit characters. If they exist, they will complain on this idea. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
Re: lists.debian.org de-localization
Tomohiro KUBOTA [EMAIL PROTECTED]: The key point is that when we receive a mail with raw 8bit characters, we don't have an easy and relyable method to tell the characters are from ISO-8859-1 or KOI8-R or other character sets. If the headers contain 8-bit octets and are valid as UTF-8, it's fairly safe to assume that they really are UTF-8. Otherwise, you could look for a Content-Type field or make it depend on the mailing list. An easy way is to assume *all* raw 8bit characters to be KOI8-R and convert into SGML entity. However, I don't know whether there are some other languages where a certain amount of non-spammer people use raw 8bit characters. If they exist, they will complain on this idea. I thought some Japanese non-spammers use iso-2022-jp in headers, which isn't 8-bit, but it isn't us-ascii, either. Am I out of date? Edmund
Re: lists.debian.org de-localization
On Mon, Jan 06, 2003 at 10:09:11AM +0900, Tomohiro KUBOTA wrote: Why not use iso_8859::str2sgml; instead of mhonarc::htmlize for iso-8859-1? [...] Sounds like a very good idea. Who should I ask for this modification? This is the right place to ask, I was watching the discussion and waiting for eventual objections. The subthreads don't seem to be overly decisive, but the change seems like an oversight to me so I guess I'll apply it. (permission of klecker:/org/www.debian.org/cron/people_scripts/people.pl I have no idea how you came from mhonarc to people.pl, but okay. :) is 755 owner=joy group=debwww, but I don't know whether klecker is the rignt place to do because I checked klecker just by chance. I also checked gluck(=www.debian.org) and master but they don't have the file. The rule is that foo.debian.org is on the host that has the A for the said address, in the directory /org/foo.debian.org. Therefore lists.debian.org stuff would be in master.debian.org's /org/lists.debian.org, and that is indeed where it's at. -- 2. That which causes joy or happiness.
Re: lists.debian.org de-localization
Hi, From: Edmund GRIMLEY EVANS [EMAIL PROTECTED] Subject: Re: lists.debian.org de-localization Date: Mon, 6 Jan 2003 13:45:47 + If the headers contain 8-bit octets and are valid as UTF-8, it's fairly safe to assume that they really are UTF-8. Otherwise, you could look for a Content-Type field or make it depend on the mailing list. A good idea, but I think people who use UTF-8 today are those who know well on character encodings and don't send raw 8bit headers. I thought some Japanese non-spammers use iso-2022-jp in headers, which isn't 8-bit, but it isn't us-ascii, either. Am I out of date? Sometimes I read raw iso-2022-jp headers. However, fortunately, there are no Japanese mailing lists in Debian. (debian-japanese is an English mailing list.) And more, MHonArc seems not to have features to convert Japanese into SGML entity or #; expression and we cannot support Japanese headers anyhow. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
Re: lists.debian.org de-localization
On Mon, Jan 06, 2003 at 03:09:47PM +0100, Josip Rodin wrote: Why not use iso_8859::str2sgml; instead of mhonarc::htmlize for iso-8859-1? [...] Sounds like a very good idea. Who should I ask for this modification? This is the right place to ask, I was watching the discussion and waiting for eventual objections. The subthreads don't seem to be overly decisive, but the change seems like an oversight to me so I guess I'll apply it. --- debian.rc~ Sun Sep 22 05:27:19 2002 +++ debian.rc Mon Jan 6 08:05:07 2003 @@ -5,7 +5,7 @@ CharsetConverters plain; mhonarc::htmlize; us-ascii; mhonarc::htmlize; -iso-8859-1; mhonarc::htmlize; +iso-8859-1; iso_8859::str2sgml; iso8859.pl iso-8859-2; iso_8859::str2sgml; iso8859.pl iso-8859-3; iso_8859::str2sgml; iso8859.pl iso-8859-4; iso_8859::str2sgml; iso8859.pl @@ -16,7 +16,7 @@ iso-8859-9; iso_8859::str2sgml; iso8859.pl iso-8859-10;iso_8859::str2sgml; iso8859.pl iso-2022-jp;iso_2022_jp::str2html; iso2022jp.pl -latin1; mhonarc::htmlize; +latin1; iso_8859::str2sgml; iso8859.pl latin2; iso_8859::str2sgml; iso8859.pl latin3; iso_8859::str2sgml; iso8859.pl latin4; iso_8859::str2sgml; iso8859.pl -- 2. That which causes joy or happiness.
Re: lists.debian.org de-localization
Hi, From: Josip Rodin [EMAIL PROTECTED] Subject: Re: lists.debian.org de-localization Date: Mon, 6 Jan 2003 15:09:47 +0100 (permission of klecker:/org/www.debian.org/cron/people_scripts/people.pl I have no idea how you came from mhonarc to people.pl, but okay. :) Ah, I was confused. It is from another thread in debian-www list about similar 8bit problem on http://www.debian.org/devel/people.ja.html . Please read the recent thread named automatically-generated ISO-8859-1 characters in mulbibyte webpages for detail. Thank you for commiting the modification of debian.rc . Does the change affect future archives only? Or all past and future archives? --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
Re: lists.debian.org de-localization
On Mon, Jan 06, 2003 at 11:42:44PM +0900, Tomohiro KUBOTA wrote: Thank you for commiting the modification of debian.rc . Does the change affect future archives only? Or all past and future archives? Future only. Is there a pressing need to regenerate the old mails? I would rather avoid it... -- 2. That which causes joy or happiness.
Re: lists.debian.org de-localization
Hi, From: Josip Rodin [EMAIL PROTECTED] Subject: Re: lists.debian.org de-localization Date: Mon, 6 Jan 2003 16:07:49 +0100 Future only. Is there a pressing need to regenerate the old mails? I would rather avoid it... I have an idea about an easy modification to old list pages. Add the following line to all http://lists.debian.org/*/*/threads.html , http://lists.debian.org/*/*/maillist.html , http://lists.debian.org/*/*/subject.html , and http://lists.debian.org/*/*/author.html meta http-equiv=Content-Type content=text/html; charset=iso-8859-1 besides following exceptions: for debian-chinese-gb, meta http-equiv=Content-Type content=text/html; charset=gb2312 for debian-chinese-big5, meta http-equiv=Content-Type content=text/html; charset=big5 for debian-russian, meta http-equiv=Content-Type content=text/html; charset=koi8-r I expect that this can be done by much less machine time than to regenerate all above pages by using MHonArc, because this doesn't need to access any individual mails. I am also now reading MHonArc documents so that these above headers can be added to future pages also (or, if everything can be migrated into UTF-8, it will be much better). --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
Re: lists.debian.org de-localization
Hi, From: Marco d'Itri [EMAIL PROTECTED] Subject: Re: lists.debian.org de-localization Date: Tue, 7 Jan 2003 01:10:29 +0100 On Jan 06, Tomohiro KUBOTA [EMAIL PROTECTED] wrote: This is not needed, only spammers put raw latin-1 characters in mail headers. The key point is that when we receive a mail with raw 8bit characters, The key point is that we should not even accept mail with raw 8bit characters in the headers. Though I agree with you, it is an ideal solution. As Stephen said, there are people who use raw 8bit characters (intended to be KOI8-R). If you could force them to use right MUAs, I would fully agree with you. Anyway, in the current set-up of lists.debian.org, encodings such as GB2312 and BIG5 (used in debian-chinese-gb and debian-chinese-big5, respectively) are not supported and processed just like raw 8bit characters. We also have to deal with them. I am now interested in MHonArc::UTF8.pm . I had been thinking that it converts all UTF-8 characters (besides ASCII) into #; expression and doesn't support east Asians, which was wrong. It seems to convert *from* all non-UTF8 encodings *to* UTF-8 and seems to support east Asians also (because Unicode::MapUTF8 supports east Asian encodings). --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
Re: lists.debian.org de-localization (Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages)
/lurk Marco == Marco d'Itri [EMAIL PROTECTED] writes: Marco It would be *MUCH* better to just refuse these Marco messages. Most of them are spam anyway. At least in my Marco country (and in all western europe, I think) raw latin-1 Marco characters in headers are never found outside of non-spam Marco messages. He did say Russian. On xemacs-users-ru, which is dedicated to Russian-language posts, about half the users use RFC-2047 encoded-words, and the rest are split evenly between ASCII-only and 8-bit Cyrillic. Raw Cyrillic in headers is used by some of the more sophisticated users, too, surprisingly enough. This is a fairly small sample (about 100 subscribers, 25 regular posters). However, the Russian spam I've seen (isn't it funny how you can identify spam even though you can't read the language it's written in?) invariably fails either the addressee tests (implicit, too many), the known spam software test, or the HTML-only test. So (FWIW) I've disabled the 8-bit test and so far the Russian subscribers are happy. I will also say I've seen a fair amount of dumbquotes from MS-encumbered posters, and the occasional accented Latin character from French and German posters (although those are quite rare, but not quite nonexistent). Marco /^Subject: .*[^[:print:]]{8}/ REJECT Your mailer is not \ Marco RFC 2047 compliant If you're going to do that, 8 is probably too many (SPC is not an 8-bit character---I find 3 works well) and the reason should be failure to comply with RFC 2822. AFAIK 2047 does not prohibit 8-bit characters, it simply provides a mechanism to encode them in environments where they are prohibited. lurk -- Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can do free software business; ask what your business can do for free software.
Re: lists.debian.org de-localization (Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages)
On Sun, Jan 05, 2003 at 10:18:48AM +0900, Tomohiro KUBOTA wrote: [...] CharsetConverters plain; mhonarc::htmlize; us-ascii; mhonarc::htmlize; iso-8859-1; mhonarc::htmlize; iso-8859-2; iso_8859::str2sgml; iso8859.pl iso-8859-3; iso_8859::str2sgml; iso8859.pl Why not use iso_8859::str2sgml; instead of mhonarc::htmlize for iso-8859-1? (Though I am new to MHonArc, I imagine that iso_8859::str2sgml converts ISO-8859 8bit characters into SGML entity like ouml;.) [...] Sounds like a very good idea. Denis
Re: lists.debian.org de-localization
Hi, From: [EMAIL PROTECTED] (Denis Barbier) Subject: Re: lists.debian.org de-localization (Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages) Date: Sun, 5 Jan 2003 15:33:41 +0100 Why not use iso_8859::str2sgml; instead of mhonarc::htmlize for iso-8859-1? [...] Sounds like a very good idea. Who should I ask for this modification? (permission of klecker:/org/www.debian.org/cron/people_scripts/people.pl is 755 owner=joy group=debwww, but I don't know whether klecker is the rignt place to do because I checked klecker just by chance. I also checked gluck(=www.debian.org) and master but they don't have the file. Where can I find a document on how /org/* are processed?) --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
lists.debian.org de-localization (Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages)
Hi, From: Tomohiro KUBOTA [EMAIL PROTECTED] Subject: Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages Date: Fri, 03 Jan 2003 09:06:43 +0900 (JST) BTW, I found similar trouble in lists.debian.org pages. In thread-list pages or date-list pages like http://lists.debian.org/debian-devel/2002/debian-devel-200212/threads.html, there are no charset specification. In such cases, web browsers will assume these pages according to user preference. Naturally, Japanese people configure web browsers to assume Japanese encoding for pages without charset specification. On the other hand, the thread-list pages show senders' names in em format, and threfore, a tag /em follows the name. If the last letter of the name is 8bit, the tag is broken. The result is that all following part are shown in em (italic) format. The test is easy: please configure your browser to assume Japanese encoding for pages without charset specification and load the above page. However, in this case, the solution is a bit complicated. All mails should have encoding information in MIME format. Thus, the best solution would be to parse MIME. On the other hand, the simplest makeshift solution is to add charset=iso8859-1 for all pages but there are mailing lists where most of 8bit characters are cyrillic and so on. I found that MHonArc has a feature to solve this problem. http://www.mhonarc.org/MHonArc/doc/faq/mime.html#nonascii I checked /org/lists.debian.org/mhonarc/debian.rc and found that it seems to ssume that any 8bit characters are ISO-8859-1. CharsetConverters plain; mhonarc::htmlize; us-ascii; mhonarc::htmlize; iso-8859-1; mhonarc::htmlize; iso-8859-2; iso_8859::str2sgml; iso8859.pl iso-8859-3; iso_8859::str2sgml; iso8859.pl Why not use iso_8859::str2sgml; instead of mhonarc::htmlize for iso-8859-1? (Though I am new to MHonArc, I imagine that iso_8859::str2sgml converts ISO-8859 8bit characters into SGML entity like ouml;.) It would be nice if we can convert raw 8bit mail headers (though it is illegal; it sometimes happens and may cause breaking the lists.debian.org pages) to SGML entities by assuming they are ISO-8859-1. Since this may annoy Russian (and other non-ISO-8859-1) people who happen to use MUAs which generates illegal mail headers with 8bit characters without charset specification, I'd like to hear from people from various countries. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/
Re: lists.debian.org de-localization (Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages)
On Jan 05, Tomohiro KUBOTA [EMAIL PROTECTED] wrote: It would be nice if we can convert raw 8bit mail headers (though it is illegal; it sometimes happens and may cause breaking the lists.debian.org pages) to SGML entities by assuming they are ISO-8859-1. Since this may annoy Russian (and other non-ISO-8859-1) people who happen to use MUAs which generates illegal mail headers with 8bit characters without charset specification, I'd like to hear from people from various countries. It would be *MUCH* better to just refuse these messages. Most of them are spam anyway. At least in my country (and in all western europe, I think) raw latin-1 characters in headers are never found outside of non-spam messages. /^Subject: .*[^[:print:]]{8}/ REJECT Your mailer is not RFC 2047 compliant -- ciao, Marco