Re: lists.debian.org de-localization

2003-02-12 Thread Tomohiro KUBOTA
Hi,

(Remember, the topic is that http://lists.debian.org pages sometimes
use 8bit characters which may break all contents after the character
when east Asian users browse the pages.)

From: Josip Rodin [EMAIL PROTECTED]
Subject: Re: lists.debian.org de-localization
Date: Sun, 12 Jan 2003 04:14:45 +0100

 On Sun, Jan 12, 2003 at 10:38:52AM +0900, Tomohiro KUBOTA wrote:
  However, I don't think this can be a solution now because it will take a
  very long time that the version will be stable, then the stable version
  will be adopted into unstable/testing version of Debian distribution, then
  the distribution will become stable (released), and then the stable
  distribution will be adopted to master.debian.org .
 
 Actually, we use a non-.deb mhonarc on lists.d.o so this isn't a problem
 per se.

A new version of MHonArc (2.6.0) was released recently which I think
can solve all encoding-related problem by converting everything into
UTF-8.


 This, on the other hand, is a hassle to handle (backporting or installation
 into subdirs). master.d.o is scheduled to be upgraded to woody after samosa.
 That's all I know. shrug

Any new information?

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/




Re: lists.debian.org de-localization

2003-01-12 Thread Tomohiro KUBOTA
Hi,

From: Josip Rodin [EMAIL PROTECTED]
Subject: Re: lists.debian.org de-localization
Date: Sun, 12 Jan 2003 04:14:45 +0100

 This, on the other hand, is a hassle to handle (backporting or installation
 into subdirs). master.d.o is scheduled to be upgraded to woody after samosa.
 That's all I know. shrug

This is a good news.  Then I will work later on various encoding support.

Anyway, I don't expect the new master.d.o will have development version
of MHonArc (with encoding-assuming feature for raw 8bit headers) even if
it comes from non-Debian-package version.  Thus I think we will have to
have some method to handle raw 8bit headers.

Here is a filter to convert 8bit characters (assumed to be KOI8-R) to
#; expression, which I wrote by imitating iso8859.pl, CharEnt.pm,
and UTF8.pm .  This filter is used for raw 7bit/8bit strings.  Since
7bit part of KOI8-R is identical to ASCII, it doesn't harm legal ASCII
headers.  The filter is to be installed into 
org/lists.debian.org/mhonarc/share/mhonarc/MHonArc/DEBIAN.pm and doesn't
depend on the version of MHonArc or Debian.
##  DEBIAN.pm by Tomohiro KUBOTA [EMAIL PROTECTED]
##
##  CHARSETCONVERTER module that assume input string to be KOI8-R
##  and convert it into #xxx; expression where xxx is decimal Unicode
##  codepoint.

package DEBIAN;

%US_ASCII_To_Ent = (
  #--
  # Hex CodeEntity Ref  # ISO external entity and description
  #--
0x22,   quot;,   # ISOnum : Quotation mark
0x26,   amp;,# ISOnum : Ampersand
0x3C,   lt;, # ISOnum : Less-than sign
0x3E,   gt;, # ISOnum : Greater-than sign
);

%KOI8_R_To_Ent = (
  #--
  # Hex CodeEntity Ref  # ISO external entity and description
  #--
0x80,   #9472;,  # BOX DRAWINGS LIGHT HORIZONTAL
0x81,   #9474;,  # BOX DRAWINGS LIGHT VERTICAL
0x82,   #9484;,  # BOX DRAWINGS LIGHT DOWN AND RIGHT
0x83,   #9488;,  # BOX DRAWINGS LIGHT DOWN AND LEFT
0x84,   #9492;,  # BOX DRAWINGS LIGHT UP AND RIGHT
0x85,   #9496;,  # BOX DRAWINGS LIGHT UP AND LEFT
0x86,   #9500;,  # BOX DRAWINGS LIGHT VERTICAL AND RIGHT
0x87,   #9508;,  # BOX DRAWINGS LIGHT VERTICAL AND LEFT
0x88,   #9516;,  # BOX DRAWINGS LIGHT DOWN AND HORIZONTAL
0x89,   #9524;,  # BOX DRAWINGS LIGHT UP AND HORIZONTAL
0x8a,   #9532;,  # BOX DRAWINGS LIGHT VERTICAL AND HORIZONTAL
0x8b,   #9600;,  # UPPER HALF BLOCK
0x8c,   #9604;,  # LOWER HALF BLOCK
0x8d,   #9608;,  # FULL BLOCK
0x8e,   #9612;,  # LEFT HALF BLOCK
0x8f,   #9616;,  # RIGHT HALF BLOCK
0x90,   #9617;,  # LIGHT SHADE
0x91,   #9618;,  # MEDIUM SHADE
0x92,   #9619;,  # DARK SHADE
0x93,   #8992;,  # TOP HALF INTEGRAL
0x94,   #9632;,  # BLACK SQUARE
0x95,   #8729;,  # BULLET OPERATOR
0x96,   #8730;,  # SQUARE ROOT
0x97,   #8776;,  # ALMOST EQUAL TO
0x98,   #8804;,  # LESS-THAN OR EQUAL TO
0x99,   #8805;,  # GREATER-THAN OR EQUAL TO
0x9a,   #160;,   # NO-BREAK SPACE
0x9b,   #8993;,  # BOTTOM HALF INTEGRAL
0x9c,   #176;,   # DEGREE SIGN
0x9d,   #178;,   # SUPERSCRIPT TWO
0x9e,   #183;,   # MIDDLE DOT
0x9f,   #247;,   # DIVISION SIGN
0xa0,   #9552;,  # BOX DRAWINGS DOUBLE HORIZONTAL
0xa1,   #9553;,  # BOX DRAWINGS DOUBLE VERTICAL
0xa2,   #9554;,  # BOX DRAWINGS DOWN SINGLE AND RIGHT DOUBLE
0xa3,   #1105;,  # CYRILLIC SMALL LETTER IO
0xa4,   #9555;,  # BOX DRAWINGS DOWN DOUBLE AND RIGHT SINGLE
0xa5,   #9556;,  # BOX DRAWINGS DOUBLE DOWN AND RIGHT
0xa6,   #9557;,  # BOX DRAWINGS DOWN SINGLE AND LEFT DOUBLE
0xa7,   #9558;,  # BOX DRAWINGS DOWN DOUBLE AND LEFT SINGLE
0xa8,   #9559;,  # BOX DRAWINGS DOUBLE DOWN AND LEFT
0xa9,   #9560;,  # BOX DRAWINGS UP SINGLE AND RIGHT DOUBLE
0xaa,   #9561;,  # BOX DRAWINGS UP DOUBLE AND RIGHT SINGLE
0xab,   #9562;,  # BOX DRAWINGS DOUBLE UP AND RIGHT
0xac,   #9563;,  # BOX DRAWINGS UP SINGLE AND LEFT DOUBLE
0xad,   #9564;,  # BOX DRAWINGS UP DOUBLE AND LEFT SINGLE
0xae,   #9565;,  # BOX DRAWINGS DOUBLE UP AND LEFT
0xaf,   #9566;,  # BOX DRAWINGS VERTICAL SINGLE AND RIGHT DOUBLE
0xb0,   #9567;,  # BOX DRAWINGS VERTICAL DOUBLE AND RIGHT SINGLE
0xb1,   #9568;,  # BOX DRAWINGS DOUBLE VERTICAL AND RIGHT
0xb2,   #9569;,  # BOX

Re: lists.debian.org de-localization

2003-01-11 Thread Tomohiro KUBOTA
Hi,

From: Tomohiro KUBOTA [EMAIL PROTECTED]
Subject: Re: lists.debian.org de-localization
Date: Tue, 07 Jan 2003 21:45:05 +0900 (JST)

 I think more important problem is how to deal with raw 8bit mail
 headers without encoding specification or encodings which are not
 supported by the current set-up but used in Debian mailing lists
 (GB2312, BIG5, and KOI8-R).

I heard that the current development version of MHonArc has a feature
to assume raw 8bit characters as some specified encoding .  However,
I don't think this can be a solution now because it will take a very 
long time that the version will be stable, then the stable version will
be adopted into unstable/testing version of Debian distribution, then
the distribution will become stable (released), and then the stable
distribution will be adopted to master.debian.org .

Anyway, I can write a KOI8-R - SGML entity (or #; expression)
filter very easily.  My plan is to assume raw 8bit characters to be
KOI8-R Russian and I think this can be achieved easily.

Remained problem is: how to handle unsupported encodings such as
GB2312 and Big5.  I found that the current set-up of lists.debian.org
mhonarc converts GB2312 and Big5 into raw 8bit streams (or can be said
16bit streams because these encodings are multibyte) and they also
cause encoding conflicts and loss of following  in /em.  Thus
I'd like these encodings to be converted into #; expressions.

(Also, debian-esperanto people may want to use ISO-8859-3 and UTF-8.)

I found
master.debian.org:/org/lists.debian.org/mhonarc/share/mhonarc/MHonArc/UTF8.pm
but I don't think this will work well because it depends on
Unicode::MapUTF8 module which is available as libunicode-maputf8-perl
package since Woody, where master.debian.org is Potato.

Then, I might be able to write an original filter using libtext-unicode-perl
but the package is also available since Woody.


I don't know any other ways.  Any suggestions?

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/




Re: lists.debian.org de-localization

2003-01-11 Thread Josip Rodin
On Sun, Jan 12, 2003 at 10:38:52AM +0900, Tomohiro KUBOTA wrote:
 However, I don't think this can be a solution now because it will take a
 very long time that the version will be stable, then the stable version
 will be adopted into unstable/testing version of Debian distribution, then
 the distribution will become stable (released), and then the stable
 distribution will be adopted to master.debian.org .

Actually, we use a non-.deb mhonarc on lists.d.o so this isn't a problem
per se.

 I found
 master.debian.org:/org/lists.debian.org/mhonarc/share/mhonarc/MHonArc/UTF8.pm
 but I don't think this will work well because it depends on
 Unicode::MapUTF8 module which is available as libunicode-maputf8-perl
 package since Woody, where master.debian.org is Potato.
 
 Then, I might be able to write an original filter using libtext-unicode-perl
 but the package is also available since Woody.

This, on the other hand, is a hassle to handle (backporting or installation
into subdirs). master.d.o is scheduled to be upgraded to woody after samosa.
That's all I know. shrug

-- 
 2. That which causes joy or happiness.



Re: lists.debian.org de-localization

2003-01-07 Thread Josip Rodin
On Tue, Jan 07, 2003 at 09:29:33AM +0900, Tomohiro KUBOTA wrote:
 I have an idea about an easy modification to old list pages.
 
 Add the following line to all 
 http://lists.debian.org/*/*/threads.html ,
 http://lists.debian.org/*/*/maillist.html ,
 http://lists.debian.org/*/*/subject.html , and
 http://lists.debian.org/*/*/author.html

Hm, but doesn't the section on character sets cover the mails themselves as
well? There are a bit under twenty thousand indices, which is a large amount
by itself, but the mails themselves are really problematic.

-- 
 2. That which causes joy or happiness.



Re: lists.debian.org de-localization

2003-01-07 Thread Tomohiro KUBOTA
Hi,

From: Josip Rodin [EMAIL PROTECTED]
Subject: Re: lists.debian.org de-localization
Date: Tue, 7 Jan 2003 11:41:36 +0100

 Hm, but doesn't the section on character sets cover the mails themselves as
 well? There are a bit under twenty thousand indices, which is a large amount
 by itself, but the mails themselves are really problematic.

I think the priority of the mails themselves is much lower than that
of the lists, because 8bit character in only one mail affects the whole
month in a list.

Anyway, I don't think the regeneration or modification is needed *now*,
because our solution is not yet perfect.  It is obviously waste of
machine time to modify both *now* and in future when better solution
will be available.

I think more important problem is how to deal with raw 8bit mail
headers without encoding specification or encodings which are not
supported by the current set-up but used in Debian mailing lists
(GB2312, BIG5, and KOI8-R).

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/




Re: lists.debian.org de-localization

2003-01-06 Thread Tomohiro KUBOTA
Hi,

From: Stephen J. Turnbull [EMAIL PROTECTED]
Subject: Re: lists.debian.org de-localization (Re: automatically-generated 
ISO-8859-1 characters in mulbibyte webpages)
Date: Sun, 05 Jan 2003 16:10:02 +0900

 This is a fairly small sample (about 100 subscribers, 25 regular
 posters).  However, the Russian spam I've seen (isn't it funny how you
 can identify spam even though you can't read the language it's written
 in?) invariably fails either the addressee tests (implicit, too many),
 the known spam software test, or the HTML-only test.  So (FWIW) I've
 disabled the 8-bit test and so far the Russian subscribers are happy.

IMO, in such a case, allowing raw 8bit mails is better (i.e., its merit
is larger than its demerit) than disabling them.

Again, speaking about lists.debian.org, my original idea is to assume
all 8bit raw characters to be ISO-8859-1, though I don't know this is
technically possible or not.  In this case, Russian people will be
annoyed browsing lists.debian.org pages.

If it is possible to have assumption encoding for each mailing list,
that of debian-russian list will be KOI8-R, that of debian-chinese-gb
will be GB2312, and so on, and all others ISO-8859-1.

I also hope there are some UTF-8 filters.  (There seems a writer
who uses UTF-8 name (From:) in debian-esperanto.)

However, I don't know at all about MHonArc

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/




Re: lists.debian.org de-localization

2003-01-06 Thread Denis Barbier
On Mon, Jan 06, 2003 at 10:09:11AM +0900, Tomohiro KUBOTA wrote:
 Hi,
 
 From: [EMAIL PROTECTED] (Denis Barbier)
 Subject: Re: lists.debian.org de-localization (Re: automatically-generated 
 ISO-8859-1 characters in mulbibyte webpages)
 Date: Sun, 5 Jan 2003 15:33:41 +0100
 
   Why not use iso_8859::str2sgml; instead of mhonarc::htmlize for 
   iso-8859-1?
 [...]
  Sounds like a very good idea.
 
 Who should I ask for this modification?
 
 (permission of klecker:/org/www.debian.org/cron/people_scripts/people.pl
 is 755 owner=joy group=debwww, but I don't know whether klecker is the
 rignt place to do because I checked klecker just by chance.  I also
 checked gluck(=www.debian.org) and master but they don't have the file.
 Where can I find a document on how /org/* are processed?)

'joy' is Josip Rodin.  He recently put these files into CVS:
   http://cvs.debian.org/cron/?cvsroot=webwml
but I do not know how permissions are managed.  IMO you should
wait for his (or any other webmaster's) approval before applying
changes.
There is no extra documentation, you have to read scripts and the
README files to learn how it works.  Basically pages are generated
on klecker (aka www-master) and copied to gluck.

Denis



Re: lists.debian.org de-localization

2003-01-06 Thread Marco d'Itri
On Jan 06, Tomohiro KUBOTA [EMAIL PROTECTED] wrote:

 IMO, in such a case, allowing raw 8bit mails is better (i.e., its merit
 is larger than its demerit) than disabling them.
 
 Again, speaking about lists.debian.org, my original idea is to assume
 all 8bit raw characters to be ISO-8859-1, though I don't know this is
 technically possible or not.
This is not needed, only spammers put raw latin-1 characters in mail
headers.

  In this case, Russian people will be
 annoyed browsing lists.debian.org pages.
This would be wrong, Russians do not use ISO-8859-1 anyway.

-- 
ciao,
Marco



Re: lists.debian.org de-localization

2003-01-06 Thread Tomohiro KUBOTA
Hi,

From: Marco d'Itri [EMAIL PROTECTED]
Subject: Re: lists.debian.org de-localization
Date: Mon, 6 Jan 2003 13:34:17 +0100

  Again, speaking about lists.debian.org, my original idea is to assume
  all 8bit raw characters to be ISO-8859-1, though I don't know this is
  technically possible or not.
 This is not needed, only spammers put raw latin-1 characters in mail
 headers.

The key point is that when we receive a mail with raw 8bit characters,
we don't have an easy and relyable method to tell the characters are
from ISO-8859-1 or KOI8-R or other character sets.

Anyway, in debian-russian mailing list, raw 8bit characters in mail
headers should be allowed and they should be assumed to be KOI8-R
on building lists.debian.org pages.

In any cases, using raw 8bit characters in lists.debian.org pages
must be avoided (so that the pages are not broken), and thus, raw
8bit characters in mail headers must be converted into something
(or must be deleted).

An easy way is to assume *all* raw 8bit characters to be KOI8-R and
convert into SGML entity.  However, I don't know whether there are
some other languages where a certain amount of non-spammer people
use raw 8bit characters.  If they exist, they will complain on this
idea.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/






Re: lists.debian.org de-localization

2003-01-06 Thread Edmund GRIMLEY EVANS
Tomohiro KUBOTA [EMAIL PROTECTED]:

 The key point is that when we receive a mail with raw 8bit characters,
 we don't have an easy and relyable method to tell the characters are
 from ISO-8859-1 or KOI8-R or other character sets.

If the headers contain 8-bit octets and are valid as UTF-8, it's
fairly safe to assume that they really are UTF-8. Otherwise, you could
look for a Content-Type field or make it depend on the mailing list.

 An easy way is to assume *all* raw 8bit characters to be KOI8-R and
 convert into SGML entity.  However, I don't know whether there are
 some other languages where a certain amount of non-spammer people
 use raw 8bit characters.  If they exist, they will complain on this
 idea.

I thought some Japanese non-spammers use iso-2022-jp in headers, which
isn't 8-bit, but it isn't us-ascii, either. Am I out of date?

Edmund



Re: lists.debian.org de-localization

2003-01-06 Thread Josip Rodin
On Mon, Jan 06, 2003 at 10:09:11AM +0900, Tomohiro KUBOTA wrote:
   Why not use iso_8859::str2sgml; instead of mhonarc::htmlize for 
   iso-8859-1?
 [...]
  Sounds like a very good idea.
 
 Who should I ask for this modification?

This is the right place to ask, I was watching the discussion and waiting
for eventual objections. The subthreads don't seem to be overly decisive,
but the change seems like an oversight to me so I guess I'll apply it.

 (permission of klecker:/org/www.debian.org/cron/people_scripts/people.pl

I have no idea how you came from mhonarc to people.pl, but okay. :)

 is 755 owner=joy group=debwww, but I don't know whether klecker is the
 rignt place to do because I checked klecker just by chance.  I also
 checked gluck(=www.debian.org) and master but they don't have the file.

The rule is that foo.debian.org is on the host that has the A for the said
address, in the directory /org/foo.debian.org. Therefore lists.debian.org
stuff would be in master.debian.org's /org/lists.debian.org, and that is
indeed where it's at.

-- 
 2. That which causes joy or happiness.



Re: lists.debian.org de-localization

2003-01-06 Thread Tomohiro KUBOTA
Hi,

From: Edmund GRIMLEY EVANS [EMAIL PROTECTED]
Subject: Re: lists.debian.org de-localization
Date: Mon, 6 Jan 2003 13:45:47 +

 If the headers contain 8-bit octets and are valid as UTF-8, it's
 fairly safe to assume that they really are UTF-8. Otherwise, you could
 look for a Content-Type field or make it depend on the mailing list.

A good idea, but I think people who use UTF-8 today are those who
know well on character encodings and don't send raw 8bit headers.


 I thought some Japanese non-spammers use iso-2022-jp in headers, which
 isn't 8-bit, but it isn't us-ascii, either. Am I out of date?

Sometimes I read raw iso-2022-jp headers.  However, fortunately,
there are no Japanese mailing lists in Debian.  (debian-japanese
is an English mailing list.)  And more, MHonArc seems not to have
features to convert Japanese into SGML entity or #; expression
and we cannot support Japanese headers anyhow.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/




Re: lists.debian.org de-localization

2003-01-06 Thread Josip Rodin
On Mon, Jan 06, 2003 at 03:09:47PM +0100, Josip Rodin wrote:
Why not use iso_8859::str2sgml; instead of mhonarc::htmlize for 
iso-8859-1?
  [...]
   Sounds like a very good idea.
  
  Who should I ask for this modification?
 
 This is the right place to ask, I was watching the discussion and waiting
 for eventual objections. The subthreads don't seem to be overly decisive,
 but the change seems like an oversight to me so I guess I'll apply it.

--- debian.rc~  Sun Sep 22 05:27:19 2002
+++ debian.rc   Mon Jan  6 08:05:07 2003
@@ -5,7 +5,7 @@
 CharsetConverters
 plain;  mhonarc::htmlize;
 us-ascii;   mhonarc::htmlize;
-iso-8859-1; mhonarc::htmlize;
+iso-8859-1; iso_8859::str2sgml; iso8859.pl
 iso-8859-2; iso_8859::str2sgml; iso8859.pl
 iso-8859-3; iso_8859::str2sgml; iso8859.pl
 iso-8859-4; iso_8859::str2sgml; iso8859.pl
@@ -16,7 +16,7 @@
 iso-8859-9; iso_8859::str2sgml; iso8859.pl
 iso-8859-10;iso_8859::str2sgml; iso8859.pl
 iso-2022-jp;iso_2022_jp::str2html;  iso2022jp.pl
-latin1; mhonarc::htmlize;
+latin1; iso_8859::str2sgml; iso8859.pl
 latin2; iso_8859::str2sgml; iso8859.pl
 latin3; iso_8859::str2sgml; iso8859.pl
 latin4; iso_8859::str2sgml; iso8859.pl

-- 
 2. That which causes joy or happiness.



Re: lists.debian.org de-localization

2003-01-06 Thread Tomohiro KUBOTA
Hi,

From: Josip Rodin [EMAIL PROTECTED]
Subject: Re: lists.debian.org de-localization
Date: Mon, 6 Jan 2003 15:09:47 +0100

  (permission of klecker:/org/www.debian.org/cron/people_scripts/people.pl
 
 I have no idea how you came from mhonarc to people.pl, but okay. :)

Ah, I was confused.  It is from another thread in debian-www list
about similar 8bit problem on http://www.debian.org/devel/people.ja.html .

Please read the recent thread named automatically-generated ISO-8859-1
characters in mulbibyte webpages for detail.


Thank you for commiting the modification of debian.rc .  Does the change
affect future archives only?  Or all past and future archives?

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/




Re: lists.debian.org de-localization

2003-01-06 Thread Josip Rodin
On Mon, Jan 06, 2003 at 11:42:44PM +0900, Tomohiro KUBOTA wrote:
 Thank you for commiting the modification of debian.rc .  Does the change
 affect future archives only?  Or all past and future archives?

Future only. Is there a pressing need to regenerate the old mails?
I would rather avoid it...

-- 
 2. That which causes joy or happiness.



Re: lists.debian.org de-localization

2003-01-06 Thread Tomohiro KUBOTA
Hi,

From: Josip Rodin [EMAIL PROTECTED]
Subject: Re: lists.debian.org de-localization
Date: Mon, 6 Jan 2003 16:07:49 +0100

 Future only. Is there a pressing need to regenerate the old mails?
 I would rather avoid it...

I have an idea about an easy modification to old list pages.

Add the following line to all 
http://lists.debian.org/*/*/threads.html ,
http://lists.debian.org/*/*/maillist.html ,
http://lists.debian.org/*/*/subject.html , and
http://lists.debian.org/*/*/author.html

   meta http-equiv=Content-Type content=text/html; charset=iso-8859-1

besides following exceptions:

for debian-chinese-gb,

   meta http-equiv=Content-Type content=text/html; charset=gb2312

for debian-chinese-big5,

   meta http-equiv=Content-Type content=text/html; charset=big5

for debian-russian,

   meta http-equiv=Content-Type content=text/html; charset=koi8-r

I expect that this can be done by much less machine time than to
regenerate all above pages by using MHonArc, because this doesn't
need to access any individual mails.

I am also now reading MHonArc documents so that these above headers
can be added to future pages also (or, if everything can be migrated
into UTF-8, it will be much better).

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/




Re: lists.debian.org de-localization

2003-01-06 Thread Tomohiro KUBOTA
Hi,

From: Marco d'Itri [EMAIL PROTECTED]
Subject: Re: lists.debian.org de-localization
Date: Tue, 7 Jan 2003 01:10:29 +0100

 On Jan 06, Tomohiro KUBOTA [EMAIL PROTECTED] wrote:
 
   This is not needed, only spammers put raw latin-1 characters in mail
   headers.
  The key point is that when we receive a mail with raw 8bit characters,
 The key point is that we should not even accept mail with raw 8bit
 characters in the headers.

Though I agree with you, it is an ideal solution.  As Stephen said,
there are people who use raw 8bit characters (intended to be KOI8-R).
If you could force them to use right MUAs, I would fully agree with you.

Anyway, in the current set-up of lists.debian.org, encodings such as
GB2312 and BIG5 (used in debian-chinese-gb and debian-chinese-big5,
respectively) are not supported and processed just like raw 8bit
characters.  We also have to deal with them.

I am now interested in MHonArc::UTF8.pm .  I had been thinking
that it converts all UTF-8 characters (besides ASCII) into #;
expression and doesn't support east Asians, which was wrong.
It seems to convert *from* all non-UTF8 encodings *to* UTF-8
and seems to support east Asians also (because Unicode::MapUTF8
supports east Asian encodings).

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/




Re: lists.debian.org de-localization (Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages)

2003-01-05 Thread Stephen J. Turnbull
/lurk
 Marco == Marco d'Itri [EMAIL PROTECTED] writes:

Marco It would be *MUCH* better to just refuse these
Marco messages. Most of them are spam anyway.  At least in my
Marco country (and in all western europe, I think) raw latin-1
Marco characters in headers are never found outside of non-spam
Marco messages.

He did say Russian.  On xemacs-users-ru, which is dedicated to
Russian-language posts, about half the users use RFC-2047 encoded-words,
and the rest are split evenly between ASCII-only and 8-bit Cyrillic.
Raw Cyrillic in headers is used by some of the more sophisticated
users, too, surprisingly enough.

This is a fairly small sample (about 100 subscribers, 25 regular
posters).  However, the Russian spam I've seen (isn't it funny how you
can identify spam even though you can't read the language it's written
in?) invariably fails either the addressee tests (implicit, too many),
the known spam software test, or the HTML-only test.  So (FWIW) I've
disabled the 8-bit test and so far the Russian subscribers are happy.

I will also say I've seen a fair amount of dumbquotes from MS-encumbered
posters, and the occasional accented Latin character from French and
German posters (although those are quite rare, but not quite nonexistent).

Marco /^Subject: .*[^[:print:]]{8}/   REJECT Your mailer is not \
Marco RFC 2047 compliant

If you're going to do that, 8 is probably too many (SPC is not an
8-bit character---I find 3 works well) and the reason should be
failure to comply with RFC 2822.  AFAIK 2047 does not prohibit 8-bit
characters, it simply provides a mechanism to encode them in
environments where they are prohibited.
lurk

-- 
Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.



Re: lists.debian.org de-localization (Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages)

2003-01-05 Thread Denis Barbier
On Sun, Jan 05, 2003 at 10:18:48AM +0900, Tomohiro KUBOTA wrote:
[...]
  CharsetConverters
  plain;  mhonarc::htmlize;
  us-ascii;   mhonarc::htmlize;
  iso-8859-1; mhonarc::htmlize;
  iso-8859-2; iso_8859::str2sgml; iso8859.pl
  iso-8859-3; iso_8859::str2sgml; iso8859.pl
 
 Why not use iso_8859::str2sgml; instead of mhonarc::htmlize for iso-8859-1?
 
 (Though I am new to MHonArc, I imagine that iso_8859::str2sgml converts
 ISO-8859 8bit characters into SGML entity like ouml;.)
[...]

Sounds like a very good idea.

Denis



Re: lists.debian.org de-localization

2003-01-05 Thread Tomohiro KUBOTA
Hi,

From: [EMAIL PROTECTED] (Denis Barbier)
Subject: Re: lists.debian.org de-localization (Re: automatically-generated 
ISO-8859-1 characters in mulbibyte webpages)
Date: Sun, 5 Jan 2003 15:33:41 +0100

  Why not use iso_8859::str2sgml; instead of mhonarc::htmlize for iso-8859-1?
[...]
 Sounds like a very good idea.

Who should I ask for this modification?

(permission of klecker:/org/www.debian.org/cron/people_scripts/people.pl
is 755 owner=joy group=debwww, but I don't know whether klecker is the
rignt place to do because I checked klecker just by chance.  I also
checked gluck(=www.debian.org) and master but they don't have the file.
Where can I find a document on how /org/* are processed?)

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/




lists.debian.org de-localization (Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages)

2003-01-04 Thread Tomohiro KUBOTA
Hi,

From: Tomohiro KUBOTA [EMAIL PROTECTED]
Subject: Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages
Date: Fri, 03 Jan 2003 09:06:43 +0900 (JST)

 BTW, I found similar trouble in lists.debian.org pages.  In thread-list
 pages or date-list pages like
 
   http://lists.debian.org/debian-devel/2002/debian-devel-200212/threads.html,
 
 there are no charset specification.  In such cases, web browsers will
 assume these pages according to user preference.  Naturally, Japanese
 people configure web browsers to assume Japanese encoding for pages
 without charset specification.  On the other hand, the thread-list
 pages show senders' names in em format, and threfore, a tag /em
 follows the name.  If the last letter of the name is 8bit, the tag
 is broken.  The result is that all following part are shown in em
 (italic) format.

 The test is easy: please configure your browser to assume Japanese
 encoding for pages without charset specification and load the above
 page.


 However, in this case, the solution is a bit complicated.  All mails
 should have encoding information in MIME format.  Thus, the best
 solution would be to parse MIME.  On the other hand, the simplest
 makeshift solution is to add charset=iso8859-1 for all pages
 but there are mailing lists where most of 8bit characters are
 cyrillic and so on.


I found that MHonArc has a feature to solve this problem.

  http://www.mhonarc.org/MHonArc/doc/faq/mime.html#nonascii

I checked /org/lists.debian.org/mhonarc/debian.rc and found
that it seems to ssume that any 8bit characters are ISO-8859-1.

 CharsetConverters
 plain;  mhonarc::htmlize;
 us-ascii;   mhonarc::htmlize;
 iso-8859-1; mhonarc::htmlize;
 iso-8859-2; iso_8859::str2sgml; iso8859.pl
 iso-8859-3; iso_8859::str2sgml; iso8859.pl

Why not use iso_8859::str2sgml; instead of mhonarc::htmlize for iso-8859-1?

(Though I am new to MHonArc, I imagine that iso_8859::str2sgml converts
ISO-8859 8bit characters into SGML entity like ouml;.)

It would be nice if we can convert raw 8bit mail headers (though it is
illegal; it sometimes happens and may cause breaking the lists.debian.org
pages) to SGML entities by assuming they are ISO-8859-1.  Since this may
annoy Russian (and other non-ISO-8859-1) people who happen to use MUAs
which generates illegal mail headers with 8bit characters without charset
specification, I'd like to hear from people from various countries.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/




Re: lists.debian.org de-localization (Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages)

2003-01-04 Thread Marco d'Itri
On Jan 05, Tomohiro KUBOTA [EMAIL PROTECTED] wrote:

 It would be nice if we can convert raw 8bit mail headers (though it is
 illegal; it sometimes happens and may cause breaking the lists.debian.org
 pages) to SGML entities by assuming they are ISO-8859-1.  Since this may
 annoy Russian (and other non-ISO-8859-1) people who happen to use MUAs
 which generates illegal mail headers with 8bit characters without charset
 specification, I'd like to hear from people from various countries.
It would be *MUCH* better to just refuse these messages. Most of them
are spam anyway.
At least in my country (and in all western europe, I think) raw latin-1
characters in headers are never found outside of non-spam messages.

/^Subject: .*[^[:print:]]{8}/   REJECT Your mailer is not RFC 2047 compliant

-- 
ciao,
Marco