Re: Long name rocks! But how about *.ecm?

SADAHIRO Tomoyuki Mon, 25 Mar 2002 07:23:02 -0800


On Mon, 25 Mar 2002 21:56:08 +0900
Dan Kogai <[EMAIL PROTECTED]> wrote:

> On Monday, March 25, 2002, at 09:37 , Nick Ing-Simmons wrote:
> >>
> >>  in trouble?  Or perl on such systems are smart enough to load
> >> UNIVERSA.pm (I guess this is the case).
> >
> > They load UNIVERSAL.pm and the OS truncates it and finds UNIVERSA.pm.
> >
> >>   Size reduction was a byproduct of */Makefile.PL linting.
> >>   As for "Encode::Supports", there is another concern in perldoc;  is
> >> perldoc smart enough to 8.3-ize filenames?
> >
> > Same logic as above works - name passed to OS is still the long one.
> 
>    Okay, I am convinced that we should stick with the original, long, 
> user-friendly names but how about ucm-transitions?
>    As of Encode-0.98, there are so many duped tables under Encode/ and I 
> want to tidy it up if possible.  Well, for this I will wait what 
> Sadahiro-san has to say....

hmm.... I'm not in opposition to it.

IMO, a more significant point might be 
which encodings are worth implemented in the core ship.
In other words, it's better to assess each encoding
which is supported only by Encode::Tcl.

AFAIK, such encodings includes ISO-2022-JP-2 and ISO-2022-CN.
(defined by 2022-jp2.enc and 2022-cn.enc, respectively)

But it may seem weird to encode to them,
since they have many many duplicates in definition.

Say, here is an example of ISO-2022-CN cited from RFC 1922.

      Example: the hex sequence

         1b 24 29 41 0e 3d 3b 3b 3b 1b 24 29 47 47 28 5f 50 0f

      represents the Chinese word for "Interchange" (jiao huan) twice;

where, <3d 3b 3b 3b> is "jiao huan" in GB (GB 2312-80),
   and <47 28 5f 50> is "jiao huan" in CNS (CNS 11643 plane-1).

Then, decoding of it gives "\x{4ea4}\x{6362}\x{4ea4}\x{63db}".
"jiao" has mapped to the same code point in Unicode!

To encode "\x{4ea4}\x{6362}\x{4ea4}\x{63db}" to ISO-2022-CN
will give the following hex sequence:

   1b 24 29 41 0e 3d 3b 3b 3b 3d 3b 1b 24 29 47 5f 50 0f

where, <3d 3b 3b 3b 3d 3b> is "jiao huan jiao" in GB,
   and <5f 50> is "huan" in CNS.

How about it?

More confusing is ISO-2022-JP-2, as it has JIS/GB/KS characters.
Many kanji/hanzi/hanja are *triplicated*!
(Of course triplicates includes hiragana, katakana, Greek, etc.)

A solution to distinguish the languages may be tagging
but are they truly useful?

NOTE
  In Encode::Tcl::Escape::encode(), each character
  is retrived in order cited in the .enc file.

  Say, according to 2022-jp2.enc,
  jis0212 is preferred than gb2312,
  and gb2312 than ksc5601.

E
name            iso2022-jp2
init            {}
final           {}
ascii           \x1b(B
ascii           \x1b(J
jis0208         \x1b$B
jis0208         \x1b$@
jis0212         \x1b$(D
gb2312          \x1b$A
ksc5601         \x1b$(C
7bit-latin1     \x1b.A
7bit-greek      \x1b.F

>    At leas euc-jp must be in *.ucm because it contains triple-bytes (JIS 
> X 0212), which Encode::Tcl used to handle via Encode::Tcl::Extended but 
> now ::Extended is gone....

Well, I've agreed it.
http://www.xray.mpe.mpg.de/mailing-lists/perl-unicode/2002-03/msg00076.html

> Dan the Encode Maintainer

Regards,
SADAHIRO Tomoyuki

Re: Long name rocks! But how about *.ecm?

Reply via email to