Re: [Encode] euc-jp vs euc-jisx0213

2002-04-30 Thread Dan Kogai

On Monday, April 29, 2002, at 07:38 , SADAHIRO Tomoyuki wrote:
 I doubt whether users of 'euc-jp' will
 assume it to be a combination with JIS X 0213.

They don't have to because 'euc-jp' behaves exactly the same as before 
so long as the charset is in ASCII/JISX(0201|0208|0212).

 Such a mixing would prevent warning/croaking
 for appearance of code points that are not defined
 originally (meaning w/o X 0213), wouldn't it?

That was my biggest concern but I have decided to go ahead with euc-jp 
to (partially) support JIS X 0213 and the reason is simple;  Encode::JP 
is already too big to differentiate between various euc-jp.  In such 
cases, we should settle for the most 'comprehensive' version.

Even the term 'euc-jp' is too ambiguous for many;  At first it didn't 
include G3 and some say they must be clearly marked as something like 
'euc-jp-classic' (no 0212 support) vs 'euc-jp-modern' and so forth (then 
our current euc-jp should be marked as 'euc-jp-postmodern' :).  It would 
be nice if we can go that way like 7bit-JIS/ISO-2022-JP/ISO-2022-JP-1 
but for euc-jp we have to have a whole ucm for each.

This is definitely a todo for Perl 5.8.1 and up and I have already come 
up with a solution;  the future Encode (Encode II) will support 
CES-generator;  that is, you can express euc-jp not as a whole big 
table but a combination of tables.  That will also reduce the duplicates 
found in vendor mappings.  It will be a complete rewrite of encengine.c

But that requires not only codes but the expansion of UCM format so give 
me more time (and Perl 5.8.0!)

Dan the Encode Maintainer




[Encode] Encode-JIS2K-0.01 uploaded to CPAN

2002-04-30 Thread Dan Kogai

Folks,

   I gotta go in 5 minutes so I just dump the README file after the sig.

Dan the Encode Maintainer

NAME
Encode::JIS2K - JIS X 0212 (aka JIS 2000) Encodings

INSTALLATION

To install this module type the following:

perl Makefile.PL
make
make test
make install

SYNOPSIS
  use Encode::JIS2K;
  use Encode qw/encode decode/;
  $euc_2k = encode(euc-jisx0213, $utf8);
  $utf8   = decode(euc-jisx0213, $euc_jp);

ABSTRACT
This module implements encodings that covers JIS X 0213
charset (AKA JIS 2000, hence the module name).  Encodings
supported are as follows.

  Canonical Alias  
Description
  

  euc-jisx0213  qr/\beuc.*jp[ \-]?(?:2000|2k)$/i  
EUC-JISX0213
qr/\bjp.*euc[ \-]?(2000|2k)$/i
qr/\bujis[ \-]?(?:2000|2k)$/i
  shiftjisx0123 qr/\bshift.*jis(?:2000|2k)$/i   
Shift_JISX0213
qr/\bsjisp \-]?(?:2000|2k)$/i

  iso-2022-jp-3
  jis0213-1-raw JIS X 0213 plane 1, raw 
format
  jis0213-2-raw JIS X 0213 plane 2, raw 
format
  


DESCRIPTION
To find out how to use this module in detail, see the
Encode manpage.

what is JIS X 0213 anyway?
Simply put, JIS X 0213 is a rework and reorganization of
JIS X 0208 and JIS X 0212.  They consist of two 94x94
planes which roughly corrensponds as follows;

  JIS X 0213 Plane 1 = JIS X 0208 + extension
  JIS X 0213 Plane 2 = JIS X 0212 reorganized + extension

And here is the character repertoire there of at a glance.

  # of codepoints Kuten Ku (rows) used
  
  JIS X 0208 6,8791..8,16..83
  JIS X 0213-1   8,7621..94 (all!)
  JIS X 0212 6,0672,6..7,9..11,16..77
  JIS X 0213-2   2,4361,3..5,8,12..15,78..94
  ---
  (JIS X0213 Total) 11,197

JIS X 0213 was designed to extend JIS X 0208 and JIS X
0212 without being imcompatible to (classic) EUC-JP and
Shift_JIS.  The following characteristics are as a result
thereof.

o JIS X plane 1 is (almost) a superset of JIS X 0208.
  However, with Unicode 3.2.0 the mappings differ in 3
  codepoints.

Kuten   JIS X 0208 - Unicode JIS X 0213 - Unicode
--
1-1-17  UFFE3 # FULLWIDTH MACRONU203E # OVERLINE
1-1-29  U2014 # EM DASH U2015 # HORIZONTAL BAR
1-1-79  UFFE5 # FULLWIDTH YEN SIGN  U00A5 # YEN SIGN

o By the same token, JIS X 0213 plane 2 contains JIS Dai-4
  Suijun Kanji (JIS Kanji Repertoire Level 4).  This
  allows EUC-JP's G3 to contain both JIS X 0212 and JIS
  0213 plane 2.

  However, JIS X 0212:1990 already contains many of Dai-4
  Suijun Kanji so EUC's G3 is subject to containing dupli-
  cate mappings.

o Because of Halfwidth Katakana, Shift_JIS mapping has
  been tricky and it is even trickier.  Here is a regex
  that matches Shift_JISX0213 sequence (note: you have to
  use bytes to make it work!)

$re_valid_shifjisx0213 =
  qr/^(?:
   [\x00-\x7f] |# ASCII or
   [\xa1-\xdf] |# JIS X 0201 
KANA or
   [\x81-\x9f\xe0-\xfc][\x40-\x7e\x80-\xfc] # JIS X 0213
   )+$/xo;

Note on EUC-JISX0213 (vs. EUC-JP)

As of Encode-1.64, 'euc-jp' does support euc-jisx0213 for
decoding.  However, 'euc-jp' in Encode and 'euc-jisx0213'
differ as follows;

euc-jp   euc-jisx0213
  --
  Decodes   (0201-K|0208|0212|0213)  ditto
  Round-Trip  (|0)  (020-K|0208|0212)JIS X (0201-K|0213)
  Decode Only (|3)  those only found in 0213
those only found in 0212
  --

AUTHORS
Dan Kogai [EMAIL PROTECTED]

COPYRIGHT
Copyright 2002 by Dan Kogai [EMAIL PROTECTED].

This program is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.

See http://www.perl.com/perl/misc/Artistic.html

SEE ALSO
the Encode manpage, the Encode::JP 

Encode doesn't like undef

2002-04-30 Thread Paul Marquess

This is with Encode 1.64

$ perl5.7.3 -w -MEncode -e 'Encode::encode_utf8(undef)'
Use of uninitialized value in subroutine entry at
/tmp/bleed/lib/perl5/5.7.3/sun4-solaris/Encode.pm line 183.

I don't know Encode well enough to check if there are any other places this
will strike.

Paul




Apache::GuessCharset - Encode::Guess + I18N::Charset

2002-04-30 Thread Tatsuhiko Miyagawa

Hi,

I've made a tiny PerlFixupHandler to guess files' encodings and
automatically add charset attribute into Content-Type accorging to the
guess, using Encode::Guess and I18N::Charset.

This module can be a powerfull replacement/suppliment for Apache's
Add*Charset stuff. The module can be downloaded from
http://bulknews.net/lib/archives/Apache-GuessCharset-0.01.tar.gz

Well, this module uses I18N::Charset internally to translate encoding
names into IANA registerd name (and its own table to convert it to
preferred MIME again). How about porting this module into Encode,
which might be useful?

Thanks,

--
Tatsuhiko Miyagawa [EMAIL PROTECTED]


NAME
Apache::GuessCharset - adds HTTP charset by guessing file's encoding

SYNOPSIS
  PerlModule Apache::GuessCharset
  SetHandler perl-script
  PerlFixupHandler Apache::GuessCharset

  # how many bytes to read for guessing (default 512)
  PerlSetVar GuessCharsetBufferSize 1024

  # list of encoding suspects
  PerlSetVar GuessCharsetSuspects euc-jp
  PerlAddVar GuessCharsetSuspects shiftjis
  PerlAddVar GuessCharsetSuspects 7bit-jis

DESCRIPTION
Apache::GuessCharset is an Apache handler which adds HTTP charset
attribute by automaticaly guessing file' encodings via Encode::Guess.

CONFIGURATION
This module uses following configuration variables.

GuessCharsetSuspects
a list of encodings for Encode::Guess to check. See the
Encode::Guess manpage for details.

GuessCharsetBufferSize
specifies how many bytes for this module to read from source file,
to properly guess encodings. default is 512.

AUTHOR
Tatsuhiko Miyagawa [EMAIL PROTECTED]

This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.

SEE ALSO
the Encode::Guess manpage, the Apache::File manpage





Re: Apache::GuessCharset - Encode::Guess + I18N::Charset

2002-04-30 Thread Tatsuhiko Miyagawa

At Tue, 30 Apr 2002 09:45:30 -0400,
Thurn, Martin (Intranet) [EMAIL PROTECTED] wrote:

   preferred MIME again). How about porting this module into Encode,
   which might be useful?   ^^^
 
 What exactly do you have in mind?
 Add the charset names used by Encode?
 That would be great... if I ever get time to play with 5.8.0 etc...?

Something like that.

My first thought was how do I get IANA registered name from current
encoding via Encode?. Then I searched CPAN and came upon your
I18N::Charset. If I18N::Charset works seamlessly with Encode, it'll be
nice.

But this would something for 5.8.1.

-- 
Tatsuhiko Miyagawa [EMAIL PROTECTED]



Re: Encode doesn't like undef

2002-04-30 Thread Dan Kogai

On Tuesday, April 30, 2002, at 07:14 , Paul Marquess wrote:
 This is with Encode 1.64

 $ perl5.7.3 -w -MEncode -e 'Encode::encode_utf8(undef)'
 Use of uninitialized value in subroutine entry at
 /tmp/bleed/lib/perl5/5.7.3/sun4-solaris/Encode.pm line 183.

 I don't know Encode well enough to check if there are any other places 
 this
 will strike.

I think we'd better leave that way;  It needs a PV to (en|de)code so 
consider this a feature.  Of course

perl5.7.3 -w -MEncode -e 'Encode::encode_utf8()'

is perfectly safe and legal.

Dan the Encode Maintainer




RE: Encode doesn't like undef

2002-04-30 Thread Paul Marquess

From: Dan Kogai [mailto:[EMAIL PROTECTED]]

 On Tuesday, April 30, 2002, at 07:14 , Paul Marquess wrote:
  This is with Encode 1.64
 
  $ perl5.7.3 -w -MEncode -e 'Encode::encode_utf8(undef)'
  Use of uninitialized value in subroutine entry at
  /tmp/bleed/lib/perl5/5.7.3/sun4-solaris/Encode.pm line 183.
 
  I don't know Encode well enough to check if there are any other places
  this
  will strike.

 I think we'd better leave that way;  It needs a PV to (en|de)code so
 consider this a feature.

I agree that passing undef() to one of the encoding functions may be an edge
condition too far, but passing a variable that contains undef is more
common.

$ perl5.7.3 -w -MEncode -e 'Encode::encode_utf8($a)'
Name main::a used only once: possible typo at -e line 1.
Use of uninitialized value in subroutine entry at
/tmp/bleed/lib/perl5/5.7.3/sun4-solaris/Encode.pm line 183.

Can this be detected  silenced?


 Of course

 perl5.7.3 -w -MEncode -e 'Encode::encode_utf8()'

 is perfectly safe and legal.

Paul




Re: Encode doesn't like undef

2002-04-30 Thread Dan Kogai

On Tuesday, April 30, 2002, at 11:42 , Paul Marquess wrote:
 I agree that passing undef() to one of the encoding functions may be an 
 edge
 condition too far, but passing a variable that contains undef is more
 common.

 $ perl5.7.3 -w -MEncode -e 'Encode::encode_utf8($a)'
 Name main::a used only once: possible typo at -e line 1.
 Use of uninitialized value in subroutine entry at
 /tmp/bleed/lib/perl5/5.7.3/sun4-solaris/Encode.pm line 183.

 Can this be detected  silenced?

You've got a point.  Warning should warn when and only when there is a 
danger therein and passing undef itself is harmless.  And this can be 
done easily by adding defined $str or return; for each sub concerned.  
Okay, I'll go for that.

Dan the Encode Maintainer




2nd Last Call for the Character Model for the WWW

2002-04-30 Thread Misha . Wolf

Owing to the large number of Last Call comments on the:
   Character Model for the World Wide Web
   W3C Working Draft 26 January 2001
and to the large number of changes these comments gave rise to, the W3C
I18N WG has decided to issue a 2nd Last Call for comments:
   Character Model for the World Wide Web
   W3C Working Draft 30 April 2002
   http://www.w3.org/TR/2002/WD-charmod-20020430

The document's abstract says:

   This Architectural Specification provides authors of specifications,
   software developers, and content developers with a common reference
   for interoperable text manipulation on the World Wide Web.  Topics
   addressed include encoding identification, early uniform
   normalization, string identity matching, string indexing, and URI
   conventions, building on the Universal Character Set, defined jointly
   by Unicode and ISO/IEC 10646.  Some introductory material on
   characters and character encodings is also provided.

Please review the document and submit any comments by 31 May 2002.
Comments should preferably be submitted via the Last Call Comment Form
(http://www.w3.org/2002/05/charmod/LastCall).  They may alternatively be
submitted by email to [EMAIL PROTECTED]  To help us process your
comments, please submit each comment separately.  If doing so is too
awkward, please number your comments clearly.

Many thanks,
Misha Wolf
W3C I18N WG Chair





-- --
Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.



RE: Encode doesn't like undef

2002-04-30 Thread Paul Marquess

From: Dan Kogai [mailto:[EMAIL PROTECTED]]

 On Tuesday, April 30, 2002, at 11:42 , Paul Marquess wrote:
  I agree that passing undef() to one of the encoding functions may be an
  edge
  condition too far, but passing a variable that contains undef is more
  common.
 
  $ perl5.7.3 -w -MEncode -e 'Encode::encode_utf8($a)'
  Name main::a used only once: possible typo at -e line 1.
  Use of uninitialized value in subroutine entry at
  /tmp/bleed/lib/perl5/5.7.3/sun4-solaris/Encode.pm line 183.
 
  Can this be detected  silenced?

 You've got a point.  Warning should warn when and only when there is a
 danger therein and passing undef itself is harmless.  And this can be
 done easily by adding defined $str or return; for each sub concerned.
 Okay, I'll go for that.

Yep, I think that's the fix to go for.

Paul




RE: Encode doesn't like undef

2002-04-30 Thread Nick Ing-Simmons

Paul Marquess [EMAIL PROTECTED] writes:

I agree that passing undef() to one of the encoding functions may be an edge
condition toois more
common.

Paraphase

nick@bactrian 1078$ perl -w -e print undef
Use of uninitialized value in print at -e line 1.

nick@bactrian 1079$ perl -w -e print $a
Use of uninitialized value in print at -e line 1.

nick@bactrian 1080$ perl -w -e '$a =~ tr/A/a/'
Name main::a used only once: possible typo at -e line 1.
Use of uninitialized value in transliteration (tr///) at -e line 1.

I agree that passing undef() to print/tr functions may be an edge
condition too far, but passing a variable that contains undef is more
common.

Can this be detected  silenced?

/Paraphrase

Yes it could but we don't for very good reasons.

-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/






Re: [Encode] 1.65 released

2002-04-30 Thread Nick Ing-Simmons

Dan Kogai [EMAIL PROTECTED] writes:

$Revision: 1.65 $ $Date: 2002/04/30 16:13:37 $
! Encode.pm
   encode(undef) no longer warns for CUse of uninitialized value in
   subroutine entry.  Suggested by Paul.

Can I get warnings + fallbacks yet?

Can we have a bit to enable warns on undef please ;-)

There are casting issues with perlqq etc. using (say) UVxf but only 
passsing a U8 (e.g. s[slen]) and not a UV which format expects.

   Message-Id: [EMAIL PROTECTED]
! lib/Encode/Supported.pod
   Encode::MIME::Header and Encode::Guess mentioned
   Updated for Encode::HanExtra 0.05 and Encode::JIS2K
! lib/Encode/Guess.pm
   POD fix by Miyagawa-kun
   Message-Id: [EMAIL PROTECTED]

Dan the Encode Maintainer
-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/






Encode should stay undefphobia

2002-04-30 Thread Dan Kogai

On Wednesday, May 1, 2002, at 02:10 , Nick Ing-Simmons wrote:
 Dan Kogai [EMAIL PROTECTED] writes:

 Please don't.

 $a =~ tr/A/a/;

 gives a warning so should encode/decode.

How can I be so dumb for not anticipating you say that! (Blame it on the 
fever).  Paul, I  now think Nick's got more points than yours so I will 
revert it in the next version.  Maybe I will document this undef-phobia 
of Encode subs in the POD

Dan the Warned Man




[Encode] 1.66 Released

2002-04-30 Thread Dan Kogai

My fever is down at last when I released Encode-1.66, available as 
follows;

Whole:
http://www.dan.co.jp/~dankogai/Encode-1.66.tar.gz or CPAN
Diff against current: 264 lines
http://www.dan.co.jp/~dankogai/current-1.66.diff.gz

And Changes.

$Revision: 1.66 $ $Date: 2002/05/01 05:41:06 $
! Encode.xs t/fallback.t
   WARN_ON_ERR no longer assumes RETURN_ON_ERR so you can issue a warning
   while fallback is in effect.  This even came with a welcome side-effect
   of cleaner code with less nests!  Thank you, NI-XS.  t/fallback.t is
   also modified to test this.
   And of course, the corresponding varialbles to UV[Xx]f are 
appropriately
   cast.  This should've concluded NI-XS homework.
! Encode.pm
   encode(undef) does warn again!  Repented upon suggestion by NI-XS.
   Document for unless vs. '' added
   Message-Id: [EMAIL PROTECTED]

As you see, this is a NI-XS homework issue.  Now I have only djgpp to 
left (I think.  djgpp is just s slow on my env.)

Dan the Encode Maintainer