Re: Change 16302: Provide the \N{U+HHHH} syntax before we forget.

2002-05-02 Thread Philip Newton

On Fri, 03 May 2002 11:36:46 +0900, [EMAIL PROTECTED] (Sadahiro
Tomoyuki) wrote:

> But Unicode 3.1 extends U+ notation beyond 0x.

Ah! Thanks for the reference.

So maybe that is no longer necessary... by the time 5.8.0 is out,
Unicode 3.2 will have been current for a while. Or should we support the
older notation?

Cheers,
Philip



Encode-InCharset-0.01 Released

2002-05-02 Thread Dan Kogai

I have just released Encode-InCharset-0.01, available as

  http://www.dan.co.jp/~dankogai/Encode-InCharset-0.01.tar.gz and CPAN.

I have developed this module primarily to implement ISO-2022-JP-3 and 
ISO-2022-CN in future.  To implement encode() in these, you have to know 
which character set a given character belongs.  But this module can also 
be used if a string can safely be encoded
(Though fallback is much faster).

Dan the Encode Maintainer

NAME
 Encode::InCharset - defines \p{InCharset}

INSTALL
 perl Makefile.PL
 make test && make install

SYNOPSIS
   use Encode::InCharset qw(InJIS0208);
   "I am \x{5c0f}\x{98fc}\x{3000}\x{5f3e}" =~ /(\p{InJIS0208})+/o;
   # guess what is in $1

ABSTRACT
 This module provides In*Charset* Unicode property that matches
 characters *Charset*.

 As of this writing, Property-matching functions are auto-generated 
out
 of ucm files in Encode, Encode::HanExtra, and Encode::JIS2K.

DESCRIPTION
 As of this writing, this module supports character properties shown
 below. Since names are self-explanatory I am not going to discuss in
 details.

   InASCII InAdobeStandardEncoding InAdobeSymbol InAdobeZdingbat
   InBIG5EXT InBIG5PLUS InBIG5_ETEN InBIG5_HKSCS InCCCII InCP1006
   InCP1026 InCP1047 InCP1250 InCP1251 InCP1252 InCP1253 InCP1254
   InCP1255 InCP1256 InCP1257 InCP1258 InCP37 InCP424 InCP437 InCP500
   InCP737 InCP775 InCP850 InCP852 InCP855 InCP856 InCP857 InCP860
   InCP861 InCP862 InCP863 InCP864 InCP865 InCP866 InCP869 InCP874
   InCP875 InCP932 InCP936 InCP949 InCP950 InDingbats InEUC_CN
   InEUC_JISX0213 InEUC_JP InEUC_KR InEUC_TW InGB12345 InGB18030 
InGB2312
   InGSM0338 InHp_Roman8 InISO_8859_1 InISO_8859_10 InISO_8859_11
   InISO_8859_13 InISO_8859_14 InISO_8859_15 InISO_8859_16 
InISO_8859_2
   InISO_8859_3 InISO_8859_4 InISO_8859_5 InISO_8859_6 InISO_8859_7
   InISO_8859_8 InISO_8859_9 InISO_IR_165 InJIS0201 InJIS0208 
InJIS0212
   InJIS0213_1 InJIS0213_2 InJohab InKOI8_F InKOI8_R InKOI8_U 
InKSC5601
   InMacArabic InMacCentralEurRoman InMacChineseSimp InMacChineseTrad
   InMacCroatian InMacCyrillic InMacDingbats InMacFarsi InMacGreek
   InMacHebrew InMacIcelandic InMacJapanese InMacKorean InMacRoman
   InMacRomanian InMacRumanian InMacSami InMacSymbol InMacThai
   InMacTurkish InMacUkrainian InNextstep InPOSIX_BC InShift_JIS
   InShift_JISX0213 InSymbol InVISCII

   EXPORT

   # will import all of them
   use Encode::InCharset;
   # will import only properties in qw()
   use Encode::InCharset qw(In...)

SEE ALSO
 the Encode manpage, the perlunicode manpage

AUTHOR
 Dan Kogai <[EMAIL PROTECTED]>

COPYRIGHT AND LICENSE
 Copyright 2002 by Dan Kogai

 This library is free software; you can redistribute it and/or modify 
it
 under the same terms as Perl itself.

 See http://www.perl.com/perl/misc/Artistic.html




Re: Change 16302: Provide the \N{U+HHHH} syntax before we forget.

2002-05-02 Thread SADAHIRO Tomoyuki


On Fri, 3 May 2002 02:30:11 +0300
Jarkko Hietaniemi <[EMAIL PROTECTED]> wrote:

> On Thu, May 02, 2002 at 08:01:34AM +0200, Philip Newton wrote:
> > On Wed, 1 May 2002 07:00:05 -0700, [EMAIL PROTECTED] (Jarkko Hietaniemi) wrote:
> > 
> > > Change 16302 by jhi@alpha on 2002/05/01 12:54:24
> > > 
> > >   Provide the \N{U+} syntax before we forget.
> > 
> > Do we also want to support U-HH? I seem to recall from somewhere
> 
> Hmmm.  One always learns something new... where did you find that format?
> 
> > that U+ went to U+ and that code points beyond that were
> > U- (i.e. U+ form took 4 hex chars and U- form took 8 hex chars,
> > or something like that.)
> > 

U- format is mentioned in Preface, 0.2 Notational Convention,
in Unicode 3.0.

http://www.unicode.org/uni2book/Preface.pdf
http://www.unicode.org/uni2book/u2.html

But Unicode 3.1 extends U+ notation beyond 0x.
cf. http://www.unicode.org/unicode/reports/tr27/

Citation from here
   II Notational Changes for the Standard
  Section 0.2 Notational Conventions, page xxviii:
  change the description of the U+ notation to read:

  In running text, an individual Unicode code point
  can be expressed as U+n, where n is from four to six
  hexadecimal digits, using the digits 0-9 and A-F
  (for 10 through 15, respectively).
  There should be no leading zeros, unless the codepoint
  would have fewer than four hexadecimal digits;
  for example,
U+0001, U+0012, U+0123, U+1234, U+12345, U+102345.
End of citation

Therefore U-0001 is U+1 and U-0010 is U+10.

Regards,
SADAHIRO Tomoyuki




Re: [Patch] User-defined \p{} more like Camel 3 example

2002-05-02 Thread Jarkko Hietaniemi

On Fri, May 03, 2002 at 08:57:42AM +0900, Dan Kogai wrote:
> jhi,
> 
> I've submitted this yesterday but it seems it gets simply overlooked 
> (got no positive or negative response) so here we go again.

Already applied, #16354.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



[Patch] User-defined \p{} more like Camel 3 example

2002-05-02 Thread Dan Kogai

jhi,

I've submitted this yesterday but it seems it gets simply overlooked 
(got no positive or negative response) so here we go again.

Dan

On Thursday, May 2, 2002, at 12:44 , Dan Kogai wrote:
> On Wednesday, May 1, 2002, at 11:23 , Jarkko Hietaniemi wrote:
>> perlunicode.pod and "User-defined Character Properties" already
>> documents it.  I guess accepting \s+ is okay... but as I said,
>> people shouldn't be doing that by hand (much).
>
> And here is the patch that fixes this.  [ \t]+ is picked instead of \s+ 
> because \s+ is too ambiguous with Unicode (plus it catches \n and \r 
> which it should not).
>
> Since Camel 3 doesn't say anything about what whitespace character(s) 
> (is|are) okay (it merely says "like this" -- cf. pp. 173), you should 
> apply this patch for the sake of Camel 3 readers.
>
> $sig =~ /Dan[ \t]+the[ \t]+Perl5[ \t]+Porter/;

 > diff -du lib/utf8_heavy.pl.old 
lib/utf8_heavy.pl  --- 
lib/utf8_heavy.pl.old   Mon Apr 22 08:29:37 2002
+++ lib/utf8_heavy.pl   Thu May  2 00:29:18 2002
@@ -271,7 +271,7 @@
 }
 else {
   LINE:
-   while (/^([0-9a-fA-F]+)(?:\t([0-9a-fA-F]+))?/mg) {
+   while (/^([0-9a-fA-F]+)(?:[ \t]+([0-9a-fA-F]+))?/mg) {
 my $min = hex $1;
 my $max = (defined $2 ? hex $2 : $min);
 next if $max < $start;




Re: Change 16302: Provide the \N{U+HHHH} syntax before we forget.

2002-05-02 Thread Jarkko Hietaniemi

On Thu, May 02, 2002 at 08:01:34AM +0200, Philip Newton wrote:
> On Wed, 1 May 2002 07:00:05 -0700, [EMAIL PROTECTED] (Jarkko Hietaniemi) wrote:
> 
> > Change 16302 by jhi@alpha on 2002/05/01 12:54:24
> > 
> > Provide the \N{U+} syntax before we forget.
> 
> Do we also want to support U-HH? I seem to recall from somewhere

Hmmm.  One always learns something new... where did you find that format?

> that U+ went to U+ and that code points beyond that were
> U- (i.e. U+ form took 4 hex chars and U- form took 8 hex chars,
> or something like that.)
> 
> > +return chr hex $1 if $arg =~ /^U\+([0-9a-fA-F]+)$/;
> 
> It would be a simple matter of replacing  \+  with  [-+]  .
> 
> Not world-shaking, just asking a question.
> 
> >  //depot/perl/toke.c#431 (text) 
> > Index: perl/toke.c
> > --- perl/toke.c.~1~ Wed May  1 07:00:05 2002
> > +++ perl/toke.c Wed May  1 07:00:05 2002
> > @@ -1540,6 +1540,16 @@
> > e = s - 1;
> > goto cont_scan;
> > }
> > +   if (e > s + 2 && s[1] == 'U' && s[2] == '+') {
> 
> Oh, I suppose this would have to be changed to '&& (s[2] == '+' || s[2]
> == '-')', too.
> 
> Cheers,
> Philip

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



[Encode] 1.67 released

2002-05-02 Thread Dan Kogai

I wonder how's Laszlo's doing when I released Encode 1.67, available as 
follows.

Whole:
http://www.dan.co.jp/~dankogai/Encode-1.67.tar.gz and CPAN
Diff agains current: 147 lines
http://www.dan.co.jp/~dankogai/current-1.67.diff.gz

And Changes.  As you see changes are just cosmetic.

$Revision: 1.67 $ $Date: 2002/05/02 07:33:09 $
! Encode.xs
   Error message now consistent w/ perlqq (\N{U+} -> \x{})
   done in perl@16308 but Philip linted me further.  Now the error
   messages are macronized as ERR_ENCODE_NOMAP and ERR_DECODE_NOMAP
! lib/Encode/Guess.pm
   Sanity check for happier -w by Autrijus

Dan the Encode Maitainer