Re: Change 16302: Provide the \N{U+HHHH} syntax before we forget.
On Fri, 03 May 2002 11:36:46 +0900, [EMAIL PROTECTED] (Sadahiro Tomoyuki) wrote: > But Unicode 3.1 extends U+ notation beyond 0x. Ah! Thanks for the reference. So maybe that is no longer necessary... by the time 5.8.0 is out, Unicode 3.2 will have been current for a while. Or should we support the older notation? Cheers, Philip
Encode-InCharset-0.01 Released
I have just released Encode-InCharset-0.01, available as http://www.dan.co.jp/~dankogai/Encode-InCharset-0.01.tar.gz and CPAN. I have developed this module primarily to implement ISO-2022-JP-3 and ISO-2022-CN in future. To implement encode() in these, you have to know which character set a given character belongs. But this module can also be used if a string can safely be encoded (Though fallback is much faster). Dan the Encode Maintainer NAME Encode::InCharset - defines \p{InCharset} INSTALL perl Makefile.PL make test && make install SYNOPSIS use Encode::InCharset qw(InJIS0208); "I am \x{5c0f}\x{98fc}\x{3000}\x{5f3e}" =~ /(\p{InJIS0208})+/o; # guess what is in $1 ABSTRACT This module provides In*Charset* Unicode property that matches characters *Charset*. As of this writing, Property-matching functions are auto-generated out of ucm files in Encode, Encode::HanExtra, and Encode::JIS2K. DESCRIPTION As of this writing, this module supports character properties shown below. Since names are self-explanatory I am not going to discuss in details. InASCII InAdobeStandardEncoding InAdobeSymbol InAdobeZdingbat InBIG5EXT InBIG5PLUS InBIG5_ETEN InBIG5_HKSCS InCCCII InCP1006 InCP1026 InCP1047 InCP1250 InCP1251 InCP1252 InCP1253 InCP1254 InCP1255 InCP1256 InCP1257 InCP1258 InCP37 InCP424 InCP437 InCP500 InCP737 InCP775 InCP850 InCP852 InCP855 InCP856 InCP857 InCP860 InCP861 InCP862 InCP863 InCP864 InCP865 InCP866 InCP869 InCP874 InCP875 InCP932 InCP936 InCP949 InCP950 InDingbats InEUC_CN InEUC_JISX0213 InEUC_JP InEUC_KR InEUC_TW InGB12345 InGB18030 InGB2312 InGSM0338 InHp_Roman8 InISO_8859_1 InISO_8859_10 InISO_8859_11 InISO_8859_13 InISO_8859_14 InISO_8859_15 InISO_8859_16 InISO_8859_2 InISO_8859_3 InISO_8859_4 InISO_8859_5 InISO_8859_6 InISO_8859_7 InISO_8859_8 InISO_8859_9 InISO_IR_165 InJIS0201 InJIS0208 InJIS0212 InJIS0213_1 InJIS0213_2 InJohab InKOI8_F InKOI8_R InKOI8_U InKSC5601 InMacArabic InMacCentralEurRoman InMacChineseSimp InMacChineseTrad InMacCroatian InMacCyrillic InMacDingbats InMacFarsi InMacGreek InMacHebrew InMacIcelandic InMacJapanese InMacKorean InMacRoman InMacRomanian InMacRumanian InMacSami InMacSymbol InMacThai InMacTurkish InMacUkrainian InNextstep InPOSIX_BC InShift_JIS InShift_JISX0213 InSymbol InVISCII EXPORT # will import all of them use Encode::InCharset; # will import only properties in qw() use Encode::InCharset qw(In...) SEE ALSO the Encode manpage, the perlunicode manpage AUTHOR Dan Kogai <[EMAIL PROTECTED]> COPYRIGHT AND LICENSE Copyright 2002 by Dan Kogai This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See http://www.perl.com/perl/misc/Artistic.html
Re: Change 16302: Provide the \N{U+HHHH} syntax before we forget.
On Fri, 3 May 2002 02:30:11 +0300 Jarkko Hietaniemi <[EMAIL PROTECTED]> wrote: > On Thu, May 02, 2002 at 08:01:34AM +0200, Philip Newton wrote: > > On Wed, 1 May 2002 07:00:05 -0700, [EMAIL PROTECTED] (Jarkko Hietaniemi) wrote: > > > > > Change 16302 by jhi@alpha on 2002/05/01 12:54:24 > > > > > > Provide the \N{U+} syntax before we forget. > > > > Do we also want to support U-HH? I seem to recall from somewhere > > Hmmm. One always learns something new... where did you find that format? > > > that U+ went to U+ and that code points beyond that were > > U- (i.e. U+ form took 4 hex chars and U- form took 8 hex chars, > > or something like that.) > > U- format is mentioned in Preface, 0.2 Notational Convention, in Unicode 3.0. http://www.unicode.org/uni2book/Preface.pdf http://www.unicode.org/uni2book/u2.html But Unicode 3.1 extends U+ notation beyond 0x. cf. http://www.unicode.org/unicode/reports/tr27/ Citation from here II Notational Changes for the Standard Section 0.2 Notational Conventions, page xxviii: change the description of the U+ notation to read: In running text, an individual Unicode code point can be expressed as U+n, where n is from four to six hexadecimal digits, using the digits 0-9 and A-F (for 10 through 15, respectively). There should be no leading zeros, unless the codepoint would have fewer than four hexadecimal digits; for example, U+0001, U+0012, U+0123, U+1234, U+12345, U+102345. End of citation Therefore U-0001 is U+1 and U-0010 is U+10. Regards, SADAHIRO Tomoyuki
Re: [Patch] User-defined \p{} more like Camel 3 example
On Fri, May 03, 2002 at 08:57:42AM +0900, Dan Kogai wrote: > jhi, > > I've submitted this yesterday but it seems it gets simply overlooked > (got no positive or negative response) so here we go again. Already applied, #16354. -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
[Patch] User-defined \p{} more like Camel 3 example
jhi, I've submitted this yesterday but it seems it gets simply overlooked (got no positive or negative response) so here we go again. Dan On Thursday, May 2, 2002, at 12:44 , Dan Kogai wrote: > On Wednesday, May 1, 2002, at 11:23 , Jarkko Hietaniemi wrote: >> perlunicode.pod and "User-defined Character Properties" already >> documents it. I guess accepting \s+ is okay... but as I said, >> people shouldn't be doing that by hand (much). > > And here is the patch that fixes this. [ \t]+ is picked instead of \s+ > because \s+ is too ambiguous with Unicode (plus it catches \n and \r > which it should not). > > Since Camel 3 doesn't say anything about what whitespace character(s) > (is|are) okay (it merely says "like this" -- cf. pp. 173), you should > apply this patch for the sake of Camel 3 readers. > > $sig =~ /Dan[ \t]+the[ \t]+Perl5[ \t]+Porter/; > diff -du lib/utf8_heavy.pl.old lib/utf8_heavy.pl --- lib/utf8_heavy.pl.old Mon Apr 22 08:29:37 2002 +++ lib/utf8_heavy.pl Thu May 2 00:29:18 2002 @@ -271,7 +271,7 @@ } else { LINE: - while (/^([0-9a-fA-F]+)(?:\t([0-9a-fA-F]+))?/mg) { + while (/^([0-9a-fA-F]+)(?:[ \t]+([0-9a-fA-F]+))?/mg) { my $min = hex $1; my $max = (defined $2 ? hex $2 : $min); next if $max < $start;
Re: Change 16302: Provide the \N{U+HHHH} syntax before we forget.
On Thu, May 02, 2002 at 08:01:34AM +0200, Philip Newton wrote: > On Wed, 1 May 2002 07:00:05 -0700, [EMAIL PROTECTED] (Jarkko Hietaniemi) wrote: > > > Change 16302 by jhi@alpha on 2002/05/01 12:54:24 > > > > Provide the \N{U+} syntax before we forget. > > Do we also want to support U-HH? I seem to recall from somewhere Hmmm. One always learns something new... where did you find that format? > that U+ went to U+ and that code points beyond that were > U- (i.e. U+ form took 4 hex chars and U- form took 8 hex chars, > or something like that.) > > > +return chr hex $1 if $arg =~ /^U\+([0-9a-fA-F]+)$/; > > It would be a simple matter of replacing \+ with [-+] . > > Not world-shaking, just asking a question. > > > //depot/perl/toke.c#431 (text) > > Index: perl/toke.c > > --- perl/toke.c.~1~ Wed May 1 07:00:05 2002 > > +++ perl/toke.c Wed May 1 07:00:05 2002 > > @@ -1540,6 +1540,16 @@ > > e = s - 1; > > goto cont_scan; > > } > > + if (e > s + 2 && s[1] == 'U' && s[2] == '+') { > > Oh, I suppose this would have to be changed to '&& (s[2] == '+' || s[2] > == '-')', too. > > Cheers, > Philip -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
[Encode] 1.67 released
I wonder how's Laszlo's doing when I released Encode 1.67, available as follows. Whole: http://www.dan.co.jp/~dankogai/Encode-1.67.tar.gz and CPAN Diff agains current: 147 lines http://www.dan.co.jp/~dankogai/current-1.67.diff.gz And Changes. As you see changes are just cosmetic. $Revision: 1.67 $ $Date: 2002/05/02 07:33:09 $ ! Encode.xs Error message now consistent w/ perlqq (\N{U+} -> \x{}) done in perl@16308 but Philip linted me further. Now the error messages are macronized as ERR_ENCODE_NOMAP and ERR_DECODE_NOMAP ! lib/Encode/Guess.pm Sanity check for happier -w by Autrijus Dan the Encode Maitainer