Re: Change 16308: Encode tweak from Dan Kogai.
On Thursday, May 2, 2002, at 03:03 , Philip Newton wrote: > On Wed, 1 May 2002 09:45:05 -0700, [EMAIL PROTECTED] (Jarkko Hietaniemi) wrote: > >> if (check & ENCODE_DIE_ON_ERR) { >> Perl_croak( >> -aTHX_ "\"\\N{U+%" UVxf "}\" does not map to %s", >> + aTHX_ "\"\\x{%04" UVxf "}\" does not map to >> %s", >> (UV)ch, enc->name[0]); >> return &PL_sv_undef; /* never reaches but be safe */ >> } >> if (check & ENCODE_WARN_ON_ERR){ >> Perl_warner(aTHX_ packWARN(WARN_UTF8), >> -"\"\\N{U+%" UVxf "}\" does not map to %s", >> + "\"\\x{%" UVxf "}\" does not map to >> %s", >> (UV)ch, enc->name[0]); >> } > > Shouldn't the formats match? That is, both '% UVxf' or both '%04 UVxf'? > (I would probably tend to '%04 UVxf', FWIW, since I consider \x{03c0} to > be "nicer" than \x{3c0} -- since I'm accustomed to four-char codepoints > in the Unicode book.) Right. Will be fixed. Dan
Re: Change 16308: Encode tweak from Dan Kogai.
On Wed, 1 May 2002 09:45:05 -0700, [EMAIL PROTECTED] (Jarkko Hietaniemi) wrote: > if (check & ENCODE_DIE_ON_ERR) { > Perl_croak( > - aTHX_ "\"\\N{U+%" UVxf "}\" does not map to %s", > + aTHX_ "\"\\x{%04" UVxf "}\" does not map to %s", > (UV)ch, enc->name[0]); > return &PL_sv_undef; /* never reaches but be safe */ > } > if (check & ENCODE_WARN_ON_ERR){ > Perl_warner(aTHX_ packWARN(WARN_UTF8), > - "\"\\N{U+%" UVxf "}\" does not map to %s", > + "\"\\x{%" UVxf "}\" does not map to %s", > (UV)ch, enc->name[0]); > } Shouldn't the formats match? That is, both '% UVxf' or both '%04 UVxf'? (I would probably tend to '%04 UVxf', FWIW, since I consider \x{03c0} to be "nicer" than \x{3c0} -- since I'm accustomed to four-char codepoints in the Unicode book.) Cheers, Philip
Re: Change 16302: Provide the \N{U+HHHH} syntax before we forget.
On Wed, 1 May 2002 07:00:05 -0700, [EMAIL PROTECTED] (Jarkko Hietaniemi) wrote: > Change 16302 by jhi@alpha on 2002/05/01 12:54:24 > > Provide the \N{U+} syntax before we forget. Do we also want to support U-HH? I seem to recall from somewhere that U+ went to U+ and that code points beyond that were U- (i.e. U+ form took 4 hex chars and U- form took 8 hex chars, or something like that.) > +return chr hex $1 if $arg =~ /^U\+([0-9a-fA-F]+)$/; It would be a simple matter of replacing \+ with [-+] . Not world-shaking, just asking a question. > //depot/perl/toke.c#431 (text) > Index: perl/toke.c > --- perl/toke.c.~1~ Wed May 1 07:00:05 2002 > +++ perl/toke.c Wed May 1 07:00:05 2002 > @@ -1540,6 +1540,16 @@ > e = s - 1; > goto cont_scan; > } > + if (e > s + 2 && s[1] == 'U' && s[2] == '+') { Oh, I suppose this would have to be changed to '&& (s[2] == '+' || s[2] == '-')', too. Cheers, Philip
Re: [Patch] ext/PerlIO/t/fallback.t gets haircut
> I know NI-XS will fix and enhance this test soon but for the time being > you can use this for peace of mind. For the time being, applied. -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Re: Encode, charnames and utf8heavy
On Wednesday, May 1, 2002, at 11:23 , Jarkko Hietaniemi wrote: > perlunicode.pod and "User-defined Character Properties" already > documents it. I guess accepting \s+ is okay... but as I said, > people shouldn't be doing that by hand (much). And here is the patch that fixes this. [ \t]+ is picked instead of \s+ because \s+ is too ambiguous with Unicode (plus it catches \n and \r which it should not). Since Camel 3 doesn't say anything about what whitespace character(s) (is|are) okay (it merely says "like this" -- cf. pp. 173), you should apply this patch for the sake of Camel 3 readers. $sig =~ /Dan[ \t]+the[ \t]+Perl5[ \t]+Porter/; > diff -du lib/utf8_heavy.pl.old lib/utf8_heavy.pl --- lib/utf8_heavy.pl.old Mon Apr 22 08:29:37 2002 +++ lib/utf8_heavy.pl Thu May 2 00:29:18 2002 @@ -271,7 +271,7 @@ } else { LINE: - while (/^([0-9a-fA-F]+)(?:\t([0-9a-fA-F]+))?/mg) { + while (/^([0-9a-fA-F]+)(?:[ \t]+([0-9a-fA-F]+))?/mg) { my $min = hex $1; my $max = (defined $2 ? hex $2 : $min); next if $max < $start;
[Patch] ext/PerlIO/t/fallback.t gets haircut
jhi, > A bit of noise from ext/PerlIO/t/fallback.t: > > ./perl -Ilib ext/PerlIO/t/fallback.t > 1..8 > ok 1 - opened iso-8859-1 file > "\N{U+20ac}" does not map to iso-8859-1 at ext/PerlIO/t/fallback.t line > 21. > ok 2 - perlqq escapes > ok 3 - opened iso-8859-1 file > ok 4 - HTML escapes > ok 5 - Opened as ASCII > # 5c > ok 6 - Escaped non-mapped char > ok 7 - Opened as ASCII > # fffd > ok 8 - Unicode replacement char The following patch will make it this way. > ./perl -I./lib ext/PerlIO/t/fallback.t 1..9 ok 1 - opened iso-8859-1 file ok 2 - FB_WARN message ok 3 - perlqq escapes ok 4 - opened iso-8859-1 file ok 5 - HTML escapes ok 6 - Opened as ASCII # 5c ok 7 - Escaped non-mapped char ok 8 - Opened as ASCII # fffd ok 9 - Unicode replacement char I know NI-XS will fix and enhance this test soon but for the time being you can use this for peace of mind. Dan the Perl5 Porter --- ext/PerlIO/t/fallback.t.prevMon Apr 29 02:10:37 2002 +++ ext/PerlIO/t/fallback.t Thu May 2 00:11:06 2002 @@ -5,7 +5,7 @@ @INC = '../lib'; require "../t/test.pl"; skip_all("No perlio") unless (find PerlIO::Layer 'perlio'); -plan (8); +plan (9); } use Encode qw(:fallback_all); @@ -13,12 +13,16 @@ my $file = "fallback$$.txt"; -$PerlIO::encoding::fallback = Encode::PERLQQ; - -ok(open(my $fh,">encoding(iso-8859-1)",$file),"opened iso-8859-1 file"); -my $str = "\x{20AC}"; -print $fh $str,"0.02\n"; -close($fh); +{ +my $message = ''; +local $SIG{__WARN__} = sub { $message = $_[0] }; +$PerlIO::encoding::fallback = Encode::PERLQQ; +ok(open(my $fh,">encoding(iso-8859-1)",$file),"opened iso-8859-1 file"); +my $str = "\x{20AC}"; +print $fh $str,"0.02\n"; +close($fh); +like($message, qr/does not map to iso-8859-1/o, "FB_WARN message"); +} open($fh,$file) || die "File cannot be re-opened"; my $line = <$fh>;
Re: Encode, charnames and utf8heavy
On Wed, May 01, 2002 at 11:19:14PM +0900, Dan Kogai wrote: > On Wednesday, May 1, 2002, at 11:04 , Jarkko Hietaniemi wrote: > > Yes, it is. It's hack. (Regexps and a small cache. It *really* sucked Ooops. So goes my memo...ry. It's not a small cache, it can grow to be really big... > > without that cache...) > > Oh yes. I had to say I almost got a hangover :P -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Re: Encode, charnames and utf8heavy
> > I don't think people should be much writing those definitions by hand. > > It would be easy to have a more user-friendly interface for that. > > At least we should document it is delimited by a single tab (Oh my > python!) or better yet, replace the \t to \s+ in the regex that parses Oh my make! (Well, it's not a leading tab...) > it. I already know where it is so if you accept this idea, I'll send > you a patch. perlunicode.pod and "User-defined Character Properties" already documents it. I guess accepting \s+ is okay... but as I said, people shouldn't be doing that by hand (much). > As for the frequency of definition, don't you see it can be a handy way > to alias character classes? Who knows how creatively users use the See above. > features we add... > > >> I would like to make this a 5.8.1 todo of mine. > > > > Whatever you try, it will be tested in the 5.9 branch first. > > I wonder when the branch will happen When we stop fiddling with 5.8 :-) > Dan the Encode Maintainer -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Re: Encode, charnames and utf8heavy
On Wednesday, May 1, 2002, at 11:04 , Jarkko Hietaniemi wrote: > Yes, it is. It's hack. (Regexps and a small cache. It *really* sucked > without that cache...) Oh yes. I had to say I almost got a hangover :P > (And I just remembered that viacode() returning an undef when there's > no corresponding name is by design.) It should stay that way because I want to do something like charname::viacode(0x5f3e) or die "Sorry, Unicode Consortium says you are nameless, dan.". > I don't think people should be much writing those definitions by hand. > It would be easy to have a more user-friendly interface for that. At least we should document it is delimited by a single tab (Oh my python!) or better yet, replace the \t to \s+ in the regex that parses it. I already know where it is so if you accept this idea, I'll send you a patch. As for the frequency of definition, don't you see it can be a handy way to alias character classes? Who knows how creatively users use the features we add... >> I would like to make this a 5.8.1 todo of mine. > > Whatever you try, it will be tested in the 5.9 branch first. I wonder when the branch will happen Dan the Encode Maintainer
Re: Encode, charnames and utf8heavy
> Is there anything I should fix before Encode 1.67 ? (ahem, besides djgpp I think we are in pretty good shape. Unless NI-S finds something evil using Tk... > which I am still waiting for the news from Laszlo) > > Dan the Encode Maintainer -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Re: Encode, charnames and utf8heavy
> Speaking of charnames and utf8heavy, charname::viacode() is incredibly > slow (I tried to use it extensively to pretty-comment ucm files. I gave Yes, it is. It's hack. (Regexps and a small cache. It *really* sucked without that cache...) (And I just remembered that viacode() returning an undef when there's no corresponding name is by design.) > up and used quicker and dirtier approach originally by NI-XS) and I > don't really like how unicore/ is laid out. We can at least make use of Well, some of it is how Unicode Consortium lays out its files :-) > AnyDBM_File (the key-value pairs needed there is totally SDBM_File safe > so we can safely use it!) or if we can spend more memory, Storable. > > return <<'END' > 0 > END > > is totally counterintuitive and the whitespace in between must be > exactly a single '\t' and that sucks (I've been annoyed why my test > script on InMyOwnDefinition didn't work as expected). I don't think people should be much writing those definitions by hand. It would be easy to have a more user-friendly interface for that. > I would like to make this a 5.8.1 todo of mine. Whatever you try, it will be tested in the 5.9 branch first. > Dan the Encode Maintainer -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Re: Encode, charnames and utf8heavy
On Wednesday, May 1, 2002, at 10:57 , Dan Kogai wrote: > Okay, I'll change the error message in the next one so it would say > > "\x{abcd}" does not map to iso-8859-1 at ext/PerlIO/t/fallback.t line > 21. > > Autrijus just sent me a patch so it won't take long. Done in my repository. Was > > piconv5.7.3 -c -f utf8 -t ascii t/jisx0201.utf > "\N{U+ff61}" does not map to ascii, 134 at > /home/dankogai/lib/perl5/5.7.3/i386-freebsd/Encode.pm line 175, <> line > 1. Is > > bleedperl -Mblib `which piconv5.7.3` -c -f utf8 -t ascii > t/jisx0201.utf > "\x{ff61}" does not map to ascii at > /usr/home/dankogai/work/Encode/blib/lib/Encode.pm line 175, <> line 1. Is there anything I should fix before Encode 1.67 ? (ahem, besides djgpp which I am still waiting for the news from Laszlo) Dan the Encode Maintainer
Re: [Encode] 1.66 Released
> Also, is it intentional that there is no \N{U+} syntax...? Uhhh. What I meant to ask that "was it intentional to use the \N{U+...} syntax, since currently there is no such syntax". I blame low caffeine levels. > That was planned at some point but as of there is no such thing: > > ../perl -Ilib -Ilib -Mcharnames=:full -e '"\N{U+20ac}"' > Unknown charname 'U+20ac' at lib/unicore/Name.pl line 1 That being said, there is now such a thing. Or will be as soon as I check in the change. > Why not just use \x{...}? If that's PERLQQ, that's what > I would expect? If you wanted to used \N{}, there's charnames::viacode() $ ./perl -Ilib -Mcharnames=:full -le 'print "\\N{", charnames::viacode(0x263a), "}"' \N{WHITE SMILING FACE} $ though for unnamed ones I think I have to do something (like use \N{U+}): $ ./perl -Ilib -Mcharnames=:full -le 'print "\\N{", charnames::viacode(0x3040), "}"' \N{} $ -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Encode, charnames and utf8heavy
On Wednesday, May 1, 2002, at 10:30 , Jarkko Hietaniemi wrote: > Thanks, upgraded. > > A bit of noise from ext/PerlIO/t/fallback.t: > > ./perl -Ilib ext/PerlIO/t/fallback.t > 1..8 > ok 1 - opened iso-8859-1 file > "\N{U+20ac}" does not map to iso-8859-1 at ext/PerlIO/t/fallback.t line > 21. > ok 2 - perlqq escapes > ok 3 - opened iso-8859-1 file > ok 4 - HTML escapes > ok 5 - Opened as ASCII > # 5c > ok 6 - Escaped non-mapped char > ok 7 - Opened as ASCII > # fffd > ok 8 - Unicode replacement char > > Also, is it intentional that there is no \N{U+} syntax...? > That was planned at some point but as of there is no such thing Okay, I'll change the error message in the next one so it would say "\x{abcd}" does not map to iso-8859-1 at ext/PerlIO/t/fallback.t line 21. Autrijus just sent me a patch so it won't take long. > ./perl -Ilib -Ilib -Mcharnames=:full -e '"\N{U+20ac}"' > Unknown charname 'U+20ac' at lib/unicore/Name.pl line 1 > > Why not just use \x{...}? If that's PERLQQ, that's what > I would expect? Speaking of charnames and utf8heavy, charname::viacode() is incredibly slow (I tried to use it extensively to pretty-comment ucm files. I gave up and used quicker and dirtier approach originally by NI-XS) and I don't really like how unicore/ is laid out. We can at least make use of AnyDBM_File (the key-value pairs needed there is totally SDBM_File safe so we can safely use it!) or if we can spend more memory, Storable. return <<'END' 0 END is totally counterintuitive and the whitespace in between must be exactly a single '\t' and that sucks (I've been annoyed why my test script on InMyOwnDefinition didn't work as expected). I would like to make this a 5.8.1 todo of mine. Dan the Encode Maintainer
Re: [Encode] 1.66 Released
On Wed, May 01, 2002 at 02:58:13PM +0900, Dan Kogai wrote: > My fever is down at last when I released Encode-1.66, available as > follows; > > Whole: > http://www.dan.co.jp/~dankogai/Encode-1.66.tar.gz or CPAN > Diff against current: 264 lines > http://www.dan.co.jp/~dankogai/current-1.66.diff.gz > > And $Revision. Thanks, upgraded. A bit of noise from ext/PerlIO/t/fallback.t: ../perl -Ilib ext/PerlIO/t/fallback.t 1..8 ok 1 - opened iso-8859-1 file "\N{U+20ac}" does not map to iso-8859-1 at ext/PerlIO/t/fallback.t line 21. ok 2 - perlqq escapes ok 3 - opened iso-8859-1 file ok 4 - HTML escapes ok 5 - Opened as ASCII # 5c ok 6 - Escaped non-mapped char ok 7 - Opened as ASCII # fffd ok 8 - Unicode replacement char Also, is it intentional that there is no \N{U+} syntax...? That was planned at some point but as of there is no such thing: ../perl -Ilib -Ilib -Mcharnames=:full -e '"\N{U+20ac}"' Unknown charname 'U+20ac' at lib/unicore/Name.pl line 1 Why not just use \x{...}? If that's PERLQQ, that's what I would expect? > Changes: 1.66 $ $Date: 2002/05/01 05:41:06 $ > ! Encode.xs t/fallback.t >WARN_ON_ERR no longer assumes RETURN_ON_ERR so you can issue a warning >while fallback is in effect. This even came with a welcome side-effect >of cleaner code with less nests! Thank you, NI-XS. t/fallback.t is >also modified to test this. >And of course, the corresponding varialbles to UV[Xx]f are > appropriately >cast. This should've concluded NI-XS homework. > ! Encode.pm >encode(undef) does warn again! Repented upon suggestion by NI-XS. >Document for unless vs. '' added >Message-Id: <[EMAIL PROTECTED]> > > As you see, this is a NI-XS homework issue. Now I have only djgpp to > left (I think. djgpp is just s slow on my env.) > > Dan the Encode Maintainer -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Re: [PATCH] Let Guess.pm handles uninitialized argument.
On Wednesday, May 1, 2002, at 09:19 , Autrijus Tang wrote: > This way is self-descriptory; it makes -w happier. :) > > /Autrijus/ XieXie. Applied. Dan the Encode Maintainer
[PATCH] Let Guess.pm handles uninitialized argument.
This way is self-descriptory; it makes -w happier. :) /Autrijus/ --- /home/autrijus/perl/ext/Encode/lib/Encode/Guess.pm Fri Apr 26 11:40:12 2002 +++ /usr/local/lib/perl5/site_perl/5.7.3/i386-freebsd-thread-multi/Encode/Guess.pm + Wed May 1 19:34:06 2002 @@ -69,16 +69,20 @@ my $class = shift; my $obj = ref($class) ? $class : $Encode::Encoding{$Canon}; my $octet = shift; + +# sanity check +return unless defined $octet and length $octet; + # cheat 0: utf8 flag; Encode::is_utf8($octet) and return find_encoding('utf8'); # cheat 1: BOM use Encode::Unicode; my $BOM = unpack('n', $octet); return find_encoding('UTF-16') - if ($BOM == 0xFeFF or $BOM == 0xFFFe); + if (defined $BOM and ($BOM == 0xFeFF or $BOM == 0xFFFe)); $BOM = unpack('N', $octet); return find_encoding('UTF-32') - if ($BOM == 0xFeFF or $BOM == 0xFFFe); + if (defined $BOM and ($BOM == 0xFeFF or $BOM == 0xFFFe)); my %try = %{$obj->{Suspects}}; for my $c (@_){ msg01271/pgp0.pgp Description: PGP signature
RE: Encode should stay undefphobia
From: Nick Ing-Simmons [mailto:[EMAIL PROTECTED]] > Paul Marquess <[EMAIL PROTECTED]> writes: > >Good catch Nick. > > > >Instead of completely backing out the "defined $str or return" change, if > >you change it to > > > > unless (defined $str) { > > warnif('uninitialized', 'Use of Uninitialized value in > encode_utf8'); > > return; > > } > > > >that gives us the same warning behaviour as print/tr/etc, but more > >importantly it also gives users of the module the ability to silence the > >uninitalized warning in the same way they do with print/tr, thus: > > > > use warnings; > > ... > > { > >no warnings 'uninitialized'; > >Encode::encode_utf8($x); > > } > > But surely the warning we get now is (as a core warning) already so > controlled ? The warning can be controlled if you place a "no warnings" in the scope where the warning is generated. In the case above, that is *inside* the encode_utf8 function. The setting of the warnings pragma in the block that calls encode_utf8 function doesn't leak into the Encode function. That's where warnings::warnif comes in. It checks to see if the warning is enabled in the calling module. This allows module authors to give users of their module the control over what warnings are generated. Without adding the warnif calls to the code, the only way you can silence the warning is { local $^W = 0 ; Encode::encode_utf8($x); } and that only works if the function being called isn't itself under the control of the warnings pragma. So for example sub xxx { use warnings ; my $a =~ tr/A/a/; } { local $^W = 0 ; xxx(); } still generates the "Use of uninitialized value" warning. I see that Encode does make use of the warnings pragma in places, so I'm not sure if the "local $^W = 0" trick can be used with it. > And can we not enhance the message generator to fish the name out > of somewhere so that is says "Use of undefined in subroutine encode_utf8" > rather than just "subroutine entry" ? That would be worth doing regardless. Paul