Re: Change 16308: Encode tweak from Dan Kogai.

2002-05-01 Thread Dan Kogai

On Thursday, May 2, 2002, at 03:03 , Philip Newton wrote:
> On Wed, 1 May 2002 09:45:05 -0700, [EMAIL PROTECTED] (Jarkko Hietaniemi) wrote:
>
>>  if (check & ENCODE_DIE_ON_ERR) {
>>  Perl_croak(
>> -aTHX_ "\"\\N{U+%" UVxf "}\" does not map to %s",
>> +   aTHX_ "\"\\x{%04" UVxf "}\" does not map to 
>> %s",
>>  (UV)ch, enc->name[0]);
>>  return &PL_sv_undef; /* never reaches but be safe */
>>  }
>>  if (check & ENCODE_WARN_ON_ERR){
>>  Perl_warner(aTHX_ packWARN(WARN_UTF8),
>> -"\"\\N{U+%" UVxf "}\" does not map to %s",
>> +   "\"\\x{%" UVxf "}\" does not map to 
>> %s",
>>  (UV)ch, enc->name[0]);
>>  }
>
> Shouldn't the formats match? That is, both '% UVxf' or both '%04 UVxf'?
> (I would probably tend to '%04 UVxf', FWIW, since I consider \x{03c0} to
> be "nicer" than \x{3c0} -- since I'm accustomed to four-char codepoints
> in the Unicode book.)

Right.  Will be fixed.

Dan




Re: Change 16308: Encode tweak from Dan Kogai.

2002-05-01 Thread Philip Newton

On Wed, 1 May 2002 09:45:05 -0700, [EMAIL PROTECTED] (Jarkko Hietaniemi) wrote:

>   if (check & ENCODE_DIE_ON_ERR) {
>   Perl_croak(
> - aTHX_ "\"\\N{U+%" UVxf "}\" does not map to %s",
> +   aTHX_ "\"\\x{%04" UVxf "}\" does not map to %s",
>   (UV)ch, enc->name[0]);
>   return &PL_sv_undef; /* never reaches but be safe */
>   }
>   if (check & ENCODE_WARN_ON_ERR){
>   Perl_warner(aTHX_ packWARN(WARN_UTF8),
> - "\"\\N{U+%" UVxf "}\" does not map to %s",
> +   "\"\\x{%" UVxf "}\" does not map to %s",
>   (UV)ch, enc->name[0]);
>   }

Shouldn't the formats match? That is, both '% UVxf' or both '%04 UVxf'?
(I would probably tend to '%04 UVxf', FWIW, since I consider \x{03c0} to
be "nicer" than \x{3c0} -- since I'm accustomed to four-char codepoints
in the Unicode book.)

Cheers,
Philip



Re: Change 16302: Provide the \N{U+HHHH} syntax before we forget.

2002-05-01 Thread Philip Newton

On Wed, 1 May 2002 07:00:05 -0700, [EMAIL PROTECTED] (Jarkko Hietaniemi) wrote:

> Change 16302 by jhi@alpha on 2002/05/01 12:54:24
> 
>   Provide the \N{U+} syntax before we forget.

Do we also want to support U-HH? I seem to recall from somewhere
that U+ went to U+ and that code points beyond that were
U- (i.e. U+ form took 4 hex chars and U- form took 8 hex chars,
or something like that.)

> +return chr hex $1 if $arg =~ /^U\+([0-9a-fA-F]+)$/;

It would be a simple matter of replacing  \+  with  [-+]  .

Not world-shaking, just asking a question.

>  //depot/perl/toke.c#431 (text) 
> Index: perl/toke.c
> --- perl/toke.c.~1~   Wed May  1 07:00:05 2002
> +++ perl/toke.c   Wed May  1 07:00:05 2002
> @@ -1540,6 +1540,16 @@
>   e = s - 1;
>   goto cont_scan;
>   }
> + if (e > s + 2 && s[1] == 'U' && s[2] == '+') {

Oh, I suppose this would have to be changed to '&& (s[2] == '+' || s[2]
== '-')', too.

Cheers,
Philip



Re: [Patch] ext/PerlIO/t/fallback.t gets haircut

2002-05-01 Thread Jarkko Hietaniemi

> I know NI-XS will fix and enhance this test soon but for the time being 
> you can use this for peace of mind.

For the time being, applied.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: Encode, charnames and utf8heavy

2002-05-01 Thread Dan Kogai

On Wednesday, May 1, 2002, at 11:23 , Jarkko Hietaniemi wrote:
> perlunicode.pod and "User-defined Character Properties" already
> documents it.  I guess accepting \s+ is okay... but as I said,
> people shouldn't be doing that by hand (much).

And here is the patch that fixes this.  [ \t]+ is picked instead of \s+ 
because \s+ is too ambiguous with Unicode (plus it catches \n and \r 
which it should not).

Since Camel 3 doesn't say anything about what whitespace character(s) 
(is|are) okay (it merely says "like this" -- cf. pp. 173), you should 
apply this patch for the sake of Camel 3 readers.

$sig =~ /Dan[ \t]+the[ \t]+Perl5[ \t]+Porter/;

 > diff -du lib/utf8_heavy.pl.old 
lib/utf8_heavy.pl  --- 
lib/utf8_heavy.pl.old   Mon Apr 22 08:29:37 2002
+++ lib/utf8_heavy.pl   Thu May  2 00:29:18 2002
@@ -271,7 +271,7 @@
 }
 else {
   LINE:
-   while (/^([0-9a-fA-F]+)(?:\t([0-9a-fA-F]+))?/mg) {
+   while (/^([0-9a-fA-F]+)(?:[ \t]+([0-9a-fA-F]+))?/mg) {
 my $min = hex $1;
 my $max = (defined $2 ? hex $2 : $min);
 next if $max < $start;




[Patch] ext/PerlIO/t/fallback.t gets haircut

2002-05-01 Thread Dan Kogai

jhi,

> A bit of noise from ext/PerlIO/t/fallback.t:
>
> ./perl -Ilib ext/PerlIO/t/fallback.t
> 1..8
> ok 1 - opened iso-8859-1 file
> "\N{U+20ac}" does not map to iso-8859-1 at ext/PerlIO/t/fallback.t line 
> 21.
> ok 2 - perlqq escapes
> ok 3 - opened iso-8859-1 file
> ok 4 - HTML escapes
> ok 5 - Opened as ASCII
> # 5c
> ok 6 - Escaped non-mapped char
> ok 7 - Opened as ASCII
> # fffd
> ok 8 - Unicode replacement char

The following patch will make it this way.

 > ./perl -I./lib ext/PerlIO/t/fallback.t
1..9
ok 1 - opened iso-8859-1 file
ok 2 - FB_WARN message
ok 3 - perlqq escapes
ok 4 - opened iso-8859-1 file
ok 5 - HTML escapes
ok 6 - Opened as ASCII
# 5c
ok 7 - Escaped non-mapped char
ok 8 - Opened as ASCII
# fffd
ok 9 - Unicode replacement char

I know NI-XS will fix and enhance this test soon but for the time being 
you can use this for peace of mind.

Dan the Perl5 Porter

--- ext/PerlIO/t/fallback.t.prevMon Apr 29 02:10:37 2002
+++ ext/PerlIO/t/fallback.t Thu May  2 00:11:06 2002
@@ -5,7 +5,7 @@
  @INC = '../lib';
  require "../t/test.pl";
  skip_all("No perlio") unless (find PerlIO::Layer 'perlio');
-plan (8);
+plan (9);
  }
  use Encode qw(:fallback_all);

@@ -13,12 +13,16 @@

  my $file = "fallback$$.txt";

-$PerlIO::encoding::fallback = Encode::PERLQQ;
-
-ok(open(my $fh,">encoding(iso-8859-1)",$file),"opened iso-8859-1 file");
-my $str = "\x{20AC}";
-print $fh $str,"0.02\n";
-close($fh);
+{
+my $message = '';
+local $SIG{__WARN__} = sub { $message = $_[0] };
+$PerlIO::encoding::fallback = Encode::PERLQQ;
+ok(open(my $fh,">encoding(iso-8859-1)",$file),"opened iso-8859-1 
file");
+my $str = "\x{20AC}";
+print $fh $str,"0.02\n";
+close($fh);
+like($message, qr/does not map to iso-8859-1/o, "FB_WARN message");
+}

  open($fh,$file) || die "File cannot be re-opened";
  my $line = <$fh>;




Re: Encode, charnames and utf8heavy

2002-05-01 Thread Jarkko Hietaniemi

On Wed, May 01, 2002 at 11:19:14PM +0900, Dan Kogai wrote:
> On Wednesday, May 1, 2002, at 11:04 , Jarkko Hietaniemi wrote:
> > Yes, it is.  It's hack.  (Regexps and a small cache.  It *really* sucked

Ooops.  So goes my memo...ry.  It's not a small cache, it can grow
to be really big...

> > without that cache...)
> 
> Oh yes.  I had to say I almost got a hangover :P

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: Encode, charnames and utf8heavy

2002-05-01 Thread Jarkko Hietaniemi

> > I don't think people should be much writing those definitions by hand.
> > It would be easy to have a more user-friendly interface for that.
> 
> At least we should document it is delimited by a single tab (Oh my 
> python!) or better yet, replace the \t to \s+ in the regex that parses 

Oh my make!  (Well, it's not a leading tab...)

> it.  I already know where it is so if you accept this idea, I'll send 
> you a patch.

perlunicode.pod and "User-defined Character Properties" already
documents it.  I guess accepting \s+ is okay... but as I said,
people shouldn't be doing that by hand (much).
 
> As for the frequency of definition, don't you see it can be a handy way 
> to alias character classes?  Who knows how creatively users use the 

See above.

> features we add...
> 
> >> I would like to make this a 5.8.1 todo of mine.
> >
> > Whatever you try, it will be tested in the 5.9 branch first.
> 
> I wonder when the branch will happen

When we stop fiddling with 5.8 :-)

> Dan the Encode Maintainer

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: Encode, charnames and utf8heavy

2002-05-01 Thread Dan Kogai

On Wednesday, May 1, 2002, at 11:04 , Jarkko Hietaniemi wrote:
> Yes, it is.  It's hack.  (Regexps and a small cache.  It *really* sucked
> without that cache...)

Oh yes.  I had to say I almost got a hangover :P

> (And I just remembered that viacode() returning an undef when there's
> no corresponding name is by design.)

It should stay that way because I want to do something like

charname::viacode(0x5f3e)
or die "Sorry, Unicode Consortium says you are nameless, dan.".

> I don't think people should be much writing those definitions by hand.
> It would be easy to have a more user-friendly interface for that.

At least we should document it is delimited by a single tab (Oh my 
python!) or better yet, replace the \t to \s+ in the regex that parses 
it.  I already know where it is so if you accept this idea, I'll send 
you a patch.

As for the frequency of definition, don't you see it can be a handy way 
to alias character classes?  Who knows how creatively users use the 
features we add...

>> I would like to make this a 5.8.1 todo of mine.
>
> Whatever you try, it will be tested in the 5.9 branch first.

I wonder when the branch will happen

Dan the Encode Maintainer




Re: Encode, charnames and utf8heavy

2002-05-01 Thread Jarkko Hietaniemi

> Is there anything I should fix before Encode 1.67 ? (ahem, besides djgpp 

I think we are in pretty good shape.  Unless NI-S finds something evil
using Tk...

> which I am still waiting for the news from Laszlo)
> 
> Dan the Encode Maintainer

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: Encode, charnames and utf8heavy

2002-05-01 Thread Jarkko Hietaniemi

> Speaking of charnames and utf8heavy, charname::viacode() is incredibly 
> slow (I tried to use it extensively to pretty-comment ucm files.  I gave 

Yes, it is.  It's hack.  (Regexps and a small cache.  It *really* sucked
without that cache...)

(And I just remembered that viacode() returning an undef when there's
no corresponding name is by design.)

> up and used quicker and dirtier approach originally by NI-XS) and I 
> don't really like how unicore/ is laid out.  We can at least make use of 

Well, some of it is how Unicode Consortium lays out its files :-)

> AnyDBM_File (the key-value pairs needed there is totally SDBM_File safe 
> so we can safely use it!) or if we can spend more memory, Storable.
> 
> return <<'END'
> 0 
> END
> 
> is totally counterintuitive and the whitespace in between must be 
> exactly a single '\t' and that sucks (I've been annoyed why my test 
> script on InMyOwnDefinition didn't work as expected).

I don't think people should be much writing those definitions by hand.
It would be easy to have a more user-friendly interface for that.

> I would like to make this a 5.8.1 todo of mine.

Whatever you try, it will be tested in the 5.9 branch first.

> Dan the Encode Maintainer

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: Encode, charnames and utf8heavy

2002-05-01 Thread Dan Kogai

On Wednesday, May 1, 2002, at 10:57 , Dan Kogai wrote:
> Okay,  I'll change the error message in the next one so it would say
>
> "\x{abcd}" does not map to iso-8859-1 at ext/PerlIO/t/fallback.t line 
> 21.
>
> Autrijus just sent me a patch so it won't take long.

Done in my repository.

Was
> > piconv5.7.3 -c -f utf8 -t ascii t/jisx0201.utf
> "\N{U+ff61}" does not map to ascii, 134 at 
> /home/dankogai/lib/perl5/5.7.3/i386-freebsd/Encode.pm line 175, <> line 
> 1.

Is
> > bleedperl -Mblib `which piconv5.7.3` -c -f utf8 -t ascii 
> t/jisx0201.utf
> "\x{ff61}" does not map to ascii at 
> /usr/home/dankogai/work/Encode/blib/lib/Encode.pm line 175, <> line 1.

Is there anything I should fix before Encode 1.67 ? (ahem, besides djgpp 
which I am still waiting for the news from Laszlo)

Dan the Encode Maintainer




Re: [Encode] 1.66 Released

2002-05-01 Thread Jarkko Hietaniemi

> Also, is it intentional that there is no \N{U+} syntax...?

Uhhh.  What I meant to ask that "was it intentional to use the
\N{U+...} syntax, since currently there is no such syntax".  I blame
low caffeine levels.

> That was planned at some point but as of there is no such thing:
> 
> ../perl -Ilib -Ilib -Mcharnames=:full -e '"\N{U+20ac}"'
> Unknown charname 'U+20ac' at lib/unicore/Name.pl line 1

That being said, there is now such a thing.  Or will be as soon
as I check in the change.

> Why not just use \x{...}?  If that's PERLQQ, that's what
> I would expect?

If you wanted to used \N{}, there's charnames::viacode()

$ ./perl -Ilib -Mcharnames=:full -le 'print "\\N{", charnames::viacode(0x263a), "}"'
\N{WHITE SMILING FACE}
$ 

though for unnamed ones I think I have to do something (like use \N{U+}):

$ ./perl -Ilib -Mcharnames=:full -le 'print "\\N{", charnames::viacode(0x3040), "}"'
\N{}
$ 


-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Encode, charnames and utf8heavy

2002-05-01 Thread Dan Kogai

On Wednesday, May 1, 2002, at 10:30 , Jarkko Hietaniemi wrote:
> Thanks, upgraded.
>
> A bit of noise from ext/PerlIO/t/fallback.t:
>
> ./perl -Ilib ext/PerlIO/t/fallback.t
> 1..8
> ok 1 - opened iso-8859-1 file
> "\N{U+20ac}" does not map to iso-8859-1 at ext/PerlIO/t/fallback.t line 
> 21.
> ok 2 - perlqq escapes
> ok 3 - opened iso-8859-1 file
> ok 4 - HTML escapes
> ok 5 - Opened as ASCII
> # 5c
> ok 6 - Escaped non-mapped char
> ok 7 - Opened as ASCII
> # fffd
> ok 8 - Unicode replacement char
>
> Also, is it intentional that there is no \N{U+} syntax...?
> That was planned at some point but as of there is no such thing

Okay,  I'll change the error message in the next one so it would say

"\x{abcd}" does not map to iso-8859-1 at ext/PerlIO/t/fallback.t line 21.

Autrijus just sent me a patch so it won't take long.

> ./perl -Ilib -Ilib -Mcharnames=:full -e '"\N{U+20ac}"'
> Unknown charname 'U+20ac' at lib/unicore/Name.pl line 1
>
> Why not just use \x{...}?  If that's PERLQQ, that's what
> I would expect?

Speaking of charnames and utf8heavy, charname::viacode() is incredibly 
slow (I tried to use it extensively to pretty-comment ucm files.  I gave 
up and used quicker and dirtier approach originally by NI-XS) and I 
don't really like how unicore/ is laid out.  We can at least make use of 
AnyDBM_File (the key-value pairs needed there is totally SDBM_File safe 
so we can safely use it!) or if we can spend more memory, Storable.

return <<'END'
0   
END

is totally counterintuitive and the whitespace in between must be 
exactly a single '\t' and that sucks (I've been annoyed why my test 
script on InMyOwnDefinition didn't work as expected).

I would like to make this a 5.8.1 todo of mine.

Dan the Encode Maintainer




Re: [Encode] 1.66 Released

2002-05-01 Thread Jarkko Hietaniemi

On Wed, May 01, 2002 at 02:58:13PM +0900, Dan Kogai wrote:
> My fever is down at last when I released Encode-1.66, available as 
> follows;
> 
> Whole:
>   http://www.dan.co.jp/~dankogai/Encode-1.66.tar.gz or CPAN
> Diff against current: 264 lines
>   http://www.dan.co.jp/~dankogai/current-1.66.diff.gz
> 
> And $Revision.

Thanks, upgraded.

A bit of noise from ext/PerlIO/t/fallback.t:

../perl -Ilib ext/PerlIO/t/fallback.t
1..8
ok 1 - opened iso-8859-1 file
"\N{U+20ac}" does not map to iso-8859-1 at ext/PerlIO/t/fallback.t line 21.
ok 2 - perlqq escapes
ok 3 - opened iso-8859-1 file
ok 4 - HTML escapes
ok 5 - Opened as ASCII
# 5c
ok 6 - Escaped non-mapped char
ok 7 - Opened as ASCII
# fffd
ok 8 - Unicode replacement char

Also, is it intentional that there is no \N{U+} syntax...?
That was planned at some point but as of there is no such thing:

../perl -Ilib -Ilib -Mcharnames=:full -e '"\N{U+20ac}"'
Unknown charname 'U+20ac' at lib/unicore/Name.pl line 1

Why not just use \x{...}?  If that's PERLQQ, that's what
I would expect?

> Changes: 1.66 $ $Date: 2002/05/01 05:41:06 $
> ! Encode.xs t/fallback.t
>WARN_ON_ERR no longer assumes RETURN_ON_ERR so you can issue a warning
>while fallback is in effect.  This even came with a welcome side-effect
>of cleaner code with less nests!  Thank you, NI-XS.  t/fallback.t is
>also modified to test this.
>And of course, the corresponding varialbles to UV[Xx]f are 
> appropriately
>cast.  This should've concluded NI-XS homework.
> ! Encode.pm
>encode(undef) does warn again!  Repented upon suggestion by NI-XS.
>Document for unless vs. '' added
>Message-Id: <[EMAIL PROTECTED]>
> 
> As you see, this is a NI-XS homework issue.  Now I have only djgpp to 
> left (I think.  djgpp is just s slow on my env.)
> 
> Dan the Encode Maintainer

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: [PATCH] Let Guess.pm handles uninitialized argument.

2002-05-01 Thread Dan Kogai

On Wednesday, May 1, 2002, at 09:19 , Autrijus Tang wrote:
> This way is self-descriptory; it makes -w happier. :)
>
> /Autrijus/

XieXie.  Applied.

Dan the Encode Maintainer




[PATCH] Let Guess.pm handles uninitialized argument.

2002-05-01 Thread Autrijus Tang

This way is self-descriptory; it makes -w happier. :)

/Autrijus/

--- /home/autrijus/perl/ext/Encode/lib/Encode/Guess.pm  Fri Apr 26 11:40:12 2002
+++ /usr/local/lib/perl5/site_perl/5.7.3/i386-freebsd-thread-multi/Encode/Guess.pm 
+ Wed May  1 19:34:06 2002
@@ -69,16 +69,20 @@
 my $class = shift;
 my $obj   = ref($class) ? $class : $Encode::Encoding{$Canon};
 my $octet = shift;
+
+# sanity check
+return unless defined $octet and length $octet;
+
 # cheat 0: utf8 flag;
 Encode::is_utf8($octet) and return find_encoding('utf8');
 # cheat 1: BOM
 use Encode::Unicode;
 my $BOM = unpack('n', $octet);
 return find_encoding('UTF-16') 
-   if ($BOM == 0xFeFF or $BOM == 0xFFFe);
+   if (defined $BOM and ($BOM == 0xFeFF or $BOM == 0xFFFe));
 $BOM = unpack('N', $octet);
 return find_encoding('UTF-32') 
-   if ($BOM == 0xFeFF or $BOM == 0xFFFe);
+   if (defined $BOM and ($BOM == 0xFeFF or $BOM == 0xFFFe));
 
 my %try =  %{$obj->{Suspects}};
 for my $c (@_){



msg01271/pgp0.pgp
Description: PGP signature


RE: Encode should stay undefphobia

2002-05-01 Thread Paul Marquess

From: Nick Ing-Simmons [mailto:[EMAIL PROTECTED]]
 
> Paul Marquess <[EMAIL PROTECTED]> writes:
> >Good catch Nick.
> >
> >Instead of completely backing out the "defined $str or return" change, if
> >you change it to
> >
> >   unless (defined $str) {
> > warnif('uninitialized', 'Use of Uninitialized value in 
> encode_utf8');
> > return;
> >   }
> >
> >that gives us the same warning behaviour as print/tr/etc, but more
> >importantly it also gives users of the module the ability to silence the
> >uninitalized warning in the same way they do with print/tr, thus:
> >
> >  use warnings;
> >  ...
> >  {
> >no warnings 'uninitialized';
> >Encode::encode_utf8($x);
> >  }
> 
> But surely the warning we get now is (as a core warning) already so 
> controlled ? 

The warning can be controlled if you place a "no warnings" in the scope where the 
warning is generated. In the case above, that is *inside* the encode_utf8 function.

The setting of the warnings pragma in the block that calls encode_utf8 function 
doesn't leak into the Encode function. 

That's where warnings::warnif comes in. It checks to see if the warning is enabled in 
the calling module. This allows module authors to give users of their module the 
control over what warnings are generated.

Without adding the warnif calls to the code, the only way you can silence the warning 
is 

  {
local $^W = 0 ;
Encode::encode_utf8($x);
  }

and that only works if the function being called isn't itself under the control of the 
warnings pragma. So for example

sub xxx
{
use warnings ;
my $a =~ tr/A/a/;
}

{
local $^W = 0 ;
xxx();
}

still generates the "Use of uninitialized value" warning.

I see that Encode does make use of the warnings pragma in places, so I'm not sure if 
the "local $^W = 0" trick can be used with it.

> And can we not enhance the message generator to fish the name out 
> of somewhere so that is says "Use of undefined in subroutine encode_utf8"
> rather than just "subroutine entry" ? 

That would be worth doing regardless.

Paul