Re: BOM and principle of least surprise

2004-05-16 Thread Erland Sommarskog
Jarkko Hietaniemi ([EMAIL PROTECTED]) writes:
>> Both input data and the script. Just because the script has been saved
>> in UTF-8, does not mean that literals in the script are taken as UTF-8.
> 
> Oh, great.  Now you want to mix different encodings in the same file.
> I give up :-)

I think you misunderstood me. This script was in my original post:

   use strict;
   
   use MSSQL::OlleDB;
   $| = 1;
   my $i = 0;
   foreach (1..2) {
  my $db = 'räksmörgås'; 
  print "Len " . length($db) . " Str: $db\n";
  my $X = MSSQL::OlleDB->connect(undef, undef, undef, $db);
  $i++;
  print "$i\n" if $i % 50 == 0
   }
 
This script is supposed to connect to a database called "räksmörgås", 
a name which in SQL Server is stored as Unicode, in UTF-16. OlleDB is
my XS module, and it uses SvUTF8 to determin whether $db is in UTF-8
or not, and then converts to UTF-16 from the ANSI code page or UTF-8.

First I had saved the script in ANSI format, and I connected as I had
expected. Then I saved the script in UTF-8. It still said "räksmörgås"
when I looked at the file, but SvUTF8 still returned false, so I did
not connect to the database successfully.

>> To be able to that, it would have have to understand byte-order marks
>> (which it doesn't). I think there was a suggestion that you could
>> specify an 
>
>In 5.8.5 it will.

Will such an option include the possibility to say that I want Perl to
determine the encoding from the byte-order mark?

-- 
Erland Sommarskog, Stockholm, [EMAIL PROTECTED]


Re: BOM and principle of least surprise

2004-05-16 Thread Jarkko Hietaniemi
>>>To be able to that, it would have have to understand byte-order marks
>>>(which it doesn't). I think there was a suggestion that you could
>>>specify an 
>>
>>In 5.8.5 it will.
> 
> 
> Will such an option include the possibility to say that I want Perl to
> determine the encoding from the byte-order mark?

No.  The patch I submitted peeks at the beginning of a Perl script and
if it either sees a BOM or something that looks like raw BOMless UTF-16
(every other byte zero, every other not) of either endianness, Perl will
understand.

Nothing for input files, someone would have to write a PerlIO layer for
that.

-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen


[Encode] 2.00 released!

2004-05-16 Thread Dan Kogai
Porters,
I have just released Encode version 2.00.  Though major version has 
been incremented, there is no big feature (addition|change)s.

=head1 AVAILABILITY
http://www.dan.co.jp/~dankogai/Encode-2.00.tar.gz
or CPAN near you
=head1 CHANGES
$Revision: 2.0 $ $Date: 2004/05/16 20:55:15 $
* version updated to 2.00
   -- sorry, no big feature change.  I just hate version 1.100 :)
! lib/Encode/Guess.pm
  Unicode/Unicode.pm
  addressed  UTF-(8|32LE) + BOM misguessing
  https://rt.cpan.org/Ticket/Display.html?id=6279
! Encode.pm
  s/is_utif8/is_utf8/ in POD
! Encode/lib/Encode/CN/HZ.pm
  Fixes "make test" failure after the patch to pp_hot.c
  by Sadahiro-san
  Message-Id: <[EMAIL PROTECTED]>
! bin/piconv
  From:   [EMAIL PROTECTED]
  Subject: [PATCH] "piconv -C 512" badly broken
  Message-Id: <[EMAIL PROTECTED]>
Some of the changes are already committed in Perl 5.8.[34] and 
maintperl but without new releases older perls are left behind so I 
released.

Enjoy!
Dan the Encode Maintainer


Re: BOM and principle of least surprise

2004-05-16 Thread Erland Sommarskog
Jarkko Hietaniemi ([EMAIL PROTECTED]) writes:
To be able to that, it would have have to understand byte-order marks
(which it doesn't). I think there was a suggestion that you could
specify an 
>>>
>>>In 5.8.5 it will.
>> 
>> 
>> Will such an option include the possibility to say that I want Perl to
>> determine the encoding from the byte-order mark?
> 
> No.  The patch I submitted peeks at the beginning of a Perl script and
> if it either sees a BOM or something that looks like raw BOMless UTF-16
> (every other byte zero, every other not) of either endianness, Perl will
> understand.

I think I understood that the change was only for the script as such. Let's
forget input files for the moment.

So Perl 5.8.5 will be able to read a UTF-16 file?

And if it sees a UTF-8 BOM, that will imply a "use utf8"?
 
Will this require that I specify a an option to Perl, or will this be 
the default behaviour?

-- 
Erland Sommarskog, Stockholm, [EMAIL PROTECTED]


Re: BOM and principle of least surprise

2004-05-16 Thread Jarkko Hietaniemi
>>No.  The patch I submitted peeks at the beginning of a Perl script and
>>if it either sees a BOM or something that looks like raw BOMless UTF-16
>>(every other byte zero, every other not) of either endianness, Perl will
>>understand.
> 
> 
> I think I understood that the change was only for the script as such. Let's
> forget input files for the moment.
> 
> So Perl 5.8.5 will be able to read a UTF-16 file?

Assuming the perl maintainers will approve that patch in to the 5.8
maintenance branch, yes.

> And if it sees a UTF-8 BOM, that will imply a "use utf8"?

Not quite 100% semantically the same (use utf8 does many things behind
the curtains), but for your purposes (that the script has been stored in
UTF-8), I think so.

Though I must say that personally I would avoid using BOM with UTF-8:
there is little reason to use a byte order mark with UTF-8 since UTF-8
is byte order independent.

> Will this require that I specify a an option to Perl, or will this be 
> the default behaviour?

The default.  It was supposed to be the default already in 5.8.0 but
it seems the feature wasn't tested well enough.  Having it optional
makes little sense because *without* the detection the script is simply
illegal Perl: UTF-16 doesn't parse as Perl, and the UTF-8 BOM doesn't
parse as Perl.

-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen