Re: CGI and UTF

2003-01-21 Thread Jarkko Hietaniemi
-C:1 / -C:0 it is.  (The :part being optional.)

-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen



Re: CGI and UTF

2003-01-21 Thread Jarkko Hietaniemi
On Tue, Jan 21, 2003 at 11:25:11AM +, Peter Haworth wrote:
> On Sat, 18 Jan 2003 18:56:57 +0200, Jarkko Hietaniemi wrote:
> > Now Perl-5.8.1-to-be has been changed to
> >
> > (1) not to do any implicit UTF-8-ification of any filehandles unless
> > explicitly asked to do so (either by the -C command line switch or by
> > setting the env var PERL_UTF8_LOCALE to a true value, the switch wins
> > if both are present)
> 
> Is there a way to specify a "negative" -C to turn off UTF-8 in the face of
> PERL_UTF8_LOCALE being set?

U, no there isn't...  Should we have -C:0 and -C:1 ?

-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen



Re: CGI and UTF

2003-01-21 Thread Peter Haworth
On Sat, 18 Jan 2003 18:56:57 +0200, Jarkko Hietaniemi wrote:
> Now Perl-5.8.1-to-be has been changed to
>
> (1) not to do any implicit UTF-8-ification of any filehandles unless
> explicitly asked to do so (either by the -C command line switch or by
> setting the env var PERL_UTF8_LOCALE to a true value, the switch wins
> if both are present)

Is there a way to specify a "negative" -C to turn off UTF-8 in the face of
PERL_UTF8_LOCALE being set?

-- 
Peter Haworth   [EMAIL PROTECTED]
"That's about all there is to it. Now you just need to go off and buy a book
 about object-oriented design methodology, and bang your forehead with it for 
 the next six months or so."-- Perl 5 "perlobj" man page



Re: CGI and UTF

2003-01-19 Thread Benjamin Franz
On Sat, 18 Jan 2003, Jarkko Hietaniemi wrote:

> Now Perl-5.8.1-to-be has been changed to
> 
> (1) not to do any implicit UTF-8-ification of any filehandles unless
> explicitly asked to do so (either by the -C command line switch
> or by setting the env var PERL_UTF8_LOCALE to a true value, the switch
> wins if both are present) (and if the locale settings do not indicate
> a UTF-8 locale, both are silent no-ops)
> 
> (2) illegal UTF-8 causing a -w(arning) immediately when read in e.g. by <>
> (an immediate croak is a possibility, but a warning is how it now works,
> and a croak would be, err, even more non-traditional for UNIX...)
> 
> Note that the above do not change the fact that if a *programmer* wants
> their code to be UTF-8 aware, they need to think about the evil binmode().

Wonderful. :) This will definitely simplify the day I have to migrate our
existing codebase to 5.8.

Thank you.

-- 
Benjamin Franz

"If the code and the comments disagree, then both are probably wrong."
-- Norm Schryer, Bell Labs 




Re: CGI and UTF

2003-01-18 Thread Jarkko Hietaniemi
Now Perl-5.8.1-to-be has been changed to

(1) not to do any implicit UTF-8-ification of any filehandles unless
explicitly asked to do so (either by the -C command line switch
or by setting the env var PERL_UTF8_LOCALE to a true value, the switch
wins if both are present) (and if the locale settings do not indicate
a UTF-8 locale, both are silent no-ops)

(2) illegal UTF-8 causing a -w(arning) immediately when read in e.g. by <>
(an immediate croak is a possibility, but a warning is how it now works,
and a croak would be, err, even more non-traditional for UNIX...)

Note that the above do not change the fact that if a *programmer* wants
their code to be UTF-8 aware, they need to think about the evil binmode().

-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen



Re: CGI and UTF

2003-01-05 Thread Jarkko Hietaniemi
On Sun, Jan 05, 2003 at 12:16:38PM -0600, Earl Hood wrote:
> > > This is Bad Juju (tm). It _guarantees_ script breakage (potentially
> > > silently!) for Unix people doing _anything_ but ASCII text manipulation.  
> > 
> > I repeat: I don't think you can do "more than ASCII" by hanging tooth
> > and nail to the "everything is bytes" credo.
> 
> This statement assumes someone is working with characters.  It is
> common for many to use regexs and other operators (substr, index,
> et. al.) on binary data directly.

True.  I think what I was referring to (somewhere earlier in my
message) is that you won't get Unicode data mixed into your data
unless you ask so, explicitly or implicitly.

> > I repeat: all your filehandles are still 'binary' unless you either
> > explicitly (binmode) or implicitly (locale) command them not be.
> > If you try to push Unicode (data marked as UTF-8, such as characters
> > beyond 255) on such a filehandle, you'll get 'Wide character' warning.
> > If you do not like the locale implicit switching, reset your locale
> > to something not /utf-?8/i in it before running the script.
> 
> I think this reasoning is flawed since it assumes the author of
> the script has complete control over the environment.  For example,
> the script can be used by others in environments the author does not
> control.  Therefore, older programs can quietly break, or behave
> different.
>
> According the perllocale manpage, locale should have no effect
> unless the 'use locale' pragma is specified.  It appears from
> Benjamin's script that he is not using the pragma, so even if the
> environment has a utf-8 locale, the script should be unaffected.

True, too.  The enabling of UTF-8ness based on locale is an
exception as to how things were done before.  But I'm delegating
responsibility about that decision to Larry Wall :-)
I'm trying to get an opinion about this from him, and I just logged
a problem ticket about this issue. 

> --ewh

-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen



Re: CGI and UTF

2003-01-05 Thread Earl Hood
On January 5, 2003 at 05:42, Jarkko Hietaniemi wrote:

> > This is Bad Juju (tm). It _guarantees_ script breakage (potentially
> > silently!) for Unix people doing _anything_ but ASCII text manipulation.  
> 
> I repeat: I don't think you can do "more than ASCII" by hanging tooth
> and nail to the "everything is bytes" credo.

This statement assumes someone is working with characters.  It is
common for many to use regexs and other operators (substr, index,
et. al.) on binary data directly.

> I repeat: all your filehandles are still 'binary' unless you either
> explicitly (binmode) or implicitly (locale) command them not be.
> If you try to push Unicode (data marked as UTF-8, such as characters
> beyond 255) on such a filehandle, you'll get 'Wide character' warning.
> If you do not like the locale implicit switching, reset your locale
> to something not /utf-?8/i in it before running the script.

I think this reasoning is flawed since it assumes the author of
the script has complete control over the environment.  For example,
the script can be used by others in environments the author does not
control.  Therefore, older programs can quietly break, or behave
different.

According the perllocale manpage, locale should have no effect
unless the 'use locale' pragma is specified.  It appears from
Benjamin's script that he is not using the pragma, so even if the
environment has a utf-8 locale, the script should be unaffected.

--ewh



Re: CGI and UTF

2003-01-05 Thread Jarkko Hietaniemi
> > or implicitly (locale) command them not be.
> 
> Not fine without a warning. This is 'action at a distance' (this is the
> same reason un'local'ized usage of the 'special' variables is nearly

On that we can agree, kind of-- I find the *whole* locale system to be
a Bad Idea (tm) (not just any UTF-8 parts of it).  Locales are *all*
about action-at-a-distance.

> always a Bad Idea (tm)). It causes breakage that can be hard to find the
> cause of. Perl needs a mandatory warning if the locale changes my
> filehandles to text mode and I haven't made some kind of _explicit_
> declaration that I want that behavior to happen.
>
> The change is of a bad 'type': An incompatible change in Perl semamtics
> without so much as a warning being issued by either the compiler or the
> runtime - except to make the code fall over dead many lines away from the
> actual breakage. If the string is invalid UTF8, why didn't Perl complain
> _when I read it_ instead of dozens of lines away when I tried to use that
> string for something else? That is _broken_.

See below.

> > If you try to push Unicode (data marked as UTF-8, such as characters
> > beyond 255) on such a filehandle, you'll get 'Wide character' warning.
> 
> But it _reads_ binary data through a UTF8 layer silently. No warnings. Try
> the code I posted on an actual jpg file with UTF-8 local set in the
> environment. The first complaint is when the code falls over dead in the
> 'jpegsize' sub - many lines of code away from the  read.

I think now I reached your page.  I have to think more about this,
though, not to make the checking at the point of reading for example
unreasonably slow.  And I'll be rather Internet connectivity
challenged in the coming weeks, so please be patient.

-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen



Re: CGI and UTF

2003-01-05 Thread Benjamin Franz
On Sun, 5 Jan 2003, Jarkko Hietaniemi wrote:
> I repeat: all your filehandles are still 'binary' unless you either
> explicitly (binmode)

Fine.

> or implicitly (locale) command them not be.

Not fine without a warning. This is 'action at a distance' (this is the
same reason un'local'ized usage of the 'special' variables is nearly
always a Bad Idea (tm)). It causes breakage that can be hard to find the
cause of. Perl needs a mandatory warning if the locale changes my
filehandles to text mode and I haven't made some kind of _explicit_
declaration that I want that behavior to happen.

The change is of a bad 'type': An incompatible change in Perl semamtics
without so much as a warning being issued by either the compiler or the
runtime - except to make the code fall over dead many lines away from the
actual breakage. If the string is invalid UTF8, why didn't Perl complain
_when I read it_ instead of dozens of lines away when I tried to use that
string for something else? That is _broken_.

> If you try to push Unicode (data marked as UTF-8, such as characters
> beyond 255) on such a filehandle, you'll get 'Wide character' warning.

But it _reads_ binary data through a UTF8 layer silently. No warnings. Try
the code I posted on an actual jpg file with UTF-8 local set in the
environment. The first complaint is when the code falls over dead in the
'jpegsize' sub - many lines of code away from the  read.

-- 
Jerry

"If the code and the comments disagree, then both are probably wrong."
-- Norm Schryer, Bell Labs 





Re: CGI and UTF

2003-01-04 Thread Jarkko Hietaniemi
> Treating a 'string' as anything but a sequence of 'bytes/octets' _without 
> my explicit request or a runtime warning that I haven't specified fh 
> semantics_.

I'm still not quite following what are you being upset about.

(I'm starting to suspect that it must be because I've so completely
bought in to the Unicode model Perl has now and am unable to see what
could be the problem...)  Perl does *not* haphazardly handle a string
as anything else than bytes/octets.  Only if you either

(1) explicitly inject Unicode into by chr(), \x{...}, etc.
(2) either explicitly (binmode) or implicitly (locale) twiddle
a filehandle so that it converts 

As far as I can understand, you were bitten by the locale.  As I told
you, that is as wanted by Larry, and also by (independently of Perl)
by the Linux Unicode people.

> > The only obvious 'magic' I can think of is the behaviour where Perl
> > checks your locale settings, and if they indicate use of UTF-8, Perl
> > switches the default encoding of the STD* streams, and any further
> > file opens to UTF-8.  This bit of magic was specificially requested by
> > Larry Wall, and also by the Linux "Unicodification" project.
> 
> This is Bad Juju (tm). It _guarantees_ script breakage (potentially
> silently!) for Unix people doing _anything_ but ASCII text manipulation.  

I repeat: I don't think you can do "more than ASCII" by hanging tooth
and nail to the "everything is bytes" credo.

> > The locale-induced UTF-8 magic can lead into situation where you have
> > to explicitly mark your filehandles "binary" (with binmode, please
> > don't use bytes), because otherwise any data going out would be
> > expected to be Unicode, that is, *text*.  If you are pushing out
> > binary bits and bytes, you should tell Perl about it.   You are
> > also simultaneously complaining about "wanting to specify things
> > yourself" and "having to use binmode"?
> 
> Yes. Because _needing_ to 'tell Perl' that I am pushing binary rather than
> text _is a change_ for *nix platforms. I should have to 'tell Perl' I am
> pushing _anything else_ than binary. Or _at a minimum_ a mandatory warning
> should be issued that I didn't declare the filehandle's encoding layer and
> it is now using encoding 'X' if I haven't explictly indicated that I
> *WANT* the system environment changing my filehandle's encodings.

I repeat: all your filehandles are still 'binary' unless you either
explicitly (binmode) or implicitly (locale) command them not be.
If you try to push Unicode (data marked as UTF-8, such as characters
beyond 255) on such a filehandle, you'll get 'Wide character' warning.
If you do not like the locale implicit switching, reset your locale
to something not /utf-?8/i in it before running the script.

> > Back to the 'UNIX' way of I/O: I'm sorry but I think the UNIX way and
> > the Unicode can't transparently cohabit.  I'm very much a UNIX geek
> > and systems programmer, and I like the simple symmetrical world of
> > UNIX I/O, but I cannot see how the byte streams of UNIX and the
> > multiple variable and fixed length encodings of Unicode can work
> > simultaneously without some sort of explicit switching.
> 
> _Explict_ switching is what I am asking for. _Implicit_ switching is what
> I am complaining about. If you want to switch based on the system env -
> fine: _But at least warn me with a good immediate warnings_ before
> changing my fh semantics if I haven't said something like

The assumption is that if you have a locale setup that indicates
UTF-8, Perl is going to assume you knew what you were doing when
you set up the locale.  *All* locale effects are 'implicit'.

>binmode FH, ':crlf|:raw|:env';
> 
> before I go my $data = ;
> 
> "Malformed UTF-8 character (unexpected end of string) at
> ./error-example.pl line 40." isn't useful: It is obscure and is produced
> distantly from the actual breakage.

perldiag has this:

=item Malformed UTF-8 character (%s)

Perl detected something that didn't comply with UTF-8 encoding rules.

One possible cause is that you read in data that you thought to be in
UTF-8 but it wasn't (it was for example legacy 8-bit data).  Another
possibility is careless use of utf8::upgrade().

-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen



Re: CGI and UTF

2002-12-26 Thread Benjamin Franz
On Sun, 24 Nov 2002, Jarkko Hietaniemi wrote:

> > 1) x.0 release. I haven't seen a x.0 release of _any_ software I was
> >willing to put the family jewels on without quite a bit of testing
> >first.
> 
> So are you conducting testing?

Slowly, informally. Work schedules leave little time to explore 5.8, 
except like now when I am actually on 'vacation' (which is why it is a 
month since the original message).

> > 2) The very first machine I installed it on immediately had script
> >breakage _specifically_ because the rather broken (IMHO) behavior
> >re making the use of either 'use bytes' or 'binmode' mandatory
> 
> Could you please specify the circumstances of the breakage further?
> What got broken, what had to be changed?

Stripped of most irrelevant code and cleaned up slightly, this is
essentially what happened (with the necessary 'binmode' commented out just
to point to the change). Yes - I know about (and frequently use)
Image::Size, et al. This is a fragment of a script that is distributed
'standalone' and so could not depend on anything not distributed with Perl
5.005 to be present.

#!/usr/bin/perl -w

use strict;

my $file = '/home/snowhare/images/test.jpg';
my ($width,$height) = jpegsize($file);

print "width = $width, height = $height\n";
exit 0;

sub readfile {
my ($filename)=@_;
if (! open (NEWFILE,$filename)) {
print STDERR "$filename could not be opened for reading\n$!";
return;
}
#binmode NEWFILE;
my ($savedreadstate) = $/;
undef $/;
my $data = ;
$/ = $savedreadstate;
close (NEWFILE);

return ($data);
}

sub jpegsize {
my ($filename) = @_;

my $jpeg = readfile($filename);

my($count) = 2;
my($length)= length($jpeg);
my($ch)= "";

while (($ch ne "\xda") && ($count<$length)) {
# Find next marker (jpeg markers begin with 0xFF)
while (($ch ne "\xff") && ($count < $length)) {
$ch=substr($jpeg,$count,1); 
$count++;
}
# jpeg markers can be padded with unlimited 0xFF's
while (($ch eq "\xff") && ($count<$length)) {
$ch=substr($jpeg,$count,1); 
$count++;
}
# Now, $ch contains the value of the marker.
if ((ord($ch) >= 0xC0) && (ord($ch) <= 0xC3)) {
$count  += 3;
my ($a,$b,$c,$d) = unpack("C"x4,substr($jpeg,$count,4));
my $width= $c<<8 | $d;
my $height   = $a<<8 | $b;
return($width,$height);
} else {
# We **MUST** skip variables, since FF's within variable names are
# NOT valid jpeg markers
my ($c1,$c2)= unpack("C"x2,substr($jpeg,$count,2));
$count += $c1<<8|$c2;
}
}   
}

> >the last few years to 'magically' try to muck with charset encodings; 
> >5.8.0 has specifically realized those fears as quite justified.
> 
> I'm sorry but you are not being very helpful at all.  You "distrust"
> "magic" but you do not really say what behaviour of Perl 5.8.0 you
> find disturbing.

Treating a 'string' as anything but a sequence of 'bytes/octets' _without 
my explicit request or a runtime warning that I haven't specified fh 
semantics_.

> The only obvious 'magic' I can think of is the behaviour where Perl
> checks your locale settings, and if they indicate use of UTF-8, Perl
> switches the default encoding of the STD* streams, and any further
> file opens to UTF-8.  This bit of magic was specificially requested by
> Larry Wall, and also by the Linux "Unicodification" project.

This is Bad Juju (tm). It _guarantees_ script breakage (potentially
silently!) for Unix people doing _anything_ but ASCII text manipulation.  

If you want to break something as fundamental to *nix boxes as binary mode
filehandles - _at least_ force the script writer acknowledge this _deep_
change to FH semantics. Then they are forced to become aware of the issue
_before_ a script gets its operating assumptions yanked out from under it.

I would lobby for a mandatory runtime warning to be issued on any
filehandle where neither 'binmode FH;' or 'binmode FH, LAYER;' has been
seen before a filehandle is used for the first time with an explanation of
the issue.

> The locale-induced UTF-8 magic can lead into situation where you have
> to explicitly mark your filehandles "binary" (with binmode, please
> don't use bytes), because otherwise any data going out would be
> expected to be Unicode, that is, *text*.  If you are pushing out
> binary bits and bytes, you should tell Perl about it.   You are
> also simultaneously complaining about "wanting to specify things
> yourself" and "having to use binmode"?

Yes. Because _needing_ to 'tell Perl' that I am pushing binary rather than
text _is a change_ for *nix platforms. I should have to 'tell Perl' I am
pushing _anything else_ than binary. Or _at a minimum_ a mandatory warning
should be issued that I didn't declare the filehandle's enc

Re: CGI and UTF

2002-11-24 Thread Andreas J. Koenig
> On Sun, 24 Nov 2002 10:41:49 -0800 (PST), Benjamin Franz <[EMAIL PROTECTED]> 
>said:

  > I'm not Cisco, or the original guy, but here are the reasons I (and as the
  > Admin/Lead Programmer here my position is basically my company's position
  > on this) won't use Perl 5.8.0 for production servers:

  > 1) x.0 release. I haven't seen a x.0 release of _any_ software I was
  >willing to put the family jewels on without quite a bit of testing
  >first.

Before 5.8.0 came out, it wasn't clear if there would ever be a 5.6.2.
Since 5.8.0 is out it is pretty clear, 5.6.2 won't happen. This seems
to be in accordance with the fact that 5.8.0 comes with 5 times more
tests than 5.6.1.

IF there is one test missing, YOU probably know which.

  > 2) The very first machine I installed it on immediately had script
  >breakage _specifically_ because the rather broken (IMHO) behavior
  >re making the use of either 'use bytes' or 'binmode' mandatory
  >if you want to get the same filehandle behavior semantics on *nix 
  >boxes that Perl (and virtually all other *nix programs) have had 
  >historically.

  >"thenkyouverramuch".

:-( Benjamin, please send us your bugreport. WITH the details.
Tänkuvérmaç

  > I should either have been less specific or more correct ...

<|=>



-- 
andreas



Re: CGI and UTF

2002-11-24 Thread Jarkko Hietaniemi
> 1) x.0 release. I haven't seen a x.0 release of _any_ software I was
>willing to put the family jewels on without quite a bit of testing
>first.

So are you conducting testing?

> 2) The very first machine I installed it on immediately had script
>breakage _specifically_ because the rather broken (IMHO) behavior
>re making the use of either 'use bytes' or 'binmode' mandatory

Could you please specify the circumstances of the breakage further?
What got broken, what had to be changed?

>if you want to get the same filehandle behavior semantics on *nix 
>boxes that Perl (and virtually all other *nix programs) have had 
>historically. I don't relish the prospect of identifying essentially
>every use of 'open' in every program we have ever written just to
>add 'binmode' or 'use bytes' to them to proof them against 5.8.0
>originated dain bramage. When I open a file handle and read a file
>I expect (by default) to get _exactly_ what is in the file. If I
>want Unicode semantics, I'll explicitly specify them myself 
>"thenkyouverramuch".

I'm afraid here you can't both have your cake and eat it, see below.

>Unicode is great - I am a huge believer it - but don't
>go mucking up *nix semantics by making 'text mode' filehandles the 
>default: It _breaks_ things that were running 100% clean under
>warnings and strict for years. I've distrusted the trend in Perl for 
>the last few years to 'magically' try to muck with charset encodings; 
>5.8.0 has specifically realized those fears as quite justified.

I'm sorry but you are not being very helpful at all.  You "distrust"
"magic" but you do not really say what behaviour of Perl 5.8.0 you
find disturbing.

The only obvious 'magic' I can think of is the behaviour where Perl
checks your locale settings, and if they indicate use of UTF-8, Perl
switches the default encoding of the STD* streams, and any further
file opens to UTF-8.  This bit of magic was specificially requested by
Larry Wall, and also by the Linux "Unicodification" project.

Other than that, you *do* need to *explicitly* turn on any encoding
conversions on filehandles.  Perl doesn't "guess" on input, or do
any implicit conversions on output.

The other magic I can think of is that Perl scripts can now be
saved in BOM-marked UTF-16, and Perl knows how to parse them.

The locale-induced UTF-8 magic can lead into situation where you have
to explicitly mark your filehandles "binary" (with binmode, please
don't use bytes), because otherwise any data going out would be
expected to be Unicode, that is, *text*.  If you are pushing out
binary bits and bytes, you should tell Perl about it.   You are
also simultaneously complaining about "wanting to specify things
yourself" and "having to use binmode"?

If you are not affected by the locale UTF-8 magic, all handles are
just like they used to be.  In this case you do have to explicitly
tell that a filehandle is Unicode, just like you say you wanted.

Back to the 'UNIX' way of I/O: I'm sorry but I think the UNIX way and
the Unicode can't transparently cohabit.  I'm very much a UNIX geek
and systems programmer, and I like the simple symmetrical world of
UNIX I/O, but I cannot see how the byte streams of UNIX and the
multiple variable and fixed length encodings of Unicode can work
simultaneously without some sort of explicit switching.

-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen



Re: CGI and UTF

2002-11-24 Thread Benjamin Franz
On Wed, 20 Nov 2002, Nicholas Clark wrote:

> 
> [such as house policy on not using .0 versions? time taken to assess and
> approve releases meaning that approving 5.8.0 is a lot of effort?
> Something specific they don't like about 5.8.0?]
> 
> Basically is there something that the perl development community needs to do
> (or change) that would avoid this in future?

I'm not Cisco, or the original guy, but here are the reasons I (and as the
Admin/Lead Programmer here my position is basically my company's position
on this) won't use Perl 5.8.0 for production servers:

1) x.0 release. I haven't seen a x.0 release of _any_ software I was
   willing to put the family jewels on without quite a bit of testing
   first.

2) The very first machine I installed it on immediately had script
   breakage _specifically_ because the rather broken (IMHO) behavior
   re making the use of either 'use bytes' or 'binmode' mandatory
   if you want to get the same filehandle behavior semantics on *nix 
   boxes that Perl (and virtually all other *nix programs) have had 
   historically. I don't relish the prospect of identifying essentially
   every use of 'open' in every program we have ever written just to
   add 'binmode' or 'use bytes' to them to proof them against 5.8.0
   originated dain bramage. When I open a file handle and read a file
   I expect (by default) to get _exactly_ what is in the file. If I
   want Unicode semantics, I'll explicitly specify them myself 
   "thenkyouverramuch".

   Unicode is great - I am a huge believer it - but don't
   go mucking up *nix semantics by making 'text mode' filehandles the 
   default: It _breaks_ things that were running 100% clean under
   warnings and strict for years. I've distrusted the trend in Perl for 
   the last few years to 'magically' try to muck with charset encodings; 
   5.8.0 has specifically realized those fears as quite justified.

-- 
Benjamin Franz

I should either have been less specific or more correct ...

---Andy Armstrong <[EMAIL PROTECTED]>






RE: CGI and UTF

2002-11-20 Thread Mark Proctor
Barry,

Yes I can do that, I already do for Soap::Lite. Now that I have it
working with the example below should I be using anything else? I've
found that I have to do the same for variables from DBI as well as CGI -
this does seem a bit excessive, so any simpler/more efficent technique
would be appreciated. I guess I could use one of the encoding modules,
but would that be any better than what I have below?

$utf8text = pack('U*', unpack('U*', $q-param('text')));

Mark

-Original Message-
From: Barry Caplan [mailto:[EMAIL PROTECTED]] 
Sent: 20 November 2002 19:00
To: Mark Proctor; 'Andreas J. Koenig'
Cc: [EMAIL PROTECTED]
Subject: RE: CGI and UTF


At 06:47 PM 11/20/2002 +, Mark Proctor wrote:
>Unfortunetly I have asked the cisco admins if we can have perl5.8 and
>they said no way.

Yeah but you can use various CPAN modules, even if you install them in a
local directory, right?

Barry





RE: CGI and UTF

2002-11-20 Thread Mark Proctor
Nicholas,

Cisco are very much "if its not broken don't fix it" - they are
generally slow to use new technologies. We are still using standard
htaccess file with lists of user names for authentification, which
causes a huge problem for large htaccess file because of the 8K limit
and I've been struggling for over a year now to get them to move to
mod_perl. I will pose your question to the euro sysadmin guy, but when I
spoke to him he basically said that they are in the middle of upgrading
all the servers and moving everything to the US - and the
researhc/preparation and application testing necessary to move to 5.8
wouldn't fit in with the available resources.

Mark

-Original Message-
From: Nicholas Clark [mailto:[EMAIL PROTECTED]] 
Sent: 20 November 2002 20:02
To: Mark Proctor
Cc: 'Andreas J. Koenig'; [EMAIL PROTECTED]
Subject: Re: CGI and UTF



On Wed, Nov 20, 2002 at 05:38:20PM -, Mark Proctor wrote:

[upgrading from 5.6.1 to 5.8]

> I have checked with the sysadmins at cisco and they said "no way" :(

I'm not asking this as an attempt to provide arguments to give them back
- if
they are sure of their position, then it is necessary to work within it.

But did they say *why* they are so insistent that 5.8.0 is not feasible?

[such as house policy on not using .0 versions? time taken to assess and
approve releases meaning that approving 5.8.0 is a lot of effort?
Something specific they don't like about 5.8.0?]

Basically is there something that the perl development community needs
to do
(or change) that would avoid this in future?

Nicholas Clark
-- 
Befunge better than perl?   http://www.perl.org/advocacy/spoofathon/




CGI and UTF

2002-11-20 Thread Mark Proctor (mproctor)
Title: Message



I'm having some problems with XML/UTF8 and CGI 
variables in perl5.6.1
 
I have attached an example of the problem, an example 
string is Descripción - although you will need to have 
XML::Simple installed. 

 
The example takes an 
input string and then prints it twice - one with concatenation another just 
displaying the inputted string. The mangling occurs when you concatenate an XML 
string with a CGI string.
 
I'm not sure why 
this happens but here is a first attempt at a possible theory. All XML parsing 
is done in UTF8, but perl has no idea of encodings for incomding CGI streams and 
assumes them to be iso-88591 (latin1) - I read this somewhere don't know if 
its correct. String operations upgrade none UTF8 strings to UTF8, so perl tries 
to convert the CGI string from iso-88591 to UTF8 thus mangling it as its already 
UTF8.
 
Can any point me in the right direction, explain where 
I'm going wrong and maybe provide some usefull links - there seems to 
be very little information on building internationalised web pages with UTF8 and 
perl5.6.1.
 
Thanks
 
Mark
 


testUTF8.pl
Description: testUTF8.pl


Re: CGI and UTF

2002-11-20 Thread Nicholas Clark

On Wed, Nov 20, 2002 at 05:38:20PM -, Mark Proctor wrote:

[upgrading from 5.6.1 to 5.8]

> I have checked with the sysadmins at cisco and they said "no way" :(

I'm not asking this as an attempt to provide arguments to give them back - if
they are sure of their position, then it is necessary to work within it.

But did they say *why* they are so insistent that 5.8.0 is not feasible?

[such as house policy on not using .0 versions? time taken to assess and
approve releases meaning that approving 5.8.0 is a lot of effort?
Something specific they don't like about 5.8.0?]

Basically is there something that the perl development community needs to do
(or change) that would avoid this in future?

Nicholas Clark
-- 
Befunge better than perl?   http://www.perl.org/advocacy/spoofathon/



RE: CGI and UTF

2002-11-20 Thread Mark Proctor
Success - I found this
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&threadm=2002
0429145407.00874.5678%40mb-me.aol.com&rnum=1&prev=/groups%3Fq%3Dperl
%2Bpack%2Bcgi%2Butf%2BOR%2Butf8%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUT
F-8%26as_qdr%3Dall%26selm%3D20020429145407.00874.5678%2540mb-me.aol.
com%26rnum%3D1

This line can take a UTF8 input and tag it as UTF8
$text = pack('U*', unpack('U*', $q->param('text')));

Is this the only way to tag a string that has come in from CGI as UTF8?
Is it technically correct? Are there any occasions when it could cause a
problem?

Thanks 

Mark

-Original Message-
From: Mark Proctor [mailto:[EMAIL PROTECTED]] 
Sent: 20 November 2002 18:47
To: 'Barry Caplan'; 'Andreas J. Koenig'
Cc: [EMAIL PROTECTED]
Subject: RE: CGI and UTF


Unfortunetly I have asked the cisco admins if we can have perl5.8 and
they said no way.

I have tried doing stuff like this:
$text = $q->param('text');
if ($q->param('text')) {
  print $text . $xml->{message};
} else {
  print "\x{00F3}" . $xml->{message};
}

And it works and displays fine. I display this in the textarea, so that
I can resubmit it, it comes back mangled still :(

Mark

-Original Message-
From: Barry Caplan [mailto:[EMAIL PROTECTED]] 
Sent: 20 November 2002 18:42
To: Mark Proctor; 'Andreas J. Koenig'
Cc: [EMAIL PROTECTED]
Subject: RE: CGI and UTF


Mark,

I think 5.8 has a encode module with a normalize function. CPAN probably
has something similar. The perl docs for those modules is probably a
good place to start to understand unicode normalization. unicode.org is
the definitive source but could be pretty pedantic if this is your first
exposure.

Barry Caplan
www.i18n.com

At 05:38 PM 11/20/2002 +, Mark Proctor wrote:
>I have checked with the sysadmins at cisco and they said "no way" :(
>So I have to get this working. Someone has said that I need to
>"normalise" the params from cgi - but I have no idea what that means.
>
>Mark






RE: CGI and UTF

2002-11-20 Thread Barry Caplan
At 06:47 PM 11/20/2002 +, Mark Proctor wrote:
>Unfortunetly I have asked the cisco admins if we can have perl5.8 and
>they said no way.

Yeah but you can use various CPAN modules, even if you install them in a local 
directory, right?

Barry




RE: CGI and UTF

2002-11-20 Thread Mark Proctor
Unfortunetly I have asked the cisco admins if we can have perl5.8 and
they said no way.

I have tried doing stuff like this:
$text = $q->param('text');
if ($q->param('text')) {
  print $text . $xml->{message};
} else {
  print "\x{00F3}" . $xml->{message};
}

And it works and displays fine. I display this in the textarea, so that
I can resubmit it, it comes back mangled still :(

Mark

-Original Message-
From: Barry Caplan [mailto:[EMAIL PROTECTED]] 
Sent: 20 November 2002 18:42
To: Mark Proctor; 'Andreas J. Koenig'
Cc: [EMAIL PROTECTED]
Subject: RE: CGI and UTF


Mark,

I think 5.8 has a encode module with a normalize function. CPAN probably
has something similar. The perl docs for those modules is probably a
good place to start to understand unicode normalization. unicode.org is
the definitive source but could be pretty pedantic if this is your first
exposure.

Barry Caplan
www.i18n.com

At 05:38 PM 11/20/2002 +, Mark Proctor wrote:
>I have checked with the sysadmins at cisco and they said "no way" :(
>So I have to get this working. Someone has said that I need to
>"normalise" the params from cgi - but I have no idea what that means.
>
>Mark





RE: CGI and UTF

2002-11-20 Thread Barry Caplan
Mark,

I think 5.8 has a encode module with a normalize function. CPAN probably has something 
similar. The perl docs for those modules is probably a good place to start to 
understand unicode normalization. unicode.org is the definitive source but could be 
pretty pedantic if this is your first exposure.

Barry Caplan
www.i18n.com

At 05:38 PM 11/20/2002 +, Mark Proctor wrote:
>I have checked with the sysadmins at cisco and they said "no way" :(
>So I have to get this working. Someone has said that I need to
>"normalise" the params from cgi - but I have no idea what that means.
>
>Mark




Re: CGI and UTF

2002-11-20 Thread Daisuke Maki
I can't quite tell if it's related, but while using AxKit I encountered 
problems with using "." to concatenate strings. I changed the module in 
question to have a "use bytes" at the top, and that problem went away.

I think it also went away when I use sprintf() to concatenate the strings.

might as well give it a shot? ;)

--d

Mark Proctor wrote:
I have checked with the sysadmins at cisco and they said "no way" :(
So I have to get this working. Someone has said that I need to
"normalise" the params from cgi - but I have no idea what that means.

Mark

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of
Andreas J. Koenig
Sent: 20 November 2002 17:31
To: Mark Proctor
Cc: [EMAIL PROTECTED]
Subject: Re: CGI and UTF




On Wed, 20 Nov 2002 15:57:43 -, "Mark Proctor"


<[EMAIL PROTECTED]> said:

  > I'm having some problems with XML/UTF8 and CGI variables in
perl5.6.1

If you have any chance to upgrade to perl-5.8.0, please do it now. The
Unicode model of 5.8.0 is much more mature than that of 5.6.* and the
number of found bugs is close to zero. Your script looks OK and runs
fine under 5.8.0







RE: CGI and UTF

2002-11-20 Thread Mark Proctor
I have checked with the sysadmins at cisco and they said "no way" :(
So I have to get this working. Someone has said that I need to
"normalise" the params from cgi - but I have no idea what that means.

Mark

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of
Andreas J. Koenig
Sent: 20 November 2002 17:31
To: Mark Proctor
Cc: [EMAIL PROTECTED]
Subject: Re: CGI and UTF


>>>>> On Wed, 20 Nov 2002 15:57:43 -, "Mark Proctor"
<[EMAIL PROTECTED]> said:

  > I'm having some problems with XML/UTF8 and CGI variables in
perl5.6.1

If you have any chance to upgrade to perl-5.8.0, please do it now. The
Unicode model of 5.8.0 is much more mature than that of 5.6.* and the
number of found bugs is close to zero. Your script looks OK and runs
fine under 5.8.0

-- 
andreas




Re: CGI and UTF

2002-11-20 Thread Andreas J. Koenig
> On Wed, 20 Nov 2002 15:57:43 -, "Mark Proctor" <[EMAIL PROTECTED]> said:

  > I'm having some problems with XML/UTF8 and CGI variables in perl5.6.1

If you have any chance to upgrade to perl-5.8.0, please do it now. The
Unicode model of 5.8.0 is much more mature than that of 5.6.* and the
number of found bugs is close to zero. Your script looks OK and runs
fine under 5.8.0

-- 
andreas



CGI and UTF

2002-11-20 Thread Mark Proctor
Title: Message



I'm having some problems with XML/UTF8 and CGI 
variables in perl5.6.1
 
I have attached an example of the problem, an example 
string is Descripción - although you will need to have 
XML::Simple installed. 

 
The example takes an 
input string and then prints it twice - one with concatenation another just 
displaying the inputted string. The mangling occurs when you concatenate an XML 
string with a CGI string.
 
I'm not sure why 
this happens but here is a first attempt at a possible theory. All XML parsing 
is done in UTF8, but perl has no idea of encodings for incomding CGI streams and 
assumes them to be iso-88591 (latin1) - I read this somewhere don't know if 
its correct. String operations upgrade none UTF8 strings to UTF8, so perl tries 
to convert the CGI string from iso-88591 to UTF8 thus mangling it as its already 
UTF8.
 
Can any point me in the right direction, explain where 
I'm going wrong and maybe provide some usefull links - there seems to 
be very little information on building internationalised web pages with UTF8 and 
perl5.6.1.
 
Thanks
 
Mark
 


testUTF8.pl
Description: Binary data