Re: CGI and UTF

2003-01-05 Thread Benjamin Franz
On Sun, 5 Jan 2003, Jarkko Hietaniemi wrote:
 I repeat: all your filehandles are still 'binary' unless you either
 explicitly (binmode)

Fine.

 or implicitly (locale) command them not be.

Not fine without a warning. This is 'action at a distance' (this is the
same reason un'local'ized usage of the 'special' variables is nearly
always a Bad Idea (tm)). It causes breakage that can be hard to find the
cause of. Perl needs a mandatory warning if the locale changes my
filehandles to text mode and I haven't made some kind of _explicit_
declaration that I want that behavior to happen.

The change is of a bad 'type': An incompatible change in Perl semamtics
without so much as a warning being issued by either the compiler or the
runtime - except to make the code fall over dead many lines away from the
actual breakage. If the string is invalid UTF8, why didn't Perl complain
_when I read it_ instead of dozens of lines away when I tried to use that
string for something else? That is _broken_.

 If you try to push Unicode (data marked as UTF-8, such as characters
 beyond 255) on such a filehandle, you'll get 'Wide character' warning.

But it _reads_ binary data through a UTF8 layer silently. No warnings. Try
the code I posted on an actual jpg file with UTF-8 local set in the
environment. The first complaint is when the code falls over dead in the
'jpegsize' sub - many lines of code away from the fh read.

-- 
Jerry

If the code and the comments disagree, then both are probably wrong.
-- Norm Schryer, Bell Labs 





Re: CGI and UTF

2003-01-05 Thread Jarkko Hietaniemi
  or implicitly (locale) command them not be.
 
 Not fine without a warning. This is 'action at a distance' (this is the
 same reason un'local'ized usage of the 'special' variables is nearly

On that we can agree, kind of-- I find the *whole* locale system to be
a Bad Idea (tm) (not just any UTF-8 parts of it).  Locales are *all*
about action-at-a-distance.

 always a Bad Idea (tm)). It causes breakage that can be hard to find the
 cause of. Perl needs a mandatory warning if the locale changes my
 filehandles to text mode and I haven't made some kind of _explicit_
 declaration that I want that behavior to happen.

 The change is of a bad 'type': An incompatible change in Perl semamtics
 without so much as a warning being issued by either the compiler or the
 runtime - except to make the code fall over dead many lines away from the
 actual breakage. If the string is invalid UTF8, why didn't Perl complain
 _when I read it_ instead of dozens of lines away when I tried to use that
 string for something else? That is _broken_.

See below.

  If you try to push Unicode (data marked as UTF-8, such as characters
  beyond 255) on such a filehandle, you'll get 'Wide character' warning.
 
 But it _reads_ binary data through a UTF8 layer silently. No warnings. Try
 the code I posted on an actual jpg file with UTF-8 local set in the
 environment. The first complaint is when the code falls over dead in the
 'jpegsize' sub - many lines of code away from the fh read.

I think now I reached your page.  I have to think more about this,
though, not to make the checking at the point of reading for example
unreasonably slow.  And I'll be rather Internet connectivity
challenged in the coming weeks, so please be patient.

-- 
Jarkko Hietaniemi [EMAIL PROTECTED] http://www.iki.fi/jhi/ There is this special
biologist word we use for 'stable'.  It is 'dead'. -- Jack Cohen



Re: CGI and UTF

2003-01-05 Thread Earl Hood
On January 5, 2003 at 05:42, Jarkko Hietaniemi wrote:

  This is Bad Juju (tm). It _guarantees_ script breakage (potentially
  silently!) for Unix people doing _anything_ but ASCII text manipulation.  
 
 I repeat: I don't think you can do more than ASCII by hanging tooth
 and nail to the everything is bytes credo.

This statement assumes someone is working with characters.  It is
common for many to use regexs and other operators (substr, index,
et. al.) on binary data directly.

 I repeat: all your filehandles are still 'binary' unless you either
 explicitly (binmode) or implicitly (locale) command them not be.
 If you try to push Unicode (data marked as UTF-8, such as characters
 beyond 255) on such a filehandle, you'll get 'Wide character' warning.
 If you do not like the locale implicit switching, reset your locale
 to something not /utf-?8/i in it before running the script.

I think this reasoning is flawed since it assumes the author of
the script has complete control over the environment.  For example,
the script can be used by others in environments the author does not
control.  Therefore, older programs can quietly break, or behave
different.

According the perllocale manpage, locale should have no effect
unless the 'use locale' pragma is specified.  It appears from
Benjamin's script that he is not using the pragma, so even if the
environment has a utf-8 locale, the script should be unaffected.

--ewh



Re: CGI and UTF

2003-01-05 Thread Jarkko Hietaniemi
On Sun, Jan 05, 2003 at 12:16:38PM -0600, Earl Hood wrote:
   This is Bad Juju (tm). It _guarantees_ script breakage (potentially
   silently!) for Unix people doing _anything_ but ASCII text manipulation.  
  
  I repeat: I don't think you can do more than ASCII by hanging tooth
  and nail to the everything is bytes credo.
 
 This statement assumes someone is working with characters.  It is
 common for many to use regexs and other operators (substr, index,
 et. al.) on binary data directly.

True.  I think what I was referring to (somewhere earlier in my
message) is that you won't get Unicode data mixed into your data
unless you ask so, explicitly or implicitly.

  I repeat: all your filehandles are still 'binary' unless you either
  explicitly (binmode) or implicitly (locale) command them not be.
  If you try to push Unicode (data marked as UTF-8, such as characters
  beyond 255) on such a filehandle, you'll get 'Wide character' warning.
  If you do not like the locale implicit switching, reset your locale
  to something not /utf-?8/i in it before running the script.
 
 I think this reasoning is flawed since it assumes the author of
 the script has complete control over the environment.  For example,
 the script can be used by others in environments the author does not
 control.  Therefore, older programs can quietly break, or behave
 different.

 According the perllocale manpage, locale should have no effect
 unless the 'use locale' pragma is specified.  It appears from
 Benjamin's script that he is not using the pragma, so even if the
 environment has a utf-8 locale, the script should be unaffected.

True, too.  The enabling of UTF-8ness based on locale is an
exception as to how things were done before.  But I'm delegating
responsibility about that decision to Larry Wall :-)
I'm trying to get an opinion about this from him, and I just logged
a problem ticket about this issue. 

 --ewh

-- 
Jarkko Hietaniemi [EMAIL PROTECTED] http://www.iki.fi/jhi/ There is this special
biologist word we use for 'stable'.  It is 'dead'. -- Jack Cohen