Re: CGI and UTF
On Sat, 18 Jan 2003 18:56:57 +0200, Jarkko Hietaniemi wrote: Now Perl-5.8.1-to-be has been changed to (1) not to do any implicit UTF-8-ification of any filehandles unless explicitly asked to do so (either by the -C command line switch or by setting the env var PERL_UTF8_LOCALE to a true value, the switch wins if both are present) Is there a way to specify a negative -C to turn off UTF-8 in the face of PERL_UTF8_LOCALE being set? -- Peter Haworth [EMAIL PROTECTED] That's about all there is to it. Now you just need to go off and buy a book about object-oriented design methodology, and bang your forehead with it for the next six months or so.-- Perl 5 perlobj man page
Re: CGI and UTF
-C:1 / -C:0 it is. (The :part being optional.) -- Jarkko Hietaniemi [EMAIL PROTECTED] http://www.iki.fi/jhi/ There is this special biologist word we use for 'stable'. It is 'dead'. -- Jack Cohen
Re: CGI and UTF
On Sat, 18 Jan 2003, Jarkko Hietaniemi wrote: Now Perl-5.8.1-to-be has been changed to (1) not to do any implicit UTF-8-ification of any filehandles unless explicitly asked to do so (either by the -C command line switch or by setting the env var PERL_UTF8_LOCALE to a true value, the switch wins if both are present) (and if the locale settings do not indicate a UTF-8 locale, both are silent no-ops) (2) illegal UTF-8 causing a -w(arning) immediately when read in e.g. by (an immediate croak is a possibility, but a warning is how it now works, and a croak would be, err, even more non-traditional for UNIX...) Note that the above do not change the fact that if a *programmer* wants their code to be UTF-8 aware, they need to think about the evil binmode(). Wonderful. :) This will definitely simplify the day I have to migrate our existing codebase to 5.8. Thank you. -- Benjamin Franz If the code and the comments disagree, then both are probably wrong. -- Norm Schryer, Bell Labs
Re: CGI and UTF
Now Perl-5.8.1-to-be has been changed to (1) not to do any implicit UTF-8-ification of any filehandles unless explicitly asked to do so (either by the -C command line switch or by setting the env var PERL_UTF8_LOCALE to a true value, the switch wins if both are present) (and if the locale settings do not indicate a UTF-8 locale, both are silent no-ops) (2) illegal UTF-8 causing a -w(arning) immediately when read in e.g. by (an immediate croak is a possibility, but a warning is how it now works, and a croak would be, err, even more non-traditional for UNIX...) Note that the above do not change the fact that if a *programmer* wants their code to be UTF-8 aware, they need to think about the evil binmode(). -- Jarkko Hietaniemi [EMAIL PROTECTED] http://www.iki.fi/jhi/ There is this special biologist word we use for 'stable'. It is 'dead'. -- Jack Cohen
Re: CGI and UTF
On Sun, 5 Jan 2003, Jarkko Hietaniemi wrote: I repeat: all your filehandles are still 'binary' unless you either explicitly (binmode) Fine. or implicitly (locale) command them not be. Not fine without a warning. This is 'action at a distance' (this is the same reason un'local'ized usage of the 'special' variables is nearly always a Bad Idea (tm)). It causes breakage that can be hard to find the cause of. Perl needs a mandatory warning if the locale changes my filehandles to text mode and I haven't made some kind of _explicit_ declaration that I want that behavior to happen. The change is of a bad 'type': An incompatible change in Perl semamtics without so much as a warning being issued by either the compiler or the runtime - except to make the code fall over dead many lines away from the actual breakage. If the string is invalid UTF8, why didn't Perl complain _when I read it_ instead of dozens of lines away when I tried to use that string for something else? That is _broken_. If you try to push Unicode (data marked as UTF-8, such as characters beyond 255) on such a filehandle, you'll get 'Wide character' warning. But it _reads_ binary data through a UTF8 layer silently. No warnings. Try the code I posted on an actual jpg file with UTF-8 local set in the environment. The first complaint is when the code falls over dead in the 'jpegsize' sub - many lines of code away from the fh read. -- Jerry If the code and the comments disagree, then both are probably wrong. -- Norm Schryer, Bell Labs
Re: CGI and UTF
or implicitly (locale) command them not be. Not fine without a warning. This is 'action at a distance' (this is the same reason un'local'ized usage of the 'special' variables is nearly On that we can agree, kind of-- I find the *whole* locale system to be a Bad Idea (tm) (not just any UTF-8 parts of it). Locales are *all* about action-at-a-distance. always a Bad Idea (tm)). It causes breakage that can be hard to find the cause of. Perl needs a mandatory warning if the locale changes my filehandles to text mode and I haven't made some kind of _explicit_ declaration that I want that behavior to happen. The change is of a bad 'type': An incompatible change in Perl semamtics without so much as a warning being issued by either the compiler or the runtime - except to make the code fall over dead many lines away from the actual breakage. If the string is invalid UTF8, why didn't Perl complain _when I read it_ instead of dozens of lines away when I tried to use that string for something else? That is _broken_. See below. If you try to push Unicode (data marked as UTF-8, such as characters beyond 255) on such a filehandle, you'll get 'Wide character' warning. But it _reads_ binary data through a UTF8 layer silently. No warnings. Try the code I posted on an actual jpg file with UTF-8 local set in the environment. The first complaint is when the code falls over dead in the 'jpegsize' sub - many lines of code away from the fh read. I think now I reached your page. I have to think more about this, though, not to make the checking at the point of reading for example unreasonably slow. And I'll be rather Internet connectivity challenged in the coming weeks, so please be patient. -- Jarkko Hietaniemi [EMAIL PROTECTED] http://www.iki.fi/jhi/ There is this special biologist word we use for 'stable'. It is 'dead'. -- Jack Cohen
Re: CGI and UTF
On January 5, 2003 at 05:42, Jarkko Hietaniemi wrote: This is Bad Juju (tm). It _guarantees_ script breakage (potentially silently!) for Unix people doing _anything_ but ASCII text manipulation. I repeat: I don't think you can do more than ASCII by hanging tooth and nail to the everything is bytes credo. This statement assumes someone is working with characters. It is common for many to use regexs and other operators (substr, index, et. al.) on binary data directly. I repeat: all your filehandles are still 'binary' unless you either explicitly (binmode) or implicitly (locale) command them not be. If you try to push Unicode (data marked as UTF-8, such as characters beyond 255) on such a filehandle, you'll get 'Wide character' warning. If you do not like the locale implicit switching, reset your locale to something not /utf-?8/i in it before running the script. I think this reasoning is flawed since it assumes the author of the script has complete control over the environment. For example, the script can be used by others in environments the author does not control. Therefore, older programs can quietly break, or behave different. According the perllocale manpage, locale should have no effect unless the 'use locale' pragma is specified. It appears from Benjamin's script that he is not using the pragma, so even if the environment has a utf-8 locale, the script should be unaffected. --ewh
Re: CGI and UTF
On Sun, Jan 05, 2003 at 12:16:38PM -0600, Earl Hood wrote: This is Bad Juju (tm). It _guarantees_ script breakage (potentially silently!) for Unix people doing _anything_ but ASCII text manipulation. I repeat: I don't think you can do more than ASCII by hanging tooth and nail to the everything is bytes credo. This statement assumes someone is working with characters. It is common for many to use regexs and other operators (substr, index, et. al.) on binary data directly. True. I think what I was referring to (somewhere earlier in my message) is that you won't get Unicode data mixed into your data unless you ask so, explicitly or implicitly. I repeat: all your filehandles are still 'binary' unless you either explicitly (binmode) or implicitly (locale) command them not be. If you try to push Unicode (data marked as UTF-8, such as characters beyond 255) on such a filehandle, you'll get 'Wide character' warning. If you do not like the locale implicit switching, reset your locale to something not /utf-?8/i in it before running the script. I think this reasoning is flawed since it assumes the author of the script has complete control over the environment. For example, the script can be used by others in environments the author does not control. Therefore, older programs can quietly break, or behave different. According the perllocale manpage, locale should have no effect unless the 'use locale' pragma is specified. It appears from Benjamin's script that he is not using the pragma, so even if the environment has a utf-8 locale, the script should be unaffected. True, too. The enabling of UTF-8ness based on locale is an exception as to how things were done before. But I'm delegating responsibility about that decision to Larry Wall :-) I'm trying to get an opinion about this from him, and I just logged a problem ticket about this issue. --ewh -- Jarkko Hietaniemi [EMAIL PROTECTED] http://www.iki.fi/jhi/ There is this special biologist word we use for 'stable'. It is 'dead'. -- Jack Cohen
Re: CGI and UTF
On Wed, 20 Nov 2002, Nicholas Clark wrote: [such as house policy on not using .0 versions? time taken to assess and approve releases meaning that approving 5.8.0 is a lot of effort? Something specific they don't like about 5.8.0?] Basically is there something that the perl development community needs to do (or change) that would avoid this in future? I'm not Cisco, or the original guy, but here are the reasons I (and as the Admin/Lead Programmer here my position is basically my company's position on this) won't use Perl 5.8.0 for production servers: 1) x.0 release. I haven't seen a x.0 release of _any_ software I was willing to put the family jewels on without quite a bit of testing first. 2) The very first machine I installed it on immediately had script breakage _specifically_ because the rather broken (IMHO) behavior re making the use of either 'use bytes' or 'binmode' mandatory if you want to get the same filehandle behavior semantics on *nix boxes that Perl (and virtually all other *nix programs) have had historically. I don't relish the prospect of identifying essentially every use of 'open' in every program we have ever written just to add 'binmode' or 'use bytes' to them to proof them against 5.8.0 originated dain bramage. When I open a file handle and read a file I expect (by default) to get _exactly_ what is in the file. If I want Unicode semantics, I'll explicitly specify them myself thenkyouverramuch. Unicode is great - I am a huge believer it - but don't go mucking up *nix semantics by making 'text mode' filehandles the default: It _breaks_ things that were running 100% clean under warnings and strict for years. I've distrusted the trend in Perl for the last few years to 'magically' try to muck with charset encodings; 5.8.0 has specifically realized those fears as quite justified. -- Benjamin Franz I should either have been less specific or more correct ... ---Andy Armstrong [EMAIL PROTECTED]
Re: CGI and UTF
On Wed, 20 Nov 2002 15:57:43 -, Mark Proctor [EMAIL PROTECTED] said: I'm having some problems with XML/UTF8 and CGI variables in perl5.6.1 If you have any chance to upgrade to perl-5.8.0, please do it now. The Unicode model of 5.8.0 is much more mature than that of 5.6.* and the number of found bugs is close to zero. Your script looks OK and runs fine under 5.8.0 -- andreas
RE: CGI and UTF
I have checked with the sysadmins at cisco and they said no way :( So I have to get this working. Someone has said that I need to normalise the params from cgi - but I have no idea what that means. Mark -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Andreas J. Koenig Sent: 20 November 2002 17:31 To: Mark Proctor Cc: [EMAIL PROTECTED] Subject: Re: CGI and UTF On Wed, 20 Nov 2002 15:57:43 -, Mark Proctor [EMAIL PROTECTED] said: I'm having some problems with XML/UTF8 and CGI variables in perl5.6.1 If you have any chance to upgrade to perl-5.8.0, please do it now. The Unicode model of 5.8.0 is much more mature than that of 5.6.* and the number of found bugs is close to zero. Your script looks OK and runs fine under 5.8.0 -- andreas
Re: CGI and UTF
I can't quite tell if it's related, but while using AxKit I encountered problems with using . to concatenate strings. I changed the module in question to have a use bytes at the top, and that problem went away. I think it also went away when I use sprintf() to concatenate the strings. might as well give it a shot? ;) --d Mark Proctor wrote: I have checked with the sysadmins at cisco and they said no way :( So I have to get this working. Someone has said that I need to normalise the params from cgi - but I have no idea what that means. Mark -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Andreas J. Koenig Sent: 20 November 2002 17:31 To: Mark Proctor Cc: [EMAIL PROTECTED] Subject: Re: CGI and UTF On Wed, 20 Nov 2002 15:57:43 -, Mark Proctor [EMAIL PROTECTED] said: I'm having some problems with XML/UTF8 and CGI variables in perl5.6.1 If you have any chance to upgrade to perl-5.8.0, please do it now. The Unicode model of 5.8.0 is much more mature than that of 5.6.* and the number of found bugs is close to zero. Your script looks OK and runs fine under 5.8.0
RE: CGI and UTF
At 06:47 PM 11/20/2002 +, Mark Proctor wrote: Unfortunetly I have asked the cisco admins if we can have perl5.8 and they said no way. Yeah but you can use various CPAN modules, even if you install them in a local directory, right? Barry
Re: CGI and UTF
On Wed, Nov 20, 2002 at 05:38:20PM -, Mark Proctor wrote: [upgrading from 5.6.1 to 5.8] I have checked with the sysadmins at cisco and they said no way :( I'm not asking this as an attempt to provide arguments to give them back - if they are sure of their position, then it is necessary to work within it. But did they say *why* they are so insistent that 5.8.0 is not feasible? [such as house policy on not using .0 versions? time taken to assess and approve releases meaning that approving 5.8.0 is a lot of effort? Something specific they don't like about 5.8.0?] Basically is there something that the perl development community needs to do (or change) that would avoid this in future? Nicholas Clark -- Befunge better than perl? http://www.perl.org/advocacy/spoofathon/
CGI and UTF
Title: Message I'm having some problems with XML/UTF8 and CGI variables in perl5.6.1 I have attached an example of the problem, an example stringis DescripciĆ³n - although you will need to have XML::Simple installed. The example takes an input string and then prints it twice - one with concatenation another just displaying the inputted string. The mangling occurs when you concatenate an XML string with a CGI string. I'm not sure why this happens but here is a first attempt at a possible theory. All XML parsing is done in UTF8, but perl has no idea of encodings for incomding CGI streams and assumes them to be iso-88591 (latin1) -I read this somewhere don't know if its correct. String operations upgrade none UTF8 strings to UTF8, so perl tries to convert the CGI string from iso-88591 to UTF8 thus mangling it as its already UTF8. Can any point me in the right direction, explain where I'm going wrong andmaybe provide some usefull links- there seems to be very little information on building internationalised web pages with UTF8 and perl5.6.1. Thanks Mark testUTF8.pl Description: testUTF8.pl
RE: CGI and UTF
Nicholas, Cisco are very much if its not broken don't fix it - they are generally slow to use new technologies. We are still using standard htaccess file with lists of user names for authentification, which causes a huge problem for large htaccess file because of the 8K limit and I've been struggling for over a year now to get them to move to mod_perl. I will pose your question to the euro sysadmin guy, but when I spoke to him he basically said that they are in the middle of upgrading all the servers and moving everything to the US - and the researhc/preparation and application testing necessary to move to 5.8 wouldn't fit in with the available resources. Mark -Original Message- From: Nicholas Clark [mailto:[EMAIL PROTECTED]] Sent: 20 November 2002 20:02 To: Mark Proctor Cc: 'Andreas J. Koenig'; [EMAIL PROTECTED] Subject: Re: CGI and UTF On Wed, Nov 20, 2002 at 05:38:20PM -, Mark Proctor wrote: [upgrading from 5.6.1 to 5.8] I have checked with the sysadmins at cisco and they said no way :( I'm not asking this as an attempt to provide arguments to give them back - if they are sure of their position, then it is necessary to work within it. But did they say *why* they are so insistent that 5.8.0 is not feasible? [such as house policy on not using .0 versions? time taken to assess and approve releases meaning that approving 5.8.0 is a lot of effort? Something specific they don't like about 5.8.0?] Basically is there something that the perl development community needs to do (or change) that would avoid this in future? Nicholas Clark -- Befunge better than perl? http://www.perl.org/advocacy/spoofathon/