Re: UTF-8 + HTML::Template + CGI::Fast
On 4 Dec 2009, at 17:33, Mark Morgan wrote: On Fri, Dec 4, 2009 at 5:15 PM, Peter Corlett wrote: [...] Getting something other than Latin-1, Windows-1252 or UTF-8 posted to your web forms is vanishingly unlikely. The above assertion assumes that the user's default locale is the same as yours... ... which is why you use to avoid that particular nasal demon, and then you have much better confidence in what character encoding(s) you're going to see.
Re: UTF-8 + HTML::Template + CGI::Fast
On Fri, Dec 4, 2009 at 5:15 PM, Peter Corlett wrote: > As far as I could tell from the last time I had this problem, if you omitted > the accept-charset attribute from the tag, the browser would use its > default character set. Which was UTF-8 in Firefox and Windows-1252 on IE. > Setting made IE play nicely. [...] > Getting something other than Latin-1, Windows-1252 or UTF-8 posted to your > web forms is vanishingly unlikely. The above assertion assumes that the user's default locale is the same as yours... Mark.
Re: UTF-8 + HTML::Template + CGI::Fast
On 4 Dec 2009, at 15:19, Mark Fowler wrote: [...] Let's assume you're sane and you've told your webserver to serve utf-8 (and you've got a utf-8 header in the Content-Type) for the page the form is created from. Most browsers will return you utf-8 in this situation. Some will not (they are broken.) As far as I could tell from the last time I had this problem, if you omitted the accept-charset attribute from the tag, the browser would use its default character set. Which was UTF-8 in Firefox and Windows-1252 on IE. Setting made IE play nicely. If you've specifically asked for UTF-8 text, and the byte stream you receive is not a valid UTF-8 encoding, you can safely assume that it's actually Windows-1252 instead. Windows-1252 is a superset of Latin-1, so that assumption still holds true even if the client has sent Latin-1. Getting something other than Latin-1, Windows-1252 or UTF-8 posted to your web forms is vanishingly unlikely.
Re: UTF-8 + HTML::Template + CGI::Fast
On Fri, Dec 4, 2009 at 11:49 AM, Philip Potter wrote: > I don't see how you're supposed to guess what encoding the user agent > used if it won't tell you. Does anyone else have any ideas? Let's assume you're sane and you've told your webserver to serve utf-8 (and you've got a utf-8 header in the Content-Type) for the page the form is created from. Most browsers will return you utf-8 in this situation. Some will not (they are broken.) Your choices are: a) Treat this as latin-1 (very wrong, but probably what the user meant) eval { $string = Encode::decode("utf8", $string, Encode::FB_CROAK); } (actually, this'll effectively have it in your default encoding, which is _probably_ latin-1) b) Display an error message (most correct) eval { $string = Encode::decode("utf8", $string, Encode::FB_CROAK); 1 } or handle_errors(); c) Put a \x{fffd} where the character you don't understand is (slightly less correct, but might just work) $string = Encode::decode("utf8", $string, Encode::FB_DEFAULT); See "perldoc Encode" and in particular the section on "Handling Malformed Data" Note in the above examples I've used "utf8" not "UTF-8" which is probably what you want (it's more lax) I'm not even going to get into a conversation about normalised forms here. Mark.
Re: UTF-8 + HTML::Template + CGI::Fast
James Laver wrote: This is one of the fun things about character sets. There are three ways to determine character set: 3. Checking if it looks like a given character set (very lossy). Eg. the is_utf8() function only checks if it *could* be utf-8. If you pass it ascii text, it'll pass. Subsets of some other character sets will also pass. There are no guarantees, just percentage chances. Not exactly the world's best fallback. When I asked a related question on this list and then read the docs with more educated eyes, I got the impression that the is_utf8 function merely tells you that the string is in internal utf8 format - which has nothing to do with what format the string came in as. It is very confusing. Because I have mixed input coming into my app, and I can't reliably (enough for me) tell what it is (could be any of the iso variants or utf8), I don't bother with any of it and have removed all attempts to decode it. I just treat it all as strings. As it is a message switch it becomes SEP or a UAP. Dirk
Re: UTF-8 + HTML::Template + CGI::Fast
2009/12/4 Nicholas Clark : > On Fri, Dec 04, 2009 at 11:49:09AM +, Philip Potter wrote: > >> I don't know if this problem is in general solvable, because user >> agents are not required to declare what encoding they are using to >> submit form contents. Even when the form uses the >> accept-charset="utf-8" attribute to restrict the user agent to only >> one charset, firefox doesn't append charset=utf-8 to the Content-type: >> HTTP header. >> >> I don't see how you're supposed to guess what encoding the user agent >> used if it won't tell you. Does anyone else have any ideas? > > I've not used it, but see http://www.joshisanerd.com/set/ > and Encode::HEBCI. > > It's a very crafty idea of using HTML entities and hidden form fields to start > to deduce which particular crack the browser is smoking. Great idea! The demo app recognised all encodings I threw at it except macFarsi... I also found this document: http://niwo.mnsys.org/saved/~flavell/charset/form-i18n.html which, though maybe a little dated, covers the issues involved well. Phil
Re: UTF-8 + HTML::Template + CGI::Fast
On Fri, Dec 4, 2009 at 11:49 AM, Philip Potter wrote: > I don't know if this problem is in general solvable, because user > agents are not required to declare what encoding they are using to > submit form contents. Even when the form uses the > accept-charset="utf-8" attribute to restrict the user agent to only > one charset, firefox doesn't append charset=utf-8 to the Content-type: > HTTP header. > > I don't see how you're supposed to guess what encoding the user agent > used if it won't tell you. Does anyone else have any ideas? > > Phil > This is one of the fun things about character sets. There are three ways to determine character set: 1. Seperate data that tells you what the character set is (that would be the http headers, and many browsers do set them (and most servers do, making life easier on browser developers)) 2. Character set data embedded in the data, with data prior to that being in a specified required character set (that would be html, specifying that the charset is utf-8 in a meta http-equiv tag (alas, you're not ea browser, that isn't going to help)) 3. Checking if it looks like a given character set (very lossy). Eg. the is_utf8() function only checks if it *could* be utf-8. If you pass it ascii text, it'll pass. Subsets of some other character sets will also pass. There are no guarantees, just percentage chances. Not exactly the world's best fallback. Of course if you get a choice between a potentially lying piece of software (software, it's hateful) and percentage chances, your chances of it working right most of the time are of course nonexistent. Better to give up. --James
Re: UTF-8 + HTML::Template + CGI::Fast
On Fri, Dec 04, 2009 at 11:49:09AM +, Philip Potter wrote: > I don't know if this problem is in general solvable, because user > agents are not required to declare what encoding they are using to > submit form contents. Even when the form uses the > accept-charset="utf-8" attribute to restrict the user agent to only > one charset, firefox doesn't append charset=utf-8 to the Content-type: > HTTP header. > > I don't see how you're supposed to guess what encoding the user agent > used if it won't tell you. Does anyone else have any ideas? I've not used it, but see http://www.joshisanerd.com/set/ and Encode::HEBCI. It's a very crafty idea of using HTML entities and hidden form fields to start to deduce which particular crack the browser is smoking. Nicholas Clark
Re: UTF-8 + HTML::Template + CGI::Fast
> Every string needs to be laundered AFAIK. Thanks, if there is no magical Apache config then I'd best get started! I'm not using use utf8, and I'm removing the utf8 fh I was passing to H::T as they both shouldn't make a difference here and this project doesn't have a comprehensive set of unit test to prove. Thanks Andy
Re: UTF-8 + HTML::Template + CGI::Fast
2009/12/4 Andrew McGregor : >> If a user enters UTF-8 chars then the string displayed is corrupt, or >> looks like it has been double encoded. > > Adding this: > > 1293 $rtbx_senderDetails = decode("utf8", $rtbx_senderDetails); > > Just before passing the param to HTML::Template seems to work. > > So I'm switching the UTF flag on which means it is handled correctly > by HTML::Template? > > Also, is there a more global way to resolve this as there are a few fields? > I don't know if this problem is in general solvable, because user agents are not required to declare what encoding they are using to submit form contents. Even when the form uses the accept-charset="utf-8" attribute to restrict the user agent to only one charset, firefox doesn't append charset=utf-8 to the Content-type: HTTP header. I don't see how you're supposed to guess what encoding the user agent used if it won't tell you. Does anyone else have any ideas? Phil
Re: UTF-8 + HTML::Template + CGI::Fast
I don't see why "use utf8;" would be required, or helpful here. AFAIK (and from perlunicode manpage), "use utf8;" is to be used to signify perl that you use UTF8 encoded characters in the Perl source code. Which is probably not the case. If Andy happens to have latin1 encoded Perl script files (which is quite comon), I bet "use utf8;" is not going to help, and is probably going to confuse him when he try to, say, match a strange character with a regexp. "As a compatibility measure, the "use utf8" pragma must be explicitly included to enable recognition of UTF-8 in the Perl scripts themselves (in string or regular expression literals, or in identifier names) on ASCII-based machines". dams 2009/12/4 Dave Hodgkinson > > On 4 Dec 2009, at 10:25, Andrew McGregor wrote: > > > Hello, > > > > I'm hoping someone can help me with a UTF-8 problem. > > > > If a user enters latin-1 into a (CGI::Fast) form and submits, and the > > string is displayed (HTML::Template) as expected. > > > > If a user enters UTF-8 chars then the string displayed is corrupt, or > > looks like it has been double encoded. > > > > If I remove CGI::Fast then there is no problem. > > > > If I die $string before hitting HTML::Template then the UTF-8 string > > is displayed properly. > > > > I've patched HTML::Template to allow UTF-8 files. I've tried passing > > in a fh opened as UTF-8 (segfault). > > > > I read somewhere that Perl doesn't cope well concatenating latin-1 and > > UTF-8 strings - could this be happening in HTML::Template? > > > > Or should I be posting somewhere else, HTML::Template list or CGI::Fast > list? > > > > This is perl, v5.8.8 built for i386-linux-thread-multi > > > Consistency is the key. If you know Latin-1 is coming in from the > form, use Encode to convert it to UTF-8 and thus tell perl that's > what it is. > > Oh, and "use utf8;" for good measure. > > Apparently, a string isn't just a stream of octets :) > > At least you don't have MySQL in the loop. > > > -- > Dave HodgkinsonMSN: daveh...@hotmail.com > Site: http://www.davehodgkinson.com UK: +44 7768 490620 > Blog: http://www.davehodgkinson.com/blog > Photos: http://www.flickr.com/photos/davehodg > > > > > > > > > >
Re: UTF-8 + HTML::Template + CGI::Fast
On Fri, Dec 4, 2009 at 11:18 AM, Dave Hodgkinson wrote: > > Every string needs to be laundered AFAIK. > > Maybe 5.12 will make it all better? If the discussions on p5p are anything to go by, it will fix some things and make others more painful (by virtue of being slightly incompatible with the old ways of doing things). Most notably I think are changes to what \d, \w and \s mean in regards to unicode things. Character sets are hard, get me a beer. --James
Re: UTF-8 + HTML::Template + CGI::Fast
On 4 Dec 2009, at 10:54, Andrew McGregor wrote: >> If a user enters UTF-8 chars then the string displayed is corrupt, or >> looks like it has been double encoded. > > Adding this: > > 1293 $rtbx_senderDetails = decode("utf8", $rtbx_senderDetails); > > Just before passing the param to HTML::Template seems to work. > > So I'm switching the UTF flag on which means it is handled correctly > by HTML::Template? > > Also, is there a more global way to resolve this as there are a few fields? > Every string needs to be laundered AFAIK. Maybe 5.12 will make it all better? -- Dave HodgkinsonMSN: daveh...@hotmail.com Site: http://www.davehodgkinson.com UK: +44 7768 490620 Blog: http://www.davehodgkinson.com/blog Photos: http://www.flickr.com/photos/davehodg
Re: UTF-8 + HTML::Template + CGI::Fast
On 4 Dec 2009, at 10:25, Andrew McGregor wrote: > Hello, > > I'm hoping someone can help me with a UTF-8 problem. > > If a user enters latin-1 into a (CGI::Fast) form and submits, and the > string is displayed (HTML::Template) as expected. > > If a user enters UTF-8 chars then the string displayed is corrupt, or > looks like it has been double encoded. > > If I remove CGI::Fast then there is no problem. > > If I die $string before hitting HTML::Template then the UTF-8 string > is displayed properly. > > I've patched HTML::Template to allow UTF-8 files. I've tried passing > in a fh opened as UTF-8 (segfault). > > I read somewhere that Perl doesn't cope well concatenating latin-1 and > UTF-8 strings - could this be happening in HTML::Template? > > Or should I be posting somewhere else, HTML::Template list or CGI::Fast list? > > This is perl, v5.8.8 built for i386-linux-thread-multi Consistency is the key. If you know Latin-1 is coming in from the form, use Encode to convert it to UTF-8 and thus tell perl that's what it is. Oh, and "use utf8;" for good measure. Apparently, a string isn't just a stream of octets :) At least you don't have MySQL in the loop. -- Dave HodgkinsonMSN: daveh...@hotmail.com Site: http://www.davehodgkinson.com UK: +44 7768 490620 Blog: http://www.davehodgkinson.com/blog Photos: http://www.flickr.com/photos/davehodg
Re: UTF-8 + HTML::Template + CGI::Fast
> If a user enters UTF-8 chars then the string displayed is corrupt, or > looks like it has been double encoded. Adding this: 1293 $rtbx_senderDetails = decode("utf8", $rtbx_senderDetails); Just before passing the param to HTML::Template seems to work. So I'm switching the UTF flag on which means it is handled correctly by HTML::Template? Also, is there a more global way to resolve this as there are a few fields? Thanks Andy