Re: UTF-8 + HTML::Template + CGI::Fast

2009-12-04 Thread Peter Corlett

On 4 Dec 2009, at 17:33, Mark Morgan wrote:
On Fri, Dec 4, 2009 at 5:15 PM, Peter Corlett   
wrote:

[...]
Getting something other than Latin-1, Windows-1252 or UTF-8 posted  
to your web forms is vanishingly unlikely.

The above assertion assumes that the user's default locale is the same
as yours...


... which is why you use  to avoid that  
particular nasal demon, and then you have much better confidence in  
what character encoding(s) you're going to see.





Re: UTF-8 + HTML::Template + CGI::Fast

2009-12-04 Thread Mark Morgan
On Fri, Dec 4, 2009 at 5:15 PM, Peter Corlett  wrote:
> As far as I could tell from the last time I had this problem, if you omitted
> the accept-charset attribute from the  tag, the browser would use its
> default character set. Which was UTF-8 in Firefox and Windows-1252 on IE.
> Setting  made IE play nicely.
[...]
> Getting something other than Latin-1, Windows-1252 or UTF-8 posted to your
> web forms is vanishingly unlikely.

The above assertion assumes that the user's default locale is the same
as yours...

Mark.


Re: UTF-8 + HTML::Template + CGI::Fast

2009-12-04 Thread Peter Corlett

On 4 Dec 2009, at 15:19, Mark Fowler wrote:
[...]

Let's assume you're sane and you've told your webserver to serve utf-8
(and you've got a utf-8 header in the Content-Type) for the page the
form is created from.  Most browsers will return you utf-8 in this
situation.  Some will not (they are broken.)


As far as I could tell from the last time I had this problem, if you  
omitted the accept-charset attribute from the  tag, the browser  
would use its default character set. Which was UTF-8 in Firefox and  
Windows-1252 on IE. Setting  made IE  
play nicely.


If you've specifically asked for UTF-8 text, and the byte stream you  
receive is not a valid UTF-8 encoding, you can safely assume that it's  
actually Windows-1252 instead. Windows-1252 is a superset of Latin-1,  
so that assumption still holds true even if the client has sent Latin-1.


Getting something other than Latin-1, Windows-1252 or UTF-8 posted to  
your web forms is vanishingly unlikely.





Re: UTF-8 + HTML::Template + CGI::Fast

2009-12-04 Thread Mark Fowler
On Fri, Dec 4, 2009 at 11:49 AM, Philip Potter
 wrote:

> I don't see how you're supposed to guess what encoding the user agent
> used if it won't tell you. Does anyone else have any ideas?

Let's assume you're sane and you've told your webserver to serve utf-8
(and you've got a utf-8 header in the Content-Type) for the page the
form is created from.  Most browsers will return you utf-8 in this
situation.  Some will not (they are broken.)

Your choices are:

a) Treat this as latin-1 (very wrong, but probably what the user meant)

eval { $string = Encode::decode("utf8", $string, Encode::FB_CROAK); }

(actually, this'll effectively have it in your default encoding, which
is _probably_ latin-1)

b) Display an error message (most correct)

eval { $string = Encode::decode("utf8", $string, Encode::FB_CROAK); 1 }
  or handle_errors();

c) Put a \x{fffd} where the character you don't understand is
(slightly less correct, but might just work)

$string = Encode::decode("utf8", $string, Encode::FB_DEFAULT);

See "perldoc Encode" and in particular the section on "Handling Malformed Data"

Note in the above examples I've used "utf8" not "UTF-8" which is
probably what you want (it's more lax)

I'm not even going to get into a conversation about normalised forms here.

Mark.


Re: UTF-8 + HTML::Template + CGI::Fast

2009-12-04 Thread Dirk Koopman

James Laver wrote:


This is one of the fun things about character sets. There are three
ways to determine character set:






3. Checking if it looks like a given character set (very lossy). Eg.
the is_utf8() function only checks if it *could* be utf-8. If you pass
it ascii text, it'll pass. Subsets of some other character sets will
also pass. There are no guarantees, just percentage chances. Not
exactly the world's best fallback.



When I asked a related question on this list and then read the docs with 
more educated eyes, I got the impression that the is_utf8 function 
merely tells you that the string is in internal utf8 format - which has 
nothing to do with what format the string came in as. It is very confusing.


Because I have mixed input coming into my app, and I can't reliably 
(enough for me) tell what it is (could be any of the iso variants or 
utf8), I don't bother with any of it and have removed all attempts to 
decode it. I just treat it all as strings. As it is a message switch it 
becomes SEP or a UAP.


Dirk




Re: UTF-8 + HTML::Template + CGI::Fast

2009-12-04 Thread Philip Potter
2009/12/4 Nicholas Clark :
> On Fri, Dec 04, 2009 at 11:49:09AM +, Philip Potter wrote:
>
>> I don't know if this problem is in general solvable, because user
>> agents are not required to declare what encoding they are using to
>> submit form contents. Even when the form uses the
>> accept-charset="utf-8" attribute to restrict the user agent to only
>> one charset, firefox doesn't append charset=utf-8 to the Content-type:
>> HTTP header.
>>
>> I don't see how you're supposed to guess what encoding the user agent
>> used if it won't tell you. Does anyone else have any ideas?
>
> I've not used it, but see http://www.joshisanerd.com/set/
> and Encode::HEBCI.
>
> It's a very crafty idea of using HTML entities and hidden form fields to start
> to deduce which particular crack the browser is smoking.

Great idea! The demo app recognised all encodings I threw at it except
macFarsi...
I also found this document:
http://niwo.mnsys.org/saved/~flavell/charset/form-i18n.html
which, though maybe a little dated, covers the issues involved well.

Phil


Re: UTF-8 + HTML::Template + CGI::Fast

2009-12-04 Thread James Laver
On Fri, Dec 4, 2009 at 11:49 AM, Philip Potter
 wrote:
> I don't know if this problem is in general solvable, because user
> agents are not required to declare what encoding they are using to
> submit form contents. Even when the form uses the
> accept-charset="utf-8" attribute to restrict the user agent to only
> one charset, firefox doesn't append charset=utf-8 to the Content-type:
> HTTP header.
>
> I don't see how you're supposed to guess what encoding the user agent
> used if it won't tell you. Does anyone else have any ideas?
>
> Phil
>

This is one of the fun things about character sets. There are three
ways to determine character set:

1. Seperate data that tells you what the character set is (that would
be the http headers, and many browsers do set them (and most servers
do, making life easier on browser developers))
2. Character set data embedded in the data, with data prior to that
being in a specified required character set (that would be html,
specifying that the charset is utf-8 in a meta http-equiv tag (alas,
you're not ea browser, that isn't going to help))
3. Checking if it looks like a given character set (very lossy). Eg.
the is_utf8() function only checks if it *could* be utf-8. If you pass
it ascii text, it'll pass. Subsets of some other character sets will
also pass. There are no guarantees, just percentage chances. Not
exactly the world's best fallback.

Of course if you get a choice between a potentially lying piece of
software (software, it's hateful) and percentage chances, your chances
of it working right most of the time are of course nonexistent. Better
to give up.

--James


Re: UTF-8 + HTML::Template + CGI::Fast

2009-12-04 Thread Nicholas Clark
On Fri, Dec 04, 2009 at 11:49:09AM +, Philip Potter wrote:

> I don't know if this problem is in general solvable, because user
> agents are not required to declare what encoding they are using to
> submit form contents. Even when the form uses the
> accept-charset="utf-8" attribute to restrict the user agent to only
> one charset, firefox doesn't append charset=utf-8 to the Content-type:
> HTTP header.
> 
> I don't see how you're supposed to guess what encoding the user agent
> used if it won't tell you. Does anyone else have any ideas?

I've not used it, but see http://www.joshisanerd.com/set/
and Encode::HEBCI.

It's a very crafty idea of using HTML entities and hidden form fields to start
to deduce which particular crack the browser is smoking.

Nicholas Clark


Re: UTF-8 + HTML::Template + CGI::Fast

2009-12-04 Thread Andrew McGregor
> Every string needs to be laundered AFAIK.

Thanks, if there is no magical Apache config then I'd best get started!

I'm not using use utf8, and I'm removing the utf8 fh I was passing to
H::T as they both shouldn't make a difference here and this project
doesn't have a comprehensive set of unit test to prove.

Thanks

Andy


Re: UTF-8 + HTML::Template + CGI::Fast

2009-12-04 Thread Philip Potter
2009/12/4 Andrew McGregor :
>> If a user enters UTF-8 chars then the string displayed is corrupt, or
>> looks like it has been double encoded.
>
> Adding this:
>
> 1293 $rtbx_senderDetails = decode("utf8", $rtbx_senderDetails);
>
> Just before passing the param to HTML::Template seems to work.
>
> So I'm switching the UTF flag on which means it is handled correctly
> by HTML::Template?
>
> Also, is there a more global way to resolve this as there are a few fields?
>
I don't know if this problem is in general solvable, because user
agents are not required to declare what encoding they are using to
submit form contents. Even when the form uses the
accept-charset="utf-8" attribute to restrict the user agent to only
one charset, firefox doesn't append charset=utf-8 to the Content-type:
HTTP header.

I don't see how you're supposed to guess what encoding the user agent
used if it won't tell you. Does anyone else have any ideas?

Phil


Re: UTF-8 + HTML::Template + CGI::Fast

2009-12-04 Thread damien krotkine
I don't see why "use utf8;" would be required, or helpful here. AFAIK (and
from perlunicode manpage), "use utf8;" is to be used to signify perl that
you use UTF8 encoded characters in the Perl source code. Which is probably
not the case.

If Andy happens to have latin1 encoded Perl script files (which is quite
comon), I bet "use utf8;" is not going to help, and is probably going to
confuse him when he try to, say, match a strange character with a regexp.

"As a compatibility measure, the "use utf8" pragma must be explicitly
included to enable recognition of UTF-8 in the Perl scripts themselves (in
string or regular expression literals, or in identifier names)
on ASCII-based machines".

dams

2009/12/4 Dave Hodgkinson 

>
> On 4 Dec 2009, at 10:25, Andrew McGregor wrote:
>
> > Hello,
> >
> > I'm hoping someone can help me with a UTF-8 problem.
> >
> > If a user enters latin-1 into a (CGI::Fast) form and submits, and the
> > string is displayed (HTML::Template) as expected.
> >
> > If a user enters UTF-8 chars then the string displayed is corrupt, or
> > looks like it has been double encoded.
> >
> > If I remove CGI::Fast then there is no problem.
> >
> > If I die $string before hitting HTML::Template then the UTF-8 string
> > is displayed properly.
> >
> > I've patched HTML::Template to allow UTF-8 files.  I've tried passing
> > in a fh opened as UTF-8 (segfault).
> >
> > I read somewhere that Perl doesn't cope well concatenating latin-1 and
> > UTF-8 strings - could this be happening in HTML::Template?
> >
> > Or should I be posting somewhere else, HTML::Template list or CGI::Fast
> list?
> >
> > This is perl, v5.8.8 built for i386-linux-thread-multi
>
>
> Consistency is the key. If you know Latin-1 is coming in from the
> form, use Encode to convert it to UTF-8 and thus tell perl that's
> what it is.
>
> Oh, and "use utf8;" for good measure.
>
> Apparently, a string isn't just a stream of octets :)
>
> At least you don't have MySQL in the loop.
>
>
> --
> Dave HodgkinsonMSN: daveh...@hotmail.com
> Site: http://www.davehodgkinson.com  UK: +44 7768 490620
> Blog: http://www.davehodgkinson.com/blog
> Photos: http://www.flickr.com/photos/davehodg
>
>
>
>
>
>
>
>
>
>


Re: UTF-8 + HTML::Template + CGI::Fast

2009-12-04 Thread James Laver
On Fri, Dec 4, 2009 at 11:18 AM, Dave Hodgkinson  wrote:
>
> Every string needs to be laundered AFAIK.
>
> Maybe 5.12 will make it all better?

If the discussions on p5p are anything to go by, it will fix some
things and make others more painful (by virtue of being slightly
incompatible with the old ways of doing things). Most notably I think
are changes to what \d, \w and \s mean in regards to unicode things.

Character sets are hard, get me a beer.

--James


Re: UTF-8 + HTML::Template + CGI::Fast

2009-12-04 Thread Dave Hodgkinson

On 4 Dec 2009, at 10:54, Andrew McGregor wrote:

>> If a user enters UTF-8 chars then the string displayed is corrupt, or
>> looks like it has been double encoded.
> 
> Adding this:
> 
> 1293 $rtbx_senderDetails = decode("utf8", $rtbx_senderDetails);
> 
> Just before passing the param to HTML::Template seems to work.
> 
> So I'm switching the UTF flag on which means it is handled correctly
> by HTML::Template?
> 
> Also, is there a more global way to resolve this as there are a few fields?
> 

Every string needs to be laundered AFAIK.

Maybe 5.12 will make it all better?

-- 
Dave HodgkinsonMSN: daveh...@hotmail.com
Site: http://www.davehodgkinson.com  UK: +44 7768 490620
Blog: http://www.davehodgkinson.com/blog
Photos: http://www.flickr.com/photos/davehodg










Re: UTF-8 + HTML::Template + CGI::Fast

2009-12-04 Thread Dave Hodgkinson

On 4 Dec 2009, at 10:25, Andrew McGregor wrote:

> Hello,
> 
> I'm hoping someone can help me with a UTF-8 problem.
> 
> If a user enters latin-1 into a (CGI::Fast) form and submits, and the
> string is displayed (HTML::Template) as expected.
> 
> If a user enters UTF-8 chars then the string displayed is corrupt, or
> looks like it has been double encoded.
> 
> If I remove CGI::Fast then there is no problem.
> 
> If I die $string before hitting HTML::Template then the UTF-8 string
> is displayed properly.
> 
> I've patched HTML::Template to allow UTF-8 files.  I've tried passing
> in a fh opened as UTF-8 (segfault).
> 
> I read somewhere that Perl doesn't cope well concatenating latin-1 and
> UTF-8 strings - could this be happening in HTML::Template?
> 
> Or should I be posting somewhere else, HTML::Template list or CGI::Fast list?
> 
> This is perl, v5.8.8 built for i386-linux-thread-multi


Consistency is the key. If you know Latin-1 is coming in from the
form, use Encode to convert it to UTF-8 and thus tell perl that's 
what it is. 

Oh, and "use utf8;" for good measure.

Apparently, a string isn't just a stream of octets :)

At least you don't have MySQL in the loop.


-- 
Dave HodgkinsonMSN: daveh...@hotmail.com
Site: http://www.davehodgkinson.com  UK: +44 7768 490620
Blog: http://www.davehodgkinson.com/blog
Photos: http://www.flickr.com/photos/davehodg











Re: UTF-8 + HTML::Template + CGI::Fast

2009-12-04 Thread Andrew McGregor
> If a user enters UTF-8 chars then the string displayed is corrupt, or
> looks like it has been double encoded.

Adding this:

1293 $rtbx_senderDetails = decode("utf8", $rtbx_senderDetails);

Just before passing the param to HTML::Template seems to work.

So I'm switching the UTF flag on which means it is handled correctly
by HTML::Template?

Also, is there a more global way to resolve this as there are a few fields?

Thanks

Andy