BTW -- I wonder about the Catalyst behavior here. On Sat, Aug 1, 2015 at 10:36 PM, Bill Moseley <mose...@hank.org> wrote:
> > > On Sat, Aug 1, 2015 at 6:31 AM, Stefan <maill...@s.profanter.me> wrote: > >> Hi, >> >> if a URL parameter contains a Unicode character (e.g. >> www.example.com/?param=%D6lso%DF which stands for param=Ölsoße), the >> parameter is not correctly parsed as Unicode. >> > One note here -- data over the wire must be encoded into octets. So, all Unicode characters must be encoded and then decoded when received. (You can't send "Unicode characters".) UTF-8 is used now (for obvious reasons). http://tools.ietf.org/html/rfc3986. You are specifying %D6 -- although the Unicode characters is U+00D6, the UTF-8 octet sequence is 0xC3 0x96. See: http://www.fileformat.info/info/unicode/char/00D6/index.htm Unless otherwise instructed, Catalyst uses UTF-8 <https://github.com/perl-catalyst/catalyst-runtime/blob/master/lib/Catalyst/Engine.pm#L579> as the encoding for decoding query parameters -- query parameters are decoded from UTF-8 octets to Perl characters. As your example showed, if you use invalid UTF-8 sequences then Encode::decode() as used by Catalyst will replace those with the U+FFFD substitution character <http://www.fileformat.info/info/unicode/char/fffd/index.htm> "�". This may or may not be what you want. Personally, I think it's not correct to silently modify user input. You intended to pass "Ölsoße" but ended up with "�lso�e" -- is that really the data you would want to process/store for the request? Seems unlikely. If "param" is suppose to be passed as textual, UTF-8-encoded octets, and it isn't, then maybe returning a 400 is a better way of handling that. That probably would have helped you see what is wrong in this case. i.e. use "eval { decode( $default_query_encoding, $str, FB_CROAK | LEAVE_SRC ); }" to catch invalid data and return to the client the "$str" that failed and why. Of course, it is also possible that you have some query parameters that you want decoded as UTF-8 and some that might represent something else (a raw sequence of bytes), and want more manual control. In that case $c->config->{do_not_decode_query} could be used to bypass the decoding. But then, you must manually decode() yourself. -- Bill Moseley mose...@hank.org
_______________________________________________ List: Catalyst@lists.scsys.co.uk Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/ Dev site: http://dev.catalyst.perl.org/