Re: [whatwg] [URL] Starting work on a URL spec

Brett Zamir Fri, 23 Jul 2010 23:49:57 -0700

 On 7/24/2010 2:02 PM, Boris Zbarsky wrote:

On 7/24/10 1:50 AM, Brett Zamir wrote:

I would be particularly interested in data on this last, across
different browsers, operating systems, and locales... There seem to be
servers out there expecting their URIs in UTF-8 and others expecting
them in ISO-8859-1, and it's not clear to me how to make things work
with them all.


Seems to me that if they are not in UTF-8, they should be treated as
bugs, even if that is not a de jure standard.


Treated as bugs by whom?

By the servers/scripting languages. While it is great that the browsersare involved in the process, I think it would be reasonable to invitethe other stake-holders to join the discussions.

The scenario is that a user types some non-ASCII text in the url bar.This needs to be url-encoded to actually go on the wire, which raisesthe question of what encoding. If the user is using IRIs, the answeris UTF-8. A number of servers barf if you do this, especially becausesome server-side scripting languages (PHP, e.g., last I checked)default to URI-unescaping via something other than UTF-8.

Hopefully to be fixed in PHP6 with its promise of full Unicode support...

Though per http://www.slideshare.net/kfish/unicode-php6-presentation :

*Slide 34: *Conversions & Encoding “HTTP Input Encoding”

With Unicode semantics switch enabled, we need to convert HTTP input toUnicodeGET requests have no encoding at all and POST ones rarely come markedwith the encoding

Encoding detection is not reliable
*Correctly decoding HTTP input is somewhat of an unsolved problem*

*Slide 35: *Conversions & Encoding “HTTP Input Encoding”
PHP will perform lazy decoding

Delays decoding data in $_GET, $_POST, and $_REQUEST until the ﬁrst timeyou access them

Allows user to set expected encoding or just rely on a default one
Allows decoding errors to be handled by the same mechanism
Applications should also use ﬁlter extension to ﬁlter incoming data

So some browser encode the non-query part of the URI as UTF-8 and thequery part as ... something (user's default filesystem encoding, say,for lack of a better guess). Others always use UTF-8 (and end up withsome servers not usable). Others... I have no idea. That's why Iwant data. ;) In particular, while the "just use UTF-8, and if theuser can't access the site sucks to be the user" approach has acertain theoretical-purity appeal, it doesn't seem like something Iwant to do to my friends and family (always a good criterion forthings you'd like to do to users).

What I meant is to try to get the server systems on board to fix theissue, including in the long-term. I appreciate you all being admirablypractical champions of present-day compatibility, though I'd hope thereis a vision to make things work better for the future, even if therewill be some inevitable growing pains for a subset of users (as the lackof standardization no doubt creates pains for another subset as it is).


Brett

Re: [whatwg] [URL] Starting work on a URL spec

Reply via email to