On 2012/10/16 1:30, Robin Berjon wrote:
On 15/10/2012 17:49 , Ted Hardie wrote:
On Mon, Oct 15, 2012 at 8:07 AM, Robin Berjon <[email protected]> wrote:
URLs to non-Web things (e.g. mailto:, smsto:, tel:, etc.) happen in Web
contexts. Libraries written to process those in Web contexts are
likely to
be reused elsewhere. There isn't really an option to have some of
this in
Web use cases and something else outside of it. If it's used for the
Web, it
*will* leak. Probably a lot, and probably fast.
One first question is how much we want it to leak. An example that Anne
brought up is a URL with a space character in it. It is clear that these
things exist on the Web, in not too small numbers. On the other hand,
it's also clear that there are many places (some of them defined by
specs, some of them just somewhere in scripts and the like) that will
just 'blow up' when they get a space.
Do we want to make sure that all browsers treat such a space in the same
way? Most probably yes, and in this case, maybe they already do. Does it
make sense to write that down? I'd also very much say yes.
Do we want to make sure that all other places that accept URIs or IRIs
also accept a space and treat it the same? Maybe we would like to do so,
but is it possible? Quite clearly no (just think HTTP request header).
This essentially means that the fork is already here. In some sense,
that's really bad news. But if we look more closely, the news may not be
that bad. First, at least for the case with the space in it, we know how
to convert it to an equivalent without a space: use %20 (except maybe in
form parts). But we need to make sure that this is written down somewhere.
Second, and that will be more obvious for some more esoteric cases than
just a space, I think that even among those who agree that such cases
should be described, and should be handled uniformly by browsers, there
will be quite some agreement that it's better not to produce such things.
What we end up with is something I'd call a semi-fork, which is a subset
of "recommended" URIs/IRIs within a larger set of (sometimes, but not
always) tolerated ones.
We already have this for the XML case, it's called LEIRIs
(http://tools.ietf.org/html/draft-ietf-iri-3987bis-12#section-6).
At one point, we tried to do something similar to what Anne is now
trying to address, but we did not get very far because once one goes
beyond the simple cases (such as a space), it gets messy quite quickly
(read: different browsers do different things). Even though there are
representatives of all major browser vendors subscribed to the IRI WG
mailing list, we also didn't get much in terms of contributions or
feedback (Adam and Anne occasionally were exceptions).
I agree. But that argues that an xmpp URI seen in a jabber context
and an xmpp URI seen in a web context should be the same;
Syntactically correct xmpp URIs should be the same indeed, and I think
they currently are.
or, to
re-iterate, that a fork would be harmful. Changing the URI parsing in
web contexts only is likely to be problematic because of leakage.
Avoiding that by retaining one way is my personal preference for the
way forward. But if those working on web-specific specs do not agree
and choose to fork, then we *must* mark the difference between the
contexts, or the results will be even worse.
I think that we're in ruthlessly violent agreement here :)
At this point we have to look at what status Anne's work could be
published under. It doesn't have to be a fork, it could simply be
published as The One True Way to parse URLs (after reviews, etc.
obviously). Is that something that could be acceptable?
I think it can easily by the One True Way to parse URLs in Web Browsers.
Given some of the current differences between browsers, even that may be
though, but I very much hope that Anne can be successful.
I think that in a way similar to how the HTML5 spec currently
distinguishes between an authoring version and a parsing version, Anne's
document can be the parsing version for Web browsers, and RFC 3986, and
3987bis, can be the authoring version(s).
Of course, that's not a strict parallel. As an example, Anne plans to
clearly document/spec how URL equivalence works in JavaScript. For
everybody who uses JavaScript, this will clearly be a good thing.
However, as http://tools.ietf.org/html/rfc3986#section-6,
http://tools.ietf.org/html/rfc3987#section-5, and
http://tools.ietf.org/html/draft-ietf-iri-comparison-01 should make
quite clear, how to compare URIs/IRIs/URLs depends very much on the
application. On one end, a spider will make as many shortcurts as
possible, where on the other end, XML namespaces and RDF will do
codepoint-by-codepoint comparison, and there is clearly some value in
documenting that. (Also, an extended JavaScript library may provide
quite a few variants to deal with these application needs.)
Last but not least, I would like to mention that if there's anything
that we can reasonably do to make the gap in the semifork narrower, then
we should give it a try. Two examples: First, RFC 3987 was quite strict
about character normalization in some circumstances. It has turned out
that browsers did it differently, so we changed the spec. Also, we had
to find out that query parts don't get converted using UTF-8 as often as
we would like. So we also adapted the spec, even though that's still
under discussion. If there are other cases that we *can* address, please
tell us. On the other hand, I'd hope that with the work that Anne does,
he also tries to narrow the gap where possible, e.g. by choosing a
solution closer to RFC 3986/3987bis where browsers disagree.
Regards, Martin.