RE: URL work in HTML 5 (semifork)

Larry Masinter Tue, 16 Oct 2012 16:57:17 -0700

((It was requested we move the conversations to the public-iri mailing list, so 
I'm willing to take that advice, bcc www-archive@w3.org to make sure anyone 
looking there has a place to follow up.))

I agree that a "processing spec" which reflects what browsers actually do, or 
should do, when confronted with a string which claims to be a "URL", seems like 
a good idea.

In 2009, the transition from  
http://tools.ietf.org/html/draft-duerst-iri-bis-06  to 
http://tools.ietf.org/html/draft-duerst-iri-bis-07#section-13.1   was a major 
restructuring of the IRI spec to move from a BNF-derived syntax specification 
to a more directive "processing model". This was followed by the formation of 
an IRI working group to manage this, with the documents starting the series 
http://tools.ietf.org/html/draft-ietf-iri-3987bis including, until version 06, 
containing:

http://tools.ietf.org/html/draft-ietf-iri-3987bis-06#section-6.2 "Web Address 
Processing". 

   Many popular web browsers have taken the approach of being quite
   liberal in what is accepted as a "URL" or its relative forms.  This
   section describes their behavior in terms of a preprocessor which
   maps strings into the IRI space for subsequent parsing andd
   interpretation as an IRI.

   In some situations, it might be appropriate to describe the syntax
   that a liberal consumer implementation might accept as a "Web
   Address" or "Hypertext Reference" or "HREF".  However, technical
   specifications SHOULD restrict the syntactic form allowed by
   compliant producers to the IRI or IRI reference syntax defined in
   this document even if they want to mandate this processing.

The specification may not have matched exact browser behavior, or expected or 
wanted browser behavior, but it at least attempted to do what was claimed was 
wanted -- describe a processing model which is forgiving and accepts arbitrary 
input -- but also provide a stricter interpretation of IRI "legal" syntax for 
IRI producers.  

It's hard to find any evidence of issues, discussion, comments in the tracker 
or on the mailing list; it's great to now finally get some participation, 
testing, and resolution of open issues.  

My understanding is that around August 2011, to address the issue of 'venue 
selection', there was some kind of agreement that the "Web Address Processing" 
would be handled in W3C (originally in HTML WG and now in WebApps) while the 
stricter interpretation handled in IETF, and this section on "Web Address 
Processing" removed from the IETF document. 
http://tools.ietf.org/html/draft-ietf-iri-3987bis-07 .   

If there is agreement now that the entire IRI / URL processing model will be 
described in a W3C specification (from string-of-characters in some document 
charset into "strings sent to IRI component processor or to HTTP client 
interface")  I think that's workable.

I think, though, that it is risky to have the same processing described in two 
different parallel specifications.

One path would be for 3987bis to instead normatively reference the W3C 
specification for URL processing (most of 3.1-3.5  of 
http://tools.ietf.org/html/draft-ietf-iri-3987bis-12 ), leaving the remaining 
components: how to compare IRIs (in the comparison document), considerations 
for dealing with BIDI IRIs (in the bidi document), and how to be compatible 
with legacy systems which accept not only ASCII-only URIs, but also 
Unicode-based IRIs which require some compatibility with RFC 3987 (such as XML 
processors which only accept LEIRIs.)

That would also converge the specifications. The only remaining concern is 
whether the W3C specification will follow the same backward compatibility 
guidelines -- to only make specification changes if there is a large majority 
of commonly deployed implementations that implement the change. If, for 
example, 30% of installed browsers treat "\" as if it is "/" and 70% of 
installed browsers do not, then the 30% should not lead to a change in the 
processing model.   Of course, percentages and market share are fluid, but 
let's look for some stability...  

Liberal handling of previously illegal strings as IRIs has some security 
implications that should be examined carefully; this is not like content 
parsing and style sheet application.

Larry
--
http://larry.masinter.net

> -----Original Message-----
> From: Jan Algermissen [mailto:jan.algermis...@nordsc.com]
> Sent: Tuesday, October 16, 2012 4:45 AM
> To: Anne van Kesteren
> Cc: Martin J. Dürst; Robin Berjon; Ted Hardie; Larry Masinter; p...@w3.org; 
> Peter
> Saint-Andre (stpe...@stpeter.im); Pete Resnick (presn...@qualcomm.com);
> www-archive@w3.org; Michael(tm) Smith
> Subject: Re: URL work in HTML 5 (semifork)
> 
> 
> On Oct 16, 2012, at 1:29 PM, Anne van Kesteren wrote:
> 
> > I'm not arguing URLs should be allowed to contain SP, just that they
> > can (and do) in certain contexts and that we need to deal with that
> > (either by terminating processing or converting it to %20 or ignoring
> > it in case of domain names, if I remember correctly).
> 
> I am not understanding your perceived problem with two specs.
> 
> There is the RFC and that is telling us what a valid URI looks like.
> 
> In addition to that you can standardize 'recovery' algorithms for turning
> broken URIs to valid ones. Maybe with different 'heuristics levels' before
> giving up and reporting an error.
> 
> Any piece of software that wishes to be nice on 'URI providers' and process
> broken URIs to some extend can apply that standardized algorith in a fixup
> phase before handing it on to the component that expects a valid URI.
> 
> The emphasis is then on fixing to get a valid URI as early in the stack
> as possible and avoid the fork on software components that deal with URIs.
> 
> I just don't see any need to mangle any specs. Syntax definition and fixing
> algorithm are orthogonal aspects, really. The belong in different specs.
> 
> Jan

RE: URL work in HTML 5 (semifork)

Reply via email to