On Tue, Oct 4, 2016 at 8:14 PM, David Walker <d...@mudsite.com> wrote:

> Hi all,
>
> A couple weeks back I took a look at 72811[1].  The bug being that
> parse_url() didn't accept IPv6 addresses without a scheme, like it did for
> IPv4 addresses.  I attempted to patch the specific bug within the scope of
> how parse_url() was processing URI's.  After opening a PR for the
> resoution, Yasuo and Christoph both chimed in that perhaps replacing the
> implementation with an re2c based parser would be better.  We found a
> parser[2] that did almost everything necessary.  I took it and made it more
> strictly adhere to RFC3986[3].
>
> I have updated my original PR[4] and created a RFC[5] that aims to replace
> the parsing of parse_url() to be more strict to RFC3986.  This will provide
> a BC break, as explained in the RFC that at very least warrants some
> discussion.  We had kicked around the idea on the PR of deprecating
> parse_url, and creating a new function with the more-compliant parser, but
> oped against it.
>
> I'm looking for discussion on if a total replacement is the preferred way
> to go about this, and if, we should be making parse_url() more standards
> strict.  Since it today has many breaks with RFC3986 that provide
> semi-reasonable parsing patterns.
>
> --
> Dave
>
> [1] - https://bugs.php.net/bug.php?id=72811
> [2] - https://github.com/staskobzar/url_parser_re2c
> [3] - https://tools.ietf.org/html/rfc3986
> [4] - https://github.com/php/php-src/pull/2079
> [5] - https://wiki.php.net/rfc/replace_parse_url
>

Are you aware of the WHATWG URL standard [1]? Quoting the first goal
statement:

> Align RFC 3986 and RFC 3987 with contemporary implementations and
obsolete them in the process. (E.g., spaces, other "illegal" code points,
query encoding, equality, canonicalization, are all concepts not entirely
shared, or defined.) URL parsing needs to become as solid as HTML parsing.

Basically this is the standard that describes how URL parsing actually
works in the wild, in browser implementations. In particular it also
includes a description of URL parsing in algorithmic form, including
specific directions as to which errors are fatal and which are not.

Also quoting from the goals:

> Standardize on the term URL. URI and IRI are just confusing. In practice
a single algorithm is used for both so keeping them distinct is not helping
anyone. URL also easily wins the search result popularity contest.

For this reason, I would recommend against introducing the term "URI"
anywhere. In particular the suggestion from this thread to use parse_uri()
for this functionality seems like it will cause a lot of confusion.

The URL standard also specifies the interface of the URL object used by
JavaScript and I think we should consider whether we may want to simply
adopt this (object-oriented) interface (potentially with adjustments for
PHP specifics).

I think an important part of this interface is that the URL is constructed
using URL(url [, base]), where "base" is the base URL against which
relative URLs are resolved. This base URL is required for parsing
non-absolute URLs. To me this makes a lot of sense and I think it makes it
much clearer how "incomplete" URLs are being treated.

While we're at it, what's the state of IDN? May this be the time to
properly support it?

Nikita

 [1]: https://url.spec.whatwg.org/

Reply via email to