Re: [whatwg] New URL Standard

Boris Zbarsky Mon, 24 Sep 2012 07:06:35 -0700

On 9/24/12 4:58 AM, Anne van Kesteren wrote:

Say you have <a href="data:test"/>; the concern is what e.g.
a.protocol and a.pathname would return here. For invalid URLs they
would return ":" and "" respectively. If we treat this as a valid URL
you would get "data:" and "test". In Gecko I get "http:" and "". If I
make that <a href="data:text/html,test"/> Gecko will give meaningful
answers (well pathname is still "", maybe that is okay and pathname
should only work for hierarchical URLs).


Ah, I see.

So what happens here is that Gecko treats this as an invalid URL (moreprecisely, it cannot create an internal "URI" object from this string).I guess that's what you were getting at: that data: URLs actually havea concept of "invalid" in Gecko. This is actually true for all schemesGecko supports, in general. For example, "http://something or other"(with the spaces) will do the same thing.

For an invalid URI, .protocol currently returns "http:" in Gecko. Ihave no idea why, offhand. It could just as easily return ":".

As far as .pathname, what Gecko does is exactly what you say: .pathnameonly works on hierarchical schemes.

More general, what I want is that for *any* given input in <a
href="..."/>, xhr.open("GET", ...), new URL(...), etc. I want to be
able to tell what the various URL components are going to be. The kind
of predictability we have for the HTML parser, I want to have for the
URL parser as well.


Yes, absolutely agreed.

(If that means handling data URLs at the layer of the URL parser
rather than a separate parser that goes over the path, as Gecko
appears to be doing, so be it.)

We could change Gecko's handling here, for what it's worth. One reasonfor the current handling is that right now we don't even make <a> into alink unless its href is a valid URI as far as Gecko is concerned. ButI'm considering changing that anyway, since no one else bothers withsuch niceties and they complicate implementation a bit...

If you want constructive advice, it would be interesting to get a full list
of all the weird stuff that UAs do here so we can evaluate which parts of it
are needed and why.  I can try to produce such a list for Gecko, if there
seems to be motion on the general idea.


I think that would be a great start. I'm happy to start out with
Gecko's behavior and iterate over time as feedback comes in from other
browsers.


Hmm.  So here goes at least a partial list:

1) On Windows and OS/2, Gecko replaces '\\' with '/' in file:// URIstrings before doing anything else with the string when parsing a newURL. That includes relative URI strings being resolved against afile:// base.

2) file:// URIs are parsed as a "no authority" URL in Gecko. Quotingthe IDL comment:


35     /**
36      * blah:foo/bar    => blah:///foo/bar
37      * blah:/foo/bar   => blah:///foo/bar
38      * blah://foo/bar  => blah://foo/bar
39      * blah:///foo/bar => blah:///foo/bar
40      */

where the thing on the left is the input string and the thing on theright is the normalized form that the parser produces from it. Notethat this is different from how HTTP URIs are parsed, for all except theitem on line number 38 there.

3) Gecko does not allow setting a username, password, hostname, port onan existing "no authority" URL object, including file://. Attempts todo that throw internally; I believe for web stuff it just becomes a no-op.

4) For "no authority" URLs, including file://, on Windows and OS/2only, if what looks like authority section looks like a drive letter,it's treated as part of the path. For example, "file://c:/" is treatedas the filename "c:\". "Looks like a drive letter" is defined as "ASCIIletter (any case), followed by a ':' or '|' and then followed by end ofstring or '/' or '\\'". I'm not sure why this is checking for '\\'again, honestly. ;)

5) When parsing a "no authority" URL (including file://), and when item4 above does not apply, it looks like Gecko skips everything after"file://" up until the next '/', '?', or '#' char before parsing path stuff.

6) On Windows and OS/2, when dynamically parsing a path for a "noauthority" URL (not sure whether this is actually web-exposed, fwiw...)Gecko will do something involving looking for a path that's only anASCII letter followed by ':' or '|' followed by end of string. I'm notquite sure what that part is about... It might have to do with the factthat URI objects in Gecko can have concepts of "directory", "filename","extension" or something like that.

7) When doing URI equality comparisons, if two file:// URIs only differin their directory/filename/extension (so the actual file path), then anequality comparison is done on the underlying file path objects. Whatthis means depends on the OS. On "Unix" this is just a straight-up byteby byte compare of file paths. I think OS X now follows the "Unix" codepath as do most other supported platforms. But note that "file path" inthis case is normalized in various ways. Specifically: trailing '/' arestripped and some sort of normalization of HFS paths (possibly with avolume name) to POSIX paths is done on OSX. One result of the latter isthat file:///Users%2fbzbarsky ends up seeing my home directory, which is... slightly surprising. On "Unix", the path bytes are treated as UTF-8if they're valid UTF-8, else treated as whatever the current localecharset is, I think. Oh, and there is some sort of escaping going onfor directory names, filenames, extensions. Not sure what that's about,if anything. The URI-escaping code is black magic, but I'm happy to runsome black-box tests on it if someone wants to provide test strings.

The things that don't go through the "Unix" code for this stuff areWindows and OS/2. I'm not going to dig through the OS/2 stuff, but onWindows if the filename contains a nonempty directory name and thesecond char is '|' that's converted to a ':'. Again, escaping fordirectory names and file names and extensions. Again, things that looklike UTF-8 are treated thus and other stuff uses the current codepage.After all that, the actual equality comparison is done via _wcsicmp onthe return value of GetShortPathNameW. So whatever things thatcombination considers equal are equal.

8) When actually resolving a file:// URL, the underlying file pathobject as described above is used to get the data. Plus there's a bitof weirdness about symlinks, I think... Mostly affects what's shown inthe url bar when pointing the browser to a symlink.

That's what I can spot offhand. I won't guarantee there is nothingelse. :(


-Boris

Re: [whatwg] New URL Standard

Reply via email to