On 9/24/12 4:58 AM, Anne van Kesteren wrote:
Say you have <a href="data:test"/>; the concern is what e.g.
a.protocol and a.pathname would return here. For invalid URLs they
would return ":" and "" respectively. If we treat this as a valid URL
you would get "data:" and "test". In Gecko I get "http:" and "". If I
make that <a href="data:text/html,test"/> Gecko will give meaningful
answers (well pathname is still "", maybe that is okay and pathname
should only work for hierarchical URLs).

Ah, I see.

So what happens here is that Gecko treats this as an invalid URL (more precisely, it cannot create an internal "URI" object from this string). I guess that's what you were getting at: that data: URLs actually have a concept of "invalid" in Gecko. This is actually true for all schemes Gecko supports, in general. For example, "http://something or other" (with the spaces) will do the same thing.

For an invalid URI, .protocol currently returns "http:" in Gecko. I have no idea why, offhand. It could just as easily return ":".

As far as .pathname, what Gecko does is exactly what you say: .pathname only works on hierarchical schemes.

More general, what I want is that for *any* given input in <a
href="..."/>, xhr.open("GET", ...), new URL(...), etc. I want to be
able to tell what the various URL components are going to be. The kind
of predictability we have for the HTML parser, I want to have for the
URL parser as well.

Yes, absolutely agreed.

(If that means handling data URLs at the layer of the URL parser
rather than a separate parser that goes over the path, as Gecko
appears to be doing, so be it.)

We could change Gecko's handling here, for what it's worth. One reason for the current handling is that right now we don't even make <a> into a link unless its href is a valid URI as far as Gecko is concerned. But I'm considering changing that anyway, since no one else bothers with such niceties and they complicate implementation a bit...

If you want constructive advice, it would be interesting to get a full list
of all the weird stuff that UAs do here so we can evaluate which parts of it
are needed and why.  I can try to produce such a list for Gecko, if there
seems to be motion on the general idea.

I think that would be a great start. I'm happy to start out with
Gecko's behavior and iterate over time as feedback comes in from other
browsers.

Hmm.  So here goes at least a partial list:

1) On Windows and OS/2, Gecko replaces '\\' with '/' in file:// URI strings before doing anything else with the string when parsing a new URL. That includes relative URI strings being resolved against a file:// base.

2) file:// URIs are parsed as a "no authority" URL in Gecko. Quoting the IDL comment:

35     /**
36      * blah:foo/bar    => blah:///foo/bar
37      * blah:/foo/bar   => blah:///foo/bar
38      * blah://foo/bar  => blah://foo/bar
39      * blah:///foo/bar => blah:///foo/bar
40      */

where the thing on the left is the input string and the thing on the right is the normalized form that the parser produces from it. Note that this is different from how HTTP URIs are parsed, for all except the item on line number 38 there.

3) Gecko does not allow setting a username, password, hostname, port on an existing "no authority" URL object, including file://. Attempts to do that throw internally; I believe for web stuff it just becomes a no-op.

4) For "no authority" URLs, including file://, on Windows and OS/2 only, if what looks like authority section looks like a drive letter, it's treated as part of the path. For example, "file://c:/" is treated as the filename "c:\". "Looks like a drive letter" is defined as "ASCII letter (any case), followed by a ':' or '|' and then followed by end of string or '/' or '\\'". I'm not sure why this is checking for '\\' again, honestly. ;)

5) When parsing a "no authority" URL (including file://), and when item 4 above does not apply, it looks like Gecko skips everything after "file://" up until the next '/', '?', or '#' char before parsing path stuff.

6) On Windows and OS/2, when dynamically parsing a path for a "no authority" URL (not sure whether this is actually web-exposed, fwiw...) Gecko will do something involving looking for a path that's only an ASCII letter followed by ':' or '|' followed by end of string. I'm not quite sure what that part is about... It might have to do with the fact that URI objects in Gecko can have concepts of "directory", "filename", "extension" or something like that.

7) When doing URI equality comparisons, if two file:// URIs only differ in their directory/filename/extension (so the actual file path), then an equality comparison is done on the underlying file path objects. What this means depends on the OS. On "Unix" this is just a straight-up byte by byte compare of file paths. I think OS X now follows the "Unix" code path as do most other supported platforms. But note that "file path" in this case is normalized in various ways. Specifically: trailing '/' are stripped and some sort of normalization of HFS paths (possibly with a volume name) to POSIX paths is done on OSX. One result of the latter is that file:///Users%2fbzbarsky ends up seeing my home directory, which is ... slightly surprising. On "Unix", the path bytes are treated as UTF-8 if they're valid UTF-8, else treated as whatever the current locale charset is, I think. Oh, and there is some sort of escaping going on for directory names, filenames, extensions. Not sure what that's about, if anything. The URI-escaping code is black magic, but I'm happy to run some black-box tests on it if someone wants to provide test strings.

The things that don't go through the "Unix" code for this stuff are Windows and OS/2. I'm not going to dig through the OS/2 stuff, but on Windows if the filename contains a nonempty directory name and the second char is '|' that's converted to a ':'. Again, escaping for directory names and file names and extensions. Again, things that look like UTF-8 are treated thus and other stuff uses the current codepage. After all that, the actual equality comparison is done via _wcsicmp on the return value of GetShortPathNameW. So whatever things that combination considers equal are equal.

8) When actually resolving a file:// URL, the underlying file path object as described above is used to get the data. Plus there's a bit of weirdness about symlinks, I think... Mostly affects what's shown in the url bar when pointing the browser to a symlink.

That's what I can spot offhand. I won't guarantee there is nothing else. :(

-Boris

Reply via email to