Re: [whatwg] [URL] Starting work on a URL spec

2010-08-04 Thread Bjartur Thorlacius
On Tue, 03 Aug 2010, Adam Barth w...@adambarth.com wrote:
 On Tue, Aug 3, 2010 at 8:21 AM, bjartur svartma...@gmail.com wrote:
  On 7/25/10 8:57 AM, Adam Barth wrote:
  It may not be an _html_ interoperability problem, but it's certainly a
  _web_ interoperability problem.
 
  It's a question of how HTTP messages are encoded (and in special the enco=
 ding of the IRI).
  WHATWG does not specify HTTP, these concerns should be directed to IETF.
 
 There are various ways to spec lawyer things so you can make this work
 appear to be the responsibility of various folks.  The work needs to
 be done.  I'm inclined to do the work first and worry about what 
organization (if any) has jurisdiction later.
Yeah, true. I've been through a repetive ask the county ask school
authorities, ask the county when asking my school to implement a
SHOULD from national gov. *shrugs*

But really, you should discuss this with the HTTP WG of IETF by raising
the issue in http...@hplb.hp.com. I recommend searching the archives,
http://www.ics.uci.edu/pub/ietf/http/hypermail, for counter-arguments
before posting as this issue has probably be raised before. Then
someone should fork RFC 2616 (or the latest working draft, if there's
a current one).

Patching the RFC == doing the work (good lucking getting consensus on
your side if you don't provide rationale, don't defend your decisions
and ignore the IETF though)


Re: [whatwg] [URL] Starting work on a URL spec

2010-08-03 Thread bjartur
 On 7/25/10 8:57 AM, Adam Barth wrote:
  There's also the related question of what browsers should do with input 
  typed into the URL field. Other than establishing that these rules may be 
  different between the URL field and URLs present in content, I'm not sure 
  this is amenable to spec. But perhaps a survey of what browsers do would 
  be useful.
 
  I wasn't planning to cover that because it's not a critical to
  interoperability
 
 Unfortunately, it is.  In particular, servers need to know what to
 expect the browser to send if a user types non-ASCII into the url bar.
 There are real interoperability problems out there due to differing
 server and browser behavior in this regard.
 
 It may not be an _html_ interoperability problem, but it's certainly a
 _web_ interoperability problem.
 
It's a question of how HTTP messages are encoded (and in special the encoding 
of the IRI).
WHATWG does not specify HTTP, these concerns should be directed to IETF.


Re: [whatwg] [URL] Starting work on a URL spec

2010-08-03 Thread Adam Barth
On Tue, Aug 3, 2010 at 8:21 AM, bjartur svartma...@gmail.com wrote:
 On 7/25/10 8:57 AM, Adam Barth wrote:
  There's also the related question of what browsers should do with input 
  typed into the URL field. Other than establishing that these rules may be 
  different between the URL field and URLs present in content, I'm not sure 
  this is amenable to spec. But perhaps a survey of what browsers do would 
  be useful.
 
  I wasn't planning to cover that because it's not a critical to
  interoperability

 Unfortunately, it is.  In particular, servers need to know what to
 expect the browser to send if a user types non-ASCII into the url bar.
 There are real interoperability problems out there due to differing
 server and browser behavior in this regard.

 It may not be an _html_ interoperability problem, but it's certainly a
 _web_ interoperability problem.

 It's a question of how HTTP messages are encoded (and in special the encoding 
 of the IRI).
 WHATWG does not specify HTTP, these concerns should be directed to IETF.

There are various ways to spec lawyer things so you can make this work
appear to be the responsibility of various folks.  The work needs to
be done.  I'm inclined to do the work first and worry about what
organization (if any) has jurisdiction later.

Adam


Re: [whatwg] [URL] Starting work on a URL spec

2010-07-26 Thread Maciej Stachowiak

On Jul 25, 2010, at 5:57 AM, Adam Barth wrote:

 2010/7/24 Maciej Stachowiak m...@apple.com:
 On Jul 24, 2010, at 9:55 AM, Adam Barth wrote:
 2010/7/23 Ian Fette (イアンフェッティ) ife...@google.com:
 http://code.google.com/apis/safebrowsing/developers_guide_v2.html#Canonicalization
  lists
 some interesting cases we've come across on the anti-phishing team in
 Google. To the extent you're concerned with / interested in
 canonicalizaiton, it may be worth taking a look at (not to suggest you
 follow that in determining how to parse/canonicalize URLs, but rather to
 make sure that you have some correct way of handling the listed URLs).
 
 Thanks.  That's helpful.
 
 BTW, are you covering canonicalization?
 
 Yes.  The three main things I'm hoping to cover are parsing,
 canonicalization, and resolving relative URLs.
 
 Is there any place in the Web platform where canonicalize is exposed by 
 itself in a Web-facing way? I think resolve against a base and parse into 
 components are the only algorithms whose effects can be observed directly. I 
 think we only need to spec canonicalize if it turns out to be a useful 
 subroutine.
 
 As far as I know, you can only see f(x) =
 canonicalize(parse(resolve(x))) and also some breakdown components of
 f(x) in HTMLAnchorElement and window.location.hash (and friends).
 
 Conceptually, it's a bit easier to think about them as three separate
 functions.  The main difference between parse and canonicalize is that
 parse segments the input and canonicalize takes the segments, mutates
 them, and assembles them into a new string.
 
 I haven't studied resolve in as much detail yet, so I'm less clear how
 that fits into the puzzle.

I would consider canonicalize() to be part of resolve(). Every time you 
retrieve a cooked URL (as opposed to original source text), you both resolve 
it against a possible base and canonicalize it as a single step. The two are 
not exposed separately. It's not clear to me that making this operation into 
three separate steps with a parse in the middle is helpful, or even 
representative of a good implementation strategy. I would think of parse() as 
something that happens after canonicalization in the cases where single 
components of the URL are exposed.

Regards,
Maciej



Re: [whatwg] [URL] Starting work on a URL spec

2010-07-26 Thread Maciej Stachowiak

On Jul 25, 2010, at 6:43 AM, Boris Zbarsky wrote:

 On 7/25/10 8:57 AM, Adam Barth wrote:
 There's also the related question of what browsers should do with input 
 typed into the URL field. Other than establishing that these rules may be 
 different between the URL field and URLs present in content, I'm not sure 
 this is amenable to spec. But perhaps a survey of what browsers do would be 
 useful.
 
 I wasn't planning to cover that because it's not a critical to
 interoperability
 
 Unfortunately, it is.  In particular, servers need to know what to
 expect the browser to send if a user types non-ASCII into the url bar.
 There are real interoperability problems out there due to differing
 server and browser behavior in this regard.
 
 It may not be an _html_ interoperability problem, but it's certainly a
 _web_ interoperability problem.
 
 There are also other
 considerations there because the URLs are displayed to users as
 security indicators.
 
 What's displayed is not a concern, in my opinion, in terms of
 interoperability.  What's put on the wire is.  The constraints that need
 to be imposed are much looser than on a href (e.g. we don't need to
 define exactly what url gets loaded if the user types monkey in the
 url bar), but sorting out the non-ASCII issue is definitely desirable.

One thing to keep in mind is that browsers do all sorts of non-interoperable 
things for input that is not a valid URL, such as guessing that it is a 
hostname or performing a search with a search engine. So there's a limit to how 
much this can be spec'd. I agree that for certain URL-like strings that a user 
may type or cut  paste, there is an interop issue.

Regards,
Maciej



Re: [whatwg] [URL] Starting work on a URL spec

2010-07-26 Thread Adam Barth
2010/7/26 Maciej Stachowiak m...@apple.com:
 On Jul 25, 2010, at 5:57 AM, Adam Barth wrote:
 2010/7/24 Maciej Stachowiak m...@apple.com:
 On Jul 24, 2010, at 9:55 AM, Adam Barth wrote:
 2010/7/23 Ian Fette (イアンフェッティ) ife...@google.com:
 http://code.google.com/apis/safebrowsing/developers_guide_v2.html#Canonicalization
  lists
 some interesting cases we've come across on the anti-phishing team in
 Google. To the extent you're concerned with / interested in
 canonicalizaiton, it may be worth taking a look at (not to suggest you
 follow that in determining how to parse/canonicalize URLs, but rather to
 make sure that you have some correct way of handling the listed URLs).

 Thanks.  That's helpful.

 BTW, are you covering canonicalization?

 Yes.  The three main things I'm hoping to cover are parsing,
 canonicalization, and resolving relative URLs.

 Is there any place in the Web platform where canonicalize is exposed by 
 itself in a Web-facing way? I think resolve against a base and parse into 
 components are the only algorithms whose effects can be observed directly. 
 I think we only need to spec canonicalize if it turns out to be a useful 
 subroutine.

 As far as I know, you can only see f(x) =
 canonicalize(parse(resolve(x))) and also some breakdown components of
 f(x) in HTMLAnchorElement and window.location.hash (and friends).

 Conceptually, it's a bit easier to think about them as three separate
 functions.  The main difference between parse and canonicalize is that
 parse segments the input and canonicalize takes the segments, mutates
 them, and assembles them into a new string.

 I haven't studied resolve in as much detail yet, so I'm less clear how
 that fits into the puzzle.

 I would consider canonicalize() to be part of resolve(). Every time you 
 retrieve a cooked URL (as opposed to original source text), you both 
 resolve it against a possible base and canonicalize it as a single step. The 
 two are not exposed separately. It's not clear to me that making this 
 operation into three separate steps with a parse in the middle is helpful, or 
 even representative of a good implementation strategy. I would think of 
 parse() as something that happens after canonicalization in the cases where 
 single components of the URL are exposed.

That's an interesting way to think about what's going on.  Different
parts of the URL get different canonicalization transformations
applied to them.  For example, the range of characters that make sense
in a host name are different than those that make sense in a port or
query, so, in some sense, the canonicalization algorithm needs to
understand something about how the URL parses, or at least how to
distinguish host names from, e.g., ports and queries.

Adam


Re: [whatwg] [URL] Starting work on a URL spec

2010-07-26 Thread Maciej Stachowiak

On Jul 25, 2010, at 11:16 PM, Adam Barth wrote:

 2010/7/26 Maciej Stachowiak m...@apple.com:
 On Jul 25, 2010, at 5:57 AM, Adam Barth wrote:
 2010/7/24 Maciej Stachowiak m...@apple.com:
 On Jul 24, 2010, at 9:55 AM, Adam Barth wrote:
 2010/7/23 Ian Fette (イアンフェッティ) ife...@google.com:
 http://code.google.com/apis/safebrowsing/developers_guide_v2.html#Canonicalization
  lists
 some interesting cases we've come across on the anti-phishing team in
 Google. To the extent you're concerned with / interested in
 canonicalizaiton, it may be worth taking a look at (not to suggest you
 follow that in determining how to parse/canonicalize URLs, but rather to
 make sure that you have some correct way of handling the listed URLs).
 
 Thanks.  That's helpful.
 
 BTW, are you covering canonicalization?
 
 Yes.  The three main things I'm hoping to cover are parsing,
 canonicalization, and resolving relative URLs.
 
 Is there any place in the Web platform where canonicalize is exposed by 
 itself in a Web-facing way? I think resolve against a base and parse into 
 components are the only algorithms whose effects can be observed directly. 
 I think we only need to spec canonicalize if it turns out to be a useful 
 subroutine.
 
 As far as I know, you can only see f(x) =
 canonicalize(parse(resolve(x))) and also some breakdown components of
 f(x) in HTMLAnchorElement and window.location.hash (and friends).
 
 Conceptually, it's a bit easier to think about them as three separate
 functions.  The main difference between parse and canonicalize is that
 parse segments the input and canonicalize takes the segments, mutates
 them, and assembles them into a new string.
 
 I haven't studied resolve in as much detail yet, so I'm less clear how
 that fits into the puzzle.
 
 I would consider canonicalize() to be part of resolve(). Every time you 
 retrieve a cooked URL (as opposed to original source text), you both 
 resolve it against a possible base and canonicalize it as a single step. The 
 two are not exposed separately. It's not clear to me that making this 
 operation into three separate steps with a parse in the middle is helpful, 
 or even representative of a good implementation strategy. I would think of 
 parse() as something that happens after canonicalization in the cases where 
 single components of the URL are exposed.
 
 That's an interesting way to think about what's going on.  Different
 parts of the URL get different canonicalization transformations
 applied to them.  For example, the range of characters that make sense
 in a host name are different than those that make sense in a port or
 query, so, in some sense, the canonicalization algorithm needs to
 understand something about how the URL parses, or at least how to
 distinguish host names from, e.g., ports and queries.

Yes, but the relative resolution algorithm needs to find URL part boundaries as 
well. I guess part of the issue here is that we have two different senses of 
parse:

(1) Find the URL component boundaries in a source string, to be used by other 
algorithms for reference purposes. In that sense, you may need to do it to both 
the base URL and the possibly-relative reference before resolve(). However, 
this step isn't really exposed directly to the Web.

(2) Extract URL components of a resolved canonicalized URL, with the 
appropriate post-processing to expose them via APIs like Location and 
HTMLAnchorElement.

I've been thinking of parse() in sense #2, since that is the version actually 
exposed as API. You can think of this as taking a resolved canonicalized URL as 
input, and having a tuple of strings representing the components as output. The 
only other public operation is resolve+canonicalize, which conceptually takes a 
base URL, a possibly relative URL reference, and an optional document encoding 
as input, and which produces the resolved canonicalized URL as output.

While there are other ways to factor these operations, using a different 
approach will make it less obvious how to glue them to the relevant other specs.

Regards,
Maciej





Re: [whatwg] [URL] Starting work on a URL spec

2010-07-25 Thread Adam Barth
2010/7/24 Maciej Stachowiak m...@apple.com:
 On Jul 24, 2010, at 9:55 AM, Adam Barth wrote:
 2010/7/23 Ian Fette (イアンフェッティ) ife...@google.com:
 http://code.google.com/apis/safebrowsing/developers_guide_v2.html#Canonicalization
  lists
 some interesting cases we've come across on the anti-phishing team in
 Google. To the extent you're concerned with / interested in
 canonicalizaiton, it may be worth taking a look at (not to suggest you
 follow that in determining how to parse/canonicalize URLs, but rather to
 make sure that you have some correct way of handling the listed URLs).

 Thanks.  That's helpful.

 BTW, are you covering canonicalization?

 Yes.  The three main things I'm hoping to cover are parsing,
 canonicalization, and resolving relative URLs.

 Is there any place in the Web platform where canonicalize is exposed by 
 itself in a Web-facing way? I think resolve against a base and parse into 
 components are the only algorithms whose effects can be observed directly. I 
 think we only need to spec canonicalize if it turns out to be a useful 
 subroutine.

As far as I know, you can only see f(x) =
canonicalize(parse(resolve(x))) and also some breakdown components of
f(x) in HTMLAnchorElement and window.location.hash (and friends).

Conceptually, it's a bit easier to think about them as three separate
functions.  The main difference between parse and canonicalize is that
parse segments the input and canonicalize takes the segments, mutates
them, and assembles them into a new string.

I haven't studied resolve in as much detail yet, so I'm less clear how
that fits into the puzzle.

 There's also the related question of what browsers should do with input typed 
 into the URL field. Other than establishing that these rules may be different 
 between the URL field and URLs present in content, I'm not sure this is 
 amenable to spec. But perhaps a survey of what browsers do would be useful.

I wasn't planning to cover that because it's not a critical to
interoperability, at least not in the same way understanding what do
do with the href attribute of the a tag is.  There are also other
considerations there because the URLs are displayed to users as
security indicators.

Adam


Re: [whatwg] [URL] Starting work on a URL spec

2010-07-25 Thread Adam Barth
2010/7/25 Boris Zbarsky bzbar...@mit.edu:
 On 7/25/10 8:57 AM, Adam Barth wrote:
 There's also the related question of what browsers should do with input 
 typed into the URL field. Other than establishing that these rules may be 
 different between the URL field and URLs present in content, I'm not sure 
 this is amenable to spec. But perhaps a survey of what browsers do would be 
 useful.

 I wasn't planning to cover that because it's not a critical to
 interoperability

 Unfortunately, it is.  In particular, servers need to know what to
 expect the browser to send if a user types non-ASCII into the url bar.
 There are real interoperability problems out there due to differing
 server and browser behavior in this regard.

 It may not be an _html_ interoperability problem, but it's certainly a
 _web_ interoperability problem.

 There are also other
 considerations there because the URLs are displayed to users as
 security indicators.

 What's displayed is not a concern, in my opinion, in terms of
 interoperability.  What's put on the wire is.  The constraints that need
 to be imposed are much looser than on a href (e.g. we don't need to
 define exactly what url gets loaded if the user types monkey in the
 url bar), but sorting out the non-ASCII issue is definitely desirable.

Okiedokes.  I'll add that to my list of things to pay attention to.  I
can't promise I'll get to it in this round though.

Thanks,
Adam


Re: [whatwg] [URL] Starting work on a URL spec

2010-07-25 Thread Ian Hickson
On Sun, 25 Jul 2010, Adam Barth wrote:

 As far as I know, you can only see f(x) = 
 canonicalize(parse(resolve(x))) and also some breakdown components of 
 f(x) in HTMLAnchorElement and window.location.hash (and friends).

Can you see the result of resolve(x) without seeing its result go through 
parse() and canonicalize()? If not, then we should just define resolve() 
as doing the canonicalize() step.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] [URL] Starting work on a URL spec

2010-07-25 Thread Adam Barth
On Sun, Jul 25, 2010 at 6:05 PM, Ian Hickson i...@hixie.ch wrote:
 On Sun, 25 Jul 2010, Adam Barth wrote:

 As far as I know, you can only see f(x) =
 canonicalize(parse(resolve(x))) and also some breakdown components of
 f(x) in HTMLAnchorElement and window.location.hash (and friends).

 Can you see the result of resolve(x) without seeing its result go through
 parse() and canonicalize()?

I don't know of any way to do that.  I can tell you that in WebKit,
the function that usually gets called to resolve URLs (called
completeURL if you want to look it up in the source) returns a
canonicalized URL.

 If not, then we should just define resolve()
 as doing the canonicalize() step.

Yeah, what might make the most sense is to use canonicalize to
post-process both resolving and parsing.  We can choose the names so
that calling the canonicalizing version is easy.

Adam


Re: [whatwg] [URL] Starting work on a URL spec

2010-07-25 Thread Boris Zbarsky

On 7/25/10 3:05 PM, Adam Barth wrote:

I don't know of any way to do that.  I can tell you that in WebKit,
the function that usually gets called to resolve URLs (called
completeURL if you want to look it up in the source) returns a
canonicalized URL.


The same is true in Gecko.  The way an nsIURI object is typically 
constructed is from an nsIURI base (possibly null) and a URL string, and 
the return value is resolved and canonicalized.


-Boris


Re: [whatwg] [URL] Starting work on a URL spec

2010-07-24 Thread Boris Zbarsky

On 7/24/10 1:50 AM, Brett Zamir wrote:

I would be particularly interested in data on this last, across
different browsers, operating systems, and locales... There seem to be
servers out there expecting their URIs in UTF-8 and others expecting
them in ISO-8859-1, and it's not clear to me how to make things work
with them all.


Seems to me that if they are not in UTF-8, they should be treated as
bugs, even if that is not a de jure standard.


Treated as bugs by whom?

The scenario is that a user types some non-ASCII text in the url bar. 
This needs to be url-encoded to actually go on the wire, which raises 
the question of what encoding.  If the user is using IRIs, the answer is 
UTF-8.  A number of servers barf if you do this, especially because some 
server-side scripting languages (PHP, e.g., last I checked) default to 
URI-unescaping via something other than UTF-8.


So some browser encode the non-query part of the URI as UTF-8 and the 
query part as ... something (user's default filesystem encoding, say, 
for lack of a better guess).  Others always use UTF-8 (and end up with 
some servers not usable).  Others... I have no idea.  That's why I want 
data.  ;)  In particular, while the just use UTF-8, and if the user 
can't access the site sucks to be the user approach has a certain 
theoretical-purity appeal, it doesn't seem like something I want to do 
to my friends and family (always a good criterion for things you'd like 
to do to users).


-Boris


Re: [whatwg] [URL] Starting work on a URL spec

2010-07-24 Thread Boris Zbarsky

On 7/24/10 2:49 AM, Brett Zamir wrote:

By the servers/scripting languages. While it is great that the browsers
are involved in the process, I think it would be reasonable to invite
the other stake-holders to join the discussions.


If they're willing to talk to us, great.  My past experience talking to 
server developers has been ... suboptimal enough that now I just route 
around the damage instead, by default.  You may be right that in this 
case that's not a good idea.



Hopefully to be fixed in PHP6 with its promise of full Unicode support...

Though per http://www.slideshare.net/kfish/unicode-php6-presentation :


Right.  Not holding my breath yet.  ;)


What I meant is to try to get the server systems on board to fix the
issue, including in the long-term. I appreciate you all being admirably
practical champions of present-day compatibility, though I'd hope there
is a vision to make things work better for the future


Yep.  That vision is always use UTF-8; there are just coordination 
problems getting there


-Boris


Re: [whatwg] [URL] Starting work on a URL spec

2010-07-24 Thread Adam Barth
2010/7/23 Ian Fette (イアンフェッティ) ife...@google.com:
 http://code.google.com/apis/safebrowsing/developers_guide_v2.html#Canonicalization
  lists
 some interesting cases we've come across on the anti-phishing team in
 Google. To the extent you're concerned with / interested in
 canonicalizaiton, it may be worth taking a look at (not to suggest you
 follow that in determining how to parse/canonicalize URLs, but rather to
 make sure that you have some correct way of handling the listed URLs).

Thanks.  That's helpful.

 BTW, are you covering canonicalization?

Yes.  The three main things I'm hoping to cover are parsing,
canonicalization, and resolving relative URLs.

Adam


 On Fri, Jul 23, 2010 at 9:02 PM, Boris Zbarsky bzbar...@mit.edu wrote:
 On 7/23/10 11:59 PM, Silvia Pfeiffer wrote:
 Is that URLs as values of attributes in HTML or is that URLs as pasted
 into the address bar? I believe their processing differs...

 It certainly does in Firefox (the latter have a lot more fixup done to
 them, and there are also differences in terms of how character encodings are
 handled).

 I would be particularly interested in data on this last, across different
 browsers, operating systems, and locales...  There seem to be servers out
 there expecting their URIs in UTF-8 and others expecting them in ISO-8859-1,
 and it's not clear to me how to make things work with them all.

 -Boris




Re: [whatwg] [URL] Starting work on a URL spec

2010-07-24 Thread Peter Kasting
On Fri, Jul 23, 2010 at 8:59 PM, Silvia Pfeiffer
silviapfeiff...@gmail.comwrote:

 Is that URLs as values of attributes in HTML or is that URLs as pasted into
 the address bar? I believe their processing differs...


I strongly suggest ignoring browser address bars.  As the author of most of
the Chromium omnibox code, I can testify that there's a ton of fixup,
heuristics, and other stuff that's designed to get the user what they want
that should never be in a spec.  I think limiting the scope to URLs consumed
as part of web content makes more sense.

PK


Re: [whatwg] [URL] Starting work on a URL spec

2010-07-24 Thread Maciej Stachowiak

On Jul 24, 2010, at 9:55 AM, Adam Barth wrote:

 2010/7/23 Ian Fette (イアンフェッティ) ife...@google.com:
 http://code.google.com/apis/safebrowsing/developers_guide_v2.html#Canonicalization
  lists
 some interesting cases we've come across on the anti-phishing team in
 Google. To the extent you're concerned with / interested in
 canonicalizaiton, it may be worth taking a look at (not to suggest you
 follow that in determining how to parse/canonicalize URLs, but rather to
 make sure that you have some correct way of handling the listed URLs).
 
 Thanks.  That's helpful.
 
 BTW, are you covering canonicalization?
 
 Yes.  The three main things I'm hoping to cover are parsing,
 canonicalization, and resolving relative URLs.

Is there any place in the Web platform where canonicalize is exposed by 
itself in a Web-facing way? I think resolve against a base and parse into 
components are the only algorithms whose effects can be observed directly. I 
think we only need to spec canonicalize if it turns out to be a useful 
subroutine.

There's also the related question of what browsers should do with input typed 
into the URL field. Other than establishing that these rules may be different 
between the URL field and URLs present in content, I'm not sure this is 
amenable to spec. But perhaps a survey of what browsers do would be useful.

Regards,
Maciej



Re: [whatwg] [URL] Starting work on a URL spec

2010-07-23 Thread Boris Zbarsky

On 7/23/10 3:11 PM, Adam Barth wrote:

Please let me know if you know of any public URL parsing test suites.
My main starting point will be the WebKit URL parsing test suite,


There's a bit at 
http://mxr.mozilla.org/mozilla-central/source/netwerk/test/unit/test_standardurl.js


I thought there was some other stuff there too, but can't find it at the 
moment.  This only tests authority urls.


-Boris


Re: [whatwg] [URL] Starting work on a URL spec

2010-07-23 Thread Charles McCathieNevile

On Fri, 23 Jul 2010 21:11:35 +0200, Adam Barth w...@adambarth.com wrote:


I've begun working on a specification for how browsers process URLs:

http://github.com/abarth/url-spec

The repository is currently empty, but I'll be adding the basic
skeleton over the next few weeks.  My intention is to triangulate
between how IE, Firefox, Chrome, Safari, and Opera process URLs to
find an algorithm that is both compatible with the web and moderately
sane.


Good luck ;)

Seriously, it is probably worth looking at things like curl that are not  
browsers but consume URLs (and in turn are used by various systems that  
interact with URLs for things like software updates, synchronisation, etc).


cheers

Chaals

--
Charles McCathieNevile  Opera Software, Standards Group
je parle français -- hablo español -- jeg lærer norsk
http://my.opera.com/chaals   Try Opera: http://www.opera.com


Re: [whatwg] [URL] Starting work on a URL spec

2010-07-23 Thread Silvia Pfeiffer
Is that URLs as values of attributes in HTML or is that URLs as pasted into
the address bar? I believe their processing differs...
Good luck with it, anyway. I'm sure you've seen http://esw.w3.org/UriTesting
.

Cheers,
Silvia.

On Sat, Jul 24, 2010 at 5:11 AM, Adam Barth w...@adambarth.com wrote:

 I've begun working on a specification for how browsers process URLs:

 http://github.com/abarth/url-spec

 The repository is currently empty, but I'll be adding the basic
 skeleton over the next few weeks.  My intention is to triangulate
 between how IE, Firefox, Chrome, Safari, and Opera process URLs to
 find an algorithm that is both compatible with the web and moderately
 sane.

 Please let me know if you know of any public URL parsing test suites.
 My main starting point will be the WebKit URL parsing test suite,

 http://trac.webkit.org/browser/trunk/LayoutTests/fast/url

 which was adapted from the GURL parsing library.

 Thanks,
 Adam



Re: [whatwg] [URL] Starting work on a URL spec

2010-07-23 Thread Boris Zbarsky

On 7/23/10 11:59 PM, Silvia Pfeiffer wrote:

Is that URLs as values of attributes in HTML or is that URLs as pasted
into the address bar? I believe their processing differs...


It certainly does in Firefox (the latter have a lot more fixup done to 
them, and there are also differences in terms of how character encodings 
are handled).


I would be particularly interested in data on this last, across 
different browsers, operating systems, and locales...  There seem to be 
servers out there expecting their URIs in UTF-8 and others expecting 
them in ISO-8859-1, and it's not clear to me how to make things work 
with them all.


-Boris


Re: [whatwg] [URL] Starting work on a URL spec

2010-07-23 Thread イアンフェッティ
http://code.google.com/apis/safebrowsing/developers_guide_v2.html#Canonicalization
lists
some interesting cases we've come across on the anti-phishing team in
Google. To the extent you're concerned with / interested in
canonicalizaiton, it may be worth taking a look at (not to suggest you
follow that in determining how to parse/canonicalize URLs, but rather to
make sure that you have some correct way of handling the listed URLs).

BTW, are you covering canonicalization?

-Ian

On Fri, Jul 23, 2010 at 9:02 PM, Boris Zbarsky bzbar...@mit.edu wrote:

 On 7/23/10 11:59 PM, Silvia Pfeiffer wrote:

 Is that URLs as values of attributes in HTML or is that URLs as pasted
 into the address bar? I believe their processing differs...


 It certainly does in Firefox (the latter have a lot more fixup done to
 them, and there are also differences in terms of how character encodings are
 handled).

 I would be particularly interested in data on this last, across different
 browsers, operating systems, and locales...  There seem to be servers out
 there expecting their URIs in UTF-8 and others expecting them in ISO-8859-1,
 and it's not clear to me how to make things work with them all.

 -Boris



Re: [whatwg] [URL] Starting work on a URL spec

2010-07-23 Thread Brett Zamir

 On 7/24/2010 12:02 PM, Boris Zbarsky wrote:

On 7/23/10 11:59 PM, Silvia Pfeiffer wrote:

Is that URLs as values of attributes in HTML or is that URLs as pasted
into the address bar? I believe their processing differs...


It certainly does in Firefox (the latter have a lot more fixup done to 
them, and there are also differences in terms of how character 
encodings are handled).


I would be particularly interested in data on this last, across 
different browsers, operating systems, and locales...  There seem to 
be servers out there expecting their URIs in UTF-8 and others 
expecting them in ISO-8859-1, and it's not clear to me how to make 
things work with them all.


Seems to me that if they are not in UTF-8, they should be treated as 
bugs, even if that is not a de jure standard.


Brett