Re: [whatwg] Web Addresses vs Legacy Extended IRI (again)

2009-04-30 Thread Ian Hickson
On Sun, 29 Mar 2009, Giovanni Campagna wrote:

 (In this email I will use URL5 as a short for Web Addresses, as that 
 previously was the URL part of HTML5)

This section is to be extracted from HTML5 shortly. I've forwarded your 
e-mails to DanC, the editor of the Web Addresses spec.

Cheers,
-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-04-27 Thread Ian Hickson
On Sun, 22 Mar 2009, Giovanni Campagna wrote:
 
  As far as I can tell the LEIRI requirements aren't actually an 
  accurate description of what browsers do.
 
 My question was more specific: what are the *techical differences* 
 betwen LEIRI and Web Addresses?

I don't think there's a complete documentation of this anywhere.


 Can't we have one technology instead of two?

Web addresses and LEIRIs are both maintained by the W3C and the IETF now, 
so I recommend discussing this with the editors of the relevant specs. The 
relevant section is going to be removed from HTML5 as soon as practical.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-04-27 Thread Ian Hickson
On Mon, 23 Mar 2009, Julian Reschke wrote:
 Ian Hickson wrote:
 
  Note that the Web addresses draft isn't specific to HTML5. It is 
  intended to apply to any user agent that interacts with Web content, 
  not just Web browsers and HTML. (That's why we took it out of HTML5.) 
 
 Be careful; depending on what you call Web content. For instance, I 
 would consider the Atom feed content (RFC4287) as Web content, but 
 Atom really uses IRIs, and doesn't need workarounds for broken IRIs in 
 content (as far as I can tell).

There are implementations of Atom that treat the URLs therein just like 
those in HTML content. I haven't studied existing content to see if this 
is required for compatibility, though. I wouldn't be surprised if it was, 
since much Atom content is just generated based on content that is 
primarily intended for HTML generation.


 Don't leak out workarounds into areas where they aren't needed.

I'd much rather we just had one set of interpretations of URLs, defined in 
one place, than the four or more we have now.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Web Addresses vs Legacy Extended IRI (again)

2009-03-30 Thread Giovanni Campagna
2009/3/29 Kristof Zelechovski giecr...@stegny.2a.pl:
 It is not clear that the server will be able to correctly support various
 representations of characters in the path component, e.g. identify accented
 characters with their decompositions using combining diacritical marks.  The
 peculiarities can depend on the underlying file system conventions.
 Therefore, if all representations are considered equally appropriate,
 various resources may suddenly become unavailable, depending on the encoding
 decisions taken by the user agent.
 Chris

It is not clear to me that the server will be able to support the
composed form of à or ø. Where is specified the conversion from
ISO-8859-1 to UCS? Nowhere.
If a server knows it cannot deal with Unicode Normalization, it should
either use an encoding form of Unicode (utf-8, utf-16), implement a
technology that uses directly IRIs (because Normalization is
introduced only when converting to an URI) or generate IRIs with
binary path data in opaque form (ie percent-encoded)
By the way, the server should be able to deal with both composed and
decomposed forms of accented character (or use none of them), because
I may type the path directly in my address bar (do you know what IME I
use?)

Giovanni


Re: [whatwg] Web Addresses vs Legacy Extended IRI (again)

2009-03-29 Thread Giovanni Campagna
2009/3/29 Anne van Kesteren ann...@opera.com:
 On Sun, 29 Mar 2009 14:37:19 +0200, Giovanni Campagna
 scampa.giova...@gmail.com wrote:

 Summing up, the differences between URL5 and LEIRI are only about the
 percent sign and its uses for delimiters.

 I'm not sure if you're correct about those differences, but even if you are
 they are not the only differences. E.g. LEIRIs perform normalization if the
 input encoding is non-Unicode. URLs do not. URLs can encode their query
 component per the input encoding (and do so for HTML and some APIs). LEIRIs
 cannot.

What is the problem with normalization? Is there a standard for
conversion to non-Unicode to Unicode?
I guess no, so normalization (which should always be done) is perfectly legal.

In addition, IRIs are defined as a sequence of Unicode codepoints. It
does not matter how those codepoints are stored (ASCII, ISO-8859-1,
UTF-8), only the Unicode version of them.
This is the same as URL5s, by the way, because none of them is defined
on octets and both use the RFC3986 method for percent-encoding (using
UTF-8)

 (Also, I'm not sure if the WHATWG list is the right place to discuss this as
 the editor of the new draft might not read this list at all.)


Unfortunately, I cannot join the public-html list. I could cross-post
this to www-html or www-archive but it would break the archives and
make it difficult to follow.

 --
 Anne van Kesteren
 http://annevankesteren.nl/


Giovanni


Re: [whatwg] Web Addresses vs Legacy Extended IRI (again)

2009-03-29 Thread Anne van Kesteren
On Sun, 29 Mar 2009 15:01:51 +0200, Giovanni Campagna  
scampa.giova...@gmail.com wrote:

2009/3/29 Anne van Kesteren ann...@opera.com:
I'm not sure if you're correct about those differences, but even if you  
are they are not the only differences. E.g. LEIRIs perform  
normalization if the input encoding is non-Unicode. URLs do not. URLs  
can encode their query
component per the input encoding (and do so for HTML and some APIs).  
LEIRIs cannot.


What is the problem with normalization? Is there a standard for
conversion to non-Unicode to Unicode?
I guess no, so normalization (which should always be done) is perfectly  
legal.


It's about Unicode Normalization. (And it should not always be done.)



In addition, IRIs are defined as a sequence of Unicode codepoints. It
does not matter how those codepoints are stored (ASCII, ISO-8859-1,
UTF-8), only the Unicode version of them.


Please read the IRI specification again. Specifically section 3.1.



This is the same as URL5s, by the way, because none of them is defined
on octets and both use the RFC3986 method for percent-encoding (using
UTF-8)


No, it's not always using UTF-8.


--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] Web Addresses vs Legacy Extended IRI (again)

2009-03-29 Thread Giovanni Campagna
2009/3/29 Anne van Kesteren ann...@opera.com:
 On Sun, 29 Mar 2009 15:01:51 +0200, Giovanni Campagna
 scampa.giova...@gmail.com wrote:

 2009/3/29 Anne van Kesteren ann...@opera.com:

 I'm not sure if you're correct about those differences, but even if you
 are they are not the only differences. E.g. LEIRIs perform normalization if
 the input encoding is non-Unicode. URLs do not. URLs can encode their query
 component per the input encoding (and do so for HTML and some APIs).
 LEIRIs cannot.

 What is the problem with normalization? Is there a standard for
 conversion to non-Unicode to Unicode?
 I guess no, so normalization (which should always be done) is perfectly
 legal.

 It's about Unicode Normalization. (And it should not always be done.)

If I convert from ISO-8859-1 and find À (decimal 192), I can emit
À U+00C0 LATIN CAPITAL A WITH GRAVE or A U+0041 LATIN CAPITAL
LETTER A followed by  ̀ U+0300 COMBINING GRAVE ACCENT
One is NFC, the other is NFD, and both are legal and simple.


 In addition, IRIs are defined as a sequence of Unicode codepoints. It
 does not matter how those codepoints are stored (ASCII, ISO-8859-1,
 UTF-8), only the Unicode version of them.

 Please read the IRI specification again. Specifically section 3.1.

Specification says that IRIs must be a in normalized UCS when created
from user input, else it must be converted to Unicode if not already
(and the conversion should be normalizing), else it must be converted
from UTF-8  / 16 / 32 to UCS but not normalized.
I don't see a particular problem in this.

 This is the same as URL5s, by the way, because none of them is defined
 on octets and both use the RFC3986 method for percent-encoding (using
 UTF-8)

 No, it's not always using UTF-8.

RFC3986 never creates percent encoding (percent-encoding is used for
unspecified binary data) but says that text components should be
encoded as UTF-8 and that rules are estabilished by scheme specific
syntaxes.

 --
 Anne van Kesteren
 http://annevankesteren.nl/


Giovanni


Re: [whatwg] Web Addresses vs Legacy Extended IRI (again)

2009-03-29 Thread Kristof Zelechovski
It is not clear that the server will be able to correctly support various
representations of characters in the path component, e.g. identify accented
characters with their decompositions using combining diacritical marks.  The
peculiarities can depend on the underlying file system conventions.
Therefore, if all representations are considered equally appropriate,
various resources may suddenly become unavailable, depending on the encoding
decisions taken by the user agent.
Chris




Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-03-23 Thread Julian Reschke

Ian Hickson wrote:

...
Note that the Web addresses draft isn't specific to HTML5. It is intended 
to apply to any user agent that interacts with Web content, not just Web 
browsers and HTML. (That's why we took it out of HTML5.)

...


Be careful; depending on what you call Web content. For instance, I 
would consider the Atom feed content (RFC4287) as Web content, but 
Atom really uses IRIs, and doesn't need workarounds for broken IRIs in 
content (as far as I can tell).


Don't leak out workarounds into areas where they aren't needed.

BR, Julian




Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-03-23 Thread Anne van Kesteren
On Mon, 23 Mar 2009 09:45:39 +0100, Julian Reschke julian.resc...@gmx.de  
wrote:

Ian Hickson wrote:

...
Note that the Web addresses draft isn't specific to HTML5. It is  
intended to apply to any user agent that interacts with Web content,  
not just Web browsers and HTML. (That's why we took it out of HTML5.)

...


Be careful; depending on what you call Web content. For instance, I  
would consider the Atom feed content (RFC4287) as Web content, but  
Atom really uses IRIs, and doesn't need workarounds for broken IRIs in  
content (as far as I can tell).


Are you sure browser implementations of feeds reject non-IRIs in some way?  
I would expect them to use the same URL handling everywhere.




Don't leak out workarounds into areas where they aren't needed.


I'm not convinced that having two ways of handling essentially the same  
thing is good.



--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-03-23 Thread Julian Reschke

Anne van Kesteren wrote:
Be careful; depending on what you call Web content. For instance, I 
would consider the Atom feed content (RFC4287) as Web content, but 
Atom really uses IRIs, and doesn't need workarounds for broken IRIs in 
content (as far as I can tell).


Are you sure browser implementations of feeds reject non-IRIs in some 
way? I would expect them to use the same URL handling everywhere.


I wasn't talking of browser implementations of feeds, but feed readers 
in general.



Don't leak out workarounds into areas where they aren't needed.


I'm not convinced that having two ways of handling essentially the same 
thing is good.


It's unavoidable, as the relaxed syntax doesn't work in many cases, for 
instance, when whitespace acts as a delimiter.


BR, Julian


Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-03-23 Thread Anne van Kesteren
On Mon, 23 Mar 2009 11:25:19 +0100, Julian Reschke julian.resc...@gmx.de  
wrote:

Anne van Kesteren wrote:
Be careful; depending on what you call Web content. For instance, I  
would consider the Atom feed content (RFC4287) as Web content, but  
Atom really uses IRIs, and doesn't need workarounds for broken IRIs in  
content (as far as I can tell).
 Are you sure browser implementations of feeds reject non-IRIs in some  
way? I would expect them to use the same URL handling everywhere.


I wasn't talking of browser implementations of feeds, but feed readers  
in general.


Well yes, and a subset of those is browser based. Besides that, most feed  
readers handle HTML. Do you think they should have two separate URL  
parsing functions?




Don't leak out workarounds into areas where they aren't needed.
 I'm not convinced that having two ways of handling essentially the  
same thing is good.


It's unavoidable, as the relaxed syntax doesn't work in many cases, for  
instance, when whitespace acts as a delimiter.


Obviously you would first split on whitepace and then parse the URLs. You  
can still use the same generic URL handling.



--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-03-23 Thread Julian Reschke

Anne van Kesteren wrote:
I wasn't talking of browser implementations of feeds, but feed 
readers in general.


Well yes, and a subset of those is browser based. Besides that, most 
feed readers handle HTML. Do you think they should have two separate URL 
parsing functions?


Yes, absolutely.


Don't leak out workarounds into areas where they aren't needed.
 I'm not convinced that having two ways of handling essentially the 
same thing is good.


It's unavoidable, as the relaxed syntax doesn't work in many cases, 
for instance, when whitespace acts as a delimiter.


Obviously you would first split on whitepace and then parse the URLs. 
You can still use the same generic URL handling.


In which case IRI handling should be totally sufficient.

Best regards, Julian


Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-03-23 Thread Anne van Kesteren
On Mon, 23 Mar 2009 11:31:01 +0100, Julian Reschke julian.resc...@gmx.de  
wrote:

Anne van Kesteren wrote:
Well yes, and a subset of those is browser based. Besides that, most  
feed readers handle HTML. Do you think they should have two separate  
URL parsing functions?


Yes, absolutely.


Why?


 I'm not convinced that having two ways of handling essentially the  
same thing is good.


It's unavoidable, as the relaxed syntax doesn't work in many cases,  
for instance, when whitespace acts as a delimiter.


Obviously you would first split on whitepace and then parse the URLs.  
You can still use the same generic URL handling.


In which case IRI handling should be totally sufficient.


I don't follow. I said I'm not convinced that having two ways of handling  
essentially the same thing is good. Then you said It's unavoidable.  
Then I pointed out it is avoidable. And then you say this. It doesn't add  
up.



--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-03-23 Thread Julian Reschke

Anne van Kesteren wrote:
On Mon, 23 Mar 2009 11:31:01 +0100, Julian Reschke 
julian.resc...@gmx.de wrote:

Anne van Kesteren wrote:
Well yes, and a subset of those is browser based. Besides that, most 
feed readers handle HTML. Do you think they should have two separate 
URL parsing functions?


Yes, absolutely.


Why?


Because it's preferable to the alternative, which is, leaking out the 
non-conformant URI/IRI handling into other places.


Obviously you would first split on whitepace and then parse the URLs. 
You can still use the same generic URL handling.


In which case IRI handling should be totally sufficient.


I don't follow. I said I'm not convinced that having two ways of 
handling essentially the same thing is good. Then you said It's 
unavoidable. Then I pointed out it is avoidable. And then you say this. 
It doesn't add up.


The issue is that it's *not* the same thing.

BR, Julian


Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-03-23 Thread Anne van Kesteren
On Mon, 23 Mar 2009 11:46:15 +0100, Julian Reschke julian.resc...@gmx.de  
wrote:
Because it's preferable to the alternative, which is, leaking out the  
non-conformant URI/IRI handling into other places.


Apparently that is already happening in part anyway due to LEIRIs. Modulo  
the URL encoding bit (which you can set to always being UTF-8 for non-HTML  
contexts) I'm not sure what's so bad about allowing a few more characters.




The issue is that it's *not* the same thing.


Well, no, not exactly. But they perform essentially the same task, modulo  
a few characters. And since one is a superset of the other (as long as URL  
encoding is UTF-8) I don't see a point in having both.



--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-03-23 Thread Julian Reschke

Anne van Kesteren wrote:
On Mon, 23 Mar 2009 11:46:15 +0100, Julian Reschke 
julian.resc...@gmx.de wrote:
Because it's preferable to the alternative, which is, leaking out the 
non-conformant URI/IRI handling into other places.


Apparently that is already happening in part anyway due to LEIRIs. 
Modulo the URL encoding bit (which you can set to always being UTF-8 for 
non-HTML contexts) I'm not sure what's so bad about allowing a few more 
characters.


Whitespace is a big issue - auto-highlighting will fail all over the place.


The issue is that it's *not* the same thing.


Well, no, not exactly. But they perform essentially the same task, 
modulo a few characters. And since one is a superset of the other (as 
long as URL encoding is UTF-8) I don't see a point in having both.


Well, then let's just agree that we disagree on that.

BR, Julian



Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-03-23 Thread Anne van Kesteren
On Mon, 23 Mar 2009 11:58:59 +0100, Julian Reschke julian.resc...@gmx.de  
wrote:
Whitespace is a big issue - auto-highlighting will fail all over the  
place.


Auto-higlighting and linking code already fails all over the place due to  
e.g. punctation issues. A solution for whitespace specifically is to  
simply forbid it, but still require parsers to handle it as browsers  
already do for HTML and XMLHttpRequest. Apparently browsers also handle it  
for HTTP as otherwise e.g. http://www.usafa.af.mil/ would not work which  
returns a 302 with Location: index.cfm?catname=AFA Homepage. Similarly  
http://www.flightsimulator.nl/ gives a URL in the Location header that  
contains a \ which is also illegal but it is handled fine. (Thanks to  
Philip`)


(Whitespace is one of the things LEIRIs introduce by the way.)



The issue is that it's *not* the same thing.


Well, no, not exactly. But they perform essentially the same task,  
modulo a few characters. And since one is a superset of the other (as  
long as URL encoding is UTF-8) I don't see a point in having both.


Well, then let's just agree that we disagree on that.


I would still be interested in hearing your point. Is it whitespace?


--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-03-23 Thread Julian Reschke

Anne van Kesteren wrote:

The issue is that it's *not* the same thing.


Well, no, not exactly. But they perform essentially the same task, 
modulo a few characters. And since one is a superset of the other (as 
long as URL encoding is UTF-8) I don't see a point in having both.


Well, then let's just agree that we disagree on that.


I would still be interested in hearing your point. Is it whitespace?


...and other characters that are not allowed in URIs and IRIs, such as 
{ and } (which therefore can be used as delimiters).


BR, Julian


Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-03-23 Thread Julian Reschke

Anne van Kesteren wrote:
On Mon, 23 Mar 2009 12:50:46 +0100, Julian Reschke 
julian.resc...@gmx.de wrote:
...and other characters that are not allowed in URIs and IRIs, such as 
{ and } (which therefore can be used as delimiters).


And keeping them invalid but requiring user agents to handle those 
characters as part of a URL (after it has been determined what the URL 
is for a given context) does not work because?


You are essentially proposing to change existing specifications (such as 
Atom). I just do not see the point.


If you think it's worthwhile, propose that change to the relevant 
standards body (in this case IETF Applications Area).


BR, Julian


Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-03-23 Thread Anne van Kesteren
On Mon, 23 Mar 2009 12:50:46 +0100, Julian Reschke julian.resc...@gmx.de  
wrote:
...and other characters that are not allowed in URIs and IRIs, such as  
{ and } (which therefore can be used as delimiters).


And keeping them invalid but requiring user agents to handle those  
characters as part of a URL (after it has been determined what the URL is  
for a given context) does not work because?



--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-03-23 Thread Ian Hickson
On Mon, 23 Mar 2009, Julian Reschke wrote:
 
 You are essentially proposing to change existing specifications (such as 
 Atom). I just do not see the point.

The point is to ensure there is only one way to handle strings that are 
purported to be IRIs but that are invalid. Right now, there are at least 
three different ways to do it: the way that the URI/IRI specs say, the way 
that the LEIRI docs say, and the way that legacy HTML content relies on.

My understanding is that even command line software, feed readers, and 
other non-Web browser tools agree that the specs are wrong here.

For example, curl will not refuse to fetch the URL http://example.com/% 
despite that URL being invalid.

Thus, we need a spec they are willing to follow. The idea of not limiting 
it to HTML is to prevent tools that deal both with HTML and with other 
languages (like Atom, CSS, DOM APIs, etc) from having to have two 
different implementations if they want to be conforming.


 If you think it's worthwhile, propose that change to the relevant 
 standards body (in this case IETF Applications Area).

This was the first thing we tried, but the people on the URI lists were 
not interested in making their specs useful for the real world. We are now 
routing around that negative energy. We're having a meeting later this 
week to see if the IETF will adopt the spec anyway, though.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-03-23 Thread Julian Reschke

Ian Hickson wrote:

On Mon, 23 Mar 2009, Julian Reschke wrote:
You are essentially proposing to change existing specifications (such as 
Atom). I just do not see the point.


The point is to ensure there is only one way to handle strings that are 
purported to be IRIs but that are invalid. Right now, there are at least 
three different ways to do it: the way that the URI/IRI specs say, the way 
that the LEIRI docs say, and the way that legacy HTML content relies on.
My understanding is that even command line software, feed readers, and 
other non-Web browser tools agree that the specs are wrong here.


For example, curl will not refuse to fetch the URL http://example.com/% 
despite that URL being invalid.


Should it refuse to?

Thus, we need a spec they are willing to follow. The idea of not limiting 
it to HTML is to prevent tools that deal both with HTML and with other 
languages (like Atom, CSS, DOM APIs, etc) from having to have two 
different implementations if they want to be conforming.


I understand that you want everybody to use the same rules, and you want 
these rules to be the ones needed for HTML content. I disagree with that.


Do not leak that stuff into places where it's not needed.

For instance, there are lots of cases where the Atom feed format can be 
used in absence of HTML.



...
If you think it's worthwhile, propose that change to the relevant 

standards body (in this case IETF Applications Area).


This was the first thing we tried, but the people on the URI lists were 
not interested in making their specs useful for the real world. We are now 
routing around that negative energy. We're having a meeting later this 
week to see if the IETF will adopt the spec anyway, though.


Adopting the spec is not the same thing as mandating its use all over 
the place.


BR, Julian



Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-03-23 Thread Ian Hickson

[cc'ed DanC since I don't think Dan is on the WHATWG list, and he's the 
editor of the draft at this point]

On Mon, 23 Mar 2009, Julian Reschke wrote:
  
  For example, curl will not refuse to fetch the URL 
  http://example.com/% despite that URL being invalid.
 
 Should it refuse to?

The URI/IRI specs don't say, because they don't cover error handling.

This is what the Web addresses spec is supposed to cover. It doesn't 
change the rules for anything that the URI spec defines, it just also says 
how to handle errors.

That way, we can have interoperability across all inputs.

I personally don't care if we say that http://example.com/% should be 
thrown out or accepted. However, I _do_ care that we get something that is 
widely and uniformly implemented, and the best way to do that is to write 
a spec that matches what people have already implemented.


  Thus, we need a spec they are willing to follow. The idea of not 
  limiting it to HTML is to prevent tools that deal both with HTML and 
  with other languages (like Atom, CSS, DOM APIs, etc) from having to 
  have two different implementations if they want to be conforming.
 
 I understand that you want everybody to use the same rules, and you want 
 these rules to be the ones needed for HTML content. I disagree with 
 that.

I want everyone to follow the same rules. I don't care what those rules 
are, so long as everyone (or at least, the vast majority of systems) are
willing to follow them. Right now, it seems to me that most systems do the 
same thing, so it makes sense to follow what they do. This really has 
nothing to do with HTML.


 Do not leak that stuff into places where it's not needed.

Interoperability and uniformity in implementations is important 
everywhere. If there are areas that are self-contained and never interact 
with the rest of the Internet, then they can do whatever they like. I do 
not believe I have ever suggested doing anything to such software. 
However, 'curl' obviously isn't self-contained; people will take URLs from 
e-mails and paste them into the command line to fetch files from FTP 
servers, and we should ensure that this works the same way whether the 
user is using Pine with wget or Mail.app with curl or any other 
combination of mail client and download tool.


 For instance, there are lots of cases where the Atom feed format can be 
 used in absence of HTML.

Sure, but the tools that use Atom still need to process URLs in the same 
way as other tools. It would be very bad if a site had an RSS feed and an 
Atom feed and they both said that the item's URL was http://example.com/% 
but in one feed that resulted in one file being fetched but in another it 
resulted in another file being fetched.


   If you think it's worthwhile, propose that change to the relevant 
   standards body (in this case IETF Applications Area).
  
  This was the first thing we tried, but the people on the URI lists 
  were not interested in making their specs useful for the real world. 
  We are now routing around that negative energy. We're having a meeting 
  later this week to see if the IETF will adopt the spec anyway, though.
 
 Adopting the spec is not the same thing as mandating its use all over 
 the place.

I think it is important that we have interoperable use of URLs in the 
transitive closure of places that use URLs, starting from any common 
starting point, like the URL in an e-mail example above. I believe this 
includes most if not all Internet software. I also believe that in 
practice most software is already doing this, though often in subtly 
different ways since the URI and IRI specs did not define error handling.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-03-23 Thread Julian Reschke

Ian Hickson wrote:
[cc'ed DanC since I don't think Dan is on the WHATWG list, and he's the 
editor of the draft at this point]


On Mon, 23 Mar 2009, Julian Reschke wrote:
For example, curl will not refuse to fetch the URL 
http://example.com/% despite that URL being invalid.

Should it refuse to?


The URI/IRI specs don't say, because they don't cover error handling.


Indeed.

This is what the Web addresses spec is supposed to cover. It doesn't 
change the rules for anything that the URI spec defines, it just also says 
how to handle errors.


That way, we can have interoperability across all inputs.

I personally don't care if we say that http://example.com/% should be 
thrown out or accepted. However, I _do_ care that we get something that is 
widely and uniformly implemented, and the best way to do that is to write 
a spec that matches what people have already implemented.


I'm OK with doing that for browsers.

I'm *very* skeptical about the idea that it needs to be the same way 
everywhere else.


Thus, we need a spec they are willing to follow. The idea of not 
limiting it to HTML is to prevent tools that deal both with HTML and 
with other languages (like Atom, CSS, DOM APIs, etc) from having to 
have two different implementations if they want to be conforming.
I understand that you want everybody to use the same rules, and you want 
these rules to be the ones needed for HTML content. I disagree with 
that.


I want everyone to follow the same rules. I don't care what those rules 
are, so long as everyone (or at least, the vast majority of systems) are
willing to follow them. Right now, it seems to me that most systems do the 
same thing, so it makes sense to follow what they do. This really has 
nothing to do with HTML.


Your perspective on most systems differs from mine.


Do not leak that stuff into places where it's not needed.


Interoperability and uniformity in implementations is important 
everywhere. If there are areas that are self-contained and never interact 
with the rest of the Internet, then they can do whatever they like. I do 
not believe I have ever suggested doing anything to such software. 
However, 'curl' obviously isn't self-contained; people will take URLs from 
e-mails and paste them into the command line to fetch files from FTP 
servers, and we should ensure that this works the same way whether the 
user is using Pine with wget or Mail.app with curl or any other 
combination of mail client and download tool.


How many people paste URLs into command lines? And of these, how many 
remember that they likely need to quote them?


For instance, there are lots of cases where the Atom feed format can be 
used in absence of HTML.


Sure, but the tools that use Atom still need to process URLs in the same 
way as other tools. It would be very bad if a site had an RSS feed and an 
Atom feed and they both said that the item's URL was http://example.com/% 
but in one feed that resulted in one file being fetched but in another it 
resulted in another file being fetched.


Yes, that would be bad.

However, what seems to be more likely is that one tool refuses to fetch 
the file (because the URI parser didn't like it), while in the other 
case, the tool puts the invalid URL on to the wire, in which case the 
server's behavior decides.


I think this is totally ok, and the more tools reject the URL early, the 
better.


If you think it's worthwhile, propose that change to the relevant 
standards body (in this case IETF Applications Area).
This was the first thing we tried, but the people on the URI lists 
were not interested in making their specs useful for the real world. 
We are now routing around that negative energy. We're having a meeting 
later this week to see if the IETF will adopt the spec anyway, though.
Adopting the spec is not the same thing as mandating its use all over 
the place.


I think it is important that we have interoperable use of URLs in the 
transitive closure of places that use URLs, starting from any common 
starting point, like the URL in an e-mail example above. I believe this 
includes most if not all Internet software. I also believe that in 
practice most software is already doing this, though often in subtly 
different ways since the URI and IRI specs did not define error handling.


If the consequence of this is that invalid URLs do not interoperate, 
then I think this is a *feature*, not a bug.


Best regards, Julian


Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-03-23 Thread Ian Hickson
On Mon, 23 Mar 2009, Julian Reschke wrote:
 
 However, what seems to be more likely is that one tool refuses to fetch 
 the file (because the URI parser didn't like it), while in the other 
 case, the tool puts the invalid URL on to the wire

IMHO this is basically the definition of a standards failure.


 I think this is totally ok

I think considering this behaviour to be ok is basically ignoring 19 years 
of experience with the Web which has shown repeatedly and at huge cost 
that having different tools act differently in the same situation is a bad 
idea and only causes end users to have a bad experience.


 If the consequence of this is that invalid URLs do not interoperate, 
 then I think this is a *feature*, not a bug.

I fundamentally disagree. Users don't care what the source of a lack of 
interoperability is. Whether it's an engineering error or a flaw in the 
standard or a flaw in the content is irrelevant, the result is the same: 
an unhappy user.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-03-23 Thread Maciej Stachowiak


On Mar 23, 2009, at 2:25 PM, Ian Hickson wrote:


On Mon, 23 Mar 2009, Julian Reschke wrote:


However, what seems to be more likely is that one tool refuses to  
fetch

the file (because the URI parser didn't like it), while in the other
case, the tool puts the invalid URL on to the wire


IMHO this is basically the definition of a standards failure.



I think this is totally ok


I think considering this behaviour to be ok is basically ignoring 19  
years

of experience with the Web which has shown repeatedly and at huge cost
that having different tools act differently in the same situation is  
a bad

idea and only causes end users to have a bad experience.



If the consequence of this is that invalid URLs do not interoperate,
then I think this is a *feature*, not a bug.


I fundamentally disagree. Users don't care what the source of a lack  
of
interoperability is. Whether it's an engineering error or a flaw in  
the
standard or a flaw in the content is irrelevant, the result is the  
same:

an unhappy user.


I largely agree with Ian's perspective on this. The primary purpose of  
standards is to enable interoperability, therefore failure to  
interoperate is by definition a standards failure (either in the  
design of the standard or in correct implementation of the standard).


Regards,
Maciej



Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-03-22 Thread Ian Hickson
On Sat, 21 Mar 2009, Giovanni Campagna wrote:
 
 Now I would like to ask:
 are there any major differences that requires the W3C / WHATWG to
 publish an other specification, just for HTML5, instead of just
 referencing the IRI-bis draft or the LEIRI working group note?

As far as I can tell the LEIRI requirements aren't actually an accurate 
description of what browsers do.

Note that the Web addresses draft isn't specific to HTML5. It is intended 
to apply to any user agent that interacts with Web content, not just Web 
browsers and HTML. (That's why we took it out of HTML5.)

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Web Addresses vs Legacy Extended IRI

2009-03-22 Thread Giovanni Campagna
2009/3/22 Ian Hickson i...@hixie.ch:
 On Sat, 21 Mar 2009, Giovanni Campagna wrote:

 Now I would like to ask:
 are there any major differences that requires the W3C / WHATWG to
 publish an other specification, just for HTML5, instead of just
 referencing the IRI-bis draft or the LEIRI working group note?

 As far as I can tell the LEIRI requirements aren't actually an accurate
 description of what browsers do.

My question was more specific: what are the *techical differences*
betwen LEIRI and Web Addresses?
Can't we have one technology instead of two?

 Note that the Web addresses draft isn't specific to HTML5. It is intended
 to apply to any user agent that interacts with Web content, not just Web
 browsers and HTML. (That's why we took it out of HTML5.)

Unfortunately, languages outside HTML5 (notably XLink, XML Base, SVG,
XForms), that use W3C Schema definition and anyURI type, use exactly
LEIRI.
Other technologies instead use pure URI / IRI (XMLNS, RDF) and I
wouldn't see much benefit in relaxing their syntax (because they never
actually process their identifiers).

 --
 Ian Hickson               U+1047E                )\._.,--,'``.    fL
 http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
 Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Giovanni