Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2012-10-09 Thread Julian Reschke

On 2012-10-09 17:33, Anne van Kesteren wrote:

On Tue, Oct 9, 2012 at 4:59 PM, Julian Reschke  wrote:

The test case at

   http://greenbytes.de/tech/tc/datauri/#svg

seems to imply that Opera doesn't do this right yet, though. (tested with
12.02)


Yeah, for some reason Opera has different behavior when entering in
the address bar.


Indeed. Will update test comment.

Best regards, Julian



Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2012-10-09 Thread Anne van Kesteren
On Tue, Oct 9, 2012 at 4:59 PM, Julian Reschke  wrote:
> The test case at
>
>   http://greenbytes.de/tech/tc/datauri/#svg
>
> seems to imply that Opera doesn't do this right yet, though. (tested with
> 12.02)

Yeah, for some reason Opera has different behavior when entering in
the address bar.


-- 
http://annevankesteren.nl/


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2012-10-09 Thread Julian Reschke

On 2012-10-09 13:51, Anne van Kesteren wrote:

On Tue, Oct 9, 2012 at 1:50 AM, Ian Hickson  wrote:

On Sat, 10 Sep 2011, Daniel Holbert wrote:

I'm writing with a proposal to improve the handling of "#" in data URIs.
I'm particularly looking for feedback from other browser vendors, but of
course feedback from others is welcome as well. [...]


Anne has since tried to respec URL parsing in detail, with the work in
progress being here:

http://url.spec.whatwg.org/

I recommend checking that spec to see if it does what you want, and if
not, working with Anne to see if it can be adjusted accordingly or if
something else needs to happen.


This is not written down explicitly just yet, but for data URLs I
think we want the fragment to *not* be part of the actual resource,
but rather as an input to the resource so things like

data:text/html,:target{background:lime}test#x

work. (Fails in Chrome, but works fine in Opera and Firefox already.)


Clarifying: that sounds like making it parse just like in any other URI 
(with which I would agree).


The test case at

  http://greenbytes.de/tech/tc/datauri/#svg

seems to imply that Opera doesn't do this right yet, though. (tested 
with 12.02)


Best regards, Julian



Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2012-10-09 Thread Anne van Kesteren
On Tue, Oct 9, 2012 at 1:50 AM, Ian Hickson  wrote:
> On Sat, 10 Sep 2011, Daniel Holbert wrote:
>> I'm writing with a proposal to improve the handling of "#" in data URIs.
>> I'm particularly looking for feedback from other browser vendors, but of
>> course feedback from others is welcome as well. [...]
>
> Anne has since tried to respec URL parsing in detail, with the work in
> progress being here:
>
>http://url.spec.whatwg.org/
>
> I recommend checking that spec to see if it does what you want, and if
> not, working with Anne to see if it can be adjusted accordingly or if
> something else needs to happen.

This is not written down explicitly just yet, but for data URLs I
think we want the fragment to *not* be part of the actual resource,
but rather as an input to the resource so things like

data:text/html,:target{background:lime}test#x

work. (Fails in Chrome, but works fine in Opera and Firefox already.)


-- 
http://annevankesteren.nl/


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2012-10-08 Thread Ian Hickson
On Sat, 10 Sep 2011, Daniel Holbert wrote:
> 
> I'm writing with a proposal to improve the handling of "#" in data URIs. 
> I'm particularly looking for feedback from other browser vendors, but of 
> course feedback from others is welcome as well. [...]

Anne has since tried to respec URL parsing in detail, with the work in 
progress being here:

   http://url.spec.whatwg.org/

I recommend checking that spec to see if it does what you want, and if 
not, working with Anne to see if it can be adjusted accordingly or if 
something else needs to happen.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-14 Thread Daniel Holbert

On 09/14/2011 10:11 AM, Daniel Holbert wrote:

I'll file a few WebKit bugs and report back with bug links for those who
are interested.


Filed:  https://bugs.webkit.org/show_bug.cgi?id=68089
(Component is "New Bugs" right now, because I wasn't sure where 
data-URI-parsing bugs belong in WebKit.  If anyone knows the correct 
component, please update that field - thanks!)


Also filed: https://bugs.webkit.org/show_bug.cgi?id=68090
(this second bug is about a mostly-unrelated SVG text-sizing issue that 
I ran across when writing my set of testcases for this thread)


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-14 Thread Daniel Holbert

On 09/14/2011 01:26 AM, Julian Reschke wrote:

On 2011-09-14 10:16, Robert O'Callahan wrote:

Yeah. Will you fix it in Webkit? :-)


:-)

Maybe we should start with opening a ticket, so this is properly tracked?


I'll file a few WebKit bugs and report back with bug links for those who 
are interested.


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-14 Thread Julian Reschke

On 2011-09-14 10:16, Robert O'Callahan wrote:

On Sat, Sep 10, 2011 at 11:01 PM, Ryosuke Niwa  wrote:


Have implementors actively opposed to this idea?  It seems like sticking to
RFC is a cleaner option if possible.



Yeah. Will you fix it in Webkit? :-)


:-)

Maybe we should start with opening a ticket, so this is properly tracked?



Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-14 Thread Robert O'Callahan
On Sat, Sep 10, 2011 at 11:01 PM, Ryosuke Niwa  wrote:

> Have implementors actively opposed to this idea?  It seems like sticking to
> RFC is a cleaner option if possible.
>

Yeah. Will you fix it in Webkit? :-)

Rob
-- 
"If we claim to be without sin, we deceive ourselves and the truth is not in
us. If we confess our sins, he is faithful and just and will forgive us our
sins and purify us from all unrighteousness. If we claim we have not sinned,
we make him out to be a liar and his word is not in us." [1 John 1:8-10]


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-13 Thread Michael A. Puls II
On Sun, 11 Sep 2011 10:21:48 -0400, Michael A. Puls II  
 wrote:


Encoding the data (markup for example) for the data URI is simple. Just  
use encodeURIComponent(markup) (on a UTF-8 page) in JS on the data. You  
still hand-author the markup. You just paste the markup into a textarea  
and have something (like encodeURIComponent()) percent-encode it for you.


For example, . (I'll make  
one for files with the File API at a later date.)


--
Michael


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-13 Thread Mikko Rantalainen
2011-09-11 00:15 EEST: Daniel Holbert:
> Browsers handle the "#" character in data URIs very differently, and the
> arguably "correct" behavior is probably not what authors actually want
> in many cases.
> 
> This could be more intuitive/do-what-I-mean if we restricted the cases
> under which "#" is treated as a fragment-ID delimiter inside of data
> URIs.  In particular: when a "#" character is followed by ">" or "<" in
> a data URI, I propose that we *don't* treat the "#" as a delimiter, and
> instead just treat it as part of the encoded document.

Please, no. We already have WAY too much of "do what I mean" stuff in
HTML and it clearly does not work in the long run. The only sane
interpretation of literal "#" in URI is to use it always as a separator
for the fragment identifier. If some user agent does not follow this
logic, that user agent should be fixed.

> When an author writes a data URI for a document that contains a "#"
> character, she may unintentionally end up with broken results (or at
> least inconsistently-handled results), because the "#" may be treated as
> the end of the document & the beginning of the URI's fragment identifier.

When an author writes invalid markup the results should be invalid, too.
I agree with WHATWG/HTML5 that defining results for any binary input is
required but I think that the spec should not try to second guess the
author intention. If the input does not make sense, the output must not
make sense. We just need a spec that outputs the *same*
output-that-doesn't-make-sense for every user agent. After that the
author *will* notice her error and she *will* fix the markup.

-- 
Mikko


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-12 Thread Michal Zalewski
[ Julian Reschke ]

> Observation: javascript: IMHO isn't a URI scheme (it just occupies a place
> in the same lexical space), so maybe the right thing to do is to document it
> as historic exception that only exists in browsers.

In one of its modes, it's roughly equivalent to data:
(javascript:"foo"), so I'm not sure the distinction is really that
strong.

> Maybe. Or it makes sense to do it one at a time :-).

My only concern is that treating them separately had some funny side
effects in the past - for example, right now, the origin inheritance
for documents created from about:blank, data:.., and javascript:...
URLs are completely different in current implementations, for no good
reason.

[ Daniel Holbert ]

> However, fragment identifiers don't have any special significance in 
> "javascript:" URIs, so the "javascript:"
> handler ends up requesting the full URI, ignoring any distinction between the 
> pre-# and post-# part.

In the above use case, you could imagine # having the same purpose as
for data, in principle.

/mz


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-12 Thread Bjoern Hoehrmann
* Michal Zalewski wrote:
>What about javascript: URLs?
>
>Right now, every browser seems to treat javascript:alert('#') in an
>"intuitive" manner.

With ...#x"> 
Firefox will scroll down to the second p element with value "#x". That
is neither very intuitive nor interoperable. As it is, the draft says
http://tools.ietf.org/html/draft-hoehrmann-javascript-scheme-03 if you
want the "# as content" behavior, then that should be specified by some
higher level protocol (like the HTML specification when you want this
for  attribute values for instance) as fragment identifiers are
scheme-independent.
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-12 Thread Daniel Holbert

On 09/12/2011 12:47 PM, Michal Zalewski wrote:

What about javascript: URLs?

Right now, every browser seems to treat javascript:alert('#') in an
"intuitive" manner.


FWIW -- in Gecko, I believe we actually do split "#')" out as a fragment 
identifier there (under-the-hood in our URI parsing code).


However, fragment identifiers don't have any special significance in 
"javascript:" URIs, so the "javascript:" handler ends up requesting the 
full URI, ignoring any distinction between the pre-# and post-# part.


~Daniel


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-12 Thread Julian Reschke

On 2011-09-12 21:47, Michal Zalewski wrote:

What about javascript: URLs?

Right now, every browser seems to treat javascript:alert('#') in an
"intuitive" manner.

This likely goes beyond data: and javascript:, so I think it would be
useful to look at it more holistically.


Maybe. Or it makes sense to do it one at a time :-).

Observation: javascript: IMHO isn't a URI scheme (it just occupies a 
place in the same lexical space), so maybe the right thing to do is to 
document it as historic exception that only exists in browsers.


Best regards, Julian



Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-12 Thread Michal Zalewski
What about javascript: URLs?

Right now, every browser seems to treat javascript:alert('#') in an
"intuitive" manner.

This likely goes beyond data: and javascript:, so I think it would be
useful to look at it more holistically.

/mz


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-12 Thread Aryeh Gregor
On Mon, Sep 12, 2011 at 7:19 AM, Simon Pieters  wrote:
> Making it magic seems confusing and error prone. Making data: URLs not
> support fragments at all seems to be taking away a useful feature. I think
> we should stick to the "correct" behavior and always treat # as fragment
> delimiter in data: URLs. I thought we did this already in Opera but
> apparently we have some bugs. If you have any test cases demonstrating where
> Opera doesn't follow the spec, that would be helpful in getting it fixed.

I was surprised at first by the fact that # was always treated as a
fragment delimiter in data: URLs by some browsers, but after reading
this thread, I agree it's the only sane choice.  It's not what I'd
initially expected, but if that's how all browsers behaved, I'd have
figured it out pretty quickly.

I find Michael's point about % particularly convincing.  Authors have
to encode % anyway for their data URL handling to be robust, and once
they're doing that, encoding # is not much extra effort.  Making #
work without encoding encourages authors to forget about %.  Trying to
key off  for text/html only is just a really scary and evil hack.
If browsers agreed on treating # in data URLs like in any other URL,
authors would quickly figure it out and remember to encode.  I think
this is the best solution by far.

This is assuming that the compat issues aren't bad enough to rule it
out, of course . . .


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-12 Thread Simon Pieters

On Mon, 12 Sep 2011 13:19:56 +0200, Simon Pieters  wrote:

If you have any test cases demonstrating where Opera doesn't follow the  
spec, that would be helpful in getting it fixed.


Sorry I must have glossed over this part of your email:


  http://people.mozilla.org/~dholbert/dataURIHashTests/tests_v1.xhtml




cheers



--
Simon Pieters
Opera Software


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-12 Thread Simon Pieters
On Sat, 10 Sep 2011 23:15:09 +0200, Daniel Holbert   
wrote:



Hi whatwg,

I'm writing with a proposal to improve the handling of "#" in data URIs.  
I'm particularly looking for feedback from other browser vendors, but of  
course feedback from others is welcome as well.


SUMMARY:

Browsers handle the "#" character in data URIs very differently, and the  
arguably "correct" behavior is probably not what authors actually want  
in many cases.


Making it magic seems confusing and error prone. Making data: URLs not  
support fragments at all seems to be taking away a useful feature. I think  
we should stick to the "correct" behavior and always treat # as fragment  
delimiter in data: URLs. I thought we did this already in Opera but  
apparently we have some bugs. If you have any test cases demonstrating  
where Opera doesn't follow the spec, that would be helpful in getting it  
fixed.


cheers
--
Simon Pieters
Opera Software


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-11 Thread Bjartur Thorlacius

Þann sun 11.sep 2011 18:44, skrifaði Michael A. Puls II:

I don't think < and > are in the list of safe URI characters. All
URI-based functions seem to percent-encode them too. Keeping them
encoded is definitely good for data URIs in text/plain documents so the
don't interfere with the < and > that encase the URI.


You are right, they are delims and MUST be percent encoded.


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-11 Thread Nils Dagsson Moskopp
"Michael A. Puls II"  schrieb am Sun, 11 Sep 2011
13:54:01 -0400:

> On Sun, 11 Sep 2011 11:30:07 -0400, Daniel Holbert
>  wrote:
> 
> > […]
> >
> >data:text/html,here is some italic text
> 
> I don't really like that though as it's not portable. If I wanted to
> copy that from the address field and paste it into a plain-text
> document, it'd look funny like this:
> 
> here is some italic text>
> 
> And, for mail clients that linkify links in plain-text messages, I
> can see that going wrong with the link (the clickable, underlined
> part and href) ending up as only "data:text/html, > So we can proactively check for >/< characters anywhere after the
> > "#", and if we find them, then we can pretty safely assume that the
> > author intended for the "#" to be part of the document, rather than
> > a fragment-ID delimiter.
> 
> I still don't like it personally as it further encourages authors to
> not encode their data and is not portable. But, if this is to happen,
> it should definitely be limited to mime types that contain markup.
> It wouldn't be useful for data:text/plain (how would you
> differentiate in that case?). And, for text/javascript and text/css
> etc., some other type of lookahead characters(s) would have to be
> used.

I sincerely hope this observation pretty much kills the original
proposal on the spot. Not only would specifying that stuff involve lots
of black magic, all talks about author expectations go off the rails as
soon as interpretation of the data URI depends on the mime type.

-- 
Nils Dagsson Moskopp // erlehmann



Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-11 Thread Michael A. Puls II

On Sun, 11 Sep 2011 12:14:08 -0400, Glenn Maynard  wrote:


On Sun, Sep 11, 2011 at 10:21 AM, Michael A. Puls II
wrote:

Not only must "#" be "%23" if you don't want it as a frag id, but ">"  
and

"<" should be "%3E" and "%3C".



I'm not sure about the spec on this, but Firefox actively unencodes %3E  
and
%3C.  Pasting this into the address bar and copying it back out turns  
them

back into literal < and > characters:

data:text/html,foobar#vector%3Cint%3E

which suggests that escaping these characters isn't necessary or  
encouraged.


Firefox aggressively decodes %HH in the address field to make the URI  
human-readable (which I hate btw, but that's another discussion). It  
usually copies the correct/original value to the clipboard though. In this  
case though, Firefox copies < and > to the clipboard decoded just like you  
say. I can't say I think that's a good idea and wonder if it's intentional  
as you suspect. Chrome does it too though. Opera doesn't and leaves them  
encoded, which I think is better.


I love what Safari does and I think what it does is the right thing to do.  
It will resolve the data URI and convert raw spaces to %20 and convert <  
and > to %3C and %3E (and anything else that should be encoded in a URI to  
%HH) if they're not encoded. It could be argued that Safari shouldn't do  
that visually. But, as far as copying to the clipboard, it should and  
does, which is awesome.


I don't think < and > are in the list of safe URI characters. All  
URI-based functions seem to percent-encode them too. Keeping them encoded  
is definitely good for data URIs in text/plain documents so the don't  
interfere with the < and > that encase the URI.


--
Michael


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-11 Thread Michael A. Puls II
On Sun, 11 Sep 2011 11:30:07 -0400, Daniel Holbert   
wrote:



On 09/11/2011 07:21 AM, Michael A. Puls II wrote:

Not only must "#" be "%23" if you don't want it as a frag id, but ">"
and "<" should be "%3E" and "%3C".

[...]
 > Of course, if you can percent-encode everything needed as you type,  
you

 > can hand-author the URI data. But, who wants to do that,

As I noted in a response to Nils earlier in this thread,  
Firefox/Webkit/Opera don't actually require authors to percent-encode  
brackets and spaces in data URIs. (not sure whether that's correct per  
spec or not).


For example
   data:text/html,here is some italic text
works just fine in all three.

So that makes it quite easy to hand-author data URIs, in fact. (aside  
from this "#" gotcha)


Yes, but it's important to know that the browser still percent-decodes  
everything after the ",". It's just that in this case, there are no %HH to  
decode. You have to be careful here and know that the data/markup is still  
not literal. For example, if you want a literal "%5E", you have to use  
%255E. If you include a URI with a bunch of %HH, you have to escape all  
those "%". So, while typing, if you have no problem typing %25, you should  
have no problem typing %23.


Are you saying that data URI authors know that they have to escape "%",  
but don't know that they have to escape "#"? Or, are you saying that the  
problem is more serious and data URI authors think the data is  
*completely* literal? If the latter, we definitely shouldn't be  
encouraging anything but properly-encoded data.


FWIW, I asked for advice on "#" in mailto URIs (since mailto URI handlers  
don't make use of frag ids for mailto and frag ids are not specified for  
mailto) at  
 and  
wanted to propose that '#' be allowed as-is when authoring without having  
to percent-encode it. But, that didn't go over too well.



   data:text/html,here is some italic text


I don't really like that though as it's not portable. If I wanted to copy  
that from the address field and paste it into a plain-text document, it'd  
look funny like this:


here is some italic text>

And, for mail clients that linkify links in plain-text messages, I can see  
that going wrong with the link (the clickable, underlined part and href)  
ending up as only "data:text/html,

So we can proactively check for >/< characters anywhere after the "#",  
and if we find them, then we can pretty safely assume that the author  
intended for the "#" to be part of the document, rather than a  
fragment-ID delimiter.


I still don't like it personally as it further encourages authors to not  
encode their data and is not portable. But, if this is to happen, it  
should definitely be limited to mime types that contain markup. It  
wouldn't be useful for data:text/plain (how would you differentiate in  
that case?). And, for text/javascript and text/css etc., some other type  
of lookahead characters(s) would have to be used.


--
Michael


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-11 Thread Julian Reschke

On 2011-09-11 17:30, Daniel Holbert wrote:

On 09/11/2011 07:21 AM, Michael A. Puls II wrote:

Not only must "#" be "%23" if you don't want it as a frag id, but ">"
and "<" should be "%3E" and "%3C".

[...]
 > Of course, if you can percent-encode everything needed as you type, you
 > can hand-author the URI data. But, who wants to do that,

As I noted in a response to Nils earlier in this thread,
Firefox/Webkit/Opera don't actually require authors to percent-encode
brackets and spaces in data URIs. (not sure whether that's correct per
spec or not).
...


It's not correct per RFC 2397 (data) and RFCs 3986 (URI) and 3987 (IRI), 
but the HTML spec certainly could *make* it correct by introducing an 
additional layer (if there was consensus to do so). Right now HTML5 
conformance requires valid IRIs, so unescaped whitespace or angle 
brackets in @hrefs make the document non-conforming.



...


Best regards, Julian


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-11 Thread Julian Reschke

On 2011-09-11 18:56, Daniel Holbert wrote:

On 09/11/2011 02:09 AM, Julian Reschke wrote:

Given the fact that this change made it into the release without any
major uproar there might be a chance that other UAs might simply adopt
it.


(To be clear -- the proposal hasn't made it into any releases yet. Right
now it's just an idea.)
...


Understood. I was referring to the changed behavior as of Firefox 6.

Best regards, Julian


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-11 Thread Daniel Holbert

On 09/11/2011 02:09 AM, Julian Reschke wrote:

Given the fact that this change made it into the release without any
major uproar there might be a chance that other UAs might simply adopt it.


(To be clear -- the proposal hasn't made it into any releases yet. Right 
now it's just an idea.)


~Daniel


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-11 Thread Glenn Maynard
On Sat, Sep 10, 2011 at 5:15 PM, Daniel Holbert wrote:

> This could be more intuitive/do-what-I-mean if we restricted the cases
> under which "#" is treated as a fragment-ID delimiter inside of data URIs.
>  In particular: when a "#" character is followed by ">" or "<" in a data
> URI, I propose that we *don't* treat the "#" as a delimiter, and instead
> just treat it as part of the encoded document.
>

An HTML document in a data: URI containing a # is probably followed by a >
or <; but that's an "if", not "iff".  It doesn't imply that a # followed by
a > or < is *always* intended as part of the data and not an actual
fragment.

data:text/html,foobar#vector

I don't think adding black magic to URI parsing will make things less
confusing.

Firefox parses fragment-identifiers strictly, potentially giving authors
> headaches and truncating content that renders fine in Opera/Webkit.
>

I'd say the opposite: WebKit breaks this author's expectations and
encourages headaches, by not parsing the above URIs in the ordinary way,
where Firefox matches my expectations.  I was certainly surprised to find
that Chrome fails the above.


On Sun, Sep 11, 2011 at 10:21 AM, Michael A. Puls II
wrote:

> Not only must "#" be "%23" if you don't want it as a frag id, but ">" and
> "<" should be "%3E" and "%3C".
>

I'm not sure about the spec on this, but Firefox actively unencodes %3E and
%3C.  Pasting this into the address bar and copying it back out turns them
back into literal < and > characters:

data:text/html,foobar#vector%3Cint%3E

which suggests that escaping these characters isn't necessary or encouraged.

-- 
Glenn Maynard


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-11 Thread Daniel Holbert

On 09/11/2011 07:21 AM, Michael A. Puls II wrote:

Not only must "#" be "%23" if you don't want it as a frag id, but ">"
and "<" should be "%3E" and "%3C".

[...]
> Of course, if you can percent-encode everything needed as you type, you
> can hand-author the URI data. But, who wants to do that,

As I noted in a response to Nils earlier in this thread, 
Firefox/Webkit/Opera don't actually require authors to percent-encode 
brackets and spaces in data URIs. (not sure whether that's correct per 
spec or not).


For example
  data:text/html,here is some italic text
works just fine in all three.

So that makes it quite easy to hand-author data URIs, in fact. (aside 
from this "#" gotcha)


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-11 Thread Michael A. Puls II
On Sat, 10 Sep 2011 17:15:09 -0400, Daniel Holbert   
wrote:


Browsers handle the "#" character in data URIs very differently, and the  
arguably "correct" behavior is probably not what authors actually want  
in many cases.


This could be more intuitive/do-what-I-mean if we restricted the cases  
under which "#" is treated as a fragment-ID delimiter inside of data  
URIs.   In particular: when a "#" character is followed by ">" or "<" in  
a data URI, I propose that we *don't* treat the "#" as a delimiter, and  
instead just treat it as part of the encoded document.


Not only must "#" be "%23" if you don't want it as a frag id, but ">" and  
"<" should be "%3E" and "%3C".


Encoding the data (markup for example) for the data URI is simple. Just  
use encodeURIComponent(markup) (on a UTF-8 page) in JS on the data. You  
still hand-author the markup. You just paste the markup into a textarea  
and have something (like encodeURIComponent()) percent-encode it for you.


Of course, if you can percent-encode everything needed as you type, you  
can hand-author the URI data. But, who wants to do that, except for simple  
data? It's like hand-authoring mime messages. It's not something you would  
normally do to create an email or mht file.


If you need to encode the data URI data as base64 instead, you can do  
encodeURIComponent(btoa(unescape(encodeURIComponent(markup; (on a  
utf-8 page).


And, there's already   
too.


Given that, I personally don't think browsers should be too lax with  
authors that don't properly-encode their data. Javascript URI  
(bookmarklet) authors already get away with that (even though there's  
pages like ), but at the same time  
often run into unexpected (to them) percent-decoding of the URI data  
before it's executed.


--
Michael


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-11 Thread Julian Reschke

On 2011-09-11 04:51, Boris Zbarsky wrote:

...
I think you misunderstand my position. I'm weakly against the proposal
in question; the strongest argument in favor of the proposal is that
there is either a current or future deployed base of data: URIs that
won't work without it but do work in either past browsers or some subset
of future ones.

Of course the simplest way to prevent the future URIs thing being a
problem is for UAs that don't follow the URI spec here right now to fix
that, but I haven't sensed much willingness to do that in the past, or
earlier in this discussion. :(
...


+1 for trying to sanitize the parsing in Firefox.

Given the fact that this change made it into the release without any 
major uproar there might be a chance that other UAs might simply adopt it.



Given the choice between converging on this proposal and the status quo
in which UAs just do wildly different totally wacky things, I'd pick the
proposal, I think


If we can't get the perfect fix (UAs consistently doing what the spec 
says), then of course converging on something that is less broken than 
before may be good.


Best regards, Julian


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-10 Thread Ryosuke Niwa
Have implementors actively opposed to this idea?  It seems like sticking to
RFC is a cleaner option if possible.

- Ryosuke

On Sat, Sep 10, 2011 at 9:01 PM, Daniel Holbert wrote:

> On 09/10/2011 08:09 PM, Bjoern Hoehrmann wrote:
> > So he would
>
>> make the same suggestion even if everybody implemented the correct beha-
>> vior.
>>
>
> No -- sorry if I wasn't clear on that.  A big part of the motivation for
> this proposal right now is the inconsistent level of "forgiveness" across
> browsers right now.  If browsers had already converged on "correct" (strict)
> parsing behavior, I wouldn't be making this proposal.
>
> ~Daniel
>


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-10 Thread Daniel Holbert

On 09/10/2011 08:09 PM, Bjoern Hoehrmann wrote:
> So he would

make the same suggestion even if everybody implemented the correct beha-
vior.


No -- sorry if I wasn't clear on that.  A big part of the motivation for 
this proposal right now is the inconsistent level of "forgiveness" 
across browsers right now.  If browsers had already converged on 
"correct" (strict) parsing behavior, I wouldn't be making this proposal.


~Daniel


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-10 Thread Boris Zbarsky

On 9/10/11 11:09 PM, Bjoern Hoehrmann wrote:

I read Daniel as saying that Firefox now implements the correct behavior


Yep.


but the definition of "correct behavior" should be changed to allow some
addtional convenience for no reason other than convenience.


He explicitly mentioned that we're getting bug reports of "this works in 
Chrome but not in Firefox" variety, no?  Possibly further down the 
thread, not in the initial post.



So he would make the same suggestion even if everybody implemented the correct 
beha-
vior.


That's possible.  If we had UA compat on the correct behavior now, I 
would be rather opposed to changing anything.



If the argument is "but this breaks too many sites" or "we think
that some vendor or other will not implement the correct behavior due to
convenience issues" or whatever other reason


The "argument" is that other vendors have not implemented correct 
behavior even though they've known for years that theirs is incorrect. 
Gecko was just as guilty as anyone else in this matter until recently, 
because getting to correct behavior involved breaking compatibility 
promises we'd made.


So I think the proposal is being offered as a possible middle ground 
that people who don't want to converge on the correct behavior might be 
willing to converge on.  If people are willing to converge on the 
correct behavior, I think that would be pretty good too.  Given Adam's 
response earlier in this thread, I'm not holding my hopes up.


-Boris


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-10 Thread Bjoern Hoehrmann
* Boris Zbarsky wrote:
>I think you misunderstand my position.  I'm weakly against the proposal 
>in question; the strongest argument in favor of the proposal is that 
>there is either a current or future deployed base of data: URIs that 
>won't work without it but do work in either past browsers or some subset 
>of future ones.
>
>Of course the simplest way to prevent the future URIs thing being a 
>problem is for UAs that don't follow the URI spec here right now to fix 
>that, but I haven't sensed much willingness to do that in the past, or 
>earlier in this discussion.  :(
>
>Given the choice between converging on this proposal and the status quo 
>in which UAs just do wildly different totally wacky things, I'd pick the 
>proposal, I think

I read Daniel as saying that Firefox now implements the correct behavior
but the definition of "correct behavior" should be changed to allow some
addtional convenience for no reason other than convenience. So he would
make the same suggestion even if everybody implemented the correct beha-
vior. If the argument is "but this breaks too many sites" or "we think
that some vendor or other will not implement the correct behavior due to
convenience issues" or whatever other reason, we would have a different
argument.
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-10 Thread Boris Zbarsky

On 9/10/11 10:39 PM, Bjoern Hoehrmann wrote:

* Boris Zbarsky wrote:

On the other hand, this would presumably mostly be a problem for people
hand-writing data: URIs.  Any sort of data: URI generator would get this
right, as you point out.


(That seems very much like saying "Any sort of SQL query generator would
get this right."


Yes, it sort of does, especially if you want it to be pessimistic.  ;)


especially when adopting the proposal so you "normally"
don't "need to" "get this right".)


I think you misunderstand my position.  I'm weakly against the proposal 
in question; the strongest argument in favor of the proposal is that 
there is either a current or future deployed base of data: URIs that 
won't work without it but do work in either past browsers or some subset 
of future ones.


Of course the simplest way to prevent the future URIs thing being a 
problem is for UAs that don't follow the URI spec here right now to fix 
that, but I haven't sensed much willingness to do that in the past, or 
earlier in this discussion.  :(


Given the choice between converging on this proposal and the status quo 
in which UAs just do wildly different totally wacky things, I'd pick the 
proposal, I think


-Boris


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-10 Thread Bjoern Hoehrmann
* Boris Zbarsky wrote:
>On the other hand, this would presumably mostly be a problem for people 
>hand-writing data: URIs.  Any sort of data: URI generator would get this 
>right, as you point out.

(That seems very much like saying "Any sort of SQL query generator would
get this right." especially when adopting the proposal so you "normally"
don't "need to" "get this right".)
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-10 Thread Boris Zbarsky

On 9/10/11 9:04 PM, Nils Dagsson Moskopp wrote:

Oops, partial misunderstanding. While I did not think of SVG (thanks),
I wanted to know how often authors have erred here by not properly
encoding their data, expecting it to work.


Good question.

Given that it used to work in Gecko, WebKit, and Presto (unlike SVG from 
data:, which did not really work in Gecko), it might have been 
reasonably common


On the other hand, this would presumably mostly be a problem for people 
hand-writing data: URIs.  Any sort of data: URI generator would get this 
right, as you point out.


I suspect that data: URI usage on the web is rare enough so far that 
there are no serious backwards-compat issues.



Btw: Are there possible security implications of data URI parse changes?


Not so much implications of the "changes", since it's not like UAs 
actually parse them per spec... but yes, a URI like this:


  data:text/html,#doStuff()

is very difficult to sanitize if your URI parser just treats the part 
before '#' as the data while a browser treats everything after the ',' 
as the data.  So there are definitely security implications to the fact 
that the browser behavior is not consistent, either across browsers, 
within a given browser, or with the specs.


-Boris


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-10 Thread Daniel Holbert

On 09/10/2011 04:53 PM, Nils Dagsson Moskopp wrote:
>> Browsers handle the "#" character in data URIs very differently, and
>> the arguably "correct" behavior is probably not what authors actually
>> want in many cases.
> Do you have any evidence for that assertion, e.g. author surveys,
> occurance in sites, number of duplicates in mozilla bugzilla (relative
> to other common bugs)?

No large-scale data like that, just a few anecdotal reports in IRC of 
Firefox purportedly being "broken" on particular content (that contained 
a "#"), whereas Chromium was "working".  (one instance about a week ago, 
which prompted this proposal)


Plus, a concern that people can *almost* just stick pure HTML/SVG into a 
data URI (see examples below) except for "#" characters which break things.


> This change would probably have to be communicated to other software
> working with data URIs (Python's urlparse module comes to mind).

Sure, ultimately. One step at a time.

> Do you
> intend to update the RFC on the point or leave that usage
> non-conforming?

I'm not sure. Right now this is just a proposal for better 
interoperability, but ultimately, yeah, it'd be great to have this 
specified.


>> Note that in cases where an author *accidentally* includes "#" inside
>> their data URI (e.g.),
>
> What's with the unencoded bracket (should be %3C) and space (should be
> %20) beforehand? Why wouldn't parsing stop at those points?

Those are fine, actually -- but I should have included an actual URI 
that loads in browsers, like the following (this is what I meant):

  data:text/html,

So to answer your question -- that does render just fine (giant red 
page) in Chromium, without any need to encode the space or the brackets. 
 It also renders fine in Opera if you point an  at it.  (but 
not if you type it directly into the URLbar -- that's the inconsistency 
on their part that I mentioned in my post)


And in Firefox, it renders fine if you just encode the # character:
  data:text/html,
(that makes it load fine from Opera's URLbar, too.)

So no -- practically at least, there's no need to encode the >/< or the 
space character.


~Daniel


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-10 Thread Nils Dagsson Moskopp
Boris Zbarsky  schrieb am Sat, 10 Sep 2011 20:34:18
-0400:

> On 9/10/11 7:53 PM, Nils Dagsson Moskopp wrote:  
> > Is fragment use in data URIs possible at all?  
> 
> Possible, and desirable; otherwise SVG data: URIs are pretty much
> useless.  

Thanks, I had not thought of that.

> > The last point – interoperability – is satisfied by any widely
> > implemented outcome. The first point – author expectations – I
> > question. So, how often does this occur?  
> 
> Having a '#' in the document being encoded in a data: URI?  Pretty
> much any time the document includes any CSS (colors) or SVG (paint
> servers and the like).  

Oops, partial misunderstanding. While I did not think of SVG (thanks),
I wanted to know how often authors have erred here by not properly
encoding their data, expecting it to work. It just seems weird to me;
the people I know who know about data URIs certainly know how percent
encoding works – often taking advantage of automatic tools for the
conversion.

Btw: Are there possible security implications of data URI parse changes?

-- 
Nils Dagsson Moskopp // erlehmann



Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-10 Thread Bjoern Hoehrmann
* Daniel Holbert wrote:
>  In particular: when a "#" character is followed by ">" or "<" in a data 
>URI, I propose that we *don't* treat the "#" as a delimiter, and instead 
>just treat it as part of the encoded document.

Your proposal does not explain whether this applies to base64 encoded
ones, whether the angle brackets have to occur literally or if they can
also occur in their percent-encoded form, or how you handle multiple '#'
characters like in data:...,...#...<...#example. You also don't say on
which layer this would happen. Obviously having this in the URI syntax
specification with an expectation that all parsing libraries would be
updated to treat the 'data' as a special case is unlikely to go down
well (problem starting with angle brackets being disallowed entirely).

If treating the part after the first "#" as fragment identifier doesn't
cause compatibility problems, as you seem to be suggesting, then that's
great, explaining URI processing would be much simpler. We also do not
have special rules for  despite
someone crafting such an address most likely means the "#" to be data.
There are a number of implementations where the "#" is treated as data,
'javascript' and 'mailto' come to mind, but there it's unreliable and
not widely used, and, more importantly, it's all or nothing, not guess-
work. 

You have to escape all sorts of characters in 'data' URLs to make them
work reliably, you have to escape spaces for instance in order to use
them as part of a white-space separated list of URLs or other syntax
that relies on URLs containing no spaces, and you have to escape '#'s
so they work reliably right now and for however long the current pack
of browsers will be around, even if you don't care about all the non-
browser implementations that are unlikely to support this.

If there isn't very clear evidence that this is needed for reasons of
compatibility, it seems preferable by far to have simpler rules that
actually reflect how this stuff works everywhere than have some magic
rules that apply some of the time that robust code cannot rely upon. I
wouldn't mind such fixups in the address bar, as that is a user input
do what I mean interface, but beyond that it just adds complexity for
very little convenience in edge cases.
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-10 Thread Boris Zbarsky

On 9/10/11 7:53 PM, Nils Dagsson Moskopp wrote:

Is fragment use in data URIs possible at all?


Possible, and desirable; otherwise SVG data: URIs are pretty much useless.


The last point – interoperability – is satisfied by any widely
implemented outcome. The first point – author expectations – I
question. So, how often does this occur?


Having a '#' in the document being encoded in a data: URI?  Pretty much 
any time the document includes any CSS (colors) or SVG (paint servers 
and the like).


-Boris


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-10 Thread Boris Zbarsky

On 9/10/11 7:44 PM, Adam Barth wrote:

It seems like a bad idea to require look-ahead to parse data URLs.  Is
there some reason we can't just treat the whole payload as part of the
document?


Yes.  It breaks the xlink:href='#greenRect' sort of thing in SVG, unless 
you do some sort of other bizarre special-casing that violates the URI RFCs.


Of course WebKit does in fact do that sort of thing in SVG; see 
https://bugs.webkit.org/show_bug.cgi?id=63283 for a more egregious 
example...  But if you fixed those bugs, then you'd break things that 
should actually work, unless you actually allow data: URIs with a 
fragment identifier (which once again is required by the RFCs).


As it is, last I checked WebKit's behavior here is inconsistent; for a 
data: URI containing a '#' it in many cases treats the part after the 
'#' as _both_ part of the content and a fragment identifier  Again, all 
sorts of broken.


-Boris


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-10 Thread Bjartur Thorlacius

Þann lau 10.sep 2011 21:15, skrifaði Daniel Holbert:

* Opera is interesting -- it can exhibit either the Firefox or WebKit
behaviors in tests A/B/C, depending on whether the data URI as an
embedded element (via iframe/img) or view it directly. When you view it
as an embedded element (in my testcase), Opera matches WebKit on A/B/C
(including the XML parse error on C). However, if you *directly view*
the data URIs (right-click on iframe, Frame|Open, focus URLbar & hit
enter), then Opera matches Firefox. Also, Opera passes test D.


So Opera treats the src attribute as a URI, but the href attribute and
identifiers input by users as URI references?  This does not conform to 
the WHATWG HTML5 standard that uses delegates the definition of URI to 
RFC 3986 which again defines the "#" character as a beginning a fragment 
identifier, and not quite to RFC 2396, delegated to by HTML 4.01 which 
forbids "#"s in URIs, but uses it as a separator between URIs and URI 
references (but doesn't specify how to parse URIs who are not part of 
URI references). I believe the HTML 4.01 usage of the term URI instead 
of URI reference to be an error, but the HTML working group has to 
confirm that (or editor(s) on it's behalf).
According to my interpretation of RFC 2396 the "#" should terminate the 
URI, as URI can't contain "#", but this isn't stated explicitly and thus 
I can't tell if Opera violates the RFC or not, and thus not if it's 
"correct" or not. It's clearly violating RFC 3986, however.
The correct thing to do seems to be to to violate HTML 4.01 & RFC 2396 
but conform to HTML5 & RFC 3986. Adding a special case for one URI 
scheme seems a little odd, but I can't think of a use case for fragment 
identifiers in data URI.


Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-10 Thread Nils Dagsson Moskopp
Daniel Holbert  schrieb am Sat, 10 Sep 2011
14:15:09 -0700:

> […]

> Browsers handle the "#" character in data URIs very differently, and
> the arguably "correct" behavior is probably not what authors actually
> want in many cases.

Do you have any evidence for that assertion, e.g. author surveys,
occurance in sites, number of duplicates in mozilla bugzilla (relative
to other common bugs)?

Anecdotally, my take: As an author, I would not think that the
semantics of „#“ in URIs change depending on the scheme. Additionally,
people tend to become confused when stuff gets special-cased
arbitrarily, see the hashbang scenario.

> This could be more intuitive/do-what-I-mean if we restricted the
> cases under which "#" is treated as a fragment-ID delimiter inside of
> data URIs. In particular: when a "#" character is followed by ">" or
> "<" in a data URI, I propose that we *don't* treat the "#" as a
> delimiter, and instead just treat it as part of the encoded document.

This change would probably have to be communicated to other software
working with data URIs (Python's urlparse module comes to mind). Do you
intend to update the RFC on the point or leave that usage
non-conforming?

> Now, a set of tests, to which I'll refer below:
>http://people.mozilla.org/~dholbert/dataURIHashTests/tests_v1.xhtml

> […]

> THE PROPOSAL & HOW IT HELPS:
> 
> We can help out the author by relaxing our fragment-ID-parsing rules
> a bit here.
> 
> Note that in cases where an author *accidentally* includes "#" inside 
> their data URI (e.g. ), there almost
> certainly will be more content following it -- in particular, there
> will be an , or an , or at least a ">" (if it's inside
> the final tag) still to come.

What's with the unencoded bracket (should be %3C) and space (should be
%20) beforehand? Why wouldn't parsing stop at those points?

If it doesn't, the given string isn't an URI anyway, or is it? If it
isn't, error recovery rules are pretty much arbitrary (looking it up in
Your Favourite Search Engine seems to be one popular way).

> So we can proactively check for >/< characters anywhere after the
> "#", and if we find them, then we can pretty safely assume that the
> author intended for the "#" to be part of the document, rather than a
> fragment-ID delimiter.

Is fragment use in data URIs possible at all? Also, my common sense
tingles: It seems to me that would be a category error. Discuss.

> […]
> 
> With my proposal here -- relaxing the situations under which "#"
> should be treated as a delimiter in a data URI -- I think we'd better
> match author expectations and improve the browser-compatibility
> picture.

The last point – interoperability – is satisfied by any widely
implemented outcome. The first point – author expectations – I
question. So, how often does this occur?

> Thoughts?

Interesting.

-- 
Nils Dagsson Moskopp // erlehmann



Re: [whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-10 Thread Adam Barth
It seems like a bad idea to require look-ahead to parse data URLs.  Is
there some reason we can't just treat the whole payload as part of the
document?  That's almost certainly what authors want.

Adam


On Sat, Sep 10, 2011 at 2:15 PM, Daniel Holbert  wrote:
> Hi whatwg,
>
> I'm writing with a proposal to improve the handling of "#" in data URIs. I'm
> particularly looking for feedback from other browser vendors, but of course
> feedback from others is welcome as well.
>
> SUMMARY:
> 
> Browsers handle the "#" character in data URIs very differently, and the
> arguably "correct" behavior is probably not what authors actually want in
> many cases.
>
> This could be more intuitive/do-what-I-mean if we restricted the cases under
> which "#" is treated as a fragment-ID delimiter inside of data URIs.  In
> particular: when a "#" character is followed by ">" or "<" in a data URI, I
> propose that we *don't* treat the "#" as a delimiter, and instead just treat
> it as part of the encoded document.
>
> Now, a set of tests, to which I'll refer below:
>  http://people.mozilla.org/~dholbert/dataURIHashTests/tests_v1.xhtml
>
> PROBLEM:
> 
> When an author writes a data URI for a document that contains a "#"
> character, she may unintentionally end up with broken results (or at least
> inconsistently-handled results), because the "#" may be treated as the end
> of the document & the beginning of the URI's fragment identifier.
>
> (I believe this to be the _technically_ correct (albeit unintuitive)
> behavior per the URI RFC [1] -- it's the behavior we've implemented in
> Firefox 6 [2] and it's what I've described as "Correct" in my testcase.
> (with quotes to indicate unintuitiveness))
>
> Technically, the author *really* should encode the "#" character as "%23",
> if she doesn't want it to be a delimiter.
>
> However, this gotcha is easy to overlook -- especially because Opera &
> Webkit are less strict than Firefox in this respect and will gladly accept
> "#" inside data URIs under some circumstances.
>
> THE PROPOSAL & HOW IT HELPS:
> 
> We can help out the author by relaxing our fragment-ID-parsing rules a bit
> here.
>
> Note that in cases where an author *accidentally* includes "#" inside their
> data URI (e.g. ), there almost certainly will be
> more content following it -- in particular, there will be an , or an
> , or at least a ">" (if it's inside the final tag) still to come.
>
> So we can proactively check for >/< characters anywhere after the "#", and
> if we find them, then we can pretty safely assume that the author intended
> for the "#" to be part of the document, rather than a fragment-ID delimiter.
>
> OVERVIEW OF BROWSERS' CURRENT HANDLING OF "#" IN DATA URIs:
> ===
> url: http://people.mozilla.org/~dholbert/dataURIHashTests/tests_v1.xhtml
>
>  * Firefox 6+ breaks the author's expectations in my tests A & B due to URI
> parsing strictness. (But if we were to implement the above proposal, we'd
> match the author's expectations.)  We pass test C due to correctly trimming
> "#target" off of the end and scrolling to the referenced element.  And we
> fail test D only due to a bug with over-enforcing same-origin checks.[3]
>
>  * WebKit matches the author's expectations on A & B -- however, that's only
> because they don't seem to support "#ref" suffixes on the ends of data URIs
> at all, so they _always_ include "#" in the document.  (They *do* apparently
> support _relative_ references within data URI documents, e.g.
> xlink:href='#greenRect' as used in test B.)  So, Webkit ends up failing test
> C because they don't strip off the "#target" suffix (resulting in broken
> XML).  They fail test D presumably for the same reason.  (They also have
> some zooming issues on the  examples, but I'm ignoring those for the
> purposes of this post.)
>
>  * Opera is interesting -- it can exhibit either the Firefox or WebKit
> behaviors in tests A/B/C, depending on whether the data URI as an embedded
> element (via iframe/img) or view it directly.  When you view it as an
> embedded element (in my testcase), Opera matches WebKit on A/B/C (including
> the XML parse error on C).  However, if you *directly view* the data URIs
> (right-click on iframe, Frame|Open, focus URLbar & hit enter), then Opera
> matches Firefox.  Also, Opera passes test D.
>
> (I don't have results for IE -- I briefly tried to support it in the test,
> but I had issues getting data URIs to work there at all.)
>
> CONCLUSION:
> ===
> So - to sum up the test-results above: webkit doesn't give "#" any special
> delimiter status in data URIs, which is a bug, but probably matches what
> authors intend a lot of the time; Opera sometimes behaves like Webkit and
> sometimes not; and Firefox parses fragment-identifiers strictly, potentially
> giving authors headaches and truncating content that renders fine in
> Opera/Webkit.
>
> With my proposal her

[whatwg] Proposal for improved handling of '#' inside of data URIs

2011-09-10 Thread Daniel Holbert

Hi whatwg,

I'm writing with a proposal to improve the handling of "#" in data URIs. 
I'm particularly looking for feedback from other browser vendors, but of 
course feedback from others is welcome as well.


SUMMARY:

Browsers handle the "#" character in data URIs very differently, and the 
arguably "correct" behavior is probably not what authors actually want in 
many cases.


This could be more intuitive/do-what-I-mean if we restricted the cases 
under which "#" is treated as a fragment-ID delimiter inside of data URIs. 
 In particular: when a "#" character is followed by ">" or "<" in a data 
URI, I propose that we *don't* treat the "#" as a delimiter, and instead 
just treat it as part of the encoded document.


Now, a set of tests, to which I'll refer below:
  http://people.mozilla.org/~dholbert/dataURIHashTests/tests_v1.xhtml

PROBLEM:

When an author writes a data URI for a document that contains a "#" 
character, she may unintentionally end up with broken results (or at least 
inconsistently-handled results), because the "#" may be treated as the end 
of the document & the beginning of the URI's fragment identifier.


(I believe this to be the _technically_ correct (albeit unintuitive) 
behavior per the URI RFC [1] -- it's the behavior we've implemented in 
Firefox 6 [2] and it's what I've described as "Correct" in my testcase. 
(with quotes to indicate unintuitiveness))


Technically, the author *really* should encode the "#" character as "%23", 
if she doesn't want it to be a delimiter.


However, this gotcha is easy to overlook -- especially because Opera & 
Webkit are less strict than Firefox in this respect and will gladly accept 
"#" inside data URIs under some circumstances.


THE PROPOSAL & HOW IT HELPS:

We can help out the author by relaxing our fragment-ID-parsing rules a bit 
here.


Note that in cases where an author *accidentally* includes "#" inside 
their data URI (e.g. ), there almost certainly 
will be more content following it -- in particular, there will be an 
, or an , or at least a ">" (if it's inside the final tag) 
still to come.


So we can proactively check for >/< characters anywhere after the "#", and 
if we find them, then we can pretty safely assume that the author intended 
for the "#" to be part of the document, rather than a fragment-ID delimiter.


OVERVIEW OF BROWSERS' CURRENT HANDLING OF "#" IN DATA URIs:
===
url: http://people.mozilla.org/~dholbert/dataURIHashTests/tests_v1.xhtml

 * Firefox 6+ breaks the author's expectations in my tests A & B due to 
URI parsing strictness. (But if we were to implement the above proposal, 
we'd match the author's expectations.)  We pass test C due to correctly 
trimming "#target" off of the end and scrolling to the referenced element. 
 And we fail test D only due to a bug with over-enforcing same-origin 
checks.[3]


 * WebKit matches the author's expectations on A & B -- however, that's 
only because they don't seem to support "#ref" suffixes on the ends of 
data URIs at all, so they _always_ include "#" in the document.  (They 
*do* apparently support _relative_ references within data URI documents, 
e.g. xlink:href='#greenRect' as used in test B.)  So, Webkit ends up 
failing test C because they don't strip off the "#target" suffix 
(resulting in broken XML).  They fail test D presumably for the same 
reason.  (They also have some zooming issues on the  examples, but 
I'm ignoring those for the purposes of this post.)


 * Opera is interesting -- it can exhibit either the Firefox or WebKit 
behaviors in tests A/B/C, depending on whether the data URI as an embedded 
element (via iframe/img) or view it directly.  When you view it as an 
embedded element (in my testcase), Opera matches WebKit on A/B/C 
(including the XML parse error on C).  However, if you *directly view* the 
data URIs (right-click on iframe, Frame|Open, focus URLbar & hit enter), 
then Opera matches Firefox.  Also, Opera passes test D.


(I don't have results for IE -- I briefly tried to support it in the test, 
but I had issues getting data URIs to work there at all.)


CONCLUSION:
===
So - to sum up the test-results above: webkit doesn't give "#" any special 
delimiter status in data URIs, which is a bug, but probably matches what 
authors intend a lot of the time; Opera sometimes behaves like Webkit and 
sometimes not; and Firefox parses fragment-identifiers strictly, 
potentially giving authors headaches and truncating content that renders 
fine in Opera/Webkit.


With my proposal here -- relaxing the situations under which "#" should be 
treated as a delimiter in a data URI -- I think we'd better match author 
expectations and improve the browser-compatibility picture.


Thoughts?

Thanks,
Daniel Holbert
Mozilla Corporation

P.S. Thanks to Robert O'Callahan for coming up with this proposal a week 
or so back.


P.P.S. Browser versions that I