Re: [whatwg] document.write("\r"): the spec doesn't say how to handle it.

David Flanagan Thu, 03 Nov 2011 11:13:20 -0700

On 11/3/11 4:21 AM, Henri Sivonen wrote:

On Thu, Nov 3, 2011 at 1:57 AM, David Flanagan<dflana...@mozilla.com>  wrote:

Firefox, Chrome and Safari all seem to do the right thing: wait for the next
character before tokenizing the CR.

See http://software.hixie.ch/utilities/js/live-dom-viewer/saved/1247

I hadn't used the live dom viewer before.  That's really useful!

Firefox tokenizes the CR immediately, emits an LF and then skips over
the next character if it is an LF. When I designed the solution
Firefox uses, I believed it was more correct and more compatible with
legacy than whatever the spec said at the time.

I'm having a Duh! moment... I currently wait for the next character, butwhat you describe is also works, and allows the document.write() spec tomake sense.

Chrome seems to wait for the next character before tokenizing the CR.

And I think this means that the description of document.write needs to be 
changed.

All along, I've felt thought that having U+0000 and CRLF handling as a
stream preprocessing step was bogus and both should happen upon
tokenization. So far, I've managed to convince Hixie about U+0000
handling.

Each tokenizer state would have to add a rule for CR that said "emitLF, save the current tokenizer state, and set the tokenizer state to"after CR state". Actually, tokenizer states that already have a rulefor LF or whitespace would have to integrate this CR rule into thatrule. Then new after CR state would have two rules. On LF it would skipthe character and restore the saved state. On anything else it wouldpush the character back and restore the saved state.

Similarly, what should the tokenizer do if the document.write emits half of
a UTF-16 surrogate pair as the last character?

The parser operates on UTF-16 code units, so a lone surrogate is emitted.

The spec seems pretty unambiguous that it operates on codepoints (thoughI implemented mine using 16-bit code units). §13.2.1: " The input to theHTML parsing process consists of a stream of Unicode code points". Also§13.2.2.3 includes a list of codepoints beyond the BMP that are parseerrors. And finally, the tests inhttp://code.google.com/p/html5lib/source/browse/testdata/tokenizer/unicodeCharsProblematic.testrequire unpaired surrogates to be converted to the U+FFFD replacementcharacter. (Though my experience is that modifying my tokenizer to passthose tests causes other tests to fail, which makes me wonder whetherunpaired surrogates are only supposed to be replaced in some but not alltokenizer states)

Thanks, Henri!

    David

Re: [whatwg] document.write("\r"): the spec doesn't say how to handle it.

Reply via email to