In Blink's implementation, we actually use two additional tokenizer
states for CDATA:
CDATASectionRightSquareBracketState,
CDATASectionDoubleRightSquareBracketState,
Adam
On Sun, Jun 8, 2014 at 6:24 PM, Geoffrey Sneddon
wrote:
> It would aid programmatic conversion of the spec, and confuse me when
> reading the spec less thereby avoiding bugs like 25871, if these states
> matched the model of the rest of the tokenizer.
>
> Thus I propose the bogus comment state becomes:
>
>> Consume the next input character:
>>
>> U+003E GREATER-THAN SIGN (>):
>>
>> Switch to the data state. Emit the comment token.
>>
>> U+ NULL:
>>
>> Append a U+FFFD REPLACEMENT CHARACTER character to the comment token's data.
>>
>> EOF:
>>
>> Switch to the data state. Emit the comment token. Reconsume the EOF
>> character.
>>
>> Anything else:
>>
>> Append the current input character to the comment token's data.
>
> This also necessitates creating a new comment token prior to entering
> the bogus comment state.
>
> The CDATA section state should become:
>
>> Consume the next input character:
>>
>> U+005D RIGHT SQUARE BRACKET (]):
>>
>> If the three characters starting from the current input character are U+005D
>> RIGHT SQUARE BRACKET U+005D RIGHT SQUARE BRACKET U+003E GREATER-THAN SIGN
>> (]]>), then consume those characters and switch to the data state.
>> Otherwise, emit the current input character as a character token.
>>
>> EOF:
>>
>> Switch to the data state. Reconsume the EOF character.
>>
>> Anything else:
>>
>> Append the current input character to the comment token's data.
>
> No changes are needed elsewhere for this. (There is no consistent style
> for lookahead — and most cases are ASCII case-insensitive words — so I
> went with what seems sane here!)
>
> /Geoffrey