Martin Atkins schrieb: > Since your list is long, I'm only going to address things I have an >> | 7.3.3. HTML-Based Discovery > In practice, few implementations actually use an HTML parser to find > these elements. These extra restrictions are present to facilitate > regex-based parsing.
Yes, and this is the problem. Implementors may *think* that they can get away with regexp parsing when in fact they can't. HTML/XHTML requires a context-free parser, which is one level above regular expressions in the Chomsky hierarchy. Even if they start mixing regexps and other code, it is likely that they won't handle all the corner-cases of HTML correctly. The effect is that an OpenID login may work on 80% of all sites ... and not on the other 20% that use a different parser. And the user will not even know _why_ his login fails. After all, validators and other HTML checking tools will tell him that his site is valid HTML. It's even possible that some parsers fail on things other parsers require. > The regex-based parsers employed by existing implementations require > explicit <head> start and end tags. I agree that this is not ideal, but > it's hardly an onerous requirement on document authors. Currently, the spec does not require explicit start and end tags for the HEAD element. It talks about a "HEAD section", which is always there even if it is not marked (see <http://www.w3.org/TR/html401/intro/sgmltut.html#h-3.2.1>) This is already an imcompatibility caused by unclear wording. In order to facilitate regexp parsing, just requiring the start and end tags is not enough. Additional restrictions may also be necessary to avoid cases where too simple regexp-based parsers might fail: - <head> start with attributes. - order of attributes within the <LINK> tag. - single quotes vs. double quotes vs. no quotes. - unescaped "<"/">" within attributes.* - numeric character references.* - line feeds within tags.* - additional XML namespaces that allow attributes like foo:href.* - <LINK> tags within <!-- comments -->.* - [to be continued] (* = inspired by a real-world implementation failing to handle these cases correctly) If you want to handle all of these correctly, you already need a true HTML parser. The less of these restrictions are added, the less likely will it be that regexp-based parsers interoperate. > This is mostly an ideological argument founded on whether we're allowed > to impose additional restrictions on HTML documents that are making use > of OpenID discovery. There is certainly no *practical* reason why this > shouldn't be done, assuming that the restrictions are sufficient to > prevent the above attack. There are practical problems: * Users can't use existing HTML tools to check for the additional restrictions. A validator will say "valid HTML" but the OpenID login fails due to a "parsing error" (e.g. the PHP implementation used on OpenID Enabled). And different RP will choke on different things. * Users can't use existing HTML tools that do not honor the additional restrictions. A HTML pretty-printer may simply re-format the code in a way unparsable by ad-hoc parsers; a hypothetical htmlcrush program might may remove the optional quotes, entity references and tags in good faith. * Other specs might also impose restrictions, which can be incompatible with OpenID's restrictions. The more restrictions are added, the more likely will it be that these practical problems arise. Claus _______________________________________________ specs mailing list specs@openid.net http://openid.net/mailman/listinfo/specs