Claus Färber wrote: > > In order to facilitate regexp parsing, just requiring the start and end > tags is not enough. Additional restrictions may also be necessary to > avoid cases where too simple regexp-based parsers might fail: > > - <head> start with attributes. > - order of attributes within the <LINK> tag. > - single quotes vs. double quotes vs. no quotes. > - unescaped "<"/">" within attributes.* > - numeric character references.* > - line feeds within tags.* > - additional XML namespaces that allow attributes like foo:href.* > - <LINK> tags within <!-- comments -->.* > - [to be continued] > > (* = inspired by a real-world implementation failing to handle these > cases correctly) >
And, in theory, the OpenID spec could add additional restrictions to "fix" the above problems. Whether it should or not is of course up for debate; I'd be interested to hear from Brad Fitzpatrick and JanRain's developers who are responsible for the most-used implementations currently using regex parsing. Why didn't you guys use an HTML parser? I assume there must have been a reason. > >> This is mostly an ideological argument founded on whether we're allowed >> to impose additional restrictions on HTML documents that are making use >> of OpenID discovery. There is certainly no *practical* reason why this >> shouldn't be done, assuming that the restrictions are sufficient to >> prevent the above attack. > > There are practical problems: > > * Users can't use existing HTML tools to check for the additional > restrictions. A validator will say "valid HTML" but the OpenID > login fails due to a "parsing error" (e.g. the PHP implementation used > on OpenID Enabled). And different RP will choke on different things. An HTML validator also won't help them if they transpose the values of openid.server and openid.delegate, or if they type rel="opnid.server" instead. There are OpenID-specific "validation"/checking tools in the works which will hopefully be able to give users good information about potential pitfalls with the way they have written their HTML in addition to pointing out things like that the openid.server LINK is missing. > * Users can't use existing HTML tools that do not honor the additional > restrictions. A HTML pretty-printer may simply re-format the code in > a way unparsable by ad-hoc parsers; a hypothetical htmlcrush program > might may remove the optional quotes, entity references and tags in > good faith. Indeed. But those documents wouldn't conform to the OpenID specification. (assuming that it went into more detail about the restrictions it is adding to HTML.) I think the main point here is that despite the outcome of this debate people *will* write regex-based parsers, whether the spec allows for it or not. We have a choice between ignoring the issue so that all of these regex-based parsers fail in interesting ways on odd cases, or accepting that this is inevitable and listing in detail a set of rules for regex-based parsing in addition to a set of restrictions on HTML that make those parsing rules possible. I'd love it if everyone would use proper HTML or XML parsers, but that just isn't going to happen no matter how much we wish it would. In the end "almost there but not quite" implementations hurt no-one but the end-user, and OpenID is what will get the blame for any negative user experience, not the libraries that use incompatible regex-based parsers. _______________________________________________ specs mailing list specs@openid.net http://openid.net/mailman/listinfo/specs