Would it be worthwhile to write-up the steps the JanRain parser uses and see what from it could be included in the OpenID spec to help out implementors?
--David -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Josh Hoyt Sent: Friday, January 26, 2007 1:07 PM To: Martin Atkins Cc: specs@openid.net Subject: Re: HTML parsing in HTML-based discovery On 1/26/07, Martin Atkins <[EMAIL PROTECTED]> wrote: > And, in theory, the OpenID spec could add additional restrictions to > "fix" the above problems. > > Whether it should or not is of course up for debate; I'd be interested > to hear from Brad Fitzpatrick and JanRain's developers who are > responsible for the most-used implementations currently using regex > parsing. Why didn't you guys use an HTML parser? I assume there must > have been a reason. While our parser [1] is implemented using regular expressions, I don't think it's what most people mean when they say "parsing with regexps." The reason that (most of) the JanRain libraries don't use an HTML parser is because I was afraid that "loose" parsing of HTML could lead to OpenID discovery markup being recognized in parts of the document other than the head, leading to vulnerabilities where someone could hijack a URL by e.g. adding a malicious comment to a blog. When implementing this parser, I tried to be aware of both the varied markup that exists across the Web and the security implications of different parsing strategies. In short, the parser tries to recognize any OpenID markup that is unambiguously in the <head> of an HTML document and reject anything else. If you're interested in the details, read the comments at the top of the file (again, [1]) The parser does not work for some valid XHTML cases, most notably using an explicit XML namespace for the XHTML tags. It also does not deal with some cases of valid HTML 4. [2] I am sure that there are other cases in which valid markup is not recognized. My implementation and design strategy was to implement something that is immune to the attacks that I could conceive and work for the vast majority of the cases that exist in the wild. Ironically, another reason that we implemented this parser instead of using an (X)HTML parsing library is that according to the spec, only certain entities should be processed in the <link> tag's attributes, and a real (X)HTML parser would process all of the entities. It's arguable whether processing the entities is right or wrong, because the user who marked up the page is violating the spec, but I figured that by following the spec exactly, the users who put in this markup would have a more consistent experience (i.e. their markup would be broken for all libraries rather than accepted by some and rejected by others) In an ideal world, we'd be able to specify that the document was e.g. valid XHTML and just use a nice, strict parser on it. The reality of the situation forces us to deal with the markup that's out there, which leads to trade-offs, no matter how we choose to deal with that markup. Josh 1. http://www.openidenabled.com/resources/darcsweb?r=python-openid;a=headbl ob;f=/openid/consumer/parse.py 2. http://www.intertwingly.net/blog/2006/12/28/Unobtrusive-OpenID#c11674043 28 _______________________________________________ specs mailing list specs@openid.net http://openid.net/mailman/listinfo/specs _______________________________________________ specs mailing list specs@openid.net http://openid.net/mailman/listinfo/specs