Re: HTML parsing in HTML-based discovery

Josh Hoyt Fri, 26 Jan 2007 13:06:53 -0800

On 1/26/07, Martin Atkins <[EMAIL PROTECTED]> wrote:
> And, in theory, the OpenID spec could add additional restrictions to
> "fix" the above problems.
>
> Whether it should or not is of course up for debate; I'd be interested
> to hear from Brad Fitzpatrick and JanRain's developers who are
> responsible for the most-used implementations currently using regex
> parsing. Why didn't you guys use an HTML parser? I assume there must
> have been a reason.

While our parser [1] is implemented using regular expressions, I don't
think it's what most people mean when they say "parsing with regexps."

The reason that (most of) the JanRain libraries don't use an HTML
parser is because I was afraid that "loose" parsing of HTML could lead
to OpenID discovery markup being recognized in parts of the document
other than the head, leading to vulnerabilities where someone could
hijack a URL by e.g. adding a malicious comment to a blog.

When implementing this parser, I tried to be aware of both the varied
markup that exists across the Web and the security implications of
different parsing strategies. In short, the parser tries to recognize
any OpenID markup that is unambiguously in the <head> of an HTML
document and reject anything else. If you're interested in the
details, read the comments at the top of the file (again, [1])

The parser does not work for some valid XHTML cases, most notably
using an explicit XML namespace for the XHTML tags. It also does not
deal with some cases of valid HTML 4. [2] I am sure that there are
other cases in which valid markup is not recognized. My implementation
and design strategy was to implement something that is immune to the
attacks that I could conceive and work for the vast majority of the
cases that exist in the wild.

Ironically, another reason that we implemented this parser instead of
using an (X)HTML parsing library is that according to the spec, only
certain entities should be processed in the <link> tag's attributes,
and a real (X)HTML parser would process all of the entities. It's
arguable whether processing the entities is right or wrong, because
the user who marked up the page is violating the spec, but I figured
that by following the spec exactly, the users who put in this markup
would have a more consistent experience (i.e. their markup would be
broken for all libraries rather than accepted by some and rejected by
others)

In an ideal world, we'd be able to specify that the document was e.g.
valid XHTML and just use a nice, strict parser on it. The reality of
the situation forces us to deal with the markup that's out there,
which leads to trade-offs, no matter how we choose to deal with that
markup.

Josh

1.
http://www.openidenabled.com/resources/darcsweb?r=python-openid;a=headblob;f=/openid/consumer/parse.py
2. http://www.intertwingly.net/blog/2006/12/28/Unobtrusive-OpenID#c1167404328
_______________________________________________
specs mailing list
specs@openid.net
http://openid.net/mailman/listinfo/specs

Re: HTML parsing in HTML-based discovery

Reply via email to