RE: HTML parsing in HTML-based discovery

Recordon, David Tue, 30 Jan 2007 11:59:51 -0800

Would it be worthwhile to write-up the steps the JanRain parser uses and
see what from it could be included in the OpenID spec to help out
implementors?

--David 

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
Behalf Of Josh Hoyt
Sent: Friday, January 26, 2007 1:07 PM
To: Martin Atkins
Cc: specs@openid.net
Subject: Re: HTML parsing in HTML-based discovery

On 1/26/07, Martin Atkins <[EMAIL PROTECTED]> wrote:
> And, in theory, the OpenID spec could add additional restrictions to 
> "fix" the above problems.
>
> Whether it should or not is of course up for debate; I'd be interested

> to hear from Brad Fitzpatrick and JanRain's developers who are 
> responsible for the most-used implementations currently using regex 
> parsing. Why didn't you guys use an HTML parser? I assume there must 
> have been a reason.

While our parser [1] is implemented using regular expressions, I don't
think it's what most people mean when they say "parsing with regexps."

The reason that (most of) the JanRain libraries don't use an HTML parser
is because I was afraid that "loose" parsing of HTML could lead to
OpenID discovery markup being recognized in parts of the document other
than the head, leading to vulnerabilities where someone could hijack a
URL by e.g. adding a malicious comment to a blog.

When implementing this parser, I tried to be aware of both the varied
markup that exists across the Web and the security implications of
different parsing strategies. In short, the parser tries to recognize
any OpenID markup that is unambiguously in the <head> of an HTML
document and reject anything else. If you're interested in the details,
read the comments at the top of the file (again, [1])

The parser does not work for some valid XHTML cases, most notably using
an explicit XML namespace for the XHTML tags. It also does not deal with
some cases of valid HTML 4. [2] I am sure that there are other cases in
which valid markup is not recognized. My implementation and design
strategy was to implement something that is immune to the attacks that I
could conceive and work for the vast majority of the cases that exist in
the wild.

Ironically, another reason that we implemented this parser instead of
using an (X)HTML parsing library is that according to the spec, only
certain entities should be processed in the <link> tag's attributes, and
a real (X)HTML parser would process all of the entities. It's arguable
whether processing the entities is right or wrong, because the user who
marked up the page is violating the spec, but I figured that by
following the spec exactly, the users who put in this markup would have
a more consistent experience (i.e. their markup would be broken for all
libraries rather than accepted by some and rejected by
others)

In an ideal world, we'd be able to specify that the document was e.g.
valid XHTML and just use a nice, strict parser on it. The reality of the
situation forces us to deal with the markup that's out there, which
leads to trade-offs, no matter how we choose to deal with that markup.

Josh

1.
http://www.openidenabled.com/resources/darcsweb?r=python-openid;a=headbl
ob;f=/openid/consumer/parse.py
2.
http://www.intertwingly.net/blog/2006/12/28/Unobtrusive-OpenID#c11674043
28
_______________________________________________
specs mailing list
specs@openid.net
http://openid.net/mailman/listinfo/specs
_______________________________________________
specs mailing list
specs@openid.net
http://openid.net/mailman/listinfo/specs

RE: HTML parsing in HTML-based discovery

Reply via email to