Re: BeautifulSoup

Mike Meyer Fri, 19 Aug 2005 20:25:34 -0700

"Paul McGuire" <[EMAIL PROTECTED]> writes:

> Here's a pyparsing program that reads my personal web page, and spits
> out HTML with all of the HREF's reversed.


Parsing HTML isn't easy, which makes me wonder how good this solution
really is. Not meant as a comment on the quality of this code or
PyParsing, but as curiosity from someone who does a lot of [X}HTML
herding.

> -- Paul
> (Download pyparsing at http://pyparsing.sourceforge.net.)

If it were in the ports tree, I'd have grabbed it and tried it
myself. But it isn't, so I'm going to be lazy and ask. If PyParsing
really makes dealing with HTML this easy, I may package it as a port
myself.

> from pyparsing import Literal, quotedString
> import urllib
>
> LT = Literal("<")
> GT = Literal(">")
> EQUALS = Literal("=")
> htmlAnchor = LT + "A" + "HREF" + EQUALS +
> quotedString.setResultsName("href") + GT
>
> def convertHREF(s,l,toks):
>     # do HREF conversion here - for demonstration, we will just reverse
> them
>     print toks.href
>     return "<A HREF=%s>" % toks.href[::-1]
>
> htmlAnchor.setParseAction( convertHREF )
>
> inputURL = "http://www.geocities.com/ptmcg";
> inputPage = urllib.urlopen(inputURL)
> inputHTML = inputPage.read()
> inputPage.close()
>
> print htmlAnchor.transformString( inputHTML )

How well does it deal with other attributes in front of the href, like
<A onClick="..." href="...">?

How about if my HTML has things that look like HTML in attributes,
like <TAG ATTRIBUTE="stuff<A HREF=stuff">?

     Thanks,
     <mike
-- 
Mike Meyer <[EMAIL PROTECTED]>                  http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: BeautifulSoup

Reply via email to