Re: [Tutor] best way to scrape html

Kent Johnson Wed, 16 Feb 2005 03:23:53 -0800

You might find these threads on comp.lang.python interesting:
http://tinyurl.com/5zmpn
http://tinyurl.com/6mxmb

Peter Kim wrote:

Which method is best and most pythonic to scrape text data with
minimal formatting?

I'm trying to read a large html file and strip out most of the markup,
but leaving the simple formatting like <p>, <b>, and <i>.  For example:

<p class="BodyText" style="MARGIN: 0in 0in 12pt"><font face="Times New
Roman"><b style="font-weight: normal"><span lang="EN-GB"
style="FONT-SIZE: 12pt">Trigger:</span></b><span lang="EN-GB"
style="FONT-SIZE: 12pt"><span style="spacerun: yes">&#160;</span>
Debate on budget in Feb-Mar. New moves to cut medical costs by better
technology.</span></font></p>

I want to change the above to:

<p><b>Trigger:</b> Debate on budget in Feb-Mar.  New moves to
cutmedical costs by better technology.</p>

Since I wanted some practice in regex, I started with something like this:

pattern = "(?:<)(.+?)(?: ?.*?>)(.*?)(</\1>)"
result = re.compile(pattern, re.IGNORECASE | re.VERBOSE |
re.DOTALL).findall(html)

But it's getting messy real fast and somehow the non-greedy parts
don't seem to work as intended.  Also I realized that the html file is
going to be 10,000+ lines, so I wonder if regex can be used for large
strings.

So I'm thinking of using sgmllib.py (as in the Dive into Python
example).  Is this where I should be using libxml2.py?  As you can
tell this is my first foray into both parsing and regex so advice in
terms of best practice would be very helpful.

Thanks,
Peter Kim
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] best way to scrape html

Reply via email to