You might find these threads on comp.lang.python interesting: http://tinyurl.com/5zmpn http://tinyurl.com/6mxmb
Peter Kim wrote:
Which method is best and most pythonic to scrape text data with minimal formatting?
I'm trying to read a large html file and strip out most of the markup, but leaving the simple formatting like <p>, <b>, and <i>. For example:
<p class="BodyText" style="MARGIN: 0in 0in 12pt"><font face="Times New Roman"><b style="font-weight: normal"><span lang="EN-GB" style="FONT-SIZE: 12pt">Trigger:</span></b><span lang="EN-GB" style="FONT-SIZE: 12pt"><span style="spacerun: yes"> </span> Debate on budget in Feb-Mar. New moves to cut medical costs by better technology.</span></font></p>
I want to change the above to:
<p><b>Trigger:</b> Debate on budget in Feb-Mar. New moves to cutmedical costs by better technology.</p>
Since I wanted some practice in regex, I started with something like this:
pattern = "(?:<)(.+?)(?: ?.*?>)(.*?)(</\1>)" result = re.compile(pattern, re.IGNORECASE | re.VERBOSE | re.DOTALL).findall(html)
But it's getting messy real fast and somehow the non-greedy parts don't seem to work as intended. Also I realized that the html file is going to be 10,000+ lines, so I wonder if regex can be used for large strings.
So I'm thinking of using sgmllib.py (as in the Dive into Python example). Is this where I should be using libxml2.py? As you can tell this is my first foray into both parsing and regex so advice in terms of best practice would be very helpful.
Thanks, Peter Kim _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor