Re: HTML Code - Line Number

Tim Roberts Fri, 27 Apr 2012 23:03:17 -0700

[email protected] wrote:
>
>For scrapping purposes, I am having a bit of trouble writing a block
>of code to define, and find, the relative position (line number) of a
>string of HTML code. I can pull out one string that I want, and then
>there is always a line of code, directly beneath the one I can pull
>out, that begins with the following:
><td align="left" valign="top" class="body_cols_middle">
>
>However, because this string of HTML code above is not unique to just
>the information I need (which I cannot currently pull out), I was
>hoping there is a way to effectively say "if you find the html string
>_____ in the line of HTML code above, and the string <td align="left"
>valign="top" class="body_cols_middle"> in the line immediately
>following, then pull everything that follows this second string.


Regular expression-based screen scraping is extremely delicate.  All it
takes is one tweak to the HTML, and your scraping fails although the page
continues to look the same.

A much better plan is to use sgmllib to write yourself a mini HTML parser.
You can handle "td" tags with the attributes you want, and count down until
you get to the "td" tag you want.
-- 
Tim Roberts, [email protected]
Providenza & Boekelheide, Inc.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: HTML Code - Line Number

Reply via email to