The other thought I had was that I may not be properly trapping the end of the first <tr> row, and the beginning of the next <tr> row.
On Oct 2, 8:38 am, John <jmg3...@gmail.com> wrote: > On Oct 2, 1:10 am, "504cr...@gmail.com" <504cr...@gmail.com> wrote: > > > > > I'm kind of new to regular expressions, and I've spent hours trying to > > finesse a regular expression to build a substitution. > > > What I'd like to do is extract data elements from HTML and structure > > them so that they can more readily be imported into a database. > > > No -- sorry -- I don't want to use BeautifulSoup (though I have for > > other projects). Humor me, please -- I'd really like to see if this > > can be done with just regular expressions. > > > Note that the output is referenced using named groups. > > > My challenge is successfully matching the HTML tags in between the > > first table row, and the second table row. > > > I'd appreciate any suggestions to improve the approach. > > > rText = "<tr><td valign=top>8583</td><td valign=top><a > > href=lic_details.asp?lic_number=8583>New Horizon Technical Academy, > > Inc #4</a></td><td valign=top>Jefferson</td><td valign=top>70114</td></ > > tr><tr><td valign=top>9371</td><td valign=top><a href=lic_details.asp? > > lic_number=9371>Career Learning Center</a></td><td > > valign=top>Jefferson</td><td valign=top>70113</td></tr>" > > > rText = re.compile(r'(<tr><td valign=top>)(?P<zlicense>\d+)(</td>)(<td > > valign=top>)(<a href=lic_details.asp)(\?lic_number=\d+)(>)(?P<zname>[A- > > Za-z0-9#\s\S\W]+)(</.*?>).+$').sub(r'LICENSE:\g<zlicense>|NAME: > > \g<zname>\n', rText) > > > print rText > > > LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4</a></td><td > > valign=top>Jefferson</td><td valign=top>70114</td></tr><tr><td > > valign=top>9371</td><td valign=top><a href=lic_details.asp? > > lic_number=9371>Career Learning Center|PARISH:Jefferson|ZIP:70113 > > Some suggestions to start off with: > > * triple-quote your multiline strings > * consider using the re.X, re.M, and re.S options for re.compile() > * save your re object after you compile it > * note that re.sub() returns a new string > > Also, it sounds like you want to replace the first 2 <td> elements for > each <tr> element with their content separated by a pipe (throwing > away the <td> tags themselves), correct? > > ---John -- http://mail.python.org/mailman/listinfo/python-list