RE: trouble scraping tables

Steve Sat, 17 Mar 2007 07:21:00 -0800

> Read it again Steve. The only error in it (just one) was that I left out
> the question-mark in the negative-lookahead... I just don't need them
> often. Other than that, if you take the time to read what I wrote, I
> said in my explanation that the text being searched would have to appear
> in the first table row (as was my interpretation of the question being
> identifying
> the table by a column header), but thank you for being a gratuitous
> enough to take the time out while offering advice to someone else to
> belittle me, it's greatly appreciated.


It was not my intention to belittle anyone, and I'm sorry if I came across
that way. However, I must counter that your claim is incorrect. Fixing the
negative lookahead syntax will not make it work, as the regex will still
match from the start of the first table, not the table which contains the
text being searched for. Also, it will match until the last instance of the
text being searched for (as a result of incorrectly used greedy repetition).

Here's the regex you proposed, with the missing question mark added:

"<table[^>]*>\s*<tr[^>]*>(?!</tr>).*<td>Code 2</td>"

Essentially, this regex contains only three significant differences from JJ
Cool's original regex. Yours asserts that:

- The match starts from the first table whose first <tr> isn't empty (i.e.,
immediately followed by "</tr>").
- The match cannot make use of any elements between the opening <table> tag
and the first <tr> tag. (This is a problem since elements such as <caption>,
<thead>, and <tbody> commonly precede table rows.)
- The text being searched for ("Code 2") must be the entire contents of a
<td> element.

However, none of the above features were requested by JJ, and the first is
quite different from the claimed effect of the negative lookahead.

As noted previously, "<table[^>]*>(?:.(?!</table>))*?Code 2" should solve
the problem as described so far (or at least as I've understood it). It
enforces that the match start from the beginning of the table which contains
"Code 2" (note how the dot operator paired with the negative lookahead is
repeated lazily).

However, note that the above does not support nested tables. Nesting is
trickier since regular expressions do not support recursion (the one
exception is the .NET framework's regex engine, which can implement
recursion using balancing groups). Despite this, we can still support nested
tables in ColdFusion regexes as long as we know in advance the maximum
amount of table nesting we might come across. For example, if we modify the
regex like so:

"<(table)[^>]*>(?:(?:<\1[^>]*>.*?</\1>|.)(?!</\1>))*?Code 2"

It now supports from zero to one level of nesting (this logic can easily be
extended to support more levels, if necessary), and will match from the
start of the outermost table which contains "Code 2", even if the text
appears in a nested table.





~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Upgrade to Adobe ColdFusion MX7
Experience Flex 2 & MX7 integration & create powerful cross-platform RIAs
http://www.adobe.com/products/coldfusion/flex2/?sdid=RVJQ 

Archive: http://www.houseoffusion.com/groups/RegEx/message.cfm/messageid:1029
Subscription: http://www.houseoffusion.com/groups/RegEx/subscribe.cfm
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.21

RE: trouble scraping tables

Reply via email to