Re: Matching anchor tags

Peter Boughton Fri, 16 Jan 2009 15:02:59 -0800

>Do I understand all that correctly and does anybody see if there is any 
>way this could be re-factored and simplified?


Mostly fine - but depending on how well you know/trust the input, there are 
some things that could be improved.


As a quick[ish] general comment about Regex and HTML/XML - parsing markup is 
not something regular expressions have been designed for, and whilst for 
relatively simple things like this they can work, if you start getting into 
invalid/non-standard HTML, or even just nested tags, Regexes rapidly increase 
in complexity.

For fiddling with HTML/XML, it's a good idea to consider:
 - Functions like xmlParse or htmlParse (Railo) to create CF objects.
 - XPath 
 - jQuery selectors (Sizzle)



Anyway, now to give my opinion on what you've done. :)



>< - match the opening angle bracket '<' character.
Yep.


>(/?) - match an optional forward slash character '/' and put result in 
>back reference 1.
Yep, although unless you're using this backreference for something 
specifically, I would probably do something more like this:
</a>|<a ...>

Which I think is clearer and more precise (but doesn't have the backref which 
you may be using)


>a - match the 'a' character.
Yep.

You may want to make your expression case-insensitive if you want to match 
uppercase 'A' tags also.


>([^>]*?(?=target|>)) - match the minimum zero or more characters until 
>either 'target' or '>' and put result in back reference 2.

Correct description, but not necessarily what you want - consider:
<a href="www.target.com" class="something" target="something"...>

To avoid the first target matching (then failing, slowing things down), you 
could do:
([^>]*?(?=\starget="|>))

Where \s is whitespace (space/newline/tab) along with the =" makes it much more 
likely for the attribute to be what you're matching.

Also, I think you know, but for reference of anyone else reading, the lookahead 
(?=...) part is a non-capturing zero-width match - it doesn't put the target 
text into the 2nd back reference. (Does that explanation make sense?)


>( *target="[^"]*" *)? - match an optional 'target="..." with zero or 
>more non-double quote characters between the double quotes and put in 
>back reference 3.

With multiple spaces matching either side, yes.

With HTML, you can technically have line breaks and tabs between attributes - 
so whilst this might work for specific input, in general it's a good idea to 
use \s as above.
Again, with general HTML stuff you can have spaces around the equals sign also.

If you know you've got specifically formatted stuff the above is redundant, but 
for an example...
(\s*target\s*=\s*"[^"]*"\s*)?

And that's still not considering single quotes, unquoted values, etc.


>([^>]*) - match zero or more non-angle bracket characters that maybe 
>left after all the above an put into back reference 4.

Yep.



Hopefully that is helpful?
Let me know if there's anything I've not explained properly.




~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Adobe® ColdFusion® 8 software 8 is the most important and dramatic release to 
date
Get the Free Trial
http://ad.doubleclick.net/clk;207172674;29440083;f

Archive: http://www.houseoffusion.com/groups/regex/message.cfm/messageid:1217
Subscription: http://www.houseoffusion.com/groups/regex/subscribe.cfm
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.21

Re: Matching anchor tags

Reply via email to