Re: Regex help with invalid HTML

Peter Boughton Tue, 17 Nov 2009 06:18:05 -0800

> I have no control over this code 

The only time parsing HTML with RegEx might be remotely viable is when you know 
what that code will be - if the HTML is uncontrolled then using RegEx is a 
futile effort.


RegEx is for dealing with Regular text, and HTML is not a Regular language - 
even modern regex engines that implement non-Regular features *cannot* deal 
with the potential complexity of HTML.

The correct solution is to **use a tool designed for parsing HTML**.

There isn't one native to CF, but there are a number of Java ones available - 
take a look at:
http://java-source.net/open-source/html-parsers

I haven't used any of those, I'd probably start with TagSoup or NekoHTML since 
they look promising, but any HTML parser that produces a DOM structure which 
you can run XPath expressions against will allow you to extract the specific 
information you want.

So yeah, it might involve a bit of effort getting one of those to work, but 
it's far more stable and reliable than attempting to use regex for something it 
simply isn't designed for. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328460
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4

Re: Regex help with invalid HTML

Reply via email to