Hey Chaps,

I've been doing a little work with some RSS feeds of late, and on the most part 
all is very well, now, the one problem I'm running into is people who publish 
RSS feeds containing lots of junk HTML (urgh!), like inline links, images, divs 
and whatnot in the description content of the feed.

I only want to have the plain text version of these feeds and not all the other 
junk. This means stripping out the html tags <div>, <a> etc, some of which are 
being published as &lt; and &gt;. Also, I want to convert HTML formatted 
characters into their nice plain text equivilants, for instance making &amp; 
just a standard &.

Now presumably this can all be done with REGEX (I couldn't find any nice built 
in CF functions) however my skills in this area are pretty much non-existent, 
however I know some of you are fairly experienced with this kind of thing.

I'm also hoping that I'll be able to do some form of REGEX related 'find' on 
the rules first so I can say to the user 'this feed appears to contain lots of 
redundant crap, would you like it cleaned for you? this may cause formatting 
issues.' or something to that effect, I can then process the replace rules if 
they choose to do so.

I'd appreciate any advice.

Rob 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Adobe® ColdFusion® 8 software 8 is the most important and dramatic release to 
date
Get the Free Trial
http://ad.doubleclick.net/clk;207172674;29440083;f

Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:322053
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4

Reply via email to