Re: Cleaning stored text to get valid XML

2007-05-05 Thread Les Mizzell
Re; http://www.nelsonmullins.com/rss/rss_newsletters.cfm

Finally found a function that seems to clean most of the MS Word control 
characters and other crap out that was causing me probems. Using two 
filters on the body text seems to be taking care of my problems now..


cfscript
function ReplaceMicrosoftChars(arg_str) {
return ReplaceList(arg_str, 
#Chr(19)#,#Chr(20)#,#Chr(25)#,#chr(8216)#,#chr(8217)#,#Chr(8211)#,#Chr(8212)#,#Chr(145)#,#Chr(146)#,#Chr(147)#,#chr(8220)#,#chr(8221)#,#Chr(148)#,#Chr(29)#,#Chr(28)#,#Chr(150)#,#Chr(151)#,#Chr(8230)#,
 
--,--,',',',--,--,',',,,-,-,...);
}
/cfscript


!--- CLEAN HTML ---
cfset request.bodynohtml = 
#rereplacenocase(stories.body,[^]*,,all)# 

!--- CLEAN WORD ---
cfset request.msclean=#ReplaceMicrosoftChars(request.bodynohtml)#




Feed seems to be working now, until the client finds something else to 
throw in there that the above doesn't cover!!!

Of course, a better way to do this would be to create valid XML text 
right from the start, but I've got hundreds of records of legacy data to 
deal with.



~|
ColdFusion MX7 and Flex 2 
Build sales  marketing dashboard RIA’s for your business. Upgrade now
http://www.adobe.com/products/coldfusion/flex2?sdid=RVJT

Archive: 
http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:277074
Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4


Re: Cleaning stored text to get valid XML

2007-05-05 Thread Claude Schneegans
 Finally found a function that seems to clean most of the MS Word control
characters and other crap out that was causing me probems. Using two
filters on the body text seems to be taking care of my problems now..

This will clean only about 1% of the crap, may be not even...

Here is a function that will clen up more, and I'm still improving it ;-)

function cleanWord (html)
// cleans pasted text from Word
{
//alert(html)
html = html.replace(/o:p\s*\/o:p/g, ) ;
html = html.replace(/o:p.*?\/o:p/g, ) ;
   
// Remove mso-xxx styles.
html = html.replace( /\s*mso-[^:]+:[^;]+;?/gi,  ) ;

// Remove margin styles.
html = html.replace( /\s*MARGIN: 0cm 0cm 0pt\s*;/gi,  ) ;
html = html.replace( /\s*MARGIN: 0cm 0cm 0pt\s*/gi, \ ) ;

html = html.replace( /\s*TEXT-INDENT: 0cm\s*;/gi,  ) ;
html = html.replace( /\s*TEXT-INDENT: 0cm\s*/gi, \ ) ;

html = html.replace( /\s*TEXT-ALIGN: [^\s;]+;?/gi, \ ) ;

html = html.replace( /\s*PAGE-BREAK-BEFORE: [^\s;]+;?/gi, \ ) ;

html = html.replace( /\s*FONT-VARIANT: [^\s;]+;?/gi, \ ) ;

html = html.replace( /\s*tab-stops:[^;]*;?/gi,  ) ;
html = html.replace( /\s*tab-stops:[^]*/gi,  ) ;

html = html.replace( /\s*FONT-FAMILY:[^;]*;?/gi,  ) ;
   
// Remove Class attributes
html = html.replace(/(\w[^]*)\s*class=([^ |]*)([^]*)/gi, $1$3) ;

// Remove styles.
html = html.replace( /(\w[^]*)style=([^\]*)([^]*)/gi, $1$3 ) ;

// Remove empty styles.
html =  html.replace( /\s*style=\s*/gi, '' ) ;
   
html = html.replace( /SPAN[^]*\s*nbsp;\s*\/SPAN/gi, 'nbsp;' ) ;
   
html = html.replace( /SPAN[^]*\s*\/SPAN/gi, '' ) ;
   
// Remove Lang attributes
html = html.replace(/(\w[^]*) lang=([^ |]*)([^]*)/gi, $1$3) ;
   
html = html.replace( /SPAN\s*([\s\S]*?)\/SPAN/gi, '$1' ) ;
html = html.replace( /SPAN\s*([\s\S]*?)\/SPAN/gi, '$1' ) ;
html = html.replace( /SPAN\s*([\s\S]*?)\/SPAN/gi, '$1' ) ;
   
// remove all font tags
html = html.replace( /\/?FONT[^]*/gi, '' ) ;
html = html.replace( /\/?FONT[^]*/gi, '' ) ;
html = html.replace( /\/?FONT[^]*/gi, '' ) ;
html = html.replace( /\/?DIV([^]*)/gi, '' ) ;

// Remove XML elements and declarations
html = html.replace(/\\?\?xml[^]*/gi, ) ;
   
// Remove Tags with XML namespace declarations: o:p/o:p
html = html.replace(/\/?\w+:[^]*/gi, ) ;
   
html = html.replace( /H\d\s*\/H\d/gi, '' ) ;

//clean up H tags   
html = html.replace( /H1([^]*)/gi, 'H1' ) ;
html = html.replace( /H2([^]*)/gi, 'H2' ) ;
html = html.replace( /H3([^]*)/gi, 'H3' ) ;
html = html.replace( /H4([^]*)/gi, 'H4' ) ;
html = html.replace( /H5([^]*)/gi, 'H5' ) ;
html = html.replace( /H6([^]*)/gi, 'H6' ) ;
html = html.replace( /P([^]*)/gi,  'P' ) ;
html = html.replace( /BR([^]*)/gi, 'BR' ) ;
html = html.replace( /P\s*(P)+\/P/gi,   'P' ) ;
html = html.replace( /\/P\s*(\/P)+\/P/gi, '/P' ) ;
   
html = html.replace( /(U|I|STRIKE)nbsp;\/\1/g, 'nbsp;' ) ;

// no comment...
html = html.replace( /!--[\s\S]*?--/gi, '' ) ;
   
// transform bullet lists
var re = new RegExp(P·SPAN(nbsp;| )*/SPAN([\\s\\S]*?)/P, 
gi);
html = html.replace( re, LI$2/LI ) ;
re = new RegExp(P·(nbsp;| )*([\\s\\S]*?)/P, gi);
html = html.replace( /(BR|P)[§·-](nbsp;| )*([\s\S]*?)\/P/gi, 
LI$2/LI ) ;
// remove spaces at begining
html = html.replace( /^(nbsp;| )*\s*/, '') ;
// replace all stupid P align=center.../P because they are 
overridden by higher
// style declarations like justify, etc.
html = html.replace( /P\s*align=center([\s\S]*?)\/P/gi, 
'BRCENTER$1/CENTER' ) ;
// remove useless /CENTERCENTER
html = html.replace( /\/CENTER(\s*BR\s*)CENTER/gi, '$1' ) ;
// remove useless BR in TD
html = html.replace( /(TD[^]*)\s*BR\s*/gi, '$1' ) ;
// replace CENTER.../CENTER inside of TDs
html = html.replace( 
/(TD[^]*)\s*CENTER([\s\S]*?)\/CENTER\s*\/TD/gi,
'$1 align=center$2/TD' ) ;
// remove Paragraphs inside TD
html = 
html.replace(/(TD[^]*)\s*P[^]*([\s\S]*?)\s*\/P\s*([\s\S]*?\/TD)/gi, 

'$1$2$3');
  
// Remove empty tags (three times, just to make sure).
html = html.replace( /([^\s]+)[^]*\s*\/\1/g, '' ) ;
html = html.replace( /([^\s]+)[^]*\s*\/\1/g, '' ) ;
html = html.replace( /([^\s]+)[^]*\s*\/\1/g, '' ) ;
html = html.replace( /[^\n\r]P/gi, 'P' ) ;
html = html.replace( /[^\n\r]BR/gi, 'BR' ) ;

//alert(html)
  return (html);
}

-- 
___
REUSE CODE! Use custom tags;
See http://www.contentbox.com/claude/customtags/tagstore.cfm
(Please send any spam to this address: [EMAIL PROTECTED])
Thanks.



~|
Create Web Applications With ColdFusion MX7  Flex 2. 
Build powerful, scalable RIAs. Free Trial
http://www.adobe.com/products/coldfusion/flex2/?sdid=RVJS 

Archive: 

Re: Cleaning stored text to get valid XML

2007-05-05 Thread Les Mizzell
   This will clean only about 1% of the crap, may be not even...


The data in question has been entered using fckeditor, what has taken 
care of a good bit of the problem stuff for me. It was the few things 
left over that fck didn't deal with that was giving me fits.

I see some great potential with cleanWord! Thanks for sharing!!! I'll 
certainly add it to my toolbox and may pull some bits and pieces as 
needed for the current project as well!!!


 Here is a function that will clen up more, and I'm still improving it ;-)
 
 function cleanWord (html)
 // cleans pasted text from Word
 {
 //alert(html)
 html = html.replace(/o:p\s*\/o:p/g, ) ;
 html = html.replace(/o:p.*?\/o:p/g, ) ;

 // Remove mso-xxx styles.
 html = html.replace( /\s*mso-[^:]+:[^;]+;?/gi,  ) ;
 
 // Remove margin styles.
 html = html.replace( /\s*MARGIN: 0cm 0cm 0pt\s*;/gi,  ) ;
 html = html.replace( /\s*MARGIN: 0cm 0cm 0pt\s*/gi, \ ) ;
 
 html = html.replace( /\s*TEXT-INDENT: 0cm\s*;/gi,  ) ;
 html = html.replace( /\s*TEXT-INDENT: 0cm\s*/gi, \ ) ;
 
 html = html.replace( /\s*TEXT-ALIGN: [^\s;]+;?/gi, \ ) ;
 
 html = html.replace( /\s*PAGE-BREAK-BEFORE: [^\s;]+;?/gi, \ ) ;
 
 html = html.replace( /\s*FONT-VARIANT: [^\s;]+;?/gi, \ ) ;
 
 html = html.replace( /\s*tab-stops:[^;]*;?/gi,  ) ;
 html = html.replace( /\s*tab-stops:[^]*/gi,  ) ;
 
 html = html.replace( /\s*FONT-FAMILY:[^;]*;?/gi,  ) ;

 // Remove Class attributes
 html = html.replace(/(\w[^]*)\s*class=([^ |]*)([^]*)/gi, $1$3) ;
 
 // Remove styles.
 html = html.replace( /(\w[^]*)style=([^\]*)([^]*)/gi, $1$3 ) ;
 
 // Remove empty styles.
 html =  html.replace( /\s*style=\s*/gi, '' ) ;

 html = html.replace( /SPAN[^]*\s*nbsp;\s*\/SPAN/gi, 'nbsp;' ) ;

 html = html.replace( /SPAN[^]*\s*\/SPAN/gi, '' ) ;

 // Remove Lang attributes
 html = html.replace(/(\w[^]*) lang=([^ |]*)([^]*)/gi, $1$3) ;

 html = html.replace( /SPAN\s*([\s\S]*?)\/SPAN/gi, '$1' ) ;
 html = html.replace( /SPAN\s*([\s\S]*?)\/SPAN/gi, '$1' ) ;
 html = html.replace( /SPAN\s*([\s\S]*?)\/SPAN/gi, '$1' ) ;

 // remove all font tags
 html = html.replace( /\/?FONT[^]*/gi, '' ) ;
 html = html.replace( /\/?FONT[^]*/gi, '' ) ;
 html = html.replace( /\/?FONT[^]*/gi, '' ) ;
 html = html.replace( /\/?DIV([^]*)/gi, '' ) ;
 
 // Remove XML elements and declarations
 html = html.replace(/\\?\?xml[^]*/gi, ) ;

 // Remove Tags with XML namespace declarations: o:p/o:p
 html = html.replace(/\/?\w+:[^]*/gi, ) ;

 html = html.replace( /H\d\s*\/H\d/gi, '' ) ;
 
 //clean up H tags   
 html = html.replace( /H1([^]*)/gi, 'H1' ) ;
 html = html.replace( /H2([^]*)/gi, 'H2' ) ;
 html = html.replace( /H3([^]*)/gi, 'H3' ) ;
 html = html.replace( /H4([^]*)/gi, 'H4' ) ;
 html = html.replace( /H5([^]*)/gi, 'H5' ) ;
 html = html.replace( /H6([^]*)/gi, 'H6' ) ;
 html = html.replace( /P([^]*)/gi,  'P' ) ;
 html = html.replace( /BR([^]*)/gi, 'BR' ) ;
 html = html.replace( /P\s*(P)+\/P/gi,   'P' ) ;
 html = html.replace( /\/P\s*(\/P)+\/P/gi, '/P' ) ;

 html = html.replace( /(U|I|STRIKE)nbsp;\/\1/g, 'nbsp;' ) ;
 
 // no comment...
 html = html.replace( /!--[\s\S]*?--/gi, '' ) ;

 // transform bullet lists
 var re = new RegExp(P·SPAN(nbsp;| )*/SPAN([\\s\\S]*?)/P, 
 gi);
 html = html.replace( re, LI$2/LI ) ;
 re = new RegExp(P·(nbsp;| )*([\\s\\S]*?)/P, gi);
 html = html.replace( /(BR|P)[§·-](nbsp;| )*([\s\S]*?)\/P/gi, 
 LI$2/LI ) ;
 // remove spaces at begining
 html = html.replace( /^(nbsp;| )*\s*/, '') ;
 // replace all stupid P align=center.../P because they are 
 overridden by higher
 // style declarations like justify, etc.
 html = html.replace( /P\s*align=center([\s\S]*?)\/P/gi, 
 'BRCENTER$1/CENTER' ) ;
 // remove useless /CENTERCENTER
 html = html.replace( /\/CENTER(\s*BR\s*)CENTER/gi, '$1' ) ;
 // remove useless BR in TD
 html = html.replace( /(TD[^]*)\s*BR\s*/gi, '$1' ) ;
 // replace CENTER.../CENTER inside of TDs
 html = html.replace( 
 /(TD[^]*)\s*CENTER([\s\S]*?)\/CENTER\s*\/TD/gi,
 '$1 align=center$2/TD' ) ;
 // remove Paragraphs inside TD
 html = 
 html.replace(/(TD[^]*)\s*P[^]*([\s\S]*?)\s*\/P\s*([\s\S]*?\/TD)/gi, 
 
 '$1$2$3');
   
 // Remove empty tags (three times, just to make sure).
 html = html.replace( /([^\s]+)[^]*\s*\/\1/g, '' ) ;
 html = html.replace( /([^\s]+)[^]*\s*\/\1/g, '' ) ;
 html = html.replace( /([^\s]+)[^]*\s*\/\1/g, '' ) ;
 html = html.replace( /[^\n\r]P/gi, 'P' ) ;
 html = html.replace( /[^\n\r]BR/gi, 'BR' ) ;
 
 //alert(html)
   return (html);
 }
 


~|
ColdFusion MX7 and Flex 2 
Build sales  marketing dashboard RIA’s for your business. Upgrade now

Re: Cleaning stored text to get valid XML

2007-05-04 Thread Robertson-Ravo, Neil (RX)
Have you tried XMLFormat() around the content?



This e-mail is from Reed Exhibitions (Gateway House, 28 The Quadrant,
Richmond, Surrey, TW9 1DN, United Kingdom), a division of Reed Business,
Registered in England, Number 678540.  It contains information which is
confidential and may also be privileged.  It is for the exclusive use of the
intended recipient(s).  If you are not the intended recipient(s) please note
that any form of distribution, copying or use of this communication or the
information in it is strictly prohibited and may be unlawful.  If you have
received this communication in error please return it to the sender or call
our switchboard on +44 (0) 20 89107910.  The opinions expressed within this
communication are not necessarily those expressed by Reed Exhibitions. 
Visit our website at http://www.reedexpo.com

-Original Message-
From: Les Mizzell
To: CF-Talk
Sent: Fri May 04 02:30:03 2007
Subject: Cleaning stored text to get valid XML

I've had an application set up for awhile now that allows a user to post 
and email newsletters from their site.

The body text is entered on a form using fckeditor, and since these 
folks are lawyers, almost everything is pasted from Word and fckeditor 
is handling whatever is thrown at it.

Now, they wish to create a RSS feed from all their newsletters. Oh boy. 
There's all kinds of crap in the data - curly quotes, apostrophies, HTML 
tags, and gawd knows what else.

I've been going nutz trying to clean the existing text enough to create 
valid XML text so it will display.

I start by getting rid of all the HTML junk, which is working fine:

cfset request.bodynohtml = 
#rereplacenocase(stories.body,[^]*,,all)# 

After that, it gets a little weird. I've tried all sorts of functions,
xmlFormat2.cfm, ConvertSpecialChars ... chaining rereplacenocase to get 
rid of left and right quotes, apostrophies, whatever other junk I keep 
finding ...

Nothing seems to be getting rid of everything, and the feed still isn't 
displaying correctly. I know my code base is OK because I've created two 
other feeds that are working. There's *something* in the text that's 
still stopping a correct display.

How are you folks handling this sort of thing?

This one is working:
http://www.nelsonmullins.com/rss/rss_press.cfm

This one ain't - there's something in the body text somewhere I'm not 
stripping out...
http://www.nelsonmullins.com/rss/rss_newsletters.cfm

Suggestions?




~|
Upgrade to Adobe ColdFusion MX7
Experience Flex 2  MX7 integration  create powerful cross-platform RIAs
http://www.adobe.com/products/coldfusion/flex2/?sdid=RVJQ 

Archive: 
http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:276985
Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4


Re: Cleaning stored text to get valid XML

2007-05-04 Thread Les Mizzell
Robertson-Ravo, Neil (RX) wrote:
 Have you tried XMLFormat() around the content?


Yup, and the version up now is using it. Still doesn't want to display...

http://www.nelsonmullins.com/rss/rss_newsletters.cfm

Here's what I've got in there right now:

cfset request.bodynohtml = 
#rereplacenocase(stories.body,[^]*,,all)# 
cfset request.bodyXMLFormat = #XMLFormat(request.bodynohtml)#
cfset request.bodylimit = #left(request.bodyXMLFormat, 300)# 

cfset request.titlenohtml = 
#rereplacenocase(pr.title,[^]*,,all)# 
cfset request.titleXMLFormat = #XMLFormat(request.titlenohtml)#


/cfsilent
cfset queryAddRow(data,1)
  cfset 
querySetCell(data,title,#ConvertSpecialChars(request.titleXMLFormat)#)
  cfset querySetCell(data,body,#request.bodylimit# ...)
  cfset 
querySetCell(data,subject,#ConvertSpecialChars(request.titleXMLFormat)#)
  cfset querySetCell(data,date,#dateformat(pr.date, 'mm/dd/')#)

~|
Upgrade to Adobe ColdFusion MX7
The most significant release in over 10 years. Upgrade  see new features.
http://www.adobe.com/products/coldfusion?sdid=RVJR

Archive: 
http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:276996
Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4


RE: Cleaning stored text to get valid XML

2007-05-04 Thread Jeremy
If the data is being pasted from word, you will probably need to clean
(Replace) the special Microsoft characters that word generates for most of
its invisible characters. I know I had problems posting data being pasted
from word to my db, and it was those special characters causing the problem.
Check the adobe site, they had an article on there somewhere with the hex
character for most of the problem text. And you may also need to replace
quotes with something else, or remove them altogether.

-Original Message-
From: Les Mizzell [mailto:[EMAIL PROTECTED] 
Sent: Thursday, May 03, 2007 8:30 PM
To: CF-Talk
Subject: Cleaning stored text to get valid XML

I've had an application set up for awhile now that allows a user to post 
and email newsletters from their site.

The body text is entered on a form using fckeditor, and since these 
folks are lawyers, almost everything is pasted from Word and fckeditor 
is handling whatever is thrown at it.

Now, they wish to create a RSS feed from all their newsletters. Oh boy. 
There's all kinds of crap in the data - curly quotes, apostrophies, HTML 
tags, and gawd knows what else.

I've been going nutz trying to clean the existing text enough to create 
valid XML text so it will display.

I start by getting rid of all the HTML junk, which is working fine:

cfset request.bodynohtml = 
#rereplacenocase(stories.body,[^]*,,all)# 

After that, it gets a little weird. I've tried all sorts of functions,
xmlFormat2.cfm, ConvertSpecialChars ... chaining rereplacenocase to get 
rid of left and right quotes, apostrophies, whatever other junk I keep 
finding ...

Nothing seems to be getting rid of everything, and the feed still isn't 
displaying correctly. I know my code base is OK because I've created two 
other feeds that are working. There's *something* in the text that's 
still stopping a correct display.

How are you folks handling this sort of thing?

This one is working:
http://www.nelsonmullins.com/rss/rss_press.cfm

This one ain't - there's something in the body text somewhere I'm not 
stripping out...
http://www.nelsonmullins.com/rss/rss_newsletters.cfm

Suggestions?




~|
Upgrade to Adobe ColdFusion MX7
The most significant release in over 10 years. Upgrade  see new features.
http://www.adobe.com/products/coldfusion?sdid=RVJR

Archive: 
http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:277045
Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4