Re: Cleaning Microsoft .docx special chars

2010-02-26 Thread denstar

On Thu, Feb 25, 2010 at 6:32 AM, Kevin Pepperman
...
 We need it to be raw text, and we have specific filtering for Meta
 descriptions, page titles etc..(eg. no  in meta description tags so the
 XHTML will validate)  --so our CMS limits what can be entered with a
 combination of jQuery and server side validations after submitting to a XHR.
 So I probably wont be able to use a full RTE for this situation.

I've seen this come up enough to toss another possible solution out there:

We're starting to use the Jericho HTML library for parsing tags in
CFEclipse, there is an example of cleaning bad markup there on the
site.

There's also an example of changing HTML into plain text, but with at
least some formatting, which could work as well.

I haven't really experimented with the HTML  text conversion, but the
cleaning stuff seems to be a pretty viable solution.

Just tossing it out there, as you could do the cleaning as part of a
custom tag or whatnot.

:den

-- 
The future belongs to those who prepare for it today.
Malcolm 

~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:331192
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4


Cleaning Microsoft .docx special chars

2010-02-25 Thread Kevin Pepperman

I just wondered if anyone else has seen this before, and I wanted to post
this to help anyone else that has this same issue.

We have an outsourced writer doing textual content for products in a site
that I work on.

The writer always supplies me with the content and I added it via a CMS.

It was always in Microsoft Word format, so I know to strip out special chars
using the DeMoronize UDF found at CFLIB, the function is built into the
CMS.

But recently he has been using a newer version of Word and giving me .docx
 files.

At first I didn't really notice, but recently some of the content has been
indexed by Google and since the text was used in Meta descriptions and the
page titles in places, I started seeing new #xxx; chars that would not
render in Google that I didn't see before.

When rendered correct on our site, they look exactly like the smart quotes
and several other strange things like ... etc..

The DeMoronize() UDF used to remove these with the chr(n) function, but
these look completely different and were missed by the UDF.

These are the new entities that I am seeing, they show up fine in UTF-8
HTML, but posting them to the CMS was adding them like this to MySql, and
Google would not render them in their results correct.

#8220; #8221; #8217; #160; #39; #8230; #8482;

The fix for DeMoronize() is simple.

Add these lines.

text = ReReplace(text, ##8220;, , All);
 text = ReReplace(text, ##8221;, , All);
text = ReReplace(text, ##8217;, ', All);
 text = ReReplace(text, ##160;,  , All);
text = ReReplace(text, ##39;, ', All);
 text = ReReplace(text, ##8230;, ..., All);
text = ReReplace(text, ##8482;, trade;, All);


Server notes:

The mySql tables, columns, connections and collation are all
utf-8_unicode_ci,

Application.cfc contains SetEncoding(form,utf-8);
SetEncoding(url,utf-8); and I am using cfprocessingdirective
pageencoding=UTF-8

Has anyone else seen this?

-- 
/Kevin Pepperman


~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:331119
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4


RE: Cleaning Microsoft .docx special chars

2010-02-25 Thread Will Swain

We use the built in richtext editor in cf8+ to handle pasting from Word, and
it seems to handle it very well. Might be an option. Just 'Paste from Word'
or 'Paste as plain text'

Will

-Original Message-
From: Kevin Pepperman [mailto:chorno...@gmail.com] 
Sent: 25 February 2010 08:58
To: cf-talk
Subject: Cleaning Microsoft .docx special chars


I just wondered if anyone else has seen this before, and I wanted to post
this to help anyone else that has this same issue.

We have an outsourced writer doing textual content for products in a site
that I work on.

The writer always supplies me with the content and I added it via a CMS.

It was always in Microsoft Word format, so I know to strip out special chars
using the DeMoronize UDF found at CFLIB, the function is built into the
CMS.

But recently he has been using a newer version of Word and giving me .docx
 files.

At first I didn't really notice, but recently some of the content has been
indexed by Google and since the text was used in Meta descriptions and the
page titles in places, I started seeing new #xxx; chars that would not
render in Google that I didn't see before.

When rendered correct on our site, they look exactly like the smart quotes
and several other strange things like ... etc..

The DeMoronize() UDF used to remove these with the chr(n) function, but
these look completely different and were missed by the UDF.

These are the new entities that I am seeing, they show up fine in UTF-8
HTML, but posting them to the CMS was adding them like this to MySql, and
Google would not render them in their results correct.

#8220; #8221; #8217; #160; #39; #8230; #8482;

The fix for DeMoronize() is simple.

Add these lines.

text = ReReplace(text, ##8220;, , All);
 text = ReReplace(text, ##8221;, , All);
text = ReReplace(text, ##8217;, ', All);
 text = ReReplace(text, ##160;,  , All);
text = ReReplace(text, ##39;, ', All);
 text = ReReplace(text, ##8230;, ..., All);
text = ReReplace(text, ##8482;, trade;, All);


Server notes:

The mySql tables, columns, connections and collation are all
utf-8_unicode_ci,

Application.cfc contains SetEncoding(form,utf-8);
SetEncoding(url,utf-8); and I am using cfprocessingdirective
pageencoding=UTF-8

Has anyone else seen this?

-- 
/Kevin Pepperman




~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:331124
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4


Re: Cleaning Microsoft .docx special chars

2010-02-25 Thread Tom Chiverton

On Thursday 25 Feb 2010, Kevin Pepperman wrote:
 But recently he has been using a newer version of Word and giving me .docx
  files.

Ask him to stop ? I would hope he would. Your interchange formats may even be 
specified in the contract


-- 
Helping to autoschediastically cluster collaborative communities as part of 
the IT team of the year 2010, '09 and '08



This email is sent for and on behalf of Halliwells LLP.

Halliwells LLP is a limited liability partnership registered in England and 
Wales under registered number OC307980 whose registered office address is at 
Halliwells LLP, 3 Hardman Square, Spinningfields, Manchester, M3 3EB.  A list 
of members is available for inspection at the registered office together with a 
list of those non members who are referred to as partners.  We use the word 
“partner” to refer to a member of the LLP, or an employee or consultant with 
equivalent standing and qualifications. Regulated by the Solicitors Regulation 
Authority.

CONFIDENTIALITY

This email is intended only for the use of the addressee named above and may be 
confidential or legally privileged.  If you are not the addressee you must not 
read it and must not use any information contained in nor copy it nor inform 
any person other than Halliwells LLP or the addressee of its existence or 
contents.  If you have received this email in error please delete it and notify 
Halliwells LLP IT Department on 0870 365 2500.

For more information about Halliwells LLP visit www.halliwells.co

~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:331126
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4


Re: Cleaning Microsoft .docx special chars

2010-02-25 Thread Kevin Pepperman


 Ask him to stop ?


That is probably the best solution. :)

I'll require him to submit .txt files from now on.

We use the built in richtext editor in cf8+ to handle pasting from Word


We need it to be raw text, and we have specific filtering for Meta
descriptions, page titles etc..(eg. no  in meta description tags so the
XHTML will validate)  --so our CMS limits what can be entered with a
combination of jQuery and server side validations after submitting to a XHR.
So I probably wont be able to use a full RTE for this situation.

Thanks!

On Thu, Feb 25, 2010 at 8:17 AM, Tom Chiverton tom.chiver...@halliwells.com
 wrote:


 On Thursday 25 Feb 2010, Kevin Pepperman wrote:
  But recently he has been using a newer version of Word and giving me
 .docx
   files.

 Ask him to stop ? I would hope he would. Your interchange formats may even
 be
 specified in the contract


 --
 Helping to autoschediastically cluster collaborative communities as part of
 the IT team of the year 2010, '09 and '08

 

 This email is sent for and on behalf of Halliwells LLP.

 Halliwells LLP is a limited liability partnership registered in England and
 Wales under registered number OC307980 whose registered office address is at
 Halliwells LLP, 3 Hardman Square, Spinningfields, Manchester, M3 3EB.  A
 list of members is available for inspection at the registered office
 together with a list of those non members who are referred to as partners.
  We use the word “partner” to refer to a member of the LLP, or an employee
 or consultant with equivalent standing and qualifications. Regulated by the
 Solicitors Regulation Authority.

 CONFIDENTIALITY

 This email is intended only for the use of the addressee named above and
 may be confidential or legally privileged.  If you are not the addressee you
 must not read it and must not use any information contained in nor copy it
 nor inform any person other than Halliwells LLP or the addressee of its
 existence or contents.  If you have received this email in error please
 delete it and notify Halliwells LLP IT Department on 0870 365 2500.

 For more information about Halliwells LLP visit www.halliwells.co

 

~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:331127
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4