Re: Cleaning Microsoft .docx special chars
On Thu, Feb 25, 2010 at 6:32 AM, Kevin Pepperman ... We need it to be raw text, and we have specific filtering for Meta descriptions, page titles etc..(eg. no in meta description tags so the XHTML will validate) --so our CMS limits what can be entered with a combination of jQuery and server side validations after submitting to a XHR. So I probably wont be able to use a full RTE for this situation. I've seen this come up enough to toss another possible solution out there: We're starting to use the Jericho HTML library for parsing tags in CFEclipse, there is an example of cleaning bad markup there on the site. There's also an example of changing HTML into plain text, but with at least some formatting, which could work as well. I haven't really experimented with the HTML text conversion, but the cleaning stuff seems to be a pretty viable solution. Just tossing it out there, as you could do the cleaning as part of a custom tag or whatnot. :den -- The future belongs to those who prepare for it today. Malcolm ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:331192 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Cleaning Microsoft .docx special chars
I just wondered if anyone else has seen this before, and I wanted to post this to help anyone else that has this same issue. We have an outsourced writer doing textual content for products in a site that I work on. The writer always supplies me with the content and I added it via a CMS. It was always in Microsoft Word format, so I know to strip out special chars using the DeMoronize UDF found at CFLIB, the function is built into the CMS. But recently he has been using a newer version of Word and giving me .docx files. At first I didn't really notice, but recently some of the content has been indexed by Google and since the text was used in Meta descriptions and the page titles in places, I started seeing new #xxx; chars that would not render in Google that I didn't see before. When rendered correct on our site, they look exactly like the smart quotes and several other strange things like ... etc.. The DeMoronize() UDF used to remove these with the chr(n) function, but these look completely different and were missed by the UDF. These are the new entities that I am seeing, they show up fine in UTF-8 HTML, but posting them to the CMS was adding them like this to MySql, and Google would not render them in their results correct. #8220; #8221; #8217; #160; #39; #8230; #8482; The fix for DeMoronize() is simple. Add these lines. text = ReReplace(text, ##8220;, , All); text = ReReplace(text, ##8221;, , All); text = ReReplace(text, ##8217;, ', All); text = ReReplace(text, ##160;, , All); text = ReReplace(text, ##39;, ', All); text = ReReplace(text, ##8230;, ..., All); text = ReReplace(text, ##8482;, trade;, All); Server notes: The mySql tables, columns, connections and collation are all utf-8_unicode_ci, Application.cfc contains SetEncoding(form,utf-8); SetEncoding(url,utf-8); and I am using cfprocessingdirective pageencoding=UTF-8 Has anyone else seen this? -- /Kevin Pepperman ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:331119 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
RE: Cleaning Microsoft .docx special chars
We use the built in richtext editor in cf8+ to handle pasting from Word, and it seems to handle it very well. Might be an option. Just 'Paste from Word' or 'Paste as plain text' Will -Original Message- From: Kevin Pepperman [mailto:chorno...@gmail.com] Sent: 25 February 2010 08:58 To: cf-talk Subject: Cleaning Microsoft .docx special chars I just wondered if anyone else has seen this before, and I wanted to post this to help anyone else that has this same issue. We have an outsourced writer doing textual content for products in a site that I work on. The writer always supplies me with the content and I added it via a CMS. It was always in Microsoft Word format, so I know to strip out special chars using the DeMoronize UDF found at CFLIB, the function is built into the CMS. But recently he has been using a newer version of Word and giving me .docx files. At first I didn't really notice, but recently some of the content has been indexed by Google and since the text was used in Meta descriptions and the page titles in places, I started seeing new #xxx; chars that would not render in Google that I didn't see before. When rendered correct on our site, they look exactly like the smart quotes and several other strange things like ... etc.. The DeMoronize() UDF used to remove these with the chr(n) function, but these look completely different and were missed by the UDF. These are the new entities that I am seeing, they show up fine in UTF-8 HTML, but posting them to the CMS was adding them like this to MySql, and Google would not render them in their results correct. #8220; #8221; #8217; #160; #39; #8230; #8482; The fix for DeMoronize() is simple. Add these lines. text = ReReplace(text, ##8220;, , All); text = ReReplace(text, ##8221;, , All); text = ReReplace(text, ##8217;, ', All); text = ReReplace(text, ##160;, , All); text = ReReplace(text, ##39;, ', All); text = ReReplace(text, ##8230;, ..., All); text = ReReplace(text, ##8482;, trade;, All); Server notes: The mySql tables, columns, connections and collation are all utf-8_unicode_ci, Application.cfc contains SetEncoding(form,utf-8); SetEncoding(url,utf-8); and I am using cfprocessingdirective pageencoding=UTF-8 Has anyone else seen this? -- /Kevin Pepperman ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:331124 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Re: Cleaning Microsoft .docx special chars
On Thursday 25 Feb 2010, Kevin Pepperman wrote: But recently he has been using a newer version of Word and giving me .docx files. Ask him to stop ? I would hope he would. Your interchange formats may even be specified in the contract -- Helping to autoschediastically cluster collaborative communities as part of the IT team of the year 2010, '09 and '08 This email is sent for and on behalf of Halliwells LLP. Halliwells LLP is a limited liability partnership registered in England and Wales under registered number OC307980 whose registered office address is at Halliwells LLP, 3 Hardman Square, Spinningfields, Manchester, M3 3EB. A list of members is available for inspection at the registered office together with a list of those non members who are referred to as partners. We use the word partner to refer to a member of the LLP, or an employee or consultant with equivalent standing and qualifications. Regulated by the Solicitors Regulation Authority. CONFIDENTIALITY This email is intended only for the use of the addressee named above and may be confidential or legally privileged. If you are not the addressee you must not read it and must not use any information contained in nor copy it nor inform any person other than Halliwells LLP or the addressee of its existence or contents. If you have received this email in error please delete it and notify Halliwells LLP IT Department on 0870 365 2500. For more information about Halliwells LLP visit www.halliwells.co ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:331126 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
Re: Cleaning Microsoft .docx special chars
Ask him to stop ? That is probably the best solution. :) I'll require him to submit .txt files from now on. We use the built in richtext editor in cf8+ to handle pasting from Word We need it to be raw text, and we have specific filtering for Meta descriptions, page titles etc..(eg. no in meta description tags so the XHTML will validate) --so our CMS limits what can be entered with a combination of jQuery and server side validations after submitting to a XHR. So I probably wont be able to use a full RTE for this situation. Thanks! On Thu, Feb 25, 2010 at 8:17 AM, Tom Chiverton tom.chiver...@halliwells.com wrote: On Thursday 25 Feb 2010, Kevin Pepperman wrote: But recently he has been using a newer version of Word and giving me .docx files. Ask him to stop ? I would hope he would. Your interchange formats may even be specified in the contract -- Helping to autoschediastically cluster collaborative communities as part of the IT team of the year 2010, '09 and '08 This email is sent for and on behalf of Halliwells LLP. Halliwells LLP is a limited liability partnership registered in England and Wales under registered number OC307980 whose registered office address is at Halliwells LLP, 3 Hardman Square, Spinningfields, Manchester, M3 3EB. A list of members is available for inspection at the registered office together with a list of those non members who are referred to as partners. We use the word partner to refer to a member of the LLP, or an employee or consultant with equivalent standing and qualifications. Regulated by the Solicitors Regulation Authority. CONFIDENTIALITY This email is intended only for the use of the addressee named above and may be confidential or legally privileged. If you are not the addressee you must not read it and must not use any information contained in nor copy it nor inform any person other than Halliwells LLP or the addressee of its existence or contents. If you have received this email in error please delete it and notify Halliwells LLP IT Department on 0870 365 2500. For more information about Halliwells LLP visit www.halliwells.co ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:331127 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4