[Boston.pm] Converting Microsoft Word Special Characters
Hi All, I was wondering if I could get some help here. I am looking for an existing function/method/module that will properly convert all special characters (like those from Microsoft Word: smart quotes, mdash, ellipses, bullet points, etc.) to either a matching simpler character, or an HTML entity. HTML::Entities does a close job, but it does not handle everything correctly. I need to clean this data up for use in a google product feed (xml). Here is an example of some text I am having trouble with: ( the +'s are actually bullet points) == begin == My doctor has recommended a dream specialist, and together we are trying to figure out what these nightmares mean. Jump into Hidden Object action in Doors of the Mind – Inner Mysteries.ADVANTAGES OF THE COMPLETE VERSION :DOORS OF THE MIND: INNER MYSTERIES + Dark atmosphere+ Spooky gameplay+ Explore a world of nightmares! === end === And here is the output from using HTML::Entities: == begin == My doctor has recommended a dream specialist, and together we are trying to figure out what these nightmares mean. Jump into Hidden Object action in Doors of the Mind acirc;#128;#147; Inner Mysteries.ADVANTAGES OF THE COMPLETE VERSION :DOORS OF THE MIND: INNER MYSTERIESAcirc;nbsp;+Acirc;nbsp;Dark atmosphere+Acirc;nbsp;Spooky gameplay+Acirc;nbsp;Explore a world of nightmares! === end === Notice the extra Acirc; all over the place. Any help you can provide would be immensely helpful. Thanks. --Alex ___ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Converting Microsoft Word Special Characters
On 05/18/2012 07:55 PM, Alex Brelsfoard wrote: Hi All, I was wondering if I could get some help here. I am looking for an existing function/method/module that will properly convert all special characters (like those from Microsoft Word: smart quotes, mdash, ellipses, bullet points, etc.) to either a matching simpler character, or an HTML entity. HTML::Entities does a close job, but it does not handle everything correctly. years ago someone i know wrote such a beast. it is appropriately called the demoronizer (replaces 'smart' crapola). http://www.fourmilab.ch/webtools/demoroniser/ it may do the trick. at least it is pure perl and would be easy for you to hack to your specific needs. note that it is very old and written in perl4 code! uri ___ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Converting Microsoft Word Special Characters
Thanks Uri, Yeah I found that one when I was Googling. Sadly it only converts a few special characters (smart quotes and M and N dashes). I need something that does as many as can be thought of/found. --Alex On Sat, May 19, 2012 at 1:26 AM, Uri Guttman u...@stemsystems.com wrote: On 05/18/2012 07:55 PM, Alex Brelsfoard wrote: Hi All, I was wondering if I could get some help here. I am looking for an existing function/method/module that will properly convert all special characters (like those from Microsoft Word: smart quotes, mdash, ellipses, bullet points, etc.) to either a matching simpler character, or an HTML entity. HTML::Entities does a close job, but it does not handle everything correctly. years ago someone i know wrote such a beast. it is appropriately called the demoronizer (replaces 'smart' crapola). http://www.fourmilab.ch/**webtools/demoroniser/http://www.fourmilab.ch/webtools/demoroniser/ it may do the trick. at least it is pure perl and would be easy for you to hack to your specific needs. note that it is very old and written in perl4 code! uri __**_ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/**listinfo/boston-pmhttp://mail.pm.org/mailman/listinfo/boston-pm ___ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm