[Boston.pm] Converting Microsoft Word Special Characters

2012-05-18 Thread Alex Brelsfoard
Hi All,

I was wondering if I could get some help here.  I am looking for an
existing function/method/module that will properly convert all special
characters (like those from Microsoft Word: smart quotes, mdash, ellipses,
bullet points, etc.) to either a matching simpler character, or an HTML
entity.

HTML::Entities does a close job, but it does not handle everything
correctly.

I need to clean this data up for use in a google product feed (xml).

Here is an example of some text I am having trouble with:
( the +'s are actually bullet points)
== begin ==
My doctor has recommended a dream specialist, and together we are trying to
figure out what these nightmares mean. Jump into Hidden Object action in
Doors of the Mind – Inner Mysteries.ADVANTAGES OF THE COMPLETE VERSION
:DOORS OF THE MIND: INNER MYSTERIES + Dark atmosphere+ Spooky
gameplay+ Explore a world of nightmares!
=== end ===

And here is the output from using HTML::Entities:
== begin ==
My doctor has recommended a dream specialist, and together we are trying to
figure out what these nightmares mean. Jump into Hidden Object action in
Doors of the Mind acirc;#128;#147; Inner Mysteries.ADVANTAGES OF THE
COMPLETE VERSION :DOORS OF THE MIND: INNER
MYSTERIESAcirc;nbsp;+Acirc;nbsp;Dark atmosphere+Acirc;nbsp;Spooky
gameplay+Acirc;nbsp;Explore a world of nightmares!
=== end ===

Notice the extra Acirc; all over the place.

Any help you can provide would be immensely helpful.

Thanks.
--Alex

___
Boston-pm mailing list
Boston-pm@mail.pm.org
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Converting Microsoft Word Special Characters

2012-05-18 Thread Uri Guttman

On 05/18/2012 07:55 PM, Alex Brelsfoard wrote:


 Hi All,

 I was wondering if I could get some help here.  I am looking for an
 existing function/method/module that will properly convert all special
 characters (like those from Microsoft Word: smart quotes, mdash, ellipses,
 bullet points, etc.) to either a matching simpler character, or an HTML
 entity.

 HTML::Entities does a close job, but it does not handle everything
 correctly.


years ago someone i know wrote such a beast. it is appropriately called
the demoronizer (replaces 'smart' crapola).

http://www.fourmilab.ch/webtools/demoroniser/

it may do the trick. at least it is pure perl and would be easy for you
to hack to your specific needs.

note that it is very old and written in perl4 code!

uri




___
Boston-pm mailing list
Boston-pm@mail.pm.org
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Converting Microsoft Word Special Characters

2012-05-18 Thread Alex Brelsfoard
Thanks Uri,

Yeah I found that one when I was Googling.  Sadly it only converts a few
special characters (smart quotes and M and N dashes).
I need something that does as many as can be thought of/found.

--Alex

On Sat, May 19, 2012 at 1:26 AM, Uri Guttman u...@stemsystems.com wrote:

 On 05/18/2012 07:55 PM, Alex Brelsfoard wrote:

   Hi All,

  I was wondering if I could get some help here.  I am looking for an
  existing function/method/module that will properly convert all special
  characters (like those from Microsoft Word: smart quotes, mdash,
 ellipses,
  bullet points, etc.) to either a matching simpler character, or an HTML
  entity.

  HTML::Entities does a close job, but it does not handle everything
  correctly.


 years ago someone i know wrote such a beast. it is appropriately called
 the demoronizer (replaces 'smart' crapola).

 http://www.fourmilab.ch/**webtools/demoroniser/http://www.fourmilab.ch/webtools/demoroniser/

 it may do the trick. at least it is pure perl and would be easy for you
 to hack to your specific needs.

 note that it is very old and written in perl4 code!

 uri




 __**_
 Boston-pm mailing list
 Boston-pm@mail.pm.org
 http://mail.pm.org/mailman/**listinfo/boston-pmhttp://mail.pm.org/mailman/listinfo/boston-pm


___
Boston-pm mailing list
Boston-pm@mail.pm.org
http://mail.pm.org/mailman/listinfo/boston-pm