On 10/18/2014 09:08 PM, Stefano Rivera wrote:
Control: tag -1 upstream
Hi Steve (2014.10.04_07:54:18_-0700)
http://www.nextinpact.com/news/90246-twitch-et-valve-veulent-plus-transparence-sur-contenus-sponsorises.htm
Running it on that page doesn't seem to introduce any
 _place_holder; entities.
Can you find any HTML that reproduces this?
SR
Hi Stefano.
My first description is misleading, sorry.
It's the RSS feed of the website that should be parsed with html2text,
as I use it through rss2email.
Steps to reproduce :
wget http://www.nextinpact.com/rss/news.xml
python /usr/share/pyshared/html2text.py news.xml > news.html
python /usr/share/pyshared/html2text.py news.html
First pass converts XML to HTML and second one HTML to plain text (I
guess that's what rss2email do).
There you should see  _place_holder; entities.
Looking further in this feed I see other similar problems :
é which should produce "é" gives sometime "e".
è which should produce "è" gives sometime "e".
ê which should produce "ê" gives sometime "e".
...
But this behavior is not consistent : I see words like "société" written
as "societe" and others like "vidéo" written as "video" but also as "vidéo".
Steve
_______________________________________________
Python-modules-team mailing list
Python-modules-team@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/python-modules-team