On 10/18/2014 09:08 PM, Stefano Rivera wrote:
Control: tag -1 upstream

Hi Steve (2014.10.04_07:54:18_-0700)
http://www.nextinpact.com/news/90246-twitch-et-valve-veulent-plus-transparence-sur-contenus-sponsorises.htm

Running it on that page doesn't seem to introduce any
&nbsp_place_holder; entities.

Can you find any HTML that reproduces this?

SR


Hi Stefano.

My first description is misleading, sorry.

It's the RSS feed of the website that should be parsed with html2text, as I use it through rss2email.

Steps to reproduce :
wget http://www.nextinpact.com/rss/news.xml
python /usr/share/pyshared/html2text.py news.xml > news.html
python /usr/share/pyshared/html2text.py news.html
First pass converts XML to HTML and second one HTML to plain text (I guess that's what rss2email do).

There you should see &nbsp_place_holder; entities.

Looking further in this feed I see other similar problems :
é which should produce "é" gives sometime "e".
è which should produce "è" gives sometime "e".
ê which should produce "ê" gives sometime "e".
...

But this behavior is not consistent : I see words like "société" written as "societe" and others like "vidéo" written as "video" but also as "vidéo".


Steve

_______________________________________________
Python-modules-team mailing list
Python-modules-team@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/python-modules-team

Reply via email to