1st. what's the BBC's rss page coding? UTF-8 or something. 2nd. confirm you file coding and document coding equate rss page coding. 3rd. add fromEncoding to you BeautifulSoup instance。
> ex. soup = BeautifulSoup(html,fromEncoding="utf-8") > 2010/7/16 Andy <chees...@titan.physx.u-szeged.hu> > Dear Nice people > > I've been using beautiful soup to filter the BBC's rss feed. However, > recently the bbc have changed the feed and it is causing me problems with > the pound(money) symbol. The initial error was "UnicodeEncodeError: 'ascii' > codec can't encode character u'\xa3'" which means that the default encoding > can't process this (unicode) character. I was having simular problems with > HTML characters appearing but I used a simple regex system to > remove/substitute them to something suitable. > I tried applying the same approach and make a generic regex patten > (re.compile(u"""\u\[A-Fa-f0-9\]\{4\}""") but this fails because it doesn't > follow the standard patten for ascii. I'm not sure that I 100% understand > the unicode system but is there a simple way to remove/subsitute these non > ascii strings? > > Thanks for any help! > > Andy > _______________________________________________ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > http://mail.python.org/mailman/listinfo/tutor >
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor