Re: [Tutor] Removing GB pound symbols from Beautiful soup output

Ethan Wei Fri, 16 Jul 2010 08:31:51 -0700

1st. what's the BBC's rss page coding? UTF-8 or something.
2nd. confirm you file coding and document coding equate rss page coding.
3rd. add fromEncoding to you BeautifulSoup instance。


>  ex. soup = BeautifulSoup(html,fromEncoding="utf-8")
>


2010/7/16 Andy <chees...@titan.physx.u-szeged.hu>

> Dear Nice people
>
> I've been using beautiful soup to filter the BBC's rss feed. However,
> recently the bbc have changed the feed and it is causing me problems with
> the pound(money) symbol. The initial error was "UnicodeEncodeError: 'ascii'
> codec can't encode character u'\xa3'" which means that the default encoding
> can't process this (unicode) character. I was having simular problems with
> HTML characters appearing but I used a simple regex system to
> remove/substitute them to something suitable.
> I tried applying the same approach and make a generic regex patten
> (re.compile(u"""\u\[A-Fa-f0-9\]\{4\}""") but this fails because it doesn't
> follow the standard patten for ascii. I'm not sure that I 100% understand
> the unicode system but is there a simple way to remove/subsitute these non
> ascii strings?
>
> Thanks for any help!
>
> Andy
> _______________________________________________
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Removing GB pound symbols from Beautiful soup output

Reply via email to