In article <[email protected]>,
Filip  <[email protected]> wrote:
>
>I tried to fix that with BeautifulSoup + regexp filtering of some
>particular cases I encountered. That was slow and after running my
>data scraper for some time a lot of new problems (exceptions from
>xpath parser) were showing up. Not to mention that BeautifulSoup
>stripped almost all of the content from some heavily broken pages
>(50+KiB page stripped down to some few hundred bytes). Character
>encoding conversion was a hell too - even UTF-8 pages had some non-
>standard characters causing issues.

Have you tried lxml?
-- 
Aahz ([email protected])           <*>         http://www.pythoncraft.com/

"At Resolver we've found it useful to short-circuit any doubt and just        
refer to comments in code as 'lies'. :-)"
--Michael Foord paraphrases Christian Muirhead on python-dev, 2009-03-22
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to