I'm trying to build a search engine in python am stuck at the place where I parse HTML to get useful text. One should ideally be able to parse the text (out of HTML tags) along with its position (for phrase searches) and font-size (to weigh words appropriately).
However, this part gets very tedious (especially with bad html and css) and my code is already unwieldy. It seems to me that this task should've been a part of any python based semi-sophisticated screen scraper and that it would be a commonly solved problem. Yet, no amount of googling has returned anything useful. Any ideas?
_______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com