2009/1/13 Girish Redekar <girish.rede...@gmail.com>: > I'm trying to build a search engine in python am stuck at the place where I > parse HTML to get useful text. One should ideally be able to parse the text > (out of HTML tags) along with its position (for phrase searches) and > font-size (to weigh words appropriately). > > However, this part gets very tedious (especially with bad html and css) and > my code is already unwieldy. It seems to me that this task should've been a > part of any python based semi-sophisticated screen scraper and that it would > be a commonly solved problem. Yet, no amount of googling has returned > anything useful. > > Any ideas?
I wrote this article a way back: http://www.ibm.com/developerworks/aix/library/au-threadingpython/ I didn't fully explore it, but it seems like thread pools and Beautiful Soup could work... > _______________________________________________ > Web-SIG mailing list > Web-SIG@python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: > http://mail.python.org/mailman/options/web-sig/noah.gift%40gmail.com > > _______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com