2009/1/12 Girish Redekar: > I'm trying to build a search engine in python am stuck at the place where I > parse HTML to get useful text. One should ideally be able to parse the text > (out of HTML tags) along with its position (for phrase searches) and > font-size (to weigh words appropriately).
Have a look at html5lib for HTML parsing: http://code.google.com/p/html5lib It builds on the HTML5 parsing rules, which are compatible with how the four most used browsers (IE, Firefox, Safari and Opera) actually parse HTML as of now (as those do not parse HTML exactly the same, the algorithm is generally the "less illogical" in these cases). The result can either be a html5lib-specific tree (SimpleTree) or a BeautifulSoup, ElementTree/lxml or minidom. This means that, for instance, you can replace your BeautifulSoup parsing code with html5lib and keep the processing code as-is. However, for font-size, you'd have to parse and "apply" CSS and for this I have no solution at hand (but I don't really understand the use-case either actually...) -- Thomas Broyer _______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com