Thanks Noah - Beautiful Soup does give a tree that can be used - however, getting from the tree to the result I desire is still a long way.
I'm using lxml (for speed conerns) and it also returns a tree similar to BS .. I have even got as far as parsing the css and getting the attributes for each text element. However, getting from here to a simple list of the form: [ (word1, fontsize1, position1), (word2, fontsize2, position2), (word3, fontsize3, position3) ... ] is still tedious as font sizes in html/css can be expressed in multiple methods (like <FONT> tags, sizes in pixels, relative sizes, default larger size for header etc). One can get down and code each of these cases, but I was hoping someone has already (and reliably) worked on the same Thanks, Girish On Mon, Jan 12, 2009 at 4:59 PM, Noah Gift <noah.g...@gmail.com> wrote: > 2009/1/13 Girish Redekar <girish.rede...@gmail.com>: > > I'm trying to build a search engine in python am stuck at the place where > I > > parse HTML to get useful text. One should ideally be able to parse the > text > > (out of HTML tags) along with its position (for phrase searches) and > > font-size (to weigh words appropriately). > > > > However, this part gets very tedious (especially with bad html and css) > and > > my code is already unwieldy. It seems to me that this task should've been > a > > part of any python based semi-sophisticated screen scraper and that it > would > > be a commonly solved problem. Yet, no amount of googling has returned > > anything useful. > > > > Any ideas? > > I wrote this article a way back: > > http://www.ibm.com/developerworks/aix/library/au-threadingpython/ > > I didn't fully explore it, but it seems like thread pools and > Beautiful Soup could work... > > > > _______________________________________________ > > Web-SIG mailing list > > Web-SIG@python.org > > Web SIG: http://www.python.org/sigs/web-sig > > Unsubscribe: > > http://mail.python.org/mailman/options/web-sig/noah.gift%40gmail.com > > > > >
_______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com