Re: [Web-SIG] HTML parsing - get text position and font size
2009/1/13 Girish Redekar girish.rede...@gmail.com: I'm trying to build a search engine in python am stuck at the place where I parse HTML to get useful text. One should ideally be able to parse the text (out of HTML tags) along with its position (for phrase searches) and font-size (to weigh words appropriately). However, this part gets very tedious (especially with bad html and css) and my code is already unwieldy. It seems to me that this task should've been a part of any python based semi-sophisticated screen scraper and that it would be a commonly solved problem. Yet, no amount of googling has returned anything useful. Any ideas? I wrote this article a way back: http://www.ibm.com/developerworks/aix/library/au-threadingpython/ I didn't fully explore it, but it seems like thread pools and Beautiful Soup could work... ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/noah.gift%40gmail.com ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] HTML parsing - get text position and font size
Thanks Noah - Beautiful Soup does give a tree that can be used - however, getting from the tree to the result I desire is still a long way. I'm using lxml (for speed conerns) and it also returns a tree similar to BS .. I have even got as far as parsing the css and getting the attributes for each text element. However, getting from here to a simple list of the form: [ (word1, fontsize1, position1), (word2, fontsize2, position2), (word3, fontsize3, position3) ... ] is still tedious as font sizes in html/css can be expressed in multiple methods (like FONT tags, sizes in pixels, relative sizes, default larger size for header etc). One can get down and code each of these cases, but I was hoping someone has already (and reliably) worked on the same Thanks, Girish On Mon, Jan 12, 2009 at 4:59 PM, Noah Gift noah.g...@gmail.com wrote: 2009/1/13 Girish Redekar girish.rede...@gmail.com: I'm trying to build a search engine in python am stuck at the place where I parse HTML to get useful text. One should ideally be able to parse the text (out of HTML tags) along with its position (for phrase searches) and font-size (to weigh words appropriately). However, this part gets very tedious (especially with bad html and css) and my code is already unwieldy. It seems to me that this task should've been a part of any python based semi-sophisticated screen scraper and that it would be a commonly solved problem. Yet, no amount of googling has returned anything useful. Any ideas? I wrote this article a way back: http://www.ibm.com/developerworks/aix/library/au-threadingpython/ I didn't fully explore it, but it seems like thread pools and Beautiful Soup could work... ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/noah.gift%40gmail.com ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] HTML parsing - get text position and font size
2009/1/12 Girish Redekar girish.rede...@gmail.com: is still tedious as font sizes in html/css can be expressed in multiple methods (like FONT tags, sizes in pixels, relative sizes, default larger size for header etc). One can get down and code each of these cases, but I was hoping someone has already (and reliably) worked on the same So basically you want a full-on headless browser? Pretty non-trivial. Your best bet would probably be to hook into a Mozilla instance somehow (PyXPCOM, anyone?) and try to read the styles from the DOM there. Cheers, Dirkjan ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] HTML parsing - get text position and font size
2009/1/12 Girish Redekar: I'm trying to build a search engine in python am stuck at the place where I parse HTML to get useful text. One should ideally be able to parse the text (out of HTML tags) along with its position (for phrase searches) and font-size (to weigh words appropriately). Have a look at html5lib for HTML parsing: http://code.google.com/p/html5lib It builds on the HTML5 parsing rules, which are compatible with how the four most used browsers (IE, Firefox, Safari and Opera) actually parse HTML as of now (as those do not parse HTML exactly the same, the algorithm is generally the less illogical in these cases). The result can either be a html5lib-specific tree (SimpleTree) or a BeautifulSoup, ElementTree/lxml or minidom. This means that, for instance, you can replace your BeautifulSoup parsing code with html5lib and keep the processing code as-is. However, for font-size, you'd have to parse and apply CSS and for this I have no solution at hand (but I don't really understand the use-case either actually...) -- Thomas Broyer ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] HTML parsing - get text position and font size
Girish Redekar ha scritto: I'm trying to build a search engine in python am stuck at the place where I parse HTML to get useful text. One should ideally be able to parse the text (out of HTML tags) along with its position (for phrase searches) and font-size (to weigh words appropriately). Words weight should be done using semantics, not style. However, if you really need it, for CSS parsing, there is cssutils package. I'm writing a CSS parser, too: http://hg.mperillo.ath.cx/pdfimg/file/tip/pdfimg/style/css/ using PLY, so it should easy to read/modify. It is still in very early stage. [...] Regards Manlio Perillo ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com