Re: [Web-SIG] HTML parsing - get text position and font size

2009-01-12 Thread Manlio Perillo
Girish Redekar ha scritto: I'm trying to build a search engine in python am stuck at the place where I parse HTML to get useful text. One should ideally be able to parse the text (out of HTML tags) along with its position (for phrase searches) and font-size (to weigh words appropriately). W

Re: [Web-SIG] HTML parsing - get text position and font size

2009-01-12 Thread Thomas Broyer
2009/1/12 Girish Redekar: > I'm trying to build a search engine in python am stuck at the place where I > parse HTML to get useful text. One should ideally be able to parse the text > (out of HTML tags) along with its position (for phrase searches) and > font-size (to weigh words appropriately). H

Re: [Web-SIG] HTML parsing - get text position and font size

2009-01-12 Thread Dirkjan Ochtman
2009/1/12 Girish Redekar : > is still tedious as font sizes in html/css can be expressed in multiple > methods (like tags, sizes in pixels, relative sizes, default larger > size for header etc). One can get down and code each of these cases, but I > was hoping someone has already (and reliably) wo

Re: [Web-SIG] HTML parsing - get text position and font size

2009-01-12 Thread Girish Redekar
Thanks Noah - Beautiful Soup does give a tree that can be used - however, getting from the tree to the result I desire is still a long way. I'm using lxml (for speed conerns) and it also returns a tree similar to BS .. I have even got as far as parsing the css and getting the attributes for each t

Re: [Web-SIG] HTML parsing - get text position and font size

2009-01-12 Thread Noah Gift
2009/1/13 Girish Redekar : > I'm trying to build a search engine in python am stuck at the place where I > parse HTML to get useful text. One should ideally be able to parse the text > (out of HTML tags) along with its position (for phrase searches) and > font-size (to weigh words appropriately). >

[Web-SIG] HTML parsing - get text position and font size

2009-01-12 Thread Girish Redekar
I'm trying to build a search engine in python am stuck at the place where I parse HTML to get useful text. One should ideally be able to parse the text (out of HTML tags) along with its position (for phrase searches) and font-size (to weigh words appropriately). However, this part gets very tediou