Girish Redekar ha scritto:
I'm trying to build a search engine in python am stuck at the place
where I parse HTML to get useful text. One should ideally be able to
parse the text (out of HTML tags) along with its position (for phrase
searches) and font-size (to weigh words appropriately).
W
2009/1/12 Girish Redekar:
> I'm trying to build a search engine in python am stuck at the place where I
> parse HTML to get useful text. One should ideally be able to parse the text
> (out of HTML tags) along with its position (for phrase searches) and
> font-size (to weigh words appropriately).
H
2009/1/12 Girish Redekar :
> is still tedious as font sizes in html/css can be expressed in multiple
> methods (like tags, sizes in pixels, relative sizes, default larger
> size for header etc). One can get down and code each of these cases, but I
> was hoping someone has already (and reliably) wo
Thanks Noah - Beautiful Soup does give a tree that can be used - however,
getting from the tree to the result I desire is still a long way.
I'm using lxml (for speed conerns) and it also returns a tree similar to BS
.. I have even got as far as parsing the css and getting the attributes for
each t
2009/1/13 Girish Redekar :
> I'm trying to build a search engine in python am stuck at the place where I
> parse HTML to get useful text. One should ideally be able to parse the text
> (out of HTML tags) along with its position (for phrase searches) and
> font-size (to weigh words appropriately).
>
I'm trying to build a search engine in python am stuck at the place where I
parse HTML to get useful text. One should ideally be able to parse the text
(out of HTML tags) along with its position (for phrase searches) and
font-size (to weigh words appropriately).
However, this part gets very tediou