Re: [Web-SIG] HTML parsing - get text position and font size

2009-01-12 Thread Noah Gift
2009/1/13 Girish Redekar girish.rede...@gmail.com:
 I'm trying to build a search engine in python am stuck at the place where I
 parse HTML to get useful text. One should ideally be able to parse the text
 (out of HTML tags) along with its position (for phrase searches) and
 font-size (to weigh words appropriately).

 However, this part gets very tedious (especially with bad html and css) and
 my code is already unwieldy. It seems to me that this task should've been a
 part of any python based semi-sophisticated screen scraper and that it would
 be a commonly solved problem. Yet, no amount of googling has returned
 anything useful.

 Any ideas?

I wrote this article a way back:

http://www.ibm.com/developerworks/aix/library/au-threadingpython/

I didn't fully explore it, but it seems like thread pools and
Beautiful Soup could work...


 ___
 Web-SIG mailing list
 Web-SIG@python.org
 Web SIG: http://www.python.org/sigs/web-sig
 Unsubscribe:
 http://mail.python.org/mailman/options/web-sig/noah.gift%40gmail.com


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] HTML parsing - get text position and font size

2009-01-12 Thread Girish Redekar
Thanks Noah - Beautiful Soup does give a tree that can be used - however,
getting from the tree to the result I desire is still a long way.

I'm using lxml (for speed conerns) and it also returns a tree similar to BS
.. I have even got as far as parsing the css and getting the attributes for
each text element. However, getting from here to a simple list of the form:
[ (word1, fontsize1, position1), (word2, fontsize2, position2), (word3,
fontsize3, position3) ... ]
is still tedious as font sizes in html/css can be expressed in multiple
methods (like FONT tags, sizes in pixels, relative sizes, default larger
size for header etc). One can get down and code each of these cases, but I
was hoping someone has already (and reliably) worked on the same

Thanks,
Girish


On Mon, Jan 12, 2009 at 4:59 PM, Noah Gift noah.g...@gmail.com wrote:

 2009/1/13 Girish Redekar girish.rede...@gmail.com:
  I'm trying to build a search engine in python am stuck at the place where
 I
  parse HTML to get useful text. One should ideally be able to parse the
 text
  (out of HTML tags) along with its position (for phrase searches) and
  font-size (to weigh words appropriately).
 
  However, this part gets very tedious (especially with bad html and css)
 and
  my code is already unwieldy. It seems to me that this task should've been
 a
  part of any python based semi-sophisticated screen scraper and that it
 would
  be a commonly solved problem. Yet, no amount of googling has returned
  anything useful.
 
  Any ideas?

 I wrote this article a way back:

 http://www.ibm.com/developerworks/aix/library/au-threadingpython/

 I didn't fully explore it, but it seems like thread pools and
 Beautiful Soup could work...


  ___
  Web-SIG mailing list
  Web-SIG@python.org
  Web SIG: http://www.python.org/sigs/web-sig
  Unsubscribe:
  http://mail.python.org/mailman/options/web-sig/noah.gift%40gmail.com
 
 

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] HTML parsing - get text position and font size

2009-01-12 Thread Dirkjan Ochtman
2009/1/12 Girish Redekar girish.rede...@gmail.com:
 is still tedious as font sizes in html/css can be expressed in multiple
 methods (like FONT tags, sizes in pixels, relative sizes, default larger
 size for header etc). One can get down and code each of these cases, but I
 was hoping someone has already (and reliably) worked on the same

So basically you want a full-on headless browser? Pretty non-trivial.

Your best bet would probably be to hook into a Mozilla instance
somehow (PyXPCOM, anyone?) and try to read the styles from the DOM
there.

Cheers,

Dirkjan
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] HTML parsing - get text position and font size

2009-01-12 Thread Thomas Broyer
2009/1/12 Girish Redekar:
 I'm trying to build a search engine in python am stuck at the place where I
 parse HTML to get useful text. One should ideally be able to parse the text
 (out of HTML tags) along with its position (for phrase searches) and
 font-size (to weigh words appropriately).

Have a look at html5lib for HTML parsing: http://code.google.com/p/html5lib
It builds on the HTML5 parsing rules, which are compatible with how
the four most used browsers (IE, Firefox, Safari and Opera) actually
parse HTML as of now (as those do not parse HTML exactly the same, the
algorithm is generally the less illogical in these cases).
The result can either be a html5lib-specific tree (SimpleTree) or a
BeautifulSoup, ElementTree/lxml or minidom. This means that, for
instance, you can replace your BeautifulSoup parsing code with
html5lib and keep the processing code as-is.

However, for font-size, you'd have to parse and apply CSS and for
this I have no solution at hand (but I don't really understand the
use-case either actually...)

-- 
Thomas Broyer
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] HTML parsing - get text position and font size

2009-01-12 Thread Manlio Perillo

Girish Redekar ha scritto:
I'm trying to build a search engine in python am stuck at the place 
where I parse HTML to get useful text. One should ideally be able to 
parse the text (out of HTML tags) along with its position (for phrase 
searches) and font-size (to weigh words appropriately).




Words weight should be done using semantics, not style.

However, if you really need it, for CSS parsing, there is cssutils package.
I'm writing a CSS parser, too:
http://hg.mperillo.ath.cx/pdfimg/file/tip/pdfimg/style/css/

using PLY, so it should easy to read/modify.
It is still in very early stage.



 [...]


Regards  Manlio Perillo
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com