Re: [Web-SIG] HTML parsing - get text position and font size

Noah Gift Mon, 12 Jan 2009 03:29:17 -0800

2009/1/13 Girish Redekar <[email protected]>:
> I'm trying to build a search engine in python am stuck at the place where I
> parse HTML to get useful text. One should ideally be able to parse the text
> (out of HTML tags) along with its position (for phrase searches) and
> font-size (to weigh words appropriately).
>
> However, this part gets very tedious (especially with bad html and css) and
> my code is already unwieldy. It seems to me that this task should've been a
> part of any python based semi-sophisticated screen scraper and that it would
> be a commonly solved problem. Yet, no amount of googling has returned
> anything useful.
>
> Any ideas?


I wrote this article a way back:

http://www.ibm.com/developerworks/aix/library/au-threadingpython/

I didn't fully explore it, but it seems like thread pools and
Beautiful Soup could work...


> _______________________________________________
> Web-SIG mailing list
> [email protected]
> Web SIG: http://www.python.org/sigs/web-sig
> Unsubscribe:
> http://mail.python.org/mailman/options/web-sig/noah.gift%40gmail.com
>
>
_______________________________________________
Web-SIG mailing list
[email protected]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com

Re: [Web-SIG] HTML parsing - get text position and font size

Reply via email to