Re: web page text extractor

kublai Fri, 13 Jul 2007 06:01:17 -0700

On Jul 13, 5:44 pm, Paul McGuire <[EMAIL PROTECTED]> wrote:
> On Jul 12, 4:42 am, kublai <[EMAIL PROTECTED]> wrote:
>
> > Hello,
>
> > For a project, I need to develop a corpus of online news stories.  I'm
> > looking for an application that, given the url of a web page, "copies"
> > the rendered text of the web page (not the source HTNL text), opens a
> > text editor (Notepad), and displays the copied text for the user to
> > examine and save into a text file. Graphics and sidebars to be
> > ignored. The examples I have come across are much too complex for me
> > to customize for this simple job. Can anyone lead me to the right
> > direction?
>
> > Thanks,
> > gk
>
> One of the examples provided with pyparsing is an HTML stripper - view
> it online athttp://pyparsing.wikispaces.com/space/showimage/htmlStripper.py.
>
> -- Paul


Stripping tags is indeed one strategy that came to mind. I'm wondering
how much information (for example, paragraphing) would be lost, and if
what would be lost would be acceptable (to the project). I looked at
pyparsing and I see that it's got a lot of text processing
capabilities that I can use along the way. I sure will try it. Thanks
for the post.

Best,
gk

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: web page text extractor

Reply via email to