[Tutor] Extract main text from HTML document

Simon Connah Sat, 05 May 2018 14:28:09 -0700

Hi,

I'm writing a very simple web scraper. It'll download a page from a
website and then store the result in a database of some sort. The
problem is that this will obviously include a whole heap of HTML,
JavaScript and maybe even some CSS. None of which is useful to me.


I was wondering if there was a way in which I could download a web
page and then just extract the main body of text without all of the
HTML.

The title is obviously easy but the main body of text could contain
all sorts of HTML and I'm interested to know how I might go about
removing the bits that are not needed but still keep the meaning of
the document intact.

Does anyone have any suggestions on this front at all?

Thanks for any help.

Simon.
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

[Tutor] Extract main text from HTML document

Reply via email to