That looks like a useful combination. Thanks.
On 6 May 2018 at 17:32, Mark Lawrence <breamore...@gmail.com> wrote: > On 05/05/18 18:59, Simon Connah wrote: >> >> Hi, >> >> I'm writing a very simple web scraper. It'll download a page from a >> website and then store the result in a database of some sort. The >> problem is that this will obviously include a whole heap of HTML, >> JavaScript and maybe even some CSS. None of which is useful to me. >> >> I was wondering if there was a way in which I could download a web >> page and then just extract the main body of text without all of the >> HTML. >> >> The title is obviously easy but the main body of text could contain >> all sorts of HTML and I'm interested to know how I might go about >> removing the bits that are not needed but still keep the meaning of >> the document intact. >> >> Does anyone have any suggestions on this front at all? >> >> Thanks for any help. >> >> Simon. > > > A combination of requests http://docs.python-requests.org/en/master/ and > beautiful soup https://www.crummy.com/software/BeautifulSoup/bs4/doc/ should > fit the bill. Both are installable with pip and are regarded as best of > breed. > > -- > My fellow Pythonistas, ask not what our language can do for you, ask > what you can do for our language. > > Mark Lawrence > > > _______________________________________________ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor