Re: [Tutor] Extract main text from HTML document
That looks like a useful combination. Thanks. On 6 May 2018 at 17:32, Mark Lawrencewrote: > On 05/05/18 18:59, Simon Connah wrote: >> >> Hi, >> >> I'm writing a very simple web scraper. It'll download a page from a >> website and then store the result in a database of some sort. The >> problem is that this will obviously include a whole heap of HTML, >> JavaScript and maybe even some CSS. None of which is useful to me. >> >> I was wondering if there was a way in which I could download a web >> page and then just extract the main body of text without all of the >> HTML. >> >> The title is obviously easy but the main body of text could contain >> all sorts of HTML and I'm interested to know how I might go about >> removing the bits that are not needed but still keep the meaning of >> the document intact. >> >> Does anyone have any suggestions on this front at all? >> >> Thanks for any help. >> >> Simon. > > > A combination of requests http://docs.python-requests.org/en/master/ and > beautiful soup https://www.crummy.com/software/BeautifulSoup/bs4/doc/ should > fit the bill. Both are installable with pip and are regarded as best of > breed. > > -- > My fellow Pythonistas, ask not what our language can do for you, ask > what you can do for our language. > > Mark Lawrence > > > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Extract main text from HTML document
Two things. The first thing is that you can download the page as a string and delete a everything between tags. Secondly It might be worth looking at Udacity cs101 as this course is all about a search engine. On Sat, 5 May 2018 at 22:27, Simon Connahwrote: > Hi, > > I'm writing a very simple web scraper. It'll download a page from a > website and then store the result in a database of some sort. The > problem is that this will obviously include a whole heap of HTML, > JavaScript and maybe even some CSS. None of which is useful to me. > > I was wondering if there was a way in which I could download a web > page and then just extract the main body of text without all of the > HTML. > > The title is obviously easy but the main body of text could contain > all sorts of HTML and I'm interested to know how I might go about > removing the bits that are not needed but still keep the meaning of > the document intact. > > Does anyone have any suggestions on this front at all? > > Thanks for any help. > > Simon. > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor > ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Extract main text from HTML document
Thanks for the replies, everyone. Beautiful Soup looks like a good option. My primary goal is to extract the main body text, the title and the meta description from a web page and run it through one of the cloud Natural Language processing services to find out some information that I'd like to know and I'd like to do it to quite a few websites. This is all for a little project I have in mind. I'm not even sure if it'll work but it'll be fun to try. I might have to do some custom work on top of what Beautiful Soup offers though as I need to get very specific data in a certain format. On 5 May 2018 at 22:43, boB Steppwrote: > On Sat, May 5, 2018 at 12:59 PM, Simon Connah wrote: > >> I was wondering if there was a way in which I could download a web >> page and then just extract the main body of text without all of the >> HTML. > > I do not have any experience with this, but I like to collect books. > One of them [1] says on page 245: > > "Beautiful Soup is a module for extracting information from an HTML > page (and is much better for this purpose than regular expressions)." > > I believe this topic has come up before on this list as well as the > main Python list. You may want to check it out. It can be installed > with pip. > > [1] "Automate the Boring Stuff with Python -- Practical Programming > for Total Beginners" by Al Sweigart. > > HTH! > -- > boB > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Extract main text from HTML document
On 05/05/18 18:59, Simon Connah wrote: Hi, I'm writing a very simple web scraper. It'll download a page from a website and then store the result in a database of some sort. The problem is that this will obviously include a whole heap of HTML, JavaScript and maybe even some CSS. None of which is useful to me. I was wondering if there was a way in which I could download a web page and then just extract the main body of text without all of the HTML. The title is obviously easy but the main body of text could contain all sorts of HTML and I'm interested to know how I might go about removing the bits that are not needed but still keep the meaning of the document intact. Does anyone have any suggestions on this front at all? Thanks for any help. Simon. A combination of requests http://docs.python-requests.org/en/master/ and beautiful soup https://www.crummy.com/software/BeautifulSoup/bs4/doc/ should fit the bill. Both are installable with pip and are regarded as best of breed. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Extract main text from HTML document
On Sat, May 5, 2018 at 12:59 PM, Simon Connahwrote: > I was wondering if there was a way in which I could download a web > page and then just extract the main body of text without all of the > HTML. I do not have any experience with this, but I like to collect books. One of them [1] says on page 245: "Beautiful Soup is a module for extracting information from an HTML page (and is much better for this purpose than regular expressions)." I believe this topic has come up before on this list as well as the main Python list. You may want to check it out. It can be installed with pip. [1] "Automate the Boring Stuff with Python -- Practical Programming for Total Beginners" by Al Sweigart. HTH! -- boB ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Extract main text from HTML document
On 05/05/2018 11:59 AM, Simon Connah wrote: > Hi, > > I'm writing a very simple web scraper. It'll download a page from a > website and then store the result in a database of some sort. The > problem is that this will obviously include a whole heap of HTML, > JavaScript and maybe even some CSS. None of which is useful to me. > > I was wondering if there was a way in which I could download a web > page and then just extract the main body of text without all of the > HTML. > > The title is obviously easy but the main body of text could contain > all sorts of HTML and I'm interested to know how I might go about > removing the bits that are not needed but still keep the meaning of > the document intact. > > Does anyone have any suggestions on this front at all? there's so much prior art in this space it's not really worth reinventing this, unless you're using it as an exercise to teach yourself more Python (always a worth goal!) Here's one guy's summary of _some_ of the existing practice, albeit probably the best known. https://elitedatascience.com/python-web-scraping-libraries ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor