Re: [Tutor] Extract main text from HTML document

2018-05-07 Thread Simon Connah
That looks like a useful combination. Thanks.

On 6 May 2018 at 17:32, Mark Lawrence  wrote:
> On 05/05/18 18:59, Simon Connah wrote:
>>
>> Hi,
>>
>> I'm writing a very simple web scraper. It'll download a page from a
>> website and then store the result in a database of some sort. The
>> problem is that this will obviously include a whole heap of HTML,
>> JavaScript and maybe even some CSS. None of which is useful to me.
>>
>> I was wondering if there was a way in which I could download a web
>> page and then just extract the main body of text without all of the
>> HTML.
>>
>> The title is obviously easy but the main body of text could contain
>> all sorts of HTML and I'm interested to know how I might go about
>> removing the bits that are not needed but still keep the meaning of
>> the document intact.
>>
>> Does anyone have any suggestions on this front at all?
>>
>> Thanks for any help.
>>
>> Simon.
>
>
> A combination of requests http://docs.python-requests.org/en/master/ and
> beautiful soup https://www.crummy.com/software/BeautifulSoup/bs4/doc/ should
> fit the bill.  Both are installable with pip and are regarded as best of
> breed.
>
> --
> My fellow Pythonistas, ask not what our language can do for you, ask
> what you can do for our language.
>
> Mark Lawrence
>
>
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Extract main text from HTML document

2018-05-06 Thread Brian Lockwood
Two things. The first thing is that you can download the page as a string
and delete a everything between tags. Secondly It might be worth looking at
Udacity cs101 as this course is all about a search engine.
On Sat, 5 May 2018 at 22:27, Simon Connah  wrote:

> Hi,
>
> I'm writing a very simple web scraper. It'll download a page from a
> website and then store the result in a database of some sort. The
> problem is that this will obviously include a whole heap of HTML,
> JavaScript and maybe even some CSS. None of which is useful to me.
>
> I was wondering if there was a way in which I could download a web
> page and then just extract the main body of text without all of the
> HTML.
>
> The title is obviously easy but the main body of text could contain
> all sorts of HTML and I'm interested to know how I might go about
> removing the bits that are not needed but still keep the meaning of
> the document intact.
>
> Does anyone have any suggestions on this front at all?
>
> Thanks for any help.
>
> Simon.
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Extract main text from HTML document

2018-05-06 Thread Simon Connah
Thanks for the replies, everyone. Beautiful Soup looks like a good option.

My primary goal is to extract the main body text, the title and the
meta description from a web page and run it through one of the cloud
Natural Language processing services to find out some information that
I'd like to know and I'd like to do it to quite a few websites.

This is all for a little project I have in mind. I'm not even sure if
it'll work but it'll be fun to try. I might have to do some custom
work on top of what Beautiful Soup offers though as I need to get very
specific data in a certain format.

On 5 May 2018 at 22:43, boB Stepp  wrote:
> On Sat, May 5, 2018 at 12:59 PM, Simon Connah  wrote:
>
>> I was wondering if there was a way in which I could download a web
>> page and then just extract the main body of text without all of the
>> HTML.
>
> I do not have any experience with this, but I like to collect books.
> One of them [1] says on page 245:
>
> "Beautiful Soup is a module for extracting information from an HTML
> page (and is much better for this purpose than regular expressions)."
>
> I believe this topic has come up before on this list as well as the
> main Python list.  You may want to check it out.  It can be installed
> with pip.
>
> [1] "Automate the Boring Stuff with Python -- Practical Programming
> for Total Beginners" by Al Sweigart.
>
> HTH!
> --
> boB
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Extract main text from HTML document

2018-05-06 Thread Mark Lawrence

On 05/05/18 18:59, Simon Connah wrote:

Hi,

I'm writing a very simple web scraper. It'll download a page from a
website and then store the result in a database of some sort. The
problem is that this will obviously include a whole heap of HTML,
JavaScript and maybe even some CSS. None of which is useful to me.

I was wondering if there was a way in which I could download a web
page and then just extract the main body of text without all of the
HTML.

The title is obviously easy but the main body of text could contain
all sorts of HTML and I'm interested to know how I might go about
removing the bits that are not needed but still keep the meaning of
the document intact.

Does anyone have any suggestions on this front at all?

Thanks for any help.

Simon.


A combination of requests http://docs.python-requests.org/en/master/ and 
beautiful soup https://www.crummy.com/software/BeautifulSoup/bs4/doc/ 
should fit the bill.  Both are installable with pip and are regarded as 
best of breed.


--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Extract main text from HTML document

2018-05-05 Thread boB Stepp
On Sat, May 5, 2018 at 12:59 PM, Simon Connah  wrote:

> I was wondering if there was a way in which I could download a web
> page and then just extract the main body of text without all of the
> HTML.

I do not have any experience with this, but I like to collect books.
One of them [1] says on page 245:

"Beautiful Soup is a module for extracting information from an HTML
page (and is much better for this purpose than regular expressions)."

I believe this topic has come up before on this list as well as the
main Python list.  You may want to check it out.  It can be installed
with pip.

[1] "Automate the Boring Stuff with Python -- Practical Programming
for Total Beginners" by Al Sweigart.

HTH!
-- 
boB
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Extract main text from HTML document

2018-05-05 Thread Mats Wichmann
On 05/05/2018 11:59 AM, Simon Connah wrote:
> Hi,
> 
> I'm writing a very simple web scraper. It'll download a page from a
> website and then store the result in a database of some sort. The
> problem is that this will obviously include a whole heap of HTML,
> JavaScript and maybe even some CSS. None of which is useful to me.
> 
> I was wondering if there was a way in which I could download a web
> page and then just extract the main body of text without all of the
> HTML.
> 
> The title is obviously easy but the main body of text could contain
> all sorts of HTML and I'm interested to know how I might go about
> removing the bits that are not needed but still keep the meaning of
> the document intact.
> 
> Does anyone have any suggestions on this front at all?

there's so much prior art in this space it's not really worth
reinventing this, unless you're using it as an exercise to teach
yourself more Python (always a worth goal!)

Here's one guy's summary of _some_ of the existing practice, albeit
probably the best known.

https://elitedatascience.com/python-web-scraping-libraries


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor