subject:"\[Tutor\] Web scraping"

Re: [Tutor] Web scraping using selenium and navigating nested dictionaries / lists.

2019-01-27 Thread mhysnm1964

Peter,

I am aware that I am avoiding functions that can make my life easier. But I
want to learn some of this data structure navigation concepts to improve my
skills in programming. What you have provided I will review in depth and
have a play with.

A big thanks.


-Original Message-
From: Tutor  On Behalf Of
Peter Otten
Sent: Sunday, 27 January 2019 10:13 PM
To: tutor@python.org
Subject: Re: [Tutor] Web scraping using selenium and navigating nested
dictionaries / lists.

mhysnm1...@gmail.com wrote:

> All,
> 
>  
> 
> Goal of new project.
> 
> I want to scrape all my books from Audible.com that I have purchased.
> Eventually I want to export this as a CSV file or maybe Json. I have 
> not got that far yet. The reasoning behind this is to  learn selenium  
> for my work and get the list of books I have purchased. Killing two 
> birds with one stone
> here. The work focus is to see if selenium   can automate some of the
> testing I have to do and collect useful information from the web page 
> for my reports. This part of the goal is in the future. As I need to 
> build my python skills up.
> 
>  
> 
> Thus far, I have been successful in logging into Audible and showing 
> the library of books. I am able to store the table of books and want 
> to use BeautifulSoup to extract the relevant information. Information 
> I will want from the table is:
> 
> * Author
> * Title
> * Date purchased
> * Length
> * Is the book in a series (there is a link for this)
> * Link to the page storing the publish details.
> * Download link
> 
> Hopefully this has given you enough information on what I am trying to 
> achieve at this stage. AS I learn more about what I am doing, I am 
> adding possible extra's tasks. Such as verifying if I have the book 
> already download via itunes.
> 
>  
> 
> Learning goals:
> 
> Using the BeautifulSoup  structure that I have extracted from the page 
> source for the table. I want to navigate the tree structure. 
> BeautifulSoup provides children, siblings and parents methods. This is 
> where I get stuck with programming logic. BeautifulSoup does provide 
> find_all method plus selectors which I do not want to use for this 
> exercise. As I want to learn how to walk a tree starting at the root 
> and visiting each node of the tree.

I think you make your life harder than necessary if you avoid the tools
provided by the library you are using.

> Then I can look at the attributes for the tag as I go. I believe I 
> have to set up a recursive loop or function call. Not sure on how to 
> do this. Pseudo code:
> 
>  
> 
> Build table structure
> 
> Start at the root node.
> 
> Check to see if there is any children.
> 
> Pass first child to function.
> 
> Print attributes for tag at this level
> 
> In function, check for any sibling nodes.
> 
> If exist, call function again
> 
> If no siblings, then start at first sibling and get its child.
> 
>  
> 
> This is where I get struck. Each sibling can have children and they 
> can have siblings. So how do I ensure I visit each node in the tree?

The problem with your description is that siblings do not matter. Just

- process root
- iterate over its children and call the function recursively with every
  child as the new root.

To make the function more useful you can pass a function instead of hard-
coding what you want to do with the elements. Given

def process_elements(elem, do_stuff):
do_stuff(elem)
for child in elem.children:
process_elements(child, do_stuff)

you can print all elements with

soup = BeautifulSoup(...)
process_elements(soup, print)

and

process_elements(soup, lambda elem: print(elem.name)) will print only the
names.

You need a bit of error checking to make it work, though.

But wait -- Python's generators let you rewrite process_elements so that you
can use it without a callback:

def gen_elements(elem):
yield elem
for child in elem.children:
yield from gen_elements(child)

for elem in gen_elements(soup):
print(elem.name)

Note that 'yield from iterable' is a shortcut for 'for x in iterable: yield
x', so there are actually two loops in gen_elements().

> Any tips or tricks for this would be grateful. As I could use this in 
> other situations.


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Web scraping using selenium and navigating nested dictionaries / lists.

2019-01-27 Thread mhysnm1964

Marco,

 

Thanks. The reason for learning selenium is for the automation. As I want to 
test web sites for keyboard and mouse interaction and record the results. That 
at least is the long term goal. In the short term, I will have a look at your 
suggestion.

 

 

From: Marco Mistroni  
Sent: Sunday, 27 January 2019 9:46 PM
To: mhysnm1...@gmail.com
Cc: tutor@python.org
Subject: Re: [Tutor] Web scraping using selenium and navigating nested 
dictionaries / lists.

 

Hi my 2 cents. Have a look at scrapy for scraping.selenium is v good  tool to 
learn but is mainly to automate uat of guis

Scrapy will scrape for you and u can automate it via cron. It's same stuff I am 
doing ATM

Hth

On Sun, Jan 27, 2019, 8:34 AM mailto:mhysnm1...@gmail.com>  wrote:

All,



Goal of new project.

I want to scrape all my books from Audible.com that I have purchased.
Eventually I want to export this as a CSV file or maybe Json. I have not got
that far yet. The reasoning behind this is to  learn selenium  for my work
and get the list of books I have purchased. Killing two birds with one stone
here. The work focus is to see if selenium   can automate some of the
testing I have to do and collect useful information from the web page for my
reports. This part of the goal is in the future. As I need to build my
python skills up. 



Thus far, I have been successful in logging into Audible and showing the
library of books. I am able to store the table of books and want to use
BeautifulSoup to extract the relevant information. Information I will want
from the table is:

*   Author 
*   Title
*   Date purchased 
*   Length
*   Is the book in a series (there is a link for this)
*   Link to the page storing the publish details. 
*   Download link

Hopefully this has given you enough information on what I am trying to
achieve at this stage. AS I learn more about what I am doing, I am adding
possible extra's tasks. Such as verifying if I have the book already
download via itunes.



Learning goals:

Using the BeautifulSoup  structure that I have extracted from the page
source for the table. I want to navigate the tree structure. BeautifulSoup
provides children, siblings and parents methods. This is where I get stuck
with programming logic. BeautifulSoup does provide find_all method plus
selectors which I do not want to use for this exercise. As I want to learn
how to walk a tree starting at the root and visiting each node of the tree.
Then I can look at the attributes for the tag as I go. I believe I have to
set up a recursive loop or function call. Not sure on how to do this. Pseudo
code:



Build table structure

Start at the root node.

Check to see if there is any children.

Pass first child to function.

Print attributes for tag at this level 

In function, check for any sibling nodes.

If exist, call function again 

If no siblings, then start at first sibling and get its child.



This is where I get struck. Each sibling can have children and they can have
siblings. So how do I ensure I visit each node in the tree? 

Any tips or tricks for this would be grateful. As I could use this in other
situations.



Sean 

___
Tutor maillist  -  Tutor@python.org <mailto:Tutor@python.org> 
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Web scraping using selenium and navigating nested dictionaries / lists.

2019-01-27 Thread Marco Mistroni

Hi my 2 cents. Have a look at scrapy for scraping.selenium is v good  tool
to learn but is mainly to automate uat of guis
Scrapy will scrape for you and u can automate it via cron. It's same stuff
I am doing ATM
Hth

On Sun, Jan 27, 2019, 8:34 AM  All,
>
>
>
> Goal of new project.
>
> I want to scrape all my books from Audible.com that I have purchased.
> Eventually I want to export this as a CSV file or maybe Json. I have not
> got
> that far yet. The reasoning behind this is to  learn selenium  for my work
> and get the list of books I have purchased. Killing two birds with one
> stone
> here. The work focus is to see if selenium   can automate some of the
> testing I have to do and collect useful information from the web page for
> my
> reports. This part of the goal is in the future. As I need to build my
> python skills up.
>
>
>
> Thus far, I have been successful in logging into Audible and showing the
> library of books. I am able to store the table of books and want to use
> BeautifulSoup to extract the relevant information. Information I will want
> from the table is:
>
> *   Author
> *   Title
> *   Date purchased
> *   Length
> *   Is the book in a series (there is a link for this)
> *   Link to the page storing the publish details.
> *   Download link
>
> Hopefully this has given you enough information on what I am trying to
> achieve at this stage. AS I learn more about what I am doing, I am adding
> possible extra's tasks. Such as verifying if I have the book already
> download via itunes.
>
>
>
> Learning goals:
>
> Using the BeautifulSoup  structure that I have extracted from the page
> source for the table. I want to navigate the tree structure. BeautifulSoup
> provides children, siblings and parents methods. This is where I get stuck
> with programming logic. BeautifulSoup does provide find_all method plus
> selectors which I do not want to use for this exercise. As I want to learn
> how to walk a tree starting at the root and visiting each node of the tree.
> Then I can look at the attributes for the tag as I go. I believe I have to
> set up a recursive loop or function call. Not sure on how to do this.
> Pseudo
> code:
>
>
>
> Build table structure
>
> Start at the root node.
>
> Check to see if there is any children.
>
> Pass first child to function.
>
> Print attributes for tag at this level
>
> In function, check for any sibling nodes.
>
> If exist, call function again
>
> If no siblings, then start at first sibling and get its child.
>
>
>
> This is where I get struck. Each sibling can have children and they can
> have
> siblings. So how do I ensure I visit each node in the tree?
>
> Any tips or tricks for this would be grateful. As I could use this in other
> situations.
>
>
>
> Sean
>
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

[Tutor] Web scraping using selenium and navigating nested dictionaries / lists.

2019-01-27 Thread mhysnm1964

All,

 

Goal of new project.

I want to scrape all my books from Audible.com that I have purchased.
Eventually I want to export this as a CSV file or maybe Json. I have not got
that far yet. The reasoning behind this is to  learn selenium  for my work
and get the list of books I have purchased. Killing two birds with one stone
here. The work focus is to see if selenium   can automate some of the
testing I have to do and collect useful information from the web page for my
reports. This part of the goal is in the future. As I need to build my
python skills up. 

 

Thus far, I have been successful in logging into Audible and showing the
library of books. I am able to store the table of books and want to use
BeautifulSoup to extract the relevant information. Information I will want
from the table is:

*   Author 
*   Title
*   Date purchased 
*   Length
*   Is the book in a series (there is a link for this)
*   Link to the page storing the publish details. 
*   Download link

Hopefully this has given you enough information on what I am trying to
achieve at this stage. AS I learn more about what I am doing, I am adding
possible extra's tasks. Such as verifying if I have the book already
download via itunes.

 

Learning goals:

Using the BeautifulSoup  structure that I have extracted from the page
source for the table. I want to navigate the tree structure. BeautifulSoup
provides children, siblings and parents methods. This is where I get stuck
with programming logic. BeautifulSoup does provide find_all method plus
selectors which I do not want to use for this exercise. As I want to learn
how to walk a tree starting at the root and visiting each node of the tree.
Then I can look at the attributes for the tag as I go. I believe I have to
set up a recursive loop or function call. Not sure on how to do this. Pseudo
code:

 

Build table structure

Start at the root node.

Check to see if there is any children.

Pass first child to function.

Print attributes for tag at this level 

In function, check for any sibling nodes.

If exist, call function again 

If no siblings, then start at first sibling and get its child.

 

This is where I get struck. Each sibling can have children and they can have
siblings. So how do I ensure I visit each node in the tree? 

Any tips or tricks for this would be grateful. As I could use this in other
situations.

 

Sean 

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

[Tutor] web scraping using Python and urlopen in Python 3.3

2012-11-07 Thread Seema V Srivastava

Hi,
I am new to Python, trying to learn it by carrying out specific tasks.  I
want to start with trying to scrap the contents of a web page.  I have
downloaded Python 3.3 and BeautifulSoup 4.

If I call upon urlopen in any form, such as below, I get the error as shown
below the syntax:  Does urlopen not apply to Python 3.3?  If not then
what;s the syntax I should be using?  Thanks so much.

import urllib
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib.urlopen(http://www.pinterest.com;))

Traceback (most recent call last):
  File C:\Users\Seema\workspace\example\main.py, line 3, in module
soup = BeautifulSoup(urllib.urlopen(http://www.pinterest.com;))
AttributeError: 'module' object has no attribute 'urlopen'
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] web scraping using Python and urlopen in Python 3.3

2012-11-07 Thread Walter Prins

Seema,

On 7 November 2012 15:44, Seema V Srivastava seema@gmail.com wrote:

 Hi,
 I am new to Python, trying to learn it by carrying out specific tasks.  I
 want to start with trying to scrap the contents of a web page.  I have
 downloaded Python 3.3 and BeautifulSoup 4.

 If I call upon urlopen in any form, such as below, I get the error as
 shown below the syntax:  Does urlopen not apply to Python 3.3?  If not then
 what;s the syntax I should be using?  Thanks so much.


See the documenation:
http://docs.python.org/2/library/urllib.html#utility-functions

Quote: Also note that the
urllib.urlopen()http://docs.python.org/2/library/urllib.html#urllib.urlopenfunction
has been removed in Python 3 in favor of
urllib2.urlopen()http://docs.python.org/2/library/urllib2.html#urllib2.urlopen
.

Walter
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] web scraping using Python and urlopen in Python 3.3

2012-11-07 Thread Dave Angel

On 11/07/2012 10:44 AM, Seema V Srivastava wrote:
 Hi,
 I am new to Python, trying to learn it by carrying out specific tasks.  I
 want to start with trying to scrap the contents of a web page.  I have
 downloaded Python 3.3 and BeautifulSoup 4.

 If I call upon urlopen in any form, such as below, I get the error as shown
 below the syntax:  Does urlopen not apply to Python 3.3?  If not then
 what;s the syntax I should be using?  Thanks so much.

 import urllib
 from bs4 import BeautifulSoup
 soup = BeautifulSoup(urllib.urlopen(http://www.pinterest.com;))

 Traceback (most recent call last):
   File C:\Users\Seema\workspace\example\main.py, line 3, in module
 soup = BeautifulSoup(urllib.urlopen(http://www.pinterest.com;))
 AttributeError: 'module' object has no attribute 'urlopen'



Since you're trying to learn, let me point out a few things that would
let you teach yourself, which is usually quicker and more effective than
asking on a mailing list.  (Go ahead and ask, but if you figure out the
simpler ones yourself, you'll learn faster)

(BTW, I'm using 3.2, but it'll probably be very close)

First, that error has nothing to do with BeautifulSoup.  If it had, I
wouldn't have responded, since I don't have any experience with BS.  The
way you could learn that for yourself is to factor the line giving the
error:

tmp = urllib.urlopen(http://www.pinterest.com;)
soup = BeautifulSoup(tmp)

Now, you'll get the error on the first line, before doing anything with
BeautifulSoup.

Now that you have narrowed it to urllib.urlopen, go find the docs for
that.  I used DuckDuckGo, with keywords  python urllib urlopen, and the
first match was:
 http://docs.python.org/2/library/urllib.html

and even though this is 2.7.3 docs, the first paragraph tells you
something useful:


Note

 

The urllib
http://docs.python.org/2/library/urllib.html#module-urllib module has
been split into parts and renamed in Python 3
to urllib.request, urllib.parse, and urllib.error. The /2to3/
http://docs.python.org/2/glossary.html#term-to3 tool will
automatically adapt imports when converting your sources to Python 3.
Also note that the urllib.urlopen()
http://docs.python.org/2/library/urllib.html#urllib.urlopen function
has been removed in Python 3 in favor of urllib2.urlopen()
http://docs.python.org/2/library/urllib2.html#urllib2.urlopen.


Now, the next question I'd ask is whether you're working from a book (or
online tutorial), and that book is describing Python 2.x  If so, you
might encounter this type of pain many times.


Anyway, another place you can learn is from the interactive
interpreter.  just run python3, and experiment.

 import urllib
 urllib.urlopen
Traceback (most recent call last):
  File stdin, line 1, in module
AttributeError: 'module' object has no attribute 'urlopen'
 dir(urllib)
['__builtins__', '__cached__', '__doc__', '__file__', '__name__',
'__package__', '__path__']


Notice that dir shows us the attributes of urllib, and none of them look
directly useful.  That's because urllib is a package, not just a
module.  A package is a container for other modules.  We can also look
__file__

 urllib.__file__
'/usr/lib/python3.2/urllib/__init__.py'

That __init__.py is another clue;  that's the way packages are initialized.

But when I try importing urllib2, I get
   ImportError: No module named urllib2

So back to the website.  But using the dropdown at the upper left, i can
change from 2.7 to 3.3:
http://docs.python.org/3.3/library/urllib.html

There it is quite explicit.

urllib is a package that collects several modules for working with URLs:

  * urllib.request

http://docs.python.org/3.3/library/urllib.request.html#module-urllib.request 
for
opening and reading URLs
  * urllib.error
http://docs.python.org/3.3/library/urllib.error.html#module-urllib.error 
containing
the exceptions raised by urllib.request

http://docs.python.org/3.3/library/urllib.request.html#module-urllib.request
  * urllib.parse
http://docs.python.org/3.3/library/urllib.parse.html#module-urllib.parse 
for
parsing URLs
  * urllib.robotparser

http://docs.python.org/3.3/library/urllib.robotparser.html#module-urllib.robotparser
 for
parsing robots.txt files

So, if we continue to play with the interpreter, we can try:

 import urllib.request
 dir(urllib.request)

['AbstractBasicAuthHandler', 'AbstractDigestAuthHandler',
'AbstractHTTPHandler', 'BaseHandler', 'CacheFTPHandler',
'ContentTooShortError', 'FTPHandler', 'FancyURLopener', 'FileHandler',
'HTTPBasicAuthHandler', 'HTTPCookieProcessor',
'HTTPDefaultErrorHandler', 'HTTPDigestAuthHandler', 'HTTPError',
'HTTPErrorProcessor',
..
'urljoin', 'urlopen', 'urlparse', 'urlretrieve', 'urlsplit', 'urlunparse']

I chopped off part of the long list of things that was imported in that
module.  But one of them is urlopen, which is what you were looking for
before.

So back to your own sources, try:

 tmp = urllib.request.urlopen(http://www.pinterest.com;)
 tmp

Re: [Tutor] web scraping using Python and urlopen in Python 3.3

2012-11-07 Thread Dave Angel

On 11/07/2012 11:25 AM, Walter Prins wrote:
 Seema,

 On 7 November 2012 15:44, Seema V Srivastava seema@gmail.com wrote:

 Hi,
 I am new to Python, trying to learn it by carrying out specific tasks.  I
 want to start with trying to scrap the contents of a web page.  I have
 downloaded Python 3.3 and BeautifulSoup 4.

 If I call upon urlopen in any form, such as below, I get the error as
 shown below the syntax:  Does urlopen not apply to Python 3.3?  If not then
 what;s the syntax I should be using?  Thanks so much.

 See the documenation:
 http://docs.python.org/2/library/urllib.html#utility-functions

 Quote: Also note that the
 urllib.urlopen()http://docs.python.org/2/library/urllib.html#urllib.urlopenfunction
 has been removed in Python 3 in favor of
 urllib2.urlopen()http://docs.python.org/2/library/urllib2.html#urllib2.urlopen
 .

 Walter


Unfortunately, that's a bug in 2.7 documentation.  The actual Python3
approach does not use urllib2.

See http://docs.python.org/3.3/library/urllib.html


-- 

DaveA

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Web scraping

2005-06-08 Thread Kent Johnson

[EMAIL PROTECTED] wrote:
 
 I am looking for a web scraping sample.who can help me?

Take a look at Beautiful Soup
http://www.crummy.com/software/BeautifulSoup/

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Web scraping

2005-06-08 Thread Liam Clarke

An alternative win32 approach is - 
Use something like IEC http://www.mayukhbose.com/python/IEC/index.php
or PAMIE http://pamie.sourceforge.net/, or you can use the python win32
extensions http://starship.python.net/crew/skippy/win32/Downloads.html
and use IE  navigate through the DOM... but PAMIE is easier.

Good luck.On 6/8/05, Kent Johnson [EMAIL PROTECTED] wrote:
[EMAIL PROTECTED] wrote: I am looking for a web scraping sample.who can help me?Take a look at Beautiful Soup
http://www.crummy.com/software/BeautifulSoup/Kent___Tutor maillist-Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor-- 'There is only one basic human right, and that is to do as you damn well please.And with it comes the only basic human duty, to take the consequences.'
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Web scraping using selenium and navigating nested dictionaries / lists.

Re: [Tutor] Web scraping using selenium and navigating nested dictionaries / lists.

Re: [Tutor] Web scraping using selenium and navigating nested dictionaries / lists.

[Tutor] Web scraping using selenium and navigating nested dictionaries / lists.

[Tutor] web scraping using Python and urlopen in Python 3.3

Re: [Tutor] web scraping using Python and urlopen in Python 3.3

Re: [Tutor] web scraping using Python and urlopen in Python 3.3

Re: [Tutor] web scraping using Python and urlopen in Python 3.3

Re: [Tutor] Web scraping

Re: [Tutor] Web scraping

10 matches

Site Navigation

Mail list logo

Footer information