Re: [Tutor] Web scraping using selenium and navigating nested dictionaries / lists.
Peter, I am aware that I am avoiding functions that can make my life easier. But I want to learn some of this data structure navigation concepts to improve my skills in programming. What you have provided I will review in depth and have a play with. A big thanks. -Original Message- From: Tutor On Behalf Of Peter Otten Sent: Sunday, 27 January 2019 10:13 PM To: tutor@python.org Subject: Re: [Tutor] Web scraping using selenium and navigating nested dictionaries / lists. mhysnm1...@gmail.com wrote: > All, > > > > Goal of new project. > > I want to scrape all my books from Audible.com that I have purchased. > Eventually I want to export this as a CSV file or maybe Json. I have > not got that far yet. The reasoning behind this is to learn selenium > for my work and get the list of books I have purchased. Killing two > birds with one stone > here. The work focus is to see if selenium can automate some of the > testing I have to do and collect useful information from the web page > for my reports. This part of the goal is in the future. As I need to > build my python skills up. > > > > Thus far, I have been successful in logging into Audible and showing > the library of books. I am able to store the table of books and want > to use BeautifulSoup to extract the relevant information. Information > I will want from the table is: > > * Author > * Title > * Date purchased > * Length > * Is the book in a series (there is a link for this) > * Link to the page storing the publish details. > * Download link > > Hopefully this has given you enough information on what I am trying to > achieve at this stage. AS I learn more about what I am doing, I am > adding possible extra's tasks. Such as verifying if I have the book > already download via itunes. > > > > Learning goals: > > Using the BeautifulSoup structure that I have extracted from the page > source for the table. I want to navigate the tree structure. > BeautifulSoup provides children, siblings and parents methods. This is > where I get stuck with programming logic. BeautifulSoup does provide > find_all method plus selectors which I do not want to use for this > exercise. As I want to learn how to walk a tree starting at the root > and visiting each node of the tree. I think you make your life harder than necessary if you avoid the tools provided by the library you are using. > Then I can look at the attributes for the tag as I go. I believe I > have to set up a recursive loop or function call. Not sure on how to > do this. Pseudo code: > > > > Build table structure > > Start at the root node. > > Check to see if there is any children. > > Pass first child to function. > > Print attributes for tag at this level > > In function, check for any sibling nodes. > > If exist, call function again > > If no siblings, then start at first sibling and get its child. > > > > This is where I get struck. Each sibling can have children and they > can have siblings. So how do I ensure I visit each node in the tree? The problem with your description is that siblings do not matter. Just - process root - iterate over its children and call the function recursively with every child as the new root. To make the function more useful you can pass a function instead of hard- coding what you want to do with the elements. Given def process_elements(elem, do_stuff): do_stuff(elem) for child in elem.children: process_elements(child, do_stuff) you can print all elements with soup = BeautifulSoup(...) process_elements(soup, print) and process_elements(soup, lambda elem: print(elem.name)) will print only the names. You need a bit of error checking to make it work, though. But wait -- Python's generators let you rewrite process_elements so that you can use it without a callback: def gen_elements(elem): yield elem for child in elem.children: yield from gen_elements(child) for elem in gen_elements(soup): print(elem.name) Note that 'yield from iterable' is a shortcut for 'for x in iterable: yield x', so there are actually two loops in gen_elements(). > Any tips or tricks for this would be grateful. As I could use this in > other situations. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Web scraping using selenium and navigating nested dictionaries / lists.
Marco, Thanks. The reason for learning selenium is for the automation. As I want to test web sites for keyboard and mouse interaction and record the results. That at least is the long term goal. In the short term, I will have a look at your suggestion. From: Marco Mistroni Sent: Sunday, 27 January 2019 9:46 PM To: mhysnm1...@gmail.com Cc: tutor@python.org Subject: Re: [Tutor] Web scraping using selenium and navigating nested dictionaries / lists. Hi my 2 cents. Have a look at scrapy for scraping.selenium is v good tool to learn but is mainly to automate uat of guis Scrapy will scrape for you and u can automate it via cron. It's same stuff I am doing ATM Hth On Sun, Jan 27, 2019, 8:34 AM mailto:mhysnm1...@gmail.com> wrote: All, Goal of new project. I want to scrape all my books from Audible.com that I have purchased. Eventually I want to export this as a CSV file or maybe Json. I have not got that far yet. The reasoning behind this is to learn selenium for my work and get the list of books I have purchased. Killing two birds with one stone here. The work focus is to see if selenium can automate some of the testing I have to do and collect useful information from the web page for my reports. This part of the goal is in the future. As I need to build my python skills up. Thus far, I have been successful in logging into Audible and showing the library of books. I am able to store the table of books and want to use BeautifulSoup to extract the relevant information. Information I will want from the table is: * Author * Title * Date purchased * Length * Is the book in a series (there is a link for this) * Link to the page storing the publish details. * Download link Hopefully this has given you enough information on what I am trying to achieve at this stage. AS I learn more about what I am doing, I am adding possible extra's tasks. Such as verifying if I have the book already download via itunes. Learning goals: Using the BeautifulSoup structure that I have extracted from the page source for the table. I want to navigate the tree structure. BeautifulSoup provides children, siblings and parents methods. This is where I get stuck with programming logic. BeautifulSoup does provide find_all method plus selectors which I do not want to use for this exercise. As I want to learn how to walk a tree starting at the root and visiting each node of the tree. Then I can look at the attributes for the tag as I go. I believe I have to set up a recursive loop or function call. Not sure on how to do this. Pseudo code: Build table structure Start at the root node. Check to see if there is any children. Pass first child to function. Print attributes for tag at this level In function, check for any sibling nodes. If exist, call function again If no siblings, then start at first sibling and get its child. This is where I get struck. Each sibling can have children and they can have siblings. So how do I ensure I visit each node in the tree? Any tips or tricks for this would be grateful. As I could use this in other situations. Sean ___ Tutor maillist - Tutor@python.org <mailto:Tutor@python.org> To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Web scraping using selenium and navigating nested dictionaries / lists.
Hi my 2 cents. Have a look at scrapy for scraping.selenium is v good tool to learn but is mainly to automate uat of guis Scrapy will scrape for you and u can automate it via cron. It's same stuff I am doing ATM Hth On Sun, Jan 27, 2019, 8:34 AM All, > > > > Goal of new project. > > I want to scrape all my books from Audible.com that I have purchased. > Eventually I want to export this as a CSV file or maybe Json. I have not > got > that far yet. The reasoning behind this is to learn selenium for my work > and get the list of books I have purchased. Killing two birds with one > stone > here. The work focus is to see if selenium can automate some of the > testing I have to do and collect useful information from the web page for > my > reports. This part of the goal is in the future. As I need to build my > python skills up. > > > > Thus far, I have been successful in logging into Audible and showing the > library of books. I am able to store the table of books and want to use > BeautifulSoup to extract the relevant information. Information I will want > from the table is: > > * Author > * Title > * Date purchased > * Length > * Is the book in a series (there is a link for this) > * Link to the page storing the publish details. > * Download link > > Hopefully this has given you enough information on what I am trying to > achieve at this stage. AS I learn more about what I am doing, I am adding > possible extra's tasks. Such as verifying if I have the book already > download via itunes. > > > > Learning goals: > > Using the BeautifulSoup structure that I have extracted from the page > source for the table. I want to navigate the tree structure. BeautifulSoup > provides children, siblings and parents methods. This is where I get stuck > with programming logic. BeautifulSoup does provide find_all method plus > selectors which I do not want to use for this exercise. As I want to learn > how to walk a tree starting at the root and visiting each node of the tree. > Then I can look at the attributes for the tag as I go. I believe I have to > set up a recursive loop or function call. Not sure on how to do this. > Pseudo > code: > > > > Build table structure > > Start at the root node. > > Check to see if there is any children. > > Pass first child to function. > > Print attributes for tag at this level > > In function, check for any sibling nodes. > > If exist, call function again > > If no siblings, then start at first sibling and get its child. > > > > This is where I get struck. Each sibling can have children and they can > have > siblings. So how do I ensure I visit each node in the tree? > > Any tips or tricks for this would be grateful. As I could use this in other > situations. > > > > Sean > > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor > ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
[Tutor] Web scraping using selenium and navigating nested dictionaries / lists.
All, Goal of new project. I want to scrape all my books from Audible.com that I have purchased. Eventually I want to export this as a CSV file or maybe Json. I have not got that far yet. The reasoning behind this is to learn selenium for my work and get the list of books I have purchased. Killing two birds with one stone here. The work focus is to see if selenium can automate some of the testing I have to do and collect useful information from the web page for my reports. This part of the goal is in the future. As I need to build my python skills up. Thus far, I have been successful in logging into Audible and showing the library of books. I am able to store the table of books and want to use BeautifulSoup to extract the relevant information. Information I will want from the table is: * Author * Title * Date purchased * Length * Is the book in a series (there is a link for this) * Link to the page storing the publish details. * Download link Hopefully this has given you enough information on what I am trying to achieve at this stage. AS I learn more about what I am doing, I am adding possible extra's tasks. Such as verifying if I have the book already download via itunes. Learning goals: Using the BeautifulSoup structure that I have extracted from the page source for the table. I want to navigate the tree structure. BeautifulSoup provides children, siblings and parents methods. This is where I get stuck with programming logic. BeautifulSoup does provide find_all method plus selectors which I do not want to use for this exercise. As I want to learn how to walk a tree starting at the root and visiting each node of the tree. Then I can look at the attributes for the tag as I go. I believe I have to set up a recursive loop or function call. Not sure on how to do this. Pseudo code: Build table structure Start at the root node. Check to see if there is any children. Pass first child to function. Print attributes for tag at this level In function, check for any sibling nodes. If exist, call function again If no siblings, then start at first sibling and get its child. This is where I get struck. Each sibling can have children and they can have siblings. So how do I ensure I visit each node in the tree? Any tips or tricks for this would be grateful. As I could use this in other situations. Sean ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
[Tutor] web scraping using Python and urlopen in Python 3.3
Hi, I am new to Python, trying to learn it by carrying out specific tasks. I want to start with trying to scrap the contents of a web page. I have downloaded Python 3.3 and BeautifulSoup 4. If I call upon urlopen in any form, such as below, I get the error as shown below the syntax: Does urlopen not apply to Python 3.3? If not then what;s the syntax I should be using? Thanks so much. import urllib from bs4 import BeautifulSoup soup = BeautifulSoup(urllib.urlopen(http://www.pinterest.com;)) Traceback (most recent call last): File C:\Users\Seema\workspace\example\main.py, line 3, in module soup = BeautifulSoup(urllib.urlopen(http://www.pinterest.com;)) AttributeError: 'module' object has no attribute 'urlopen' ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] web scraping using Python and urlopen in Python 3.3
Seema, On 7 November 2012 15:44, Seema V Srivastava seema@gmail.com wrote: Hi, I am new to Python, trying to learn it by carrying out specific tasks. I want to start with trying to scrap the contents of a web page. I have downloaded Python 3.3 and BeautifulSoup 4. If I call upon urlopen in any form, such as below, I get the error as shown below the syntax: Does urlopen not apply to Python 3.3? If not then what;s the syntax I should be using? Thanks so much. See the documenation: http://docs.python.org/2/library/urllib.html#utility-functions Quote: Also note that the urllib.urlopen()http://docs.python.org/2/library/urllib.html#urllib.urlopenfunction has been removed in Python 3 in favor of urllib2.urlopen()http://docs.python.org/2/library/urllib2.html#urllib2.urlopen . Walter ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] web scraping using Python and urlopen in Python 3.3
On 11/07/2012 10:44 AM, Seema V Srivastava wrote: Hi, I am new to Python, trying to learn it by carrying out specific tasks. I want to start with trying to scrap the contents of a web page. I have downloaded Python 3.3 and BeautifulSoup 4. If I call upon urlopen in any form, such as below, I get the error as shown below the syntax: Does urlopen not apply to Python 3.3? If not then what;s the syntax I should be using? Thanks so much. import urllib from bs4 import BeautifulSoup soup = BeautifulSoup(urllib.urlopen(http://www.pinterest.com;)) Traceback (most recent call last): File C:\Users\Seema\workspace\example\main.py, line 3, in module soup = BeautifulSoup(urllib.urlopen(http://www.pinterest.com;)) AttributeError: 'module' object has no attribute 'urlopen' Since you're trying to learn, let me point out a few things that would let you teach yourself, which is usually quicker and more effective than asking on a mailing list. (Go ahead and ask, but if you figure out the simpler ones yourself, you'll learn faster) (BTW, I'm using 3.2, but it'll probably be very close) First, that error has nothing to do with BeautifulSoup. If it had, I wouldn't have responded, since I don't have any experience with BS. The way you could learn that for yourself is to factor the line giving the error: tmp = urllib.urlopen(http://www.pinterest.com;) soup = BeautifulSoup(tmp) Now, you'll get the error on the first line, before doing anything with BeautifulSoup. Now that you have narrowed it to urllib.urlopen, go find the docs for that. I used DuckDuckGo, with keywords python urllib urlopen, and the first match was: http://docs.python.org/2/library/urllib.html and even though this is 2.7.3 docs, the first paragraph tells you something useful: Note The urllib http://docs.python.org/2/library/urllib.html#module-urllib module has been split into parts and renamed in Python 3 to urllib.request, urllib.parse, and urllib.error. The /2to3/ http://docs.python.org/2/glossary.html#term-to3 tool will automatically adapt imports when converting your sources to Python 3. Also note that the urllib.urlopen() http://docs.python.org/2/library/urllib.html#urllib.urlopen function has been removed in Python 3 in favor of urllib2.urlopen() http://docs.python.org/2/library/urllib2.html#urllib2.urlopen. Now, the next question I'd ask is whether you're working from a book (or online tutorial), and that book is describing Python 2.x If so, you might encounter this type of pain many times. Anyway, another place you can learn is from the interactive interpreter. just run python3, and experiment. import urllib urllib.urlopen Traceback (most recent call last): File stdin, line 1, in module AttributeError: 'module' object has no attribute 'urlopen' dir(urllib) ['__builtins__', '__cached__', '__doc__', '__file__', '__name__', '__package__', '__path__'] Notice that dir shows us the attributes of urllib, and none of them look directly useful. That's because urllib is a package, not just a module. A package is a container for other modules. We can also look __file__ urllib.__file__ '/usr/lib/python3.2/urllib/__init__.py' That __init__.py is another clue; that's the way packages are initialized. But when I try importing urllib2, I get ImportError: No module named urllib2 So back to the website. But using the dropdown at the upper left, i can change from 2.7 to 3.3: http://docs.python.org/3.3/library/urllib.html There it is quite explicit. urllib is a package that collects several modules for working with URLs: * urllib.request http://docs.python.org/3.3/library/urllib.request.html#module-urllib.request for opening and reading URLs * urllib.error http://docs.python.org/3.3/library/urllib.error.html#module-urllib.error containing the exceptions raised by urllib.request http://docs.python.org/3.3/library/urllib.request.html#module-urllib.request * urllib.parse http://docs.python.org/3.3/library/urllib.parse.html#module-urllib.parse for parsing URLs * urllib.robotparser http://docs.python.org/3.3/library/urllib.robotparser.html#module-urllib.robotparser for parsing robots.txt files So, if we continue to play with the interpreter, we can try: import urllib.request dir(urllib.request) ['AbstractBasicAuthHandler', 'AbstractDigestAuthHandler', 'AbstractHTTPHandler', 'BaseHandler', 'CacheFTPHandler', 'ContentTooShortError', 'FTPHandler', 'FancyURLopener', 'FileHandler', 'HTTPBasicAuthHandler', 'HTTPCookieProcessor', 'HTTPDefaultErrorHandler', 'HTTPDigestAuthHandler', 'HTTPError', 'HTTPErrorProcessor', .. 'urljoin', 'urlopen', 'urlparse', 'urlretrieve', 'urlsplit', 'urlunparse'] I chopped off part of the long list of things that was imported in that module. But one of them is urlopen, which is what you were looking for before. So back to your own sources, try: tmp = urllib.request.urlopen(http://www.pinterest.com;) tmp
Re: [Tutor] web scraping using Python and urlopen in Python 3.3
On 11/07/2012 11:25 AM, Walter Prins wrote: Seema, On 7 November 2012 15:44, Seema V Srivastava seema@gmail.com wrote: Hi, I am new to Python, trying to learn it by carrying out specific tasks. I want to start with trying to scrap the contents of a web page. I have downloaded Python 3.3 and BeautifulSoup 4. If I call upon urlopen in any form, such as below, I get the error as shown below the syntax: Does urlopen not apply to Python 3.3? If not then what;s the syntax I should be using? Thanks so much. See the documenation: http://docs.python.org/2/library/urllib.html#utility-functions Quote: Also note that the urllib.urlopen()http://docs.python.org/2/library/urllib.html#urllib.urlopenfunction has been removed in Python 3 in favor of urllib2.urlopen()http://docs.python.org/2/library/urllib2.html#urllib2.urlopen . Walter Unfortunately, that's a bug in 2.7 documentation. The actual Python3 approach does not use urllib2. See http://docs.python.org/3.3/library/urllib.html -- DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Web scraping
[EMAIL PROTECTED] wrote: I am looking for a web scraping sample.who can help me? Take a look at Beautiful Soup http://www.crummy.com/software/BeautifulSoup/ Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Web scraping
An alternative win32 approach is - Use something like IEC http://www.mayukhbose.com/python/IEC/index.php or PAMIE http://pamie.sourceforge.net/, or you can use the python win32 extensions http://starship.python.net/crew/skippy/win32/Downloads.html and use IE navigate through the DOM... but PAMIE is easier. Good luck.On 6/8/05, Kent Johnson [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: I am looking for a web scraping sample.who can help me?Take a look at Beautiful Soup http://www.crummy.com/software/BeautifulSoup/Kent___Tutor maillist-Tutor@python.org http://mail.python.org/mailman/listinfo/tutor-- 'There is only one basic human right, and that is to do as you damn well please.And with it comes the only basic human duty, to take the consequences.' ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor