Re: Parsing/Crawler Questions - solution
On Mar 7, 9:56 pm, bruce bedoug...@earthlink.net wrote: and this solution will somehow allow a user to create a web parsing/scraping app for parising links, and javascript from a web page? not just parsing the links and the static javascript, but: * actually executing the javascript, giving the quotes page quotes a chance to actually _look_ like it would if it was being viewed as a quotes real quotes web browser. so any XMLHTTPRequests will _actually_ get executed, _actually_ result in _actually_ having the content of the web page _properly_ modified. so, e.g instead of seeing a Loader page on gmail you would _actually_ see the user's email and the adverts (assuming you went to the trouble of putting in the username/password) because the AJAX would _actually_ get executed by the WebKit engine, and the DOM model accessed thereafter. * giving the user the opportunity to call DOM methods such as getElementsByTagName and the opportunity to access properties such as document.anchors. in webkit-glib gdom bindings, that would be: * anchor_list = gdom_document_get_elements_by_tag_name(doc, a); or * g_object_get(doc, anchors, anchor_list, NULL); which in pywebkitgtk (thanks to python-pygobject auto-generation of python bindings from gobject bindings) translates into: * doc.get_elements_by_tag_name(a) or * doc.props.anchors which in pyjamas-desktop, a high-level abstraction on top of _that_, turns into: * from pyjamas import DOM anchor_list = DOM.getElementsByTagName(doc, a) or * from pyjamas import DOM anchor_list = DOM.getAttribute(doc, anchors) answer: yes. l. -Original Message- From: python-list-bounces+bedouglas=earthlink@python.org [mailto:python-list-bounces+bedouglas=earthlink@python.org]on Behalf Oflkcl Sent: Saturday, March 07, 2009 2:34 AM To: python-l...@python.org Subject: Re: Parsing/Crawler Questions - solution On Mar 7, 12:19 am, rounderwe...@gmail.com wrote: So, it sounds like your update means that it is related to a specific url. I'm curious about this issue myself. I've often wondered how one could properly crawl anAJAX-ish site when you're not sure how quickly the data will be returned after the page has been. you want to look at the webkit engine - no not the graphical browser - the ParseTree example - and combine it with pywebkitgtk - no not the original version, the one which has DOM-manipulation bindings through webkit-glib. the webkit parse tree example is, despite it being based on the GTK port as they like to call it in webkit (which just means that it links with GTK not QT4 or wxWidgets), is a console-based application. in other words, despite it being GTK, it still does NOT output graphical crap to the screen, yet it still *executes* the javascript on the page. dummy functions for mouse, keyboard, console errors are given as examples and are left as an exercise for the application writer to fill-in-the-blanks. combining this parse tree example with pywebkitgtk (see demobrowser.py) would provide a means by which web pages can be executed AT THE CONSOLE NOT AS A GUI APP, then, thanks to the glib / gobject bindings, a python app will be able to walk the DOM tree as expected. i _just_ fixed pyjamas-desktop's iterators in the pyjamas.DOM module for someone, on the pyjamas-dev mailing list. http://github.com/lkcl/pyjamas-desktop/tree/8ed365b89efe5d1d3451c3e3c... dd014540 so, actually, you may be better off starting from pyjamas-desktop and then cutting out the fire up the GTK window bit, from pyjd.py. pyjd.py is based on pywebkitgtk's demobrowser.py the alternative to webkit is to use python-hulahop - it will do the same thing, but just using python bindings to gecko instead of python- bindings-to-glib-bindings-to-webkit. l. --http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing/Crawler Questions - solution
bruce wrote: john... again the problem i'm facing really has nothing to do with a specific url... the app i have for the usc site works... but for any number of reasons... you might get different results when running the app.. -the server could be screwed up.. -data might be cached -data might be changed, and not updated.. -actual app problems... -networking issues... -memory corruption issues... -process constraint issues.. -web server overload.. -etc... the assumption that most people appear to make is that if you create a parser, and run and test it once.. then if it gets you the data, it's working.. when you run the same app.. 100s of times, and you're slamming the webserver... then you realize that that's a vastly different animal than simply running a snigle query a few times... The assumptions is most websites edit and remove data from time to time and using the union of data collected throughout several runs might populate your program with redundant (but slightly different) or outdated data. The assumptions is these redundant or outdated data is not useful for most people. -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing/Crawler Questions - solution
On Mar 7, 12:19 am, rounderwe...@gmail.com wrote: So, it sounds like your update means that it is related to a specific url. I'm curious about this issue myself. I've often wondered how one could properly crawl anAJAX-ish site when you're not sure how quickly the data will be returned after the page has been. you want to look at the webkit engine - no not the graphical browser - the ParseTree example - and combine it with pywebkitgtk - no not the original version, the one which has DOM-manipulation bindings through webkit-glib. the webkit parse tree example is, despite it being based on the GTK port as they like to call it in webkit (which just means that it links with GTK not QT4 or wxWidgets), is a console-based application. in other words, despite it being GTK, it still does NOT output graphical crap to the screen, yet it still *executes* the javascript on the page. dummy functions for mouse, keyboard, console errors are given as examples and are left as an exercise for the application writer to fill-in-the-blanks. combining this parse tree example with pywebkitgtk (see demobrowser.py) would provide a means by which web pages can be executed AT THE CONSOLE NOT AS A GUI APP, then, thanks to the glib / gobject bindings, a python app will be able to walk the DOM tree as expected. i _just_ fixed pyjamas-desktop's iterators in the pyjamas.DOM module for someone, on the pyjamas-dev mailing list. http://github.com/lkcl/pyjamas-desktop/tree/8ed365b89efe5d1d3451c3e3ced662a2dd014540 so, actually, you may be better off starting from pyjamas-desktop and then cutting out the fire up the GTK window bit, from pyjd.py. pyjd.py is based on pywebkitgtk's demobrowser.py the alternative to webkit is to use python-hulahop - it will do the same thing, but just using python bindings to gecko instead of python- bindings-to-glib-bindings-to-webkit. l. -- http://mail.python.org/mailman/listinfo/python-list
RE: Parsing/Crawler Questions - solution
and this solution will somehow allow a user to create a web parsing/scraping app for parising links, and javascript from a web page? -Original Message- From: python-list-bounces+bedouglas=earthlink@python.org [mailto:python-list-bounces+bedouglas=earthlink@python.org]on Behalf Of lkcl Sent: Saturday, March 07, 2009 2:34 AM To: python-list@python.org Subject: Re: Parsing/Crawler Questions - solution On Mar 7, 12:19 am, rounderwe...@gmail.com wrote: So, it sounds like your update means that it is related to a specific url. I'm curious about this issue myself. I've often wondered how one could properly crawl anAJAX-ish site when you're not sure how quickly the data will be returned after the page has been. you want to look at the webkit engine - no not the graphical browser - the ParseTree example - and combine it with pywebkitgtk - no not the original version, the one which has DOM-manipulation bindings through webkit-glib. the webkit parse tree example is, despite it being based on the GTK port as they like to call it in webkit (which just means that it links with GTK not QT4 or wxWidgets), is a console-based application. in other words, despite it being GTK, it still does NOT output graphical crap to the screen, yet it still *executes* the javascript on the page. dummy functions for mouse, keyboard, console errors are given as examples and are left as an exercise for the application writer to fill-in-the-blanks. combining this parse tree example with pywebkitgtk (see demobrowser.py) would provide a means by which web pages can be executed AT THE CONSOLE NOT AS A GUI APP, then, thanks to the glib / gobject bindings, a python app will be able to walk the DOM tree as expected. i _just_ fixed pyjamas-desktop's iterators in the pyjamas.DOM module for someone, on the pyjamas-dev mailing list. http://github.com/lkcl/pyjamas-desktop/tree/8ed365b89efe5d1d3451c3e3ced662a2 dd014540 so, actually, you may be better off starting from pyjamas-desktop and then cutting out the fire up the GTK window bit, from pyjd.py. pyjd.py is based on pywebkitgtk's demobrowser.py the alternative to webkit is to use python-hulahop - it will do the same thing, but just using python bindings to gecko instead of python- bindings-to-glib-bindings-to-webkit. l. -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing/Crawler Questions - solution
So, it sounds like your update means that it is related to a specific url. I'm curious about this issue myself. I've often wondered how one could properly crawl an AJAX-ish site when you're not sure how quickly the data will be returned after the page has been. John, your advice has really helped me. Bruce / anyone else, have you had any further experience with this type of parsing / crawling? On Mar 5, 2:50 pm, bruce bedoug...@earthlink.net wrote: hi john... update... further investigation has revealed that apparently, for some urls/sites, the server serves up pages that take awhile to be fetched... this appears to be a potential problem, in that it appears that the parsescript never gets anything from the python mech/urllib read function. the curious issue is that i can run a single test script, pointing to the url, and after a bit of time.. the resulting content is fetched/downloaded correctly. by the way, i can get the same results in my test browsing environment, if i start it with only a subset of the urs that i've been using to test the app. hmm... might be a resource issue, a timing issue,.. or something else... hmmm... thanks again the problem i'm facing really has nothing to do with a specific url... the app i have for the usc site works... but for any number of reasons... you might get different results when running the app.. -the server could be screwed up.. -data might be cached -data might be changed, and not updated.. -actual app problems... -networking issues... -memory corruption issues... -process constraint issues.. -web server overload.. -etc... the assumption that most people appear to make is that if you create a parser, and run and test it once.. then if it gets you the data, it's working.. when you run the same app.. 100s of times, and you're slamming the webserver... then you realize that that's a vastly different animal than simply running a snigle query a few times... so.. nope, i'm not running the app and getting data from a dynamic page that hasn't finished uploading/creating the content.. but what my analysis is showing, not only for the usc, but for others as well.. is that there might be differences in what gets returned... which is where a smoothing algorithmic approach appears to be workable.. i've been starting to test this approach, and it actually might have a chance of working... so.. as i've stated a number of times.. focusing on a specific url isn't the issue.. the larger issue is how you can programatically/algorithmically/automatically, be reasonably ensured that what you have is exactly what's on the site... ain't screen scraping fun!!! -Original Message- From: python-list-bounces+bedouglas=earthlink@python.org [mailto:python-list-bounces+bedouglas=earthlink@python.org]on Behalf Of John Nagle Sent: Thursday, March 05, 2009 10:54 AM To: python-l...@python.org Subject: Re: Parsing/Crawler Questions - solution Philip Semanchuk wrote: On Mar 5, 2009, at 12:31 PM, bruce wrote: hi.. the url i'm focusing on is irrelevant to the issue i'm trying to solve at this time. Not if we're to understand the situation you're trying to describe. From what I can tell, you're saying that the target site displays different results each time your crawler visits it. It's as if e.g. the site knows about 100 courses but only displays 80 randomly chosen ones to each visitor. If that's the case, then it is truly bizarre. Agreed. The course list isn't changing that rapidly. I suspect the original poster is doing something like reading the DOM of a dynamic page while the page is still updating, running a browser in a subprocess. Is that right? I've had to deal with that in Javascript. My AdRater browser plug-in (http://www.sitetruth.com/downloads) looks at Google-served ads and rates the advertisers. There, I have to watch for page-change events and update the annotations I'm adding to ads. But you don't need to work that hard here. The USC site is actually querying a server which provides the requested data in JSON format. See http://web-app.usc.edu/soc/dev/scripts/soc.js Reverse-engineer that and you'll be able to get the underlying data. (It's an amusing script; many little fixes to data items are performed, something that should have been done at the database front end.) The way to get USC class data is this: 1. Start here: http://web-app.usc.edu/soc/term_20091.html; 2. Examine all the department pages under that page. 3. On each page, look for the value of coursesrc, like this: var coursesrc = '/ws/soc/api/classes/aest/20091' 4. For each coursesrc value found, construct a URL like this: http://web-app.usc.edu/ws/soc/api/classes/aest/20091 5. Read that URL. This will return the department's course list in JSON format. 6. From the JSON tree, pull out CourseData items, which look like
RE: Parsing/Crawler Questions..
hi john.. You're missing the issue, so a little clarification... I've got a number of test parsers that point to a given classlist site.. the scripts work. the issue that one faces is that you never know if you've gotten all of the items/links that you're looking for based on the XPath functions. This could be due to an error in the parsing, or it could be due to an admin changing the site (removing/adding courses etc...) So I'm trying to figure out an approach to handling these issues... As far as I can tell... An approach might be to run the parser script across the target site X number of times within a narrow timeframe (a few minutes). Based on the results of this process, you might be able to develop an overall tree of what the actual class/course links/list should be. But you don't know from hour to hour, day to day if this list is stable, as it could change.. The only way you know for certain is to physically examine a site. You can't do this if you're going to develop an automated system for 5-10 sites, or for 500-1000... These are the issues that I'm grappling with.. not how to write the XPath parsing functions... Thanks.. -Original Message- From: python-list-bounces+bedouglas=earthlink@python.org [mailto:python-list-bounces+bedouglas=earthlink@python.org]on Behalf Of John Nagle Sent: Wednesday, March 04, 2009 10:23 PM To: python-list@python.org Subject: Re: Parsing/Crawler Questions.. bruce wrote: hi phillip... thanks for taking a sec to reply... i'm solid on the test app i've created.. but as an example.. i have a parse for usc (southern cal) and it exrtacts the courselist/class schedule... my issue was that i realized the multiple runs of the app was giving differentt results... in my case, the class schedule isn't static.. (actually, none of the class/course lists need be static.. they could easily change). so i don't have apriori knowledge of what the actual class/course list site would look like, unless i physically examined the site, each time i run the app... i'm inclined to think i might need to run the parser a number of times within a given time frame, and then take a union/join of the output of the different runs.. this would in theory, give me a high probablity that i'd get 100% of the class list... I think I see the problem. I took a look at the USC class list, and it's been made Web 2.0. When you read the page, you don't get the class list; you get a Javascript thing that builds a class list on demand, using JSON, no less. See http://web-app.usc.edu/soc/term_20091.html;. I'm not sure how you're handling this. The Javascript actually has to be run before you get anything. John Nagle -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing/Crawler Questions..
bruce wrote: hi john.. You're missing the issue, so a little clarification... I've got a number of test parsers that point to a given classlist site.. the scripts work. the issue that one faces is that you never know if you've gotten all of the items/links that you're looking for based on the XPath functions. This could be due to an error in the parsing, or it could be due to an admin changing the site (removing/adding courses etc...) What URLs are you looking at? John Nagle -- http://mail.python.org/mailman/listinfo/python-list
RE: Parsing/Crawler Questions..
hi.. the url i'm focusing on is irrelevant to the issue i'm trying to solve at this time. i think an approach will be to fire up a number of parsing attempts, and to track the returned depts/classes/etc... in theory (hopefully) i should be able to create a process to build a kind of statistical representation of what the site looks like (names of depts, names/number of classes for given depts, etc..) if i'm correct, this would provide a complete list/understanding of what the courselist looks like i could then run the parsing process a number of times, examining the actual value/results for the query, and taking the highest/oldest values for the given query.. the idea being that the app will return correct results for most of the queries, most of the time.. so from a statistical basis.. i can take the results that are returned with the highest frequency... so this approach might work. but again, haven't seen anything in the literature/'net that talks about this... thoughts... thanks -Original Message- From: python-list-bounces+bedouglas=earthlink@python.org [mailto:python-list-bounces+bedouglas=earthlink@python.org]on Behalf Of John Nagle Sent: Thursday, March 05, 2009 8:38 AM To: python-list@python.org Subject: Re: Parsing/Crawler Questions.. bruce wrote: hi john.. You're missing the issue, so a little clarification... I've got a number of test parsers that point to a given classlist site.. the scripts work. the issue that one faces is that you never know if you've gotten all of the items/links that you're looking for based on the XPath functions. This could be due to an error in the parsing, or it could be due to an admin changing the site (removing/adding courses etc...) What URLs are you looking at? John Nagle -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing/Crawler Questions..
On Mar 5, 2009, at 12:31 PM, bruce wrote: hi.. the url i'm focusing on is irrelevant to the issue i'm trying to solve at this time. Not if we're to understand the situation you're trying to describe. From what I can tell, you're saying that the target site displays different results each time your crawler visits it. It's as if e.g. the site knows about 100 courses but only displays 80 randomly chosen ones to each visitor. If that's the case, then it is truly bizarre. i think an approach will be to fire up a number of parsing attempts, and to track the returned depts/classes/etc... in theory (hopefully) i should be able to create a process to build a kind of statistical representation of what the site looks like (names of depts, names/number of classes for given depts, etc..) if i'm correct, this would provide a complete list/understanding of what the courselist looks like i could then run the parsing process a number of times, examining the actual value/results for the query, and taking the highest/oldest values for the given query.. the idea being that the app will return correct results for most of the queries, most of the time.. so from a statistical basis.. i can take the results that are returned with the highest frequency... so this approach might work. but again, haven't seen anything in the literature/'net that talks about this... thoughts... thanks -Original Message- From: python-list-bounces+bedouglas=earthlink@python.org [mailto:python-list-bounces+bedouglas=earthlink@python.org]on Behalf Of John Nagle Sent: Thursday, March 05, 2009 8:38 AM To: python-list@python.org Subject: Re: Parsing/Crawler Questions.. bruce wrote: hi john.. You're missing the issue, so a little clarification... I've got a number of test parsers that point to a given classlist site.. the scripts work. the issue that one faces is that you never know if you've gotten all of the items/links that you're looking for based on the XPath functions. This could be due to an error in the parsing, or it could be due to an admin changing the site (removing/adding courses etc...) What URLs are you looking at? John Nagle -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing/Crawler Questions - solution
Philip Semanchuk wrote: On Mar 5, 2009, at 12:31 PM, bruce wrote: hi.. the url i'm focusing on is irrelevant to the issue i'm trying to solve at this time. Not if we're to understand the situation you're trying to describe. From what I can tell, you're saying that the target site displays different results each time your crawler visits it. It's as if e.g. the site knows about 100 courses but only displays 80 randomly chosen ones to each visitor. If that's the case, then it is truly bizarre. Agreed. The course list isn't changing that rapidly. I suspect the original poster is doing something like reading the DOM of a dynamic page while the page is still updating, running a browser in a subprocess. Is that right? I've had to deal with that in Javascript. My AdRater browser plug-in (http://www.sitetruth.com/downloads) looks at Google-served ads and rates the advertisers. There, I have to watch for page-change events and update the annotations I'm adding to ads. But you don't need to work that hard here. The USC site is actually querying a server which provides the requested data in JSON format. See http://web-app.usc.edu/soc/dev/scripts/soc.js Reverse-engineer that and you'll be able to get the underlying data. (It's an amusing script; many little fixes to data items are performed, something that should have been done at the database front end.) The way to get USC class data is this: 1. Start here: http://web-app.usc.edu/soc/term_20091.html; 2. Examine all the department pages under that page. 3. On each page, look for the value of coursesrc, like this: var coursesrc = '/ws/soc/api/classes/aest/20091' 4. For each coursesrc value found, construct a URL like this: http://web-app.usc.edu/ws/soc/api/classes/aest/20091 5. Read that URL. This will return the department's course list in JSON format. 6. From the JSON tree, pull out CourseData items, which look like this: CourseData: {prefix:AEST, number:220, sequence:B, suffix:{}, title:Advanced Leadership Laboratory II, description:Additional exposure to the military experience for continuing AFROTC cadets, focusing on customs and courtesies, drill and ceremonies, and the environment of an Air Force officer. Credit\/No Credit., units:1, restriction_by_major:{}, restriction_by_class:{}, restriction_by_school:{}, CourseNotes:{}, CourseTermNotes:{}, prereq_text:AEST-220A, coreq_text:{}, SectionData:{id:41799, session:790, dclass_code:D, title:Advanced Leadership Laboratory II, section_title:{}, description:{}, notes:{}, type:Lec, units:1, spaces_available:30, number_registered:2, wait_qty:0, canceled:N, blackboard:Y, comment:{}, day:{},start_time:TBA, end_time:TBA, location:OFFICE, instructor:{last_name:Hampton,first_name:Daniel}, syllabus:{format:{},filesize:{}}, IsDistanceLearning:N}}}, Parsing the JSON is left as an exercise for the student. (There's a Python module for that.) And no, the data isn't changing; you can read those pages of JSON over and over and get the same data every time. John Nagle -- http://mail.python.org/mailman/listinfo/python-list
RE: Parsing/Crawler Questions - solution
john... again the problem i'm facing really has nothing to do with a specific url... the app i have for the usc site works... but for any number of reasons... you might get different results when running the app.. -the server could be screwed up.. -data might be cached -data might be changed, and not updated.. -actual app problems... -networking issues... -memory corruption issues... -process constraint issues.. -web server overload.. -etc... the assumption that most people appear to make is that if you create a parser, and run and test it once.. then if it gets you the data, it's working.. when you run the same app.. 100s of times, and you're slamming the webserver... then you realize that that's a vastly different animal than simply running a snigle query a few times... so.. nope, i'm not running the app and getting data from a dynamic page that hasn't finished uploading/creating the content.. but what my analysis is showing, not only for the usc, but for others as well.. is that there might be differences in what gets returned... which is where a smoothing algorithmic approach appears to be workable.. i've been starting to test this approach, and it actually might have a chance of working... so.. as i've stated a number of times.. focusing on a specific url isn't the issue.. the larger issue is how you can programatically/algorithmically/automatically, be reasonably ensured that what you have is exactly what's on the site... ain't screen scraping fun!!! -Original Message- From: python-list-bounces+bedouglas=earthlink@python.org [mailto:python-list-bounces+bedouglas=earthlink@python.org]on Behalf Of John Nagle Sent: Thursday, March 05, 2009 10:54 AM To: python-list@python.org Subject: Re: Parsing/Crawler Questions - solution Philip Semanchuk wrote: On Mar 5, 2009, at 12:31 PM, bruce wrote: hi.. the url i'm focusing on is irrelevant to the issue i'm trying to solve at this time. Not if we're to understand the situation you're trying to describe. From what I can tell, you're saying that the target site displays different results each time your crawler visits it. It's as if e.g. the site knows about 100 courses but only displays 80 randomly chosen ones to each visitor. If that's the case, then it is truly bizarre. Agreed. The course list isn't changing that rapidly. I suspect the original poster is doing something like reading the DOM of a dynamic page while the page is still updating, running a browser in a subprocess. Is that right? I've had to deal with that in Javascript. My AdRater browser plug-in (http://www.sitetruth.com/downloads) looks at Google-served ads and rates the advertisers. There, I have to watch for page-change events and update the annotations I'm adding to ads. But you don't need to work that hard here. The USC site is actually querying a server which provides the requested data in JSON format. See http://web-app.usc.edu/soc/dev/scripts/soc.js Reverse-engineer that and you'll be able to get the underlying data. (It's an amusing script; many little fixes to data items are performed, something that should have been done at the database front end.) The way to get USC class data is this: 1. Start here: http://web-app.usc.edu/soc/term_20091.html; 2. Examine all the department pages under that page. 3. On each page, look for the value of coursesrc, like this: var coursesrc = '/ws/soc/api/classes/aest/20091' 4. For each coursesrc value found, construct a URL like this: http://web-app.usc.edu/ws/soc/api/classes/aest/20091 5. Read that URL. This will return the department's course list in JSON format. 6. From the JSON tree, pull out CourseData items, which look like this: CourseData: {prefix:AEST, number:220, sequence:B, suffix:{}, title:Advanced Leadership Laboratory II, description:Additional exposure to the military experience for continuing AFROTC cadets, focusing on customs and courtesies, drill and ceremonies, and the environment of an Air Force officer. Credit\/No Credit., units:1, restriction_by_major:{}, restriction_by_class:{}, restriction_by_school:{}, CourseNotes:{}, CourseTermNotes:{}, prereq_text:AEST-220A, coreq_text:{}, SectionData:{id:41799, session:790, dclass_code:D, title:Advanced Leadership Laboratory II, section_title:{}, description:{}, notes:{}, type:Lec, units:1, spaces_available:30, number_registered:2, wait_qty:0, canceled:N, blackboard:Y, comment:{}, day:{},start_time:TBA, end_time:TBA, location:OFFICE, instructor:{last_name:Hampton,first_name:Daniel}, syllabus:{format:{},filesize:{}}, IsDistanceLearning:N}}}, Parsing the JSON is left as an exercise for the student. (There's a Python module for that.) And no, the data isn't changing; you can read those pages of JSON over and over and get the same data every time. John Nagle -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org
RE: Parsing/Crawler Questions - solution
hi john... update... further investigation has revealed that apparently, for some urls/sites, the server serves up pages that take awhile to be fetched... this appears to be a potential problem, in that it appears that the parsescript never gets anything from the python mech/urllib read function. the curious issue is that i can run a single test script, pointing to the url, and after a bit of time.. the resulting content is fetched/downloaded correctly. by the way, i can get the same results in my test browsing environment, if i start it with only a subset of the urs that i've been using to test the app. hmm... might be a resource issue, a timing issue,.. or something else... hmmm... thanks again the problem i'm facing really has nothing to do with a specific url... the app i have for the usc site works... but for any number of reasons... you might get different results when running the app.. -the server could be screwed up.. -data might be cached -data might be changed, and not updated.. -actual app problems... -networking issues... -memory corruption issues... -process constraint issues.. -web server overload.. -etc... the assumption that most people appear to make is that if you create a parser, and run and test it once.. then if it gets you the data, it's working.. when you run the same app.. 100s of times, and you're slamming the webserver... then you realize that that's a vastly different animal than simply running a snigle query a few times... so.. nope, i'm not running the app and getting data from a dynamic page that hasn't finished uploading/creating the content.. but what my analysis is showing, not only for the usc, but for others as well.. is that there might be differences in what gets returned... which is where a smoothing algorithmic approach appears to be workable.. i've been starting to test this approach, and it actually might have a chance of working... so.. as i've stated a number of times.. focusing on a specific url isn't the issue.. the larger issue is how you can programatically/algorithmically/automatically, be reasonably ensured that what you have is exactly what's on the site... ain't screen scraping fun!!! -Original Message- From: python-list-bounces+bedouglas=earthlink@python.org [mailto:python-list-bounces+bedouglas=earthlink@python.org]on Behalf Of John Nagle Sent: Thursday, March 05, 2009 10:54 AM To: python-list@python.org Subject: Re: Parsing/Crawler Questions - solution Philip Semanchuk wrote: On Mar 5, 2009, at 12:31 PM, bruce wrote: hi.. the url i'm focusing on is irrelevant to the issue i'm trying to solve at this time. Not if we're to understand the situation you're trying to describe. From what I can tell, you're saying that the target site displays different results each time your crawler visits it. It's as if e.g. the site knows about 100 courses but only displays 80 randomly chosen ones to each visitor. If that's the case, then it is truly bizarre. Agreed. The course list isn't changing that rapidly. I suspect the original poster is doing something like reading the DOM of a dynamic page while the page is still updating, running a browser in a subprocess. Is that right? I've had to deal with that in Javascript. My AdRater browser plug-in (http://www.sitetruth.com/downloads) looks at Google-served ads and rates the advertisers. There, I have to watch for page-change events and update the annotations I'm adding to ads. But you don't need to work that hard here. The USC site is actually querying a server which provides the requested data in JSON format. See http://web-app.usc.edu/soc/dev/scripts/soc.js Reverse-engineer that and you'll be able to get the underlying data. (It's an amusing script; many little fixes to data items are performed, something that should have been done at the database front end.) The way to get USC class data is this: 1. Start here: http://web-app.usc.edu/soc/term_20091.html; 2. Examine all the department pages under that page. 3. On each page, look for the value of coursesrc, like this: var coursesrc = '/ws/soc/api/classes/aest/20091' 4. For each coursesrc value found, construct a URL like this: http://web-app.usc.edu/ws/soc/api/classes/aest/20091 5. Read that URL. This will return the department's course list in JSON format. 6. From the JSON tree, pull out CourseData items, which look like this: CourseData: {prefix:AEST, number:220, sequence:B, suffix:{}, title:Advanced Leadership Laboratory II, description:Additional exposure to the military experience for continuing AFROTC cadets, focusing on customs and courtesies, drill and ceremonies, and the environment of an Air Force officer. Credit\/No Credit., units:1, restriction_by_major:{}, restriction_by_class:{}, restriction_by_school:{}, CourseNotes:{}, CourseTermNotes:{}, prereq_text:AEST-220A, coreq_text:{}, SectionData:{id:41799, session:790, dclass_code:D
Re: Parsing/Crawler Questions..
bruce wrote: Hi... Sorry that this is a bit off track. Ok, maybe way off track! But I don't have anyone to bounce this off of.. I'm working on a crawling project, crawling a college website, to extract course/class information. I've built a quick test app in python to crawl the site. I crawl at the top level, and work my way down to getting the required course/class schedule. The app works. I can consistently run it and extract the information. The required information is based upon an XPath analysis of the DOM for the given pages that I'm parsing. My issue is now that I have a basic app that works, I need to figure out how I guarantee that I'm correctly crawling the site. How do I know when I've got an error at a given node/branch, so that the app knows that it's not going to fetch the underlying branch/nodes of the tree.. [snip] If you were crawling the site yourself, how would _you_ know when you had an error at a given node/branch? -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing/Crawler Questions..
On Mar 4, 2009, at 4:44 PM, bruce wrote: Hi... Sorry that this is a bit off track. Ok, maybe way off track! But I don't have anyone to bounce this off of.. I'm working on a crawling project, crawling a college website, to extract course/class information. I've built a quick test app in python to crawl the site. I crawl at the top level, and work my way down to getting the required course/class schedule. The app works. I can consistently run it and extract the information. The required information is based upon an XPath analysis of the DOM for the given pages that I'm parsing. My issue is now that I have a basic app that works, I need to figure out how I guarantee that I'm correctly crawling the site. How do I know when I've got an error at a given node/branch, so that the app knows that it's not going to fetch the underlying branch/nodes of the tree.. When running the app, I can get 5000 classes on one run, 4700 on antoher, etc... So I need some method of determining when I get a complete tree... How do I know when I have a complete tree! hi Bruce, To put this another way, you're trying to convince yourself that your program is correct, yes? For instance, you're worried that you might be doing something like discovering a URL on a site but failing to pursue that URL, yes? The standard way of testing any program is to test known input and look for expected output. Repeat as necessary. In your case that would mean crawling a site where you know all of the URLs and to see if your program finds them all. And that, of course, isn't proof of correctness, it just means that that particular site didn't trigger any error conditions that would cause your program to misbehave. I think every modern OS makes it easy to run a Web server on your local machine. You might want to set up suite of test sites on your machine and point your program at localhost. That way you can build a site to test your application in areas you fear it may be weak. I'm unclear on what you're using to parse the pages, but (X)HTML is very often invalid in the strict sense of validity. If the tools you're using expect/insist on well-formed XML or valid HTML, they'll be disappointed on most sites and you'll probably be missing URLs. The canonical solution for parsing real-world Web pages with Python is BeautifulSoup. HTH Philip -- http://mail.python.org/mailman/listinfo/python-list
RE: Parsing/Crawler Questions..
hi phillip... thanks for taking a sec to reply... i'm solid on the test app i've created.. but as an example.. i have a parse for usc (southern cal) and it exrtacts the courselist/class schedule... my issue was that i realized the multiple runs of the app was giving differentt results... in my case, the class schedule isn't static.. (actually, none of the class/course lists need be static.. they could easily change). so i don't have apriori knowledge of what the actual class/course list site would look like, unless i physically examined the site, each time i run the app... i'm inclined to think i might need to run the parser a number of times within a given time frame, and then take a union/join of the output of the different runs.. this would in theory, give me a high probablity that i'd get 100% of the class list... most crawlers, and most research that i've seen focus on the indexing, or crawling function/architecture.. haven't really seen any articles/research/pointers dealing with this kind of process... thoughts/comments are welcome.. thanks -Original Message- From: python-list-bounces+bedouglas=earthlink@python.org [mailto:python-list-bounces+bedouglas=earthlink@python.org]on Behalf Of Philip Semanchuk Sent: Wednesday, March 04, 2009 6:15 PM To: python-list (General) Subject: Re: Parsing/Crawler Questions.. On Mar 4, 2009, at 4:44 PM, bruce wrote: Hi... Sorry that this is a bit off track. Ok, maybe way off track! But I don't have anyone to bounce this off of.. I'm working on a crawling project, crawling a college website, to extract course/class information. I've built a quick test app in python to crawl the site. I crawl at the top level, and work my way down to getting the required course/class schedule. The app works. I can consistently run it and extract the information. The required information is based upon an XPath analysis of the DOM for the given pages that I'm parsing. My issue is now that I have a basic app that works, I need to figure out how I guarantee that I'm correctly crawling the site. How do I know when I've got an error at a given node/branch, so that the app knows that it's not going to fetch the underlying branch/nodes of the tree.. When running the app, I can get 5000 classes on one run, 4700 on antoher, etc... So I need some method of determining when I get a complete tree... How do I know when I have a complete tree! hi Bruce, To put this another way, you're trying to convince yourself that your program is correct, yes? For instance, you're worried that you might be doing something like discovering a URL on a site but failing to pursue that URL, yes? The standard way of testing any program is to test known input and look for expected output. Repeat as necessary. In your case that would mean crawling a site where you know all of the URLs and to see if your program finds them all. And that, of course, isn't proof of correctness, it just means that that particular site didn't trigger any error conditions that would cause your program to misbehave. I think every modern OS makes it easy to run a Web server on your local machine. You might want to set up suite of test sites on your machine and point your program at localhost. That way you can build a site to test your application in areas you fear it may be weak. I'm unclear on what you're using to parse the pages, but (X)HTML is very often invalid in the strict sense of validity. If the tools you're using expect/insist on well-formed XML or valid HTML, they'll be disappointed on most sites and you'll probably be missing URLs. The canonical solution for parsing real-world Web pages with Python is BeautifulSoup. HTH Philip -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing/Crawler Questions..
bruce wrote: hi phillip... thanks for taking a sec to reply... i'm solid on the test app i've created.. but as an example.. i have a parse for usc (southern cal) and it exrtacts the courselist/class schedule... my issue was that i realized the multiple runs of the app was giving differentt results... in my case, the class schedule isn't static.. (actually, none of the class/course lists need be static.. they could easily change). so i don't have apriori knowledge of what the actual class/course list site would look like, unless i physically examined the site, each time i run the app... i'm inclined to think i might need to run the parser a number of times within a given time frame, and then take a union/join of the output of the different runs.. this would in theory, give me a high probablity that i'd get 100% of the class list... I think I see the problem. I took a look at the USC class list, and it's been made Web 2.0. When you read the page, you don't get the class list; you get a Javascript thing that builds a class list on demand, using JSON, no less. See http://web-app.usc.edu/soc/term_20091.html;. I'm not sure how you're handling this. The Javascript actually has to be run before you get anything. John Nagle -- http://mail.python.org/mailman/listinfo/python-list