Re: Parsing/Crawler Questions - solution

2009-03-08 Thread lkcl
On Mar 7, 9:56 pm, bruce bedoug...@earthlink.net wrote:
 

 and this solution will somehow allow a user to create a web parsing/scraping
 app for parising links, and javascript from a web page?


 not just parsing the links and the static javascript, but:

 * actually executing the javascript, giving the quotes page quotes a
chance to actually _look_ like it would if it was being viewed as a
quotes real quotes web browser.

 so any XMLHTTPRequests will _actually_ get executed, _actually_
result in _actually_ having the content of the web page _properly_
modified.

 so, e.g instead of seeing a Loader page on gmail you would
_actually_ see the user's email and the adverts (assuming you went to
the trouble of putting in the username/password) because the AJAX
would _actually_ get executed by the WebKit engine, and the DOM model
accessed thereafter.


 * giving the user the opportunity to call DOM methods such as
getElementsByTagName and the opportunity to access properties such as
document.anchors.

  in webkit-glib gdom bindings, that would be:

 * anchor_list = gdom_document_get_elements_by_tag_name(doc, a);

or

 * g_object_get(doc, anchors, anchor_list, NULL);

  which in pywebkitgtk (thanks to python-pygobject auto-generation of
python bindings from gobject bindings) translates into:

 * doc.get_elements_by_tag_name(a)

or

 * doc.props.anchors

  which in pyjamas-desktop, a high-level abstraction on top of _that_,
turns into:

 * from pyjamas import DOM
   anchor_list = DOM.getElementsByTagName(doc, a)

or

 * from pyjamas import DOM
   anchor_list = DOM.getAttribute(doc, anchors)

answer: yes.

l.

 -Original Message-
 From: python-list-bounces+bedouglas=earthlink@python.org

 [mailto:python-list-bounces+bedouglas=earthlink@python.org]on Behalf
 Oflkcl
 Sent: Saturday, March 07, 2009 2:34 AM
 To: python-l...@python.org
 Subject: Re: Parsing/Crawler Questions - solution

 On Mar 7, 12:19 am, rounderwe...@gmail.com wrote:
  So, it sounds like your update means that it is related to a specific
  url.

  I'm curious about this issue myself.  I've often wondered how one
  could properly crawl anAJAX-ish site when you're not sure how quickly
  the data will be returned after the page has been.

  you want to look at the webkit engine - no not the graphical browser
 - the ParseTree example - and combine it with pywebkitgtk - no not the
 original version, the one which has DOM-manipulation bindings
 through webkit-glib.

 the webkit parse tree example is, despite it being based on the GTK
 port as they like to call it in webkit (which just means that it
 links with GTK not QT4 or wxWidgets), is a console-based application.

 in other words, despite it being GTK, it still does NOT output
 graphical crap to the screen, yet it still *executes* the javascript
 on the page.

 dummy functions for mouse, keyboard, console errors are given as
 examples and are left as an exercise for the application writer to
 fill-in-the-blanks.

 combining this parse tree example with pywebkitgtk (see
 demobrowser.py) would provide a means by which web pages can be
 executed AT THE CONSOLE NOT AS A GUI APP, then, thanks to the glib /
 gobject bindings, a python app will be able to walk the DOM tree as
 expected.

 i _just_ fixed pyjamas-desktop's iterators in the pyjamas.DOM module
 for someone, on the pyjamas-dev mailing list.

 http://github.com/lkcl/pyjamas-desktop/tree/8ed365b89efe5d1d3451c3e3c...
 dd014540

 so, actually, you may be better off starting from pyjamas-desktop and
 then cutting out the fire up the GTK window bit, from pyjd.py.

 pyjd.py is based on pywebkitgtk's demobrowser.py

 the alternative to webkit is to use python-hulahop - it will do the
 same thing, but just using python bindings to gecko instead of python-
 bindings-to-glib-bindings-to-webkit.

 l.
 --http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing/Crawler Questions - solution

2009-03-07 Thread Lie Ryan

bruce wrote:

john...

again the problem i'm facing really has nothing to do with a specific
url... the app i have for the usc site works...

but for any number of reasons... you might get different results when
running the app..
-the server could be screwed up..
-data might be cached
-data might be changed, and not updated..
-actual app problems...
-networking issues...
-memory corruption issues...
-process constraint issues..
-web server overload..
-etc...

the assumption that most people appear to make is that if you create a
parser, and run and test it once.. then if it gets you the data, it's
working.. when you run the same app.. 100s of times, and you're slamming the
webserver... then you realize that that's a vastly different animal than
simply running a snigle query a few times...


The assumptions is most websites edit and remove data from time to time 
and using the union of data collected throughout several runs might 
populate your program with redundant (but slightly different) or 
outdated data. The assumptions is these redundant or outdated data is 
not useful for most people.

--
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing/Crawler Questions - solution

2009-03-07 Thread lkcl
On Mar 7, 12:19 am, rounderwe...@gmail.com wrote:
 So, it sounds like your update means that it is related to a specific
 url.

 I'm curious about this issue myself.  I've often wondered how one
 could properly crawl anAJAX-ish site when you're not sure how quickly
 the data will be returned after the page has been.

 you want to look at the webkit engine - no not the graphical browser
- the ParseTree example - and combine it with pywebkitgtk - no not the
original version, the one which has DOM-manipulation bindings
through webkit-glib.

the webkit parse tree example is, despite it being based on the GTK
port as they like to call it in webkit (which just means that it
links with GTK not QT4 or wxWidgets), is a console-based application.

in other words, despite it being GTK, it still does NOT output
graphical crap to the screen, yet it still *executes* the javascript
on the page.

dummy functions for mouse, keyboard, console errors are given as
examples and are left as an exercise for the application writer to
fill-in-the-blanks.

combining this parse tree example with pywebkitgtk (see
demobrowser.py) would provide a means by which web pages can be
executed AT THE CONSOLE NOT AS A GUI APP, then, thanks to the glib /
gobject bindings, a python app will be able to walk the DOM tree as
expected.

i _just_ fixed pyjamas-desktop's iterators in the pyjamas.DOM module
for someone, on the pyjamas-dev mailing list.


http://github.com/lkcl/pyjamas-desktop/tree/8ed365b89efe5d1d3451c3e3ced662a2dd014540

so, actually, you may be better off starting from pyjamas-desktop and
then cutting out the fire up the GTK window bit, from pyjd.py.

pyjd.py is based on pywebkitgtk's demobrowser.py

the alternative to webkit is to use python-hulahop - it will do the
same thing, but just using python bindings to gecko instead of python-
bindings-to-glib-bindings-to-webkit.


l.
--
http://mail.python.org/mailman/listinfo/python-list


RE: Parsing/Crawler Questions - solution

2009-03-07 Thread bruce


and this solution will somehow allow a user to create a web parsing/scraping
app for parising links, and javascript from a web page?


-Original Message-
From: python-list-bounces+bedouglas=earthlink@python.org
[mailto:python-list-bounces+bedouglas=earthlink@python.org]on Behalf
Of lkcl
Sent: Saturday, March 07, 2009 2:34 AM
To: python-list@python.org
Subject: Re: Parsing/Crawler Questions - solution


On Mar 7, 12:19 am, rounderwe...@gmail.com wrote:
 So, it sounds like your update means that it is related to a specific
 url.

 I'm curious about this issue myself.  I've often wondered how one
 could properly crawl anAJAX-ish site when you're not sure how quickly
 the data will be returned after the page has been.

 you want to look at the webkit engine - no not the graphical browser
- the ParseTree example - and combine it with pywebkitgtk - no not the
original version, the one which has DOM-manipulation bindings
through webkit-glib.

the webkit parse tree example is, despite it being based on the GTK
port as they like to call it in webkit (which just means that it
links with GTK not QT4 or wxWidgets), is a console-based application.

in other words, despite it being GTK, it still does NOT output
graphical crap to the screen, yet it still *executes* the javascript
on the page.

dummy functions for mouse, keyboard, console errors are given as
examples and are left as an exercise for the application writer to
fill-in-the-blanks.

combining this parse tree example with pywebkitgtk (see
demobrowser.py) would provide a means by which web pages can be
executed AT THE CONSOLE NOT AS A GUI APP, then, thanks to the glib /
gobject bindings, a python app will be able to walk the DOM tree as
expected.

i _just_ fixed pyjamas-desktop's iterators in the pyjamas.DOM module
for someone, on the pyjamas-dev mailing list.


http://github.com/lkcl/pyjamas-desktop/tree/8ed365b89efe5d1d3451c3e3ced662a2
dd014540

so, actually, you may be better off starting from pyjamas-desktop and
then cutting out the fire up the GTK window bit, from pyjd.py.

pyjd.py is based on pywebkitgtk's demobrowser.py

the alternative to webkit is to use python-hulahop - it will do the
same thing, but just using python bindings to gecko instead of python-
bindings-to-glib-bindings-to-webkit.


l.
--
http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing/Crawler Questions - solution

2009-03-06 Thread rounderweget
So, it sounds like your update means that it is related to a specific
url.

I'm curious about this issue myself.  I've often wondered how one
could properly crawl an AJAX-ish site when you're not sure how quickly
the data will be returned after the page has been.

John, your advice has really helped me.  Bruce / anyone else, have you
had any further experience with this type of parsing / crawling?


On Mar 5, 2:50 pm, bruce bedoug...@earthlink.net wrote:
 hi john...

 update...

 further investigation has revealed that apparently, for some urls/sites, the
 server serves up pages that take awhile to be fetched... this appears to be
 a potential problem, in that it appears that the parsescript never gets
 anything from the python mech/urllib read function.

 the curious issue is that i can run a single test script, pointing to the
 url, and after a bit of time.. the resulting content is fetched/downloaded
 correctly. by the way, i can get the same results in my test browsing
 environment, if i start it with only a subset of the urs that i've been
 using to test the app.

 hmm... might be a resource issue, a timing issue,.. or something else...
 hmmm...

 thanks

 again the problem i'm facing really has nothing to do with a specific
 url... the app i have for the usc site works...

 but for any number of reasons... you might get different results when
 running the app..
 -the server could be screwed up..
 -data might be cached
 -data might be changed, and not updated..
 -actual app problems...
 -networking issues...
 -memory corruption issues...
 -process constraint issues..
 -web server overload..
 -etc...

 the assumption that most people appear to make is that if you create a
 parser, and run and test it once.. then if it gets you the data, it's
 working.. when you run the same app.. 100s of times, and you're slamming the
 webserver... then you realize that that's a vastly different animal than
 simply running a snigle query a few times...

 so.. nope, i'm not running the app and getting data from a dynamic page that
 hasn't finished uploading/creating the content..

 but what my analysis is showing, not only for the usc, but for others as
 well.. is that there might be differences in what gets returned...

 which is where a smoothing algorithmic approach appears to be workable..

 i've been starting to test this approach, and it actually might have a
 chance of working...

 so.. as i've stated a number of times.. focusing on a specific url isn't the
 issue.. the larger issue is how you can
 programatically/algorithmically/automatically, be reasonably ensured that
 what you have is exactly what's on the site...

 ain't screen scraping fun!!!

 -Original Message-
 From: python-list-bounces+bedouglas=earthlink@python.org

 [mailto:python-list-bounces+bedouglas=earthlink@python.org]on Behalf
 Of John Nagle
 Sent: Thursday, March 05, 2009 10:54 AM
 To: python-l...@python.org
 Subject: Re: Parsing/Crawler Questions - solution

 Philip Semanchuk wrote:
  On Mar 5, 2009, at 12:31 PM, bruce wrote:

  hi..

  the url i'm focusing on is irrelevant to the issue i'm trying to solve at
  this time.

  Not if we're to understand the situation you're trying to describe. From
  what I can tell, you're saying that the target site displays different
  results each time your crawler visits it. It's as if e.g. the site knows
  about 100 courses but only displays 80 randomly chosen ones to each
  visitor. If that's the case, then it is truly bizarre.

      Agreed.  The course list isn't changing that rapidly.

      I suspect the original poster is doing something like reading the DOM
 of a dynamic page while the page is still updating, running a browser
 in a subprocess.  Is that right?

      I've had to deal with that in Javascript.  My AdRater browser plug-in
 (http://www.sitetruth.com/downloads) looks at Google-served ads and
 rates the advertisers.   There, I have to watch for page-change events
 and update the annotations I'm adding to ads.

      But you don't need to work that hard here. The USC site is actually
 querying a server which provides the requested data in JSON format.  See

        http://web-app.usc.edu/soc/dev/scripts/soc.js

 Reverse-engineer that and you'll be able to get the underlying data.
 (It's an amusing script; many little fixes to data items are performed,
 something that should have been done at the database front end.)

 The way to get USC class data is this:

 1.  Start here: http://web-app.usc.edu/soc/term_20091.html;
 2.  Examine all the department pages under that page.
 3.  On each page, look for the value of coursesrc, like this:
         var coursesrc = '/ws/soc/api/classes/aest/20091'
 4.  For each coursesrc value found, construct a URL like this:
        http://web-app.usc.edu/ws/soc/api/classes/aest/20091
 5.  Read that URL.  This will return the department's course list in
      JSON format.
 6.  From the JSON tree, pull out CourseData items, which look like

RE: Parsing/Crawler Questions..

2009-03-05 Thread bruce
hi john..

You're missing the issue, so a little clarification...

I've got a number of test parsers that point to a given classlist site.. the
scripts work.

the issue that one faces is that you never know if you've gotten all of
the items/links that you're looking for based on the XPath functions. This
could be due to an error in the parsing, or it could be due to an admin
changing the site (removing/adding courses etc...)

So I'm trying to figure out an approach to handling these issues...

As far as I can tell... An approach might be to run the parser script across
the target site X number of times within a narrow timeframe (a few minutes).
Based on the results of this process, you might be able to develop an
overall tree of what the actual class/course links/list should be. But you
don't know from hour to hour, day to day if this list is stable, as it could
change..

The only way you know for certain is to physically examine a site. You can't
do this if you're going to develop an automated system for 5-10 sites, or
for 500-1000...

These are the issues that I'm grappling with.. not how to write the XPath
parsing functions...

Thanks..


-Original Message-
From: python-list-bounces+bedouglas=earthlink@python.org
[mailto:python-list-bounces+bedouglas=earthlink@python.org]on Behalf
Of John Nagle
Sent: Wednesday, March 04, 2009 10:23 PM
To: python-list@python.org
Subject: Re: Parsing/Crawler Questions..


bruce wrote:
 hi phillip...

 thanks for taking a sec to reply...

 i'm solid on the test app i've created.. but as an example.. i have a
parse
 for usc (southern cal) and it exrtacts the courselist/class schedule... my
 issue was that i realized the multiple runs of the app was giving
differentt
 results... in my case, the class schedule isn't static.. (actually, none
of
 the class/course lists need be static.. they could easily change).

 so i don't have apriori knowledge of what the actual class/course list
site
 would look like, unless i physically examined the site, each time i run
the
 app...

 i'm inclined to think i might need to run the parser a number of times
 within a given time frame, and then take a union/join of the output of the
 different runs.. this would in theory, give me a high probablity that i'd
 get 100% of the class list...

 I think I see the problem.  I took a look at the USC class list, and
it's been made Web 2.0.  When you read the page, you don't get the
class list; you get a Javascript thing that builds a class list on
demand, using JSON, no less.

 See http://web-app.usc.edu/soc/term_20091.html;.

 I'm not sure how you're handling this.  The Javascript actually
has to be run before you get anything.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing/Crawler Questions..

2009-03-05 Thread John Nagle

bruce wrote:

hi john..

You're missing the issue, so a little clarification...

I've got a number of test parsers that point to a given classlist site.. the
scripts work.

the issue that one faces is that you never know if you've gotten all of
the items/links that you're looking for based on the XPath functions. This
could be due to an error in the parsing, or it could be due to an admin
changing the site (removing/adding courses etc...)


   What URLs are you looking at?

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


RE: Parsing/Crawler Questions..

2009-03-05 Thread bruce
hi..

the url i'm focusing on is irrelevant to the issue i'm trying to solve at
this time.

i think an approach will be to fire up a number of parsing attempts, and to
track the returned depts/classes/etc... in theory (hopefully) i should be
able to create a process to build a kind of statistical representation of
what the site looks like (names of depts, names/number of classes for given
depts, etc..) if i'm correct, this would provide a complete
list/understanding of what the courselist looks like

i could then run the parsing process a number of times, examining the actual
value/results for the query, and taking the highest/oldest values for the
given query.. the idea being that the app will return correct results for
most of the queries, most of the time.. so from a statistical basis.. i can
take the results that are returned with the highest frequency...

so this approach might work. but again, haven't seen anything in the
literature/'net that talks about this...


thoughts...

thanks



-Original Message-
From: python-list-bounces+bedouglas=earthlink@python.org
[mailto:python-list-bounces+bedouglas=earthlink@python.org]on Behalf
Of John Nagle
Sent: Thursday, March 05, 2009 8:38 AM
To: python-list@python.org
Subject: Re: Parsing/Crawler Questions..


bruce wrote:
 hi john..

 You're missing the issue, so a little clarification...

 I've got a number of test parsers that point to a given classlist site..
the
 scripts work.

 the issue that one faces is that you never know if you've gotten all of
 the items/links that you're looking for based on the XPath functions. This
 could be due to an error in the parsing, or it could be due to an admin
 changing the site (removing/adding courses etc...)

What URLs are you looking at?

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing/Crawler Questions..

2009-03-05 Thread Philip Semanchuk


On Mar 5, 2009, at 12:31 PM, bruce wrote:


hi..

the url i'm focusing on is irrelevant to the issue i'm trying to  
solve at

this time.


Not if we're to understand the situation you're trying to describe.  
From what I can tell, you're saying that the target site displays  
different results each time your crawler visits it. It's as if e.g.  
the site knows about 100 courses but only displays 80 randomly chosen  
ones to each visitor. If that's the case, then it is truly bizarre.










i think an approach will be to fire up a number of parsing attempts,  
and to
track the returned depts/classes/etc... in theory (hopefully) i  
should be
able to create a process to build a kind of statistical  
representation of
what the site looks like (names of depts, names/number of classes  
for given

depts, etc..) if i'm correct, this would provide a complete
list/understanding of what the courselist looks like

i could then run the parsing process a number of times, examining  
the actual
value/results for the query, and taking the highest/oldest values  
for the
given query.. the idea being that the app will return correct  
results for
most of the queries, most of the time.. so from a statistical  
basis.. i can

take the results that are returned with the highest frequency...

so this approach might work. but again, haven't seen anything in the
literature/'net that talks about this...


thoughts...

thanks



-Original Message-
From: python-list-bounces+bedouglas=earthlink@python.org
[mailto:python-list-bounces+bedouglas=earthlink@python.org]on  
Behalf

Of John Nagle
Sent: Thursday, March 05, 2009 8:38 AM
To: python-list@python.org
Subject: Re: Parsing/Crawler Questions..


bruce wrote:

hi john..

You're missing the issue, so a little clarification...

I've got a number of test parsers that point to a given classlist  
site..

the

scripts work.

the issue that one faces is that you never know if you've gotten  
all of
the items/links that you're looking for based on the XPath  
functions. This
could be due to an error in the parsing, or it could be due to an  
admin

changing the site (removing/adding courses etc...)


   What URLs are you looking at?

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list


--
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing/Crawler Questions - solution

2009-03-05 Thread John Nagle

Philip Semanchuk wrote:

On Mar 5, 2009, at 12:31 PM, bruce wrote:


hi..

the url i'm focusing on is irrelevant to the issue i'm trying to solve at
this time.


Not if we're to understand the situation you're trying to describe. From 
what I can tell, you're saying that the target site displays different 
results each time your crawler visits it. It's as if e.g. the site knows 
about 100 courses but only displays 80 randomly chosen ones to each 
visitor. If that's the case, then it is truly bizarre.


Agreed.  The course list isn't changing that rapidly.

I suspect the original poster is doing something like reading the DOM
of a dynamic page while the page is still updating, running a browser
in a subprocess.  Is that right?

I've had to deal with that in Javascript.  My AdRater browser plug-in
(http://www.sitetruth.com/downloads) looks at Google-served ads and
rates the advertisers.   There, I have to watch for page-change events
and update the annotations I'm adding to ads.

But you don't need to work that hard here. The USC site is actually
querying a server which provides the requested data in JSON format.  See

http://web-app.usc.edu/soc/dev/scripts/soc.js

Reverse-engineer that and you'll be able to get the underlying data.
(It's an amusing script; many little fixes to data items are performed,
something that should have been done at the database front end.)

The way to get USC class data is this:

1.  Start here: http://web-app.usc.edu/soc/term_20091.html;
2.  Examine all the department pages under that page.
3.  On each page, look for the value of coursesrc, like this:
var coursesrc = '/ws/soc/api/classes/aest/20091'
4.  For each coursesrc value found, construct a URL like this:
http://web-app.usc.edu/ws/soc/api/classes/aest/20091
5.  Read that URL.  This will return the department's course list in
JSON format.
6.  From the JSON tree, pull out CourseData items, which look like this:

CourseData:
{prefix:AEST,
number:220,
sequence:B,
suffix:{},
title:Advanced Leadership Laboratory II,
description:Additional exposure to the military experience for continuing 
AFROTC cadets, focusing on customs and courtesies, drill and ceremonies, and the 
environment of an Air Force officer. Credit\/No Credit.,

units:1,
restriction_by_major:{},
restriction_by_class:{},
restriction_by_school:{},
CourseNotes:{},
CourseTermNotes:{},
prereq_text:AEST-220A,
coreq_text:{},
SectionData:{id:41799,
session:790,
dclass_code:D,
title:Advanced Leadership Laboratory II,
section_title:{},
description:{},
notes:{},
type:Lec,
units:1,
spaces_available:30,
number_registered:2,
wait_qty:0,
canceled:N,
blackboard:Y,
comment:{},
day:{},start_time:TBA,
end_time:TBA,
location:OFFICE,
instructor:{last_name:Hampton,first_name:Daniel},
syllabus:{format:{},filesize:{}},
IsDistanceLearning:N}}},

Parsing the JSON is left as an exercise for the student.  (There's
a Python module for that.)

And no, the data isn't changing; you can read those pages of JSON over and
over and get the same data every time.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


RE: Parsing/Crawler Questions - solution

2009-03-05 Thread bruce
john...

again the problem i'm facing really has nothing to do with a specific
url... the app i have for the usc site works...

but for any number of reasons... you might get different results when
running the app..
-the server could be screwed up..
-data might be cached
-data might be changed, and not updated..
-actual app problems...
-networking issues...
-memory corruption issues...
-process constraint issues..
-web server overload..
-etc...

the assumption that most people appear to make is that if you create a
parser, and run and test it once.. then if it gets you the data, it's
working.. when you run the same app.. 100s of times, and you're slamming the
webserver... then you realize that that's a vastly different animal than
simply running a snigle query a few times...

so.. nope, i'm not running the app and getting data from a dynamic page that
hasn't finished uploading/creating the content..

but what my analysis is showing, not only for the usc, but for others as
well.. is that there might be differences in what gets returned...

which is where a smoothing algorithmic approach appears to be workable..

i've been starting to test this approach, and it actually might have a
chance of working...

so.. as i've stated a number of times.. focusing on a specific url isn't the
issue.. the larger issue is how you can
programatically/algorithmically/automatically, be reasonably ensured that
what you have is exactly what's on the site...

ain't screen scraping fun!!!



-Original Message-
From: python-list-bounces+bedouglas=earthlink@python.org
[mailto:python-list-bounces+bedouglas=earthlink@python.org]on Behalf
Of John Nagle
Sent: Thursday, March 05, 2009 10:54 AM
To: python-list@python.org
Subject: Re: Parsing/Crawler Questions - solution


Philip Semanchuk wrote:
 On Mar 5, 2009, at 12:31 PM, bruce wrote:

 hi..

 the url i'm focusing on is irrelevant to the issue i'm trying to solve at
 this time.

 Not if we're to understand the situation you're trying to describe. From
 what I can tell, you're saying that the target site displays different
 results each time your crawler visits it. It's as if e.g. the site knows
 about 100 courses but only displays 80 randomly chosen ones to each
 visitor. If that's the case, then it is truly bizarre.

 Agreed.  The course list isn't changing that rapidly.

 I suspect the original poster is doing something like reading the DOM
of a dynamic page while the page is still updating, running a browser
in a subprocess.  Is that right?

 I've had to deal with that in Javascript.  My AdRater browser plug-in
(http://www.sitetruth.com/downloads) looks at Google-served ads and
rates the advertisers.   There, I have to watch for page-change events
and update the annotations I'm adding to ads.

 But you don't need to work that hard here. The USC site is actually
querying a server which provides the requested data in JSON format.  See

http://web-app.usc.edu/soc/dev/scripts/soc.js

Reverse-engineer that and you'll be able to get the underlying data.
(It's an amusing script; many little fixes to data items are performed,
something that should have been done at the database front end.)

The way to get USC class data is this:

1.  Start here: http://web-app.usc.edu/soc/term_20091.html;
2.  Examine all the department pages under that page.
3.  On each page, look for the value of coursesrc, like this:
var coursesrc = '/ws/soc/api/classes/aest/20091'
4.  For each coursesrc value found, construct a URL like this:
http://web-app.usc.edu/ws/soc/api/classes/aest/20091
5.  Read that URL.  This will return the department's course list in
 JSON format.
6.  From the JSON tree, pull out CourseData items, which look like this:

CourseData:
{prefix:AEST,
number:220,
sequence:B,
suffix:{},
title:Advanced Leadership Laboratory II,
description:Additional exposure to the military experience for continuing
AFROTC cadets, focusing on customs and courtesies, drill and ceremonies, and
the
environment of an Air Force officer. Credit\/No Credit.,
units:1,
restriction_by_major:{},
restriction_by_class:{},
restriction_by_school:{},
CourseNotes:{},
CourseTermNotes:{},
prereq_text:AEST-220A,
coreq_text:{},
SectionData:{id:41799,
session:790,
dclass_code:D,
title:Advanced Leadership Laboratory II,
section_title:{},
description:{},
notes:{},
type:Lec,
units:1,
spaces_available:30,
number_registered:2,
wait_qty:0,
canceled:N,
blackboard:Y,
comment:{},
day:{},start_time:TBA,
end_time:TBA,
location:OFFICE,
instructor:{last_name:Hampton,first_name:Daniel},
syllabus:{format:{},filesize:{}},
IsDistanceLearning:N}}},

Parsing the JSON is left as an exercise for the student.  (There's
a Python module for that.)

And no, the data isn't changing; you can read those pages of JSON over and
over and get the same data every time.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org

RE: Parsing/Crawler Questions - solution

2009-03-05 Thread bruce
hi john...


update...

further investigation has revealed that apparently, for some urls/sites, the
server serves up pages that take awhile to be fetched... this appears to be
a potential problem, in that it appears that the parsescript never gets
anything from the python mech/urllib read function.

the curious issue is that i can run a single test script, pointing to the
url, and after a bit of time.. the resulting content is fetched/downloaded
correctly. by the way, i can get the same results in my test browsing
environment, if i start it with only a subset of the urs that i've been
using to test the app.

hmm... might be a resource issue, a timing issue,.. or something else...
hmmm...

thanks




again the problem i'm facing really has nothing to do with a specific
url... the app i have for the usc site works...

but for any number of reasons... you might get different results when
running the app..
-the server could be screwed up..
-data might be cached
-data might be changed, and not updated..
-actual app problems...
-networking issues...
-memory corruption issues...
-process constraint issues..
-web server overload..
-etc...

the assumption that most people appear to make is that if you create a
parser, and run and test it once.. then if it gets you the data, it's
working.. when you run the same app.. 100s of times, and you're slamming the
webserver... then you realize that that's a vastly different animal than
simply running a snigle query a few times...

so.. nope, i'm not running the app and getting data from a dynamic page that
hasn't finished uploading/creating the content..

but what my analysis is showing, not only for the usc, but for others as
well.. is that there might be differences in what gets returned...

which is where a smoothing algorithmic approach appears to be workable..

i've been starting to test this approach, and it actually might have a
chance of working...

so.. as i've stated a number of times.. focusing on a specific url isn't the
issue.. the larger issue is how you can
programatically/algorithmically/automatically, be reasonably ensured that
what you have is exactly what's on the site...

ain't screen scraping fun!!!



-Original Message-
From: python-list-bounces+bedouglas=earthlink@python.org
[mailto:python-list-bounces+bedouglas=earthlink@python.org]on Behalf
Of John Nagle
Sent: Thursday, March 05, 2009 10:54 AM
To: python-list@python.org
Subject: Re: Parsing/Crawler Questions - solution


Philip Semanchuk wrote:
 On Mar 5, 2009, at 12:31 PM, bruce wrote:

 hi..

 the url i'm focusing on is irrelevant to the issue i'm trying to solve at
 this time.

 Not if we're to understand the situation you're trying to describe. From
 what I can tell, you're saying that the target site displays different
 results each time your crawler visits it. It's as if e.g. the site knows
 about 100 courses but only displays 80 randomly chosen ones to each
 visitor. If that's the case, then it is truly bizarre.

 Agreed.  The course list isn't changing that rapidly.

 I suspect the original poster is doing something like reading the DOM
of a dynamic page while the page is still updating, running a browser
in a subprocess.  Is that right?

 I've had to deal with that in Javascript.  My AdRater browser plug-in
(http://www.sitetruth.com/downloads) looks at Google-served ads and
rates the advertisers.   There, I have to watch for page-change events
and update the annotations I'm adding to ads.

 But you don't need to work that hard here. The USC site is actually
querying a server which provides the requested data in JSON format.  See

http://web-app.usc.edu/soc/dev/scripts/soc.js

Reverse-engineer that and you'll be able to get the underlying data.
(It's an amusing script; many little fixes to data items are performed,
something that should have been done at the database front end.)

The way to get USC class data is this:

1.  Start here: http://web-app.usc.edu/soc/term_20091.html;
2.  Examine all the department pages under that page.
3.  On each page, look for the value of coursesrc, like this:
var coursesrc = '/ws/soc/api/classes/aest/20091'
4.  For each coursesrc value found, construct a URL like this:
http://web-app.usc.edu/ws/soc/api/classes/aest/20091
5.  Read that URL.  This will return the department's course list in
 JSON format.
6.  From the JSON tree, pull out CourseData items, which look like this:

CourseData:
{prefix:AEST,
number:220,
sequence:B,
suffix:{},
title:Advanced Leadership Laboratory II,
description:Additional exposure to the military experience for continuing
AFROTC cadets, focusing on customs and courtesies, drill and ceremonies, and
the
environment of an Air Force officer. Credit\/No Credit.,
units:1,
restriction_by_major:{},
restriction_by_class:{},
restriction_by_school:{},
CourseNotes:{},
CourseTermNotes:{},
prereq_text:AEST-220A,
coreq_text:{},
SectionData:{id:41799,
session:790,
dclass_code:D

Re: Parsing/Crawler Questions..

2009-03-04 Thread MRAB

bruce wrote:

Hi...

Sorry that this is a bit off track. Ok, maybe way off track!

But I don't have anyone to bounce this off of..

I'm working on a crawling project, crawling a college website, to extract
course/class information. I've built a quick test app in python to crawl the
site. I crawl at the top level, and work my way down to getting the required
course/class schedule. The app works. I can consistently run it and extract
the information. The required information is based upon an XPath analysis of
the DOM for the given pages that I'm parsing.

My issue is now that I have a basic app that works, I need to figure out
how I guarantee that I'm correctly crawling the site. How do I know when
I've got an error at a given node/branch, so that the app knows that it's
not going to fetch the underlying branch/nodes of the tree..


[snip]
If you were crawling the site yourself, how would _you_ know when you
had an error at a given node/branch?

--
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing/Crawler Questions..

2009-03-04 Thread Philip Semanchuk


On Mar 4, 2009, at 4:44 PM, bruce wrote:


Hi...

Sorry that this is a bit off track. Ok, maybe way off track!

But I don't have anyone to bounce this off of..

I'm working on a crawling project, crawling a college website, to  
extract
course/class information. I've built a quick test app in python to  
crawl the
site. I crawl at the top level, and work my way down to getting the  
required
course/class schedule. The app works. I can consistently run it and  
extract
the information. The required information is based upon an XPath  
analysis of

the DOM for the given pages that I'm parsing.

My issue is now that I have a basic app that works, I need to  
figure out
how I guarantee that I'm correctly crawling the site. How do I know  
when
I've got an error at a given node/branch, so that the app knows that  
it's

not going to fetch the underlying branch/nodes of the tree..

When running the app, I can get 5000 classes on one run, 4700 on  
antoher,
etc... So I need some method of determining when I get a complete  
tree...


How do I know when I have a complete tree!



hi Bruce,
To put this another way, you're trying to convince yourself that your  
program is correct, yes? For instance, you're worried that you might  
be doing something like discovering a URL on a site but failing to  
pursue that URL, yes?


The standard way of testing any program is to test known input and  
look for expected output. Repeat as necessary. In your case that would  
mean crawling a site where you know all of the URLs and to see if your  
program finds them all. And that, of course, isn't proof of  
correctness, it just means that that particular site didn't trigger  
any error conditions that would cause your program to misbehave.


I think every modern OS makes it easy to run a Web server on your  
local machine. You might want to set up suite of test sites on your  
machine and point your program at localhost. That way you can build a  
site to test your application in areas you fear it may be weak.


I'm unclear on what you're using to parse the pages, but (X)HTML is  
very often invalid in the strict sense of validity. If the tools  
you're using expect/insist on well-formed XML or valid HTML, they'll  
be disappointed on most sites and you'll probably be missing URLs. The  
canonical solution for parsing real-world Web pages with Python is  
BeautifulSoup.


HTH
Philip






--
http://mail.python.org/mailman/listinfo/python-list


RE: Parsing/Crawler Questions..

2009-03-04 Thread bruce
hi phillip...

thanks for taking a sec to reply...

i'm solid on the test app i've created.. but as an example.. i have a parse
for usc (southern cal) and it exrtacts the courselist/class schedule... my
issue was that i realized the multiple runs of the app was giving differentt
results... in my case, the class schedule isn't static.. (actually, none of
the class/course lists need be static.. they could easily change).

so i don't have apriori knowledge of what the actual class/course list site
would look like, unless i physically examined the site, each time i run the
app...

i'm inclined to think i might need to run the parser a number of times
within a given time frame, and then take a union/join of the output of the
different runs.. this would in theory, give me a high probablity that i'd
get 100% of the class list...

most crawlers, and most research that i've seen focus on the indexing, or
crawling function/architecture.. haven't really seen any
articles/research/pointers dealing with this kind of process...

thoughts/comments are welcome..

thanks



-Original Message-
From: python-list-bounces+bedouglas=earthlink@python.org
[mailto:python-list-bounces+bedouglas=earthlink@python.org]on Behalf
Of Philip Semanchuk
Sent: Wednesday, March 04, 2009 6:15 PM
To: python-list (General)
Subject: Re: Parsing/Crawler Questions..



On Mar 4, 2009, at 4:44 PM, bruce wrote:

 Hi...

 Sorry that this is a bit off track. Ok, maybe way off track!

 But I don't have anyone to bounce this off of..

 I'm working on a crawling project, crawling a college website, to
 extract
 course/class information. I've built a quick test app in python to
 crawl the
 site. I crawl at the top level, and work my way down to getting the
 required
 course/class schedule. The app works. I can consistently run it and
 extract
 the information. The required information is based upon an XPath
 analysis of
 the DOM for the given pages that I'm parsing.

 My issue is now that I have a basic app that works, I need to
 figure out
 how I guarantee that I'm correctly crawling the site. How do I know
 when
 I've got an error at a given node/branch, so that the app knows that
 it's
 not going to fetch the underlying branch/nodes of the tree..

 When running the app, I can get 5000 classes on one run, 4700 on
 antoher,
 etc... So I need some method of determining when I get a complete
 tree...

 How do I know when I have a complete tree!


hi Bruce,
To put this another way, you're trying to convince yourself that your
program is correct, yes? For instance, you're worried that you might
be doing something like discovering a URL on a site but failing to
pursue that URL, yes?

The standard way of testing any program is to test known input and
look for expected output. Repeat as necessary. In your case that would
mean crawling a site where you know all of the URLs and to see if your
program finds them all. And that, of course, isn't proof of
correctness, it just means that that particular site didn't trigger
any error conditions that would cause your program to misbehave.

I think every modern OS makes it easy to run a Web server on your
local machine. You might want to set up suite of test sites on your
machine and point your program at localhost. That way you can build a
site to test your application in areas you fear it may be weak.

I'm unclear on what you're using to parse the pages, but (X)HTML is
very often invalid in the strict sense of validity. If the tools
you're using expect/insist on well-formed XML or valid HTML, they'll
be disappointed on most sites and you'll probably be missing URLs. The
canonical solution for parsing real-world Web pages with Python is
BeautifulSoup.

HTH
Philip






--
http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing/Crawler Questions..

2009-03-04 Thread John Nagle

bruce wrote:

hi phillip...

thanks for taking a sec to reply...

i'm solid on the test app i've created.. but as an example.. i have a parse
for usc (southern cal) and it exrtacts the courselist/class schedule... my
issue was that i realized the multiple runs of the app was giving differentt
results... in my case, the class schedule isn't static.. (actually, none of
the class/course lists need be static.. they could easily change).

so i don't have apriori knowledge of what the actual class/course list site
would look like, unless i physically examined the site, each time i run the
app...

i'm inclined to think i might need to run the parser a number of times
within a given time frame, and then take a union/join of the output of the
different runs.. this would in theory, give me a high probablity that i'd
get 100% of the class list...


I think I see the problem.  I took a look at the USC class list, and
it's been made Web 2.0.  When you read the page, you don't get the
class list; you get a Javascript thing that builds a class list on
demand, using JSON, no less.

See http://web-app.usc.edu/soc/term_20091.html;.

I'm not sure how you're handling this.  The Javascript actually
has to be run before you get anything.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list