Philip Semanchuk wrote:
On Mar 5, 2009, at 12:31 PM, bruce wrote:
hi..
the url i'm focusing on is irrelevant to the issue i'm trying to solve at
this time.
Not if we're to understand the situation you're trying to describe. From
what I can tell, you're saying that the target site displays different
results each time your crawler visits it. It's as if e.g. the site knows
about 100 courses but only displays 80 randomly chosen ones to each
visitor. If that's the case, then it is truly bizarre.
Agreed. The course list isn't changing that rapidly.
I suspect the original poster is doing something like reading the DOM
of a dynamic page while the page is still updating, running a browser
in a subprocess. Is that right?
I've had to deal with that in Javascript. My AdRater browser plug-in
(http://www.sitetruth.com/downloads) looks at Google-served ads and
rates the advertisers. There, I have to watch for page-change events
and update the annotations I'm adding to ads.
But you don't need to work that hard here. The USC site is actually
querying a server which provides the requested data in JSON format. See
http://web-app.usc.edu/soc/dev/scripts/soc.js
Reverse-engineer that and you'll be able to get the underlying data.
(It's an amusing script; many little fixes to data items are performed,
something that should have been done at the database front end.)
The way to get USC class data is this:
1. Start here: "http://web-app.usc.edu/soc/term_20091.html"
2. Examine all the department pages under that page.
3. On each page, look for the value of "coursesrc", like this:
var coursesrc = '/ws/soc/api/classes/aest/20091'
4. For each "coursesrc" value found, construct a URL like this:
http://web-app.usc.edu/ws/soc/api/classes/aest/20091
5. Read that URL. This will return the department's course list in
JSON format.
6. From the JSON tree, pull out CourseData items, which look like this:
CourseData":
{"prefix":"AEST",
"number":"220",
"sequence":"B",
"suffix":{},
"title":"Advanced Leadership Laboratory II",
"description":"Additional exposure to the military experience for continuing
AFROTC cadets, focusing on customs and courtesies, drill and ceremonies, and the
environment of an Air Force officer. Credit\/No Credit.",
"units":"1",
"restriction_by_major":{},
"restriction_by_class":{},
"restriction_by_school":{},
"CourseNotes":{},
"CourseTermNotes":{},
"prereq_text":"AEST-220A",
"coreq_text":{},
"SectionData":{"id":"41799",
"session":"790",
"dclass_code":"D",
"title":"Advanced Leadership Laboratory II",
"section_title":{},
"description":{},
"notes":{},
"type":"Lec",
"units":"1",
"spaces_available":"30",
"number_registered":"2",
"wait_qty":"0",
"canceled":"N",
"blackboard":"Y",
"comment":{},
"day":{},"start_time":"TBA",
"end_time":"TBA",
"location":"OFFICE",
"instructor":{"last_name":"Hampton","first_name":"Daniel"},
"syllabus":{"format":{},"filesize":{}},
"IsDistanceLearning":"N"}}},
Parsing the JSON is left as an exercise for the student. (There's
a Python module for that.)
And no, the data isn't changing; you can read those pages of JSON over and
over and get the same data every time.
John Nagle
--
http://mail.python.org/mailman/listinfo/python-list