Re: Utility to screenscrape sites using javascript ?

2010-01-31 Thread Nobody
On Sat, 30 Jan 2010 11:28:47 -0800, KB wrote:

  I have a service I subscribe to that uses javascript to stream news.
 
 There's a Python interface to SpiderMonkey (Mozilla's JavaScript
 interpreter):

 http://pypi.python.org/pypi/python-spidermonkey
 
 Thanks! I don't see a documentation page, but how would one use this?
 Would you download the HTML using urllib2/mechanize, then parse for the
 .js script and use spider-monkey to execute the script and the output is
 passed back to python?

Something like that.

The homepage link:

http://github.com/davisp/python-spidermonkey

has some examples further down. The first one starts with:

 import spidermonkey
 rt = spidermonkey.Runtime()
 cx = rt.new_context()
 cx.execute(var x = 3; x *= 4; x;)
12

It goes on to mention the .add_global(name, object) method, using it to
name a Python object such that its variables and methods can be accessed
from JS.

For scraping web pages, I suspect that you'll at least need to create an
object and name it document, so that document.location etc works. How
much you'll need to implement will depend upon what the web page uses,
although you can probably use .__getattr__() to serve up dummy handlers
for calls which can be ignored.

Futher documentation is probably a case of read the spidermonkey
documentation, then read the python-spidermonkey source if it isn't clear
how the wrapper relates to spidermonkey itself.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Utility to screenscrape sites using javascript ?

2010-01-30 Thread Nobody
On Sat, 30 Jan 2010 06:21:01 -0800, KB wrote:

 I have a service I subscribe to that uses javascript to stream news.
 Ideally I would like to use python to parse the information for me. Note
 there is an option to take a static snapshot of the current stream but
 that is still done via Javascript. (I can reference the snapshot with a
 unique URL though, so I can pass that to a parser as long as it can
 resolve the javascript and get at the content)
 
 I had a quick look at Windmill but it doesn't appear to be what I am
 looking for. Does anyone else have any experience in screenscraping sites
 that utilise javascript? Can you share how you did it and perhaps some
 sample code if possible?

There's a Python interface to SpiderMonkey (Mozilla's JavaScript
interpreter):

http://pypi.python.org/pypi/python-spidermonkey

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Utility to screenscrape sites using javascript ?

2010-01-30 Thread KB

 On Sat, 30 Jan 2010 06:21:01 -0800, KB wrote:
  I have a service I subscribe to that uses javascript to stream news.

 There's a Python interface to SpiderMonkey (Mozilla's JavaScript
 interpreter):

 http://pypi.python.org/pypi/python-spidermonkey

Thanks! I don't see a documentation page, but how would one use this?
Would you download the HTML using urllib2/mechanize, then parse for
the .js script and use spider-monkey to execute the script and the
output is passed back to python?

TIA.
-- 
http://mail.python.org/mailman/listinfo/python-list