Re: Utility to screenscrape sites using javascript ?
On Sat, 30 Jan 2010 11:28:47 -0800, KB wrote: I have a service I subscribe to that uses javascript to stream news. There's a Python interface to SpiderMonkey (Mozilla's JavaScript interpreter): http://pypi.python.org/pypi/python-spidermonkey Thanks! I don't see a documentation page, but how would one use this? Would you download the HTML using urllib2/mechanize, then parse for the .js script and use spider-monkey to execute the script and the output is passed back to python? Something like that. The homepage link: http://github.com/davisp/python-spidermonkey has some examples further down. The first one starts with: import spidermonkey rt = spidermonkey.Runtime() cx = rt.new_context() cx.execute(var x = 3; x *= 4; x;) 12 It goes on to mention the .add_global(name, object) method, using it to name a Python object such that its variables and methods can be accessed from JS. For scraping web pages, I suspect that you'll at least need to create an object and name it document, so that document.location etc works. How much you'll need to implement will depend upon what the web page uses, although you can probably use .__getattr__() to serve up dummy handlers for calls which can be ignored. Futher documentation is probably a case of read the spidermonkey documentation, then read the python-spidermonkey source if it isn't clear how the wrapper relates to spidermonkey itself. -- http://mail.python.org/mailman/listinfo/python-list
Re: Utility to screenscrape sites using javascript ?
On Sat, 30 Jan 2010 06:21:01 -0800, KB wrote: I have a service I subscribe to that uses javascript to stream news. Ideally I would like to use python to parse the information for me. Note there is an option to take a static snapshot of the current stream but that is still done via Javascript. (I can reference the snapshot with a unique URL though, so I can pass that to a parser as long as it can resolve the javascript and get at the content) I had a quick look at Windmill but it doesn't appear to be what I am looking for. Does anyone else have any experience in screenscraping sites that utilise javascript? Can you share how you did it and perhaps some sample code if possible? There's a Python interface to SpiderMonkey (Mozilla's JavaScript interpreter): http://pypi.python.org/pypi/python-spidermonkey -- http://mail.python.org/mailman/listinfo/python-list
Re: Utility to screenscrape sites using javascript ?
On Sat, 30 Jan 2010 06:21:01 -0800, KB wrote: I have a service I subscribe to that uses javascript to stream news. There's a Python interface to SpiderMonkey (Mozilla's JavaScript interpreter): http://pypi.python.org/pypi/python-spidermonkey Thanks! I don't see a documentation page, but how would one use this? Would you download the HTML using urllib2/mechanize, then parse for the .js script and use spider-monkey to execute the script and the output is passed back to python? TIA. -- http://mail.python.org/mailman/listinfo/python-list