David,
I've written middleware to intercept a JS-specific request before it is
processed. I haven't used WaitFor.js, so I can't help you there, but I can
help get you started with PhantomJS.
class JSMiddleware(BaseMiddleware):
def process_request(self, request, spider):
if request.meta.get('js'): # you probably want a conditional
trigger
driver = webdriver.PhantomJS()
driver.get(request.url)
body = driver.page_source
return HtmlResponse(driver.current_url, body=body,
encoding='utf-8', request=request)
return
That's the simplest approach. You may want to end up adding options to the
webdriver.PhantomJS() call, such as desired_capabilities including SSL
processing options or a user agent string. You may also want to wrap the
driver.get() call in a try/except block. Additionally, you should do
something with the cookies that come back from PhantomJS using
driver.get_cookies().
Also, if you want every request to go through JS, then you can remove the
request.meta['js'] conditional. Otherwise, you could insert that
information for initial requests in a spider.make_requests_from_url
override, or you could simply have a spider instance method like
spider.run_js(request) where the spider can look at the request and
determine whether it needs JS on it based on some criteria you come up with.
There are a lot of options for you with PhantomJS, so it's really up to
you, but this should be a decent starting point. I hope this answers your
question.
--
Respectfully,
Joey Espinosa
http://about.me/joelinux
On Thu, May 14, 2015 at 9:57 AM David Fishburn <[email protected]>
wrote:
> Thanks for the response José.
>
> That integrates Splash as the JS renderer. From the documentation I have
> read, it looks like Splash does not support Windows.
>
> David
>
>
> On Thursday, May 14, 2015 at 12:24:08 AM UTC-4, José Ricardo wrote:
>
>> Hi David, have you given ScrapyJS a try?
>>
>> https://github.com/scrapinghub/scrapyjs
>>
>> Besides rendering the page, it can also take screenshots :)
>>
>> Regards,
>>
>> José
>>
> On Wed, May 13, 2015 at 3:54 PM, Travis Leleu <[email protected]>
>> wrote:
>>
> Hi David,
>>>
>>> Honestly, I have yet to find a good integration with scrapy / JS
>>> browser. The current methods seem to all download the basic page via
>>> urllib3, then send that HTML to render and fetch other resources.
>>>
>>> This causes a bottleneck -- the browser process, usually exposed via an
>>> API, takes a lot of CPU / time to render the page. It also doesn't easily
>>> use proxies, which means that all subsequent requests will be from one IP
>>> address.
>>>
>>> I think it would be a lot of work to build this into scrapy.
>>>
>>> In my work, I tend to just write my own (scaled down) scraping engine
>>> that works more directly with a headless js browser.
>>>
>> On Wed, May 13, 2015 at 12:32 PM, David Fishburn <[email protected]>
>>> wrote:
>>>
>> I am new to Scrapy and Python.
>>>>
>>>> I have a site I need to scrap but it is all AJAX driven, so will need
>>>> something like PhantomJS to yield the final page rendering.
>>>>
>>>> I have been searching in vain really for a simple example of a
>>>> downloader middleware which uses PhantomJS. It has been around long enough
>>>> that I am sure someone has already written one. I can find complete
>>>> projects for Splash and others, but I am on Windows.
>>>>
>>>> It doesn't need to be fancy, just take the Scrapy request and return
>>>> the PhantomJS page (most likely using the WaitFor.js, which the PhantomJS
>>>> dev team wrote, to only return the page after it has stopped making AJAX
>>>> calls).
>>>>
>>>> I am completely lost trying to get started. The documentation (
>>>> http://doc.scrapy.org/en/latest/topics/downloader-middleware.html)
>>>> talks about the APIs, but they don't give a basic application which I could
>>>> begin modifying to plugin the PhantomJS calls which I have shown below
>>>> (which are very simple).
>>>>
>>>> Anyone have something I can use?
>>>>
>>>> This code does what I want when using the Scrapy shell:
>>>>
>>>>
>>>> D:\Python27\Scripts\scrapy.exe shell
>>>> https://sapui5.netweaver.ondemand.com/sdk/#docs/api/symbols/sap.html
>>>>
>>>> >>>from selenium import webdriver
>>>> >>>driver = webdriver.PhantomJS()
>>>> >>>driver.set_window_size(1024, 768)
>>>> >>>driver.get('
>>>> https://sapui5.netweaver.ondemand.com/sdk/#docs/api/symbols/sap.html')
>>>> -- Wait here for a 30 seconds and let the AJAX calls finish
>>>> >>>driver.save_screenshot('screen.png')
>>>> >>>print driver.page_source
>>>> >>>driver.quit()
>>>>
>>>>
>>>> The screen shot contains a properly rendered browser.
>>>>
>>>>
>>>> Thanks for any advice you can give.
>>>> David
>>>>
>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "scrapy-users" group.
>>>>
>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>
>>>
>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "scrapy-users" group.
>>>
>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to [email protected].
>>> To post to this group, send email to [email protected].
>>>
>>
>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.