Hi David,

Honestly, I have yet to find a good integration with scrapy / JS browser.
The current methods seem to all download the basic page via urllib3, then
send that HTML to render and fetch other resources.

This causes a bottleneck -- the browser process, usually exposed via an
API, takes a lot of CPU / time to render the page.  It also doesn't easily
use proxies, which means that all subsequent requests will be from one IP
address.

I think it would be a lot of work to build this into scrapy.

In my work, I tend to just write my own (scaled down) scraping engine that
works more directly with a headless js browser.

On Wed, May 13, 2015 at 12:32 PM, David Fishburn <[email protected]>
wrote:

> I am new to Scrapy and Python.
>
> I have a site I need to scrap but it is all AJAX driven, so will need
> something like PhantomJS to yield the final page rendering.
>
> I have been searching in vain really for a simple example of a downloader
> middleware which uses PhantomJS.  It has been around long enough that I am
> sure someone has already written one.  I can find complete projects for
> Splash and others, but I am on Windows.
>
> It doesn't need to be fancy, just take the Scrapy request and return the
> PhantomJS page (most likely using the WaitFor.js, which the PhantomJS dev
> team wrote, to only return the page after it has stopped making AJAX calls).
>
> I am completely lost trying to get started.  The documentation (
> http://doc.scrapy.org/en/latest/topics/downloader-middleware.html) talks
> about the APIs, but they don't give a basic application which I could begin
> modifying to plugin the PhantomJS calls which I have shown below (which are
> very simple).
>
> Anyone have something I can use?
>
> This code does what I want when using the Scrapy shell:
>
>
> D:\Python27\Scripts\scrapy.exe shell
> https://sapui5.netweaver.ondemand.com/sdk/#docs/api/symbols/sap.html
>
> >>>from selenium import webdriver
> >>>driver = webdriver.PhantomJS()
> >>>driver.set_window_size(1024, 768)
> >>>driver.get('
> https://sapui5.netweaver.ondemand.com/sdk/#docs/api/symbols/sap.html')
> -- Wait here for a 30 seconds and let the AJAX calls finish
> >>>driver.save_screenshot('screen.png')
> >>>print driver.page_source
> >>>driver.quit()
>
>
> The screen shot contains a properly rendered browser.
>
>
> Thanks for any advice you can give.
> David
>
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to