Hi David, Honestly, I have yet to find a good integration with scrapy / JS browser. The current methods seem to all download the basic page via urllib3, then send that HTML to render and fetch other resources.
This causes a bottleneck -- the browser process, usually exposed via an API, takes a lot of CPU / time to render the page. It also doesn't easily use proxies, which means that all subsequent requests will be from one IP address. I think it would be a lot of work to build this into scrapy. In my work, I tend to just write my own (scaled down) scraping engine that works more directly with a headless js browser. On Wed, May 13, 2015 at 12:32 PM, David Fishburn <[email protected]> wrote: > I am new to Scrapy and Python. > > I have a site I need to scrap but it is all AJAX driven, so will need > something like PhantomJS to yield the final page rendering. > > I have been searching in vain really for a simple example of a downloader > middleware which uses PhantomJS. It has been around long enough that I am > sure someone has already written one. I can find complete projects for > Splash and others, but I am on Windows. > > It doesn't need to be fancy, just take the Scrapy request and return the > PhantomJS page (most likely using the WaitFor.js, which the PhantomJS dev > team wrote, to only return the page after it has stopped making AJAX calls). > > I am completely lost trying to get started. The documentation ( > http://doc.scrapy.org/en/latest/topics/downloader-middleware.html) talks > about the APIs, but they don't give a basic application which I could begin > modifying to plugin the PhantomJS calls which I have shown below (which are > very simple). > > Anyone have something I can use? > > This code does what I want when using the Scrapy shell: > > > D:\Python27\Scripts\scrapy.exe shell > https://sapui5.netweaver.ondemand.com/sdk/#docs/api/symbols/sap.html > > >>>from selenium import webdriver > >>>driver = webdriver.PhantomJS() > >>>driver.set_window_size(1024, 768) > >>>driver.get(' > https://sapui5.netweaver.ondemand.com/sdk/#docs/api/symbols/sap.html') > -- Wait here for a 30 seconds and let the AJAX calls finish > >>>driver.save_screenshot('screen.png') > >>>print driver.page_source > >>>driver.quit() > > > The screen shot contains a properly rendered browser. > > > Thanks for any advice you can give. > David > > > > -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
