Re: PhantomJS Downloader Middleware

Joey Espinosa Thu, 14 May 2015 07:17:25 -0700

Sorry, if you want the import for that example I just sent (I often hate
when tutorials leave that out), here you go:


    from scrapy.http import HtmlResponse

Also, I realize that you don't need to subclass BaseMiddleware for your
middleware class. That's an artifact of my own boilerplate code, because I
usually have a base middleware class that contains common things I want all
middleware to have. You can just do this in the previous example (my bad):

    class JSMiddleware(object):

--
Respectfully,

Joey Espinosa
http://about.me/joelinux

On Thu, May 14, 2015 at 10:13 AM Joey Espinosa <[email protected]>
wrote:

> David,
>
> I've written middleware to intercept a JS-specific request before it is
> processed. I haven't used WaitFor.js, so I can't help you there, but I can
> help get you started with PhantomJS.
>
>     class JSMiddleware(BaseMiddleware):
>         def process_request(self, request, spider):
>             if request.meta.get('js'): # you probably want a conditional
> trigger
>                 driver = webdriver.PhantomJS()
>                 driver.get(request.url)
>                 body = driver.page_source
>                 return HtmlResponse(driver.current_url, body=body,
> encoding='utf-8', request=request)
>             return
>
> That's the simplest approach. You may want to end up adding options to the
> webdriver.PhantomJS() call, such as desired_capabilities including SSL
> processing options or a user agent string. You may also want to wrap the
> driver.get() call in a try/except block. Additionally, you should do
> something with the cookies that come back from PhantomJS using
> driver.get_cookies().
>
> Also, if you want every request to go through JS, then you can remove the
> request.meta['js'] conditional. Otherwise, you could insert that
> information for initial requests in a spider.make_requests_from_url
> override, or you could simply have a spider instance method like
> spider.run_js(request) where the spider can look at the request and
> determine whether it needs JS on it based on some criteria you come up with.
>
> There are a lot of options for you with PhantomJS, so it's really up to
> you, but this should be a decent starting point. I hope this answers your
> question.
>
> --
> Respectfully,
>
> Joey Espinosa
> http://about.me/joelinux
>
>
> On Thu, May 14, 2015 at 9:57 AM David Fishburn <[email protected]>
> wrote:
>
>> Thanks for the response José.
>>
>> That integrates Splash as the JS renderer.  From the documentation I have
>> read, it looks like Splash does not support Windows.
>>
>> David
>>
>>
>> On Thursday, May 14, 2015 at 12:24:08 AM UTC-4, José Ricardo wrote:
>>
>>> Hi David, have you given ScrapyJS a try?
>>>
>>> https://github.com/scrapinghub/scrapyjs
>>>
>>> Besides rendering the page, it can also take screenshots :)
>>>
>>> Regards,
>>>
>>> José
>>>
>> On Wed, May 13, 2015 at 3:54 PM, Travis Leleu <[email protected]>
>>> wrote:
>>>
>> Hi David,
>>>>
>>>> Honestly, I have yet to find a good integration with scrapy / JS
>>>> browser.  The current methods seem to all download the basic page via
>>>> urllib3, then send that HTML to render and fetch other resources.
>>>>
>>>> This causes a bottleneck -- the browser process, usually exposed via an
>>>> API, takes a lot of CPU / time to render the page.  It also doesn't easily
>>>> use proxies, which means that all subsequent requests will be from one IP
>>>> address.
>>>>
>>>> I think it would be a lot of work to build this into scrapy.
>>>>
>>>> In my work, I tend to just write my own (scaled down) scraping engine
>>>> that works more directly with a headless js browser.
>>>>
>>> On Wed, May 13, 2015 at 12:32 PM, David Fishburn <[email protected]>
>>>> wrote:
>>>>
>>> I am new to Scrapy and Python.
>>>>>
>>>>> I have a site I need to scrap but it is all AJAX driven, so will need
>>>>> something like PhantomJS to yield the final page rendering.
>>>>>
>>>>> I have been searching in vain really for a simple example of a
>>>>> downloader middleware which uses PhantomJS.  It has been around long 
>>>>> enough
>>>>> that I am sure someone has already written one.  I can find complete
>>>>> projects for Splash and others, but I am on Windows.
>>>>>
>>>>> It doesn't need to be fancy, just take the Scrapy request and return
>>>>> the PhantomJS page (most likely using the WaitFor.js, which the PhantomJS
>>>>> dev team wrote, to only return the page after it has stopped making AJAX
>>>>> calls).
>>>>>
>>>>> I am completely lost trying to get started.  The documentation (
>>>>> http://doc.scrapy.org/en/latest/topics/downloader-middleware.html)
>>>>> talks about the APIs, but they don't give a basic application which I 
>>>>> could
>>>>> begin modifying to plugin the PhantomJS calls which I have shown below
>>>>> (which are very simple).
>>>>>
>>>>> Anyone have something I can use?
>>>>>
>>>>> This code does what I want when using the Scrapy shell:
>>>>>
>>>>>
>>>>> D:\Python27\Scripts\scrapy.exe shell
>>>>> https://sapui5.netweaver.ondemand.com/sdk/#docs/api/symbols/sap.html
>>>>>
>>>>> >>>from selenium import webdriver
>>>>> >>>driver = webdriver.PhantomJS()
>>>>> >>>driver.set_window_size(1024, 768)
>>>>> >>>driver.get('
>>>>> https://sapui5.netweaver.ondemand.com/sdk/#docs/api/symbols/sap.html')
>>>>> -- Wait here for a 30 seconds and let the AJAX calls finish
>>>>> >>>driver.save_screenshot('screen.png')
>>>>> >>>print driver.page_source
>>>>> >>>driver.quit()
>>>>>
>>>>>
>>>>> The screen shot contains a properly rendered browser.
>>>>>
>>>>>
>>>>> Thanks for any advice you can give.
>>>>> David
>>>>>
>>>>>
>>>>>
>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "scrapy-users" group.
>>>>>
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>
>>>>
>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "scrapy-users" group.
>>>>
>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>>
>>>
>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>  --
>> You received this message because you are subscribed to the Google Groups
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: PhantomJS Downloader Middleware

Reply via email to