Re: PhantomJS Downloader Middleware

David Fishburn Thu, 14 May 2015 10:38:34 -0700

Thanks Joey.

Since I am new to Python and Scrapy, if I run:


d:\python27\scripts\scrapy startproject ui5 

Which files do I have to create (and in which directory to add the code you 
supplied) and which files do I have to modify (i.e. settings.py) to have it 
automatically call this at the outset.

I hope that is not too much to ask.

Thanks,
David

On Thursday, May 14, 2015 at 10:13:09 AM UTC-4, Joey Espinosa wrote:
>
> David,
>
> I've written middleware to intercept a JS-specific request before it is 
> processed. I haven't used WaitFor.js, so I can't help you there, but I can 
> help get you started with PhantomJS.
>
>     class JSMiddleware(BaseMiddleware):
>         def process_request(self, request, spider):
>             if request.meta.get('js'): # you probably want a conditional 
> trigger
>                 driver = webdriver.PhantomJS()
>                 driver.get(request.url)
>                 body = driver.page_source
>                 return HtmlResponse(driver.current_url, body=body, 
> encoding='utf-8', request=request)
>             return
>
> That's the simplest approach. You may want to end up adding options to the 
> webdriver.PhantomJS() call, such as desired_capabilities including SSL 
> processing options or a user agent string. You may also want to wrap the 
> driver.get() call in a try/except block. Additionally, you should do 
> something with the cookies that come back from PhantomJS using 
> driver.get_cookies().
>
> Also, if you want every request to go through JS, then you can remove the 
> request.meta['js'] conditional. Otherwise, you could insert that 
> information for initial requests in a spider.make_requests_from_url 
> override, or you could simply have a spider instance method like 
> spider.run_js(request) where the spider can look at the request and 
> determine whether it needs JS on it based on some criteria you come up with.
>
> There are a lot of options for you with PhantomJS, so it's really up to 
> you, but this should be a decent starting point. I hope this answers your 
> question.
>
> --
> Respectfully,
>
> Joey Espinosa
> http://about.me/joelinux
>
>
> On Thu, May 14, 2015 at 9:57 AM David Fishburn <[email protected] 
> <javascript:>> wrote:
>
>> Thanks for the response José.  
>>
>> That integrates Splash as the JS renderer.  From the documentation I have 
>> read, it looks like Splash does not support Windows.
>>
>> David
>>
>>
>> On Thursday, May 14, 2015 at 12:24:08 AM UTC-4, José Ricardo wrote:
>>
>>> Hi David, have you given ScrapyJS a try?
>>>
>>> https://github.com/scrapinghub/scrapyjs
>>>
>>> Besides rendering the page, it can also take screenshots :)
>>>
>>> Regards,
>>>
>>> José
>>>
>> On Wed, May 13, 2015 at 3:54 PM, Travis Leleu <[email protected]> 
>>> wrote:
>>>
>> Hi David,
>>>>
>>>> Honestly, I have yet to find a good integration with scrapy / JS 
>>>> browser.  The current methods seem to all download the basic page via 
>>>> urllib3, then send that HTML to render and fetch other resources.
>>>>
>>>> This causes a bottleneck -- the browser process, usually exposed via an 
>>>> API, takes a lot of CPU / time to render the page.  It also doesn't easily 
>>>> use proxies, which means that all subsequent requests will be from one IP 
>>>> address.
>>>>
>>>> I think it would be a lot of work to build this into scrapy.
>>>>
>>>> In my work, I tend to just write my own (scaled down) scraping engine 
>>>> that works more directly with a headless js browser.
>>>>
>>> On Wed, May 13, 2015 at 12:32 PM, David Fishburn <[email protected]> 
>>>> wrote:
>>>>
>>> I am new to Scrapy and Python.
>>>>>
>>>>> I have a site I need to scrap but it is all AJAX driven, so will need 
>>>>> something like PhantomJS to yield the final page rendering.
>>>>>
>>>>> I have been searching in vain really for a simple example of a 
>>>>> downloader middleware which uses PhantomJS.  It has been around long 
>>>>> enough 
>>>>> that I am sure someone has already written one.  I can find complete 
>>>>> projects for Splash and others, but I am on Windows.
>>>>>
>>>>> It doesn't need to be fancy, just take the Scrapy request and return 
>>>>> the PhantomJS page (most likely using the WaitFor.js, which the PhantomJS 
>>>>> dev team wrote, to only return the page after it has stopped making AJAX 
>>>>> calls).
>>>>>
>>>>> I am completely lost trying to get started.  The documentation (
>>>>> http://doc.scrapy.org/en/latest/topics/downloader-middleware.html) 
>>>>> talks about the APIs, but they don't give a basic application which I 
>>>>> could 
>>>>> begin modifying to plugin the PhantomJS calls which I have shown below 
>>>>> (which are very simple).
>>>>>
>>>>> Anyone have something I can use?
>>>>>
>>>>> This code does what I want when using the Scrapy shell:
>>>>>
>>>>>
>>>>> D:\Python27\Scripts\scrapy.exe shell 
>>>>> https://sapui5.netweaver.ondemand.com/sdk/#docs/api/symbols/sap.html
>>>>>
>>>>> >>>from selenium import webdriver
>>>>> >>>driver = webdriver.PhantomJS()
>>>>> >>>driver.set_window_size(1024, 768)
>>>>> >>>driver.get('
>>>>> https://sapui5.netweaver.ondemand.com/sdk/#docs/api/symbols/sap.html')
>>>>> -- Wait here for a 30 seconds and let the AJAX calls finish
>>>>> >>>driver.save_screenshot('screen.png')
>>>>> >>>print driver.page_source
>>>>> >>>driver.quit()
>>>>>
>>>>>
>>>>> The screen shot contains a properly rendered browser.
>>>>>
>>>>>
>>>>> Thanks for any advice you can give.
>>>>> David
>>>>>
>>>>>
>>>>>
>>>>>  -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "scrapy-users" group.
>>>>>
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>
>>>>
>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "scrapy-users" group.
>>>>
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>>
>>>
>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: PhantomJS Downloader Middleware

Reply via email to