Re: PhantomJS Downloader Middleware

David Fishburn Thu, 14 May 2015 11:00:13 -0700

Thanks again José.

I did some more Googling around.  Didn't know what Docker was, but found it 
here:
    https://docs.docker.com/


The Splash instructions I saw was always referencing Linux paths and I 
didn't find any Windows references.

So, Docker is essentially a virtual machine.  So if Docker runs on your 
platform (yes to Windows) then it will be able to run your Python / Scrapy 
/ Splash code.

Thank you.
David



On Thursday, May 14, 2015 at 1:10:48 PM UTC-4, José Ricardo wrote:
>
> David, it seems that there shouldn't be problems to run splash from docker 
> on Windows :)
>
> On Thu, May 14, 2015 at 10:17 AM, Joey Espinosa <[email protected] 
> <javascript:>> wrote:
>
>> Crap, and obviously (I need more coffee):
>>
>>     from selenium import webdriver
>>
>> --
>> Respectfully,
>>
>> Joey Espinosa
>> http://about.me/joelinux
>>
>> On Thu, May 14, 2015 at 10:16 AM Joey Espinosa <[email protected] 
>> <javascript:>> wrote:
>>
>>> Sorry, if you want the import for that example I just sent (I often hate 
>>> when tutorials leave that out), here you go:
>>>
>>>     from scrapy.http import HtmlResponse
>>>     
>>> Also, I realize that you don't need to subclass BaseMiddleware for your 
>>> middleware class. That's an artifact of my own boilerplate code, because I 
>>> usually have a base middleware class that contains common things I want all 
>>> middleware to have. You can just do this in the previous example (my bad):
>>>
>>>     class JSMiddleware(object):
>>>
>>> --
>>> Respectfully,
>>>
>>> Joey Espinosa
>>> http://about.me/joelinux
>>>
>>> On Thu, May 14, 2015 at 10:13 AM Joey Espinosa <[email protected] 
>>> <javascript:>> wrote:
>>>
>>>> David,
>>>>
>>>> I've written middleware to intercept a JS-specific request before it is 
>>>> processed. I haven't used WaitFor.js, so I can't help you there, but I can 
>>>> help get you started with PhantomJS.
>>>>
>>>>     class JSMiddleware(BaseMiddleware):
>>>>         def process_request(self, request, spider):
>>>>             if request.meta.get('js'): # you probably want a 
>>>> conditional trigger
>>>>                 driver = webdriver.PhantomJS()
>>>>                 driver.get(request.url)
>>>>                 body = driver.page_source
>>>>                 return HtmlResponse(driver.current_url, body=body, 
>>>> encoding='utf-8', request=request)
>>>>             return
>>>>
>>>> That's the simplest approach. You may want to end up adding options to 
>>>> the webdriver.PhantomJS() call, such as desired_capabilities including SSL 
>>>> processing options or a user agent string. You may also want to wrap the 
>>>> driver.get() call in a try/except block. Additionally, you should do 
>>>> something with the cookies that come back from PhantomJS using 
>>>> driver.get_cookies().
>>>>
>>>> Also, if you want every request to go through JS, then you can remove 
>>>> the request.meta['js'] conditional. Otherwise, you could insert that 
>>>> information for initial requests in a spider.make_requests_from_url 
>>>> override, or you could simply have a spider instance method like 
>>>> spider.run_js(request) where the spider can look at the request and 
>>>> determine whether it needs JS on it based on some criteria you come up 
>>>> with.
>>>>
>>>> There are a lot of options for you with PhantomJS, so it's really up to 
>>>> you, but this should be a decent starting point. I hope this answers your 
>>>> question.
>>>>
>>>> --
>>>> Respectfully,
>>>>
>>>> Joey Espinosa
>>>> http://about.me/joelinux
>>>>
>>>>
>>>> On Thu, May 14, 2015 at 9:57 AM David Fishburn <[email protected] 
>>>> <javascript:>> wrote:
>>>>
>>>>> Thanks for the response José.  
>>>>>
>>>>> That integrates Splash as the JS renderer.  From the documentation I 
>>>>> have read, it looks like Splash does not support Windows.
>>>>>
>>>>> David
>>>>>
>>>>>
>>>>> On Thursday, May 14, 2015 at 12:24:08 AM UTC-4, José Ricardo wrote:
>>>>>
>>>>>> Hi David, have you given ScrapyJS a try?
>>>>>>
>>>>>> https://github.com/scrapinghub/scrapyjs
>>>>>>
>>>>>> Besides rendering the page, it can also take screenshots :)
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> José
>>>>>>
>>>>> On Wed, May 13, 2015 at 3:54 PM, Travis Leleu <[email protected]> 
>>>>>> wrote:
>>>>>>
>>>>> Hi David,
>>>>>>>
>>>>>>> Honestly, I have yet to find a good integration with scrapy / JS 
>>>>>>> browser.  The current methods seem to all download the basic page via 
>>>>>>> urllib3, then send that HTML to render and fetch other resources.
>>>>>>>
>>>>>>> This causes a bottleneck -- the browser process, usually exposed via 
>>>>>>> an API, takes a lot of CPU / time to render the page.  It also doesn't 
>>>>>>> easily use proxies, which means that all subsequent requests will be 
>>>>>>> from 
>>>>>>> one IP address.
>>>>>>>
>>>>>>> I think it would be a lot of work to build this into scrapy.
>>>>>>>
>>>>>>> In my work, I tend to just write my own (scaled down) scraping 
>>>>>>> engine that works more directly with a headless js browser.
>>>>>>>
>>>>>> On Wed, May 13, 2015 at 12:32 PM, David Fishburn <[email protected]
>>>>>>> > wrote:
>>>>>>>
>>>>>> I am new to Scrapy and Python.
>>>>>>>>
>>>>>>>> I have a site I need to scrap but it is all AJAX driven, so will 
>>>>>>>> need something like PhantomJS to yield the final page rendering.
>>>>>>>>
>>>>>>>> I have been searching in vain really for a simple example of a 
>>>>>>>> downloader middleware which uses PhantomJS.  It has been around long 
>>>>>>>> enough 
>>>>>>>> that I am sure someone has already written one.  I can find complete 
>>>>>>>> projects for Splash and others, but I am on Windows.
>>>>>>>>
>>>>>>>> It doesn't need to be fancy, just take the Scrapy request and 
>>>>>>>> return the PhantomJS page (most likely using the WaitFor.js, which the 
>>>>>>>> PhantomJS dev team wrote, to only return the page after it has stopped 
>>>>>>>> making AJAX calls).
>>>>>>>>
>>>>>>>> I am completely lost trying to get started.  The documentation (
>>>>>>>> http://doc.scrapy.org/en/latest/topics/downloader-middleware.html) 
>>>>>>>> talks about the APIs, but they don't give a basic application which I 
>>>>>>>> could 
>>>>>>>> begin modifying to plugin the PhantomJS calls which I have shown below 
>>>>>>>> (which are very simple).
>>>>>>>>
>>>>>>>> Anyone have something I can use?
>>>>>>>>
>>>>>>>> This code does what I want when using the Scrapy shell:
>>>>>>>>
>>>>>>>>
>>>>>>>> D:\Python27\Scripts\scrapy.exe shell 
>>>>>>>> https://sapui5.netweaver.ondemand.com/sdk/#docs/api/symbols/sap.html
>>>>>>>>
>>>>>>>> >>>from selenium import webdriver
>>>>>>>> >>>driver = webdriver.PhantomJS()
>>>>>>>> >>>driver.set_window_size(1024, 768)
>>>>>>>> >>>driver.get('
>>>>>>>> https://sapui5.netweaver.ondemand.com/sdk/#docs/api/symbols/sap.html
>>>>>>>> ')
>>>>>>>> -- Wait here for a 30 seconds and let the AJAX calls finish
>>>>>>>> >>>driver.save_screenshot('screen.png')
>>>>>>>> >>>print driver.page_source
>>>>>>>> >>>driver.quit()
>>>>>>>>
>>>>>>>>
>>>>>>>> The screen shot contains a properly rendered browser.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for any advice you can give.
>>>>>>>> David
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>  -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "scrapy-users" group.
>>>>>>>>
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to [email protected].
>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>
>>>>>>>
>>>>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>  -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "scrapy-users" group.
>>>>>>>
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>>
>>>>>>
>>>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>  -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "scrapy-users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected] <javascript:>.
>>>>> To post to this group, send email to [email protected] 
>>>>> <javascript:>.
>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: PhantomJS Downloader Middleware

Reply via email to