Re: PhantomJS Downloader Middleware

Joey Espinosa Thu, 14 May 2015 13:30:37 -0700

Inside your project directory (in this case, looks like "ui5"), create
yourself a middleware directory to hold all middleware modules. Then,
create a file for this purpose and name it something relevant (like
ui5\middleware\javascript.py). Now you have your middleware module.


Next, you have to let Scrapy know about this middleware. On to settings.py!

    DOWNLOADER_MIDDLEWARES = {
        'ui5.middleware.javascript.JSMiddleware': 99
    }

That's assuming, of course, that you use all the same naming that I
suggested in the earlier examples. Change it as appropriate. Also, since
you mentioned you're new to Python, that string inside
DOWNLOADER_MIDDLEWARES needs to resolve to a class that Python can "see",
so the PYTHONPATH needs to be able to find "ui5". So just a heads up on
that in case you try this and get an error like "No module named 'ui5'."

That should be it.

A side note about Docker (since you mentioned it in another response)...
Windows doesn't have the necessary features to support what Docker is
actually doing, so in order to get Docker working on Windows, it actually
creates a lightweight Linux VM that then runs Docker within it. So when you
run processes within Docker on Windows, you're running an abstraction
within an abstraction within an abstraction. Not tremendously efficient
simply to make up for a lack of knowledge regarding Scrapy. If you're
intent on making use of a project only supported on Linux, I'd rather
suggest a VM with Ubuntu, since you'd already be part of the way there with
attempting Docker anyway. I don't use Windows at all, but one of my
colleagues does, and he set up a cheap Ubuntu computer and uses NX to
connect to it from Windows and do development within Docker. I'm not
knocking Docker (I use it myself quite heavily), but I'm just cautioning
against throwing too many I-need-to-learn-this-from-scratch things at
yourself all at once. You'll be overwhelmed. Just my two cents.

On Thu, May 14, 2015 at 1:38 PM David Fishburn <[email protected]>
wrote:

> Thanks Joey.
>
> Since I am new to Python and Scrapy, if I run:
>
> d:\python27\scripts\scrapy startproject ui5
>
> Which files do I have to create (and in which directory to add the code
> you supplied) and which files do I have to modify (i.e. settings.py) to
> have it automatically call this at the outset.
>
> I hope that is not too much to ask.
>
> Thanks,
> David
>
>
> On Thursday, May 14, 2015 at 10:13:09 AM UTC-4, Joey Espinosa wrote:
>
>> David,
>>
>> I've written middleware to intercept a JS-specific request before it is
>> processed. I haven't used WaitFor.js, so I can't help you there, but I can
>> help get you started with PhantomJS.
>>
>>     class JSMiddleware(BaseMiddleware):
>>         def process_request(self, request, spider):
>>             if request.meta.get('js'): # you probably want a conditional
>> trigger
>>                 driver = webdriver.PhantomJS()
>>                 driver.get(request.url)
>>                 body = driver.page_source
>>                 return HtmlResponse(driver.current_url, body=body,
>> encoding='utf-8', request=request)
>>             return
>>
>> That's the simplest approach. You may want to end up adding options to
>> the webdriver.PhantomJS() call, such as desired_capabilities including SSL
>> processing options or a user agent string. You may also want to wrap the
>> driver.get() call in a try/except block. Additionally, you should do
>> something with the cookies that come back from PhantomJS using
>> driver.get_cookies().
>>
>> Also, if you want every request to go through JS, then you can remove the
>> request.meta['js'] conditional. Otherwise, you could insert that
>> information for initial requests in a spider.make_requests_from_url
>> override, or you could simply have a spider instance method like
>> spider.run_js(request) where the spider can look at the request and
>> determine whether it needs JS on it based on some criteria you come up with.
>>
>> There are a lot of options for you with PhantomJS, so it's really up to
>> you, but this should be a decent starting point. I hope this answers your
>> question.
>>
>> --
>> Respectfully,
>>
>> Joey Espinosa
>> http://about.me/joelinux
>>
>>
>> On Thu, May 14, 2015 at 9:57 AM David Fishburn <[email protected]>
>> wrote:
>>
>>> Thanks for the response José.
>>>
>>> That integrates Splash as the JS renderer.  From the documentation I
>>> have read, it looks like Splash does not support Windows.
>>>
>>> David
>>>
>>>
>>> On Thursday, May 14, 2015 at 12:24:08 AM UTC-4, José Ricardo wrote:
>>>
>>>> Hi David, have you given ScrapyJS a try?
>>>>
>>>> https://github.com/scrapinghub/scrapyjs
>>>>
>>>> Besides rendering the page, it can also take screenshots :)
>>>>
>>>> Regards,
>>>>
>>>> José
>>>>
>>> On Wed, May 13, 2015 at 3:54 PM, Travis Leleu <[email protected]>
>>>> wrote:
>>>>
>>> Hi David,
>>>>>
>>>>> Honestly, I have yet to find a good integration with scrapy / JS
>>>>> browser.  The current methods seem to all download the basic page via
>>>>> urllib3, then send that HTML to render and fetch other resources.
>>>>>
>>>>> This causes a bottleneck -- the browser process, usually exposed via
>>>>> an API, takes a lot of CPU / time to render the page.  It also doesn't
>>>>> easily use proxies, which means that all subsequent requests will be from
>>>>> one IP address.
>>>>>
>>>>> I think it would be a lot of work to build this into scrapy.
>>>>>
>>>>> In my work, I tend to just write my own (scaled down) scraping engine
>>>>> that works more directly with a headless js browser.
>>>>>
>>>> On Wed, May 13, 2015 at 12:32 PM, David Fishburn <[email protected]>
>>>>> wrote:
>>>>>
>>>> I am new to Scrapy and Python.
>>>>>>
>>>>>> I have a site I need to scrap but it is all AJAX driven, so will need
>>>>>> something like PhantomJS to yield the final page rendering.
>>>>>>
>>>>>> I have been searching in vain really for a simple example of a
>>>>>> downloader middleware which uses PhantomJS.  It has been around long 
>>>>>> enough
>>>>>> that I am sure someone has already written one.  I can find complete
>>>>>> projects for Splash and others, but I am on Windows.
>>>>>>
>>>>>> It doesn't need to be fancy, just take the Scrapy request and return
>>>>>> the PhantomJS page (most likely using the WaitFor.js, which the PhantomJS
>>>>>> dev team wrote, to only return the page after it has stopped making AJAX
>>>>>> calls).
>>>>>>
>>>>>> I am completely lost trying to get started.  The documentation (
>>>>>> http://doc.scrapy.org/en/latest/topics/downloader-middleware.html)
>>>>>> talks about the APIs, but they don't give a basic application which I 
>>>>>> could
>>>>>> begin modifying to plugin the PhantomJS calls which I have shown below
>>>>>> (which are very simple).
>>>>>>
>>>>>> Anyone have something I can use?
>>>>>>
>>>>>> This code does what I want when using the Scrapy shell:
>>>>>>
>>>>>>
>>>>>> D:\Python27\Scripts\scrapy.exe shell
>>>>>> https://sapui5.netweaver.ondemand.com/sdk/#docs/api/symbols/sap.html
>>>>>>
>>>>>> >>>from selenium import webdriver
>>>>>> >>>driver = webdriver.PhantomJS()
>>>>>> >>>driver.set_window_size(1024, 768)
>>>>>> >>>driver.get('
>>>>>> https://sapui5.netweaver.ondemand.com/sdk/#docs/api/symbols/sap.html
>>>>>> ')
>>>>>> -- Wait here for a 30 seconds and let the AJAX calls finish
>>>>>> >>>driver.save_screenshot('screen.png')
>>>>>> >>>print driver.page_source
>>>>>> >>>driver.quit()
>>>>>>
>>>>>>
>>>>>> The screen shot contains a properly rendered browser.
>>>>>>
>>>>>>
>>>>>> Thanks for any advice you can give.
>>>>>> David
>>>>>>
>>>>>>
>>>>>>
>>>>>>  --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "scrapy-users" group.
>>>>>>
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>>> an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>
>>>>>
>>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "scrapy-users" group.
>>>>>
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>>
>>>>
>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "scrapy-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: PhantomJS Downloader Middleware

Reply via email to