Thanks Joey.
Since I am new to Python and Scrapy, if I run:
d:\python27\scripts\scrapy startproject ui5
Which files do I have to create (and in which directory to add the code you
supplied) and which files do I have to modify (i.e. settings.py) to have it
automatically call this at the outset.
I hope that is not too much to ask.
Thanks,
David
On Thursday, May 14, 2015 at 10:13:09 AM UTC-4, Joey Espinosa wrote:
>
> David,
>
> I've written middleware to intercept a JS-specific request before it is
> processed. I haven't used WaitFor.js, so I can't help you there, but I can
> help get you started with PhantomJS.
>
> class JSMiddleware(BaseMiddleware):
> def process_request(self, request, spider):
> if request.meta.get('js'): # you probably want a conditional
> trigger
> driver = webdriver.PhantomJS()
> driver.get(request.url)
> body = driver.page_source
> return HtmlResponse(driver.current_url, body=body,
> encoding='utf-8', request=request)
> return
>
> That's the simplest approach. You may want to end up adding options to the
> webdriver.PhantomJS() call, such as desired_capabilities including SSL
> processing options or a user agent string. You may also want to wrap the
> driver.get() call in a try/except block. Additionally, you should do
> something with the cookies that come back from PhantomJS using
> driver.get_cookies().
>
> Also, if you want every request to go through JS, then you can remove the
> request.meta['js'] conditional. Otherwise, you could insert that
> information for initial requests in a spider.make_requests_from_url
> override, or you could simply have a spider instance method like
> spider.run_js(request) where the spider can look at the request and
> determine whether it needs JS on it based on some criteria you come up with.
>
> There are a lot of options for you with PhantomJS, so it's really up to
> you, but this should be a decent starting point. I hope this answers your
> question.
>
> --
> Respectfully,
>
> Joey Espinosa
> http://about.me/joelinux
>
>
> On Thu, May 14, 2015 at 9:57 AM David Fishburn <[email protected]
> <javascript:>> wrote:
>
>> Thanks for the response José.
>>
>> That integrates Splash as the JS renderer. From the documentation I have
>> read, it looks like Splash does not support Windows.
>>
>> David
>>
>>
>> On Thursday, May 14, 2015 at 12:24:08 AM UTC-4, José Ricardo wrote:
>>
>>> Hi David, have you given ScrapyJS a try?
>>>
>>> https://github.com/scrapinghub/scrapyjs
>>>
>>> Besides rendering the page, it can also take screenshots :)
>>>
>>> Regards,
>>>
>>> José
>>>
>> On Wed, May 13, 2015 at 3:54 PM, Travis Leleu <[email protected]>
>>> wrote:
>>>
>> Hi David,
>>>>
>>>> Honestly, I have yet to find a good integration with scrapy / JS
>>>> browser. The current methods seem to all download the basic page via
>>>> urllib3, then send that HTML to render and fetch other resources.
>>>>
>>>> This causes a bottleneck -- the browser process, usually exposed via an
>>>> API, takes a lot of CPU / time to render the page. It also doesn't easily
>>>> use proxies, which means that all subsequent requests will be from one IP
>>>> address.
>>>>
>>>> I think it would be a lot of work to build this into scrapy.
>>>>
>>>> In my work, I tend to just write my own (scaled down) scraping engine
>>>> that works more directly with a headless js browser.
>>>>
>>> On Wed, May 13, 2015 at 12:32 PM, David Fishburn <[email protected]>
>>>> wrote:
>>>>
>>> I am new to Scrapy and Python.
>>>>>
>>>>> I have a site I need to scrap but it is all AJAX driven, so will need
>>>>> something like PhantomJS to yield the final page rendering.
>>>>>
>>>>> I have been searching in vain really for a simple example of a
>>>>> downloader middleware which uses PhantomJS. It has been around long
>>>>> enough
>>>>> that I am sure someone has already written one. I can find complete
>>>>> projects for Splash and others, but I am on Windows.
>>>>>
>>>>> It doesn't need to be fancy, just take the Scrapy request and return
>>>>> the PhantomJS page (most likely using the WaitFor.js, which the PhantomJS
>>>>> dev team wrote, to only return the page after it has stopped making AJAX
>>>>> calls).
>>>>>
>>>>> I am completely lost trying to get started. The documentation (
>>>>> http://doc.scrapy.org/en/latest/topics/downloader-middleware.html)
>>>>> talks about the APIs, but they don't give a basic application which I
>>>>> could
>>>>> begin modifying to plugin the PhantomJS calls which I have shown below
>>>>> (which are very simple).
>>>>>
>>>>> Anyone have something I can use?
>>>>>
>>>>> This code does what I want when using the Scrapy shell:
>>>>>
>>>>>
>>>>> D:\Python27\Scripts\scrapy.exe shell
>>>>> https://sapui5.netweaver.ondemand.com/sdk/#docs/api/symbols/sap.html
>>>>>
>>>>> >>>from selenium import webdriver
>>>>> >>>driver = webdriver.PhantomJS()
>>>>> >>>driver.set_window_size(1024, 768)
>>>>> >>>driver.get('
>>>>> https://sapui5.netweaver.ondemand.com/sdk/#docs/api/symbols/sap.html')
>>>>> -- Wait here for a 30 seconds and let the AJAX calls finish
>>>>> >>>driver.save_screenshot('screen.png')
>>>>> >>>print driver.page_source
>>>>> >>>driver.quit()
>>>>>
>>>>>
>>>>> The screen shot contains a properly rendered browser.
>>>>>
>>>>>
>>>>> Thanks for any advice you can give.
>>>>> David
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "scrapy-users" group.
>>>>>
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>
>>>>
>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "scrapy-users" group.
>>>>
>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>>
>>>
>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected]
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.