PhantomJS DOWNLOAD_HANDLER setup

David Fishburn Thu, 14 May 2015 07:13:11 -0700


Besides my PhantomJS Middleware post, I believe the DOWNLOAD_HANDLER is 
different from the middleware (which is why I have posted separately).


I did find this github project:
flisky <https://github.com/flisky>/*scrapy-phantomjs-downloader 
<https://github.com/flisky/scrapy-phantomjs-downloader>*
https://github.com/flisky/scrapy-phantomjs-downloader


Which provides a single python file:
    scrapy_phantomjs/downloader/handler.py

Given I am new to both Python and Scrapy I am having a very hard time 
understanding where to put and how to reference that file.

Assuming I have the following project structure.

/
    scrapy.cfg

    sapui5api/
        __init__.py
        items.py
        pipelines.py
        settings.py

        spiders/
            sapui5api_spiders.py


1.  Which directory to put handlers.py?
2.  How to reference it (what name to use)?


This is what I have tried so far.

I added:
    
/sapui5api/spiders/handler.py


This file has this defined:


from __future__ import unicode_literals
 
 from scrapy import signals
 from scrapy.signalmanager import SignalManager
 from scrapy.responsetypes import responsetypes
 from scrapy.xlib.pydispatch import dispatcher
 from selenium import webdriver
 from six.moves import queue
 from twisted.internet import defer, threads
 from twisted.python.failure import Failure
 
 
 class PhantomJSDownloadHandler(object):







In /sapui5api/settings.py I added:

DOWNLOAD_HANDLERS = {
    'http': 'crawler.http.PhantomJSDownloadHandler',
    'https': 'crawler.https.PhantomJSDownloadHandler'
}


I also tried:

DOWNLOAD_HANDLERS = {
    'http': 'sapui5api.spiders.PhantomJSDownloadHandler',
    'https': 'sapui5api.spiders.PhantomJSDownloadHandler'
}


Really quite guessing at this point:

D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\settings\
deprecated.py:26: ScrapyDeprecationWarning: You are using the following 
settings which are deprecated or obsolete (ask [email protected] 
for alternatives):
    BOT_VERSION: no longer used (user agent defaults to Scrapy now)
  warnings.warn(msg, ScrapyDeprecationWarning)
2015-05-14 10:08:34-0400 [scrapy] INFO: Scrapy 0.24.6 started (bot: 
sapui5api)
2015-05-14 10:08:34-0400 [scrapy] INFO: Optional features available: ssl, 
http11
2015-05-14 10:08:34-0400 [scrapy] INFO: Overridden settings: {
'NEWSPIDER_MODULE': 'sapui5api.spiders', 'SPIDER_MODULES': [
'sapui5api.spiders'], 'USER_AGENT': 'sapui5api/1.0', 'BOT_NAME': 'sapui5api'
}
2015-05-14 10:08:35-0400 [scrapy] INFO: Enabled extensions: LogStats, 
TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
Traceback (most recent call last):
  File "d:\python27\scripts\scrapy-script.py", line 9, in <module>
    load_entry_point('scrapy==0.24.6', 'console_scripts', 'scrapy')()
  File 
"D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\cmdline.py", 
line 143, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File 
"D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\cmdline.py", 
line 89, in _run_print_help
    func(*a, **kw)
  File 
"D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\cmdline.py", 
line 150, in _run_command
    cmd.run(args, opts)
  File 
"D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\commands\crawl.py"
, line 60, in run
    self.crawler_process.start()
  File 
"D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\crawler.py", 
line 92, in start
    if self.start_crawling():
  File 
"D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\crawler.py", 
line 124, in start_crawling
    return self._start_crawler() is not None
  File 
"D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\crawler.py", 
line 139, in _start_crawler
    crawler.configure()
  File 
"D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\crawler.py", 
line 47, in configure
    self.engine = ExecutionEngine(self, self._spider_closed)
  File 
"D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\core\engine.py"
, line 64, in __init__
    self.downloader = downloader_cls(crawler)
  File 
"D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\core\downloader\__init__.py"
, line 73, in __init__
    self.handlers = DownloadHandlers(crawler)
  File 
"D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\core\downloader\handlers\__init__.py"
, line 22, in __init__
    cls = load_object(clspath)
  File 
"D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\utils\misc.py"
, line 42, in load_object
    raise ImportError("Error loading object '%s': %s" % (path, e))
ImportError: Error loading object 'crawler.http.PhantomJSDownloadHandler': 
No module named crawler.http


Not sure how the whole naming thing works with Python and Scrapy.

How do you know in what directory to put the handler.py?  The docs only 
talk about creating one, they never mention what directory you have to put 
these files or how to reference them properly after you create them.

Any help is greatly appreciated.

David


-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

PhantomJS DOWNLOAD_HANDLER setup

Reply via email to