I am using Scrapy, it is great! so fast to build a crawler. with the
number of web sites are increasing, need to create new spiders, these web
sits are the same type, all these spiders use same items, pipelines,
parsing process
the contents of the project directory:
test/
├── scrapy.cfg
└── test
├── __init__.py
├── items.py
├── mybasespider.py
├── pipelines.py
├── settings.py
├── spider1_settings.py
├── spider2_settings.py
└── spiders
├── __init__.py
├── spider1.py
└── spider2.py
To reduce source code redundancy, mybasespider.py has a base spider
"MyBaseSpider", 95% source code are in it, all other spiders inherited
from it, if a spider has some special things, override some class methods,
normally only need to add several lines source code to create a new spider
Place all common settings in settings.py, one spider's special settings
are in <spider name>_settings.py, such as:
the special settings of spider1:
from settings import *
LOG_FILE = 'spider1.log'
LOG_LEVEL = 'INFO'
JOBDIR = 'spider1-job'
START_URLS = [
'http://test1.com/',
]
the special settings of spider2:
from settings import *
LOG_FILE = 'spider2.log'
LOG_LEVEL = 'DEBUG'
JOBDIR = 'spider2-job'
START_URLS = [
'http://test2.com/',
]
Scrapy uses LOG_FILE, LOG_LEVEL, JOBDIR before launching a spider
All urls in START_URLS are filled into MyBaseSpider.start_urls, different
spider has a different content, but the name "START_URLS" used in the base
spider "MyBaseSpider" isn't changed.
the contents of the scrapy.cfg:
[settings]
default = test.settings
spider1 = spider1.settings
spider2 = spider2.settings
[deploy]
url = http://localhost:6800/
project = test
To run a spider, such as spider1:
1. export SCRAPY_PROJECT=spider1
2. scrapy crawl spider1
But this way can't be used to run spiders in scrapyd. scrapyd-deploy always
uses 'default' project name in scrapy.cfg 'settings' section to build an
egg file and deploys it to scrapyd
Have several questions:
1. Is this way to use multiple spiders in one project if don't create a
project per spider? have better ways?
2. how to separate a spider's special settings as above, but can run in
scrapyd and reduce source code redundancy
3. if all spiders use a same JOBDIR, is it safe to run all spiders
concurrently? is the persistent spider state corrupted?
Any insights would be greatly appreciated!
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.