When and how should use multiple spiders in one project

lnxpgn Thu, 31 Jul 2014 11:12:13 -0700

I am using Scrapy, it is great!  so fast to build a crawler.  with the 
number of web sites are increasing,  need to create new spiders, these web 
sits are the same type, all these spiders use  same items, pipelines, 
parsing process


the contents of the project directory:

test/
├── scrapy.cfg
└── test
    ├── __init__.py
    ├── items.py
    ├── mybasespider.py
    ├── pipelines.py
    ├── settings.py
    ├── spider1_settings.py
    ├── spider2_settings.py
    └── spiders
        ├── __init__.py
        ├── spider1.py
        └── spider2.py

To reduce source code redundancy, mybasespider.py has a base spider 
"MyBaseSpider", 95% source code are in it,  all other spiders inherited 
from it, if a spider has some special things, override some class methods, 
normally only need to add several lines source code to create a new spider

Place all common settings in settings.py,  one spider's special settings 
are in <spider name>_settings.py, such as:

the special settings of spider1:

from settings import *

LOG_FILE = 'spider1.log'
LOG_LEVEL = 'INFO'
JOBDIR = 'spider1-job'
START_URLS = [
    'http://test1.com/',
]

the special settings of spider2:

from settings import *

LOG_FILE = 'spider2.log'
LOG_LEVEL = 'DEBUG'
JOBDIR = 'spider2-job'
START_URLS = [
    'http://test2.com/',
]

Scrapy uses LOG_FILE, LOG_LEVEL, JOBDIR before launching a spider
All urls in START_URLS are filled into MyBaseSpider.start_urls, different 
spider has a different content, but the name "START_URLS" used in the base 
spider "MyBaseSpider" isn't changed.
  
the contents of the scrapy.cfg:

[settings]
default = test.settings
spider1 = spider1.settings
spider2 = spider2.settings

[deploy]
url = http://localhost:6800/
project = test

To run a spider, such as spider1:
1. export SCRAPY_PROJECT=spider1
2. scrapy crawl spider1

But this way can't be used to run spiders in scrapyd. scrapyd-deploy always 
uses 'default' project name in scrapy.cfg 'settings' section to build an 
egg file and deploys it to scrapyd

Have several questions:
1. Is this way to use multiple spiders in one project if don't create a 
project per spider?  have better ways?
2. how to separate a spider's special settings as above, but can run in 
scrapyd and reduce source code redundancy
3. if all spiders use a same JOBDIR, is it safe to run all spiders 
concurrently? is the persistent spider state corrupted?
 
Any insights would be greatly appreciated! 

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

When and how should use multiple spiders in one project

Reply via email to