Assuming that scrapyd is production LOG_LEVEL might be the same for all spiders, LOG_FILE isn't needed, JOBDIR isn't really needed. And there is no need for a START_URLS setting, you can just append the new ones on the child spiders.
El jueves, 31 de julio de 2014 14:20:18 UTC-3, lnxpgn escribió: > > I am using Scrapy, it is great! so fast to build a crawler. with the > number of web sites are increasing, need to create new spiders, these web > sits are the same type, all these spiders use same items, pipelines, > parsing process > > the contents of the project directory: > > test/ > ├── scrapy.cfg > └── test > ├── __init__.py > ├── items.py > ├── mybasespider.py > ├── pipelines.py > ├── settings.py > ├── spider1_settings.py > ├── spider2_settings.py > └── spiders > ├── __init__.py > ├── spider1.py > └── spider2.py > > To reduce source code redundancy, mybasespider.py has a base spider > "MyBaseSpider", 95% source code are in it, all other spiders inherited > from it, if a spider has some special things, override some class methods, > normally only need to add several lines source code to create a new spider > > Place all common settings in settings.py, one spider's special settings > are in <spider name>_settings.py, such as: > > the special settings of spider1: > > from settings import * > > LOG_FILE = 'spider1.log' > LOG_LEVEL = 'INFO' > JOBDIR = 'spider1-job' > START_URLS = [ > 'http://test1.com/', > ] > > the special settings of spider2: > > from settings import * > > LOG_FILE = 'spider2.log' > LOG_LEVEL = 'DEBUG' > JOBDIR = 'spider2-job' > START_URLS = [ > 'http://test2.com/', > ] > > Scrapy uses LOG_FILE, LOG_LEVEL, JOBDIR before launching a spider > All urls in START_URLS are filled into MyBaseSpider.start_urls, different > spider has a different content, but the name "START_URLS" used in the base > spider "MyBaseSpider" isn't changed. > > the contents of the scrapy.cfg: > > [settings] > default = test.settings > spider1 = spider1.settings > spider2 = spider2.settings > > [deploy] > url = http://localhost:6800/ > project = test > > To run a spider, such as spider1: > 1. export SCRAPY_PROJECT=spider1 > 2. scrapy crawl spider1 > > But this way can't be used to run spiders in scrapyd. scrapyd-deploy > always uses 'default' project name in scrapy.cfg 'settings' section to > build an egg file and deploys it to scrapyd > > Have several questions: > 1. Is this way to use multiple spiders in one project if don't create a > project per spider? have better ways? > 2. how to separate a spider's special settings as above, but can run in > scrapyd and reduce source code redundancy > 3. if all spiders use a same JOBDIR, is it safe to run all spiders > concurrently? is the persistent spider state corrupted? > > Any insights would be greatly appreciated! > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
