Re: When and how should use multiple spiders in one project

Nicolás Alejandro Ramírez Quiros Mon, 18 Aug 2014 12:35:35 -0700

Assuming that scrapyd is production LOG_LEVEL might be the same for all 
spiders, LOG_FILE isn't needed, JOBDIR isn't really needed. 
And there is no need for a START_URLS setting, you can just append the new 
ones on the child spiders.


El jueves, 31 de julio de 2014 14:20:18 UTC-3, lnxpgn escribió:
>
> I am using Scrapy, it is great!  so fast to build a crawler.  with the 
> number of web sites are increasing,  need to create new spiders, these web 
> sits are the same type, all these spiders use  same items, pipelines, 
> parsing process
>
> the contents of the project directory:
>
> test/
> ├── scrapy.cfg
> └── test
>     ├── __init__.py
>     ├── items.py
>     ├── mybasespider.py
>     ├── pipelines.py
>     ├── settings.py
>     ├── spider1_settings.py
>     ├── spider2_settings.py
>     └── spiders
>         ├── __init__.py
>         ├── spider1.py
>         └── spider2.py
>
> To reduce source code redundancy, mybasespider.py has a base spider 
> "MyBaseSpider", 95% source code are in it,  all other spiders inherited 
> from it, if a spider has some special things, override some class methods, 
> normally only need to add several lines source code to create a new spider
>
> Place all common settings in settings.py,  one spider's special settings 
> are in <spider name>_settings.py, such as:
>
> the special settings of spider1:
>
> from settings import *
>
> LOG_FILE = 'spider1.log'
> LOG_LEVEL = 'INFO'
> JOBDIR = 'spider1-job'
> START_URLS = [
>     'http://test1.com/',
> ]
>
> the special settings of spider2:
>
> from settings import *
>
> LOG_FILE = 'spider2.log'
> LOG_LEVEL = 'DEBUG'
> JOBDIR = 'spider2-job'
> START_URLS = [
>     'http://test2.com/',
> ]
>
> Scrapy uses LOG_FILE, LOG_LEVEL, JOBDIR before launching a spider
> All urls in START_URLS are filled into MyBaseSpider.start_urls, different 
> spider has a different content, but the name "START_URLS" used in the base 
> spider "MyBaseSpider" isn't changed.
>   
> the contents of the scrapy.cfg:
>
> [settings]
> default = test.settings
> spider1 = spider1.settings
> spider2 = spider2.settings
>
> [deploy]
> url = http://localhost:6800/
> project = test
>
> To run a spider, such as spider1:
> 1. export SCRAPY_PROJECT=spider1
> 2. scrapy crawl spider1
>
> But this way can't be used to run spiders in scrapyd. scrapyd-deploy 
> always uses 'default' project name in scrapy.cfg 'settings' section to 
> build an egg file and deploys it to scrapyd
>
> Have several questions:
> 1. Is this way to use multiple spiders in one project if don't create a 
> project per spider?  have better ways?
> 2. how to separate a spider's special settings as above, but can run in 
> scrapyd and reduce source code redundancy
> 3. if all spiders use a same JOBDIR, is it safe to run all spiders 
> concurrently? is the persistent spider state corrupted?
>  
> Any insights would be greatly appreciated! 
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: When and how should use multiple spiders in one project

Reply via email to