Hello, you have information in the stats at the end, and in the logs directly:
'offsite/domains': 1, 'offsite/filtered': 2, and DEBUG: Crawled (200) <GET http://farmaplus.com.ve/ catalog.do?page=1&offSet=0&comboCategorySelected=1&op= requestSearch&searchBox=ranitidina&go=> (referer: None) DEBUG: Filtered offsite request to 'farmaplus.com.ve': <GET http:// farmaplus.com.ve/catalog.do?op=requestPage&selectedPage=1&category=1& offSet=0&page=1&searchBox=ranitidina> Your spider is filtering requests to farmaplus.com.ve, which is the domain you're interested in. This is because you haven't set your spider's allowed_domains attribute correctly. It should be a list of domain names, eg. allowed_domains = ["farmaplus.com.ve"] This should unblock fetching pages from that website. I haven't checked your callbacks so you may need to fix those to get items out. Hope this helps. Regards, Paul. On Friday, August 19, 2016 at 5:34:36 AM UTC+2, [email protected] wrote: > > Hi. > > I'm trying to execute, what I believe is, a basic scrape task. However, > it's not working out. No items are loaded/saved. > > My start_url is the result of a search and has the following structure: > > # ------------------------------------------ start_url (page_1) structure > --------------- > product_1 > product_2 > ... > product_n > > link_to_page_1 > link_to_page_2 > # ------------------------------------------ start_url (page_1) structure > --------------- > > > > This particular search result has one additional page (a total of 2), but > in general, the links to next pages are: > > link_to_page_1 > link_to_page_2 > ... > link_to_page_5 > link_to_next_set_of_5_or_less > link_to_last_set_of_5_or_less > > > > Each product has its url and I'm interested in the details of each product > found in those urls. > > I created the Scrapy project, created the item.py file: > > # --------------------------------------- Begin item.py > ---------------------- > from scrapy.item import Item, Field > > > class MedicinesSearchItem(Item): > # Primary fields > name = Field() > name_url = Field() > image_url = Field() > availability = Field() > > # Calculated fields > images = Field() > > # Housekeeping fields > url = Field() > project = Field() > spider = Field() > server = Field() > date = Field() > > > class MedicinesItem(Item): > # Primary fields > name = Field() > image_url = Field() > stores = Field() > availabilities = Field() > update_times = Field() > update_dates = Field() > presentation_and_component = Field() > #active_component = Field() > manufacturer = Field() > > # Calculated fields > images = Field() > > # Housekeeping fields > url = Field() > project = Field() > spider = Field() > server = Field() > date = Field() > # --------------------------------------- End item.py > ---------------------- > > > > and the spider file: > > # --------------------------------------- Begin farmaplus3.py > ---------------------- > from scrapy.loader import ItemLoader > from medicines.items import MedicinesItem > import scrapy > > > class Farmaplus3Spider(scrapy.Spider): > name = "farmaplus3" > allowed_domains = ["web"] > # Start on a medicine search page > start_urls = ( > ' > http://farmaplus.com.ve/catalog.do?page=1&offSet=0&comboCategorySelected=1&op=requestSearch&searchBox=ranitidina&go= > ', > ) > > def parse(self, response): > for next_page in response.xpath( > '//*[@class="pageBarTableStyle"]//a/@href'): > url = response.urljoin(next_page.extract()) > yield scrapy.Request(url, self.parse_medicines_results) > > def parse_medicines_results(self, response): > for next_medicine in response.xpath( > '//*[@class="productNameThumbnail"]/@href'): > url = response.urljoin(next_medicine.extract()) > yield scrapy.Request(url, self.parse_medicines) > > def parse_medicines(self, response): > # Create the medicine item > item = MedicinesItem() > # Load fields using XPath expressions > item['name'] = response.xpath( > '//*[@class="productTitleDetail"]/text()').extract() > item['image_url'] = response.xpath( > '//*[@class="productImageDetail"]/@src').extract() > item['stores'] = response.xpath( > '//*[@id="branchDetailTable"]//td[@class="middleFirst"]/a/text()').extract > () > yield item > # --------------------------------------- End farmaplus3.py > ---------------------- > > > > However, running > > $ scrapy crawl farmaplus3 -o items.json > > > > produces the following output: > > 2016-08-18 21:03:11 [scrapy] INFO: Scrapy 1.1.1 started (bot: medicines) > 2016-08-18 21:03:11 [scrapy] INFO: Overridden settings: { > 'NEWSPIDER_MODULE': 'medicines.spiders', 'FEED_URI': 'items.json', > 'SPIDER_MODULES': ['medicines.spiders'], 'BOT_NAME': 'medicines', > 'ROBOTSTXT_OBEY': True, 'FEED_FORMAT': 'json'} > 2016-08-18 21:03:11 [scrapy] INFO: Enabled extensions: > ['scrapy.extensions.feedexport.FeedExporter', > 'scrapy.extensions.logstats.LogStats', > 'scrapy.extensions.telnet.TelnetConsole', > 'scrapy.extensions.corestats.CoreStats'] > 2016-08-18 21:03:11 [scrapy] INFO: Enabled downloader middlewares: > ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', > 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', > 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', > 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', > 'scrapy.downloadermiddlewares.retry.RetryMiddleware', > 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', > 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', > 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', > 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', > 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', > 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware', > 'scrapy.downloadermiddlewares.stats.DownloaderStats'] > 2016-08-18 21:03:11 [scrapy] INFO: Enabled spider middlewares: > ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', > 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', > 'scrapy.spidermiddlewares.referer.RefererMiddleware', > 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', > 'scrapy.spidermiddlewares.depth.DepthMiddleware'] > 2016-08-18 21:03:11 [scrapy] INFO: Enabled item pipelines: > [] > 2016-08-18 21:03:11 [scrapy] INFO: Spider opened > 2016-08-18 21:03:11 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), > scraped 0 items (at 0 items/min) > 2016-08-18 21:03:11 [scrapy] DEBUG: Telnet console listening on 127.0.0.1: > 6023 > 2016-08-18 21:03:19 [scrapy] DEBUG: Redirecting (302) to <GET http:// > farmaplus.com.ve/users/farmaplus.com.ve/staticResources/robots.txt> from > <GET http://farmaplus.com.ve/robots.txt> > 2016-08-18 21:03:20 [scrapy] DEBUG: Crawled (404) <GET http:// > farmaplus.com.ve/users/farmaplus.com.ve/staticResources/robots.txt> > (referer: None) > 2016-08-18 21:03:23 [scrapy] DEBUG: Crawled (200) <GET http:// > farmaplus.com.ve/catalog.do?page=1&offSet=0&comboCategorySelected=1&op=requestSearch&searchBox=ranitidina&go= > > > (referer: None) > 2016-08-18 21:03:23 [scrapy] DEBUG: Filtered offsite request to ' > farmaplus.com.ve': <GET http:// > farmaplus.com.ve/catalog.do?op=requestPage&selectedPage=1&category=1&offSet=0&page=1&searchBox=ranitidina > > > 2016-08-18 21:03:23 [scrapy] INFO: Closing spider (finished) > 2016-08-18 21:03:23 [scrapy] INFO: Dumping Scrapy stats: > {'downloader/request_bytes': 793, > 'downloader/request_count': 3, > 'downloader/request_method_count/GET': 3, > 'downloader/response_bytes': 21242, > 'downloader/response_count': 3, > 'downloader/response_status_count/200': 1, > 'downloader/response_status_count/302': 1, > 'downloader/response_status_count/404': 1, > 'finish_reason': 'finished', > 'finish_time': datetime.datetime(2016, 8, 19, 1, 3, 23, 498723), > 'log_count/DEBUG': 5, > 'log_count/INFO': 7, > 'offsite/domains': 1, > 'offsite/filtered': 2, > 'request_depth_max': 1, > 'response_received_count': 2, > 'scheduler/dequeued': 1, > 'scheduler/dequeued/memory': 1, > 'scheduler/enqueued': 1, > 'scheduler/enqueued/memory': 1, > 'start_time': datetime.datetime(2016, 8, 19, 1, 3, 11, 510607)} > 2016-08-18 21:03:23 [scrapy] INFO: Spider closed (finished) > > > > My project tree is: > . > ├── medicines > │ ├── __init__.py > │ ├── __init__.pyc > │ ├── items.py > │ ├── items.pyc > │ ├── pipelines.py > │ ├── settings.py > │ ├── settings.pyc > │ └── spiders > │ ├── basic2.py > │ ├── basic2.pyc > │ ├── basic.py > │ ├── basic.pyc > │ ├── farmaplus2.py > │ ├── farmaplus2.pyc > │ ├── farmaplus3.py > │ ├── farmaplus3.pyc > │ ├── farmaplus.py > │ ├── farmaplus.pyc > │ ├── __init__.py > │ └── __init__.pyc > └── scrapy.cfg > > The strategy I'm trying to implement is: > > i) collect the next page urls > ii) from each of the next pages, collect the product urls > iii) from each of the product urls, extract the details to fill in the > item fields > > When I do this "manually" with scrapy shell it seems to work. That is, > the selectors are OK. The urls seem OK and I'm able to load item fields. > Running the file, however, doesn't give the expected results. > > What am I missing here? Why do I end up without any items? > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
