Hi Paul, I feel silly. For some reason I thought 'web' was a keyword that meant "any domain". Not knowing how to read the output didn't help at all.
Your advice worked. The items are out. Thanks. On Friday, August 19, 2016 at 5:01:46 AM UTC-4, Paul Tremberth wrote: > > Hello, > > you have information in the stats at the end, and in the logs directly: > > 'offsite/domains': 1, > 'offsite/filtered': 2, > > and > > DEBUG: Crawled (200) <GET http://farmaplus.com.ve/ > catalog.do?page=1&offSet=0&comboCategorySelected=1&op= > requestSearch&searchBox=ranitidina&go=> > (referer: None) > DEBUG: Filtered offsite request to 'farmaplus.com.ve': <GET http:// > farmaplus.com.ve/catalog.do?op=requestPage&selectedPage=1&category=1& > offSet=0&page=1&searchBox=ranitidina> > > > Your spider is filtering requests to farmaplus.com.ve, which is the > domain you're interested in. > This is because you haven't set your spider's allowed_domains attribute > correctly. > It should be a list of domain names, eg. > allowed_domains = ["farmaplus.com.ve"] > > This should unblock fetching pages from that website. > I haven't checked your callbacks so you may need to fix those to get items > out. > > Hope this helps. > > Regards, > Paul. > > On Friday, August 19, 2016 at 5:34:36 AM UTC+2, [email protected] > <javascript:> wrote: >> >> Hi. >> >> I'm trying to execute, what I believe is, a basic scrape task. However, >> it's not working out. No items are loaded/saved. >> >> My start_url is the result of a search and has the following structure: >> >> # ------------------------------------------ start_url (page_1) structure >> --------------- >> product_1 >> product_2 >> ... >> product_n >> >> link_to_page_1 >> link_to_page_2 >> # ------------------------------------------ start_url (page_1) structure >> --------------- >> >> >> >> This particular search result has one additional page (a total of 2), but >> in general, the links to next pages are: >> >> link_to_page_1 >> link_to_page_2 >> ... >> link_to_page_5 >> link_to_next_set_of_5_or_less >> link_to_last_set_of_5_or_less >> >> >> >> Each product has its url and I'm interested in the details of each >> product found in those urls. >> >> I created the Scrapy project, created the item.py file: >> >> # --------------------------------------- Begin item.py >> ---------------------- >> from scrapy.item import Item, Field >> >> >> class MedicinesSearchItem(Item): >> # Primary fields >> name = Field() >> name_url = Field() >> image_url = Field() >> availability = Field() >> >> # Calculated fields >> images = Field() >> >> # Housekeeping fields >> url = Field() >> project = Field() >> spider = Field() >> server = Field() >> date = Field() >> >> >> class MedicinesItem(Item): >> # Primary fields >> name = Field() >> image_url = Field() >> stores = Field() >> availabilities = Field() >> update_times = Field() >> update_dates = Field() >> presentation_and_component = Field() >> #active_component = Field() >> manufacturer = Field() >> >> # Calculated fields >> images = Field() >> >> # Housekeeping fields >> url = Field() >> project = Field() >> spider = Field() >> server = Field() >> date = Field() >> # --------------------------------------- End item.py >> ---------------------- >> >> >> >> and the spider file: >> >> # --------------------------------------- Begin farmaplus3.py >> ---------------------- >> from scrapy.loader import ItemLoader >> from medicines.items import MedicinesItem >> import scrapy >> >> >> class Farmaplus3Spider(scrapy.Spider): >> name = "farmaplus3" >> allowed_domains = ["web"] >> # Start on a medicine search page >> start_urls = ( >> ' >> http://farmaplus.com.ve/catalog.do?page=1&offSet=0&comboCategorySelected=1&op=requestSearch&searchBox=ranitidina&go= >> ', >> ) >> >> def parse(self, response): >> for next_page in response.xpath( >> '//*[@class="pageBarTableStyle"]//a/@href'): >> url = response.urljoin(next_page.extract()) >> yield scrapy.Request(url, self.parse_medicines_results) >> >> def parse_medicines_results(self, response): >> for next_medicine in response.xpath( >> '//*[@class="productNameThumbnail"]/@href'): >> url = response.urljoin(next_medicine.extract()) >> yield scrapy.Request(url, self.parse_medicines) >> >> def parse_medicines(self, response): >> # Create the medicine item >> item = MedicinesItem() >> # Load fields using XPath expressions >> item['name'] = response.xpath( >> '//*[@class="productTitleDetail"]/text()').extract() >> item['image_url'] = response.xpath( >> '//*[@class="productImageDetail"]/@src').extract() >> item['stores'] = response.xpath( >> '//*[@id="branchDetailTable"]//td[@class="middleFirst"]/a/text()'). >> extract() >> yield item >> # --------------------------------------- End farmaplus3.py >> ---------------------- >> >> >> >> However, running >> >> $ scrapy crawl farmaplus3 -o items.json >> >> >> >> produces the following output: >> >> 2016-08-18 21:03:11 [scrapy] INFO: Scrapy 1.1.1 started (bot: medicines) >> 2016-08-18 21:03:11 [scrapy] INFO: Overridden settings: { >> 'NEWSPIDER_MODULE': 'medicines.spiders', 'FEED_URI': 'items.json', >> 'SPIDER_MODULES': ['medicines.spiders'], 'BOT_NAME': 'medicines', >> 'ROBOTSTXT_OBEY': True, 'FEED_FORMAT': 'json'} >> 2016-08-18 21:03:11 [scrapy] INFO: Enabled extensions: >> ['scrapy.extensions.feedexport.FeedExporter', >> 'scrapy.extensions.logstats.LogStats', >> 'scrapy.extensions.telnet.TelnetConsole', >> 'scrapy.extensions.corestats.CoreStats'] >> 2016-08-18 21:03:11 [scrapy] INFO: Enabled downloader middlewares: >> ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', >> 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', >> 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware' >> , >> 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', >> 'scrapy.downloadermiddlewares.retry.RetryMiddleware', >> 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', >> 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', >> 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware' >> , >> 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', >> 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', >> 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware', >> 'scrapy.downloadermiddlewares.stats.DownloaderStats'] >> 2016-08-18 21:03:11 [scrapy] INFO: Enabled spider middlewares: >> ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', >> 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', >> 'scrapy.spidermiddlewares.referer.RefererMiddleware', >> 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', >> 'scrapy.spidermiddlewares.depth.DepthMiddleware'] >> 2016-08-18 21:03:11 [scrapy] INFO: Enabled item pipelines: >> [] >> 2016-08-18 21:03:11 [scrapy] INFO: Spider opened >> 2016-08-18 21:03:11 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), >> scraped 0 items (at 0 items/min) >> 2016-08-18 21:03:11 [scrapy] DEBUG: Telnet console listening on 127.0.0.1 >> :6023 >> 2016-08-18 21:03:19 [scrapy] DEBUG: Redirecting (302) to <GET http:// >> farmaplus.com.ve/users/farmaplus.com.ve/staticResources/robots.txt> from >> <GET http://farmaplus.com.ve/robots.txt> >> 2016-08-18 21:03:20 [scrapy] DEBUG: Crawled (404) <GET http:// >> farmaplus.com.ve/users/farmaplus.com.ve/staticResources/robots.txt> >> (referer: None) >> 2016-08-18 21:03:23 [scrapy] DEBUG: Crawled (200) <GET http:// >> farmaplus.com.ve/catalog.do?page=1&offSet=0&comboCategorySelected=1&op=requestSearch&searchBox=ranitidina&go= >> > >> (referer: None) >> 2016-08-18 21:03:23 [scrapy] DEBUG: Filtered offsite request to ' >> farmaplus.com.ve': <GET http:// >> farmaplus.com.ve/catalog.do?op=requestPage&selectedPage=1&category=1&offSet=0&page=1&searchBox=ranitidina >> > >> 2016-08-18 21:03:23 [scrapy] INFO: Closing spider (finished) >> 2016-08-18 21:03:23 [scrapy] INFO: Dumping Scrapy stats: >> {'downloader/request_bytes': 793, >> 'downloader/request_count': 3, >> 'downloader/request_method_count/GET': 3, >> 'downloader/response_bytes': 21242, >> 'downloader/response_count': 3, >> 'downloader/response_status_count/200': 1, >> 'downloader/response_status_count/302': 1, >> 'downloader/response_status_count/404': 1, >> 'finish_reason': 'finished', >> 'finish_time': datetime.datetime(2016, 8, 19, 1, 3, 23, 498723), >> 'log_count/DEBUG': 5, >> 'log_count/INFO': 7, >> 'offsite/domains': 1, >> 'offsite/filtered': 2, >> 'request_depth_max': 1, >> 'response_received_count': 2, >> 'scheduler/dequeued': 1, >> 'scheduler/dequeued/memory': 1, >> 'scheduler/enqueued': 1, >> 'scheduler/enqueued/memory': 1, >> 'start_time': datetime.datetime(2016, 8, 19, 1, 3, 11, 510607)} >> 2016-08-18 21:03:23 [scrapy] INFO: Spider closed (finished) >> >> >> >> My project tree is: >> . >> ├── medicines >> │ ├── __init__.py >> │ ├── __init__.pyc >> │ ├── items.py >> │ ├── items.pyc >> │ ├── pipelines.py >> │ ├── settings.py >> │ ├── settings.pyc >> │ └── spiders >> │ ├── basic2.py >> │ ├── basic2.pyc >> │ ├── basic.py >> │ ├── basic.pyc >> │ ├── farmaplus2.py >> │ ├── farmaplus2.pyc >> │ ├── farmaplus3.py >> │ ├── farmaplus3.pyc >> │ ├── farmaplus.py >> │ ├── farmaplus.pyc >> │ ├── __init__.py >> │ └── __init__.pyc >> └── scrapy.cfg >> >> The strategy I'm trying to implement is: >> >> i) collect the next page urls >> ii) from each of the next pages, collect the product urls >> iii) from each of the product urls, extract the details to fill in the >> item fields >> >> When I do this "manually" with scrapy shell it seems to work. That is, >> the selectors are OK. The urls seem OK and I'm able to load item fields. >> Running the file, however, doesn't give the expected results. >> >> What am I missing here? Why do I end up without any items? >> > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
